SRE (remote)

Sofia, Sofia

Key Responsibilities:

● Monitoring and Observability:

○ Design and implement monitoring solutions to ensure the health, performance, and availability of SaaS web applications and infrastructure.

○ Develop and maintain dashboards, alerts, and reporting systems for proactive monitoring of application performance, user experience, and system health. ○ Ensure end-to-end observability by integrating log aggregation, metrics, and tracing tools to identify and resolve issues before they impact customers.

● Incident Management & Root Cause Analysis:

○ Lead the response to production incidents, working with cross-functional teams to identify the root cause and implement effective remediation strategies.

○ Drive post-incident reviews and document incidents, identifying areas for improvement in systems, processes, and response strategies.

○ Create and enforce procedures for incident management, on-call rotations, and escalations.

● Reliability & Availability:

○ Collaborate with engineering and DevOps teams to implement strategies for ensuring high availability, scalability, and disaster recovery for critical services. ○ Ensure systems are designed to handle high traffic loads and remain resilient to failures by building and deploying robust monitoring frameworks and automation tools.

○ Focus on reducing mean time to recovery (MTTR) and increasing mean time between failures (MTBF) across the SaaS platform.

● Automation & Efficiency:

○ Drive automation efforts to eliminate manual intervention and improve system reliability through automated testing, deployment, and monitoring pipelines. ○ Collaborate with the development team to implement changes that improve system reliability and efficiency.

● Capacity Planning & Performance Tuning:

○ Monitor system resource usage and identify potential capacity issues, driving proactive scaling and performance tuning initiatives.

○ Use performance metrics to predict scaling needs and ensure the infrastructure can meet the growing demands of the platform.

● Collaboration & Cross-Functional Engagement:

○ Work closely with developers, product managers, and DevOps engineers to improve application performance and reliability through better code, infrastructure, and operational practices.

○ Act as a mentor to junior SREs, sharing knowledge about best practices for monitoring, scaling, and troubleshooting complex web applications.

● Continuous Improvement & Best Practices:

○ Establish and promote best practices for reliability engineering, monitoring standards, incident management, and performance optimization.

○ Stay current with industry trends and evaluate new tools and technologies to improve service reliability and monitoring practices.

Required Skills and Qualifications:

● Experience:

○ 5+ years of experience as a Site Reliability Engineer (SRE), Systems Engineer, or DevOps Engineer with a focus on monitoring, reliability, and performance for SaaS-based web applications.

○ Proven track record in designing and maintaining monitoring systems for large-scale, high-availability applications.

● Technical Skills:

○ Strong experience with monitoring, logging, and alerting tools such as Prometheus, Grafana, Datadog, ELK Stack (Elasticsearch, Logstash, Kibana), New Relic, or similar. ○ Expertise in setting up and managing cloud-based infrastructure monitoring (AWS CloudWatch, Google Cloud Operations, etc.).

○ Experience with containerized applications (Docker, Kubernetes) and orchestrating infrastructure at scale.

● Scripting & Automation:

○ Proficiency in automation tools (e.g., Terraform, Ansible, Chef, Puppet) and programming/scripting languages (e.g., Python, Go, Shell).

○ Experience building and managing automated pipelines for CI/CD, deployment, and monitoring.

● Incident Response & Troubleshooting:

○ Expertise in incident response, troubleshooting production issues, root cause analysis, and leading post-mortems to improve system reliability.

○ Familiarity with on-call responsibilities, managing high-pressure situations, and minimizing downtime for customers.

● Cloud & Infrastructure Experience:

○ Experience with cloud platforms (AWS, GCP, Azure) and managing infrastructure at scale.

○ Understanding of distributed systems, microservices architecture, and how to monitor and manage them effectively.

● Performance Tuning & Optimization:

○ Strong understanding of application performance tuning, database performance, and infrastructure optimizations.

○ Experience with system performance monitoring, profiling, and resource management.

SRE (remote)

Share This Job