logo

View all jobs

SRE (remote)

Sofia, Sofia

Key Responsibilities: 

● Monitoring and Observability: 

○ Design and implement monitoring solutions to ensure the health, performance, and availability of SaaS web applications and infrastructure. 

○ Develop and maintain dashboards, alerts, and reporting systems for proactive monitoring of application performance, user experience, and system health. ○ Ensure end-to-end observability by integrating log aggregation, metrics, and tracing tools to identify and resolve issues before they impact customers. 

● Incident Management & Root Cause Analysis: 

○ Lead the response to production incidents, working with cross-functional teams to identify the root cause and implement effective remediation strategies. 

○ Drive post-incident reviews and document incidents, identifying areas for improvement in systems, processes, and response strategies. 

○ Create and enforce procedures for incident management, on-call rotations, and escalations. 

● Reliability & Availability: 

○ Collaborate with engineering and DevOps teams to implement strategies for ensuring high availability, scalability, and disaster recovery for critical services. ○ Ensure systems are designed to handle high traffic loads and remain resilient to failures by building and deploying robust monitoring frameworks and automation tools. 

○ Focus on reducing mean time to recovery (MTTR) and increasing mean time between failures (MTBF) across the SaaS platform. 

● Automation & Efficiency: 

○ Drive automation efforts to eliminate manual intervention and improve system reliability through automated testing, deployment, and monitoring pipelines. ○ Collaborate with the development team to implement changes that improve system reliability and efficiency. 

● Capacity Planning & Performance Tuning: 

○ Monitor system resource usage and identify potential capacity issues, driving proactive scaling and performance tuning initiatives. 

○ Use performance metrics to predict scaling needs and ensure the infrastructure can meet the growing demands of the platform. 

● Collaboration & Cross-Functional Engagement: 

○ Work closely with developers, product managers, and DevOps engineers to improve application performance and reliability through better code, infrastructure, and operational practices. 

○ Act as a mentor to junior SREs, sharing knowledge about best practices for monitoring, scaling, and troubleshooting complex web applications. 

● Continuous Improvement & Best Practices: 

○ Establish and promote best practices for reliability engineering, monitoring standards, incident management, and performance optimization.

○ Stay current with industry trends and evaluate new tools and technologies to improve service reliability and monitoring practices. 

Required Skills and Qualifications: 

● Experience: 

○ 5+ years of experience as a Site Reliability Engineer (SRE), Systems Engineer, or DevOps Engineer with a focus on monitoring, reliability, and performance for SaaS-based web applications. 

○ Proven track record in designing and maintaining monitoring systems for large-scale, high-availability applications. 

● Technical Skills: 

○ Strong experience with monitoring, logging, and alerting tools such as Prometheus, Grafana, Datadog, ELK Stack (Elasticsearch, Logstash, Kibana), New Relic, or similar. ○ Expertise in setting up and managing cloud-based infrastructure monitoring (AWS CloudWatch, Google Cloud Operations, etc.). 

○ Experience with containerized applications (Docker, Kubernetes) and orchestrating infrastructure at scale. 

● Scripting & Automation: 

○ Proficiency in automation tools (e.g., Terraform, Ansible, Chef, Puppet) and programming/scripting languages (e.g., Python, Go, Shell). 

○ Experience building and managing automated pipelines for CI/CD, deployment, and monitoring. 

● Incident Response & Troubleshooting: 

○ Expertise in incident response, troubleshooting production issues, root cause analysis, and leading post-mortems to improve system reliability. 

○ Familiarity with on-call responsibilities, managing high-pressure situations, and minimizing downtime for customers. 

● Cloud & Infrastructure Experience: 

○ Experience with cloud platforms (AWS, GCP, Azure) and managing infrastructure at scale. 

○ Understanding of distributed systems, microservices architecture, and how to monitor and manage them effectively. 

● Performance Tuning & Optimization: 

○ Strong understanding of application performance tuning, database performance, and infrastructure optimizations. 

○ Experience with system performance monitoring, profiling, and resource management.



 

Share This Job

Powered by