Site Reliability Engineer (Remote)

София, София

Why would you love this job?

The Redis Cloud Operations team is hiring for a Site Reliability Engineer role, offering you the chance to work on large-scale systems and support our valued customers. You’ll operate state-of-the-art products using the most cutting-edge and best-in-class SRE tools. In this global role, you’ll participate in the monitoring, operation, and management of a Redis managed service at one of the world’s largest CSP. We’re looking for highly motivated team players with a keen attention to detail and substantial experience working in intense production environments. If you have a passion for tackling technical challenges on a global scale, this position is perfect for you.

What you’ll do:

As a Site Reliability Engineer at Redis, you will:

Handle Technical Escalations: Engage in complex troubleshooting and manage technical escalations within a Follow-the-Sun (FTS) support model, ensuring seamless global service coverage.
Ensure System Reliability: Leverage your software development and problem-solving expertise to create automation tools and runbooks, enhancing the reliability and stability of the Redis database on a leading cloud service provider.
Collaborate with Engineering Teams: Partner closely with engineering teams during service-impacting incidents, leading problem management efforts to maintain service continuity and stability.
Participate in On-Call Rotations: Be available for occasional weekend on-call shifts, providing critical support and ensuring service reliability.

What will you need:

B.S. in Computer Science, Information Technology, Software Engineering or a related field or 4 or more years of experience working on infrastructure/CloudOps/SRE domains.
At least 3 years of experience troubleshooting production systems.
At least 2 years of hands-on experience with cloud infrastructure.
Strong working knowledge in Linux/Unix.
Deep understanding of networking (TCP/IP) with emphasis on the various cloud providers.
Experience with alerting and monitoring systems (Prometheus, Grafana).
Experience in programming languages: Bash, Python. C# is a plus.
Familiarity with source code version control tools like Git, Gitlab, SVN, etc.
Experience in analyzing and debugging production issues at scale.
Self-directed, ambitious, authentic, caring, and eager to learn new things.
Experience with 24/7 on-call rotation. Availability to nights and weekends (follow the sun)

Extra great if you have:

FedRAMP certification is a plus.
Experience working with large-scale distributed systems.
Experience with NoSQL databases (especially Redis).
Programming experience in C#
Linux and Azure Certification

Site Reliability Engineer (Remote)

Share This Job