Responsibilities:
- Oversee and manage the daily technical operations of the Syncplicity SaaS platform, ensuring systems deliver high levels of availability, scalability, and performance.
- Design, implement, and maintain a robust and scalable infrastructure on AWS using industry-standard tools and best practices.
- Lead automation initiatives to minimize manual interventions across deployments, monitoring, and system maintenance.
- Work closely with the Engineering team to ensure smooth and efficient deployment of new releases while maintaining quality and performance standards.
- Maintain and enhance existing monitoring and alerting frameworks to ensure system health, identify bottlenecks, and prevent potential failures.
- Act as the primary point of escalation for production issues, ensuring timely and effective resolution with minimal user impact.
- Conduct thorough root cause analyses of incidents and develop both tactical fixes and long-term prevention strategies.
- Coordinate and lead incident response efforts to effectively address operational challenges.
- Monitor key performance indicators (KPIs) like uptime, incident resolution time, and overall system reliability, driving continuous improvement.
- Mentor and guide the TechOps team, fostering collaboration and innovation to address operational challenges.
- Optimize AWS infrastructure to balance cost and performance while staying within budgetary constraints.
- Participate in on-call rotations to ensure 24/7 support for critical systems.
- Ensure operational compliance with industry standards and regulatory requirements (e.g., GDPR, SOC2).
- Engage with stakeholders to align operational strategies with business needs.
Requirements:
- 6+ years of relevant experience in TechOps or DevOps roles.
- Proficient in monitoring and alerting tools such as Prometheus, Grafana.
- Familiarity with logging tools like the ELK stack (Elasticsearch, Logstash, Kibana).
- Hands-on experience with CI/CD pipelines and tools like Jenkins or GitHub Actions.
- Expertise in Infrastructure-as-Code (IaC) tools, such as Terraform, Terragrunt, Ansible.
- Proficient in scripting languages (Python, Bash, PowerShell) for automation.
- Proven ability to manage technical operations and troubleshoot production issues in business-critical environments.
- Strong experience with AWS infrastructure and Platform (EC2, EKS, Lambda, Route53, RDS) and containerization technologies like Docker and Kubernetes.
- Deep understanding of networking, security practices, and compliance frameworks (e.g., SOC2).
- Strong problem-solving, communication, and leadership skills, with a customer-focused mindset.