STACKIT CLOUD SITE RELIABILITY ENGINEER STORAGE/SRE

София, Sofia

The impact you will create:

Stability & reliability: You are responsible for maintaining and optimizing the stability and availability of our highly available, resilient storage infrastructure (Block, Object, Backup and File storage). You ensure this through proactive monitoring, solving occurring incidents on your own responsibility and avoiding their occurrence in the future
Automation: You automate the provisioning and operating processes in the storage environment with your own aspiration to become a little better every day and to continuously optimize our products.
Architecture: With your team, you support a robust and efficient storage architecture – because it is important to you to build a long-term stable and reliable solution that our customers like to use.
End-to-end responsibility: Identifying with the products we provide to our customers is very important to us. That is why we actively practice end-to-end responsibility and receive support from many internal STACKIT service teams to refine our services.
Performance and capacity planning: You will analyze and optimize the performance of our existing systems regarding future scaling of the landscape. This also includes forward-looking capacity planning.
Incident and post-mortem analysis: You take care of the processing of (major) incidents with storage participation as part of the incident & problem management process of STACKIT with the aim of deriving mitigating measures for the future and then successfully implementing them.

Experience and Skills you will need:

You want to make a big difference and play a key role in shaping the solution with state-of-the-art cloud technologies
You have experience in one ore more various storage product(s) (e.g. NetApp, Cohesity, Pure, Ceph) in the area of block, object, backup or file storage and have good knowledge of cloud environments and their architectures
You are an expert in the operation of storage infrastructure (e.g. solution scenarios, provision, scaling, migration, incident response) and their automation (e.g. using Golang / Python, Bash, Ansible)
You are already familiar with containerized system landscapes of the storage environment (e.g. k8s)
You have experience in monitoring, alerting and logging to ensure complete system monitoring (e.g. Prometheus, Grafana, Elasticsearch)
Ideally, you are already working with APIs and developing them further (e.g. REST API with Golang and Python)
You enjoy the challenges of operating storage systems (e.g. protocols, troubleshooting, performance analyzes, high availability, lifecycle)
You have a passion and enthusiasm for new technologies and topics related to various storage systems
You like to be part of a motivated team that always strives for improvement and continuously develops itself (and the products)
Your excellent communication skills in English (and optional in German) form the basis for successful cooperation in international, agile teams