logo

View all jobs

Senior Systems Engineer – HPC (remote)

Sofia, Sofia
Responsibilities
●    Oversee the design, deployment, and optimization of the HPC infrastructure, including hardware, platform, software, networking, and storage components. 
●    Partake in preparation and review of HLD, LLD documents, scope of work, RFIs, RFPs and RFQs. 
●    Lead efforts to maximize the efficiency and performance of HPC systems, ensuring optimal resource utilization and minimal downtime. 
●    Collaborate closely with product and architecture teams to understand and implement customer computational needs and requirements. Provide tailored technical solutions that align with company’s strategic goals. 
●    Develop and implement automation solutions and tools for deployment and management. 
●    Set up monitoring, logging, and alerting systems. 
●    Act as L3 support for complex technical issues, perform root cause analysis, and implement solutions to ensure the reliability and availability of HPC systems. 
●    Maintain comprehensive documentation of HPC configurations, procedures, and best practices to facilitate knowledge sharing and future reference. 
●    Ensure the security and compliance of the HPC infrastructure, implementing necessary safeguards, and adhering to company standards and regulations. 
●    Collaborate with HPC vendors and suppliers for hardware and software procurement, support, and delivery. 
●    Assist in budget planning and management for HPC-related expenditures, ensuring cost-effective solutions. 
●    Stay at the forefront of HPC technology trends, evaluating and recommending new technologies and practices to enhance HPC capabilities. 
Qualification, Experience, Competence and Certifications
●    Bachelor’s degree in Information Technology, Computer Science or relevant field. 
●    Minimum 7 years of hands-on experience in High-Performance Computing (HPC) systems administration and infrastructure management 
●    Advanced knowledge and expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems. 
●    Proficiency in parallel computing principles, distributed computing, and cluster management. 
●    Comprehensive knowledge and hands-on experience in the system administration of Linux environments. 
●    Experience with job schedulers, resource managers, and workflow orchestration tools commonly used in HPC environments ( Slurm, LSF or PBS, K8S ) 
●    Advanced knowledge of Data Center network design and related technologies [OSI model, TCP/IP stack, routing, VLAN/VxLAN, etc]. 
●    Competence in network design and configuration of switches/routers, including InfiniBand and RoCE. 
●    Experience with large-scale data storage solutions, particularly Ceph, NFS, and Lustre. 
●    Proficiency in one or more of the parallel libraries/languages such as MPI, OpenMP, OneAPI and CUDA. 
●    Competence in configuration management tools such as Ansible, Puppet, Terraform, and integration with Git. 
●    Strong scripting and automation skills (e.g., Python, Bash) for system administration tasks. 
●    Excellent problem-solving skills and the ability to troubleshoot complex HPC issues effectively. 
●    In-depth knowledge of performance tuning and optimization techniques for HPC systems. 
●    Familiarity with containerization and orchestration (Docker, Kubernetes) 
●    Experience with monitoring and observability (e.g. Prometheus, Grafana, Nagios, Zabbix, Ganglia, ELK) 
●    Effective communication and collaboration skills to work with cross-functional teams.
 

Share This Job

Powered by