Senior DevOps Engineer for HPC and ML Infrastructure

Attis • United State
Relocation
Apply
AI Summary

Lead DevOps Engineer for HPC/ML Infrastructure. Build and operate HPC backbone and cloud environment. Enable scientists to solve complex problems.

Key Highlights
Architect and manage HPC cluster and cloud environment
Implement Infrastructure-as-Code philosophy and automate provisioning and deployment
Serve as key technical partner to science and ML teams
Embody Site Reliability Engineering (SRE) mindset
Technical Skills Required
Kubernetes Docker Terraform Cloud Automation Linux/UNIX Python Go C++ Rust
Benefits & Perks
Generous salary up to $230k
Equity
15% bonus
Full benefits package
Relocation assistance

Job Description


Lead DevOps Engineer - HPC / ML Infrastructure / Platform Engineer


We are seeking an exceptional infrastructure engineer to build and operate the HPC backbone and cloud environment for their mission. This is a unique opportunity to create the foundational platform that will power complex predictive modeling and process immense, planetary-scale datasets.

You will be the "engineer's engineer," enabling a team of world-class scientists to solve deeply meaningful real-world problems.


Why Join?


  • A generous salary of up to $230k + equity + 15% bonus + full benefits package.
  • Mission & Impact: Your work will directly contribute to a mission with significant global importance, building systems that generate critical insights from complex environmental data.
  • Greenfield Ownership: You will be the principal owner of the computational environment. This isn't about maintaining an existing system; it's about architecting, building, and scaling it from the ground up.
  • Technical Challenge: This role operates at the intersection of HPC, MLOps, and large-scale data. You will solve challenging problems related to distributed computing, automation, and performance at a massive scale.
  • Culture & Compensation: Join a small, brilliant team that values direct and honest communication. The role comes with a highly competitive salary, significant equity, a performance bonus, and comprehensive benefits.


The Company


My client is a venture-backed startup dedicated to tackling major environmental and scientific challenges. They are leveraging cutting-edge techniques and vast, complex datasets to create a new class of predictive insights. By building robust, scalable technology, they are providing solutions that were previously out of reach.


The Role

As the Principal Software Infrastructure Engineer, you will have ultimate responsibility for the reliability, scalability, and performance of the company's entire computational platform:


  • Architect, implement, and manage a sophisticated HPC cluster and cloud environment designed for both traditional scientific computing and modern machine learning workflows.
  • Champion an "Infrastructure-as-Code" philosophy, automating everything from provisioning and configuration to deployment and monitoring.
  • Build and own the CI/CD pipelines for infrastructure, ensuring the entire environment is reproducible, stable, and secure.
  • Embody a Site Reliability Engineering (SRE) mindset, proactively identifying and eliminating performance bottlenecks across the stack, from the Linux kernel to the network layer (focusing on system optimization, not model code tuning).
  • Serve as the key technical partner to the science and machine learning teams, understanding their needs and building the robust platform they need to succeed.


This hybrid role requires the ability to work from an office in either the Greater Denver or Greater Boston area at least three days per week. Relocation assistance is available.


The Essential Requirements


This role is for a hands-on builder, not just a user of platforms. You must have:

  • Demonstrable experience architecting, building, and owning core infrastructure from first principles.
  • A deep and practical understanding of large-scale, distributed computing environments (e.g., HPC clusters, supercomputing, or massive data grids).
  • Expertise with modern platform technologies, including Kubernetes and Docker, for service orchestration.
  • Mandatory: Explicit experience with Cloud Automation and Infrastructure-as-Code (IaC) on a major provider (AWS, GCP, or Azure) using tools like Terraform or CloudFormation.
  • Strong, hands-on knowledge of Linux/UNIX systems and proven proficiency in a general-purpose programming language (Python, Go, C++, or Rust) for complex tooling. Note: Reliance on Bash scripting alone is insufficient.
  • Experience handling and processing massive, multi-terabyte datasets.
  • Either: A professional background in a scientific, research, or mission-driven R&D environment (e.g., Space/Aerospace, Computational Biology, Genomics, Physics) OR: Experience from a domain with analogous data challenges, such as large-scale IoT sensor systems, autonomous systems/robotics, or complex geospatial logistics.


What Will Make You Stand Out


  • Specific experience building and operating the infrastructure for large-scale AI/ML model training and deployment.
  • Professional development experience with C++ or GoLang.
  • Experience writing custom Kubernetes Operators.


*Data Scientists, ML Scientists, BI Engineers, Data Analysts, Application Engineers, and ML Engineers who focus on model productionization, feature stores, and using high-level MLOps tools will likely fail the deep, low-level systems interview questions, and are not a fit for this role unfortunately.


If you are interested in this role, please apply with your resume through this site.


SEO Keywords for Search


Principal Platform Engineer, Lead DevOps Specialist, Kubernetes, Docker, Terraform, Site Reliability Engineer SRE, High-Performance Computing, Ansible, CI/CD, Python, MLOps Engineer, Linux, C++, GoLang, Scientific Computing, Data Infrastructure Professional, IaC, AWS, GCP, Cluster Computing, Cloud Automation.


Disclaimer


DISCLAIMER: No terminology in this advert is intended to discriminate on the grounds of age, sex, race, religion or belief, disability, pregnancy and maternity, marriage and civil partnership, sexual orientation, gender, and/or gender reassignment, and we confirm that we are happy to accept applications from anyone for this role. Attis Global Ltd operates as an employment agency and employment business. More information can be found at attisglobal.com.


Subscribe our newsletter

New Things Will Always Update Regularly