Senior DevOps Engineer for HPC and ML Infrastructure

Attis • United State

Relocation

This Job is No Longer Active This position is no longer accepting applications

AI Summary

Lead DevOps Engineer for HPC/ML Infrastructure. Build and operate HPC backbone and cloud environment. Enable scientists to solve complex problems.

Key Highlights

Architect and manage HPC cluster and cloud environment

Implement Infrastructure-as-Code philosophy and automate provisioning and deployment

Serve as key technical partner to science and ML teams

Embody Site Reliability Engineering (SRE) mindset

Technical Skills Required

Kubernetes Docker Terraform Cloud Automation Linux/UNIX Python Go C++ Rust

Benefits & Perks

Generous salary up to $230k

Equity

15% bonus

Full benefits package

Relocation assistance

Job Description

Lead DevOps Engineer - HPC / ML Infrastructure / Platform Engineer

We are seeking an exceptional infrastructure engineer to build and operate the HPC backbone and cloud environment for their mission. This is a unique opportunity to create the foundational platform that will power complex predictive modeling and process immense, planetary-scale datasets.

You will be the "engineer's engineer," enabling a team of world-class scientists to solve deeply meaningful real-world problems.

Why Join?

A generous salary of up to $230k + equity + 15% bonus + full benefits package.
Mission & Impact: Your work will directly contribute to a mission with significant global importance, building systems that generate critical insights from complex environmental data.
Greenfield Ownership: You will be the principal owner of the computational environment. This isn't about maintaining an existing system; it's about architecting, building, and scaling it from the ground up.
Technical Challenge: This role operates at the intersection of HPC, MLOps, and large-scale data. You will solve challenging problems related to distributed computing, automation, and performance at a massive scale.
Culture & Compensation: Join a small, brilliant team that values direct and honest communication. The role comes with a highly competitive salary, significant equity, a performance bonus, and comprehensive benefits.

The Company

My client is a venture-backed startup dedicated to tackling major environmental and scientific challenges. They are leveraging cutting-edge techniques and vast, complex datasets to create a new class of predictive insights. By building robust, scalable technology, they are providing solutions that were previously out of reach.

The Role

As the Principal Software Infrastructure Engineer, you will have ultimate responsibility for the reliability, scalability, and performance of the company's entire computational platform:

Architect, implement, and manage a sophisticated HPC cluster and cloud environment designed for both traditional scientific computing and modern machine learning workflows.
Champion an "Infrastructure-as-Code" philosophy, automating everything from provisioning and configuration to deployment and monitoring.
Build and own the CI/CD pipelines for infrastructure, ensuring the entire environment is reproducible, stable, and secure.
Embody a Site Reliability Engineering (SRE) mindset, proactively identifying and eliminating performance bottlenecks across the stack, from the Linux kernel to the network layer (focusing on system optimization, not model code tuning).
Serve as the key technical partner to the science and machine learning teams, understanding their needs and building the robust platform they need to succeed.

This hybrid role requires the ability to work from an office in either the Greater Denver or Greater Boston area at least three days per week. Relocation assistance is available.

The Essential Requirements

This role is for a hands-on builder, not just a user of platforms. You must have:

Demonstrable experience architecting, building, and owning core infrastructure from first principles.
A deep and practical understanding of large-scale, distributed computing environments (e.g., HPC clusters, supercomputing, or massive data grids).
Expertise with modern platform technologies, including Kubernetes and Docker, for service orchestration.
Mandatory: Explicit experience with Cloud Automation and Infrastructure-as-Code (IaC) on a major provider (AWS, GCP, or Azure) using tools like Terraform or CloudFormation.
Strong, hands-on knowledge of Linux/UNIX systems and proven proficiency in a general-purpose programming language (Python, Go, C++, or Rust) for complex tooling. Note: Reliance on Bash scripting alone is insufficient.
Experience handling and processing massive, multi-terabyte datasets.
Either: A professional background in a scientific, research, or mission-driven R&D environment (e.g., Space/Aerospace, Computational Biology, Genomics, Physics) OR: Experience from a domain with analogous data challenges, such as large-scale IoT sensor systems, autonomous systems/robotics, or complex geospatial logistics.

What Will Make You Stand Out

Specific experience building and operating the infrastructure for large-scale AI/ML model training and deployment.
Professional development experience with C++ or GoLang.
Experience writing custom Kubernetes Operators.

*Data Scientists, ML Scientists, BI Engineers, Data Analysts, Application Engineers, and ML Engineers who focus on model productionization, feature stores, and using high-level MLOps tools will likely fail the deep, low-level systems interview questions, and are not a fit for this role unfortunately.

If you are interested in this role, please apply with your resume through this site.

SEO Keywords for Search

Principal Platform Engineer, Lead DevOps Specialist, Kubernetes, Docker, Terraform, Site Reliability Engineer SRE, High-Performance Computing, Ansible, CI/CD, Python, MLOps Engineer, Linux, C++, GoLang, Scientific Computing, Data Infrastructure Professional, IaC, AWS, GCP, Cluster Computing, Cloud Automation.

Disclaimer

DISCLAIMER: No terminology in this advert is intended to discriminate on the grounds of age, sex, race, religion or belief, disability, pregnancy and maternity, marriage and civil partnership, sexual orientation, gender, and/or gender reassignment, and we confirm that we are happy to accept applications from anyone for this role. Attis Global Ltd operates as an employment agency and employment business. More information can be found at attisglobal.com.

Job Overview

Posted Date Nov 26, 2025

Employment Type Full-time

Experience Level Mid-Senior level

Location United State

Category Devops

Company Attis

Mentioned Skills

Industries

Similar Jobs

Explore other opportunities that match your interests

Staff ML Infra Engineer - Autonomous Driving

Devops

•

6h ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

General Motors

United State

Senior Director, ERP Product Management

Devops

•

11h ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

BioSpace

United State

Cloud AI Engineer - Entry Level Consulting Program

Devops

•

14h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Entry level

hirepower staffing solution

United State

Senior DevOps Engineer for HPC and ML Infrastructure

Key Highlights

Technical Skills Required

Benefits & Perks

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Staff ML Infra Engineer - Autonomous Driving

Premium Job

General Motors

Senior Director, ERP Product Management

Premium Job

BioSpace

Cloud AI Engineer - Entry Level Consulting Program

hirepower staffing solution

Subscribe our newsletter