Senior Kubernetes Engineer (GPU Acceleration, AI/ML, HPC)
Seeking a Senior Kubernetes Engineer to design, implement, and optimize GPU-accelerated container platforms for AI/ML and HPC workloads. Requires deep expertise in NVIDIA and Kubernetes ecosystems, including GPU scheduling and custom operators. Role involves architecting, operating, and securing high-performance GPU clusters in hybrid or on-prem environments.
Key Highlights
Technical Skills Required
Benefits & Perks
Job Description
Our client, is seeking a highly skilled Senior Kubernetes Engineer to join their team in Dallas. In this role, you will design, implement, and optimize GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments. You will need deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.
Key responsibilities of the role include:
- Architecting and operating Kubernetes clusters optimized for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM
- Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services
- Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer
- Optimizing GPU utilization and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano
- Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance
- Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry
- Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper
- Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD
- Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize
- Participating in performance tuning, incident response and production readiness reviews
Who are we looking for?
The ideal candidate will have the following skills and experience:
- Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes, including GPU Operator, device plugin, NVML, MIG and DCGM
- Proficiency in Go or Python for operator development and Kubernetes controller logic
- Deep understanding of Kubernetes internals, including CRDs, RBAC, custom controllers and scheduler extensions
- Experience with GPU-intensive workloads, for example for LLMs, training pipelines and scientific computing
- Hands-on experience with Helm, Kustomize and GitOps workflows
- Familiarity with CNI plugins, especially NVIDIA CNI and Multus
- Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter
The following is beneficial:
- Knowledge of container runtimes with CRI-O, containerd and NVIDIA Container Toolkit
- Contributions to open-source projects in the Kubernetes or NVIDIA ecosystem
- Preferred experience working with cilium or CNI plugins
Relocation Assistance Provided
Similar Jobs
Explore other opportunities that match your interests
Staff DevOps Engineer
Northrop Grumman
Senior DevOps Engineer
Northrop Grumman