Senior Kubernetes Engineer (GPU Acceleration, AI/ML, HPC)

Coda Search│Staffing United State
Relocation
Apply
AI Summary

Seeking a Senior Kubernetes Engineer to design, implement, and optimize GPU-accelerated container platforms for AI/ML and HPC workloads. Requires deep expertise in NVIDIA and Kubernetes ecosystems, including GPU scheduling and custom operators. Role involves architecting, operating, and securing high-performance GPU clusters in hybrid or on-prem environments.

Key Highlights
Design, implement, and optimize GPU-accelerated container platforms at scale.
Architect and operate Kubernetes clusters optimized for GPU workloads.
Develop and deploy custom Kubernetes operators and controllers.
Technical Skills Required
Kubernetes NVIDIA GPU Operator Network Operator DCGM Go Python CRDs RBAC Custom Controllers Scheduler Extensions Helm Kustomize GitOps ArgoCD FluxCD Terraform CNI plugins NVIDIA CNI Multus Prometheus Grafana DCGM Exporter OpenTelemetry OPA Gatekeeper CRI-O containerd NVIDIA Container Toolkit Cilium
Benefits & Perks
Relocation Assistance Provided

Job Description


Our client, is seeking a highly skilled Senior Kubernetes Engineer to join their team in Dallas. In this role, you will design, implement, and optimize GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments. You will need deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.


Key responsibilities of the role include:


  • Architecting and operating Kubernetes clusters optimized for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM
  • Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services
  • Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer
  • Optimizing GPU utilization and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano
  • Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance
  • Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry
  • Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper
  • Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD
  • Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize
  • Participating in performance tuning, incident response and production readiness reviews


Who are we looking for?


The ideal candidate will have the following skills and experience:

  • Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes, including GPU Operator, device plugin, NVML, MIG and DCGM
  • Proficiency in Go or Python for operator development and Kubernetes controller logic
  • Deep understanding of Kubernetes internals, including CRDs, RBAC, custom controllers and scheduler extensions
  • Experience with GPU-intensive workloads, for example for LLMs, training pipelines and scientific computing
  • Hands-on experience with Helm, Kustomize and GitOps workflows
  • Familiarity with CNI plugins, especially NVIDIA CNI and Multus
  • Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter


The following is beneficial:

  • Knowledge of container runtimes with CRI-O, containerd and NVIDIA Container Toolkit
  • Contributions to open-source projects in the Kubernetes or NVIDIA ecosystem
  • Preferred experience working with cilium or CNI plugins


Relocation Assistance Provided


Similar Jobs

Explore other opportunities that match your interests

Staff DevOps Engineer

Devops
2h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Northrop Grumman

United State

Senior DevOps Engineer

Devops
4h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Northrop Grumman

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Jobs via Dice

United State

Subscribe our newsletter

New Things Will Always Update Regularly