Senior Kubernetes Engineer (GPU Acceleration, AI/ML, HPC)

Coda Search│Staffing • United State

Relocation

Apply

AI Summary

Seeking a Senior Kubernetes Engineer to design, implement, and optimize GPU-accelerated container platforms for AI/ML and HPC workloads. Requires deep expertise in NVIDIA and Kubernetes ecosystems, including GPU scheduling and custom operators. Role involves architecting, operating, and securing high-performance GPU clusters in hybrid or on-prem environments.

Key Highlights

Design, implement, and optimize GPU-accelerated container platforms at scale.

Architect and operate Kubernetes clusters optimized for GPU workloads.

Develop and deploy custom Kubernetes operators and controllers.

Technical Skills Required

Kubernetes NVIDIA GPU Operator Network Operator DCGM Go Python CRDs RBAC Custom Controllers Scheduler Extensions Helm Kustomize GitOps ArgoCD FluxCD Terraform CNI plugins NVIDIA CNI Multus Prometheus Grafana DCGM Exporter OpenTelemetry OPA Gatekeeper CRI-O containerd NVIDIA Container Toolkit Cilium

Benefits & Perks

Relocation Assistance Provided

Job Description

Our client, is seeking a highly skilled Senior Kubernetes Engineer to join their team in Dallas. In this role, you will design, implement, and optimize GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments. You will need deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.

Key responsibilities of the role include:

Architecting and operating Kubernetes clusters optimized for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM
Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services
Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer
Optimizing GPU utilization and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano
Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance
Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry
Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper
Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD
Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize
Participating in performance tuning, incident response and production readiness reviews

Who are we looking for?

The ideal candidate will have the following skills and experience:

Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes, including GPU Operator, device plugin, NVML, MIG and DCGM
Proficiency in Go or Python for operator development and Kubernetes controller logic
Deep understanding of Kubernetes internals, including CRDs, RBAC, custom controllers and scheduler extensions
Experience with GPU-intensive workloads, for example for LLMs, training pipelines and scientific computing
Hands-on experience with Helm, Kustomize and GitOps workflows
Familiarity with CNI plugins, especially NVIDIA CNI and Multus
Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter

The following is beneficial:

Knowledge of container runtimes with CRI-O, containerd and NVIDIA Container Toolkit
Contributions to open-source projects in the Kubernetes or NVIDIA ecosystem
Preferred experience working with cilium or CNI plugins

Relocation Assistance Provided

Job Overview

Posted Date Jan 17, 2026

Employment Type Full-time

Experience Level Mid-Senior level

Location United State

Category Devops

Company Coda Search│Staffing

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

Staff DevOps Engineer

Devops

•

2h ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Northrop Grumman

United State

Senior DevOps Engineer

Devops

•

4h ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Northrop Grumman

United State

DevOps Linux Systems Engineer

Devops

•

1d ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Mid-Senior level

Jobs via Dice

United State

Senior Kubernetes Engineer (GPU Acceleration, AI/ML, HPC)

Key Highlights

Technical Skills Required

Benefits & Perks

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Staff DevOps Engineer

Premium Job

Northrop Grumman

Senior DevOps Engineer

Premium Job

Northrop Grumman

DevOps Linux Systems Engineer

Jobs via Dice

Subscribe our newsletter