Senior HPC & Kubernetes Architect

Intellectt Inc • United State
Relocation
Apply
AI Summary

Lead the architecture and modernization of large-scale compute platforms supporting AI/ML and scientific workloads. Design and optimize job scheduling strategies using Slurm, enable GPU-aware container orchestration, and optimize MPI-based distributed workloads. Implement centralized observability and integrate Kubernetes with traditional schedulers.

Key Highlights
Lead the architecture and modernization of large-scale compute platforms
Design and optimize job scheduling strategies using Slurm
Enable GPU-aware container orchestration
Key Responsibilities
Architect hybrid environments integrating dedicated HPC clusters with Kubernetes-based container platforms
Design and optimize job scheduling strategies using Slurm
Enable GPU-aware container orchestration
Optimize MPI-based distributed workloads
Design infrastructure automation pipelines using Terraform and Ansible
Architect secure multi-tenant environments with RBAC and TLS
Implement centralized observability using Elasticsearch, Logstash, and Kibana
Integrate Kubernetes with traditional schedulers
Technical Skills Required
Slurm scheduler MPI (OpenMPI/MPICH) GPU clusters (NVIDIA CUDA) Kubernetes architecture Infrastructure-as-Code (Terraform and/or Ansible) Linux systems engineering AWS HPC services (EKS, EC2 GPU, FSx for Lustre, AWS Batch)
Nice to Have
Experience with NVIDIA Enroot and/or Pyxis
Exposure to InfiniBand networking
Experience integrating Kubernetes-native batch schedulers (Volcano)
ELK Stack architecture at enterprise scale
CKA or equivalent certification

Job Description


Job Title: Senior HPC & Kubernetes Architect

📍Location: Albany, NY (Relocation Required)

Full-Time | Hybrid/Onsite


Job Description:

  • We are seeking a handson Senior Architect with deep expertise in High Performance Computing (HPC) and Kubernetes to lead the architecture and modernization of large-scale compute platforms supporting AI/ML and scientific workloads.
  • This role requires demonstrated production experience designing Slurm-based HPC clusters, integrating GPU enabled workloads with Kubernetes, and optimizing MPI-driven applications in hybrid cloud environments.
  • This is not a general DevOps or Kubernetes administration role. Candidates must have production HPC cluster architecture experience.


Core Responsibilities:

  • Architect hybrid environments integrating dedicated HPC clusters with Kubernetes-based container platforms.
  • Design and optimize job scheduling strategies using Slurm (required), including priority queues, gang scheduling, and deterministic resource allocation.
  • Enable GPU-aware container orchestration using Docker and Kubernetes with near bare-metal performance.
  • Optimize MPI-based distributed workloads with low-latency networking (InfiniBand preferred).
  • Design infrastructure automation pipelines using Terraform and Ansible.
  • Architect secure multi-tenant environments with RBAC and TLS across cluster and container layers.
  • Implement centralized observability using Elasticsearch, Logstash, and Kibana for large-scale job monitoring.
  • Integrate Kubernetes with traditional schedulers to support AI/ML and high-throughput compute workloads.
  • Support AWS-based HPC workloads including EKS, EC2 GPU instances, AWS Batch, and FSx for Lustre.


Required Technical Expertise:

  • 5+ years hands-on experience architecting and operating production HPC clusters.
  • Deep experience with Slurm scheduler (configuration, partitioning, job queues, resource management).
  • Strong knowledge of MPI (OpenMPI/MPICH) and distributed compute models.
  • Experience with GPU clusters (NVIDIA CUDA) and GPU scheduling.
  • Advanced Kubernetes architecture knowledge (cluster networking, resource quotas, performance tuning).
  • Experience running containers in HPC environments (Docker; NVIDIA runtime experience strongly preferred).
  • Proven experience with Infrastructure-as-Code (Terraform and/or Ansible).
  • Strong Linux systems engineering background.
  • Experience with AWS HPC services (EKS, EC2 GPU, FSx for Lustre, AWS Batch).


Strongly Preferred:

  • Experience with NVIDIA Enroot and/or Pyxis.
  • Exposure to InfiniBand networking.
  • Experience integrating Kubernetes-native batch schedulers (Volcano).
  • ELK Stack architecture at enterprise scale.
  • CKA or equivalent certification.


Candidate Profile:

Ideal candidates will have backgrounds in:


  • Research computing environments
  • National laboratories
  • AI/ML infrastructure platforms
  • Financial modeling or quantitative computing
  • Scientific simulation platforms


Important

  • Applicants without direct Slurm or production HPC cluster experience will not be considered.

Similar Jobs

Explore other opportunities that match your interests

Cloud Platform Engineer

Devops
•
2h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

yora sigma

United State

Azure Cloud Architect

Devops
•
2h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

bass pro shops

United State

Java Server Engineer

Devops
•
18h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

apple

United State

Subscribe our newsletter

New Things Will Always Update Regularly