Senior HPC & Kubernetes Architect

Intellectt Inc • United State

Relocation

This Job is No Longer Active This position is no longer accepting applications

AI Summary

Lead the architecture and modernization of large-scale compute platforms supporting AI/ML and scientific workloads. Design and optimize job scheduling strategies using Slurm, enable GPU-aware container orchestration, and optimize MPI-based distributed workloads. Implement centralized observability and integrate Kubernetes with traditional schedulers.

Key Highlights

Lead the architecture and modernization of large-scale compute platforms

Design and optimize job scheduling strategies using Slurm

Enable GPU-aware container orchestration

Key Responsibilities

Architect hybrid environments integrating dedicated HPC clusters with Kubernetes-based container platforms

Design and optimize job scheduling strategies using Slurm

Enable GPU-aware container orchestration

Optimize MPI-based distributed workloads

Design infrastructure automation pipelines using Terraform and Ansible

Architect secure multi-tenant environments with RBAC and TLS

Implement centralized observability using Elasticsearch, Logstash, and Kibana

Integrate Kubernetes with traditional schedulers

Technical Skills Required

Slurm scheduler MPI (OpenMPI/MPICH) GPU clusters (NVIDIA CUDA) Kubernetes architecture Infrastructure-as-Code (Terraform and/or Ansible) Linux systems engineering AWS HPC services (EKS, EC2 GPU, FSx for Lustre, AWS Batch)

Nice to Have

Experience with NVIDIA Enroot and/or Pyxis

Exposure to InfiniBand networking

Experience integrating Kubernetes-native batch schedulers (Volcano)

ELK Stack architecture at enterprise scale

CKA or equivalent certification

Job Description

Job Title: Senior HPC & Kubernetes Architect

📍Location: Albany, NY (Relocation Required)

Full-Time | Hybrid/Onsite

Job Description:

We are seeking a handson Senior Architect with deep expertise in High Performance Computing (HPC) and Kubernetes to lead the architecture and modernization of large-scale compute platforms supporting AI/ML and scientific workloads.
This role requires demonstrated production experience designing Slurm-based HPC clusters, integrating GPU enabled workloads with Kubernetes, and optimizing MPI-driven applications in hybrid cloud environments.
This is not a general DevOps or Kubernetes administration role. Candidates must have production HPC cluster architecture experience.

Core Responsibilities:

Architect hybrid environments integrating dedicated HPC clusters with Kubernetes-based container platforms.
Design and optimize job scheduling strategies using Slurm (required), including priority queues, gang scheduling, and deterministic resource allocation.
Enable GPU-aware container orchestration using Docker and Kubernetes with near bare-metal performance.
Optimize MPI-based distributed workloads with low-latency networking (InfiniBand preferred).
Design infrastructure automation pipelines using Terraform and Ansible.
Architect secure multi-tenant environments with RBAC and TLS across cluster and container layers.
Implement centralized observability using Elasticsearch, Logstash, and Kibana for large-scale job monitoring.

Looking to advance your Devops career with relocation support? Explore Devops Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.

Integrate Kubernetes with traditional schedulers to support AI/ML and high-throughput compute workloads.
Support AWS-based HPC workloads including EKS, EC2 GPU instances, AWS Batch, and FSx for Lustre.

Required Technical Expertise:

5+ years hands-on experience architecting and operating production HPC clusters.
Deep experience with Slurm scheduler (configuration, partitioning, job queues, resource management).
Strong knowledge of MPI (OpenMPI/MPICH) and distributed compute models.
Experience with GPU clusters (NVIDIA CUDA) and GPU scheduling.
Advanced Kubernetes architecture knowledge (cluster networking, resource quotas, performance tuning).
Experience running containers in HPC environments (Docker; NVIDIA runtime experience strongly preferred).
Proven experience with Infrastructure-as-Code (Terraform and/or Ansible).
Strong Linux systems engineering background.
Experience with AWS HPC services (EKS, EC2 GPU, FSx for Lustre, AWS Batch).

Strongly Preferred:

Experience with NVIDIA Enroot and/or Pyxis.

Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.

Exposure to InfiniBand networking.
Experience integrating Kubernetes-native batch schedulers (Volcano).
ELK Stack architecture at enterprise scale.
CKA or equivalent certification.

Candidate Profile:

Ideal candidates will have backgrounds in:

Research computing environments
National laboratories
AI/ML infrastructure platforms
Financial modeling or quantitative computing
Scientific simulation platforms

Important

Applicants without direct Slurm or production HPC cluster experience will not be considered.

Job Overview

Posted Date Feb 11, 2026

Employment Type Full-time

Experience Level Mid-Senior level

Location United State

Category Devops

Company Intellectt Inc

Mentioned Skills

Industries

Similar Jobs

Explore other opportunities that match your interests

Principal Platform Engineer

Devops

•

13h ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Raytheon

United State

Cleared Sr. Principal Cloud Architect

Devops

•

1d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Northrop Grumman

United State

Senior Cloud Software Engineer - Mobile Control Station Migration

Devops

•

1d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Northrop Grumman

United State

Senior HPC & Kubernetes Architect

Key Highlights

Key Responsibilities

Technical Skills Required

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Principal Platform Engineer

Premium Job

Raytheon

Cleared Sr. Principal Cloud Architect

Premium Job

Northrop Grumman

Senior Cloud Software Engineer - Mobile Control Station Migration

Premium Job

Northrop Grumman

Subscribe our newsletter