Senior Systems Engineer, AI Infrastructure

RemoteStar • Germany

Relocation

Apply

AI Summary

Seeking a Senior Systems Engineer with 10+ years of Python expertise to build and automate bare-metal AI infrastructure. Responsibilities include developing control plane software, orchestrating large-scale GPU compute, and optimizing networking for distributed training. Requires deep Kubernetes, GPU ecosystem, and Linux internals knowledge.

Key Highlights

Design and develop software for bare-metal AI infrastructure lifecycle automation.

Orchestrate large-scale distributed training jobs on massive GPU clusters.

Debug complex distributed systems issues across code, network, and silicon.

Key Responsibilities

Designing and developing the software layer (APIs, Controllers, Agents) that automates the lifecycle of bare-metal AI infrastructure.

Architecting scheduling solutions for large-scale distributed training jobs across massive clusters of GPUs (NVIDIA H200/B200/B300), ensuring efficient bin-packing and gang scheduling.

Tuning the software-defined networking layer to support low-latency interconnects (InfiniBand/RDMA/RoCEv2) essential for multi-node training.

Writing custom Kubernetes Operators and CRDs to abstract complex hardware realities (topology awareness, GPU partitioning) into usable interfaces for our Data Scientists.

Investigating and resolving deep systems issues, ranging from PCIe bus errors and NCCL communication timeouts to kernel panics on bare-metal nodes.

Creating the "Golden Image" for AI workloads, managing drivers, firmware, and OS optimizations to squeeze maximum performance out of the hardware.

Technical Skills Required

Python Kubernetes Custom Resource Definitions (CRDs) Operators Kubernetes API server architecture NVIDIA GPU clusters NVIDIA drivers CUDA toolkit NVIDIA Container Toolkit Linux kernel cgroups namespaces Terraform Ansible DCGM PyTorch Distributed Megatron-LM DeepSpeed

Benefits & Perks

Indefinite contract

Equal pay guaranteed

Variable performance bonus

Signing bonus

Relocation package

Private health insurance

Eligibility for educational budget

Hybrid opportunity

Flexible working hours

Nice to Have

Experience working with traditional supercomputing schedulers (Slurm, PBS) or modern batch schedulers (Volcano, Kueue, Ray).

Experience with tools like Cluster API (CAPI), Metal3, Tinkerbell, Canonical MaaS, or OpenStack Ironic.

Knowledge of RDMA, InfiniBand, GPUDirect, and how to expose these technologies to containerized workloads.

Understanding of how distributed training works (e.g., PyTorch Distributed, Megatron-LM, DeepSpeed) and the infrastructure requirements of Large Language Models (LLMs).

Experience building monitoring for hardware health (DCGM) and distributed tracing for long-running jobs.

Job Description

About client :

Well-funded and fast-growing deep-tech company founded in 2019. We are the biggest Quantum Software company in the EU. They are also one of the 100 most promising companies in AI in the world (according to CB Insights, 2023) with 150+ employees and growing, fully multicultural and international.

Requirements

Systems Programming Expertise: 10+ years of software engineering experience with strong proficiency in Python. You must be comfortable building system agents, APIs, and CLI tools.
Deep Kubernetes Knowledge: You understand K8s internals beyond simple deployment. Experience with Custom Resource Definitions (CRDs), Operators, and the Kubernetes API server architecture.
GPU Ecosystem Experience: Hands-on experience managing NVIDIA GPU clusters. Familiarity with NVIDIA drivers, CUDA toolkit, and the container runtime (NVIDIA Container Toolkit).
Linux Internals: Deep understanding of the Linux kernel, cgroups, namespaces, and system performance tuning.
Infrastructure as Code: Mastery of declarative infrastructure tools (Terraform, Ansible) but with a focus on provisioning physical hardware rather than just cloud VMs.
Problem Solving: A proven track record of debugging complex distributed systems where the root cause could be code, network, or silicon.

Preferred qualifications

HPC Background: Experience working with traditional supercomputing schedulers (Slurm, PBS) or modern batch schedulers (Volcano, Kueue, Ray).

Looking to advance your Devops career with relocation support? Explore Devops Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.

Bare Metal Provisioning: Experience with tools like Cluster API (CAPI), Metal3, Tinkerbell, Canonical MaaS, or OpenStack Ironic.
High-Speed Networking: Knowledge of RDMA, InfiniBand, GPUDirect, and how to expose these technologies to containerized workloads.
AI/ML Familiarity: Understanding of how distributed training works (e.g., PyTorch Distributed, Megatron-LM, DeepSpeed) and the infrastructure requirements of Large Language Models (LLMs).
Observability: Experience building monitoring for hardware health (DCGM) and distributed tracing for long-running jobs.

Location: Applicants must have legal authorization to work in the country where the position is based

What you will be doing

Building the Control Plane: Designing and developing the software layer (APIs, Controllers, Agents) that automates the lifecycle of bare-metal AI infrastructure.
Orchestrating High-Scale Compute: Architecting scheduling solutions for large-scale distributed training jobs across massive clusters of GPUs (NVIDIA H200/B200/B300), ensuring efficient bin-packing and gang scheduling.
Optimizing the Fabric: Tuning the software-defined networking layer to support low-latency interconnects (InfiniBand/RDMA/RoCEv2) essential for multi-node training.
Developing Kubernetes Extensions: Writing custom Kubernetes Operators and CRDs to abstract complex hardware realities (topology awareness, GPU partitioning) into usable interfaces for our Data Scientists.
Hardware-Level Debugging: Investigating and resolving deep systems issues, ranging from PCIe bus errors and NCCL communication timeouts to kernel panics on bare-metal nodes.

Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.

Defining Standards: Creating the "Golden Image" for AI workloads, managing drivers, firmware, and OS optimizations to squeeze maximum performance out of the hardware.

Perks & Benefits

Indefinite contract.
Equal pay guaranteed.
Variable performance bonus.
Signing bonus.
Relocation package (if applicable).
Private health insurance.
Eligibility for educational budget according to internal policy.
Hybrid opportunity.
Flexible working hours.
Working in a high paced environment, working on cutting edge technologies.
Career plan. Opportunity to learn and teach.
Progressive Company. Happy people culture

Job Overview

Posted Date Feb 01, 2026

Employment Type Full-time

Experience Level Not Applicable

Location Germany

Category Devops

Company RemoteStar

Mentioned Skills

Industries

Similar Jobs

Explore other opportunities that match your interests

Systems Engineer for European Sovereign Cloud

Devops

•

2d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Haystack

Germany

Cloud Infrastructure Developer

Devops

•

2d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

SAP

Germany

Senior Systems Engineer - European Sovereign Cloud

Devops

•

2d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

jobster

Germany

Senior Systems Engineer, AI Infrastructure

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Systems Engineer for European Sovereign Cloud

Premium Job

Haystack

Cloud Infrastructure Developer

Premium Job

SAP

Senior Systems Engineer - European Sovereign Cloud

Premium Job

jobster

Subscribe our newsletter