Datacenter Infrastructure Reliability Lead

Doghouse Recruitment โ€ข Netherlands
Relocation
Apply
AI Summary

Lead the highest escalation layer for critical infrastructure incidents across global datacenters. Build and lead the L3 support team across regions. Design and enforce incident response and escalation frameworks.

Key Highlights
Lead L3 support team
Design incident response frameworks
Act as Incident Commander
Key Responsibilities
Build, lead, and scale the L3 support team
Design and enforce incident response and escalation frameworks
Act as Incident Commander for high-severity production incidents
Technical Skills Required
Linux Server hardware Firmware (BIOS/BMC) GPU server platforms Nvidia-smi Dcgmi Linux log correlation
Benefits & Perks
Up to 200k base
25% bonus
RSUs
Relocation package provided
Hybrid work arrangement (50/50 in office)
Nice to Have
Deep troubleshooting capability across Linux, server hardware, and firmware (BIOS/BMC)
Strong familiarity with GPU server platforms and common diagnostics

Job Description


Datacenter Infrastructure Reliability Lead

Location: Amsterdam, Netherlands โ€“ Hybrid (50/50 in office)

Relocation: possible and supported.

Compensation: Up to 200k base + 25% bonus + RSUs

Join a fast-growing AI infrastructure company building large-scale GPU and datacenter platforms from the ground up. This role is ideal for experienced infrastructure leaders who enjoy solving complex production issues, building teams from scratch, and operating at the intersection of hardware, Linux systems, and large-scale datacenter operations.

You will lead the highest escalation layer for critical infrastructure incidents across global datacenters.


Role Overview

Your team will be the final escalation point for anything related to datacenter IT hardware infrastructure (modern servers, GPUs, racks, networking, storage, etc.). Anything L2 cannot resolve will be escalated to this team.

This L3 team is not yet in place โ€” you will be responsible for building and leading it from scratch.


Responsibilities

  • Build, lead, and scale the L3 support team across regions, with full ownership of hiring, team structure, and performance
  • Design and enforce the end-to-end incident response and escalation framework, including workflows, ownership models, KPIs, and ensuring adoption across multiple teams
  • Act as Incident Commander for high-severity production incidents, driving structured mitigation, clear communication, and long-term resolution
  • Own problem management and continuous improvement, identifying recurring failure patterns and translating them into scalable fixes across infrastructure and operations


What Weโ€™re Looking For

  • Minimum of 10+ years of experience in large-scale datacenter environments
  • 3+ years of experience leading highly technical teams
  • 3+ years of experience building teams (hiring and performance management)
  • Experience setting up frameworks, processes, and workflows from scratch


Nice to have:

  • Deep troubleshooting capability across Linux, server hardware, and firmware (BIOS/BMC), with the ability to guide investigations at a systems engineer level
  • Strong familiarity with GPU server platforms and common diagnostics (e.g. nvidia-smi, dcgmi, Linux log correlation)


Similar Jobs

Explore other opportunities that match your interests

Senior Enterprise Architect

Programming
โ€ข
14h ago
Visa Sponsorship Relocation Remote
Job Type Internship
Experience Level Mid-Senior level

TNO

Netherlands

Senior Frontend Developer

Programming
โ€ข
2d ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Picnic Technologies

Netherlands
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Internship

delft university of technology

Netherlands

Subscribe our newsletter

New Things Will Always Update Regularly