Senior Platform Engineer / SRE - GPU Cloud Infrastructure

circle b • Netherlands

Visa Sponsorship

Apply

AI Summary

You will own the observability platform, GPU metering-to-billing pipeline, and AI service catalogue for a sovereign EU GPU cloud. You are responsible for making GPU clusters observable, measurable, billable, and reliable. Requires 5+ years in platform engineering/SRE with deep Kubernetes and observability expertise.

Key Highlights

Build and operate the full observability stack: Prometheus, VictoriaMetrics/Thanos, Loki, Alertmanager, Grafana.

Design and implement GPU metering pipeline from DCGM telemetry to usage-based billing engine.

Deploy and operate AI/ML platform services: KServe, NIM, vLLM, JupyterHub, MLflow, Ray, Qdrant.

Key Responsibilities

Build and operate the full observability stack: metrics (Prometheus + long-term store such as VictoriaMetrics/Thanos), logs (Loki + a forwarder), and alerting (Alertmanager).

Integrate DCGM GPU telemetry into the pipeline: utilisation, memory, temperature, power, SM activity — per GPU, per tenant.

Surface GPU and fabric health (DCGM XID errors, NCCL/RDMA, RoCE/link health) from the Infra and Network engineers into a single pane — you own dashboards, alerting, and tenant-impact SLOs; Infra/Network own the fabric remediation.

Build the dashboarding capability (Grafana with SSO), ready for tenants at launch.

Define SLOs, SLIs, and error budgets; implement burn-rate alerting.

Implement OpenTelemetry instrumentation across platform services; lay the alerting and runbook foundations for reliable operation later.

Build the GPU metering pipeline: DCGM + scheduler/namespace state → accurate per-tenant usage events (GPU-hours per tenant) with idempotency and reconciliation → a usage-based billing engine, billing primarily on allocated/reserved GPU-time.

Deploy cost-attribution (OpenCost or equivalent) for per-tenant GPU and infrastructure showback/chargeback.

Integrate a billing engine (Lago, Stripe Billing, or Metronome): flat fee + storage/egress overages, SEPA B2B direct debit.

Deploy and operate the AI service catalogue — model-serving runtimes (KServe / NVIDIA NIM Operator, Triton, vLLM), notebooks (JupyterHub), experiment tracking (MLflow), a vector DB (Qdrant), and a distributed-job framework (Ray/KubeRay) — as repeatable, GitOps-driven templates. Operation, not model authoring.

Operate the air-gapped registry and controlled artifact seeding into the sovereign zone (with Network on the path policy).

Expose platform APIs (gateway, rate limiting, integration glue) for the future customer portal.

Build the tenant onboarding/offboarding automation — the path from a new tenant to running GPU workloads, end to end.

Technical Skills Required

Go Python Kubernetes Prometheus VictoriaMetrics Thanos Grafana Alertmanager Loki OpenTelemetry DCGM Exporter NCCL RDMA RoCE KServe NVIDIA NIM Operator Triton vLLM JupyterHub MLflow Ray KubeRay Harbor ArgoCD Flux Helm Kustomize GitOps Linux systemd REST APIs gRPC OpenCost Lago Stripe Billing Metronome Qdrant

Benefits & Perks

Competitive salary

Pension scheme

Visa sponsorship

Company gym

Modern office in Hoofddorp

Informal and open working culture

Participation in relevant conferences and exhibitions across Europe

Opportunity to develop skills in a fast-growing technology environment

Nice to Have

Prior hands-on with a commercial / OSS usage-based billing engine (Lago, OpenMeter, Metronome, Stripe Billing, Amberflo) and event-driven metering; EU payment integration (SEPA, Mollie, or Stripe EU) a plus.

NVIDIA AI Enterprise / NIM Operator; Run:ai or KAI Scheduler for GPU quota, fair-share, and multi-tenancy.

OpenCost / FinOps: GPU cost allocation, chargeback / showback.

Container registry operations in air-gapped environments (Harbor, Quay, or similar).

Identity & access integration (OIDC/SAML, SSO, tenant-scoped RBAC) and API gateways (Kong, Envoy, Traefik).

EU regulatory awareness: GDPR, AI Act, DORA, NIS2, NEN 7510 in a platform context.

TypeScript (billing/admin tooling; interfacing with the future Fullstack portal).

Event streaming for telemetry / metering.

Experience operating a Kubernetes-native multi-cluster management platform.

Experience at an AI neocloud, GPU cloud provider, or managed ML platform.

Familiarity with NCCL / RDMA fabric troubleshooting.

Job Description

What we do-

Circle B builds sustainable IT infrastructure for the AI and cloud era. For over a decade— we have designed and deployed datacenter, edge, and AI/HPC systems on Open Compute Project (OCP) hardware. We are independent, vendor-neutral, and ISO 9001 / 14001 / 27001 certified, with deployments across multiple countries.

Our newest initiative is a sovereign EU GPU cloud — operating under full Dutch/EU jurisdiction and beyond the reach of the US CLOUD Act, for regulated European organizations that cannot compromise on where their data lives.

The Role-

You own everything above the cluster: the observability platform, the GPU metering-to-billing pipeline, the operation of the AI service catalogue, and the SLOs that define reliability. Your job is to make the GPU clusters delivered to you observable, measurable, billable, and reliable.

What you will own-

Observability & Reliability

Build and operate the full observability stack: metrics (Prometheus + long-term store such as VictoriaMetrics/Thanos), logs (Loki + a forwarder), and alerting (Alertmanager)
Integrate DCGM GPU telemetry into the pipeline: utilisation, memory, temperature, power, SM activity — per GPU, per tenant
Surface GPU and fabric health (DCGM XID errors, NCCL/RDMA, RoCE/link health) from the Infra and Network engineers into a single pane — you own dashboards, alerting, and tenant-impact SLOs; Infra/Network own the fabric remediation
Build the dashboarding capability (Grafana with SSO), ready for tenants at launch
Define SLOs, SLIs, and error budgets; implement burn-rate alerting
Implement OpenTelemetry instrumentation across platform services; lay the alerting and runbook foundations for reliable operation later

Metering, Billing & Service Delivery

Build the GPU metering pipeline: DCGM + scheduler/namespace state → accurate per-tenant usage events (GPU-hours per tenant) with idempotency and reconciliation → a usage-based billing engine, billing primarily on allocated/reserved GPU-time
Deploy cost-attribution (OpenCost or equivalent) for per-tenant GPU and infrastructure showback/chargeback

Searching for Devops roles that provide visa sponsorship? Connect with international employers through Devops Jobs with Visa Sponsorship opportunities actively seeking talented professionals.

Integrate a billing engine (Lago, Stripe Billing, or Metronome): flat fee + storage/egress overages, SEPA B2B direct debit
Deploy and operate the AI service catalogue — model-serving runtimes (KServe / NVIDIA NIM Operator, Triton, vLLM), notebooks (JupyterHub), experiment tracking (MLflow), a vector DB (Qdrant), and a distributed-job framework (Ray/KubeRay) — as repeatable, GitOps-driven templates. Operation, not model authoring
Operate the air-gapped registry and controlled artifact seeding into the sovereign zone (with Network on the path policy)
Expose platform APIs (gateway, rate limiting, integration glue) for the future customer portal
Build the tenant onboarding/offboarding automation — the path from a new tenant to running GPU workloads, end to end

What We Are Looking For-

Required Skills-

5+ years in platform engineering / SRE, deploying and operating complex service stacks on K8s (operators, CRDs, Helm, scheduling, multi-tenancy / quota)
Go and/or Python to a software-engineering standard — control-plane services and automation, not just scripting
Production observability end-to-end: Prometheus + a long-term store (VictoriaMetrics / Thanos / Mimir), Grafana, Alertmanager, Loki, OpenTelemetry; designing SLIs, SLOs, and error budgets
GPU observability: running the DCGM exporter and surfacing GPU/fabric health (XID, NCCL, RoCE) into dashboards and alerts (surface, not fabric-remediate)
Ability to build a GPU metering pipeline: DCGM + scheduler/namespace state → accurate, idempotent per-tenant usage events → a billing engine
Deploy and operate (not author) model-serving + ML-platform services on K8s: KServe and/or NIM/Triton/vLLM, plus Ray, JupyterHub, MLflow, Harbor, a vector DB
Infrastructure-as-Code + GitOps: Helm, Kustomize, and ArgoCD or Flux in production
Strong Linux fundamentals and production troubleshooting (systemd, container runtimes, REST/gRPC APIs, event-driven pipelines)
Self-driven and comfortable with the breadth of a pre-launch platform: context-switching across observability, metering, service delivery, and onboarding

Explore our comprehensive directory of visa sponsorship jobs from employers worldwide who are ready to sponsor talented international professionals.

Preferred Skills-

Prior hands-on with a commercial / OSS usage-based billing engine (Lago, OpenMeter, Metronome, Stripe Billing, Amberflo) and event-driven metering; EU payment integration (SEPA, Mollie, or Stripe EU) a plus
NVIDIA AI Enterprise / NIM Operator; Run:ai or KAI Scheduler for GPU quota, fair-share, and multi-tenancy
OpenCost / FinOps: GPU cost allocation, chargeback / showback
Container registry operations in air-gapped environments (Harbor, Quay, or similar)
Identity & access integration (OIDC/SAML, SSO, tenant-scoped RBAC) and API gateways (Kong, Envoy, Traefik)
EU regulatory awareness: GDPR, AI Act, DORA, NIS2, NEN 7510 in a platform context
TypeScript (billing/admin tooling; interfacing with the future Fullstack portal)

Nice to Have

Event streaming for telemetry / metering
Experience operating a Kubernetes-native multi-cluster management platform
Experience at an AI neocloud, GPU cloud provider, or managed ML platform
Familiarity with NCCL / RDMA fabric troubleshooting

Why Join Us

Help build a sovereign EU GPU cloud from the ground up.

Interested in opportunities specifically in Netherlands? Discover our dedicated Visa Sponsorship Jobs in Netherlands page featuring roles from top employers in this location.

Own a critical platform layer, not just tickets or maintenance.
Work on modern AI infrastructure, GPU platforms, Kubernetes, observability, and automation.
Join a company with deep experience in OCP, datacenter, AI/HPC, and cloud infrastructure.
Build infrastructure for organizations where data location, compliance, and reliability truly matter.

Benefits

Competitive salary.
Pension scheme.
Visa sponsorship.
Company gym.
Modern office in Hoofddorp.
Informal and open working culture.
Participation in relevant conferences and exhibitions across Europe.
Opportunity to develop your skills in a fast-growing technology environment.

Our Work Culture

Circle B offers an informal working atmosphere with energetic people who enjoy being part of a growing technology company. We have an open management culture and encourage colleagues to contribute to improving our products, services, and processes.

If this sounds like a good fit, please send your CV and motivation letter to:

surbi@tauruseu.com

Job Overview

Posted Date Jun 10, 2026

Employment Type Full-time

Experience Level Associate

Location Netherlands

Category Devops

Company circle b

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

DevOps Engineer for Business Intelligence and Telecom Data

Devops

•

6d ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

magno it recruitment

Netherlands

Senior CI/CD Engineer - Platform Engineering

Devops

•

1w ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Adyen

Netherlands

Senior Zscaler Architect - Zero Trust Security

Devops

•

1w ago

Visa Sponsorship Relocation Remote

Job Type Contract

Experience Level Mid-Senior level

koinworx bv

Netherlands

Senior Platform Engineer / SRE - GPU Cloud Infrastructure

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

DevOps Engineer for Business Intelligence and Telecom Data

magno it recruitment

Senior CI/CD Engineer - Platform Engineering

Premium Job

Adyen

Senior Zscaler Architect - Zero Trust Security

koinworx bv

Subscribe our newsletter