Senior Platform Engineer / SRE - GPU Cloud Infrastructure

circle b Netherlands
Visa Sponsorship
Apply
AI Summary

You will own the observability platform, GPU metering-to-billing pipeline, and AI service catalogue for a sovereign EU GPU cloud. You are responsible for making GPU clusters observable, measurable, billable, and reliable. Requires 5+ years in platform engineering/SRE with deep Kubernetes and observability expertise.

Key Highlights
Build and operate the full observability stack: Prometheus, VictoriaMetrics/Thanos, Loki, Alertmanager, Grafana.
Design and implement GPU metering pipeline from DCGM telemetry to usage-based billing engine.
Deploy and operate AI/ML platform services: KServe, NIM, vLLM, JupyterHub, MLflow, Ray, Qdrant.
Key Responsibilities
Build and operate the full observability stack: metrics (Prometheus + long-term store such as VictoriaMetrics/Thanos), logs (Loki + a forwarder), and alerting (Alertmanager).
Integrate DCGM GPU telemetry into the pipeline: utilisation, memory, temperature, power, SM activity — per GPU, per tenant.
Surface GPU and fabric health (DCGM XID errors, NCCL/RDMA, RoCE/link health) from the Infra and Network engineers into a single pane — you own dashboards, alerting, and tenant-impact SLOs; Infra/Network own the fabric remediation.
Build the dashboarding capability (Grafana with SSO), ready for tenants at launch.
Define SLOs, SLIs, and error budgets; implement burn-rate alerting.
Implement OpenTelemetry instrumentation across platform services; lay the alerting and runbook foundations for reliable operation later.
Build the GPU metering pipeline: DCGM + scheduler/namespace state → accurate per-tenant usage events (GPU-hours per tenant) with idempotency and reconciliation → a usage-based billing engine, billing primarily on allocated/reserved GPU-time.
Deploy cost-attribution (OpenCost or equivalent) for per-tenant GPU and infrastructure showback/chargeback.
Integrate a billing engine (Lago, Stripe Billing, or Metronome): flat fee + storage/egress overages, SEPA B2B direct debit.
Deploy and operate the AI service catalogue — model-serving runtimes (KServe / NVIDIA NIM Operator, Triton, vLLM), notebooks (JupyterHub), experiment tracking (MLflow), a vector DB (Qdrant), and a distributed-job framework (Ray/KubeRay) — as repeatable, GitOps-driven templates. Operation, not model authoring.
Operate the air-gapped registry and controlled artifact seeding into the sovereign zone (with Network on the path policy).
Expose platform APIs (gateway, rate limiting, integration glue) for the future customer portal.
Build the tenant onboarding/offboarding automation — the path from a new tenant to running GPU workloads, end to end.
Technical Skills Required
Go Python Kubernetes Prometheus VictoriaMetrics Thanos Grafana Alertmanager Loki OpenTelemetry DCGM Exporter NCCL RDMA RoCE KServe NVIDIA NIM Operator Triton vLLM JupyterHub MLflow Ray KubeRay Harbor ArgoCD Flux Helm Kustomize GitOps Linux systemd REST APIs gRPC OpenCost Lago Stripe Billing Metronome Qdrant
Benefits & Perks
Competitive salary
Pension scheme
Visa sponsorship
Company gym
Modern office in Hoofddorp
Informal and open working culture
Participation in relevant conferences and exhibitions across Europe
Opportunity to develop skills in a fast-growing technology environment
Nice to Have
Prior hands-on with a commercial / OSS usage-based billing engine (Lago, OpenMeter, Metronome, Stripe Billing, Amberflo) and event-driven metering; EU payment integration (SEPA, Mollie, or Stripe EU) a plus.
NVIDIA AI Enterprise / NIM Operator; Run:ai or KAI Scheduler for GPU quota, fair-share, and multi-tenancy.
OpenCost / FinOps: GPU cost allocation, chargeback / showback.
Container registry operations in air-gapped environments (Harbor, Quay, or similar).
Identity & access integration (OIDC/SAML, SSO, tenant-scoped RBAC) and API gateways (Kong, Envoy, Traefik).
EU regulatory awareness: GDPR, AI Act, DORA, NIS2, NEN 7510 in a platform context.
TypeScript (billing/admin tooling; interfacing with the future Fullstack portal).
Event streaming for telemetry / metering.
Experience operating a Kubernetes-native multi-cluster management platform.
Experience at an AI neocloud, GPU cloud provider, or managed ML platform.
Familiarity with NCCL / RDMA fabric troubleshooting.

Job Description


What we do-

Circle B builds sustainable IT infrastructure for the AI and cloud era. For over a decade— we have designed and deployed datacenter, edge, and AI/HPC systems on Open Compute Project (OCP) hardware. We are independent, vendor-neutral, and ISO 9001 / 14001 / 27001 certified, with deployments across multiple countries.

Our newest initiative is a sovereign EU GPU cloud — operating under full Dutch/EU jurisdiction and beyond the reach of the US CLOUD Act, for regulated European organizations that cannot compromise on where their data lives.


The Role-

You own everything above the cluster: the observability platform, the GPU metering-to-billing pipeline, the operation of the AI service catalogue, and the SLOs that define reliability. Your job is to make the GPU clusters delivered to you observable, measurable, billable, and reliable.


What you will own-


Observability & Reliability

  • Build and operate the full observability stack: metrics (Prometheus + long-term store such as VictoriaMetrics/Thanos), logs (Loki + a forwarder), and alerting (Alertmanager)
  • Integrate DCGM GPU telemetry into the pipeline: utilisation, memory, temperature, power, SM activity — per GPU, per tenant
  • Surface GPU and fabric health (DCGM XID errors, NCCL/RDMA, RoCE/link health) from the Infra and Network engineers into a single pane — you own dashboards, alerting, and tenant-impact SLOs; Infra/Network own the fabric remediation
  • Build the dashboarding capability (Grafana with SSO), ready for tenants at launch
  • Define SLOs, SLIs, and error budgets; implement burn-rate alerting
  • Implement OpenTelemetry instrumentation across platform services; lay the alerting and runbook foundations for reliable operation later

Metering, Billing & Service Delivery

  • Build the GPU metering pipeline: DCGM + scheduler/namespace state → accurate per-tenant usage events (GPU-hours per tenant) with idempotency and reconciliation → a usage-based billing engine, billing primarily on allocated/reserved GPU-time
  • Deploy cost-attribution (OpenCost or equivalent) for per-tenant GPU and infrastructure showback/chargeback
  • Integrate a billing engine (Lago, Stripe Billing, or Metronome): flat fee + storage/egress overages, SEPA B2B direct debit
  • Deploy and operate the AI service catalogue — model-serving runtimes (KServe / NVIDIA NIM Operator, Triton, vLLM), notebooks (JupyterHub), experiment tracking (MLflow), a vector DB (Qdrant), and a distributed-job framework (Ray/KubeRay) — as repeatable, GitOps-driven templates. Operation, not model authoring
  • Operate the air-gapped registry and controlled artifact seeding into the sovereign zone (with Network on the path policy)
  • Expose platform APIs (gateway, rate limiting, integration glue) for the future customer portal
  • Build the tenant onboarding/offboarding automation — the path from a new tenant to running GPU workloads, end to end


What We Are Looking For-


Required Skills-

  • 5+ years in platform engineering / SRE, deploying and operating complex service stacks on K8s (operators, CRDs, Helm, scheduling, multi-tenancy / quota)
  • Go and/or Python to a software-engineering standard — control-plane services and automation, not just scripting
  • Production observability end-to-end: Prometheus + a long-term store (VictoriaMetrics / Thanos / Mimir), Grafana, Alertmanager, Loki, OpenTelemetry; designing SLIs, SLOs, and error budgets
  • GPU observability: running the DCGM exporter and surfacing GPU/fabric health (XID, NCCL, RoCE) into dashboards and alerts (surface, not fabric-remediate)
  • Ability to build a GPU metering pipeline: DCGM + scheduler/namespace state → accurate, idempotent per-tenant usage events → a billing engine
  • Deploy and operate (not author) model-serving + ML-platform services on K8s: KServe and/or NIM/Triton/vLLM, plus Ray, JupyterHub, MLflow, Harbor, a vector DB
  • Infrastructure-as-Code + GitOps: Helm, Kustomize, and ArgoCD or Flux in production
  • Strong Linux fundamentals and production troubleshooting (systemd, container runtimes, REST/gRPC APIs, event-driven pipelines)
  • Self-driven and comfortable with the breadth of a pre-launch platform: context-switching across observability, metering, service delivery, and onboarding


Preferred Skills-

  • Prior hands-on with a commercial / OSS usage-based billing engine (Lago, OpenMeter, Metronome, Stripe Billing, Amberflo) and event-driven metering; EU payment integration (SEPA, Mollie, or Stripe EU) a plus
  • NVIDIA AI Enterprise / NIM Operator; Run:ai or KAI Scheduler for GPU quota, fair-share, and multi-tenancy
  • OpenCost / FinOps: GPU cost allocation, chargeback / showback
  • Container registry operations in air-gapped environments (Harbor, Quay, or similar)
  • Identity & access integration (OIDC/SAML, SSO, tenant-scoped RBAC) and API gateways (Kong, Envoy, Traefik)
  • EU regulatory awareness: GDPR, AI Act, DORA, NIS2, NEN 7510 in a platform context
  • TypeScript (billing/admin tooling; interfacing with the future Fullstack portal)


Nice to Have

  • Event streaming for telemetry / metering
  • Experience operating a Kubernetes-native multi-cluster management platform
  • Experience at an AI neocloud, GPU cloud provider, or managed ML platform
  • Familiarity with NCCL / RDMA fabric troubleshooting


Why Join Us

  • Help build a sovereign EU GPU cloud from the ground up.
  • Own a critical platform layer, not just tickets or maintenance.
  • Work on modern AI infrastructure, GPU platforms, Kubernetes, observability, and automation.
  • Join a company with deep experience in OCP, datacenter, AI/HPC, and cloud infrastructure.
  • Build infrastructure for organizations where data location, compliance, and reliability truly matter.


Benefits

  • Competitive salary.
  • Pension scheme.
  • Visa sponsorship.
  • Company gym.
  • Modern office in Hoofddorp.
  • Informal and open working culture.
  • Participation in relevant conferences and exhibitions across Europe.
  • Opportunity to develop your skills in a fast-growing technology environment.


Our Work Culture

Circle B offers an informal working atmosphere with energetic people who enjoy being part of a growing technology company. We have an open management culture and encourage colleagues to contribute to improving our products, services, and processes.

If this sounds like a good fit, please send your CV and motivation letter to:

surbi@tauruseu.com


Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

magno it recruitment

Netherlands

Senior CI/CD Engineer - Platform Engineering

Devops
1w ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Adyen

Netherlands
Visa Sponsorship Relocation Remote
Job Type Contract
Experience Level Mid-Senior level

koinworx bv

Netherlands

Subscribe our newsletter

New Things Will Always Update Regularly