You will own the observability platform, GPU metering-to-billing pipeline, and AI service catalogue for a sovereign EU GPU cloud. You are responsible for making GPU clusters observable, measurable, billable, and reliable. Requires 5+ years in platform engineering/SRE with deep Kubernetes and observability expertise.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
What we do-
Circle B builds sustainable IT infrastructure for the AI and cloud era. For over a decade— we have designed and deployed datacenter, edge, and AI/HPC systems on Open Compute Project (OCP) hardware. We are independent, vendor-neutral, and ISO 9001 / 14001 / 27001 certified, with deployments across multiple countries.
Our newest initiative is a sovereign EU GPU cloud — operating under full Dutch/EU jurisdiction and beyond the reach of the US CLOUD Act, for regulated European organizations that cannot compromise on where their data lives.
The Role-
You own everything above the cluster: the observability platform, the GPU metering-to-billing pipeline, the operation of the AI service catalogue, and the SLOs that define reliability. Your job is to make the GPU clusters delivered to you observable, measurable, billable, and reliable.
What you will own-
Observability & Reliability
- Build and operate the full observability stack: metrics (Prometheus + long-term store such as VictoriaMetrics/Thanos), logs (Loki + a forwarder), and alerting (Alertmanager)
- Integrate DCGM GPU telemetry into the pipeline: utilisation, memory, temperature, power, SM activity — per GPU, per tenant
- Surface GPU and fabric health (DCGM XID errors, NCCL/RDMA, RoCE/link health) from the Infra and Network engineers into a single pane — you own dashboards, alerting, and tenant-impact SLOs; Infra/Network own the fabric remediation
- Build the dashboarding capability (Grafana with SSO), ready for tenants at launch
- Define SLOs, SLIs, and error budgets; implement burn-rate alerting
- Implement OpenTelemetry instrumentation across platform services; lay the alerting and runbook foundations for reliable operation later
Metering, Billing & Service Delivery
- Build the GPU metering pipeline: DCGM + scheduler/namespace state → accurate per-tenant usage events (GPU-hours per tenant) with idempotency and reconciliation → a usage-based billing engine, billing primarily on allocated/reserved GPU-time
- Deploy cost-attribution (OpenCost or equivalent) for per-tenant GPU and infrastructure showback/chargeback
- Integrate a billing engine (Lago, Stripe Billing, or Metronome): flat fee + storage/egress overages, SEPA B2B direct debit
- Deploy and operate the AI service catalogue — model-serving runtimes (KServe / NVIDIA NIM Operator, Triton, vLLM), notebooks (JupyterHub), experiment tracking (MLflow), a vector DB (Qdrant), and a distributed-job framework (Ray/KubeRay) — as repeatable, GitOps-driven templates. Operation, not model authoring
- Operate the air-gapped registry and controlled artifact seeding into the sovereign zone (with Network on the path policy)
- Expose platform APIs (gateway, rate limiting, integration glue) for the future customer portal
- Build the tenant onboarding/offboarding automation — the path from a new tenant to running GPU workloads, end to end
Searching for Devops roles that provide visa sponsorship? Connect with international employers through Devops Jobs with Visa Sponsorship opportunities actively seeking talented professionals.
What We Are Looking For-
Required Skills-
- 5+ years in platform engineering / SRE, deploying and operating complex service stacks on K8s (operators, CRDs, Helm, scheduling, multi-tenancy / quota)
- Go and/or Python to a software-engineering standard — control-plane services and automation, not just scripting
- Production observability end-to-end: Prometheus + a long-term store (VictoriaMetrics / Thanos / Mimir), Grafana, Alertmanager, Loki, OpenTelemetry; designing SLIs, SLOs, and error budgets
- GPU observability: running the DCGM exporter and surfacing GPU/fabric health (XID, NCCL, RoCE) into dashboards and alerts (surface, not fabric-remediate)
- Ability to build a GPU metering pipeline: DCGM + scheduler/namespace state → accurate, idempotent per-tenant usage events → a billing engine
- Deploy and operate (not author) model-serving + ML-platform services on K8s: KServe and/or NIM/Triton/vLLM, plus Ray, JupyterHub, MLflow, Harbor, a vector DB
- Infrastructure-as-Code + GitOps: Helm, Kustomize, and ArgoCD or Flux in production
- Strong Linux fundamentals and production troubleshooting (systemd, container runtimes, REST/gRPC APIs, event-driven pipelines)
- Self-driven and comfortable with the breadth of a pre-launch platform: context-switching across observability, metering, service delivery, and onboarding
Explore our comprehensive directory of visa sponsorship jobs from employers worldwide who are ready to sponsor talented international professionals.
Preferred Skills-
- Prior hands-on with a commercial / OSS usage-based billing engine (Lago, OpenMeter, Metronome, Stripe Billing, Amberflo) and event-driven metering; EU payment integration (SEPA, Mollie, or Stripe EU) a plus
- NVIDIA AI Enterprise / NIM Operator; Run:ai or KAI Scheduler for GPU quota, fair-share, and multi-tenancy
- OpenCost / FinOps: GPU cost allocation, chargeback / showback
- Container registry operations in air-gapped environments (Harbor, Quay, or similar)
- Identity & access integration (OIDC/SAML, SSO, tenant-scoped RBAC) and API gateways (Kong, Envoy, Traefik)
- EU regulatory awareness: GDPR, AI Act, DORA, NIS2, NEN 7510 in a platform context
- TypeScript (billing/admin tooling; interfacing with the future Fullstack portal)
Nice to Have
- Event streaming for telemetry / metering
- Experience operating a Kubernetes-native multi-cluster management platform
- Experience at an AI neocloud, GPU cloud provider, or managed ML platform
- Familiarity with NCCL / RDMA fabric troubleshooting
Why Join Us
- Help build a sovereign EU GPU cloud from the ground up.
- Own a critical platform layer, not just tickets or maintenance.
- Work on modern AI infrastructure, GPU platforms, Kubernetes, observability, and automation.
- Join a company with deep experience in OCP, datacenter, AI/HPC, and cloud infrastructure.
- Build infrastructure for organizations where data location, compliance, and reliability truly matter.
Interested in opportunities specifically in Netherlands? Discover our dedicated Visa Sponsorship Jobs in Netherlands page featuring roles from top employers in this location.
Benefits
- Competitive salary.
- Pension scheme.
- Visa sponsorship.
- Company gym.
- Modern office in Hoofddorp.
- Informal and open working culture.
- Participation in relevant conferences and exhibitions across Europe.
- Opportunity to develop your skills in a fast-growing technology environment.
Our Work Culture
Circle B offers an informal working atmosphere with energetic people who enjoy being part of a growing technology company. We have an open management culture and encourage colleagues to contribute to improving our products, services, and processes.
If this sounds like a good fit, please send your CV and motivation letter to:
surbi@tauruseu.com
Similar Jobs
Explore other opportunities that match your interests
magno it recruitment
Senior CI/CD Engineer - Platform Engineering
Adyen