DevSecOps/Platform Engineer (AI Infra)

integra.works • United Arab Emirates
Remote Relocation
Apply
AI Summary

We are looking for an experienced DevSecOps/Platform Engineer to build, secure, and scale our production environment. The role includes complete ownership of production readiness, platform stability, observability, and deployment of self-hosted AI/ML models. This is a fully remote opportunity with potential for relocation.

Key Highlights
Own infrastructure and production readiness
Lead Kubernetes cluster engineering
Design and operate highly available PostgreSQL and Kafka clusters
Key Responsibilities
Own infrastructure and production readiness across Hetzner and Cloudflare
Lead Kubernetes cluster engineering including cluster bootstrap, RBAC and network policies
Design and operate highly available PostgreSQL and Kafka clusters
Technical Skills Required
Kubernetes Linux Networking Terraform GPU setups CUDA Inference optimization Prometheus Grafana Alertmanager Loki OMD Nagios CheckMK MinIO Traefik ingress Docker Harbor registry PostgreSQL MongoDB ClickHouse Kafka Zookeeper KRaft
Benefits & Perks
Fully remote opportunity
Potential for relocation to the UAE in 2027

Job Description


Job Summary

We are looking for an experienced DevSecOps / Platform Engineer (DevSecOps + AI Infra) to build, secure, and scale our fully self-hosted production environment across Hetzner, Kubernetes, Kafka, MinIO, GitLab, Redis, and AI/LLM infrastructure. This role includes complete ownership of production readiness, platform stability, observability, and deployment of self-hosted AI/ML models (LLMs, embeddings, vision models) on GPU/CPU infrastructure. This is a fully remote opportunity, with potential for relocation to the UAE in 2027, subject to business needs and mutual interest.

Responsibilities

  • Own infrastructure and production readiness across Hetzner and Cloudflare, including compute sizing (CPU/GPU/RAM/SSD), secure networking and firewalling, DNS/WAF/DDoS configuration, and automated backup, failover, and high-availability setup.
  • Own self-hosted GitLab CI/CD and source control, including runner setup (Docker/VM/Kubernetes), secure secrets management, multi-environment pipelines (dev → staging → prod) with approvals, and GitOps integration using ArgoCD or FluxCD.
  • Lead Kubernetes cluster engineering including cluster bootstrap (kubeadm/k3s/RKE/Terraform), RBAC and network policies, service mesh with mTLS (Istio/Linkerd), autoscaling (HPA/VPA), health probes, and robust etcd and cluster backup strategies.
  • Own end-to-end monitoring, observability, and logging using Prometheus, Grafana, Alertmanager, Loki, and OMD (Nagios/CheckMK), with comprehensive alerting across nodes, pods, applications, networking, Kafka, databases, and storage.
  • Manage and secure object storage using MinIO, including bucket policies, lifecycle rules, TLS integration, and secure credential management via Secrets/Vault.
  • Implement secure networking and TLS using Traefik ingress, automated certificate rotation, Cloudflare WAF and DDoS protection, zero-trust principles, mTLS, and API gateway–level security policies.
  • Design and manage secure networking and TLS using Traefik ingress with automated certificate rotation, Cloudflare WAF and DDoS protection with rate limiting, zero-trust networking principles, mTLS, and API gateway–level security policies
  • Implement secure container and image management using best-practice Docker builds (multi-stage, minimal base images), Harbor registry with RBAC, vulnerability scanning and image signing, along with automated image retention and cleanup policies.
  • Design and operate highly available PostgreSQL (PGVector/CloudNativePG), MongoDB, and ClickHouse clusters with operator-based deployments, replica sets, PITR backups, connection pooling, performance tuning, and full observability via Prometheus exporters.
  • Design and operate a highly available Kafka cluster (Zookeeper/KRaft) with optimized topics, partitions, replication, lag monitoring, exporters, and robust retry/DLQ strategies for production reliability.
  • Provision and manage Hetzner and Kubernetes infrastructure using Terraform with modular multi-environment setups, remote state management, and CI/CD-driven plan/apply workflows with approval gates.
  • Implement end-to-end DevSecOps and compliance controls including secrets management, least-privilege RBAC, container image scanning, runtime security, and automated CVE detection and patching pipelines.
  • Design, deploy, and operate secure, scalable self-hosted AI/LLM infrastructure including GPU model serving, multi-model routing, vector databases, AI DevSecOps, monitoring, and CI/CD for model lifecycle management.
  • Define and maintain comprehensive backup and disaster recovery strategies for infrastructure, AI models, and object storage, including routine recovery testing and DR playbooks
  • Ensure production-grade application readiness through ingress/load balancer configuration, environment-specific configs and secrets, and performance, load, and chaos testing.

Qualifications

  • Mid/Senior-level (4–10 years)
  • Bachelor's in Computer Science or related field
  • Strong Kubernetes, Linux, networking, and Terraform expertise.
  • Hands-on with GPU setups, CUDA, inference optimization.
  • Experience with self-hosted AI/LLM models (Ollama, vLLM, TGI).
  • Strong observability & security foundations.

Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Jobgether

United Arab Emirates

Senior QAOps Engineer

Devops
•
1w ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Discovered MENA

United Arab Emirates

Group Technology Director - Food Distribution

Devops
•
1w ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

eaton saks international group

United Arab Emirates

Subscribe our newsletter

New Things Will Always Update Regularly