We are looking for an experienced DevSecOps/Platform Engineer to build, secure, and scale our production environment. The role includes complete ownership of production readiness, platform stability, observability, and deployment of self-hosted AI/ML models. This is a fully remote opportunity with potential for relocation.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Job Description
Job Summary
We are looking for an experienced DevSecOps / Platform Engineer (DevSecOps + AI Infra) to build, secure, and scale our fully self-hosted production environment across Hetzner, Kubernetes, Kafka, MinIO, GitLab, Redis, and AI/LLM infrastructure. This role includes complete ownership of production readiness, platform stability, observability, and deployment of self-hosted AI/ML models (LLMs, embeddings, vision models) on GPU/CPU infrastructure. This is a fully remote opportunity, with potential for relocation to the UAE in 2027, subject to business needs and mutual interest.
Responsibilities
- Own infrastructure and production readiness across Hetzner and Cloudflare, including compute sizing (CPU/GPU/RAM/SSD), secure networking and firewalling, DNS/WAF/DDoS configuration, and automated backup, failover, and high-availability setup.
- Own self-hosted GitLab CI/CD and source control, including runner setup (Docker/VM/Kubernetes), secure secrets management, multi-environment pipelines (dev → staging → prod) with approvals, and GitOps integration using ArgoCD or FluxCD.
- Lead Kubernetes cluster engineering including cluster bootstrap (kubeadm/k3s/RKE/Terraform), RBAC and network policies, service mesh with mTLS (Istio/Linkerd), autoscaling (HPA/VPA), health probes, and robust etcd and cluster backup strategies.
- Own end-to-end monitoring, observability, and logging using Prometheus, Grafana, Alertmanager, Loki, and OMD (Nagios/CheckMK), with comprehensive alerting across nodes, pods, applications, networking, Kafka, databases, and storage.
- Manage and secure object storage using MinIO, including bucket policies, lifecycle rules, TLS integration, and secure credential management via Secrets/Vault.
- Implement secure networking and TLS using Traefik ingress, automated certificate rotation, Cloudflare WAF and DDoS protection, zero-trust principles, mTLS, and API gateway–level security policies.
- Design and manage secure networking and TLS using Traefik ingress with automated certificate rotation, Cloudflare WAF and DDoS protection with rate limiting, zero-trust networking principles, mTLS, and API gateway–level security policies
- Implement secure container and image management using best-practice Docker builds (multi-stage, minimal base images), Harbor registry with RBAC, vulnerability scanning and image signing, along with automated image retention and cleanup policies.
- Design and operate highly available PostgreSQL (PGVector/CloudNativePG), MongoDB, and ClickHouse clusters with operator-based deployments, replica sets, PITR backups, connection pooling, performance tuning, and full observability via Prometheus exporters.
- Design and operate a highly available Kafka cluster (Zookeeper/KRaft) with optimized topics, partitions, replication, lag monitoring, exporters, and robust retry/DLQ strategies for production reliability.
- Provision and manage Hetzner and Kubernetes infrastructure using Terraform with modular multi-environment setups, remote state management, and CI/CD-driven plan/apply workflows with approval gates.
- Implement end-to-end DevSecOps and compliance controls including secrets management, least-privilege RBAC, container image scanning, runtime security, and automated CVE detection and patching pipelines.
- Design, deploy, and operate secure, scalable self-hosted AI/LLM infrastructure including GPU model serving, multi-model routing, vector databases, AI DevSecOps, monitoring, and CI/CD for model lifecycle management.
- Define and maintain comprehensive backup and disaster recovery strategies for infrastructure, AI models, and object storage, including routine recovery testing and DR playbooks
- Ensure production-grade application readiness through ingress/load balancer configuration, environment-specific configs and secrets, and performance, load, and chaos testing.
Looking to advance your Devops career with relocation support? Explore Devops Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.
Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.
- Mid/Senior-level (4–10 years)
- Bachelor's in Computer Science or related field
- Strong Kubernetes, Linux, networking, and Terraform expertise.
- Hands-on with GPU setups, CUDA, inference optimization.
- Experience with self-hosted AI/LLM models (Ollama, vLLM, TGI).
- Strong observability & security foundations.
Similar Jobs
Explore other opportunities that match your interests
Jobgether
Discovered MENA
Group Technology Director - Food Distribution