Senior MLOps Engineer

zerotoone.ai India
Remote
Apply
AI Summary

Design and operate scalable cloud infrastructure for a large behavioral model, handling millions of predictions daily. Collaborate with ML engineers and product teams on model performance, retraining cadence, and infrastructure. Optimize cloud resource usage and implement automation for model lifecycle management.

Key Highlights
Design and operate scalable cloud infrastructure for a large behavioral model
Collaborate with ML engineers and product teams on model performance and retraining
Optimize cloud resource usage and implement automation for model lifecycle management
Key Responsibilities
Build and maintain training and inference pipelines for the LBM
Operate scalable cloud infrastructure on AWS and GCP for training, deployment, and monitoring
Optimize cloud resource usage — cost-effective scaling without sacrificing availability or performance
Technical Skills Required
Docker Kubernetes Python PyTorch Terraform Prometheus Grafana
Benefits & Perks
Full-time, fully remote within India
Salary: ₹25-40 LPA depending on experience
Overlap with US hours: 3-4 hours with US Eastern, typically evenings IST
Nice to Have
Production experience with both AWS and GCP
Experience serving transformer-based models at scale
Distributed training experience

Job Description


The work


We run a Large Behavioral Model — custom transformer-based architecture — that generates billions of spatiotemporal predictions daily for Fortune 500 clients. The model is retrained on a regular cadence and serves predictions at scale.


You'll own the infrastructure that makes that possible. ML pipelines, cloud infrastructure across AWS and GCP, CI/CD for models, cost management, and monitoring. You'll work with ML engineers who depend on your systems being reliable, reproducible, and fast.


What you'll do
  • Build and maintain training and inference pipelines for the LBM, handling millions of predictions daily
  • Operate scalable cloud infrastructure on AWS and GCP for training, deployment, and monitoring
  • Optimize cloud resource usage — cost-effective scaling without sacrificing availability or performance
  • Maintain CI/CD pipelines specifically for ML models, with proper dev / staging / prod separation
  • Implement automation for model lifecycle management, retraining, and data pipeline orchestration
  • Instrument model monitoring — performance, drift, latency, resource utilization — and wire up alerting
  • Collaborate with ML engineers and product teams on model performance, retraining cadence, and infrastructure
  • Document pipelines, deployments, and on-call runbooks
  • Evaluate new MLOps tooling and techniques, and apply them where they measurably improve the platform


What we need
  • Bachelor's, Master's, or PhD in CS, Engineering, AI, or a related field
  • 3-5 years of hands-on experience in MLOps, ML platform, or ML infrastructure
  • Production experience deploying, managing, and scaling ML workloads on AWS or GCP (SageMaker, Vertex AI, EC2, GKE, EKS)
  • Strong proficiency with Docker, Kubernetes, and container orchestration
  • Experience designing and operating CI/CD pipelines for ML models
  • Performance tuning and cost optimization experience in cloud environments
  • Strong programming skills in Python
  • Working experience with PyTorch (or TensorFlow) in production
  • Solid understanding of machine learning concepts, including deep learning and transformer-based architectures
  • Strong written English and comfort working with a US-based team across time zones


Nice to have
  • Production experience with both AWS and GCP
  • Experience serving transformer-based models at scale (TorchServe, Triton, vLLM, or similar)
  • Distributed training experience (DDP, FSDP, Ray, or similar)
  • Terraform or other infrastructure-as-code
  • Feature store experience (Feast, Vertex AI Feature Store, SageMaker Feature Store, or similar)
  • Prometheus, Grafana, or similar observability stacks
  • Prior experience in a fast-paced startup environment
  • AdTech, MarTech, or consumer behavioral data domains


What we offer
  • Full-time, fully remote within India
  • Salary: ₹25-40 LPA depending on experience
  • Overlap with US hours: 3-4 hours with US Eastern, typically evenings IST
  • Hands-on work with a modern ML stack at petabyte scale
  • Direct collaboration with ML engineers and technical leads
  • Clear growth path with increasing ownership over time

Similar Jobs

Explore other opportunities that match your interests

Customer Success Engineer

Devops
17h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

GitLab

India
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Jobgether

India
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

stackway

India

Subscribe our newsletter

New Things Will Always Update Regularly