Design and operate scalable cloud infrastructure for a large behavioral model, handling millions of predictions daily. Collaborate with ML engineers and product teams on model performance, retraining cadence, and infrastructure. Optimize cloud resource usage and implement automation for model lifecycle management.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
The work
We run a Large Behavioral Model — custom transformer-based architecture — that generates billions of spatiotemporal predictions daily for Fortune 500 clients. The model is retrained on a regular cadence and serves predictions at scale.
You'll own the infrastructure that makes that possible. ML pipelines, cloud infrastructure across AWS and GCP, CI/CD for models, cost management, and monitoring. You'll work with ML engineers who depend on your systems being reliable, reproducible, and fast.
- Build and maintain training and inference pipelines for the LBM, handling millions of predictions daily
- Operate scalable cloud infrastructure on AWS and GCP for training, deployment, and monitoring
- Optimize cloud resource usage — cost-effective scaling without sacrificing availability or performance
- Maintain CI/CD pipelines specifically for ML models, with proper dev / staging / prod separation
- Implement automation for model lifecycle management, retraining, and data pipeline orchestration
- Instrument model monitoring — performance, drift, latency, resource utilization — and wire up alerting
- Collaborate with ML engineers and product teams on model performance, retraining cadence, and infrastructure
- Document pipelines, deployments, and on-call runbooks
- Evaluate new MLOps tooling and techniques, and apply them where they measurably improve the platform
Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
- Bachelor's, Master's, or PhD in CS, Engineering, AI, or a related field
- 3-5 years of hands-on experience in MLOps, ML platform, or ML infrastructure
- Production experience deploying, managing, and scaling ML workloads on AWS or GCP (SageMaker, Vertex AI, EC2, GKE, EKS)
- Strong proficiency with Docker, Kubernetes, and container orchestration
- Experience designing and operating CI/CD pipelines for ML models
- Performance tuning and cost optimization experience in cloud environments
- Strong programming skills in Python
- Working experience with PyTorch (or TensorFlow) in production
- Solid understanding of machine learning concepts, including deep learning and transformer-based architectures
- Strong written English and comfort working with a US-based team across time zones
- Production experience with both AWS and GCP
- Experience serving transformer-based models at scale (TorchServe, Triton, vLLM, or similar)
- Distributed training experience (DDP, FSDP, Ray, or similar)
- Terraform or other infrastructure-as-code
- Feature store experience (Feast, Vertex AI Feature Store, SageMaker Feature Store, or similar)
- Prometheus, Grafana, or similar observability stacks
- Prior experience in a fast-paced startup environment
- AdTech, MarTech, or consumer behavioral data domains
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
- Full-time, fully remote within India
- Salary: ₹25-40 LPA depending on experience
- Overlap with US hours: 3-4 hours with US Eastern, typically evenings IST
- Hands-on work with a modern ML stack at petabyte scale
- Direct collaboration with ML engineers and technical leads
- Clear growth path with increasing ownership over time
Similar Jobs
Explore other opportunities that match your interests
Customer Success Engineer
GitLab
Jobgether