Senior AI/ML Operations Engineer

Jobgether • India
Remote
Apply
AI Summary

Jobgether is seeking a Senior AI/ML Operations Engineer to ensure the reliability, availability, and performance of AI/ML services in production environments. The role involves defining and maintaining SLOs/SLIs, monitoring and mitigating model drift and system issues, and designing observability solutions. The ideal candidate will have strong experience in operating distributed systems, cloud platforms, and CI/CD pipelines.

Key Highlights
Own the reliability, availability, and performance of AI/ML services in production environments
Define and maintain SLOs/SLIs for AI systems
Monitor, detect, and mitigate model drift and system issues
Key Responsibilities
Own the reliability, availability, and performance of AI/ML services in production environments
Define and maintain SLOs/SLIs for AI systems
Monitor, detect, and mitigate model drift and system issues
Design and implement observability solutions
Support deployment workflows for ML models
Operate and improve AI infrastructure components
Manage CI/CD pipelines and automation
Technical Skills Required
Kubernetes Google Cloud Platform Terraform Datadog Prometheus Grafana ELK stack Docker Python Machine learning lifecycle concepts LLM gateways Vector databases RAG systems
Benefits & Perks
Competitive annual salary
Fully remote work
Comprehensive health, accident, and retirement benefits
Paid holidays, generous leave policies, and wellness programs

Job Description


This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior AIOps Engineer I in India.

This role sits at the intersection of AI, machine learning, and platform reliability, focusing on ensuring that production AI systems operate efficiently, securely, and at scale. You will be responsible for maintaining and improving the operational health of AI/ML-powered services running in production environments. The position involves working closely with data scientists, ML engineers, and platform teams to ensure smooth deployment, monitoring, and lifecycle management of AI models. You will play a key role in building observability, automation, and infrastructure that supports reliable AI delivery. The environment is highly collaborative and fast-evolving, with a strong emphasis on scalability, cost optimization, and production readiness. This is a hands-on engineering role where your work directly impacts the stability and performance of AI-driven products used at scale.

Accountabilities

  • Own the reliability, availability, and performance of AI/ML services in production environments.
  • Define and maintain SLOs/SLIs for AI systems, ensuring alignment with user experience and business outcomes.
  • Monitor, detect, and mitigate model drift, performance degradation, and system issues in production.
  • Design and implement observability solutions including monitoring, logging, alerting, and dashboards for AI systems.
  • Support deployment workflows for ML models, including canary, blue/green, and A/B testing strategies.
  • Operate and improve AI infrastructure components such as model serving systems, LLM gateways, and RAG pipelines.
  • Manage CI/CD pipelines and automation to improve deployment reliability and reduce operational overhead.
  • Participate in incident management, on-call rotations, and post-incident reviews to improve system resilience.
  • Collaborate with cross-functional teams to ensure scalable, secure, and cost-efficient AI operations.

Requirements

  • 4+ years of software engineering experience, including at least 3 years in production systems, SRE, DevOps, or platform engineering roles.
  • Strong experience operating distributed systems on Kubernetes and cloud platforms.
  • Hands-on experience with Google Cloud Platform services such as GKE, BigQuery, Pub/Sub, Vertex AI, Cloud SQL, and GCS.
  • Solid understanding of CI/CD pipelines, infrastructure-as-code (Terraform preferred), and deployment automation.
  • Experience with monitoring, logging, and observability tools such as Datadog, Prometheus, Grafana, or ELK stack.
  • Familiarity with containerization and Docker image lifecycle management.
  • Understanding of ML lifecycle concepts including training, deployment, evaluation, and monitoring.
  • Exposure to AI/ML tooling such as LLM gateways, vector databases, RAG systems, or embedding pipelines is a strong plus.
  • Strong Python programming skills and solid software engineering fundamentals.
  • Excellent communication skills with the ability to work across technical and non-technical stakeholders.
  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).

Benefits

  • Competitive annual salary aligned with experience and market standards.
  • Fully remote work with structured overlap hours for global collaboration.
  • Comprehensive health, accident, and retirement benefits.
  • Paid holidays, generous leave policies, and wellness programs.
  • Exposure to cutting-edge AI/ML infrastructure and large-scale production systems.
  • Strong culture of learning, ownership, and cross-functional collaboration.
  • Opportunity to work on high-impact AI systems used in real-world production environments.
  • Inclusive and globally distributed team environment.

How Jobgether Works

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.


Similar Jobs

Explore other opportunities that match your interests

Customer Success Engineer

Devops
•
14h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

GitLab

India
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

stackway

India

Mulesoft Support Engineer

Devops
•
2d ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

LIXIL

India

Subscribe our newsletter

New Things Will Always Update Regularly