Jobgether is seeking a Senior AI/ML Operations Engineer to ensure the reliability, availability, and performance of AI/ML services in production environments. The role involves defining and maintaining SLOs/SLIs, monitoring and mitigating model drift and system issues, and designing observability solutions. The ideal candidate will have strong experience in operating distributed systems, cloud platforms, and CI/CD pipelines.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior AIOps Engineer I in India.
This role sits at the intersection of AI, machine learning, and platform reliability, focusing on ensuring that production AI systems operate efficiently, securely, and at scale. You will be responsible for maintaining and improving the operational health of AI/ML-powered services running in production environments. The position involves working closely with data scientists, ML engineers, and platform teams to ensure smooth deployment, monitoring, and lifecycle management of AI models. You will play a key role in building observability, automation, and infrastructure that supports reliable AI delivery. The environment is highly collaborative and fast-evolving, with a strong emphasis on scalability, cost optimization, and production readiness. This is a hands-on engineering role where your work directly impacts the stability and performance of AI-driven products used at scale.
Accountabilities
- Own the reliability, availability, and performance of AI/ML services in production environments.
- Define and maintain SLOs/SLIs for AI systems, ensuring alignment with user experience and business outcomes.
- Monitor, detect, and mitigate model drift, performance degradation, and system issues in production.
- Design and implement observability solutions including monitoring, logging, alerting, and dashboards for AI systems.
- Support deployment workflows for ML models, including canary, blue/green, and A/B testing strategies.
- Operate and improve AI infrastructure components such as model serving systems, LLM gateways, and RAG pipelines.
- Manage CI/CD pipelines and automation to improve deployment reliability and reduce operational overhead.
- Participate in incident management, on-call rotations, and post-incident reviews to improve system resilience.
- Collaborate with cross-functional teams to ensure scalable, secure, and cost-efficient AI operations.
Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
- 4+ years of software engineering experience, including at least 3 years in production systems, SRE, DevOps, or platform engineering roles.
- Strong experience operating distributed systems on Kubernetes and cloud platforms.
- Hands-on experience with Google Cloud Platform services such as GKE, BigQuery, Pub/Sub, Vertex AI, Cloud SQL, and GCS.
- Solid understanding of CI/CD pipelines, infrastructure-as-code (Terraform preferred), and deployment automation.
- Experience with monitoring, logging, and observability tools such as Datadog, Prometheus, Grafana, or ELK stack.
- Familiarity with containerization and Docker image lifecycle management.
- Understanding of ML lifecycle concepts including training, deployment, evaluation, and monitoring.
- Exposure to AI/ML tooling such as LLM gateways, vector databases, RAG systems, or embedding pipelines is a strong plus.
- Strong Python programming skills and solid software engineering fundamentals.
- Excellent communication skills with the ability to work across technical and non-technical stakeholders.
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- Competitive annual salary aligned with experience and market standards.
- Fully remote work with structured overlap hours for global collaboration.
- Comprehensive health, accident, and retirement benefits.
- Paid holidays, generous leave policies, and wellness programs.
- Exposure to cutting-edge AI/ML infrastructure and large-scale production systems.
- Strong culture of learning, ownership, and cross-functional collaboration.
- Opportunity to work on high-impact AI systems used in real-world production environments.
- Inclusive and globally distributed team environment.
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Why Apply Through Jobgether?
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
Similar Jobs
Explore other opportunities that match your interests
Customer Success Engineer
GitLab
stackway
Mulesoft Support Engineer