Staff AI Support Operations Engineer

CyberCoders United State
Remote
Apply
AI Summary

We're seeking a Staff AI Support Operations Engineer to lead our new Ops team, architecting, deploying, and maintaining AI-optimized data center infrastructure. This role requires expertise in cluster engineering, infrastructure management, and automation. As a technical leader, you'll build operational standards and technical foundations for future engineers.

Key Highlights
Cluster Engineering & Operations
Infrastructure Source of Truth
Automation & Tooling
Tier 3 Escalation Lead
Documentation Excellence
Technical Leadership & Mentorship
Key Responsibilities
Collaborate with engineering teams to architect, deploy, and bring new AI compute clusters online
Deliver expert-level support for existing high-density GPU environments
Own NetBox and related internal systems, ensuring all infrastructure data is accurate, consistent, and reliably maintained
Build and refine internal automation using Python, Ansible, and Terraform to eliminate manual workflows and modernize fragile legacy processes
Serve as the highest technical escalation point for customer and internal issues prior to involvement from Platform or Network/Undercloud teams
Transform tribal knowledge into clear, durable SOPs and technical documentation that establish the operational 'gold standard'
Technical Skills Required
Python Ansible Terraform HPE Dell SuperMicro IPMI BMC iDRAC Redfish Linux Kubernetes NVIDIA Blackwell InfiniBand RoCE Weka VAST Data OpenStack Canonical MAAS
Benefits & Perks
$150-200k/year + BONUS + RSUs
Fully REMOTE
BONUS
RSUs
Nice to Have
Next-Generation GPU Hardware
High-Performance Fabrics
Bare-Metal Provisioning

Job Description


Title: AI Support Operations Engineer
Location: Fully REMOTE!
Salary: $150-200k/year + BONUS + RSUs 

We're not following someone else's cloud blueprint - we're creating the next one. While legacy providers hand you a finished process, we're engineering the next generation of AI-optimized data center infrastructure from the ground up.

As our first internal Staff AI Support Operations Engineer, you'll be a foundational technical leader on a brand-new Ops team. This is a role for an architect-practitioner: the kind of engineer who can untangle a complex InfiniBand issue one hour and automate away the root cause the next. You won't just maintain systems - you'll build the operational standards and technical foundations that every future engineer will rely on.

Key Responsibilities

  • Cluster Engineering & Operations: Collaborate with engineering teams to architect, deploy, and bring new AI compute clusters online while delivering expert-level support for existing high-density GPU environments
  • Infrastructure Source of Truth: Own NetBox and related internal systems, ensuring all infrastructure data is accurate, consistent, and reliably maintained
  • Automation & Tooling: Build and refine internal automation using Python, Ansible, and Terraform to eliminate manual workflows and modernize fragile legacy processes
  • Tier 3 Escalation Lead: Serve as the highest technical escalation point for customer and internal issues prior to involvement from Platform or Network/Undercloud teams
  • Documentation Excellence: Transform tribal knowledge into clear, durable SOPs and technical documentation that establish the operational "gold standard"
  • Technical Leadership & Mentorship: Raise the technical bar for the team through code reviews, architectural guidance, and mentorship as the organization scales

Qualifications

  • Enterprise-Grade Server Proficiency: Advanced operational knowledge of HPE, Dell, and SuperMicro platforms, including IPMI, BMC, iDRAC workflows, and familiarity with Redfish-based management.
  • Core Engineering Toolkit: Mastery of Python, Ansible, and Terraform as primary tools for automation, orchestration, and infrastructure lifecycle management.
  • Linux Performance Engineering: Strong capability in diagnosing and tuning Linux systems, resolving performance bottlenecks, and optimizing workloads at the OS level.
  • Advanced Incident Resolution: Demonstrated experience serving as the final technical escalation point for complex, high-impact infrastructure failures.
  • Cloud-Native Operations: Proven production experience operating and troubleshooting Kubernetes environments.

Nice to have

  • Next-Generation GPU Hardware: Familiarity with NVIDIA Blackwell (B200/B300) or Hopper (H100/H200) architectures.
  • High-Performance Fabrics: Experience with InfiniBand or RoCE networking, and modern high-throughput storage platforms such as Weka or VAST Data.
  • Bare-Metal Provisioning: Exposure to OpenStack or Canonical MAAS for automated provisioning of physical infrastructure.

Legacy is predictable. Safe. Slow. We're none of those things. We're building the Neo-Cloud at AI speed, and the rules aren't handed to you - you define them. If you're ready to trade routine for impact and build systems that actually move the company forward, let's talk.



Email Your Resume In Word To
Sean.Gur@CyberCoders.com
Looking forward to receiving your resume through our website and going over the position with you. Clicking apply is the best way to apply.
Please do NOT change the email subject line in any way. You must keep the JobID: linkedin : AH12-1982723 -- in the email subject line for your application to be considered.
Sean Gur - Lead Recruiter

For this position, you must be currently authorized to work in the United States without the need for sponsorship for a non-immigrant visa. This is a new role.

This job was first posted by CyberCoders on 04/03/2026 and applications will be accepted on an ongoing basis until the position is filled or closed.

This job was posted on 04/03/2026 and is open for 90 days

CyberCoders is proud to be an Equal Opportunity Employer

All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, sexual orientation, gender identity or expression, national origin, ancestry, citizenship, genetic information, registered domestic partner status, marital status, status as a crime victim, disability, protected veteran status, or any other characteristic protected by law. Our hiring process includes AI screening for keywords and minimum qualifications. Recruiters review all results.  CyberCoders will consider qualified applicants with criminal histories in a manner consistent with the requirements of applicable state and local law, including but not limited to the Los Angeles County Fair Chance Ordinance, the San Francisco Fair Chance Ordinance, and the California Fair Chance Act. CyberCoders is committed to working with and providing reasonable accommodation to individuals with physical and mental disabilities. Individuals needing special assistance or an accommodation while seeking employment can contact a member of our Human Resources team at Benefits@CyberCoders.com to make arrangements.

Copyright 1999 - 2026. CyberCoders , Inc. All rights reserved.

Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Contract
Experience Level Mid-Senior level

elios

United State

Linux System Administrator

Devops
13h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Jobs via Dice

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Jobs via Dice

United State

Subscribe our newsletter

New Things Will Always Update Regularly