IOC Systems Specialist - HPC & AI Infrastructure Operations

Optomi United State
Relocation
Apply
AI Summary

Monitor and troubleshoot high-performance compute and AI cluster environments in a 24×7 operations center. Ensure reliability of distributed HPC systems and enterprise storage platforms. Support Kubernetes and Slurm-based compute environments with Grafana and Jira incident management.

Key Highlights
24×7 operations center support for AI/HPC infrastructure
Tier 2 IOC/NOC monitoring and troubleshooting
Enterprise storage systems (WEKA, VAST, Dell PowerScale)
Kubernetes and Slurm orchestration environments
Key Responsibilities
Monitor and troubleshoot HPC and AI cluster environments in a Tier 2 IOC/NOC setting
Support and troubleshoot enterprise storage systems (WEKA, VAST, Dell Isilon/PowerScale, or similar SAN/NAS)
Investigate performance, connectivity, and network-related storage issues (including VLAN and configuration validation)
Technical Skills Required
HPC infrastructure Storage systems Networking (VLANs) Kubernetes
Benefits & Perks
relocation assistance available
Nice to Have
Slurm-based compute environments
Grafana monitoring tools
Jira ticketing systems

Job Description


Optomi, in partnership with our client, are seeking an IOC Systems Specialist to support a large-scale AI/HPC infrastructure environment focused on high-performance compute and data-intensive workloads.


This role sits in a 24×7 operations center and is responsible for monitoring, troubleshooting, and ensuring reliability of distributed HPC systems and enterprise storage platforms.


  • Onsite Fort Worth, TX - relocation assistance available!
  • Direct Hire
  • On-call rotation


What you’ll do:

  • Monitor and troubleshoot HPC and AI cluster environments in a Tier 2 IOC/NOC setting
  • Support and troubleshoot enterprise storage systems (WEKA, VAST, Dell Isilon/PowerScale, or similar SAN/NAS)
  • Investigate performance, connectivity, and network-related storage issues (including VLAN and configuration validation)
  • Work with Kubernetes and Slurm-based compute environments
  • Use monitoring tools (Grafana) and ticketing systems (Jira) for incident management
  • Perform root cause analysis and collaborate with engineering teams for resolution
  • Ensure system health, uptime, and performance across distributed infrastructure


What we’re looking for:

  • Experience supporting enterprise storage or data center storage environments
  • Strong troubleshooting skills across storage, network, and compute systems
  • Familiarity with HPC or high-throughput infrastructure environments
  • Understanding of networking concepts (VLANs, connectivity, throughput)
  • Experience in operational support environments (IOC/NOC or Tier 2 support)
  • Exposure to Kubernetes, Slurm, or similar orchestration/workload tools is a plus
  • Join a cutting-edge AI infrastructure company building sustainable, large-scale GPU compute environments powering next-generation workloads.



Similar Jobs

Explore other opportunities that match your interests

IT Systems Administrator

Networking
56m ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Raytheon

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

RQ Construction, LLC

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

State of Colorado

United State

Subscribe our newsletter

New Things Will Always Update Regularly