Monitor and troubleshoot high-performance compute and AI cluster environments in a 24×7 operations center. Ensure reliability of distributed HPC systems and enterprise storage platforms. Support Kubernetes and Slurm-based compute environments with Grafana and Jira incident management.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
Optomi, in partnership with our client, are seeking an IOC Systems Specialist to support a large-scale AI/HPC infrastructure environment focused on high-performance compute and data-intensive workloads.
This role sits in a 24×7 operations center and is responsible for monitoring, troubleshooting, and ensuring reliability of distributed HPC systems and enterprise storage platforms.
- Onsite Fort Worth, TX - relocation assistance available!
- Direct Hire
- On-call rotation
Looking to advance your IT & Network Engineering career with relocation support? Explore IT & Network Engineering Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.
What you’ll do:
- Monitor and troubleshoot HPC and AI cluster environments in a Tier 2 IOC/NOC setting
- Support and troubleshoot enterprise storage systems (WEKA, VAST, Dell Isilon/PowerScale, or similar SAN/NAS)
- Investigate performance, connectivity, and network-related storage issues (including VLAN and configuration validation)
- Work with Kubernetes and Slurm-based compute environments
- Use monitoring tools (Grafana) and ticketing systems (Jira) for incident management
- Perform root cause analysis and collaborate with engineering teams for resolution
- Ensure system health, uptime, and performance across distributed infrastructure
Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.
What we’re looking for:
Interested in relocating to United State? Check out our comprehensive Relocation Jobs in United State page with detailed relocation packages and benefits.
- Experience supporting enterprise storage or data center storage environments
- Strong troubleshooting skills across storage, network, and compute systems
- Familiarity with HPC or high-throughput infrastructure environments
- Understanding of networking concepts (VLANs, connectivity, throughput)
- Experience in operational support environments (IOC/NOC or Tier 2 support)
- Exposure to Kubernetes, Slurm, or similar orchestration/workload tools is a plus
- Join a cutting-edge AI infrastructure company building sustainable, large-scale GPU compute environments powering next-generation workloads.
Similar Jobs
Explore other opportunities that match your interests
IT Systems Administrator
Raytheon
RQ Construction, LLC