Senior Network Engineer - AI Infrastructure & High-Performance Computing

asobbi • United State
Relocation
Apply
AI Summary

Design and deploy high-performance datacenter fabrics for AI cluster deployments. Configure Arista or Mellanox switches with VXLAN/EVPN overlays. Implement network configurations for GPU cluster connectivity.

Key Highlights
High-performance datacenter fabrics design and deployment
Arista or Mellanox switch configuration with VXLAN/EVPN overlays
GPU cluster connectivity network configuration
Network automation using Ansible, Terraform, or Python scripting
Performance validation using iperf3, NCCL benchmarks, and monitoring with Prometheus/Grafana
Technical Skills Required
Ansible Terraform Python BGP OSPF VXLAN EVPN InfiniBand RoCEv2 RDMA GPU cluster connectivity HPC interconnects Linux Git Prometheus Grafana NetBox Device42
Benefits & Perks
Competitive base salary up to $200K
Full benefits package
Relocation assistance available
Professional development budget for certifications and training

Job Description


Network Engineer - AI Infrastructure & High-Performance Computing

Up to $250K Base + 2x basic in equity + bonus + Much more!


About the Role

My client is building the network backbone for next-generation AI and machine learning infrastructure. This isn't traditional enterprise networking - you'll be designing and deploying the high-speed network fabrics that connect GPU clusters for large-scale training workloads. If you've worked with HPC environments, GPU clusters, or high-performance datacenter fabrics, this role could be a great fit.


What You'll Be Building

  • High-performance datacenter fabrics - Spine-leaf topologies optimized for GPU-to-GPU communication using modern protocols
  • Physical network infrastructure - Rack-level designs, structured cabling, and acceptance testing in production datacenters
  • Lossless Ethernet networks - Configure congestion control mechanisms for zero packet loss in AI training workloads
  • Automated provisioning systems - Infrastructure as Code using Ansible, Terraform, and Python to manage network devices
  • Performance validation - Throughput testing with iperf3, NCCL benchmarks, and monitoring with Prometheus/Grafana


This Role Is For You If You've:

✓ Experience with High-Performance Workloads

  • Worked with GPU cluster networks, HPC environments, or high-throughput computing infrastructure
  • Exposure to InfiniBand (EDR/HDR/NDR) or RDMA over Ethernet (RoCEv2) - even in lab/testing environments
  • Some understanding of lossless Ethernet concepts - Priority Flow Control (PFC), Explicit Congestion Notification (ECN)
  • Interest in high-bandwidth, low-latency networking for compute-intensive applications


✓ Strong Datacenter Fabric Experience

  • Deployed spine-leaf architectures in production (doesn't need to be massive scale)
  • Configured VXLAN/EVPN with BGP routing for modern datacenters
  • Worked with Arista switches OR Mellanox/NVIDIA networking gear (or willingness to learn)
  • Built non-blocking fabrics or understand oversubscription for compute workloads


✓ Hands-On Infrastructure Skills

  • Experience with physical datacenter implementations - racking equipment, cable management, labeling
  • Comfortable with structured cabling - fiber optics, copper, DAC cables
  • Can perform acceptance testing - link validation, basic throughput checks
  • Willing to travel to datacenters as needed for hands-on work


✓ Network Automation Experience

  • Built configurations using Ansible, Terraform, or Python
  • Comfortable with command-line interfaces and scripting
  • Experience with version control (Git) for network configs
  • Familiarity with monitoring tools - Prometheus, Grafana, or similar platforms


Required Technical Skills

Core Networking (Must Have):

  • 5+ years working with datacenter networks
  • Strong Layer 2/3 networking - BGP, OSPF, VXLAN, EVPN
  • Experience with spine-leaf topologies in production
  • Understanding of routing protocols and modern datacenter designs


High-Performance Networking (Preferred):

  • Experience with one or more of the following:
  • InfiniBand networks (any exposure counts)
  • RoCEv2 / RDMA networking
  • GPU cluster connectivity
  • HPC interconnects or high-bandwidth environments
  • Basic understanding of congestion control - ECN, PFC, jumbo frames
  • Interest in learning about lossless transport for RDMA workloads


Vendor Platforms (Experience with One or More):

  • Arista switches - 7000/8000 series preferred, but any Arista experience valuable
  • Mellanox/NVIDIA networking - Spectrum switches, ConnectX NICs (even basic exposure)
  • Cisco Nexus datacenter switches - 9K/7K series
  • Palo Alto firewalls - configuration and policy management


Automation & Tooling (Must Have at Least Two):

  • Infrastructure as Code - Ansible, Terraform, or Python scripting
  • Version control - Git for managing network configurations
  • Linux familiarity - comfortable in command-line environments
  • Monitoring tools - Prometheus, Grafana, SolarWinds, or similar
  • DCIM platforms - NetBox, Device42, or asset management tools


Nice to Have (Bonus Skills)

  • InfiniBand experience - any hands-on work with IB fabrics
  • RoCEv2 or RDMA - configuration or testing experience
  • GPU cluster networking - NVIDIA NVLink, GPUDirect
  • Performance testing tools - iperf3, NCCL tests, network benchmarking
  • Bare metal provisioning - PXE boot, Redfish/IPMI
  • Cloud networking - AWS, Azure hybrid connectivity
  • Multi-tenant environments - namespace isolation, traffic segmentation


Day-to-Day Responsibilities

  • Design and deploy spine-leaf fabrics for AI cluster deployments
  • Configure Arista or Mellanox switches with VXLAN/EVPN overlays
  • Implement network configurations for GPU cluster connectivity
  • Perform rack-level implementations - cable routing, labeling, testing
  • Validate network performance using standard testing tools
  • Automate network provisioning using Ansible/Terraform
  • Monitor network health and troubleshoot performance issues
  • Work with datacenter teams on infrastructure requirements
  • Implement firewall policies for network segmentation
  • Document network designs and maintain topology diagrams
  • Travel to datacenters as needed (up to 20% travel)


Compensation & Benefits

  • Base Salary: Competitive, up to $200K depending on experience
  • Full benefits package - health, dental, vision, 401(k)
  • Relocation assistance available if needed
  • Professional development budget for certifications and training
  • Opportunity to work with cutting-edge AI infrastructure


What You'll Get

✓ Modern technology - Work with 400G networking, GPU architectures, automation tools

✓ High-impact work - Networks you build enable AI research and breakthroughs

✓ Greenfield deployments - Design networks without legacy constraints

✓ Technical focus - Engineering-first culture, minimal bureaucracy

✓ Career growth - Become an expert in AI infrastructure networking

✓ Learning opportunities - Hands-on with InfiniBand, RDMA, GPU networking technologies

How to Apply


If you have datacenter networking experience and interest in HPC/GPU/high-performance computing, I'd love to hear from you.


Subscribe our newsletter

New Things Will Always Update Regularly