Senior Network Engineer - AI Infrastructure & High-Performance Computing

asobbi • United State
Relocation
Apply
AI Summary

Design and deploy high-performance datacenter fabrics for AI cluster deployments, configure Arista or Mellanox switches with VXLAN/EVPN overlays, and implement network configurations for GPU cluster connectivity.

Key Highlights
High-performance datacenter fabrics
GPU cluster connectivity
Network automation using Ansible/Terraform
Technical Skills Required
BGP OSPF VXLAN EVPN Ansible Terraform Python Arista switches Mellanox/NVIDIA networking Cisco Nexus datacenter switches Palo Alto firewalls InfiniBand RoCEv2/RDMA networking GPU cluster connectivity HPC interconnects High-bandwidth environments
Benefits & Perks
Competitive base salary up to $200K
Full benefits package
Relocation assistance available
Professional development budget for certifications and training

Job Description


Network Engineer - AI Infrastructure & High-Performance Computing

Up to $250K Base + 2x basic in equity + bonus + Much more!


About the Role

My client is building the network backbone for next-generation AI and machine learning infrastructure. This isn't traditional enterprise networking - you'll be designing and deploying the high-speed network fabrics that connect GPU clusters for large-scale training workloads. If you've worked with HPC environments, GPU clusters, or high-performance datacenter fabrics, this role could be a great fit.


What You'll Be Building

  • High-performance datacenter fabrics - Spine-leaf topologies optimized for GPU-to-GPU communication using modern protocols
  • Physical network infrastructure - Rack-level designs, structured cabling, and acceptance testing in production datacenters
  • Lossless Ethernet networks - Configure congestion control mechanisms for zero packet loss in AI training workloads
  • Automated provisioning systems - Infrastructure as Code using Ansible, Terraform, and Python to manage network devices
  • Performance validation - Throughput testing with iperf3, NCCL benchmarks, and monitoring with Prometheus/Grafana


This Role Is For You If You've:

✓ Experience with High-Performance Workloads

  • Worked with GPU cluster networks, HPC environments, or high-throughput computing infrastructure
  • Exposure to InfiniBand (EDR/HDR/NDR) or RDMA over Ethernet (RoCEv2) - even in lab/testing environments
  • Some understanding of lossless Ethernet concepts - Priority Flow Control (PFC), Explicit Congestion Notification (ECN)
  • Interest in high-bandwidth, low-latency networking for compute-intensive applications


✓ Strong Datacenter Fabric Experience

  • Deployed spine-leaf architectures in production (doesn't need to be massive scale)
  • Configured VXLAN/EVPN with BGP routing for modern datacenters
  • Worked with Arista switches OR Mellanox/NVIDIA networking gear (or willingness to learn)
  • Built non-blocking fabrics or understand oversubscription for compute workloads


✓ Hands-On Infrastructure Skills

  • Experience with physical datacenter implementations - racking equipment, cable management, labeling
  • Comfortable with structured cabling - fiber optics, copper, DAC cables
  • Can perform acceptance testing - link validation, basic throughput checks
  • Willing to travel to datacenters as needed for hands-on work


✓ Network Automation Experience

  • Built configurations using Ansible, Terraform, or Python
  • Comfortable with command-line interfaces and scripting
  • Experience with version control (Git) for network configs
  • Familiarity with monitoring tools - Prometheus, Grafana, or similar platforms


Required Technical Skills

Core Networking (Must Have):

  • 5+ years working with datacenter networks
  • Strong Layer 2/3 networking - BGP, OSPF, VXLAN, EVPN
  • Experience with spine-leaf topologies in production
  • Understanding of routing protocols and modern datacenter designs


High-Performance Networking (Preferred):

  • Experience with one or more of the following:
  • InfiniBand networks (any exposure counts)
  • RoCEv2 / RDMA networking
  • GPU cluster connectivity
  • HPC interconnects or high-bandwidth environments
  • Basic understanding of congestion control - ECN, PFC, jumbo frames
  • Interest in learning about lossless transport for RDMA workloads


Vendor Platforms (Experience with One or More):

  • Arista switches - 7000/8000 series preferred, but any Arista experience valuable
  • Mellanox/NVIDIA networking - Spectrum switches, ConnectX NICs (even basic exposure)
  • Cisco Nexus datacenter switches - 9K/7K series
  • Palo Alto firewalls - configuration and policy management


Automation & Tooling (Must Have at Least Two):

  • Infrastructure as Code - Ansible, Terraform, or Python scripting
  • Version control - Git for managing network configurations
  • Linux familiarity - comfortable in command-line environments
  • Monitoring tools - Prometheus, Grafana, SolarWinds, or similar
  • DCIM platforms - NetBox, Device42, or asset management tools


Nice to Have (Bonus Skills)

  • InfiniBand experience - any hands-on work with IB fabrics
  • RoCEv2 or RDMA - configuration or testing experience
  • GPU cluster networking - NVIDIA NVLink, GPUDirect
  • Performance testing tools - iperf3, NCCL tests, network benchmarking
  • Bare metal provisioning - PXE boot, Redfish/IPMI
  • Cloud networking - AWS, Azure hybrid connectivity
  • Multi-tenant environments - namespace isolation, traffic segmentation


Day-to-Day Responsibilities

  • Design and deploy spine-leaf fabrics for AI cluster deployments
  • Configure Arista or Mellanox switches with VXLAN/EVPN overlays
  • Implement network configurations for GPU cluster connectivity
  • Perform rack-level implementations - cable routing, labeling, testing
  • Validate network performance using standard testing tools
  • Automate network provisioning using Ansible/Terraform
  • Monitor network health and troubleshoot performance issues
  • Work with datacenter teams on infrastructure requirements
  • Implement firewall policies for network segmentation
  • Document network designs and maintain topology diagrams
  • Travel to datacenters as needed (up to 20% travel)


Compensation & Benefits

  • Base Salary: Competitive, up to $200K depending on experience
  • Full benefits package - health, dental, vision, 401(k)
  • Relocation assistance available if needed
  • Professional development budget for certifications and training
  • Opportunity to work with cutting-edge AI infrastructure


What You'll Get

✓ Modern technology - Work with 400G networking, GPU architectures, automation tools

✓ High-impact work - Networks you build enable AI research and breakthroughs

✓ Greenfield deployments - Design networks without legacy constraints

✓ Technical focus - Engineering-first culture, minimal bureaucracy

✓ Career growth - Become an expert in AI infrastructure networking

✓ Learning opportunities - Hands-on with InfiniBand, RDMA, GPU networking technologies

How to Apply


If you have datacenter networking experience and interest in HPC/GPU/high-performance computing, I'd love to hear from you.


Similar Jobs

Explore other opportunities that match your interests

System Administrator - Level 3 or 4

Networking
•
14m ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Northrop Grumman

United State

Senior Technology Leader for Investment Solutions

Networking
•
20m ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Lensa

United State

System Administrator - Level 2

Networking
•
24m ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Northrop Grumman

United State

Subscribe our newsletter

New Things Will Always Update Regularly