HPC Engineer

Relocation
Apply
AI Summary

Join the High Performance Operations Group in operating and maintaining supercomputers. Design, operate, and maintain complex computing environments. Collaborate to maintain and implement capability improvements.

Key Highlights
High Performance Computing
Linux Administration
Cloud Infrastructure
Key Responsibilities
Design, operate, and maintain complex computing environments
Collaborate to maintain and implement capability improvements
Participate in periodic on-call responsibilities managing HPC clusters and AI infrastructure
Technical Skills Required
Linux Administration Kubernetes Python Bash Chef Puppet Ansible Salt CFEngine Nvidia DGX/HGX NVLink InfiniBand Lustre ZFS EXT XFS Ceph
Benefits & Perks
PPO or High Deductible medical insurance
Dental and vision insurance
Free basic life and disability insurance
Paid childbirth and parental leave
Award-winning 401(k)
Flexible schedules and time off
Onsite gyms and wellness programs
Extensive relocation packages
Nice to Have
Experience running Nvidia DGX/HGX/NVL72 systems or pods in a production environment
Experience using the Nvidia Base Command Manager for system administration of NVL72 clusters
Strong understanding of AI/ML workflows and experience setting up and maintaining user-facing AL/ML tools and services

Job Description


What You Will Do

Join the High Performance Operations Group (HPC-OPS) in operating and maintaining some of the fastest supercomputers in the world. Designing, operating and maintaining these systems requires highly skilled personnel that specialize in both the hardware and software aspects of High Performance Computing. Innovators at heart, HPC-OPS Linux Administrators work both independently and collaboratively to maintain and implement capability improvements across a complex computing environment. This team is currently building on-premise cloud-like infrastructure to support the AI/ML/LLM needs of the laboratory.

The Platforms Team is seeking to add highly knowledgeable and motivated team members to help build and deploy the AI/ML/LLM infrastructure for LANL. This person will be an expert Linux Administrator who will help design, build and run our production NVidia DGX/HGX pods optimized for our environment and workflow. They will run and manage both admin and user-facing services with an understanding of modern AI/ML/LLM user workflows, Kubernetes, and other common tools. The successful candidate will participate in periodic on-call responsibilities managing HPC clusters and AI infrastructure, while actively growing their technical skills and staying up to date with the latest technologies in the field. In addition, the selected candidate will have the opportunity to develop technical products such as technical documentation, presentations, technical papers, and reports, to communicate findings internally and at conferences.

The selected HPC/AI Linux Administrator (HPC Engineer 2/3) will provide strategic design, testing, analysis, administration, configuration management, verification, and validation of the newly developed cloud-like infrastructure and specialized compute infrastructure for AL/ML workloads. Mentoring of students, junior staff, and peers in technical and professional growth activities is highly valued, as is maintaining state-of-the-art technical expertise and knowledge within HPC system administration and developing new skills in related disciplines. This is your chance to directly support our national security mission and continue to make LANL the best place to work as a member of a dynamic, team-oriented, and leading-edge technical capability team.

This position will be filled at either the HPC Engineer 2/3 level, depending on the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.

What You Need

Minimum Job Requirements:

HPC Engineer 2: ($104,100 - $172,200)

  • Advanced Linux Administration Expertise: Demonstrated knowledge of administering production Linux computer systems, including strong command line Linux operating system skills, working knowledge of or experience with hardware and software security practices, and experience scripting in Bash, Python, or similar languages.
  • Configuration Management Expertise: Demonstrated experience with configuration and automation tools and practices, such as Chef, Puppet, Ansible, Salt, CFEngine, or similar tools.
  • Troubleshooting and Technical Analysis: Significant knowledge and demonstrated experience in formulating and testing hypotheses, investigating alternative solutions, and recommending solutions to technical problems.
  • Computer Networking Expertise: Working knowledge of networking concepts and practices.
  • Communication and Teaming Skills: Demonstrated effective communication skills, both verbal and written, including the ability to communicate technical information to both technical and non-technical personnel, to provide assistance and knowledge to peers, to collaborate with Group members, other HPC Group personnel and vendor representatives, as required, and to formulate and communicate technical results and findings to technical audiences and readerships (examples can include publications, team projects, and presentations).
  • Troubleshooting skills: Demonstrated ability to troubleshoot hardware and software errors, prioritizing problems and assessing impact to stakeholders, documenting problems and solutions.

HPC Engineer 3: ($125,200 - $211,300)

In addition to what was outlined at the lower level, at this level

  • Container Orchestration Expertise: Demonstrated experience managing, administering and maintaining large production Kubernetes clusters.
  • Troubleshooting Expertise: Experience troubleshooting and debugging user workflows in a Kubernetes environment.
  • Computer Networking Expertise: High performance interconnects, preferably NVLink and InfiniBand networks.
  • Leadership: Demonstrated experience with project planning and management. Ability developing and leading complex projects, generating formal project plans, delegating tasks, and providing routine updates to management.
  • Filesystems: Knowledge of or demonstrated experience with parallel and distributed storage systems (e.g. Lustre); knowledge of file systems such as ZFS, EXT, XFS; working knowledge of file system structures and algorithms; and/or experience with Object storage and RESTful storage interfaces. Experience administering cluster storage technologies such as Ceph.
  • HPC Experience: Demonstrated experience building, installation, and administration of HPC systems. Experience with modern image building and provisioning tools.
  • Mentoring: Ability to mentor and lead individual junior team members and students.

Education/Experience at HPC Engineer 2:

The position requires a bachelor's in Computer Science, Computer Engineering, or a related field, and 3 years of relevant experience in high performance computing or scalable AI computing, or data center environments or equivalent combination of education and experience in related fields.

Education/Experience at HPC Engineer 3:

The position requires a bachelor's in Computer Science or Computer Engineering or a related field and 6 years of relevant experience in high performance computing or scalable AI computing, or data center environments or equivalent combination of education and experience in related fields.

Desired Qualifications:

  • Experience running Nvidia DGX/HGX/NVL72 systems or pods in a production environment.
  • Experience using the Nvidia Base Command Manager for system administration of NVL72 clusters.
  • Strong understanding of AI/ML workflows and experience setting up and maintaining user-facing AL/ML tools and services (such JupyterHub).
  • Experience writing and debugging Kubernetes microservices in Go.
  • Knowledge of Cloud technologies.
  • Experience integrating operational metrics into a monitoring system such as Splunk.
  • Demonstrated effective communication skills, including demonstrated ability to work productively with customers and vendors.
  • High attention to detail including excellent organizational skills, analytical thinking, observational and problem-solving skills. Proven ability to independently multi-task and adjust to the workings of a dynamic and fast paced environment.
  • Experience with Git, creating issues, branches, merge requests and using CI/CD pipelines.
  • Experience modifying Unix/Linux operating systems (e.g., enabling/disabling kernel modules).
  • Practical experience with Splunk or other monitoring tools.
  • Demonstrated ability to develop new methods, techniques, or approaches to address critical technical problems and/develop new technical capabilities.
  • Experience managing both SCIF and SAPF environments and HPC computing resources.
  • Clearance: Active DOE Q or DoD Top Secret clearance with SCI eligibility.

Work Location: The work location for this position is hybrid and is located in Los Alamos, NM. Hybrid is defined as working partially onsite/partially offsite but within 2 hours ground commute of this location. All work locations are at the discretion of management and can change at any time with appropriate notice.

Position commitment: Regular appointment employees are required to serve a period of continuous service in their current position in order to be eligible to apply for posted jobs throughout the Laboratory. If an employee has not served the time required, they may only apply for Laboratory jobs with the documented approval of their Division Leader. The position commitment for this position is 1 year.

Note to Applicants:

Due to federal restrictions contained in the current National Defense Authorization Act, citizens of the People's Republic of China-including the special administrative regions of Hong Kong and Macau-as well as citizens of the Islamic Republic of Iran, the Democratic People's Republic of Korea (North Korea), and the Russian Federation, who are not Lawful Permanent Residents ("green card" holders) are prohibited from accessing facilities that support the mission, functions, and operations of national security laboratories and nuclear weapons production facilities, which includes Los Alamos National Laboratory.

Where You Will Work

Located in beautiful northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. Our generous benefits package includes:

  • PPO or High Deductible medical insurance with the same large nationwide network
  • Dental and vision insurance
  • Free basic life and disability insurance
  • Paid childbirth and parental leave
  • Award-winning 401(k) (6% matching plus 3.5% annually)
  • Learning opportunities and tuition assistance
  • Flexible schedules and time off (PTO and holidays)
  • Onsite gyms and wellness programs
  • Extensive relocation packages (outside a 50 mile radius)

Additional Details

Directive 206.2 - Employment with Triad requires a favorable decision by NNSA indicating employee is suitable under NNSA Supplemental Directive 206.2. Please note that this requirement applies only to citizens of the United States. Foreign nationals are subject to a similar requirement under DOE Order 142.3A.

Clearance: Q/SCI (Position will be cleared to this level). Selected applicants will be subject to a background investigation conducted by or on behalf of the Federal Government, and must meet eligibility requirements* for access to classified matter. This position requires a Q clearance. and obtaining such clearance requires US Citizenship except in extremely rare circumstances. Dependent upon the position, additional authorization to access classified information may be required, which may or may not be available to dual citizens. Receipt of a Q clearance and additional access authorization ultimately is a decision of the Federal Government and not of Triad.

  • Eligibility requirements: To obtain a clearance, an individual must be at least 18 years of age; U.S. citizenship is required except in very limited circumstances. See DOE Order 472.2 for additional information.

New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing. Although New Mexico and other states have legalized the use of marijuana, use and possession of marijuana remain illegal under federal law. A positive drug test for marijuana will result in termination of employment, even if the use was pre-offer.

Regular position: Term status Laboratory employees applying for regular-status positions are converted to regular status.

Internal Applicants: Regular appointment employees who have served the required period of continuous service in their current position are eligible to apply for posted jobs throughout the Laboratory. If an employee has not served the required period of continuous service, they may only apply for Laboratory jobs with the documented approval of their Division Leader. Please refer to Policy Policy P701 for applicant eligibility requirements.

Equal Opportunity: Los Alamos National Laboratory is an equal opportunity employer. All employment practices are based on qualification and merit, without regard to protected categories such as race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal, state, and local laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to applyhelp@lanl.gov or call (505)-664-6947.

Similar Jobs

Explore other opportunities that match your interests

Senior DNS Engineer

Networking
•
4m ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

JS Consulting

United State

Systems Security Engineer - MDC2 Division

Networking
•
4h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Northrop Grumman

United State

Principal Computer Systems Architect/Sr. Principal Computer Systems Architect

Networking
•
10h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Northrop Grumman

United State

Subscribe our newsletter

New Things Will Always Update Regularly