Machine Learning Systems Engineer

Aurora United State
Visa Sponsorship
Apply
AI Summary

Build and optimize production infrastructure for diffusion-based language models, focusing on serving, compilation, and deployment reliability. Own end-to-end systems from model execution to cloud rollout across AWS and Azure. Requires deep technical ownership and cross-functional collaboration with researchers and engineering leadership.

Key Highlights
Production ML infrastructure ownership
Diffusion LLM serving and optimization
Cross-functional systems debugging
Kubernetes and cloud deployment pipelines
Key Responsibilities
Build and improve model serving infrastructure with attention to latency, throughput, and stability
Optimize performance across CUDA, TensorRT, ONNX Runtime, vLLM, and SGLang to reduce bottlenecks
Create reproducible deployment pipelines across Kubernetes and cloud environments with safe release and rollback mechanisms
Develop benchmarking and evaluation systems to separate real model gains from runtime noise
Debug failures across Python, containerized services, GPU execution, and orchestration layers
Scale infrastructure to support growing customer load and evolving model requirements
Collaborate directly with researchers to translate model changes into production-ready performance improvements
Technical Skills Required
PyTorch CUDA Kubernetes Python
Benefits & Perks
Base salary: $200K–$300K
Competitive equity
On-site work in Palo Alto, CA
Visa support available

Job Description


Machine Learning Systems Engineer


Palo Alto, CA · On-site · Full-time


$200K–$300K base + competitive equity



The company


The company is building diffusion-based language models that generate tokens in parallel instead of one at a time.


That architecture is designed to reduce latency and cost while preserving quality, and it has already moved beyond research.


The team launched the first commercially available dLLM, Mercury, in early 2025 and is now deploying large-scale diffusion LLMs at Fortune 500 companies.


The company has raised $56M, is about 20 people, and operates as a small, deeply technical team in Palo Alto.


This is not a lab demo. The product is already being used in enterprise settings, which makes production performance, deployment reliability, and systems quality first-order problems.



The role


This is a machine learning systems role for someone who wants ownership of the infrastructure around model performance: serving, compilation, optimization, benchmarking, deployment, and reliability.


You will work directly with researchers and engineering leadership to move models from implementation to production systems that are measurable, reproducible, and fast enough for real customers.


The scope is broad enough to feel staff-level in practice. The hard problems are the ones that decide whether the model is usable in the real world: throughput, latency, memory use, hardware efficiency, rollout safety, and operational stability.



The technical problem


Diffusion LLMs change the inference problem.


You are not just serving a model. You are making a new architecture run efficiently across GPUs, runtimes, and cloud environments while preserving output quality and deployment reliability.


The challenge is to connect research code to production systems without losing the performance characteristics that make the architecture valuable in the first place.


That means the work spans model execution, runtime optimization, infrastructure, and evaluation rather than only training or only serving.



What you'll own


• Model serving infrastructure: build and improve the systems that serve diffusion LLMs in production, with attention to latency, throughput, and stability.

• Performance optimization: work across CUDA, TensorRT, ONNX Runtime, vLLM, and SGLang to reduce bottlenecks and improve hardware utilization.

• Deployment pipelines: make model rollout reproducible across Kubernetes and cloud environments, with safe release and rollback mechanisms.

• Benchmarking and evaluation: build measurement systems that separate real model gains from runtime noise and infrastructure effects.

• Systems debugging: trace failures across Python, containerized services, GPU execution, and orchestration layers.

• Scaling infrastructure: help adapt the stack as customer load grows and model requirements evolve across AWS and Azure.

• Cross-functional execution: work closely with researchers to turn model changes into production-ready performance improvements.



Who this is for


You are likely a strong fit if you have:


• Built production ML infrastructure or inference systems where latency, throughput, and cost are explicit design constraints.

• Strong judgment around GPU utilization, memory pressure, batching, and runtime tradeoffs.

• Experience with PyTorch, CUDA, serving runtimes, or deployment stacks that sit between model code and production traffic.

• Comfort reading profiles, tracing bottlenecks, and turning ambiguous performance issues into concrete fixes.

• Shipped systems where correctness, reproducibility, and operational reliability mattered as much as raw speed.

• The ability to work directly with researchers and translate model behavior into systems decisions.

• Experience operating in environments where requirements change as quickly as the model stack.

• Enough range to move from code-level debugging to infrastructure design without handoff overhead.



Tech stack


• Serving and optimization: vLLM, TensorRT, ONNX Runtime, SGLang

• Modeling and training: PyTorch, TensorFlow

• GPU and systems: CUDA, Docker, Kubernetes

• Infrastructure: Python, AWS, Azure, Kubeflow


The stack is broad because the work sits across research, inference, deployment, and cloud infrastructure. The best candidates will understand where each layer creates leverage and where it becomes a bottleneck.



Why now


The company has already proven the core idea with a commercial product and enterprise deployments.


The next problem is not whether the model works in principle. It is whether the system can serve real demand with predictable performance, stable rollouts, and a runtime stack that keeps up with model progress.


This is the point where systems engineering matters most: the architecture decisions made now will shape how efficiently the product can scale across customers and hardware generations.



This role is not for you if


• You want a narrowly scoped feature role with clean handoffs.

• You prefer working only on model research and do not want systems ownership.

• You are uncomfortable debugging across GPU, runtime, container, and orchestration layers.

• You do not want to work on-site most days in Palo Alto.

• You need strict process separation between research, infrastructure, and product execution.



Compensation and logistics


• Base salary: $200K–$300K

• Equity: competitive

• Location: Palo Alto, CA

• Work model: on-site, 5 days per week in Palo Alto

• Visa support: available

• Employment: full-time



Interview process


Typical process:


• Intro call — 20 min: background, scope, and fit.

• Technical coding rounds: engineering depth and problem-solving.

• Onsite-style panel with founders: usually remote.

• References: final stage.



About Aurora


Aurora helps exceptional engineers find the right role at some of the most ambitious startups worldwide.


We work with teams that expect high ownership, technical depth, and direct accountability.


Similar Jobs

Explore other opportunities that match your interests

Founding AI/ML Engineer

Machine Learning
11h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

clera

United State

Senior Machine Learning Platform Engineer

Machine Learning
2d ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Chime

United State

Sr Distinguished Machine Learning Engineer

Machine Learning
3d ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Capital One

United State

Subscribe our newsletter

New Things Will Always Update Regularly