Machine Learning Evaluation Engineer

Aurora United State
Visa Sponsorship
Apply
AI Summary

Design and maintain evaluation systems for machine learning models, ensuring quality and deployment controls for enterprise workflows. Requires strong Python skills, experience with ML or software engineering, and ability to work with noisy data.

Key Highlights
Design and maintain evaluation systems for machine learning models
Ensure quality and deployment controls for enterprise workflows
Work with noisy data and ambiguous failure cases
Key Responsibilities
Design and maintain test sets that capture real document failure modes
Build metrics, heuristics, and automated workflows that surface new errors across large, messy datasets
Turn evaluation results into training priorities, error analysis, and concrete model improvements with the ML team
Technical Skills Required
Python AWS S3 OLAP / analytics systems such as Tinybird
Benefits & Perks
Base salary: $150K–$300K
Competitive equity
Daily lunch, transportation reimbursement, health insurance, a wellness budget, and parental leave support

Job Description


Machine Learning Eval Engineer


San Francisco, CA · On-site · Full-time

$150K–$300K base + competitive equity



The company


The vast majority of enterprise data lives in PDFs, spreadsheets, and other files that are awkward for models to handle. The company builds software that turns those documents into LLM-ready inputs with the accuracy and deployment controls enterprise workflows require.


The business has grown revenue 8x year over year, has hundreds of companies using the product, and now processes tens of millions of pages every month.


Customers include leading AI teams like Harvey, Vanta, Scale, and Meta, plus enterprise customers across FAANG and top trading firms.


Deployment is part of the product: cloud, fully air-gapped environments, SOC II and HIPAA compliance, and zero data retention.


Founded in 2023, the company has raised over $100M from a16z, Benchmark, and First Round Capital. The team includes people from Stripe, Discord, Scale AI, Dynamo AI, HRT, BAM, and similar backgrounds.



The role


This is the person who owns how the company measures model quality.


You will design the evaluation systems, benchmarks, and inspection tools that tell the team where models are strong, where they fail, and which failures are important enough to change training or product decisions.


The work sits between ML engineering, data analysis, and lightweight tooling. You will work closely with ML, platform, and GTM teams, and you will often be the one translating a vague customer problem into a reproducible benchmark.


This role shapes release confidence and the training priorities that come next.



The technical problem


Document workflows fail in ways that are hard to see in curated benchmarks: layout shifts, table errors, scan quality, long-tail formats, and customer-specific distributions.


A model can look excellent on a small sample and still break on the cases that matter in production.


The hard part is building an evaluation system that stays useful as the product and data distribution change: fast regression checks, hard-example mining, bespoke customer benchmarks, and metrics that predict real-world performance.


The eval surface is already large enough that manual review does not scale. The systems you build need to operate across billions of documents and still surface signals the ML team can act on.



What you'll own


• Benchmarks and regression suites: design and maintain test sets that capture real document failure modes, not only curated samples.

• Failure detection: build metrics, heuristics, and automated workflows that surface new errors across large, messy datasets.

• Model feedback loops: turn evaluation results into training priorities, error analysis, and concrete model improvements with the ML team.

• Document inspection: work hands-on with PDFs, spreadsheets, and other difficult formats to find edge cases and construct hard examples.

• Internal and customer-facing tooling: build lightweight Python tools, including simple Flask interfaces, so teams can inspect outputs and explain model behavior.

• Customer-specific evals: partner with customers and GTM to define bespoke benchmarks that reflect real deployment requirements.

• Data plumbing: use AWS S3 and analytics systems like Tinybird to store, query, and analyze large-scale evaluation runs.



Who this is for


You are likely a fit if you have:


• 1–5 years of experience in ML or software engineering, with work that already shows strong independence.

• Strong Python skills and the ability to build clean, reliable technical solutions without much hand-holding.

• Experience building evaluation, analytics, or data systems end to end.

• Comfort working through noisy data, ambiguous failure cases, and metrics that need to be validated before anyone trusts them.

• Enough product sense to build tools other engineers, researchers, or customer-facing teams will actually use.

• Comfort with AWS S3 and OLAP or analytics systems like Tinybird.

• The habit of taking ownership from problem definition through implementation and iteration.

• Clear communication with both technical and non-technical stakeholders.


You do not need to come from document AI specifically, but you should enjoy getting close to the data and the failure cases instead of staying at the level of abstract model metrics.



Tech stack


• Python

• Flask for lightweight internal tools

• AWS S3

• OLAP / analytics systems such as Tinybird

• LLM evaluation tooling and data inspection workflows are a plus


The stack is intentionally small because the hard part is the measurement system, not the framework.



Why now


The company already has real customer usage and real enterprise constraints. Revenue has grown 8x year over year, and the product is being used in environments where deployment flexibility and data handling are part of the buying decision.


The next constraint is not whether the company can process documents. It is whether the team can measure quality well enough to keep improving models as the customer base and document surface area expand.


The evaluation systems you build will be run at significant scale, across billions of documents, so reproducibility and signal quality matter more than one-off analysis.


This role will define that measurement layer.



This role is not for you if


• You want a pure research role detached from production systems.

• You prefer clean datasets over messy real-world document distributions.

• You need fully specified tickets before you can start.

• You are not comfortable building tools and workflows that other teams depend on.

• You do not want to be on-site in San Francisco five days a week.



Compensation and logistics


• Base salary: $150K–$300K

• Equity: competitive

• Location: San Francisco, CA

• Work model: on-site, 5 days per week

• Employment: full-time

• Visa sponsorship: available on a case-by-case basis


Benefits include daily lunch, transportation reimbursement, health insurance, a wellness budget, and parental leave support.



About Aurora


Aurora helps exceptional engineers find the right role at some of the most ambitious startups worldwide.


We work with teams that value high ownership, strong technical standards, and clear scope.


Similar Jobs

Explore other opportunities that match your interests

Senior Software Engineer

Programming
52m ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

meeboss

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

clera

United State

Product Engineer

Programming
1h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

clera

United State

Subscribe our newsletter

New Things Will Always Update Regularly