AI Engineer — Production LLM Systems & Evaluation

Aurora United State
Visa Sponsorship
Apply
AI Summary

AI Engineer responsible for owning model behavior in production, evaluating, and refining LLM-powered systems. Requires 3+ years of shipping AI in production at scale, direct ownership of model quality, and experience building evaluation sets.

Key Highlights
Own model behavior in production
Evaluate and refine LLM-powered systems
Work directly with subject matter experts and enterprise customers
Key Responsibilities
Evaluate, workflow design, fine-tuning, release discipline, and turning customer feedback into product improvements
Work directly with subject matter experts and enterprise customers to understand what they mean by a correct answer, where the system fails, and whether the fix belongs in the model, the prompt, the retrieval layer, or the workflow itself
Generalize one-off wins into reusable platform primitives instead of leaving them as bespoke deployments
Technical Skills Required
Machine Learning Artificial Intelligence Python
Benefits & Perks
$210K–$350K base + competitive equity
Hybrid or Remote work
Full-time employment

Job Description


AI Engineer — Production LLM Systems & Evaluation


New York, NY or San Francisco, CA · Hybrid or Remote · Full-time

$210K–$350K base + competitive equity


The company


The company is building software that captures expert judgment for regulated industries, starting with financial services.


The first product is an AI-powered third-party risk management platform for financial institutions. It captures the compliance reasoning that normally lives in the heads of senior experts and turns it into software that can be deployed, measured, and improved over time.


The product already serves FDIC-insured banks. The business has gone from 0 to $10M ARR in less than a year, closed a $25M Series A, and has $40M in total contract value. It went from 0 to 5 live deployments in 45 days and is on track to hit 15 live deployments next month.


The team is 8 people, founded in 2023, and includes former regulators, heads of compliance and legal at fintechs, and experienced engineers. The company is backed by leading institutional investors.


The role


This is a hands-on AI engineering role for someone who wants to own model behavior in production.


The scope includes evaluation, workflow design, fine-tuning, release discipline, and turning customer feedback into product improvements.


You will be customer-facing. A big part of the job is working directly with subject matter experts and enterprise customers to understand what they mean by a correct answer, where the system fails, and whether the fix belongs in the model, the prompt, the retrieval layer, or the workflow itself.


The output of your work should not stop at a single deployment. Customer-specific solutions should be generalized back into the core learning library so the platform gets stronger with each new customer.


The technical problem


Model quality here is a systems problem, not a prompt problem.


The product has to produce grounded, defensible output from messy input, keep its behavior stable across edge cases, and make it easy for humans to review and trust what it returns.


The hard part is building a production system with measurable quality, controlled regressions, and clear feedback loops from real users.


Why now


The hard part has shifted from proving demand to making every deployment better than the last.


With live banks, rapid deployment velocity, and recurring enterprise feedback, the next constraint is model quality at scale: evaluation, reliability, and reuse across customers.


This is the point where engineering decisions become durable platform infrastructure.


What you'll own


  • LLM-powered systems and agentic workflows: ship end-user experiences that are accurate, usable, and production-ready.
  • Evaluation frameworks: build gold sets, scoring rubrics, regression tests, and release gates that catch quality issues before customers do.
  • Model refinement: use fine-tuning, prompt iteration, and data-driven feedback to improve accuracy and consistency.
  • Customer-facing iteration: work with SMEs and enterprise users to prototype, validate, and ship improvements quickly.
  • Core learning library: generalize one-off wins into reusable platform primitives instead of leaving them as bespoke deployments.
  • Production quality: keep the system observable, measurable, and stable as the product and customer base grow.


Who this is for


You are likely a strong fit if you have:

  • 3+ years shipping AI in production at scale, with direct ownership of model quality.
  • Built systems where offline evaluation, production behavior, and customer feedback all mattered.
  • Owned more than integration work; you have been responsible for the model behavior itself.
  • Experience building evaluation sets and using them to make release decisions.
  • Comfort working directly with technical and non-technical stakeholders, including domain experts.
  • Judgment about when to use rules, prompts, retrieval, fine-tuning, or workflow changes.
  • Experience in environments where both false positives and false negatives have real cost.
  • The ability to explain technical tradeoffs clearly and without hand-waving.


This role is not for you if


  • You want to stay in prototype mode and avoid production ownership.
  • You want a research-only role with no customer contact.
  • You prefer narrow tickets and heavy specification before you start.
  • You are not interested in evaluation rigor, reliability, or reuse.
  • You optimize for novelty over repeatable quality.


Compensation and logistics


  • Base salary: $210K–$350K
  • Equity: competitive
  • Location: New York, NY
  • Work model: hybrid or remote
  • Employment: full-time
  • Visa sponsorship: available


About Aurora


Aurora helps exceptional engineers find the right role at some of the most ambitious startups worldwide.


We work with teams that value high ownership, strong technical standards, and clear scope.


Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Big Wave Digital

United State

Senior Python Developer - Enterprise Applications & Data Engineering

Programming
1h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Bright Vision Technologies

United State

Staff Backend Engineer

Programming
1h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

david joseph & company

United State

Subscribe our newsletter

New Things Will Always Update Regularly