Build UI for LLM evaluation outputs, design data visualizations, and partner with researchers to create explorable artifacts.
Key Highlights
Technical Skills Required
Benefits & Perks
Job Description
Our Client is a well-funded nonprofit research organization focused on measuring frontier AI capabilities—especially agentic / autonomous capabilities and the ability of models to conduct AI R&D, because those capabilities can create outsized societal and security risk if they scale faster than our ability to evaluate and govern them.
Their work is unusually “real-world” compared to typical benchmarks: they build evaluations with high realism and measure performance against skilled-human baselines (often multi-hour tasks), and publish research on how quickly models are improving at completing long tasks.
You’d be building the UI that turns messy LLM evaluation outputs into clear, explorable artifacts that researchers can trust.
What you’ll do
- Build React + TypeScript interfaces for exploring LLM evaluation results and experiment outputs.
- Design and implement data visualizations that make model behavior, metrics, and results easy to inspect.
- Build workflows that support end-to-end traceability of LLM runs (prompts → intermediate steps → decisions → outputs).
- Partner closely with researchers; iterate quickly while balancing clarity, accuracy, and performance.
Tech stack / must-haves
- React + TypeScript
- Hands-on with at least one major visualization library: D3, Plotly, Vega/Vega-Lite, Visx, Three.js, Highcharts, ECharts
Why this matters
- Their mission is to give society and AI labs grounded answers to: “What can frontier models actually do?” and “When do capabilities become dangerous?”
- The team includes researchers and engineers with backgrounds across top AI orgs and programs (e.g., OpenAI, DeepMind, and alumni of Oxford, Caltech, MIRI, and ML interpretability programs).
Location
- Onsite in Berkeley, CA (relocation sponsored).
*Please do not apply if you do not meet our requirements, as we won’t be able to respond.