Senior Multimodal Foundation Model Engineer

Visa Sponsorship
Apply
AI Summary

Design, train, and optimize large multimodal models for real-time human interactions. Collaborate with a world-class founding team to drive architectural decisions and experiment rapidly. Work on groundbreaking technology that unifies real-time emotional and social intelligence across modalities.

Key Highlights
Design and train large multimodal models
Collaborate with a world-class founding team
Work on groundbreaking technology
Technical Skills Required
Large language models Multimodal systems Distributed training Model architecture Inference optimization Large-scale data processing
Benefits & Perks
$240K–$300K salary
0.5%–1% equity
Visa & Green Card sponsorship available

Job Description


Job Description: Senior Multimodal Foundation Model Engineer

Location: Seattle, WA

Compensation: $240K–$300K salary + 0.5%–1% equity

Work Policy: On-site, 5 days/week

Visa Support: Visa & Green Card sponsorship available

About the Role

This role offers a rare opportunity to shape truly foundational technology in a space where the boundaries are still being defined. You will be part of a small, high-performing team building the first real-time human foundation model capable of understanding and generating text, speech, facial expression, and body language as a unified system.

You’ll work on technology that interprets the micro-signals humans intuitively use — the quirk of an eyebrow, a pause in speech, a shift in tone — and builds models that can understand and respond with emotional intelligence.

Your work will power lifelike, responsive avatars whose expressions, gestures, and tone evolve naturally frame-by-frame to deliver deeply human interactions.

This is a role for someone who wants to build at the frontier of multimodal AI, push scientific boundaries, and work hands-on at massive scale.

What You’ll Do

  • Design, train, and optimize large multimodal and autoregressive models that operate across text, speech, and visual signals in real time.
  • Build systems that understand fine-grained human cues and infer nuanced intent and emotion.
  • Develop lifelike avatar generation systems capable of natural facial expression, gesture, and tone rendering.
  • Lead model training end-to-end, from data pipeline design to pre-training to evaluation and iteration.
  • Collaborate closely with a world-class founding team to drive architectural decisions, establish research direction, and experiment rapidly.
  • Work in a fast-paced, flat, highly collaborative environment where you will have significant ownership and influence.

Required Qualifications

Experience

  • 3+ years training multimodal LLMs, MLLMs, autoregressive architectures, or closely related models.
  • Hands-on experience with large-scale pre-training and familiarity with full model training pipelines.
  • Prior experience training models in corporate or advanced research environments.

Education

  • Degree in Computer Science, Mathematics, or Engineering from a top-tier institution.
  • PhD (or PhD-level research experience) with a focus on speech synthesis, multimodal modeling, or related fields.

Technical Skills

  • Deep understanding of large language models, especially multimodal systems combining text, audio, and visual data.
  • Demonstrated ability to train models at large scale (e.g., distributed training across 32+ GPUs).
  • Strong understanding of model architecture, inference optimization, and large-scale data processing.

Soft Skills

  • Low ego, collaborative, and easy to work with.
  • Genuine interest in committing to a startup environment and building foundational technology.
  • Strong communication and willingness to iterate quickly.

Why Join

  • Exceptional founding team with deep expertise across AI, speech, embodied intelligence, and real-time modeling.
  • Work on groundbreaking technology: building the first human foundation model that unifies real-time emotional and social intelligence across modalities.
  • Clear impact: your work directly shapes the core product and technical direction.
  • Flat, collaborative structure where top performers can influence decisions and experiment freely.
  • Mission-driven environment focused on creating AI that interacts with people more naturally and meaningfully.
  • Strong funding and an ambitious vision spanning AI companionship, enterprise workflows, interviewing, sales intelligence, and more.


Subscribe our newsletter

New Things Will Always Update Regularly