Senior Multimodal Foundation Model Engineer

oyf (own your future) staffing • United State

Visa Sponsorship

Apply

AI Summary

Design, train, and optimize large multimodal models for real-time human interactions. Collaborate with a world-class founding team to drive architectural decisions and experiment rapidly. Work on groundbreaking technology that unifies real-time emotional and social intelligence across modalities.

Key Highlights

Design and train large multimodal models

Collaborate with a world-class founding team

Work on groundbreaking technology

Technical Skills Required

Large language models Multimodal systems Distributed training Model architecture Inference optimization Large-scale data processing

Benefits & Perks

$240K–$300K salary

0.5%–1% equity

Visa & Green Card sponsorship available

Job Description

Job Description: Senior Multimodal Foundation Model Engineer

Location: Seattle, WA

Compensation: $240K–$300K salary + 0.5%–1% equity

Work Policy: On-site, 5 days/week

Visa Support: Visa & Green Card sponsorship available

About the Role

This role offers a rare opportunity to shape truly foundational technology in a space where the boundaries are still being defined. You will be part of a small, high-performing team building the first real-time human foundation model capable of understanding and generating text, speech, facial expression, and body language as a unified system.

You’ll work on technology that interprets the micro-signals humans intuitively use — the quirk of an eyebrow, a pause in speech, a shift in tone — and builds models that can understand and respond with emotional intelligence.

Your work will power lifelike, responsive avatars whose expressions, gestures, and tone evolve naturally frame-by-frame to deliver deeply human interactions.

This is a role for someone who wants to build at the frontier of multimodal AI, push scientific boundaries, and work hands-on at massive scale.

What You’ll Do

Design, train, and optimize large multimodal and autoregressive models that operate across text, speech, and visual signals in real time.
Build systems that understand fine-grained human cues and infer nuanced intent and emotion.
Develop lifelike avatar generation systems capable of natural facial expression, gesture, and tone rendering.
Lead model training end-to-end, from data pipeline design to pre-training to evaluation and iteration.
Collaborate closely with a world-class founding team to drive architectural decisions, establish research direction, and experiment rapidly.
Work in a fast-paced, flat, highly collaborative environment where you will have significant ownership and influence.

Required Qualifications

Experience

3+ years training multimodal LLMs, MLLMs, autoregressive architectures, or closely related models.
Hands-on experience with large-scale pre-training and familiarity with full model training pipelines.
Prior experience training models in corporate or advanced research environments.

Education

Degree in Computer Science, Mathematics, or Engineering from a top-tier institution.
PhD (or PhD-level research experience) with a focus on speech synthesis, multimodal modeling, or related fields.

Technical Skills

Deep understanding of large language models, especially multimodal systems combining text, audio, and visual data.
Demonstrated ability to train models at large scale (e.g., distributed training across 32+ GPUs).
Strong understanding of model architecture, inference optimization, and large-scale data processing.

Soft Skills

Low ego, collaborative, and easy to work with.
Genuine interest in committing to a startup environment and building foundational technology.
Strong communication and willingness to iterate quickly.

Why Join

Exceptional founding team with deep expertise across AI, speech, embodied intelligence, and real-time modeling.
Work on groundbreaking technology: building the first human foundation model that unifies real-time emotional and social intelligence across modalities.
Clear impact: your work directly shapes the core product and technical direction.
Flat, collaborative structure where top performers can influence decisions and experiment freely.
Mission-driven environment focused on creating AI that interacts with people more naturally and meaningfully.
Strong funding and an ambitious vision spanning AI companionship, enterprise workflows, interviewing, sales intelligence, and more.

Job Overview

Posted Date Dec 06, 2025

Employment Type Full-time

Experience Level Mid-Senior level

Location United State

Category Machine Learning

Company oyf (own your future) staffing

Senior Multimodal Foundation Model Engineer

Key Highlights

Technical Skills Required

Benefits & Perks

Job Description

Job Overview

Mentioned Skills

Industries

Senior Multimodal Foundation Model Engineer

Key Highlights

Technical Skills Required

Benefits & Perks

Job Description

Job Overview

Mentioned Skills

Industries

Subscribe our newsletter