Senior Training Infrastructure Engineer

DeepRec.ai Germany
Remote
Apply
AI Summary

Join a fast-growing AI startup as a Senior Training Infrastructure Engineer to own and evolve the full model training stack. Design and optimize large-scale training systems, improve the training pipeline, and work on cutting-edge generative systems.

Key Highlights
Design and evaluate optimal training strategies
Profile, debug, and optimize GPU workloads
Improve the entire training pipeline end to end
Build scalable systems for experiment tracking and model versioning
Design, deploy, and maintain large-scale training clusters
Technical Skills Required
GPU workloads PyTorch SLURM GPU memory hierarchy compute constraints attention mechanisms diffusion or autoregressive models VAST or large-scale object storage
Benefits & Perks
Competitive salary: €80,000 to €150,000
Equity
Fully remote work within Europe (CET ±2 hours)
Genuine ownership and autonomy from day one

Job Description


Training Infrastructure Engineer

Salary: €80,000 to €150,000 + equity

Location: Fully remote within Europe (CET ±2 hours)

Stage: Recently funded Series A AI startup


We are partnered with a fast-growing generative AI company building the next generation of creative tooling. Their platform generates hyper-realistic sound, speech, and music directly from video, effectively bringing silent content to life. The technology is already being used across gaming, video platforms, and creator ecosystems, with a clear ambition to become foundational infrastructure for audio-visual storytelling.


Backed by top-tier venture capital and fresh Series A funding, the company is now scaling its core engineering group. This is a chance to join at a point where the technical challenges are deep, the scope is wide, and individual impact is unmistakable.


The Role:

As a Training Infrastructure Engineer, you will own and evolve the full model training stack. This is a hands-on, systems-level role focused on making large-scale training fast, reliable, and efficient. You will work close to the hardware and close to the models, shaping how cutting-edge generative systems are trained and iterated.


What You Will Do:

  • Design and evaluate optimal training strategies including parallelism approaches and precision trade-offs across different model sizes and workloads
  • Profile, debug, and optimise GPU workloads at single and multi-GPU level, using low-level tooling to understand real hardware behaviour
  • Improve the entire training pipeline end to end, from data storage and loading through distributed training, checkpointing, and logging
  • Build scalable systems for experiment tracking, model and data versioning, and training insights
  • Design, deploy, and maintain large-scale training clusters orchestrated with SLURM


What We Are Looking For:

  • Proven experience optimising training and inference workloads through hands-on implementation, not just theory
  • Deep understanding of GPU memory hierarchy and compute constraints, including the gap between theoretical and practical performance
  • Strong intuition for memory-bound vs compute-bound workloads and how to optimise for each
  • Expertise in efficient attention mechanisms and how their performance characteristics change at scale


Nice to Have:

  • Experience writing custom GPU kernels and integrating them into PyTorch
  • Background working with diffusion or autoregressive models
  • Familiarity with high-performance storage systems such as VAST or large-scale object storage
  • Experience managing SLURM clusters in production environments


Why This Role:

  • Join at a pivotal growth stage with fresh funding and strong momentum
  • Genuine ownership and autonomy from day one, with direct influence over technical direction
  • Competitive salary and equity so you share in the upside you help create
  • Work on technology that is redefining how creators produce and experience content


If you want to operate at the intersection of deep systems engineering and frontier generative AI, this is one of the strongest opportunities in the European market right now.


Subscribe our newsletter

New Things Will Always Update Regularly