Senior Software Engineer - ML Data Platform

duckduckgoose ai Netherlands
Visa Sponsorship Relocation
Apply
AI Summary

Design and develop a unified schema and catalog for large image/video datasets, ensuring reproducibility and auditable lineage. Collaborate with the ML team to deliver one-command dataset builds, deterministic splits, and fast sampling tools. Work on technically challenging problems such as dataset lineage, model-family clusters, and reproducible forensic benchmarks.

Key Highlights
Design and develop a unified schema and catalog for large image/video datasets
Collaborate with the ML team to deliver one-command dataset builds
Work on technically challenging problems such as dataset lineage and model-family clusters
Key Responsibilities
Design and develop a unified schema and catalog for large image/video datasets
Collaborate with the ML team to deliver one-command dataset builds
Work on technically challenging problems such as dataset lineage and model-family clusters
Technical Skills Required
Master's in Computer Science, Data Engineering, or a related field Production experience: 5–8+ years building and operating data platforms for large unstructured datasets (images/video) Pipelines & orchestration: Experience with modern schedulers (e.g., Airflow/Prefect) and containerized jobs Storage & formats: Hands-on with object storage (e.g., S3), columnar formats/partitioning, and performance tuning Versioning & lineage: Experience with dataset versioning and reproducibility (e.g., DVC/lakeFS/Delta or equivalents) Quality at scale: Deduplication, schema/label checks, and automated QC gates in CI Security & privacy: IAM, access controls, and privacy-aware workflows suitable for regulated customers
Benefits & Perks
Meaningful equity/virtual shares aligned with company growth
Flexible work: Hybrid (Delft), flexible hours, minimal ceremony, async-first collaboration
Data platform mandate: Real say in stack choices (orchestration, catalog, storage/layout) and time to implement them right
Nice to Have
Streaming & events: Kafka/Kinesis or similar for near-real-time ingestion
Vector search: Experience with embedding stores or similarity search at scale
Synthetic data: Building pipelines to generate/stress-test rare scenarios

Job Description


Senior Software Engineer — ML Data Platform


Location: Delft - the Netherlands (hybrid)

Type: Full-time

Start: ASAP


The internet has entered an era where reality is generatable. We build the infrastructure that helps institutions distinguish real from synthetic — at scale, protecting citizens, enterprises, and governments from synthetic media fraud. Everything you see and hear online can now be manipulated — our job is to make sure people can trust what they see. As part of our forensics platform team, you’ll work on the data backbone that makes large-scale detection possible, from ingestion and versioning to training, evaluation, and production.


You’ll join a small, senior team where your work will have immediate impact, and you’ll have ownership over the systems you build.


You’ll work on technically challenging problems such as:

  • Building dataset lineage for rapidly evolving generative models
  • Tracking model-family clusters across synthetic media types
  • Designing reproducible forensic benchmarks at scale
  • Managing large-scale image/video datasets with auditable provenance
  • Creating deterministic dataset builds for research and production environments


What You’ll Drive

  • Data platform architecture: Define unified schemas, lineage, and dataset versioning for large image/video + context data.
  • Ingestion at scale: Build reliable pipelines from research repos, APIs, and internal generators; automate connectors and jobs.
  • Quality & governance: Implement deduplication, validation, health dashboards, and drift/coverage checks with auditable lineage.
  • Curation & access: Deliver one-command dataset builds, deterministic splits, and fast sampling tools for training/eval.
  • Performance & cost: Tune S3/object storage layouts, partitioning, and lifecycle policies for speed and spend.
  • Orchestration & ops: Productionize pipelines with CI/CD, containerization, scheduling/monitoring, and safe rollbacks.
  • Reliability & operations: Build for simplicity and observability; participate in a planned, compensated support rotation.
  • Engineering productivity: Create internal tools/CLIs, docs, and templates that make everyone faster.


Must haves

  • Strong software engineering foundation: Master’s in Computer Science, Data Engineering, or a related field.
  • Production experience: 5–8+ years building and operating data platforms for large unstructured datasets (images/video).
  • Data lifecycle ownership: Ingest → validate → catalog → version → sample/serve → monitor.
  • Pipelines & orchestration: Experience with modern schedulers (e.g., Airflow/Prefect) and containerized jobs.
  • Storage & formats: Hands-on with object storage (e.g., S3), columnar formats/partitioning, and performance tuning.
  • Versioning & lineage: Experience with dataset versioning and reproducibility (e.g., DVC/lakeFS/Delta or equivalents).
  • Quality at scale: Deduplication, schema/label checks, and automated QC gates in CI.
  • Security & privacy: IAM, access controls, and privacy-aware workflows suitable for regulated customers.
  • Domain awareness: Familiarity with digital forensics, misinformation threats, or synthetic media — and willingness to deepen expertise.
  • Flexibility: Comfortable moving between data engineering, infra, and tooling tasks when needed.
  • Mindset & delivery: Thrive in a fast-moving environment; proactive problem-solver; ship, measure, simplify.
  • Communication: Excellent written and verbal skills; explain complex ideas clearly.
  • Independence: Deliver quality work on time without constant oversight.
  • Language: Fluent in English.


Nice-to-haves

  • Streaming & events: Kafka/Kinesis or similar for near-real-time ingestion.
  • Vector search: Experience with embedding stores or similarity search at scale.
  • Synthetic data: Building pipelines to generate/stress-test rare scenarios.
  • Cloud & on-prem: Terraform/CDK, Kubernetes, and hybrid/on-prem data deployments.
  • FinOps: Cost monitoring and optimization for data workloads.
  • Technical track record: Strong GitHub, open-source contributions, publications, patents, or public talks.
  • Leadership: Mentoring and guiding technical direction.
  • Dutch language: Fluency is a plus.


Key Deliverables (First 90 Days)

  • A unified schema + catalog with key datasets onboarded, versioned, and reproducibly built via one command.
  • Automated QC gates (dedup/validation) with a red/amber/green dataset health dashboard and clear lineage.
  • Fast sampling/curation tools for the ML team, plus cost controls (storage layouts, lifecycle policies) in place.
  • Data migration: Inventory and migrate existing/legacy datasets into the new platform; reformat to the new schema, backfill metadata, validate checksums/lineage, and deprecate legacy paths with a rollback plan.


Compensation & benefits

  • Own the backbone: Define schemas, lineage, and dataset versioning used across research and production.
  • Company participation: Meaningful equity/virtual shares aligned with company growth.
  • Flexible work: Hybrid (Delft), flexible hours, minimal ceremony, async-first collaboration.
  • Data platform mandate: Real say in stack choices (orchestration, catalog, storage/layout) and time to implement them right.
  • Repro & auditability: Space to enforce deterministic builds, splits, and traceable lineage—no heroics needed.
  • Quality culture: Backing to implement dedup, drift/coverage checks, and dataset health dashboards org-wide.
  • FinOps mindset: Budget and support to balance speed, reliability, and total cost.
  • Pragmatic on-call: Planned, compensated rotation with automation-first recovery and rollback plans.
  • Growth path: IC track to Staff/Principal; opportunities to mentor and codify data standards.
  • Learning budget: Annual budget for courses/books + two data/ML-infra conferences per year.
  • Home office: Modest stipend for an ergonomic setup; commuting support (public transport or mileage).
  • Relocation + visa: Visa sponsorship and relocation support for internationals.


Join us and be part of a company committed to creating a more secure and trustworthy digital future. Apply today to become part of our mission-driven team!


Similar Jobs

Explore other opportunities that match your interests

Senior Enterprise Architect

Programming
3d ago
Visa Sponsorship Relocation Remote
Job Type Internship
Experience Level Mid-Senior level

TNO

Netherlands

Senior Frontend Developer

Programming
5d ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Picnic Technologies

Netherlands
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Internship

delft university of technology

Netherlands

Subscribe our newsletter

New Things Will Always Update Regularly