Senior Python Engineer for part-time, task-based work focused on evaluating and testing Large Language Models (LLMs). Evaluate failure modes, design test cases, and analyze LLM behavior. Requires Python expertise, strong engineering judgment, and comfort with debugging code.
Key Highlights
Technical Skills Required
Benefits & Perks
Job Description
Senior Python Engineer / LLM Evaluation โ Part-Time, Remote
We are hiring experienced Python engineers for part-time, task-based work focused on evaluating and testing Large Language Models (LLMs).
This is not traditional QA and not junior AI labeling work. We are looking for senior engineers who can reason deeply about system behavior, ambiguity, and real-world usage โ not just write code.
What Youโll Do
โข Design structured test cases that simulate real human workflows
โข Define gold-standard outputs and expected behaviors
โข Analyze LLM failure modes such as hallucinations, bias, and context limitations
โข Work directly with Git repositories and existing codebases
โข Navigate incomplete documentation and ambiguous requirements
โข Apply engineering judgment to determine what โgoodโ looks like
Interested in remote work opportunities in Development & Programming? Discover Development & Programming Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
Who You Are
โข 3+ years of software development experience (Python-focused)
โข Python is your primary language
โข Strong hands-on Git experience in real projects
โข Comfortable reading and debugging code you didnโt write
โข Able to reason about edge cases, trade-offs, and ambiguity
โข Strong written and spoken English (B2+)
Nice to Have
โข QA or structured testing experience (must be code-capable)
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
โข Experience evaluating AI or LLM systems
โข Familiarity with evaluation metrics such as precision, recall, coverage
โข Experience working with Docker
โข Consulting or freelance engineering background
What Weโre Looking For
We value engineers who can explain why something fails โ not just that it fails. If you naturally think in terms of scenarios, assertions, failure modes, and user expectations, youโll thrive here.
This role suits senior backend Python engineers, ML engineers who still code regularly, and technically strong evaluators with real production experience.
Fully remote. Flexible schedule. Task-based delivery.
If youโre interested in applying your engineering judgment to real-world AI system evaluation, weโd love to hear from you.
Similar Jobs
Explore other opportunities that match your interests
Senior Software Engineer - Contribute to AI-assisted Software Development
Turing
full potential solutions
Senior Backend Engineer