Senior Software Engineer — AI Evaluation & Benchmarks
🔥 Hot role$80 - $100/hrWorldwideRemote · worldwidecodingContract / freelance
Pay rate · $80 - $100/hr
Senior Software Engineer — AI Evaluation & Benchmarks
About the Role
What if the code you write could determine how smart the next generation of AI truly is? We're looking for experienced Software Engineers to design and build the coding benchmarks and data pipelines used to evaluate frontier AI models — the systems that decide whether an AI can actually reason, debug, and write production-quality software.
This is high-impact, technically demanding work at the intersection of software engineering and AI research. You'll work with large codebases, multiple programming languages, and scalable infrastructure to create evaluation systems that push the boundaries of what AI can do.
This is a fully remote contract role. If you thrive in fast-paced engineering environments and want your work to directly shape the trajectory of AI — this is the role.
Organization: the hiring company
Type: Hourly Contract
Location: Remote
Contract Length: 3 Months
Commitment: Full-time availability preferred
---
What You'll Do
Design and implement coding benchmarks used to evaluate frontier AI models across real-world programming tasks
Build and maintain scalable data pipelines for AI evaluation workflows
Analyze AI-generated code for correctness, reliability, and edge-case failures
Create structured evaluation scenarios that rigorously test reasoning, debugging, and code quality
Work with large code repositories and multi-language environments
Collaborate on systems that improve how AI models understand and generate software
Provide detailed technical feedback on model performance and failure patterns
Contribute to the design of evaluation frameworks that set industry standards
---
Who You Are
4+ years of professional software engineering experience — this is non-negotiable
Experience working at a high-growth tech company or top-tier software organization
Expert proficiency in Python — you write clean, performant, well-tested Python code
Hands-on experience with code repositories and working in large, complex codebases
Proven experience designing and implementing LLM coding benchmarks and data pipelines
Track record of working in high-performance engineering environments with large-scale products or platforms
Strong command of version control systems (Git) and modern development workflows
Bilingual or native English speaker with strong written communication skills
Self-directed, technically rigorous, and comfortable operating with autonomy
---
What Makes a Perfect Match
Candidates with these additional qualifications have the highest chance of success:
Senior or Lead-level engineering profiles with a history of technical ownership
Bachelor's or Master's degree in Computer Science, Machine Learning, or a related field — or equivalent professional experience
Proficiency in one or more additional languages: JavaScript, Go, C++, or other relevant languages
Experience with CI/CD pipelines and writing robust unit tests (pytest, Mocha, JUnit)
Background in security engineering or significant open-source contributions
Familiarity with AI/ML evaluation methodologies or model benchmarking
---
Why Join Us
Work on cutting-edge AI evaluation projects alongside world-class research teams
Fully remote — work from anywhere with a reliable internet connection
Your benchmarks directly influence how the most advanced AI systems in the world are measured and improved
Freelance autonomy with meaningful, high-stakes engineering work
Collaborate with a global community of elite engineers and researchers
Potential for contract extension and ongoing engagement as new evaluation challenges emerge