ExpertGrid
← All jobs

Site Reliability Engineer

$40 - $70/hr Worldwide Remote · worldwide coding Contract / freelance
Pay rate · $40 - $70/hr
  • Job Description
  • Site Reliability Engineer
  • Contractor
  • Remote

Job Summary

In this role, you'll apply your expertise to help train next-generation AI systems. Your work will shape how models learn, reason, and perform through high-quality, real-world input. No prior experience in AI is required — your domain knowledge is what matters.

Key Responsibilities

  • • Lead the deployment, monitoring, and recovery of complex, containerized AI training environments using advanced terminal techniques.
  • • Proactively identify, diagnose, and resolve infrastructure bottlenecks and failures in long-running processes.
  • • Orchestrate resilient system builds and infrastructure management, ensuring stability and optimal resource utilization.
  • • Collaborate closely with engineering teams to refine CI/CD pipelines and automate routine operational tasks.
  • • Manage and optimize filesystem structures, networked storage, and process scheduling in Dockerized sandboxes.
  • • Conduct rapid mid-execution replanning during error states and unforeseen runtime issues.
  • • Document best practices, emergent solutions, and contribute to knowledge transfer across the team.

Required Skills and Qualifications

  • • Demonstrated expert proficiency with terminal-based problem solving and complex system administration.
  • • Mastery of dynamic infrastructure recovery and long-running operational process management.
  • • Deep expertise in containerized environments (e.g., Docker, Kubernetes) and sandbox orchestration.
  • • Strong Python skills, with the ability to script, automate, and debug real-world production systems.
  • • Proficiency in Bash and familiarity with JavaScript/TypeScript, Go, Rust, C/C++.
  • • Experience with build systems, package managers, databases, version control, and cryptography tools.
  • • Adept at troubleshooting, documenting, and replanning in high-velocity technical environments.

Preferred Qualifications

  • • Background in machine learning operations or AI infrastructure.
  • • Familiarity with ML frameworks and distributed computing.
  • • Experience supporting multi-phase, high-intensity engineering projects.
Fill in your name, country and email to proceed to next step.
Looking for something else? Browse all AI jobs →