In this role, you'll apply your expertise to help train next-generation AI systems. Your work will shape how models learn, reason, and perform through high-quality, real-world input. No prior experience in AI is required — your domain knowledge is what matters.
Key Responsibilities
Monitor, troubleshoot, and support production infrastructure systems including servers, containers, storage, and networks.
Respond promptly to operational incidents and alerts, leading the resolution of platform-related issues to minimize downtime.
Manage and optimize CI/CD pipelines, leveraging tools such as Jenkins, and oversee container orchestration with Kubernetes and Helm.
Collaborate closely with DevOps and SRE stakeholders to drive automation and enhance system reliability.
Maintain, support, and improve internal tooling and platform services across all engineering teams.
Participate in on-call rotations or shift work as required, ensuring continuous operational support.
Utilize observability and monitoring tools (e.g., ELK, Grafana, Prometheus) to proactively identify and address potential issues.
Required Skills and Qualifications
Minimum 2+ years of experience in platform/infrastructure support, RunOps, or DevOps roles.
Expertise in Linux systems and shell scripting for automation and troubleshooting.
Strong hands-on experience with Docker, Kubernetes, and Helm in production environments.
Demonstrated proficiency with CI/CD tools such as Jenkins.
Robust knowledge of at least one major cloud platform (AWS, Azure, or GCP) and associated services.
Familiarity with observability toolsets including ELK, Grafana, and Prometheus.
Exceptional written and verbal communication skills, with a strong focus on clear documentation and collaborative problem-solving.
Preferred Qualifications
Experience supporting large-scale or multi-cloud environments.
Background in infrastructure as code or configuration management solutions.
Ability to drive process improvement and contribute to platform strategy initiatives.