You will play a key role in developing, debugging and maintaining software to operate a large scale compute platform. Your duties will include:
- Close collaboration with teams within and across organizations to support their workflows or integrate their technology into our platform
Writing software to automate operations processes by developing services and tools
- Designing, implementing, and maintaining robust, scalable, and highly available services that support Infrastructure as Code (Terraform, Pulumi)
- Developing configuration management, and fleet orchestration solutions powered via Ansible, Puppet, or others
- Supervising on-server system performance, identifying bottlenecks, and implementing solutions to improve efficiency
- Conducting root cause analysis for on-server system failures and implementing preventive measures
- Writing and reviewing code, as well as generating and reviewing design documentation
- Participating in qualifications and rollouts of software to production clusters
- Participating in a business-hours rotation where engineers respond to platform issues for same-day resolution