Your mission
- Design, build, and maintain our infrastructure and tools to allow for the highly reliable and scalable deployment of services and applications, incorporating both cloud-based and on-premise solutions
- Implement comprehensive monitoring and observability frameworks to detect and resolve issues proactively, using tools like Prometheus, Grafana, and Zabbix for system health and performance metrics
- Develop and manage incident response protocols, including on-call rotations, incident analysis, and conducting postmortems to ensure continuous improvement in system reliability and performance
- Automate infrastructure and workflows using Infrastructure as Code (IaC) tools like Ansible
- Optimize system performance through regular performance tuning, capacity planning, and conducting reliability experiments to identify and mitigate potential points of failure
- Collaborate with development teams to advocate for reliability and scalable practices throughout the software development life cycle, and assist in the design and review of new systems and major changes