● Ensure the reliability and availability of production systems and services by monitoring, troubleshooting, and responding to incidents.
● Develop and maintain tools and automation for system monitoring, alerting, and incident response to minimize manual intervention.
● Collaborate with development teams to plan for capacity scaling and performance improvements based on usage patterns and growth forecasts.
● Collaborate with development and product teams to ensure that new features and services are designed with reliability in mind.
● Maintain documentation for operational processes, system configurations, and best practices.
Requirements
● Bachelor's degree in computer science, information technology, or a related field (or equivalent work experience).
● Proven experience in software development and/or system administration.
● Strong scripting and coding skills (e.g., Python, Go, Shell) for automation and tool development.
● Familiarity with containerization and orchestration technologies like Docker and Kubernetes.
● Experience with cloud platforms (e.g., AWS, Azure, GCP) and infrastructure as code tools (e.g., Terraform).
● Proficiency in monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
● Knowledge of network, security, and database concepts.
● Strong problem-solving skills and the ability to work well under pressure.
● Understanding of agile and DevOps methodologies.
● Excellent communication and collaboration skills.
● Availability to work during US hours till 3 pm ET is essential for this role.
● Candidates must have their own system/work setup for remote work.
Life360
GovOS
Global Fashion Group
CME