Match score not available

Associate Site Reliability Engineer

Remote:

Full Remote

Contract:

Full time

Experience:

Mid-level (2-5 years)

Work from:

Philippines

Offer summary

Qualifications:

2+ years experience in SRE, DevOps, or Production Engineering, Familiarity with observability tools like Grafana and Prometheus, Experience with cloud platforms, Programming skills in Python or Go.

Key responsabilities:

Design and maintain scalable resilient systems
Implement automation and performance optimization initiatives

Full Scale

201 - 500 Employees

See all jobs

Job description

This is a remote position.

Join one of the Philippine’s fastest-growing tech companies!

Role Description:

As an Associate Site Reliability Engineer, you'll be at the forefront of our platform evolution, architecting solutions that ensure reliability, performance, and efficiency for our customers and their production workloads. You'll lead initiatives to enhance system reliability, implement innovative fault-tolerance strategies, and drive automation that significantly reduces toil. If you're passionate about solving complex challenges, mentoring the next generation of SREs, and implementing best practices that make a real difference in system reliability and cost-effectiveness, this is the role for you. You'll have the opportunity to work with a diverse set of technologies, influence our technical direction, and make a tangible impact on our platform's performance and reliability.

Responsibilities:

Work in a team of SREs in designing, implementing, and maintaining highly scalable and resilient systems .
Help execute initiatives to significantly reduce toil through automation and process improvements.
Aid in executing performance optimization initiatives to enhance system efficiency and user experience
Architect and implement robust, secure, and scalable software solutions to support our platform
Continuously improve our secure, performant platform that supports tens of millions of end users.
Develop and implement strategies to optimize infrastructure costs without compromising reliability or performance
Drive continuous improvement in observability, including metrics, logging, and tracing to enhance system visibility and troubleshooting capabilities
Assist in implementing CI/CD pipelines to enhance deployment velocity while maintaining system stability and reliability
Design and implement sophisticated SLOs and SLIs to better align with business objectives
Constantly look for opportunities to automate and optimize
Contribute to alert management and incident response processes, reducing alert fatigue and minimizing MTTR
Establish monitoring systems to ensure the health, performance, and reliability of our platforms
Collaborate with development teams to build reliability and operability into services from the ground up.

Requirements

2+ years experience in SRE, Production Engineering, or DevOps roles
Familiarity with modern observability practices and tools (e.g., Grafana, Prometheus, TICK stack, ELK stack, distributed tracing)
Experience with at least one major cloud platform and ability to design and troubleshoot multi-cloud architectures
Proven track record of significantly reducing toil and improving system reliability in large-scale environments
Demonstrated experience in performance tuning and cost optimization for large-scale systems
Proactive with natural problem-solving abilities, an inquisitive personality, a continuous learning approach, and an eagerness to tackle big problems even with uncertain requirements.
Experience designing and implementing effective alerting strategies that minimize noise and maximize signal.
Excellent communication skills with the ability to explain complex technical concepts to both technical and non-technical stakeholders.
Proven ability to drive adoption of SRE best practices across an organization.
Experience with a Kubernetes environment at a large scale.
On-call experience in critical services with good troubleshooting skills.

Desired experience:

Programming skills in languages commonly used for SRE tasks (e.g., Python, Go, Bash)
Understanding of Linux/Unix systems and networking principles
Proven ability to design and implement robust CI/CD pipelines
Experience with containerization and orchestration technologies, particularly Kubernetes
Experience in implementing and managing large-scale distributed systems
Track record of driving adoption of SRE best practices across an organization
Experience participating in major incident responses
Experience defining and implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs)