Match score not available

Associate Site Reliability Engineer

Remote: 
Full Remote
Contract: 
Experience: 
Mid-level (2-5 years)
Work from: 

Offer summary

Qualifications:

2+ years experience in SRE, DevOps, or Production Engineering, Familiarity with observability tools like Grafana and Prometheus, Experience with cloud platforms, Programming skills in Python or Go.

Key responsabilities:

  • Design and maintain scalable resilient systems
  • Implement automation and performance optimization initiatives

Full Scale logo
Full Scale
201 - 500 Employees
See all jobs

Job description

This is a remote position.

Join one of the Philippine’s fastest-growing tech companies!


Role Description:

As an Associate Site Reliability Engineer, you'll be at the forefront of our platform evolution, architecting solutions that ensure reliability, performance, and efficiency for our customers and their production workloads. You'll lead initiatives to enhance system reliability, implement innovative fault-tolerance strategies, and drive automation that significantly reduces toil. If you're passionate about solving complex challenges, mentoring the next generation of SREs, and implementing best practices that make a real difference in system reliability and cost-effectiveness, this is the role for you. You'll have the opportunity to work with a diverse set of technologies, influence our technical direction, and make a tangible impact on our platform's performance and reliability.

Responsibilities:

  • Work in a team of SREs in designing, implementing, and maintaining highly scalable and resilient systems .

  • Help execute initiatives to significantly reduce toil through automation and process improvements.

  • Aid in executing performance optimization initiatives to enhance system efficiency and user experience

  • Architect and implement robust, secure, and scalable software solutions to support our platform

  • Continuously improve our secure, performant platform that supports tens of millions of end users.

  • Develop and implement strategies to optimize infrastructure costs without compromising reliability or performance

  • Drive continuous improvement in observability, including metrics, logging, and tracing to enhance system visibility and troubleshooting capabilities

  • Assist in implementing CI/CD pipelines to enhance deployment velocity while maintaining system stability and reliability

  • Design and implement sophisticated SLOs and SLIs to better align with business objectives

  • Constantly look for opportunities to automate and optimize

  • Contribute to alert management and incident response processes, reducing alert fatigue and minimizing MTTR

  • Establish monitoring systems to ensure the health, performance, and reliability of our platforms

  • Collaborate with development teams to build reliability and operability into services from the ground up.



Requirements


  • 2+ years experience in SRE, Production Engineering, or DevOps roles 

  • Familiarity with modern observability practices and tools (e.g., Grafana, Prometheus, TICK stack, ELK stack, distributed tracing) 

  • Experience with at least one major cloud platform and ability to design and troubleshoot multi-cloud architectures

  • Proven track record of significantly reducing toil and improving system reliability in large-scale environments

  • Demonstrated experience in performance tuning and cost optimization for large-scale systems

  • Proactive with natural problem-solving abilities, an inquisitive personality, a continuous learning approach, and an eagerness to tackle big problems even with uncertain requirements.

  • Experience designing and implementing effective alerting strategies that minimize noise and maximize signal.

  • Excellent communication skills with the ability to explain complex technical concepts to both technical and non-technical stakeholders.

  • Proven ability to drive adoption of SRE best practices across an organization.

  • Experience with a Kubernetes environment at a large scale.

  • On-call experience in critical services with good troubleshooting skills.


Desired experience: 

  • Programming skills in languages commonly used for SRE tasks (e.g., Python, Go, Bash)

  • Understanding of Linux/Unix systems and networking principles

  • Proven ability to design and implement robust CI/CD pipelines

  • Experience with containerization and orchestration technologies, particularly Kubernetes

  • Experience in implementing and managing large-scale distributed systems

  • Track record of driving adoption of SRE best practices across an organization

  • Experience participating in major incident responses

  • Experience defining and implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) 



Benefits
  • Permanent Work-from-home/Work-anywhere in the Philippines

  • Work-from-home allowance

  • Health Insurance on day 1 of employment with free three (3) dependents

  • Group Term Life Insurance

  • A laptop and other equipment

  • Other top benefits



Required profile

Experience

Level of experience: Mid-level (2-5 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Communication
  • Problem Solving

Site Reliability Engineer (SRE) Related jobs