Match score not available

Principal Architect/Technical Lead Manager, Site Reliability Engineering

Remote: 
Full Remote
Contract: 
Work from: 

Offer summary

Qualifications:

10+ years of experience in software engineering or site reliability engineering roles., 2+ years of hands-on leadership experience managing and mentoring SRE teams., Proficiency in Golang and Python development skills., Extensive experience with cloud platforms and Infrastructure-as-Code (IaC) using Terraform..

Key responsabilities:

  • Lead and mentor a small team of Site Reliability Engineers.
  • Ensure uptime for crucial services through proactive monitoring and fault-tolerant design.
  • Develop and implement automation tools to streamline operations and reduce human error.
  • Collaborate with product engineering to meet service-level objectives and reliability targets.

Aviatrix logo
Aviatrix SME https://www.aviatrix.com/
201 - 500 Employees
See all jobs

Job description

 
 

The Aviatrix SRE team is a small but highly skilled global group of Systems Engineers/SREs dedicated to ensuring the reliability, availability, and performance of Aviatrix’s critical systems and services. Our mission is to build and maintain a robust, resilient infrastructure that enables Aviatrix to deliver high-quality services with agility through automation, best practices, and a culture of operational excellence.

About the Role

As an SRE – Principal Architect and Technical Lead Manager, you will lead and manage a small team of SRE’s in designing, implementing, and maintaining highly available, fault-tolerant, and scalable systems. You’ll focus on automation, proactive monitoring, and Infrastructure-as-Code (IaC) to drive efficiency and reliability across our services.

Tech Stack & Responsibilities

  • Kubernetes – Manage application lifecycles, automate operational tasks, troubleshoot issues, integrate monitoring and alerting, optimize infrastructure, and ensure reliable operations using custom-built operators and cdk8s.
  • Terraform – Implement Infrastructure-as-Code (IaC) to enable rapid provisioning, seamless configuration changes, and efficient scaling.
  • Automation & Development – Build and enhance automation tools and frameworks in Golang and Python to streamline operations.

On-Call Rotation

We maintain a structured on-call rotation to ensure 24/7 coverage:

Location & Eligibility

This is a remote role open to candidates located in US ideally located on Eastern or Central Time Zone.

RESPONSIBILITIES:  

  • Lead and mentor a small team of global Site Reliability Engineers
  • Ensure Reliability and Availability: You will ensure uptime for crucial services and systems based on business required SLOs. Minimize service disruptions through proactive monitoring, capacity planning and fault-tolerant design.
  • Architecture and System Design: you will design and architect complex, scalable and reliable systems.
  • Automation and Efficiency: you will develop and implement automation tools and frameworks to automate routine tasks to reduce human error and to streamline and improve operational processes to increase efficiency.
  • Build Observability and Monitoring tools: you will define, build, deploy, maintain, and extend our observability and monitoring tools to enhance system reliability and availability.
  • Incident Management and Response: you will maintain an effective on-call rotation to ensure 24/7 coverage. You will respond to incident response procedures to swiftly address and mitigate service disruptions.
  • Performance Monitoring and SLIs/SLOs: you will help define and monitor Service level Indicators (SLIs) and Service Level Objectives to set clear expectations for system performance.
  • Collaboration: you will work closely with product engineering to ensure service-level objectives and reliability targets are met
  • Problem-Solving & Troubleshooting: you respond to escalations by troubleshooting complex system and application incidents, perform root cause analysis, implement necessary corrective actions.
  • Thought Leadership and Innovation: Stay up to date with latest industry trends, emerging technologies. Iterate on best practices to increase the quality & velocity of development and deliverables.  

QUALIFICATIONS:   

  • 10+ years of experience in software engineering or site reliability engineering roles.
  • 2+ years of hands-on leadership role, managing and mentoring SRE engineering teams.
  • Proficiency in Golang and Python development skills
  • Extensive experience with cloud platforms (e.g., AWS, Azure, GCP) and cloud-native technologies.
  • Infrastructure-as-code (IaC): Deep understanding of Terraform core components (e.g., Terragrunt is a bonus) with real-world experience using Terraform for infrastructure provisioning and management.
  • Good knowledge of Kubernetes (e.g., cdk8s and operators are a bonus)
  • Solid experience developing Automation tools and frameworks.
  • Experience with Logging Solutions (e.g., Loki, Syslog, Elasticsearch, Logstash, Kibana, Filebeat, Fluentbit, etc.) 
  • Experience with Monitoring and Metrics Solutions (e.g., Prometheus, Grafana, Victoria Metrics)
  • Practical experience with Linux system administration
  • Experience with Version control system (e.g., Git, GitHub) and code review  
  •  Excellent communication skills are required.

US Pay Range

The US National annual base salary range for this full-time position is $244,000 – $265,000 + annual performance bonus + benefits + 401(k) match + equity. The pay range is determined by role, work location, job-related skills, level, experience, and relevant education. [Certain roles are eligible to earn sales commission, depending on the terms of the applicable plan.] The range displayed is the minimum and maximum target base salary and is applicable only for new hires for the listed position located in the US. Your Talent Advisor can share more details regarding salary ranges, benefits, and equity for your location during the hiring process.

#LI-LD1

 

BENEFITS

US: We cover 100% of employee premiums and 88% of dependent(s) premiums for medical, dental and vision coverage, 401(k) match, short and long-term disability, life/AD&D insurance, $1,000/year education reimbursement, and a flexible vacation policy. 

Outside the US: We offer a comprehensive benefits package which, (subect to regional variations) could include pension, private medical for you and dependents, generous holiday allowance, life assurance, long-term disability, annual wellbeing stipend

Your total compensation package will be based on job-related knowledge, education, certifications and location, per our aligned ranges.

About Aviatrix
Aviatrix is the cloud networking expert. We’re on a mission to make cloud networking simple so companies stay agile. Trusted by more than 500 of the world’s leading enterprises, our cloud networking platform creates the visibility, security, and control needed to adapt with ease and move ahead at speed. Combined with the Aviatrix Certified Engineer (ACE) Program, the industry's leading multicloud networking and security certification, Aviatrix empowers the cloud networking community to stay at the forefront of digital transformation.

WE WANT TO INCLUDE YOU

We embrace the fact that not everyone’s journey took the same route or started at the same place. If your experience doesn’t quite meet the requirements but the opportunity excites you and you believe you could be great, don’t let that hold you back from applying. Tell us what you CAN bring and what makes you special.

Aviatrix is a community where everyone's career can grow and we want to help you achieve your goals and be “your best YOU,” however that looks. If you're seeking an opportunity where you can be excited to start work every morning with enthusiastic people, make a real difference and be part of something amazing then let’s talk. We want to get to know you and how we could grow together.

Aviatrix, Inc. is an equal opportunity employer and does not make hiring decisions based on race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws. This policy applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation and training.

CPRA - California Applicant Privacy Notice

 

Required profile

Experience

Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Team Leadership
  • Communication
  • Problem Solving

Site Reliability Engineer (SRE) Related jobs