About Avra

Avra is a deep tech data intelligence platform powered by foundational AI that translates the complexity of SMEs into strategic decisions for large enterprises. We develop our own foundational models from the ground up—without relying on third-party solutions—to deliver innovative insights that empower some of the leading banks and fintechs across Latin America. Founded in 2024 by Bruno Alano (ex-OpenAI) and Viviane Meister, our team brings together expertise from NVIDIA, Palantir, Google, and more to drive real impact.

About the Role

As a Senior Site Reliability Engineer at Avra, you will be responsible for designing, building, and maintaining the infrastructure that powers our AI platform. You will play a crucial role in ensuring the reliability, scalability, and security of our systems as we process vast amounts of data and deliver real-time insights. Working closely with our engineering and data science teams, you will create resilient infrastructure that supports our heterogeneous graph neural networks and knowledge graph processing capabilities.

Responsibilities

Platform Reliability: Design and implement highly available, fault-tolerant systems across our multi-cloud environment (AWS and GCP) that support our graph processing and AI inference workloads.
Kubernetes Platform Engineering: Design, implement, and maintain our production Kubernetes environments on GKE and AWS, ensuring high availability, scalability, and security for our graph processing and AI inference workloads.
Observability & Monitoring: Develop comprehensive monitoring, alerting, and logging systems to ensure 99.9%+ uptime for critical services and provide visibility into system performance.
Infrastructure as Code: Create and maintain infrastructure as code using Terraform to automate provisioning and configuration management.
Performance Optimization: Identify and resolve performance bottlenecks in our distributed systems, particularly around graph processing and real-time inference workflows.
Security Engineering: Collaborate with security teams to implement robust security practices, supporting our ISO 27001 and NIST CSF 2.0 certification efforts.
CI/CD Pipeline Enhancement: Improve and maintain our continuous integration and deployment pipelines to support rapid, reliable software delivery.
Incident Response: Lead incident response efforts, conduct post-mortems, and implement systems to prevent recurrence of issues.

You Stand Out If

You have experience building and maintaining infrastructure for data-intensive or AI applications, particularly those involving graph processing or machine learning.
You have DEEP expertise with Kubernetes, including advanced concepts such as custom controllers, operators, networking policies, and multi-cluster management.
You excel at designing scalable, distributed systems that can handle terabytes of data and millions of requests.
You are proficient with cloud orchestration tools like Kubernetes and have experience managing deployments across AWS and GCP environments.
You have significant experience with GKE (Google Kubernetes Engine) and EKS (Amazon Elastic Kubernetes Service) in production environments.
You have implemented robust observability solutions and can effectively troubleshoot complex system failures.
You practice a security-first mindset and have experience implementing infrastructure security controls.
You are passionate about automation and eliminating toil through effective tooling.

Qualifications

Experience: 5+ years in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles, with at least 3 years of hands-on Kubernetes experience in production environments.
Kubernetes Expertise: Proven experience managing Kubernetes at scale, including cluster architecture, security hardening, resource optimization, and upgrade management.
Technical Skills: Proficiency in programming (Go, Python, or similar), cloud platforms (AWS, GCP), containerization (Docker, Kubernetes), and monitoring technologies (OpenTelemetry, Prometheus, Grafana, ELK stack, etc.).
System Design: Strong understanding of distributed systems design, failure modes, and mitigation strategies.
Problem-Solving: Exceptional debugging skills and the ability to troubleshoot complex issues across the entire technology stack.
Collaboration: Excellent communication skills and ability to work effectively with cross-functional teams in a remote environment.

Why Join Avra?

Cutting-Edge Technology: Build infrastructure for a deep tech AI platform that processes data from millions of Brazilian companies to enable better business decisions.
Competitive Compensation: Attractive salary, equity participation, and full transparency in our compensation structure.
Direct Impact: Work closely with the founders to shape the infrastructure vision of a fast-growing startup.
Technical Challenges: Solve complex problems around graph processing, real-time inference, and large-scale data systems.
Flexible Work Culture: Enjoy the benefits of 100% remote work with access to an office in São Paulo, unlimited vacation, and a comprehensive benefits package including a national health plan and generous parental leave.

If you are passionate about building reliable, scalable infrastructure for AI systems and want to help us revolutionize how businesses make decisions about SMEs in Brazil, we'd love to hear from you. Apply now to join Avra and help us build the future of AI-powered business intelligence in Latin America.

Senior Site Reliability Engineer (SRE)

Offer summary

Qualifications:

Key responsibilities:

Job description

About Avra

About the Role

Responsibilities

You Stand Out If

Qualifications

Why Join Avra?

Required profile

Experience

Hard Skills

Other Skills

Site Reliability Engineer (SRE) Related jobs

Site Reliability Engineer

Linux Site Reliability Consultant

Site Reliability Engineer

DevSecOps Engineer / Site Reliability Engineer

(1016) Staff Site Reliability Engineer