Offer summary

Qualifications:

5+ years in SRE, DevOps, or Cloud Infrastructure roles supporting production environments, Advanced expertise in Microsoft Azure and strong experience with CI/CD using Azure DevOps and GitHub Actions, Proficient in Python and comfortable with Node.js, with Infrastructure as Code skills using Terraform, In-depth knowledge of Docker, Helm, Flux, AKS, and production experience managing MongoDB..

Key responsibilities:

Architect and manage scalable, cloud-native infrastructure in Microsoft Azure.

Design and maintain CI/CD pipelines for full-stack applications and ML workloads.

Implement observability and alerting strategies for real-time monitoring and logging.

Champion SRE best practices across the organization, including incident response and continuous reliability improvements.

Job description

Role: Senior Site Reliability Engineer (SRE)
Position Type: Full-Time Contract (40hrs/week)
Contract Duration: 12 months+
Work Schedule: 8 hours/day (Mon-Fri)
Work Timezone: US Time
Location: 100% Remote (Candidates can work anywhere from anywhere in LATAM)

We're looking for a Senior Site Reliability Engineer (SRE) to join our Innovation Team, where we're building the next generation of AI-powered SaaS solutions. This is a high-impact, hands-on role supporting a fast-moving, multidisciplinary engineering team—including Angular developers, Node.js engineers, and data scientists working with OpenAI and agentic AI architectures.

You'll play a critical role in ensuring our infrastructure is scalable, resilient, observable, and automated to support production-grade applications and machine learning workloads.

What You'll Do:

Architect and manage scalable, cloud-native infrastructure in Microsoft Azure.
Design and maintain CI/CD pipelines using Azure DevOps, GitHub Actions, and Terraform Cloud for full-stack applications, data workflows, and ML workloads.
Deploy and operate containerized environments with Docker, Helm, Flux, and Azure Kubernetes Service (AKS) using GitOps best practices.
Support and optimize MongoDB clusters and cloud-native data pipelines.
Work closely with data scientists building solutions using OpenAI, LLMs, and agentic AI frameworks—ensuring robust compute and observability integration.
Provide infrastructure support for Databricks environments used in ML development and experimentation.
Implement observability and alerting strategies using Dynatrace for real-time monitoring, logging, and traceability.
Automate DevOps workflows and infrastructure operations using Python and Node.js.
Own production deployments, release coordination, and rollback readiness.
Participate in an on-call rotation and provide after-hours/weekend support as needed.
Champion SRE best practices across the organization: SLOs, SLIs, incident response, postmortems, and continuous reliability improvements.

Must-Have Qualifications:

5+ years in SRE, DevOps, or Cloud Infrastructure roles supporting production environments
Advanced expertise in Microsoft Azure (compute, networking, identity, and security)
Strong experience with CI/CD using Azure DevOps and GitHub Actions
Infrastructure as Code skills using Terraform
Proficient in Python (scripting/automation) and comfortable with Node.js
In-depth knowledge of Docker, Helm, Flux, AKS, and containerized architectures
Production experience managing and scaling MongoDB
Familiar with Databricks and ML pipeline operations
Hands-on experience with Dynatrace for observability and monitoring
Exposure to AI/LLM-based production workloads (e.g., OpenAI APIs, agentic AI systems)
Willingness to provide after-hours and weekend support

Nice-to-Have Skills:

Experience with MLOps and scalable ML model deployment
Familiarity with Angular CI/CD and frontend observability
Exposure to event-driven or serverless architectures (e.g., Azure Functions, Kafka)
Understanding of cloud security, compliance, and secrets management
Azure or Kubernetes certifications

Required profile