Senior Site Reliability Engineer (SRE)

Remote: 
Full Remote
Contract: 
Work from: 

Offer summary

Qualifications:

5+ years in SRE, DevOps, or Cloud Infrastructure roles supporting production environments, Advanced expertise in Microsoft Azure and strong experience with CI/CD using Azure DevOps and GitHub Actions, Proficient in Python and comfortable with Node.js, with Infrastructure as Code skills using Terraform, In-depth knowledge of Docker, Helm, Flux, AKS, and production experience managing MongoDB..

Key responsibilities:

  • Architect and manage scalable, cloud-native infrastructure in Microsoft Azure.
  • Design and maintain CI/CD pipelines for full-stack applications and ML workloads.
  • Implement observability and alerting strategies for real-time monitoring and logging.
  • Champion SRE best practices across the organization, including incident response and continuous reliability improvements.

Sky Systems, Inc. (SkySys) logo
Sky Systems, Inc. (SkySys) Information Technology & Services Startup https://myskysys.com/
11 - 50 Employees
See all jobs

Job description

Role: Senior Site Reliability Engineer (SRE)
Position Type: Full-Time Contract (40hrs/week)
Contract Duration: 12 months+
Work Schedule: 8 hours/day (Mon-Fri)
Work Timezone: US Time
Location: 100% Remote (Candidates can work anywhere from anywhere in LATAM)

We're looking for a Senior Site Reliability Engineer (SRE) to join our Innovation Team, where we're building the next generation of AI-powered SaaS solutions. This is a high-impact, hands-on role supporting a fast-moving, multidisciplinary engineering team—including Angular developers, Node.js engineers, and data scientists working with OpenAI and agentic AI architectures.

You'll play a critical role in ensuring our infrastructure is scalable, resilient, observable, and automated to support production-grade applications and machine learning workloads.

What You'll Do:

  • Architect and manage scalable, cloud-native infrastructure in Microsoft Azure.
  • Design and maintain CI/CD pipelines using Azure DevOps, GitHub Actions, and Terraform Cloud for full-stack applications, data workflows, and ML workloads.
  • Deploy and operate containerized environments with Docker, Helm, Flux, and Azure Kubernetes Service (AKS) using GitOps best practices.
  • Support and optimize MongoDB clusters and cloud-native data pipelines.
  • Work closely with data scientists building solutions using OpenAI, LLMs, and agentic AI frameworks—ensuring robust compute and observability integration.
  • Provide infrastructure support for Databricks environments used in ML development and experimentation.
  • Implement observability and alerting strategies using Dynatrace for real-time monitoring, logging, and traceability.
  • Automate DevOps workflows and infrastructure operations using Python and Node.js.
  • Own production deployments, release coordination, and rollback readiness.
  • Participate in an on-call rotation and provide after-hours/weekend support as needed.
  • Champion SRE best practices across the organization: SLOs, SLIs, incident response, postmortems, and continuous reliability improvements.

Must-Have Qualifications:

  • 5+ years in SRE, DevOps, or Cloud Infrastructure roles supporting production environments
  • Advanced expertise in Microsoft Azure (compute, networking, identity, and security)
  • Strong experience with CI/CD using Azure DevOps and GitHub Actions
  • Infrastructure as Code skills using Terraform
  • Proficient in Python (scripting/automation) and comfortable with Node.js
  • In-depth knowledge of Docker, Helm, Flux, AKS, and containerized architectures
  • Production experience managing and scaling MongoDB
  • Familiar with Databricks and ML pipeline operations
  • Hands-on experience with Dynatrace for observability and monitoring
  • Exposure to AI/LLM-based production workloads (e.g., OpenAI APIs, agentic AI systems)
  • Willingness to provide after-hours and weekend support

Nice-to-Have Skills:

  • Experience with MLOps and scalable ML model deployment
  • Familiarity with Angular CI/CD and frontend observability
  • Exposure to event-driven or serverless architectures (e.g., Azure Functions, Kafka)
  • Understanding of cloud security, compliance, and secrets management
  • Azure or Kubernetes certifications

Required profile

Experience

Industry :
Information Technology & Services
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Time Management
  • Teamwork
  • Communication
  • Problem Solving

Site Reliability Engineer (SRE) Related jobs