Site Reliability Engineer

Remote: 
Full Remote
Contract: 
Work from: 

Offer summary

Qualifications:

4+ years of experience as a Site Reliability Engineer or similar role with a focus on cloud infrastructure and automation., Expertise in Infrastructure as Code (IaC) using Terraform and Terragrunt., Strong knowledge of AWS cloud services and experience with Kafka, Redis, and Kubernetes., Excellent problem-solving and communication skills, with a background in agile environments..

Key responsibilities:

  • Design, build, and maintain scalable infrastructure and CI/CD pipelines using Terraform and AWS.
  • Manage and optimize cloud environments for cost-efficiency and high availability.
  • Implement monitoring and alerting systems to ensure system reliability and troubleshoot complex issues.
  • Collaborate with development teams to integrate new features and foster a culture of continuous improvement.

Blackpoint Cyber logo
Blackpoint Cyber Cybersecurity Scaleup https://www.blackpointcyber.com/
51 - 200 Employees
See all jobs

Job description

Blackpoint Cyber is the leading provider of world-class cybersecurity threat hunting, detection and remediation technology. Founded by former National Security Agency (NSA) cyber operations experts who applied their learnings to bring national security-grade technology solutions to commercial customers around the world, Blackpoint Cyber is in hyper-growth mode,  fueled by a recent $190m series C round. 

Job Overview: 

We’re on the lookout for a passionate and experienced Site Reliability Engineer (SRE) to join our high-impact, fast-moving team. In this role, you’ll take the lead in designing, building, and scaling robust infrastructure, CI/CD pipelines, and build systems that power our products. You’ll work together with cross-functional teams to drive system reliability, performance, and automation, all while championing a culture of innovation, collaboration, and continuous improvement. 

Key Responsibilities: 

Infrastructure & Cloud Management 

  • Design, build, and maintain highly scalable infrastructure using Terraform and Terragrunt to automate cloud resource provisioning. 

  • Manage and optimize AWS cloud environments for cost-efficiency, security, and high availability. 

  • Continuously improve infrastructure automation tools and methodologies to support scalability and maintainability. 

Platform & System Reliability 

  • Manage and scale Kafka and Confluent Cloud platforms for real-time data streaming. 

  • Deploy and maintain Redis instances to support caching and real-time data processing workloads. 

  • Implement and maintain robust monitoring and alerting systems using Prometheus, Grafana, Alert Manager, and OpsGenie to ensure system reliability and visibility. 

  • Troubleshoot and resolve complex system issues, ensuring optimal performance and uptime. 

Deployment & Release Engineering 

  • Manage Kubernetes clusters using tools like Helm, ArgoCD, Istio, and Kustomize to support modern infrastructure-as-code and continuous delivery practices. 

  • Enable feature flag management and safe, controlled rollouts using LaunchDarkly. 

Collaboration & Continuous Improvement 

  • Work closely with development teams to seamlessly integrate new features and services into the infrastructure. 

  • Foster a culture of continuous improvement by regularly evaluating and adopting emerging SRE tools, technologies, and best practices. 

 

Skills & Qualifications: 

  • 4+ years proven experience as a SRE Engineer or in a similar role with a strong focus on cloud infrastructure and automation. 

  • Excellent problem-solving skills with the ability to troubleshoot complex systems in production. 

  • Strong communication and collaboration skills, with experience working in agile environments. 

  • Expertise in Infrastructure as Code (IaC) using Terraform and Terragrunt. 

  • Deep knowledge of AWS cloud services and best practices for designing secure and scalable architectures. 

  • Hands-on experience with Confluent Cloud and Kafka for distributed data streaming. 

  • Strong experience with REDIS for caching and RDS data storage. 

  • Strong Experience with OpenSearch/Elasticsearch/ Chaos Search. 

  • Proficiency in monitoring and alerting using Prometheus, Grafana, Alert Manager. 

  • Extensive experience managing Kubernetes clusters, including package management with Helm, deployment with ArgoCD, and service mesh configurations using Istio. 

  • Familiarity with Kustomize for Kubernetes resource configuration. 

  • Development experience in NodeJS/Python/GoLang. 

Nice to Have: 

  • Experience with multi-cloud environments (e.g., GCP, Azure). 

  • Familiarity with security, compliance best practices in cloud and containerized environments. 

  • Knowledge of serverless architectures and CI/CD tools such as Jenkins and/or GitHub Actions. 

Blackpoint Cyber welcomes and encourages applications from qualified individuals of all races,  colors, religions, sex, sexual orientation, gender identity or expression, national origin, age, marital  status, or any other legally protected status. We are committed to equality of opportunity in all  aspects of employment.  For eligible employees in the US, Blackpoint offers competitive Health, Vision, Dental, and Life Insurance plans, a robust 401k plan, Discretionary Time Off, and other minor perks.

Required profile

Experience

Industry :
Cybersecurity
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Collaboration
  • Communication
  • Problem Solving

Site Reliability Engineer (SRE) Related jobs