Match score not available

Site Reliability Engineer

extra holidays - extra parental leave
Remote: 
Full Remote
Contract: 
Experience: 
Senior (5-10 years)
Work from: 

Offer summary

Qualifications:

Proven experience as an SRE or in a similar role, Strong knowledge of Elasticsearch/OpenSearch architecture, Experience with performance tuning and cluster optimization, Understanding of JVM concepts and programming languages, Familiarity with monitoring and automation tools.

Key responsabilities:

  • Oversee the performance and reliability of Elasticsearch/OpenSearch clusters
  • Implement best practices for scaling and indexing
  • Develop and maintain automated performance testing and monitoring
  • Diagnose and resolve issues related to cluster health and performance
  • Collaborate with development and DevOps teams for system enhancements

Job description

Description

Coralogix is a modern, full-stack observability platform transforming how businesses process and understand their data. Our unique architecture powers in-stream analytics without reliance on expensive indexing or hot storage. We specialize in comprehensive monitoring of logs, metrics, trace and security events with features such as APM, RUM, SIEM, Kubernetes monitoring and more, all enhancing operational efficiency and reducing observability spend by up to 70%.

We are seeking a skilled Site Reliability Engineer (SRE) with a strong background in Elasticsearch/OpenSearch to join our team. The ideal candidate will manage and optimize large-scale Elasticsearch/OpenSearch clusters, ensuring the infrastructure's stability, performance, and scalability. You'll work closely with development and operations teams to build robust and efficient systems.

Key Responsibilities:

  • Manage & Monitor: Oversee the performance, reliability, and availability of large-scale Elasticsearch/OpenSearch clusters.
  • Optimize & Scale: Implement best practices for scaling, indexing, and querying to ensure optimal performance.
  • Automate & Streamline: Develop and maintain automated performance testing or benchmarking, monitoring, and alerting for Elasticsearch/OpenSearch clusters.
  • Troubleshoot & Resolve: Quickly identify and resolve issues related to cluster health, data integrity, performance bottlenecks, and search accuracy.
  • Collaborate: Work closely with development, DevOps, and other teams to design and implement enhancements to cluster architecture, stability, performance, and data management flows.

Requirements


  • Experience: Proven experience as an SRE or in a similar role, with specific expertise in managing Elasticsearch or OpenSearch clusters.
  • Technical Skills:
  • Strong knowledge of Elasticsearch/OpenSearch architecture, including index management, sharding, and replication.
  • Experience with performance tuning, scaling, and cluster optimization.
  • Understanding of JVM concepts and ability to code with Java or Scala, Python, Go.
  • Familiarity with monitoring tools (e.g., Prometheus, Grafana)
  • Experience with configuration management and automation tools (e.g., Ansible, Terraform, Kubernetes).
  • Problem Solving: Ability to diagnose and troubleshoot complex performance and stability issues in large-scale distributed systems.
  • Communication: Strong verbal and written communication skills to collaborate across teams and document processes clearly.

Preferred Skills:

  • Familiarity with other other distributed systems (e.g., Apache Solr, Kafka).
  • Knowledge of CI/CD pipelines and experience with DevOps practices.
  • Experience with cloud providers (AWS, Azure, GCP).

Required profile

Experience

Level of experience: Senior (5-10 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Communication
  • Collaboration
  • Problem Solving

Site Reliability Engineer (SRE) Related jobs