Match score not available

(Senior) Reliability Engineer (m/f/d)

Remote:

Full Remote

Experience:

Senior (5-10 years)

Work from:

Germany

Offer summary

Qualifications:

Experience in SaaS and telecom sectors, Strong knowledge of incident management frameworks, Proficiency in AWS cloud services, Familiarity with programming languages like Python or Java.

Key responsabilities:

Lead incident management processes
Design observability frameworks and monitoring solutions

EMnify Telecommunication Services Scaleup http://www.emnify.com

51 - 200 Employees

See all jobs

Job description

(Senior) Reliability Engineer (d/f/m)

Why emnify?

With a predicted 25 billion connected IoT endpoints by 2025, the commercial and technological opportunities presented by the IoT are endless. Not to mention the career paths this exciting space has created. emnify stands out by delivering the next generation of connectivity technology to IoT solution providers worldwide – either directly or via strategic partnerships with CSPs (Communication Service Providers). The most exciting thing about working at emnify is that we are just at the beginning of our journey. We are constantly developing our culture, people, and our business approach further. Our guiding principles are driving transformation, customer centricity and empowering people. If you share our vision to unlock the potential of connectivity in a people focused culture that gives you the chance for impact, growth and to be successful together, we would be happy to join us.

Your Role:

Are you passionate about streamlining software development processes and eager to make a significant impact in the cloud-based telecom services sector? emnify is seeking a talented Reliability Engineer to drive incident management and improve platform observability capabilities to support incident prevention and improve resolution times. This role is critical to enable monitoring, detecting, and resolving potential issues within our complex microservice-based platform. The ideal candidate will have extensive experience with AWS cloud infrastructure, microservices, and modern observability practices.

As a part of the larger Engineering department, our Platform team plays a crucial role in enhancing our competitive edge by improving developer experience to increase development efficiency and scale productivity. You will join a team of 3 engineers, fostering empathy and a collaboration mindset to ensure continuous improvement of development experience at emnify.

Emnify technology radar: https://emnify.github.io/tech-radar/

The position can be based in emnify’s office either in Berlin or in Würzburg or remote in Germany or Poland.

Your Impact:

Incident Management:

-Lead the incident management process, ensuring timely identification, resolution and documentation of incidents.

-Coordinate cross-functional teams during incident investigations and resolution processes.

-Conduct post-incident reviews and root cause analyses to prevent future occurrences.

Observability Framework Implementation:

-Design and implement comprehensive observability frameworks to provide monitoring and alerting capabilities.

-Develop and maintain dashboards, alerts, and metrics to ensure the health and performance of services.

-Implement logging strategies that provide insights for debugging and resolving issues.

Monitoring and Alerting:

-Establish monitoring solutions for proactive detection of potential issues across the platform.

-Develop automated alerting systems to notify relevant teams of anomalies and performance degradation.

-Continuously improve monitoring and alerting systems to enhance detection capabilities.

Collaboration and Support:

-Work closely with development and infrastructure teams to implement observability best practices.

-Provide guidance and training to teams on effective use of observability tools and frameworks.

-Assist in the design and deployment of reliable and scalable microservice architectures.

-Actively use metrics data to foster prioritization of engineering needs.

AWS Cloud Infrastructure:

-Leverage AWS services to build and maintain a resilient cloud infrastructure.

-Implement best practices for security, scalability, and cost optimization in AWS.

-Ensure high availability and disaster recovery capabilities for critical services.

-Build and maintain platform components, including development pipelines, shared infrastructure, and application services.

Your Skills:

-Proven experience as a Reliability Engineer, Site Reliability Engineer (SRE), or similar role in a SaaS and/or telecom company.

-Strong understanding of incident management frameworks and best practices (e.g., ITIL, SRE).

-Hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK Stack, Loki, Opsgenie).

-Proficiency in monitoring and alerting systems, including setup and optimization.

-Extensive experience with AWS cloud services (e.g., EC2, S3, RDS, Lambda, CloudWatch).

-Familiarity with programming languages such as Python, Go, or Java.

-Exceptional problem-solving and critical thinking with a passion for enhancing development experiences in fast-paced tech environments.

-Certification in AWS (e.g., AWS Certified DevOps Engineer, AWS Certified Solutions Architect) (preferred)

Required profile

Experience

Level of experience: Senior (5-10 years)

Industry :

Telecommunication Services

Spoken language(s):

English

Check out the description to know which languages are mandatory.

Hard Skills

Incident Management Amazon Web Services Observability Reliability Engineering Continuous Monitoring Prometheus (Software)Java (Programming Language)Python (Programming Language)Elastic (ELK) Stack Microservices Go (Programming Language)Grafana Itilv3 AWS Cloud Development Kit (CDK)

Other Skills

Collaboration
Critical Thinking
Problem Solving

Are you interested?

Site Reliability Engineer (SRE) Related jobs

Senior Site Reliability Engineer

7 day ago

Zillow

Full time
Remote: Mexico

DevOps/Site Reliability Engineer (Hanoi-Remote)

3 day ago

Token Metrics

Full time
Remote: Vietnam

Site Engineer (Canada - Eastern Region)

1 day ago

ChargePoint

48 - 96K
Remote: United States

Senior Site Reliability Engineer II (Remote)

3 day ago

Drata

1920 - 1920K
Remote: United States

Staff Security Site Reliability Engineer (SRE)

24 day ago

Okta

131 - 197K
Remote: Canada

See more Site Reliability Engineer (SRE) jobs