(Senior) Reliability Engineer (d/f/m)
Why emnify?
With a predicted 25 billion connected IoT endpoints by 2025, the commercial and technological opportunities presented by the IoT are endless. Not to mention the career paths this exciting space has created. emnify stands out by delivering the next generation of connectivity technology to IoT solution providers worldwide – either directly or via strategic partnerships with CSPs (Communication Service Providers). The most exciting thing about working at emnify is that we are just at the beginning of our journey. We are constantly developing our culture, people, and our business approach further. Our guiding principles are driving transformation, customer centricity and empowering people. If you share our vision to unlock the potential of connectivity in a people focused culture that gives you the chance for impact, growth and to be successful together, we would be happy to join us.
Your Role:
Are you passionate about streamlining software development processes and eager to make a significant impact in the cloud-based telecom services sector? emnify is seeking a talented Reliability Engineer to drive incident management and improve platform observability capabilities to support incident prevention and improve resolution times. This role is critical to enable monitoring, detecting, and resolving potential issues within our complex microservice-based platform. The ideal candidate will have extensive experience with AWS cloud infrastructure, microservices, and modern observability practices.
As a part of the larger Engineering department, our Platform team plays a crucial role in enhancing our competitive edge by improving developer experience to increase development efficiency and scale productivity. You will join a team of 3 engineers, fostering empathy and a collaboration mindset to ensure continuous improvement of development experience at emnify.
Emnify technology radar: https://emnify.github.io/tech-radar/
The position can be based in emnify’s office either in Berlin or in Würzburg or remote in Germany or Poland.
Your Impact:
Incident Management:
-Lead the incident management process, ensuring timely identification, resolution and documentation of incidents.
-Coordinate cross-functional teams during incident investigations and resolution processes.
-Conduct post-incident reviews and root cause analyses to prevent future occurrences.
Observability Framework Implementation:
-Design and implement comprehensive observability frameworks to provide monitoring and alerting capabilities.
-Develop and maintain dashboards, alerts, and metrics to ensure the health and performance of services.
-Implement logging strategies that provide insights for debugging and resolving issues.
Monitoring and Alerting:
-Establish monitoring solutions for proactive detection of potential issues across the platform.
-Develop automated alerting systems to notify relevant teams of anomalies and performance degradation.
-Continuously improve monitoring and alerting systems to enhance detection capabilities.
Collaboration and Support:
-Work closely with development and infrastructure teams to implement observability best practices.
-Provide guidance and training to teams on effective use of observability tools and frameworks.
-Assist in the design and deployment of reliable and scalable microservice architectures.
-Actively use metrics data to foster prioritization of engineering needs.
AWS Cloud Infrastructure:
-Leverage AWS services to build and maintain a resilient cloud infrastructure.
-Implement best practices for security, scalability, and cost optimization in AWS.
-Ensure high availability and disaster recovery capabilities for critical services.
-Build and maintain platform components, including development pipelines, shared infrastructure, and application services.
Your Skills:
-Proven experience as a Reliability Engineer, Site Reliability Engineer (SRE), or similar role in a SaaS and/or telecom company.
-Strong understanding of incident management frameworks and best practices (e.g., ITIL, SRE).
-Hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK Stack, Loki, Opsgenie).
-Proficiency in monitoring and alerting systems, including setup and optimization.
-Extensive experience with AWS cloud services (e.g., EC2, S3, RDS, Lambda, CloudWatch).
-Familiarity with programming languages such as Python, Go, or Java.
-Exceptional problem-solving and critical thinking with a passion for enhancing development experiences in fast-paced tech environments.
-Certification in AWS (e.g., AWS Certified DevOps Engineer, AWS Certified Solutions Architect) (preferred)