Who are we?
IOG, is a technology company focused on blockchain research and development. We are renowned for our scientific approach to blockchain development, emphasizing peer-reviewed research and formal methods to ensure security, scalability, and sustainability.
Our projects include the Cardano blockchain, as well as other products in the areas of decentralized finance (DeFi), governance, and identity management, aiming to advance the capabilities and adoption of blockchain and Web3 technology globally.
About Midnight:
IOG's Midnight Tribe is a business technology provider and core contributor to the Midnight Network, a blockchain platform for developing decentralized applications that safeguard personal and commercial data. The Midnight Network is the first blockchain to offer programmable data isolation by leveraging zero-knowledge (ZK) proofs to enable selective disclosure of what information is visible on-chain and is designed to help developers implement necessary business policies, such as meeting regulatory requirements.
What the role involves:
As an experienced and visionary Head of Site Reliability Engineering (SRE), you will be responsible for leading the infrastructure and reliability strategy for Midnight, a regulatory-friendly blockchain focused on data protection, privacy, and freedom of expression.
In this senior leadership role, you will own the reliability, scalability, and performance of the Midnight platform. You will be responsible for building and leading a high-performing team of SREs, driving the SRE roadmap, and partnering closely with engineering, security, and product teams to deliver robust production systems.
You will be instrumental in setting the foundations of our infrastructure, designing systems that scale globally, and ensuring high availability, while embracing the unique challenges of a blockchain-based architecture. This is a hands-on leadership role combining technical depth, architectural vision, operational rigor, and people leadership.
- Lead the SRE team, sharing expertise and best practices. Coach, mentor and develop SRE team.
- Demonstrate leadership in driving initiatives that enhance service reliability, scalability, and overall performance.
- Lead the entire lifecycle of services, including inception, design, deployment, operation, and refinement.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
- Oversee the maintenance of live services by continuously measuring and monitoring factors like availability, latency, and overall system health.
- Assist our teams in creating software that is both simple and flexible to configure and deploy.
- Lead sustainable incident response practices, ensuring timely resolution with a focus on minimizing impact.
- Collaborate with software engineering and testing teams to establish and maintain automated regression suite infrastructure and performance testing.
- Sustainably scale systems through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity.
- Conduct blameless postmortems to analyze incidents, identify root causes, and implement preventive measures.
Requirements
Who you are:
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- At least 8 years in a Reliability Engineering, DevOps or infrastructure focused role.
- Proven track record of leading and managing a high-performing SRE team.
- Experience writing code in Python, Rust/C++ or JavaScript.
- Proven years of experience in Build and Release engineering, Linux operational excellence and automation.
- Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
- You will be someone who works well on your own and with a team.
- You are kind and respectful of others’ opinions and you are open and act with integrity when engaging in academic or technical discussions.
- Proven experience in capacity planning, performance monitoring, and optimization to ensure systems can handle current and future loads efficiently.
- System engineering experience working with application servers, containers, and web servers.
- Demonstrated ability to analyze incidents, identify root causes, and implement preventive measures to reduce the likelihood of recurring issues.
- Strong understanding of cloud architecture including the major cloud providers (AWS, GCP, etc).
- Experience working with Docker containers and related orchestration technologies (such as Kubernetes or ECS).
- Knowledge of SRE principles (observability, SLOs, SLIs, logging, etc)
- Understand underlying networking and security considerations when developing the architecture of our deployment environments.
- Fluency in git based workflows, commit discipline.
- Experience in providing mentorship and coaching to team members
Are you an IOGer?
Do you find yourself questioning the status quo? Do you tinker with ideas and long to turn those ideas into solutions? Are you able to spark thoughtful debates, bringing out the inquisitiveness in others? Does the promise of continuously growing excite you? Then get ready to reimagine everything you thought wasn’t possible because that’s what it means to be an IOGer - we don’t set limits, we break them.
Benefits
- Remote work
- Laptop reimbursement
- New starter package to buy hardware essentials (headphones, monitor, etc)
- Learning & Development opportunities
- Competitive PTO
At IOG, we value diversity and always treat all employees and job applicants based on merit, qualifications, competence, and talent. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.