Offer summary

Qualifications:

5+ years of experience in Site Reliability Engineering or related fields, with at least 3 years in AWS., Strong experience with Kubernetes and related container orchestration tools., Proficient in databases like MongoDB, Postgres, and DynamoDB, along with coding skills in IaC languages such as Terraform and programming languages like Python or Java., Excellent problem-solving abilities and strong communication skills to convey complex technical concepts..

Key responsabilities:

Design and maintain service level objectives (SLOs) that align with business goals.

Develop observability strategies and implement scalable infrastructure solutions using cloud-native technologies.

Drive automation initiatives to enhance system reliability and lead incident response efforts.

Champion reliability best practices and participate in on-call rotations for continuous service improvement.

Job description

iCapital is powering the world’s alternative investment marketplace. Our financial technology platform has transformed how advisors, wealth management firms, asset managers, and banks evaluate and recommend bespoke public and private market strategies for their high-net-worth clients. iCapital services approximately $210 billion in global client assets invested in 1,690 funds, as of November 2024.
iCapital has been named to the Forbes Fintech 50 for six consecutive years (2018 – 2024); a three-time selection by Forbes to its list of Best Startup Employers (2021-2023); and a three-time winner of MMI/Barron’s Solutions Provider award (See link below).

About the Role

The Site Reliability Engineering team at iCapital is fundamental to ensuring our platform delivers consistent, reliable service to our client base. As an Assistant Vice President, you'll work at the intersection of software engineering and operations, applying engineering principles to infrastructure challenges. You'll be responsible for designing and implementing systems that scale efficiently, architecting observability solutions that provide actionable insights, and building automation that enhances our platform's reliability. This role requires someone who thinks systematically about reliability, can translate business requirements into technical implementations, and thrives on making complex systems more robust.

Responsibilities

Design, implement, and maintain service level objectives (SLOs) that align with business goals and customer expectations
Develop observability strategies, focusing on meaningful metrics that drive actionable insights
Architect and implement scalable infrastructure solutions using cloud-native technologies and infrastructure as code
Drive automation initiatives to eliminate toil and improve system reliability
Champion reliability best practices across development teams through consultation and tooling
Design and operation of a Kubernetes environment for container management and orchestration.
Lead incident response, conduct thorough postmortems, and drive systematic improvements
Participate in on-call rotations with a focus on continuous service improvement

Qualifications

5+ years of SRE experience or related experience with 3+ years in AWS
Strong experience with container orchestration platforms like Kubernetes and related ecosystem tools
Working knowledge of databases such as MongoDB, Postgres, DynamoDB
Strong foundation in reliability engineering principles and distributed systems behavior
Experience defining and implementing SLOs/SLIs and using them to drive system improvements
Demonstrated ability to design and implement observability solutions that provide actionable insights while minimizing alert fatigue
Coding abilities in at least one IaC language (Terraform strongly preferred) and one programming language such as Python, Ruby or Java with a focus on maintainable, tested code
Understanding of modern observability practices and experience implementing and maintaining monitoring solutions (Prometheus/Grafana, Splunk, NewRelic, CloudWatch, and ELK in the cloud)
Strong incident response skills with experience leading incident retrospectives and driving improvements
Excellent problem-solving abilities and experience debugging distributed systems
Track record of successfully automating operations and reducing toil
Strong communication skills with ability to explain complex technical concepts to diverse audiences
A desire to share, teach, and learn as part of a team

Employees in this role will work fully remote. Every department has different needs, and some positions will be designated in-office jobs, based on their function.

Benefits

iCapital offers a comprehensive benefits package that includes a total compensation program consisting of competitive salary, annual performance bonus, and equity for all full-time employees; healthcare with 100% employer-paid health and dental insurance; and generous paid time off (PTO).

For additional information on iCapital please visit https://www.icapital.com/about-us Twitter: @icapitalnetwork | LinkedIn: https://www.linkedin.com/company/icapital-network-inc

Required profile