Match score not available

Staff Site Reliability Engineer

extra holidays - extra parental leave
Remote: 
Full Remote
Contract: 
Experience: 
Senior (5-10 years)
Work from: 

Offer summary

Qualifications:

8-10 years of experience in a similar role, Bachelor's degree in Computer Science or related field, In-depth knowledge of Cloud Ops technologies, Strong understanding of programming/scripting languages.

Key responsabilities:

  • Define and enforce SRE best practices
  • Lead complex post-incident reviews and improvements

Agiloft logo
Agiloft SME https://www.agiloft.com
201 - 500 Employees
See all jobs

Job description

As the most trusted global leader in data-first contract lifecycle management (CLM) software, Agiloft helps organizations manage the end-to-end process of proposing, negotiating, signing, and leveraging contracts using our flexible Data-first Agreement Platform (DAP). With contract data as the foundation, customers quickly and collaboratively reach agreement and leverage contract visibility to thrive with competitive advantage. Employing powerful, pragmatic artificial intelligence as a legal force multiplier, and robust integration capabilities as a data liberator, organizations around the world trust Agiloft’s certified implementers to deliver connected, intelligent, and autonomous solutions across the entire contract lifecycle.

Top analysts like Gartner, Forrester, and IDC agree, all showing Agiloft as a leader in the CLM space. Our no code platform is easily managed and administered by business users, which is why Agiloft is the contract you keep: nearly a full 100% of new customers are satisfied with their initial implementations, and some 97% of customers renew every year. Ours is a growing, vibrant, successful company that is at the forefront of a market that is becoming a must-have for all organizations.

We believe that the way to build the strongest, most vibrant place to work is to bring in individuals from all walks of life, and to support them in bringing their authentic selves to their day, every day. Our working philosophy is that “EX = CX”: when employee experience is excellent, so is customer experience. We support multiple Employee Resource Groups (ERGs), and offer a working environment that supports healthy work/life balance, including floating holidays and a quarterly, no-questions-asked wellness day.

Position Overview

As a Staff Site Reliability Engineer (SRE), you will be responsible for developing and implementing highly reliable and scalable system. You will work closely with different functional teams to create a stable, efficient, and scalable environment, leading complex projects requiring collaboration with multiple stakeholders.

Job Responsibilities
  • Define and enforce SRE best practices and standards.
  • Architect and implement highly reliable and scalable systems.
  • Lead complex post-incident reviews and implement systemic improvements.
  • Collaborate with product and engineering teams to set reliability targets.
  • Manage high-impact incidents and coordinate incident response.
  • Contribute to budget planning and resource allocation.
  • Lead efforts to establish disaster recovery strategies.
  • Provide technical leadership and mentorship to the SRE team.
  • Continuously track and improve metrics (for example, DORA) to optimize software delivery and operational performance.
  • Participate in on-call rotation.
  • Other duties as assigned

  • Required Qualifications
  • 8-10 years of experience in similar or related role
  • Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent experience)
  • In-depth knowledge of Cloud Ops technologies including Amazon Web Services (AWS) and Terraform or other Infrastructure as Code (IaC)
  • Advanced knowledge in Linux operating systems and troubleshooting OS issues
  • Expertise in setting up and managing monitoring tools (such as Prometheus, Grafana, Datadog, Nagios, Open Telemetry, ELK, or similar tools)
  • In-depth understanding of monitoring and alerting systems, networking principles (such as load balancing, CDN, and disaster recovery)
  • Strong understanding of:
  • Incident management
  • Capacity planning
  • Disaster recovery
  • Observability practices (in tools such as OpenTelemetry and Jaeger)
  • Advanced experience with or knowledge of with security measures and practices (for example, threat modeling, compliance, and secure coding practices)
  • Strong analytical and problem-solving skills
  • Knowledge with Linux systems and common system administration tasks
  • Strong understanding of programming/scripting languages (such as Python) including additional scripting skills in multiple languages to automate SRE operations
  • Excellent communication and teamwork skills
  • A willingness to learn and adapt in a fast-paced, dynamic environment

  • Preferred Qualifications
  • Familiarity with DevOps practices, infrastructure as Code tools, and Agile methodologies a plus
  • Ensuring a diverse and inclusive workplace is our priority. We are committed to an environment of acceptance where you are free to bring your full self to work. All employment decisions at Agiloft are based on business needs, job requirements, and individual qualifications without regard to race, color, religion or belief, national or social ethnic origin, sex, age, sexual orientation, gender identity and/or expression, parental status, marital status, Veteran status, or any other status protected by the laws or regulations in the locations where we operate. If you have a need that requires accommodation during the recruiting process, please let us know by contacting Director, Talent Acquisition, Brad Toothman at brad.toothman@agiloft.com.
     
    Applicants from underrepresented groups such as minorities, veterans, or individuals with disabilities encouraged to apply.

    Applications will be reviewed as submitted. There will be no application deadline for this opportunity.

    Required profile

    Experience

    Level of experience: Senior (5-10 years)
    Spoken language(s):
    English
    Check out the description to know which languages are mandatory.

    Other Skills

    • Problem Solving
    • Adaptability
    • Communication
    • Analytical Skills
    • Teamwork

    Site Reliability Engineer (SRE) Related jobs