Offer summary

Qualifications:

7+ years of experience in site reliability engineering or related field, with at least 3 years in a leadership role., Proven expertise in Kubernetes, microservices, and cloud-native infrastructure., Strong knowledge of AWS cloud services and infrastructure as code tools like Terraform., Hands-on experience with observability platforms and a deep understanding of data streaming technologies..

Key responsibilities:

Develop and execute the SRE strategy aligned with business objectives.

Build and lead a team of SREs, fostering a culture of reliability and continuous improvement.

Oversee the design and maintenance of scalable infrastructure and ensure high availability of services.

Champion best-in-class observability and manage service level objectives to enhance system transparency.

Job description

ARE YOU A CURRENT US FOODS EMPLOYEE? PLEASE APPLY DIRECTLY THROUGH OUR INTERNAL WORKDAY CAREER SITE

Join Our Community of Food People!

The Director, Site Reliability Engineering (SRE), will lead a high-performing team responsible for the resilience, scalability, and performance of our digital platforms. This leader will bring deep technical expertise in modern architecture patterns, infrastructure as code (IaC), observability best practices, and the disciplined processes required for a world-class SRE function. The role is instrumental in ensuring reliable, efficient, and secure digital experiences for our customers.

Flexible Work Policy: The work for the Director Site Reliability position is completely remote anywhere in the United States except Hawaii or United States Territories. This position may require up to 20% travel.

RESPONSIBILITIES

SRE Strategy & Leadership

Develop and execute the SRE strategy aligned with digital architecture goals and business objectives.
Build and lead a team of SREs, fostering a culture of reliability, accountability, and continuous improvement.
Drive operational excellence through disciplined incident management, blameless post-mortems, and service reviews.
Partner closely with application, platform, and security engineering teams to enable resilient system design.

Technology & Engineering Execution

Oversee the design and maintenance of scalable infrastructure, leveraging Kubernetes, microservices, and infrastructure as code.
Ensure high availability and performance of Single Page Applications (SPAs), APIs, and backend services.
Supports efforts in CI/CD automation, infrastructure provisioning, and capacity planning.
Drive proactive performance tuning and failure scenario planning using real-world chaos engineering practices.

Observability & Incident Management

Champion best-in-class observability using tools such as New Relic for monitoring, alerting, and root cause analysis.
Define and manage service level objectives (SLOs), service level indicators (SLIs), and error budgets.
Continuously evolve telemetry and logging strategies to increase system transparency and reduce mean time to resolution (MTTR).

Collaboration & Stakeholder Engagement

Partner cross-functionally with Product, Engineering, Data, and Security teams to align SRE practices with business needs.
Communicate reliability trade-offs and performance insights to technical and non-technical stakeholders.
Collaborate with vendors and internal teams to maintain tooling and operational readiness.

SUPERVISION:

Supervision of 12-14 site reliability engineers and support analysts.
Supervision of third-party consultants

RELATIONSHIPS

Internal: Regular interactions with business and technical leaders across the organization to communicate a vision for what is possible and align to business objectives.
External: Regular interactions with technology partners and contract vendors will be required as a key part of this role.

WORK ENVIRONMENT

Remote: This role is fully remote, and the associate is expected to perform assigned responsibilities from a home-based environment.

MINIMUM QUALIFICATIONS

7+ years of experience in site reliability engineering, infrastructure engineering, or a related field, with at least 3 years in a leadership role.
Excellent leadership, communication, and incident management skills to drive high-performance engineering culture and cross-functional collaboration.
Proven expertise in Kubernetes, microservices, and SPA-based architecture, with a strong foundation in cloud-native infrastructure.
Strong knowledge of AWS cloud services and infrastructure as code tools (e.g., Terraform, CloudFormation).
Hands-on experience with observability platforms (e.g., New Relic, Datadog, Dynatrace), distributed tracing, and real-time monitoring.
Deep understanding of data streaming technologies (e.g., Kafka), NoSQL databases (e.g., MongoDB), and event-driven architecture.

EDUCATION

BS/BA in computer science OR related equivalent work experience

PREFERRED QUALIFICATIONS

Experience in the foodservice distribution, wholesale, or supply chain industry with a deep understanding of product data challenges.
Familiarity with chaos engineering, automated runbooks, and site reliability maturity models.
Certifications in Kubernetes, AWS, GCP, or observability tools (e.g., New Relic, Datadog, Dynatrace).
Experience leveraging Generative AI for automation and enhanced observability workflows.

This role will also receive annual incentive plan bonus.

Benefits for this role may include health insurance, pre-tax spending accounts, retirement benefits, paid time off, short-term and long-term disability, employee stock purchase plan, and life insurance.

To review available benefits, please click here: https://www.usfoods.com/careers/benefits.html

Compensation depends on relevant experience and/or education, specific skills, function, geographic location, and other factors as applicable by law (for example: state minimum wage thresholds). The expected base rate for this role is between

$110,000 - $180,000

***EOE Race/Color/Religion/Sex/Sexual Orientation/Gender Identity/National Origin/Protected Veteran/Disability Status***

Required profile