Offer summary

Qualifications:

5+ years in software or site reliability engineering, Experience with high-availability applications, Proficient in Bash and Python scripting, Hands-on experience with Kubernetes and Terraform.

Key responsibilities:

Design automated cloud environments using infrastructure-as-code

Manage incidents and analyze root causes

Job description

About Opaque Systems, Inc.

WHAT WE DO

Opaque is the confidential AI platform unlocking sensitive data to securely accelerate AI into production. Created by world-renowned researchers at the Berkeley RISELab, Opaque’s user-friendly platform empowers organizations to run cloud-scale, general purpose AI workloads on encrypted data. Opaque supports popular languages and frameworks for AI, including Python and Spark, and enables governed data sharing with cryptographic verification of privacy and sovereignty. Opaque customers deploy high-performance AI faster and eliminate the tradeoff between innovation and security.

Who We Are

At Opaque, we cultivate an effective work culture grounded in kindness, customer-centricity, and continuous improvement. By fostering innovation, inclusivity, and excellence, we attract top talent and set industry standards, leading to widespread adoption and trust in AI technologies that keep data private and sovereign.

Job Overview

Join Opaque as a Software Engineer - Infrastructure, where you'll harness your expertise in cloud infrastructure, automation, and modern DevOps practices to build, optimize, and secure our Confidential AI platform. You'll design reliable, high-availability systems using tools like Kubernetes, Terraform, and GitHub Actions while enabling seamless CI/CD workflows. From improving system performance and incident management to ensuring compliance (SOC, HIPAA) and deploying proactive cybersecurity measures, you'll play a critical role in balancing innovation, reliability, and customer trust.

Key Responsibilities

Design and implement automated build, test, and other cloud environments using infrastructure-as-code
Partner with development teams to improve services through rigorous testing and release procedures
Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding
Participate in system design consulting, platform management, and capacity planning
Balance feature development speed and reliability with well-defined service-level objectives
Manage and troubleshoot incidents and analyze root causes
Identify and deploy cybersecurity measures for continuous vulnerability assessment and risk management.
Support SOC, HIPPA, and other compliance, and assist with customer infosec questionnaires.

Qualifications

5+ years of experience in a software engineering or site reliability engineering role
Excellent communication and problem-solving skills across languages
Experience operating 24x7 high-availability, distributed software applications and performance tuning software applications and optimizing fleet utilization
Experience building infrastructure and tooling from the ground up.
Experience scripting operating system tasks in Bash, Python, etc
Knowledge about hosting multi-tenant solutions with hybrid deployment models
Strong knowledge of modern web standards and protocols (HTTP, TLS, OAuth2, CORS), network fundamentals (DNS, DHCP, TCP/IP, routing, load balancing, load shedding), and experience with monitoring frameworks (such as CloudWatch, Datadog, Grafana, Elastic or similar)
Hands-on production experience working with:
- Kubernetes, terraform, Github Actions
- At least one major cloud provider (Azure, AWS, GCP)
- Managing critical production workloads

The Pay Range For This Role Is

120,000 - 200,000 USD per year(Any)

Required profile