Site Reliability Engineer
3 weeks ago
Staff Site Reliability Engineer (SRE) Job Responsibilities As our Staff SRE, you'll be the primary expert responsible for our entire compute ecosystem. Your key responsibilities will include: Design, implement, and lead large-scale, cross-functional projects to improve the reliability, performance, and efficiency of our core services and infrastructure (10× impact). Drive the reduction of toil by developing and deploying sophisticated automation tools and frameworks, championing the "everything as code" philosophy. Serve as a technical escalation point for critical incidents, perform deep-dive root cause analyses (RCAs), and implement robust corrective measures to prevent recurrence. Define and implement SLOs, SLIs, and Error Budgets for critical services. Enhance our monitoring, logging, and tracing systems to provide comprehensive visibility into system health. Set the technical direction and best practices for the entire SRE and engineering organization. Mentor mid-level and senior engineers on design patterns, operational rigor, and reliability principles. We're looking for a leader and a deep technical expert with a proven track record of solving the hardest scaling and reliability challenges. Required Qualifications 8+ years of progressive experience in Site Reliability Engineering, Production Engineering, or a closely related role. Expert-level proficiency with AWS, including networking, compute, and storage. Deep expertise in Kubernetes and the cloud-native ecosystem. Fluency in at least one major scripting/programming language for automation and tooling (e.g., Python, Go, or Java). Solid experience with monitoring and logging solutions (Datadog). Proven ability to design and implement robust, highly available distributed systems. Demonstrated experience with Infrastructure as Code tools like Terraform. Exceptional communication skills, capable of explaining complex technical issues to both technical and non-technical audiences. Nice-to-Have Experience implementing Service Mesh technologies (e.g., Istio, Linkerd). A strong understanding of security principles and practices in a cloud environment. Certifications such as CKA (Certified Kubernetes Administrator) or CKAD (Certified Kubernetes Application Developer). Seniority level Mid-Senior level Employment type Contract Job function Information Technology Industries Staffing and Recruiting San Francisco, CA $130,000 - $155,000 (5 days ago) #J-18808-Ljbffr
-
Site Reliability Engineer
3 weeks ago
San Francisco, United States Alchemy Full timeJoin to apply for the Site Reliability Engineer role at Alchemy Join to apply for the Site Reliability Engineer role at Alchemy Our Mission Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers the powerful APIs, SDKs,...
-
Site Reliability Engineer
3 weeks ago
San Francisco, United States Workos Full timeAbout WorkOS 🚀WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with employees across North American time zones. We’re well-funded, having raised an $80M Series B. Our fast-growing customer base includes hundreds of rapidly...
-
Engineering Manager, Site Reliability
2 weeks ago
San Francisco, United States Reddit Full timeEngineering Manager, Site ReliabilityAs an Engineering Manager for Site Reliability, you will be responsible for ensuring the reliability, performance, efficiency, and resilience of your team's systems and services, as well as working to ensure that the experience of your customers other internal engineering teams steadily improves. This includes...
-
Site Reliability Engineer
1 week ago
San Francisco, United States Air Apps Full timeJoin to apply for the Site Reliability Engineer (SRE) role at Air AppsJoin to apply for the Site Reliability Engineer (SRE) role at Air AppsGet AI-powered advice on this job and more exclusive features.About Air AppsAt Air Apps, we believe in thinking bigger—and moving faster. We’re a family-founded company on a mission to create the world’s first...
-
Site Reliability Engineer
3 weeks ago
San Francisco, United States Runloop Full timeAbout Runloop Runloop is building the foundational infrastructure for the next generation of AI development. We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxes. Our platform enables teams to experiment, iterate, and deploy their projects without the friction of environment setup and dependencies. We are a...
-
Site Reliability Engineer
1 week ago
San Francisco, United States SOLANA FOUNDATION Full timeOur MissionIncrease your chances of reaching the interview stage by reading the complete job description and applying promptly.Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers the powerful APIs, SDKs, and tools...
-
Site Reliability Engineer
4 weeks ago
San Francisco, United States Runloop AI, Inc Full timeAbout Runloop Runloop is building the foundational infrastructure for the next generation of AI development. We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxes. Our platform enables teams to experiment, iterate, and deploy their projects without the friction of environment setup and dependencies. We are a...
-
Site Reliability Engineer
4 days ago
San Francisco, United States Cypress HCM Full timeSite Reliability EngineerAs a Site Reliability Engineer (Contractor), you will be a hands-on contributor, focused on supporting and improving the reliability of our AWS cloud infrastructure. You will apply core SRE principles to automate operational tasks, monitor system health, and participate in incident response. This role is execution-focused, supporting...
-
Site Reliability Engineer
4 weeks ago
San Francisco, United States ConductorOne Full timeConductorOne is the first AI-native identity security platform that protects every identity: human, non-human, and AI. With powerful automation, platform-level AI, and out-of-the-box connectors, it centralizes access visibility, enforces fine-grained controls, enables just-in-time access, and automates user access reviews across all apps. It's easy to use,...
-
Site Reliability Engineer
1 week ago
San Francisco, United States ConductorOne Full timeConductorOne is the first AI-native identity security platform that protects every identity: human, non-human, and AI. With powerful automation, platform-level AI, and out-of-the-box connectors, it centralizes access visibility, enforces fine-grained controls, enables just-in-time access, and automates user access reviews across all apps. It’s easy to use,...