Site Reliability Engineer
2 weeks ago
Site Reliability Engineer Lambda is the #1 GPU Cloud for ML/AI teams training, fine-tuning and inferencing AI models, where engineers can easily, securely and affordably build, test and deploy AI products at scale. Lambda’s product portfolio includes on-prem GPU systems, hosted GPUs across public & private clouds and managed inference services—servicing government, researchers, startups and Enterprises worldwide. Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll Do Define Fleet Health metrics and indicators to objectively measure and improve system availability Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies Create runbooks and automated remediations for common failure scenarios Build in automation and auditing to ensure compliance and improve efficiency and productivity Participate in on‑call rotations and provide support for incident response and resolution Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc. You 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization Strong understanding of Linux-based systems in a distributed environment Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling. Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic) Proficiency in automation and configuration management tools (e.g., Ansible, Terraform) Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure) Excellent problem-solving and troubleshooting skills Strong communication and collaboration skills Passion for continuous improvement and innovation Nice to Have Experience in the machine learning or computer hardware industry Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes) Experience building and/or operating HPC resources. Background in chaos engineering or similar reliability testing methodologies Understanding of compliance frameworks (SOC 2, ISO 27001, etc.) Salary Range Information Based on market data and other factors, the annual salary range for this position is $255,000-$405,000. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About Lambda Founded in 2012, ~350 employees (2024) and growing fast. We offer generous cash & equity compensation. We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability. Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG. Health, dental, and vision coverage for you and your dependents. Wellness and Commuter stipends for select roles. 401k Plan with 2% company match (USA employees). Flexible Paid Time Off Plan that we all actually use. A Final Note: You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity Employer Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law. #J-18808-Ljbffr
-
Site Reliability Engineer
3 weeks ago
San Francisco, United States Alchemy Full timeJoin to apply for the Site Reliability Engineer role at Alchemy Join to apply for the Site Reliability Engineer role at Alchemy Our Mission Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers the powerful APIs, SDKs,...
-
Site Reliability Engineer
2 weeks ago
San Francisco, United States Rivago Infotech Inc Full timeStaff Site Reliability Engineer (SRE) Job Responsibilities As our Staff SRE, you'll be the primary expert responsible for our entire compute ecosystem. Your key responsibilities will include: Design, implement, and lead large-scale, cross-functional projects to improve the reliability, performance, and efficiency of our core services and infrastructure (10×...
-
Site Reliability Engineer
3 weeks ago
San Francisco, United States Workos Full timeAbout WorkOS 🚀WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with employees across North American time zones. We’re well-funded, having raised an $80M Series B. Our fast-growing customer base includes hundreds of rapidly...
-
Engineering Manager, Site Reliability
1 week ago
San Francisco, United States Reddit Full timeEngineering Manager, Site ReliabilityAs an Engineering Manager for Site Reliability, you will be responsible for ensuring the reliability, performance, efficiency, and resilience of your team's systems and services, as well as working to ensure that the experience of your customers other internal engineering teams steadily improves. This includes...
-
Site Reliability Engineer
1 week ago
San Francisco, United States Air Apps Full timeJoin to apply for the Site Reliability Engineer (SRE) role at Air AppsJoin to apply for the Site Reliability Engineer (SRE) role at Air AppsGet AI-powered advice on this job and more exclusive features.About Air AppsAt Air Apps, we believe in thinking bigger—and moving faster. We’re a family-founded company on a mission to create the world’s first...
-
Site Reliability Engineer
3 weeks ago
San Francisco, United States Runloop Full timeAbout Runloop Runloop is building the foundational infrastructure for the next generation of AI development. We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxes. Our platform enables teams to experiment, iterate, and deploy their projects without the friction of environment setup and dependencies. We are a...
-
Site Reliability Engineer
1 week ago
San Francisco, United States SOLANA FOUNDATION Full timeOur MissionIncrease your chances of reaching the interview stage by reading the complete job description and applying promptly.Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers the powerful APIs, SDKs, and tools...
-
Site Reliability Engineer
4 weeks ago
San Francisco, United States Runloop AI, Inc Full timeAbout Runloop Runloop is building the foundational infrastructure for the next generation of AI development. We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxes. Our platform enables teams to experiment, iterate, and deploy their projects without the friction of environment setup and dependencies. We are a...
-
Site Reliability Engineer
3 days ago
San Francisco, United States Cypress HCM Full timeSite Reliability EngineerAs a Site Reliability Engineer (Contractor), you will be a hands-on contributor, focused on supporting and improving the reliability of our AWS cloud infrastructure. You will apply core SRE principles to automate operational tasks, monitor system health, and participate in incident response. This role is execution-focused, supporting...
-
Site Reliability Engineer
4 weeks ago
San Francisco, United States ConductorOne Full timeConductorOne is the first AI-native identity security platform that protects every identity: human, non-human, and AI. With powerful automation, platform-level AI, and out-of-the-box connectors, it centralizes access visibility, enforces fine-grained controls, enables just-in-time access, and automates user access reviews across all apps. It's easy to use,...