Manager, Site Reliability Engineering

4 weeks ago

Palo Alto, United States Navan Group Full time

At Navan, “It’s all about the user. All of them.” We’re passionate about providing a seamless one-stop experience for business travelers, no matter how they travel, where they stay, or where they’re going. We are committed to building the most reliable, scalable, and efficient infrastructure to ensure our services are always available when travelers need them most. With our rapid growth, we face exciting challenges ahead and are seeking a Site Reliability Engineering (SRE) Manager to join our team in headquarters based out of Palo Alto, California.

As a SRE Manager, you will lead a team of senior and experienced SREs, driving innovation in infrastructure design, automation, and tooling. You will spearhead the development of infrastructure services that power Navan’s systems, serving thousands of travelers daily. Your role will include partnering with development, release and productivity, and security teams to identify user needs and deliver cutting-edge solutions.

You will oversee a diverse range of systems and technologies with the goal of building autonomous, fault-tolerant, and monitored infrastructure. This infrastructure will be optimized for simplicity, performance, and uptime. Collaborating with backend and frontend engineering teams, you will ensure that our systems are scalable, reliable, and efficient. Additionally, you will lead efforts to design and implement infrastructure capable of supporting our exponential growth while maintaining the highest levels of service reliability and operational excellence.

What You'll Do

Lead & Mentor the SRE Team: Guide and develop a high-performing team of SREs, fostering a culture of collaboration, reliability, and continuous improvement.
Drive Infrastructure Reliability & Automation: Collaborate with Engineering and Product teams to design and implement scalable, fault-tolerant systems. Leverage IaC tools (e.g., Terraform, CloudFormation) and microservices architectures to automate and improve infrastructure.
Incident Management: Improve incident response processes, reduce MTTR, and proactively mitigate risks. Apply resiliency patterns to ensure systems are fault-tolerant and highly available.
Define & Measure SLOs: Develop service-level objectives (SLOs) and KPIs to track and improve system reliability, using tools like NewRelic or DataDog for observability.
24x7 Production Support: Ensure system availability in a 24x7 environment, applying expertise in AWS (e.g., ECS, Lambda, DynamoDB) and database management for optimal performance.
Optimize CI/CD Pipelines: Automate and streamline deployment workflows using tools like Jenkins or GitHub Actions to ensure faster and more reliable deployments.
Resource Management: Manage team resources, including capacity planning, hiring, and upskilling, to meet evolving business needs.

What We're Looking For

8+ years in Site Reliability Engineering, DevOps, or Infrastructure roles, with at least 3 years in a leadership position.
Proven ability to lead and mentor teams, fostering a culture of collaboration and reliability.
Hands-on experience with AWS cloud technologies, Infrastructure as Code (Terraform/CloudFormation), microservices architectures, deployment automation (Jenkins/GitHub Actions), and observability tools (NewRelic/DataDog).
Strong background in designing scalable, fault-tolerant systems, improving incident response, and driving operational improvements.
Excellent interpersonal and communication skills, with the ability to work effectively across cross-functional teams.

Workplace Policy

Navan believes in the value of in-person connections, whether that is sitting down to have lunch with one another, taking a walking 1:1, or collaborating in a room together. The connections forged through face-to-face interactions improves company culture and drives business results. Navan invests in global office spaces — in the US , Germany , France , Spain , and the UK , among others — that feel welcoming and offers perks such as lunches and happy hours to create a strong team environment to help you do your best work. We operate on a hybrid working model, which we define as three days a week in-office. Please expect this policy for all roles that are tied to an office.

Navan is an equal opportunity employer. We make all employment decisions based solely on merit. We provide equal employment opportunity to all applicants and employees without discrimination on the bases of race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. We prohibit any such discrimination or harassment. This policy applies to all terms and conditions of employment, including hiring.

Accommodations

Navan complies with the Americans with Disabilities Act (ADA), as amended by the ADA Amendments Act, and all applicable state or local law. Navan will reasonably accommodate qualified individuals with a disability in connection with applications for employment as required by law.

#J-18808-Ljbffr

Site Reliability Engineering Manager

2 weeks ago

Palo Alto, California, United States Plume Full time

About the JobThe Technical Manager will lead a team of Site Reliability Engineers, providing technical guidance and oversight. Key responsibilities include:Supervise a team of Site Reliability Engineers who provide first-line support to Customer Clouds.Attend and conduct customer Meetings for Project and Roadmap specification.Manage growth and performance of...
Manager, Site Reliability Engineering

3 days ago

Palo Alto, United States Plume Design, Inc. Full time

We’re looking for a seasoned Technical Manager, experienced with Customer Facing environments, to Captain our Site Reliability Engineering Team. This team is focused on deployments, fixes, and sustainability. The right candidate needs to have strong technical knowledge in key areas while focusing on customer satisfaction. What You’ll Do: Supervise a...
Site Reliability Engineer

3 days ago

Palo Alto, United States JPMorgan Chase Full time

DESCRIPTION:Duties: Design, build and operate large-scale production systems. Debug complex problems across the whole stack. Develop tools for application engineering teams based on operations requirements for micro services. Improve alerting and monitoring for the existing services. Assist with onboarding and mentoring new engineers. Collaborate with the...
Technical Site Reliability Engineering Leader

4 weeks ago

Palo Alto, California, United States Plume Full time

About the CompanyPlume is a leader in the smart home and small business market, delivering services to over 50 million locations globally. Our software-defined network platform allows CSPs to decouple their service offerings from hardware and rapidly curate and deliver new services over a multi-vendor, open-platform architecture.We're looking for a seasoned...
Site Reliability Engineer

3 weeks ago

Palo Alto, California, United States Tesla Full time

Role DescriptionThis is a challenging opportunity to work with cutting-edge technology and contribute to the development of automation tools. As a Site Reliability Engineer, you will drive root cause analysis of system failures, manage containerization technology, and maintain site performance using various tools.Expected CompensationThe estimated annual...
Site Reliability Infrastructure Engineer

2 weeks ago

Palo Alto, California, United States Assured Full time

About Assured">At Assured, we modernize insurance by providing software solutions to large insurers. We empower them to win in a technology-driven world with self-service claim filing software and backend fraud detection.">Job Overview">We are looking for a Site Reliability Engineer to join our team. The ideal candidate will have experience working in a...
Site Reliability Engineering Team Lead

1 month ago

Palo Alto, California, United States Plume Design, Inc. Full time

We're looking for a seasoned Technical Manager with extensive experience in Customer Facing environments to lead our Site Reliability Engineering Team. This team is focused on deployments, fixes, and sustainability.The ideal candidate will have strong technical knowledge in key areas while focusing on customer satisfaction.Key ResponsibilitiesSupervise a...
Site Reliability Engineer

4 days ago

Palo Alto, United States criteo Full time

At Criteo we face some of the most challenging, but interesting problems in the IT industry. We work at a scale of speed, performance and complexity that few others in the industry can compete with. Our data is not big it’s absolutely HUGE. We have about 40 petabytes in our Hadoop storage (more than 30 TB extra per day), we take less than 10ms to respond...
Site Reliability Engineer

3 days ago

Palo Alto, United States criteo Full time

At Criteo we face some of the most challenging, but interesting problems in the IT industry. We work at a scale of speed, performance and complexity that few others in the industry can compete with. Our data is not big it’s absolutely HUGE. We have about 40 petabytes in our Hadoop storage (more than 30 TB extra per day), we take less than 10ms to respond...
Sr. Site Reliability Engineer, Dojo

22 hours ago

Palo Alto, United States Tesla, Inc. Full time

We are seeking an experienced Site Reliability Engineer (SRE) to join our team responsible for ensuring the reliability and performance of our Dojo cluster infrastructure. The successful candidate will be responsible for providing exceptional customer response and support, managing third-party systems, and collaborating with various teams to ensure seamless...
Reliability Engineer for Distributed Systems

3 weeks ago

Palo Alto, California, United States Tesla Full time

Company OverviewTesla is a leading electric vehicle manufacturer accelerating the world's transition to sustainable energy. Our mission-critical systems enable our engineers to design and develop innovative solutions.Job SummaryWe are seeking a highly skilled Site Reliability Engineer to join our Design Technology Operations team. This position will be...
Reliability Engineering Team Lead

15 hours ago

Palo Alto, California, United States Navan Group Full time

At Navan, our vision is centered around providing a seamless user experience. We are passionate about delivering a one-stop-shop for business travelers, catering to their diverse needs and preferences.We are committed to building robust, scalable, and efficient infrastructure that ensures our services are always available when needed most. As we continue to...
Principal Site Reliability Engineer with Luma AI

4 days ago

Palo Alto, United States jobs.lever.co - ATS Full time

The SRE role at Luma AI sits with the Infrastructure and Research teams and is responsible for our GPU clusters. Luma runs on '000s of H100 GPUs across multiple providers and clusters for Training, Data Processing and Inference. We need a highly skilled SRE to ensure those clusters are healthy and to build the monitoring and management tools we need to make...
Site Reliability Engineer

3 weeks ago

Palo Alto, United States Ario Full time

Our Mission at ArioYou generate enormous amounts of personal data when you use the internet. This data is extremely powerful and could make your life easier, better, more magical. So why aren't you using it? At Ario, we've developed a product that effortlessly enables you to consolidate your digital world - from your Twitter likes to your Kindle highlights -...
Senior Site Reliability Engineer Seattle

4 days ago

Palo Alto, United States MongoDB Full time

MongoDB’s mission is to empower innovators to create, transform, and disrupt industries by unleashing the power of software and data. We enable organizations of all sizes to easily build, scale, and run modern applications by helping them modernize legacy workloads, embrace innovation, and unleash AI. Our industry-leading developer data platform, MongoDB...
Vehicle Technology Reliability Expert

3 weeks ago

Palo Alto, California, United States Tesla Full time

About the JobWe are looking for an experienced Site Reliability Engineer to join our team. Your responsibilities will include building release processes, managing Kubernetes infrastructure, and maintaining site performance. You will also participate in on-call rotations and facilitate production and security incidents.Required SkillsTo succeed in this role,...
Hardware Reliability Engineer

4 days ago

Palo Alto, United States Wing Inflatables, Inc. Full time

About Wing:Wing offers drone delivery as a safe, fast, and sustainable solution for last mile logistics. Consumer appetites for on-demand services are increasing, but current delivery methods are inefficient, costly, and contribute to road accidents and air pollution. Wing’s fleet of highly automated delivery drones can transport small packages directly...
Reliability Engineering Professional

2 weeks ago

Palo Alto, California, United States Tesla Full time

**About the Role:**Tesla is looking for a highly motivated Reliability Engineering Professional to join our team. As a key member of our engineering group, you will play a crucial role in ensuring the reliability of our innovative products.This position offers an exciting opportunity to contribute to the development of cutting-edge technology and shape the...
Sr. Mechanical Reliability Engineer, Megapack

4 hours ago

Palo Alto, United States Tesla Full time

As a Sr. Mechanical Reliability Engineer focusing on Tesla Megapack, you will play a key role in designing reliability into Tesla's industrial energy storage systems ensuring the products meet the highest standards of reliability. This role follows the reliability lifecycle of the product from concept to design, validation testing/analysis, manufacturing,...
Reliability Solutions Engineer

2 weeks ago

Palo Alto, California, United States Luma AI Full time

**Job Overview**Luma AI is seeking a highly skilled Reliability Solutions Engineer to join our team. As a key member of our Infrastructure and Research teams, you will be responsible for ensuring the health and reliability of our GPU clusters.We are looking for someone with a strong background in cloud infrastructure, containerization, and...

Americas

Europe

Asia / Oceania

Africa

Manager, Site Reliability Engineering