Senior Site Reliability Engineer San Francisco

3 weeks ago

Palo Alto CA United States MongoDB Full time

MongoDB’s mission is to empower innovators to create, transform, and disrupt industries by unleashing the power of software and data. We enable organizations of all sizes to easily build, scale, and run modern applications by helping them modernize legacy workloads, embrace innovation, and unleash AI. Our industry-leading developer data platform, MongoDB Atlas, is the only globally distributed, multi-cloud database and is available in more than 115 regions across AWS, Google Cloud, and Microsoft Azure. Atlas allows customers to build anywhere—on the edge, on premises, or across cloud providers.

The MongoDB Cloud Team is a diverse collection of individuals working together to provide MongoDB in the cloud at global scale. The Cloud Team is responsible for several services including MongoDB Atlas - our database as a service offering and fastest growing product, MongoDB Realm- our serverless platform offering that allows developers to build apps on MongoDB without managing any infrastructure, and our newest offering, Atlas Data Lake.

The Cloud Site Reliability Engineering Team designs and builds the global infrastructure on which we deploy our services. As our customers grow and globalize, our services must satisfy demands for low-latency requests around the globe, and comply with various data sovereignty requirements. The SRE Team’s mission is to build this increasingly complex infrastructure, while continually lowering the operational burden associated with it, and increasing our internal visibility into the health of the system. We are strong believers in infrastructure-as-code and self-healing systems. The SRE Team is fully integrated with all the other Cloud teams, and the teams work closely together with a soft and traversable boundary between their areas of responsibility.

Responsibilities

Design and build the infrastructure for a global cloud service that comprises hundreds of thousands of MongoDB clusters, processes a billion metrics per day, and replicates tens of billions of database writes to our backup service.
Design, implement, and troubleshoot the automation and monitoring of services that seamlessly spans the globe - including several cloud providers.
Become an expert in infrastructure performance, helping us optimize from the application level all the way through the firmware.
Build for resilience. Our goal is that nobody’s pager goes off, ever. Are we there yet? No. Are we really close? Very. While we work on that - participate in a weekly on-call rotation.
Improve our infrastructure capabilities, optimizing for cost, simplicity, and maintainability.

Requirements

You have experience running a mission critical service at scale.
An understanding of information security issues.
Prior experience running critical production systems in a Linux environment.
Firm grasp of at least one modern programming language, beyond basic scripting.
Solid understanding of web and network protocols and standards (HTTP, TLS, DNS, etc).
Bachelor’s degree in Computer Science or equivalent experience.
Experience writing automation tools & eagerness to "automate all the things".

Nice to haves

Experience building large applications from scratch, complete with CI/CD infrastructure.
Experience in networking, security, hardware or OS performance tuning.
Experience with at least one of the major cloud providers (Amazon Web Services, Google Compute, Microsoft Azure).
Experience managing Kubernetes clusters or some other container orchestration infrastructure.
Experience with observability of large scale distributed systems.

What's in it for you

Generous compensation package (top-range salary: we pay in the top 95% percentile and our package includes equity and generous benefits).
Opportunities to learn on the job (time to up skill in new technologies).
High level of independence in your day to day work.

To drive the personal growth and business impact of our employees, we’re committed to developing a supportive and enriching culture for everyone. From employee affinity groups, to fertility assistance and a generous parental leave policy, we value our employees’ wellbeing and want to support them along every step of their professional and personal journeys.

MongoDB is committed to providing any necessary accommodations for individuals with disabilities within our application and interview process. To request an accommodation due to a disability, please inform your recruiter.

MongoDB, Inc. provides equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type and makes all hiring decisions without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.

#J-18808-Ljbffr

Senior Staff Site Reliability Engineer

3 weeks ago

Chicago, IL, United States WEX Inc. Full time

Senior Staff Site Reliability Engineer Apply to locations: Chicago, IL; Bay Area, CA; San Francisco, CA. About the Role The WEX Site Reliability Engineering (SRE) team is seeking a Senior Staff SRE who is passionate about developing software and solutions focused on observability, incident response, reliability and performance, operational excellence, and...
Tech Support Engineer

3 weeks ago

San Francisco, CA, United States Canva Full time

Join the team redefining how the world experiences design. Canva is seeking a highly motivated IT/Tech Support Engineer to join our San Francisco team to support Canvanauts to do the best work of their lives! It's more than tickets at Canva - you'll have the opportunity to contribute to projects that improve our service delivery, learn new skills and work...
Site Reliability Engineer II

3 weeks ago

San Francisco, CA, United States Earnest Current Job Openings Full time

The Site Reliability Engineer II position will report to the Lead Cloud Engineer. As an SRE II Engineer, you will: Set up and maintain comprehensive monitoring, create and refine playbooks, build dashboards, and adopt industry-standard practices to enhance the reliability and resilience of our site and systems. Develop and manage IaC to ensure reliable,...
Senior Site Reliability Engineer Seattle

3 weeks ago

Palo Alto, CA, United States MongoDB Full time

MongoDB’s mission is to empower innovators to create, transform, and disrupt industries by unleashing the power of software and data. We enable organizations of all sizes to easily build, scale, and run modern applications by helping them modernize legacy workloads, embrace innovation, and unleash AI. Our industry-leading developer data platform, MongoDB...
Tech Support Engineer

2 weeks ago

San Francisco, CA, United States Canva Full time

Join the team redefining how the world experiences design. Canva is seeking a highly motivated IT/Tech Support Engineer to join our San Francisco team to support Canvanauts to do the best work of their lives! It's more than tickets at Canva - you'll have the opportunity to contribute to projects that improve our service delivery, learn new skills and work...
Manager, Site Reliability Engineering

2 weeks ago

Palo Alto, United States Navan Group Full time

At Navan, “It’s all about the user. All of them.” We’re passionate about providing a seamless one-stop experience for business travelers, no matter how they travel, where they stay, or where they’re going. We are committed to building the most reliable, scalable, and efficient infrastructure to ensure our services are always available when...
Infrastructure Engineer

3 weeks ago

San Francisco, CA, United States Replicate, Inc. Full time

At Replicate, we believe AI shouldn’t be exclusive to tech giants — it should be accessible to every software developer. Our goal is straightforward: build the best platform for creating, deploying, and running machine learning models. As an Infrastructure Engineer on the Platform team, you’ll play a key role in making generative AI available to...
Site Reliability Engineer, Enterprise IAM

3 weeks ago

San Francisco, CA, United States OpenAI Full time

Site Reliability Engineer, Enterprise IAM OpenAI’s IT organization supports the mission of deploying artificial general intelligence (AGI) for the benefit of all. Our team is committed to providing seamless technological support and solutions to ensure that all OpenAI employees are well-equipped and connected. This enables them to contribute effectively...
(Senior) Site Reliability Engineer

3 weeks ago

Mountain View, CA, United States Intershop Communications AG Full time

(Senior) Site Reliability Engineer (m/f/d) Jena permanent Full time Senior We are Intershop - We're built to boost your business! As an e-commerce pioneer, we have been setting standards in the development of software for digital commerce for almost 30 years. With our cloud offering and as a Microsoft partner, we transform the challenges of the...
Live-In Weekend Caregiver Position in San Francisco Peninsula

3 weeks ago

Palo Alto, California, United States Right at Home Peninsula Full time

About UsRight at Home Peninsula is a locally owned and family operated home care agency providing non-medical support for seniors in the San Francisco Peninsula area. We pride ourselves on our commitment to delivering exceptional care and service to our clients.
Senior Site Reliability Engineer New York City

3 weeks ago

Palo Alto, CA, United States MongoDB Full time

MongoDB’s mission is to empower innovators to create, transform, and disrupt industries by unleashing the power of software and data. We enable organizations of all sizes to easily build, scale, and run modern applications by helping them modernize legacy workloads, embrace innovation, and unleash AI. Our industry-leading developer data platform, MongoDB...
Technical Site Reliability Engineering Leader

1 week ago

Palo Alto, California, United States Plume Full time

About the CompanyPlume is a leader in the smart home and small business market, delivering services to over 50 million locations globally. Our software-defined network platform allows CSPs to decouple their service offerings from hardware and rapidly curate and deliver new services over a multi-vendor, open-platform architecture.We're looking for a seasoned...
Site Reliability Engineer

2 weeks ago

Chicago, IL, United States WEX, Inc. Full time

The WEX Site Reliability Engineering (SRE) team is seeking an entry-level Site Reliability Engineer Level 1 who is passionate about learning and growing in the field of software development and solutions focused on observability, incident response, reliability and performance, operational excellence, and compliance. The team will be part of the Benefits...
Senior Software Professional

19 hours ago

Palo Alto, California, United States BuildBuddy Full time

Engineering Leadership OpportunityJoin BuildBuddy as a Senior Software Engineer and contribute to our mission of empowering developers worldwide. As a leader on our team, you'll have the opportunity to shape the future of software development and leverage your expertise to drive large technical projects.About the Role:Lead the design, build, test,...
Site Reliability Engineer

3 weeks ago

San Francisco, CA, United States Withorb Full time

Mission Orb is on an ambitious mission to provide every business with the infrastructure to unlock their revenue. Best-in class businesses find ways to effectively align their monetization to product usage—whether that's through seats, consumption, feature limits, or usage-based tiers. Orb brings that opportunity to every software company. We are...
Site Reliability Engineer

3 weeks ago

San Francisco, CA, United States Mistral AI Full time

About Mistral At Mistral AI, we are a tight-knit, nimble team dedicated to bringing our cutting-edge AI technology to the world. Our mission is to make AI ubiquitous and open. We are creative, low-ego, team-spirited, and have been passionate about AI for years. We hire people who thrive in competitive environments, because they find them more fun to work...
Site Reliability Engineer

1 week ago

Palo Alto, California, United States Tesla Full time

Role DescriptionThis is a challenging opportunity to work with cutting-edge technology and contribute to the development of automation tools. As a Site Reliability Engineer, you will drive root cause analysis of system failures, manage containerization technology, and maintain site performance using various tools.Expected CompensationThe estimated annual...
Software Engineer Lead

1 day ago

Palo Alto, California, United States BuildBuddy Full time

Build Engineering ExcellenceWe are seeking a talented Senior Software Engineer to join our team at BuildBuddy, where we strive to empower developers worldwide. Our mission is to provide access to world-class engineering tools, and your expertise will play a crucial role in shaping the future of software development.About the Role:Design, build, test, deploy,...
Computer Systems Architects

3 weeks ago

San Francisco, CA, United States San Rosenau Full time

Determine architectural strategy and vision for the company’s computer systems. Develop scalable and high performing solutions to improve the stability, interoperability, and security of computer systems. Identify data, software, technical tools, and software development standards that meet user needs. Provide architectural guidance to software developers...
Senior Site Reliability Engineer

3 weeks ago

Aliso Viejo, CA, United States Sony Interactive Entertainment Full time

Why PlayStation? PlayStation isn't just the Best Place to Play - it's also the Best Place to Work. Today, we're recognized as a global leader in entertainment producing The PlayStation family of products and services including PlayStation5, PlayStation4, PlayStationVR, PlayStationPlus, acclaimed PlayStation software titles from PlayStation Studios, and...

Americas

Europe

Asia / Oceania

Africa

Senior Site Reliability Engineer San Francisco