Principal Site Reliability Engineer

24 hours ago

San Francisco CA United States salesforce Full time

To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts.

Job Category: Software Engineering

About Salesforce:

We’re Salesforce, the Customer Company, inspiring the future of business with AI + Data + CRM. Leading with our core values, we help companies across every industry blaze new trails and connect with customers in a whole new way. And, we empower you to be a Trailblazer, too — driving your performance and career growth, charting new paths, and improving the state of the world. If you believe in business as the greatest platform for change and in companies doing well and doing good – you’ve come to the right place.

Job Details:

(Lead/Principal/Architect) Software Engineer - Availability Engineering
Our Availability engineering teams are responsible for driving ‘best in class’ availability. You will work with delivery teams deploying customer-facing/supporting software across a multi-substrate engineering platform that collectively ships hundreds of features to production for tens of millions of users across all industries every day. Our users count on our applications and platforms to be highly reliable, lightning fast, supremely secure, and to preserve all of their customizations and integrations every time we ship. You will need deep experience with concurrency, large scale systems, proficiency with solving real-world data management challenges, a strong understanding of how to craft solutions that are highly available, and a proven ability to design, develop, and optimize the core back-end systems.

What you’ll be doing:

As part of a specialist unit focused on availability and resilience, you will embed with delivery teams, acting in a Lead capacity, creating bandwidth and prioritizing a focus on corrective and proactive availability measures.
You will be contributing to designing, developing, debugging, and operating resilient applications and platforms deployed across distributed systems that run across thousands of compute nodes in multiple data centers.
You will champion resiliency best practices; observability tool integration, horizontal/vertical sizing & auto-scaling, release rollback & recovery workflows, integration tests and validation procedures for applications running on self-host infra as well as public cloud platforms such as AWS, GCP, Azure & Alibaba.
Using and contributing to open source technology (Spinnaker, Zookeeper, etc.).
Developing/leverage Infrastructure-as-Code using Terraform.
Building/integrating with APIs and microservices deployed on containerization frameworks such as Kubernetes, Docker, Mesos, etc.
Resolving complex technical issues and driving innovations that improve system availability, resilience, and performance.
You have experience balancing live runtime management, feature delivery, and retirement of technical debt.
Participate in the team’s on-call rotation to address complex problems in real-time and keep services operational and highly available.

Required Skills:

A related technical degree required, (masters preferred).
15+ years of hands-on software development experience.
5+ years in a Tech Lead, Principal or Architect capacity.
Ability to reverse engineer solutions via independent code and architecture review, envision, define and then contribute to delivery of availability improvement refactoring projects.
Mastery of one or more object-oriented delivery languages such as Java, Golang, APEX, Python.
Deep experience working with core web technologies: HTTP, JSON, REST, XML.
Proficiency with databases including Oracle or other relational and/or NoSQL solutions.
Experience owning and operating multiple instances of a critical service.
Running critical infrastructure services; monitoring, alerting, logging, tracing and reporting.
Subject matter expertise on Service ownership best practices, SLO/I/A definition, driving proactive operational awareness and experience with Incident/Problem management.
Thorough knowledge of Agile development methodology with experience in both Test/Behavioral Driven Development practices.

#J-18808-Ljbffr

Principal Site Reliability Engineer

1 day ago

Sunnyvale, CA, United States Microsoft Full time

There has never been a more exciting time to be working in healthcare at Microsoft. Our Health & Life Sciences Solutions organization is an interdisciplinary team of product managers, designers, engineers, and clinicians who are designing, developing and deploying next-generation healthcare solutions powered by the Microsoft Cloud for healthcare...
Site Reliability Engineer

1 week ago

San Francisco, United States Apollo Solutions Full time

Site Reliability Engineer Apollo Solutions have partnered with a groundbreaking artifical inteligence business who are making major developments in how we use AI/ML for gaming/security. They are working closely with government contracts as well as gaming consoles companys and are now searching for an SRE to join their growing team. The Site Reliability...
Principal Site Reliability Engineer

2 weeks ago

San Francisco, United States salesforce Full time

To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts.Job Category: Software EngineeringAbout Salesforce:We’re Salesforce, the Customer Company, inspiring the future of business with AI + Data + CRM. Leading with our core values, we help companies across every...
Principal Site Reliability Engineer

4 days ago

San Francisco, United States salesforce Full time

To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts.Job Category: Software EngineeringAbout Salesforce:We’re Salesforce, the Customer Company, inspiring the future of business with AI+ Data +CRM. Leading with our core values, we help companies across every...
Site Reliability Engineer II

2 days ago

San Francisco, CA, United States Earnest Current Job Openings Full time

The Site Reliability Engineer II position will report to the Lead Cloud Engineer. As an SRE II Engineer, you will: Set up and maintain comprehensive monitoring, create and refine playbooks, build dashboards, and adopt industry-standard practices to enhance the reliability and resilience of our site and systems. Develop and manage IaC to ensure reliable,...
Principal Software Engineer, Site Reliability Engineering

1 day ago

San Francisco, CA, United States Salesforce, Inc. Full time

Software Engineering PMTS remote type: Office Tech-Flexible locations: California - San Francisco, Washington - Bellevue time type: Full time posted on: Posted 3 Days Ago job requisition id: JR266855 To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts. About...
Site Reliability Engineer

2 days ago

Sunnyvale, CA, United States Natcast, Inc. Full time

Natcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (NSTC) consortium, established by the CHIPS Act of the U.S. government. Working at Natcast represents an opportunity to help extend America’s leadership in...
Site Reliability Engineer

3 weeks ago

San Francisco, United States WEX Full time

The WEX Site Reliability Engineering (SRE) team is seeking an entry-level Site Reliability Engineer Level 1 who is passionate about learning and growing in the field of software development and solutions focused on observability, incident response, reliability and performance, operational excellence, and compliance. The team will be part of the Benefits...
Site Reliability Engineer

20 hours ago

San Francisco, United States Bun Full time

Bun is an open-source JavaScript tooling company focused on making programming simpler. We've raised $26 million from top investors in Silicon Valley, are among the top GitHub repositories and have a growing community of 33,000 Discord members.We're hiring an experienced Site Reliability Engineer to scale and maintain the infrastructure that builds and tests...
Site Reliability Engineer

3 weeks ago

San Francisco, United States Ellation, Inc. Full time

Who We AreWe‘re a cast of characters working to shine a spotlight on anime. Crunchyroll is an international business focused on creating both online and offline experiences for fans through content (licensed, co-produced, originals, distribution), merchandise, events, gaming, news, and more. Visit our About Us pages for more information about our...
Site Reliability Engineer

3 weeks ago

San Francisco, United States Ellation, Inc. Full time

Who We AreWe‘re a cast of characters working to shine a spotlight on anime. Crunchyroll is an international business focused on creating both online and offline experiences for fans through content (licensed, co-produced, originals, distribution), merchandise, events, gaming, news, and more. Visit our About Us pages for more information about our...
Site Reliability Engineer

1 week ago

San Francisco, United States Unreal Gigs Full time

Are you passionate about building and maintaining resilient systems that ensure high availability and performance? Do you excel at automating processes, troubleshooting complex issues, and creating systems that scale smoothly? If you're ready to take on the challenge of ensuring reliable, efficient, and secure system operations, our client has the perfect...
Site Reliability Engineer

2 days ago

San Francisco, CA, United States Withorb Full time

Mission Orb is on an ambitious mission to provide every business with the infrastructure to unlock their revenue. Best-in class businesses find ways to effectively align their monetization to product usage—whether that's through seats, consumption, feature limits, or usage-based tiers. Orb brings that opportunity to every software company. We are...
Site Reliability Engineer

1 month ago

San Francisco, United States New York Technology Partners Full time

Must Have's in the order of preference.Typical Java/J2EE experience between 6 and 10 yearsApplication Production Support(SRE - Site Reliability Engineering) with 3+ years - Preferably in e-commerce domainHands-on experience in any of the UI Frameworks(AngularJS, VueJS etc) - 1+ years
Site Reliability Engineer

1 week ago

San Francisco, United States New York Technology Partners Full time

Must Have's in the order of preference.Typical Java/J2EE experience between 6 and 10 yearsApplication Production Support(SRE - Site Reliability Engineering) with 3+ years - Preferably in e-commerce domainHands-on experience in any of the UI Frameworks(AngularJS, VueJS etc) - 1+ years
Site Reliability Engineer

1 month ago

san francisco, United States New York Technology Partners Full time

Must Have's in the order of preference.Typical Java/J2EE experience between 6 and 10 yearsApplication Production Support(SRE - Site Reliability Engineering) with 3+ years - Preferably in e-commerce domainHands-on experience in any of the UI Frameworks(AngularJS, VueJS etc) - 1+ years
Site Reliability Engineer

2 days ago

San Francisco, CA, United States Mistral AI Full time

About Mistral At Mistral AI, we are a tight-knit, nimble team dedicated to bringing our cutting-edge AI technology to the world. Our mission is to make AI ubiquitous and open. We are creative, low-ego, team-spirited, and have been passionate about AI for years. We hire people who thrive in competitive environments, because they find them more fun to work...
Site Reliability Engineer

1 week ago

San Francisco, California, United States WEX Inc Full time

The WEX Site Reliability Engineering team is looking for a motivated Site Reliability Engineer to join our Benefits Reliability organization. As a member of our team, you will be responsible for ensuring the reliability, performance, and security of our systems.Key Responsibilities:Learning and Development: Participate in training and mentorship programs to...
Associate Site Reliability Engineer/Site Reliability Engineer

2 days ago

Redwood City, CA, United States C3 AI Full time

We are looking for an Associate Site Reliability Engineer / Site Reliability Engineer to join our team at our HQ in Redwood City, CA. Responsibilities: Maximize system uptime and availability, ensuring functional and performance SLAs. Establish end-to-end monitoring and alerting on all critical aspects. Solve complex problems for critical services...
Site Reliability Engineer

4 weeks ago

San Francisco, United States Focal Systems Full time

Location: San Francisco - hybrid (1-2 days per week)Salary: $165-175k + stock Company Description Focal Systems is the industry leader in retail AI solutions. We are a Silicon Valley based startup that has more than doubled in size every year since inception. We are a Deep Learning first company. Our mission is to automate and optimize brick and mortar...

Americas

Europe

Asia / Oceania

Africa

Principal Site Reliability Engineer