Principal Site Reliability Engineer

3 days ago


San Francisco, United States salesforce Full time

To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts.

Job Category: Software Engineering

About Salesforce:

We’re Salesforce, the Customer Company, inspiring the future of business with AI + Data + CRM. Leading with our core values, we help companies across every industry blaze new trails and connect with customers in a whole new way. And, we empower you to be a Trailblazer, too — driving your performance and career growth, charting new paths, and improving the state of the world. If you believe in business as the greatest platform for change and in companies doing well and doing good – you’ve come to the right place.

Job Details:

(Lead/Principal/Architect) Software Engineer - Availability Engineering
Our Availability engineering teams are responsible for driving ‘best in class’ availability. You will work with delivery teams deploying customer-facing/supporting software across a multi-substrate engineering platform that collectively ships hundreds of features to production for tens of millions of users across all industries every day. Our users count on our applications and platforms to be highly reliable, lightning fast, supremely secure, and to preserve all of their customizations and integrations every time we ship. You will need deep experience with concurrency, large scale systems, proficiency with solving real-world data management challenges, a strong understanding of how to craft solutions that are highly available, and a proven ability to design, develop, and optimize the core back-end systems.

What you’ll be doing:

  • As part of a specialist unit focused on availability and resilience, you will embed with delivery teams, acting in a Lead capacity, creating bandwidth and prioritizing a focus on corrective and proactive availability measures.
  • You will be contributing to designing, developing, debugging, and operating resilient applications and platforms deployed across distributed systems that run across thousands of compute nodes in multiple data centers.
  • You will champion resiliency best practices; observability tool integration, horizontal/vertical sizing & auto-scaling, release rollback & recovery workflows, integration tests and validation procedures for applications running on self-host infra as well as public cloud platforms such as AWS, GCP, Azure & Alibaba.
  • Using and contributing to open source technology (Spinnaker, Zookeeper, etc.).
  • Developing/leverage Infrastructure-as-Code using Terraform.
  • Building/integrating with APIs and microservices deployed on containerization frameworks such as Kubernetes, Docker, Mesos, etc.
  • Resolving complex technical issues and driving innovations that improve system availability, resilience, and performance.
  • You have experience balancing live runtime management, feature delivery, and retirement of technical debt.
  • Participate in the team’s on-call rotation to address complex problems in real-time and keep services operational and highly available.

Required Skills:

  • A related technical degree required, (masters preferred).
  • 15+ years of hands-on software development experience.
  • 5+ years in a Tech Lead, Principal or Architect capacity.
  • Ability to reverse engineer solutions via independent code and architecture review, envision, define and then contribute to delivery of availability improvement refactoring projects.
  • Mastery of one or more object-oriented delivery languages such as Java, Golang, APEX, Python.
  • Deep experience working with core web technologies: HTTP, JSON, REST, XML.
  • Proficiency with databases including Oracle or other relational and/or NoSQL solutions.
  • Experience owning and operating multiple instances of a critical service.
  • Running critical infrastructure services; monitoring, alerting, logging, tracing and reporting.
  • Subject matter expertise on Service ownership best practices, SLO/I/A definition, driving proactive operational awareness and experience with Incident/Problem management.
  • Thorough knowledge of Agile development methodology with experience in both Test/Behavioral Driven Development practices.
#J-18808-Ljbffr

  • San Francisco, United States Apollo Solutions Full time

    Site Reliability Engineer Apollo Solutions have partnered with a groundbreaking artifical inteligence business who are making major developments in how we use AI/ML for gaming/security. They are working closely with government contracts as well as gaming consoles companys and are now searching for an SRE to join their growing team. The Site Reliability...


  • San Francisco, United States WEX Full time

    The WEX Site Reliability Engineering (SRE) team is seeking an entry-level Site Reliability Engineer Level 1 who is passionate about learning and growing in the field of software development and solutions focused on observability, incident response, reliability and performance, operational excellence, and compliance. The team will be part of the Benefits...


  • San Francisco, California, United States Outdefine Full time

    About the JobWe are seeking a highly skilled Site Reliability Engineer to join our team at Outdefine. As a key member of our engineering team, you will be responsible for ensuring the reliability, scalability, and performance of our ecommerce platform.Key ResponsibilitiesDesign and implement scalable and highly available cloud infrastructure using Kubernetes...


  • San Francisco, California, United States Roman Health Pharmacy LLC Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Xero. As a key member of our Reliability Enablement team, you will play a critical role in ensuring the reliability and performance of our systems.Key ResponsibilitiesInvestigate operational surprises and support teams in post-incident activitiesConduct in-depth...


  • San Francisco, California, United States Swish Analytics Full time

    {"h1": "Site Reliability Engineer at Swish Analytics"} Swish Analytics is a sports analytics and betting startup that's revolutionizing the industry with cutting-edge predictive data products. We're on a mission to make oddsmaking a challenge rooted in engineering, mathematics, and sports betting expertise, not intuition. We're looking for a team-oriented...


  • San Francisco, United States Ellation, Inc. Full time

    Who We AreWe‘re a cast of characters working to shine a spotlight on anime. Crunchyroll is an international business focused on creating both online and offline experiences for fans through content (licensed, co-produced, originals, distribution), merchandise, events, gaming, news, and more. Visit our About Us pages for more information about our...


  • San Francisco, United States Ellation, Inc. Full time

    Who We AreWe‘re a cast of characters working to shine a spotlight on anime. Crunchyroll is an international business focused on creating both online and offline experiences for fans through content (licensed, co-produced, originals, distribution), merchandise, events, gaming, news, and more. Visit our About Us pages for more information about our...


  • San Francisco, United States Unreal Gigs Full time

    Are you passionate about building and maintaining resilient systems that ensure high availability and performance? Do you excel at automating processes, troubleshooting complex issues, and creating systems that scale smoothly? If you're ready to take on the challenge of ensuring reliable, efficient, and secure system operations, our client has the perfect...


  • San Francisco, California, United States WEX Full time

    Job SummaryThe WEX Site Reliability Engineering team is seeking a highly motivated and quick-learning individual to join our team as a Site Reliability Engineer Level 1. As a key member of our team, you will be responsible for ensuring the reliability, performance, and security of our systems.Key Responsibilities:Actively participate in training and...


  • san francisco, United States New York Technology Partners Full time

    Must Have's in the order of preference.Typical Java/J2EE experience between 6 and 10 yearsApplication Production Support(SRE - Site Reliability Engineering) with 3+ years - Preferably in e-commerce domainHands-on experience in any of the UI Frameworks(AngularJS, VueJS etc) - 1+ years


  • San Francisco, United States New York Technology Partners Full time

    Must Have's in the order of preference.Typical Java/J2EE experience between 6 and 10 yearsApplication Production Support(SRE - Site Reliability Engineering) with 3+ years - Preferably in e-commerce domainHands-on experience in any of the UI Frameworks(AngularJS, VueJS etc) - 1+ years


  • San Francisco, United States New York Technology Partners Full time

    Must Have's in the order of preference.Typical Java/J2EE experience between 6 and 10 yearsApplication Production Support(SRE - Site Reliability Engineering) with 3+ years - Preferably in e-commerce domainHands-on experience in any of the UI Frameworks(AngularJS, VueJS etc) - 1+ years


  • San Francisco, California, United States WEX Inc Full time

    The WEX Site Reliability Engineering team is looking for a motivated Site Reliability Engineer to join our Benefits Reliability organization. As a member of our team, you will be responsible for ensuring the reliability, performance, and security of our systems.Key Responsibilities:Learning and Development: Participate in training and mentorship programs to...


  • San Francisco, United States Focal Systems Full time

    Location: San Francisco - hybrid (1-2 days per week)Salary: $165-175k + stock Company Description Focal Systems is the industry leader in retail AI solutions. We are a Silicon Valley based startup that has more than doubled in size every year since inception. We are a Deep Learning first company. Our mission is to automate and optimize brick and mortar...


  • San Francisco, California, United States Arbitrum Inc Full time

    Reliability EngineerAt Arbitrum Inc, we're on a mission to bring blockchain to a billion people. Our developer platform is designed to make building on the blockchain easy, and we're looking for a skilled Reliability Engineer to join our Infrastructure team.As a Reliability Engineer, you'll collaborate with our engineering team to design, deploy, and...


  • San Francisco, United States Perplexity AI Full time

    Perplexity is seeking a Site Reliability Engineer (SRE) to join our small team in revolutionizing the way people search and interact with the internet. You will be responsible for leading the design, implementation, and scaling of the infrastructure and systems that support our web and mobile products. The ideal candidate should have experience in designing...


  • San Francisco, California, United States Tampa Gardens Senior Living Full time

    About the RoleWe are seeking a highly skilled Senior Site Reliability Engineer to join our Cloud Infrastructure Team. As a key member of our team, you will be responsible for deploying, managing, optimizing, and upgrading the systems that run Sight Machine software.You will work closely with our Development Engineering team to ensure the stability,...


  • San Francisco, United States Focal Systems Full time

    Location: San Francisco - hybrid (1-2 days per week)Salary: $170-190k + stockCompany DescriptionFocal Systems is the industry leader in retail AI solutions. We are a Silicon Valley based startup that has more than doubled in size every year since inception. We are a Deep Learning first company. Our mission is to automate and optimize brick and mortar retail...


  • San Jose, United States EVONA Full time

    Site Reliability Engineer (SRE)Location: San Francisco Bay AreaRole Overview:We are seeking a highly skilled Site Reliability Engineer (SRE) to join a dynamic team at a rapidly growing technology company. As an SRE, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical systems, while implementing automation...


  • San Francisco, United States WEX, Inc. Full time

    About the RoleThe WEX Site Reliability Engineering (SRE) team is seeking a Senior Staff SRE who is passionate about developing software and solutions focused on observability, incident response, reliability and performance, operational excellence, and compliance. The team will be part of the Benefits Reliability organization which supports our internal...