Lead Site Reliability Engineer

2 weeks ago


San Diego, California, United States Platform Science Full time

About Us

At Platform Science, we are dedicated to connecting all aspects of mobility. Established in 2015, our open IoT platform collaborates with forward-thinking fleets, application developers, vehicle manufacturers, and equipment providers within the transportation sector to deliver groundbreaking solutions for supply chain professionals worldwide.

Our workforce is a vibrant and diverse assembly of individuals who believe in the strength of innovative ideas. We seek individuals with varied experiences and viewpoints to cultivate a company culture that promotes growth through creativity. We emphasize thoughtful actions and empathy, tackling challenges with resilience and ingenuity while fostering transparency, as we are united as one team regardless of our backgrounds or roles.

Role Overview

We are in search of a qualified Senior Site Reliability Engineer to enhance our team. The position involves addressing operational challenges and providing support to development teams for essential business applications in production. Our primary objective is to guarantee reliability across all production services and empower development teams to assess their reliability for informed decision-making.

The SRE team has the distinct advantage of engaging with all facets of our platform. Our operations are entirely cloud-based, utilizing AWS, Azure, and GCP. Our applications and services are both containerized and serverless. If you are eager to learn and support new technologies across various products—including mobile applications, hardware, websites, messaging queues, serverless pipelines, and more—and work alongside an exceptionally skilled team, this role is tailored for you.

As a Senior SRE, you should possess a background in software development or systems engineering, coupled with robust coding abilities. Ideal candidates are keen to gain a comprehensive understanding of our systems, from infrastructure dependencies to customer experiences, and how to mitigate risks effectively. You should be comfortable both providing and receiving technical guidance, demonstrating excellent communication skills, and being a proactive self-starter committed to enhancing our company and technologies.

Key Responsibilities

  • Develop and refine Continuous Integration/Continuous Deployment (CI/CD) pipelines while enhancing release management processes and associated tools.
  • Maintain Helm charts to facilitate application deployment and management.
  • Establish standardized observability solutions to assist development teams in managing their applications efficiently.
  • Champion reliability initiatives, striving to meet uptime objectives, and mentor peers in SRE best practices.
  • Conduct thorough Production Readiness Reviews, collaborating with teams to identify and establish Service Level Indicators and Service Level Objectives (SLIs/SLOs) to ensure high-quality, reliable services.
  • Design and implement software solutions to effectively tackle operational challenges, enhancing system stability and reliability.
  • Provide on-call support, offering expert assistance to development teams for mission-critical applications in production settings.
  • Enhance application and system resilience through chaos engineering practices.

Qualifications

  • Minimum of 5 years of hands-on experience in SRE or Platform Engineering roles.
  • Proven expertise (2+ years) with automation technologies such as Jenkins, ArgoCD, or similar.
  • Experience with Kubernetes (2+ years), Helm, and Docker in production environments.
  • Strong understanding of current software development lifecycle (SDLC) principles and best practices, including CI/CD pipelines and test-driven development.
  • Familiarity with AWS, particularly in EKS, IAM, autoscaling, networking, and load balancing/request routing in production.
  • Proficient in programming languages such as Python, Bash, Node.js, and/or Go.
  • Experience with distributed tracing methodologies and observability tools like Prometheus, ELK, or Datadog.
  • Strong emphasis on documentation and promoting knowledge-sharing within the team and organization.
  • Demonstrated success in training and mentoring engineers.
  • Proven ability to optimize performance and manage costs in cloud environments.
  • Solid understanding of SLI/SLO concepts and adherence to SRE best practices.
  • Bachelor's degree in Computer Science or a related field.

Benefits Overview

Platform Science offers a comprehensive benefits package for regular, full-time employees, including:

  • Medical, dental, and vision insurance.
  • Short-term and long-term disability insurance.
  • Accidental death and dismemberment (AD&D) and life insurance.
  • 401k retirement plan.
  • Paid vacation, sick leave, and holidays.
  • Six weeks of paid parental leave.


  • San Diego, California, United States Dexcom Full time

    About Dexcom:Founded in 1999, Dexcom, Inc. (NASDAQ: DXCM) is a pioneer in the development and marketing of Continuous Glucose Monitoring (CGM) systems designed for use by individuals with diabetes and healthcare professionals. As a leader in the transformation of diabetes management, Dexcom is committed to providing innovative CGM technology that empowers...


  • San Diego, California, United States Platform Science Full time

    Company OverviewAt Platform Science, we are dedicated to revolutionizing connectivity in the transportation sector. Established in 2015, our open IoT platform collaborates with forward-thinking fleets, application developers, vehicle manufacturers, and equipment providers to deliver groundbreaking solutions for supply chain professionals worldwide.Our...


  • San Diego, California, United States Platform Science Full time

    About UsAt Platform Science, we are dedicated to revolutionizing the transportation industry through innovative IoT solutions. Established in 2015, our open platform collaborates with forward-thinking fleets, application developers, vehicle manufacturers, and equipment providers to enhance supply chain efficiency worldwide.Our workforce is a vibrant and...


  • San Diego, California, United States Intuit Inc. Full time

    Intuit Inc. is seeking a Senior Software Engineer specializing in Site Reliability Engineering. This role is crucial for ensuring that our products maintain high availability, scale efficiently, and deliver exceptional performance.The ideal candidate will be a "full cycle" Software Engineer with a strong focus on optimization, reliability, and tool...


  • San Diego, California, United States Intuit Inc. Full time

    Position Overview: We are seeking a skilled Senior Software Engineer to contribute to our Site Reliability Engineering efforts at Intuit Inc. This role focuses on ensuring our products maintain high availability, scalability, and exceptional performance.Key Responsibilities:Application Development: Design and implement web applications and backend services...


  • San Jose, California, United States Zscaler Full time

    About ZscalerAt Zscaler, our Engineering team has developed the largest cloud security platform globally, and we continue to innovate. With over 100 patents and ambitious plans for service enhancement and global expansion, our team has established us as a leader in cloud security, serving more than 15 million users across 185 countries. We invite you to...


  • San Jose, California, United States Zscaler Full time

    About ZscalerAt Zscaler, our Engineering team has developed the largest cloud security platform globally, and we continue to innovate. With over 100 patents and ambitious plans for service enhancement and global expansion, our team has established us as the leader in cloud security, serving more than 15 million users across 185 countries. We invite you to...


  • San Jose, California, United States Zscaler Full time

    About UsZscaler has developed the world's largest cloud security platform, continually innovating and expanding our services. With a robust portfolio of over 100 patents and ambitious plans for global growth, our team has established itself as a leader in cloud security, serving more than 15 million users across 185 countries. We are looking for talented...


  • San Francisco, California, United States AutoRABIT Holding Inc. Full time

    Job OverviewAbout AutoRABIT:AutoRABIT is a rapidly expanding SaaS company recognized as the premier provider of Salesforce DevSecOps solutions tailored for regulated sectors such as finance, insurance, and healthcare. Our platform empowers developers to streamline their workflows, enhancing productivity and accelerating release cycles while adhering to...


  • San Diego, California, United States Onebrief, Inc Full time

    About Onebrief, Inc.Onebrief, Inc. is a cutting-edge technology company that specializes in developing innovative solutions for military planning and operations. Our flagship product, Onebrief, is an all-in-one tool that streamlines the planning process, enabling users to create and manage complex plans with ease.Job SummaryWe are seeking a highly skilled...


  • San Diego, California, United States Apple Full time

    Overview As a key member of our Silicon Technologies team, you will play a pivotal role in designing and producing our cutting-edge, high-performance, and energy-efficient processors and system-on-chip (SoC) solutions. Your expertise will ensure that Apple products and services operate seamlessly, enhancing the user experience for millions. This position...


  • San Diego, California, United States Apple Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Data Analytics team at Apple. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and performance of our data analytics applications and infrastructure.Key ResponsibilitiesDesign, develop, and maintain complex data infrastructure at the...


  • San Jose, California, United States Adobe Full time

    Site Reliability Engineer page is loadedAdobe's Reliability Engineering team is looking for a Site Reliability Engineer (SRE) to help build and operate services like Adobe Sign. Adobe Sign is the fastest, and easiest way to get contracts signed and filed.You have a track record as a site reliability engineer in large-scale SaaS businesses, and a strong...


  • San Francisco, California, United States Okta, Inc. Full time

    Senior Site Reliability Engineer, Security About Okta Okta stands as a leader in identity management, empowering users to securely access technology across various platforms and devices. Our solutions in Workforce and Customer Identity Clouds provide seamless access, authentication, and automation, ensuring that identity is central to business security and...


  • San Francisco, California, United States Okta, Inc. Full time

    Senior Site Reliability Engineer, Security About Okta Okta is recognized as a leader in identity management, empowering users to securely access technology across various platforms and devices. Our Workforce and Customer Identity Clouds facilitate secure access, authentication, and automation, fundamentally transforming the digital experience by placing...


  • San Francisco, California, United States Okta, Inc. Full time

    Senior Site Reliability Engineer, Security About Okta Okta is recognized as a leader in identity management. Our mission is to empower individuals to securely access any technology—anywhere, on any device or application. Our Workforce and Customer Identity Clouds provide secure yet adaptable access, authentication, and automation that revolutionizes the...


  • San Francisco, California, United States Okta, Inc. Full time

    Senior Site Reliability Engineer, Security About Okta Okta is recognized as The World's Identity Company, dedicated to empowering individuals to securely access any technology across various devices and applications. Our Workforce and Customer Identity Clouds facilitate secure yet adaptable access, authentication, and automation, fundamentally transforming...


  • San Francisco, California, United States Okta, Inc. Full time

    Senior Site Reliability Engineer, Security About Okta Okta is recognized as the premier Identity Company globally. Our mission is to empower individuals to securely utilize any technology—anywhere, on any device or application. Our Workforce and Customer Identity Clouds facilitate secure yet adaptable access, authentication, and automation,...


  • San Diego, California, United States Apple Full time

    Overview Join our dynamic Silicon Technologies team at Apple, where your expertise will contribute to the design and production of cutting-edge, energy-efficient processors and system-on-chip (SoC) solutions. Your role will be pivotal in ensuring that our products deliver exceptional performance and reliability, enhancing the user experience for millions...


  • San Jose, California, United States Hireio, Inc. Full time

    Exciting Opportunity: Data Infrastructure Site Reliability Engineering (SRE) TeamJoin Hireio, Inc., a premier platform for short-form mobile video hosting services. As a trailblazer in technology, our SRE team integrates software development with infrastructure management to architect, construct, and oversee extensive, highly distributed systems. We operate...