Sr. Site Reliability Engineer

3 weeks ago


San Francisco, United States hims & hers Full time

About the Role: We are seeking a Site Reliability Engineer to help build a reliable web experience for our users. We believe that moving fast is our competitive advantage, and enables us to better serve our users. We also know that the faster we move, the more likely we are to break things.

You Will:

Design and implement SRE practices ensuring availability, scalability and observability of production systems with a strong focus on excellent customer experience

Actively seek and identify opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.

Use automation extensively to design, configure, manage, and monitor systems in support of our product development teams

Manage Infrastructure through automation (Infrastructure as Code)

Manage incidents and emergency response, track outages, ensure data integrity and engineer releases to promote safe, efficient and rapid deployments

Handle emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed

Improve the codebase by resolving logic issues, deprecating unused code, etc.

Implement monitoring, logging, alerting and SLO Reporting

Identify Service Level Indicators (SLIs) that will align the team to meet the availability and performance objectives.

Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent incident reoccurrence.

You Have:

8+ years as a software engineer, shipping production code.

5+ years of experience as a Site Reliability Engineer.

Experience with service-oriented architectures and microservices at scale

Strong proficiency with RDBMS databases (PostgreSQL, MySQL, SQL Server, etc.)

Strong proficiency in SQL scripting

Proficiency developing in one or more languages such as Java, Kotlin, Python, and/or others

Ability to use containers and orchestration frameworks (Kubernetes, Docker, Container registries etc.)

Proficiency in Git or other VCS

Experience with configuring, customizing, and extending monitoring tools (Datadog, Prometheus, New Relic etc.)

Excellent debugging and troubleshooting skills

Strong technical competency, with a data-driven analytical approach towards solving complex challenges

Have a systematic problem-solving approach, coupled with strong and effective communication skills and a sense of drive

Nice-to-have: Experience with Terraform or other IAC tools such as Chef, Puppet or Ansible

Our Benefits (there are more but here are some highlights):

Competitive salary & equity compensation for full-time roles

Unlimited PTO, company holidays, and quarterly mental health days

Comprehensive health benefits including medical, dental & vision, and parental leave

Employee Stock Purchase Program (ESPP)

Employee discounts on hims & hers & Apostrophe online products

401k benefits with employer matching contribution

Offsite team retreats

#LI-Remote

#J-18808-Ljbffr



  • San Francisco, United States Apollo Solutions Full time

    Principal Site Reliability Engineer Apollo Solutions have partnered with a groundbreaking Fintech start-up backed by top tier venture capital. They are looking to significantly disrupt how we view, store and invest our personal finance and have already made significant waves in the industry. The Principal Site Reliability Engineer will be working closely...


  • San Francisco, United States Patreon Full time

    Patreon is the best place for creators to build exclusive content and community for their fans. We enable creators (podcasters, writers, musicians, illustrators, etc) to connect with their fans directly and make money from their creative work. Creators can sell one-off items from their own shops or offer recurring monthly memberships with exclusive access to...


  • San Francisco, United States Pelago Full time

    Role Overview: At Pelago, we run a serverless architecture on AWS, with infrastructure managed using Terraform. Our system has been built to deliver our virtual clinic for Substance Use Management, and we are looking for a talented Site Reliability Engineer to join the engineering team supporting Pelago.As a HIPAA compliant, HITRUST certified organization it...


  • San Francisco, United States Apollo Solutions Full time

    Principal Site Reliability Engineer Apollo Solutions have partnered with a groundbreaking Fintech start-up backed by top tier venture capital. They are looking to significantly disrupt how we view, store and invest our personal finance and have already made significant waves in the industry. The Principal Site Reliability Engineer will be working closely...


  • San Francisco, United States Instabase Full time

    At Instabase, we're passionate about democratizing access to cutting-edge AI innovation to enable any organization to solve previously unsolvable unstructured data problems in their industry. With customers representing some of the largest and most complex organizations in the world, and investors like Greylock, Andreessen Horowitz, and Index Ventures, our...


  • San Francisco, United States Instabase Full time

    At Instabase, we're passionate about democratizing access to cutting-edge AI innovation to enable any organization to solve previously unsolvable unstructured data problems in their industry.  With customers representing some of the largest and most complex organizations in the world, and investors like Greylock, Andreessen Horowitz, and Index Ventures, our...


  • San Francisco, United States Talkdesk Full time

    At Talkdesk, we are courageous innovators focused on helping organizations around the world create better customer experiences. Our AI-powered cloud contact center solutions optimize our customers’ most critical customer service processes. We are recognized as a Contact Center as a Service (CCaaS) leader by influential research organizations including...


  • San Francisco, United States Resource Informatics Group Full time

    Job Title: Site Reliability Engineer Work Location : San Francisco, CA (Hybrid after showing successful engagement) Duration: 18+ months Most important skills: 10 years of Oracle database administration experience on large production environment Database hands on skills especially around database and system troubleshooting and administration GoldenGate...


  • San Francisco, United States Talkdesk Full time

    At Talkdesk, we are courageous innovators focused on helping organizations around the world create better customer experiences. Our AI-powered cloud contact center solutions optimize our customers’ most critical customer service processes. We are recognized as a Contact Center as a Service (CCaaS) leader by influential research organizations including...


  • San Francisco, United States DAOmatch Full time

    Aptos is a people-first blockchain on a mission to help billions of people achieve universal and fair access to decentralized assets in a safe and scalable way.Founded by some of the original creators and maintainers that researched, designed, and built the Diem blockchain to serve this purpose, we have dedicated several years toward this mission. We believe...


  • San Francisco, United States Resource Informatics Group Full time

    Job Title: Site Reliability Engineer Work Location: San Francisco, CA (Hybrid after showing successful engagement) Duration: 18+ months Most important skills:10 years of Oracle database administration experience on large production environment Database hands on skills especially around database and system troubleshooting and administration GoldenGate setup,...


  • San Francisco, United States Cypress Human Capital Management, LLC Full time

    Site Reliability Engineer (Grafana) Responsibilities Collaborate with Service Owners and Observability Leaders to develop a strategy for monitoring the technology stack using Grafana. Initiate data ingestion by deploying Telegraf and exporters (if necessary), utilizing discovery to feed data into Grafana Mimir. Establish initial alerting by creating alert...


  • San Francisco, United States Swish Analytics Full time

    Swish Analytics is a sports analytics, betting and fantasy startup building the next generation of predictive sports analytics data products. We believe that oddsmaking is a challenge rooted in engineering, mathematics, and sports betting expertise; not intuition. We're looking for team-oriented individuals with an authentic passion for accurate and...


  • San Francisco, United States Cypress HCM Full time

    Job DescriptionJob DescriptionSite Reliability Engineer (Grafana)Responsibilities:Collaborate with Service Owners and Observability Leaders to develop a strategy for monitoring the technology stack using Grafana.Initiate data ingestion by deploying Telegraf and exporters (if necessary), utilizing discovery to feed data into Grafana Mimir.Establish initial...


  • San Francisco, California, United States Zetachain Full time

    We are seeking a Sr. Site Reliability Engineer to join our team and run critical infrastructure for our blockchain and web applications. You'll learn to deploy and maintain a fleet of RPC and validator nodes for multiple blockchain networks. You'll also provide guidance and expertise to development teams to ensure their application follow modern best...


  • San Francisco, United States Webflow Full time

    At Webflow, our mission is to bring development superpowers to everyone. Webflow is the leading visual development platform for building powerful websites without writing code. By combining modern web development technologies into one platform, Webflow enables people to build websites visually, saving engineering time, while clean code seamlessly generates...


  • San Francisco, California, United States Observable Full time

    Observable is seeking a full-time infrastructure and site reliability engineer to help improve, administrate, and grow Observable systems as we scale to meet our customer's needs.What you will doPerform site reliability and ops work for Observable production and staging environments. (Manage servers Tweak WAF rules Optimize SQL queries And more)Design and...


  • San Francisco, United States Orb Full time

    Mission Orb is on an ambitious mission to provide every business with the infrastructure to unlock their revenue. Best-in class businesses find ways to effectively align their monetization to product usage—whether that's through seats, consumption, feature limits, or usage-based tiers. Orb brings that opportunity to every software company. We are...


  • San Diego, United States ObjectWin Technology Full time

    Job Title: Site Reliability Engineer Location: San Diego, CA or Remote in CA Duration: 6 Months Description: It is an exciting time to be part of SIE’s CICD and Cloud Site Reliability Engineering (SRE) team. SREs operate right at the intersection of Software Engineering and Infrastructure Engineering. The SRE team strives to make PlayStation highly...


  • San Francisco, United States Orb Full time

    Mission Orb is on an ambitious mission to provide every business with the infrastructure to unlock their revenue. Best-in class businesses find ways to effectively align their monetization to product usage-whether that's through seats, consumption, feature limits, or usage-based tiers. Orb brings that opportunity to every software company. We are reimagining...