Staff Site Reliability Engineer

6 days ago


San Francisco, United States Ellation, Inc. Full time
Who We Are

We're a cast of characters working to shine a spotlight on anime. Crunchyroll is an international business focused on creating both online and offline experiences for fans through content (licensed, co-produced, originals, distribution), merchandise, events, gaming, news, and more. Visit our About Us pages for more information about our collection of brands.

About the Team

The Site Reliability Engineering (SRE) team is dedicated to ensuring the reliability, scalability, and performance of our data infrastructure. We focus on standardizing and implementing monitoring and alerting across all datastores to track key metrics like errors, latency, and throughput, and to ensure critical systems are covered. Our team also leads efforts to keep databases up-to-date, implements Infrastructure as Code (IaC) for high availability and performance, and automates key processes to enhance operational efficiency.

We lead and evangelize the principle of 100% automation. Additionally, we define and document operational requirements, develop incident response processes, and automate monitoring and compliance checks to maintain a secure and reliable data environment. By continuously improving load testing and optimizing data governance practices, we support the overall health and efficiency of our data systems.

About the Role

Crunchyroll is growing and changing, presenting unique challenges and opportunities to support millions of anime fans around the world. The Data Engineering team provides seamless help to our internal stakeholders, ensuring an exceptional experience for all Crunchyroll fans.

As a Staff Site Reliability Engineer for the Data Engineering team, you will be responsible for maintaining and enhancing the reliability of our data infrastructure. Your work will directly impact the availability and performance of our data services, enabling the organization to better decisions. You will collaborate closely with data engineers, and software engineers to develop and drive 100% automation, best practices for deep monitoring and alerting. This role will report to our Director of Data Engineering. While it is preferred for this role to sit in one of our offices, fully remote is also an option in the United States.

About You
  • Bachelor's degree in Computer Science, Information Technology, or a related field.
  • 12+ years of experience in site reliability engineering, database operations, or a related role with a focus on data platforms, data stores, data operations.
  • Extensive experience with AWS cloud platform and their data-related services.
  • Proficiency in monitoring tools (e.g., Datadog, CloudWatch, DevOps Guru, DB Performance Insights).
  • Proficiency in one or more programming languages (e.g. Python, Java)
  • Proficiency in automation frameworks (e.g., Terraform, Cloud Formation).
  • Strong understanding of various performance metrics both at a high level and at a low level like Disk/IO saturation.
  • Experience in identifying and eliminating the bottlenecks in the system.
  • Strong understanding of database internals like types of indexes, schemas, query plans.
  • Strong understanding of database systems (e.g., SQL, NoSQL) and experience in managing large-scale data infrastructures.
  • Strong understanding and hands-on implementation of CI/CD pipelines and DataOps practices.
  • Experience with data governance, compliance, and lifecycle management.
  • Ability to own and execute projects while effectively collaborating with the team to influence and shape the vision of the data engineering organization.
Why you will love working at Crunchyroll

Not only will you get to work with fun, passionate and inspired colleagues, you will also...

  • Receive a great compensation package including salary plus performance bonus earning potential, paid annually.
  • Enjoy flexible PTO and time off policies allowing you to take the time you need to be your whole self.
  • Appreciate the generous medical, dental, vision, STD, LTD, and life insurance options for you and your family.
  • Take advantage of our health saving account HSA program plus health care and dependent care FSA programs.
  • Love that we offer an employer match on our 401(k) plan.
  • Receive employer paid commuter benefit (for eligible employees)
  • Appreciate the generous support program for new parents
  • Obtain pet insurance and some of our offices are pet friendly

#LifeAtCrunchyroll #LI-Remote

#J-18808-Ljbffr

  • San Francisco, California, United States Crunchyroll Full time

    About CrunchyrollWe're a global entertainment company dedicated to delivering the art and culture of anime to a passionate community. Our mission is to help everyone belong, and we're looking for talented individuals to join our team.The RoleWe're seeking a Staff Site Reliability Engineer to maintain and enhance the reliability of our data infrastructure. As...


  • San Francisco, California, United States Aitopics Full time

    About the RoleWe are seeking a highly skilled Staff Site Reliability Engineer to join our Data Engineering team. As a key member of our team, you will be responsible for maintaining and enhancing the reliability of our data infrastructure.Your work will directly impact the availability and performance of our data services, enabling the organization to make...


  • San Francisco, California, United States Zilliz Full time

    Job Title: Cloud Platform Staff Site Reliability EngineerWe are seeking a highly skilled Cloud Platform Staff Site Reliability Engineer to join our team at Zilliz. As a key member of our SRE team, you will be responsible for ensuring the reliability, availability, and performance of our distributed database systems.Key Responsibilities:Design and build tools...


  • San Francisco, United States WEX, Inc. Full time

    About the RoleThe WEX Site Reliability Engineering (SRE) team is seeking a Senior Staff SRE who is passionate about developing software and solutions focused on observability, incident response, reliability and performance, operational excellence, and compliance. The team will be part of the Benefits Reliability organization which supports our internal...


  • San Francisco, California, United States WEX Full time

    About the RoleThe WEX Site Reliability Engineering team is seeking a technical leader to drive the design and implementation of complex systems at scale. As a Senior Staff SRE, you will work closely with engineering teams to ensure that our systems are reliable, performant, and secure.Key ResponsibilitiesProvide technical guidance and mentorship to other...


  • San Francisco, United States WEX Full time

    About the Role The WEX Site Reliability Engineering (SRE) team is seeking a Senior Staff SRE who is passionate about developing software and solutions focused on observability, incident response, reliability and performance, operational excellence, and compliance. The team will be part of the Benefits Reliability organization which supports our internal...


  • San Francisco, California, United States WEX Full time

    The WEX Site Reliability Engineering team is seeking a Senior Staff SRE who is passionate about developing software and solutions focused on observability, incident response, reliability, and performance.The team will be part of the Benefits Reliability organization which supports our internal stakeholders and our Benefits Platform teams.As part of the...


  • San Francisco, California, United States Unreal Gigs Full time

    Job Title: Site Reliability EngineerAt Unreal Gigs, we're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the high availability, scalability, and performance of our complex distributed systems.Key Responsibilities:Design and implement monitoring, logging, and alerting...


  • San Francisco, California, United States Unreal Gigs Full time

    Job Title: Site Reliability EngineerAt Unreal Gigs, we're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the high availability, scalability, and performance of our complex distributed systems.Key Responsibilities:Design and implement monitoring, logging, and alerting...


  • San Francisco, California, United States Medallia Full time

    About the RoleWe are seeking a highly skilled Staff Site Reliability Engineer to join our GovCloud team at Medallia. As a Staff Engineer, you will be responsible for ensuring the reliability and availability of Medallia applications for our US Government customers and infrastructure in a highly available, secure, and scalable environment.Key...


  • San Francisco, California, United States DaVita Full time

    About the RoleThe WEX Site Reliability Engineering team is seeking a skilled Site Reliability Engineer to join our Platform Reliability organization. As a key member of our team, you will be responsible for developing software and solutions focused on observability, incident response, reliability, and performance.You will collaborate with our engineering...


  • San Francisco, California, United States Roman Health Pharmacy LLC Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Xero. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and scalability of our cloud-based platform.Key ResponsibilitiesInvestigate operational surprises and support teams in post-incident activitiesConduct in-depth incident...


  • San Francisco, California, United States Instabase Full time

    About InstabaseAt Instabase, we're passionate about harnessing the power of AI innovation to democratize access to cutting-edge technology and empower organizations to solve complex unstructured data problems. With a strong presence in the market and a talented team, we're committed to delivering top-tier solutions that drive business success.Job...


  • San Francisco, California, United States Instabase Full time

    About InstabaseInstabase is a global company with offices in San Francisco, New York, London, and Bengaluru. We're a people-first organization that values experimentation, curiosity, and customer obsession.Job SummaryWe're seeking a Site Reliability Engineer to join our Site Reliability and Platform Engineering team. As a key member of our team, you'll be...


  • San Francisco, California, United States Withorb Full time

    About UsOrb is a cutting-edge technology company on a mission to revolutionize the way businesses approach revenue growth. Our team is passionate about building a robust infrastructure that enables our customers to unlock their full potential.Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team. As a key member of our...


  • San Francisco, California, United States BaseTen Labs, Inc. Full time

    About BaseTen Labs, Inc.We're a rapidly growing team of innovators backed by top-tier investors, including IVP, Spark Capital, and Sarah Guo at Conviction. Our mission is to empower machine learning teams at enterprises and AI-native companies to build scalable, reliable, and efficient infrastructure.Job DescriptionWe're seeking a skilled Site Reliability...


  • San Francisco, California, United States Outdefine Full time

    About the JobWe are seeking a highly skilled Site Reliability Engineer to join our team at Outdefine. As a key member of our engineering team, you will be responsible for ensuring the reliability, scalability, and performance of our ecommerce platform.Key ResponsibilitiesDesign and implement scalable and highly available cloud infrastructure using Kubernetes...


  • San Francisco, California, United States Roman Health Pharmacy LLC Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Xero. As a key member of our Reliability Enablement team, you will play a critical role in ensuring the reliability and performance of our systems.Key ResponsibilitiesInvestigate operational surprises and support teams in post-incident activitiesConduct in-depth...


  • San Francisco, California, United States YO HR CONSULTANCY Full time

    Job Title: Site Reliability EngineerJob Description:At YO HR CONSULTANCY, we are seeking a highly skilled Site Reliability Engineer to join our team.Key Responsibilities:* Extensive experience working with Linux flavors like RHEL/CentOS OS, shells, filesystems, and utilities* Knowledge of distributed computing and experience working with container...


  • San Francisco, California, United States Orb Full time

    About the RoleOrb is seeking a skilled Site Reliability Engineer to join our team. As a key member of our engineering organization, you will play a critical role in maintaining and scaling our robust infrastructure, ensuring stability, scalability, and performance.You will be responsible for tackling complex engineering challenges, from scaling our data...