Reliability Operations Manager

2 days ago


San Jose, California, United States The Trade Desk Full time
About The Trade Desk

The Trade Desk is a leading global technology company that empowers brands to drive real connections with their customers. Our mission is to constantly improve the reliability of our platform, ensuring a seamless customer experience.

Job Summary

We are seeking a highly skilled Reliability Operations Manager to join our global Reliability Operations team. This role will be responsible for defining, managing, and measuring incident response engineering practices, liaising with engineering teams, and managing a global team of reliability operations engineers.

Key Responsibilities
  • Define, manage, and measure incident response engineering practices
  • Liaise with engineering teams to ensure work discovered during incident response is prioritized
  • Participate in incident response engineering duties as necessary
  • Manage a global Reliability Operations team (3 to 6+ Reliability operations engineers across NAMER, EMEA, APAC)
  • Periodically meet with reports across timezones
  • There may be periodic weekend coverage requirements
Requirements
  • Bachelor's Degree from a four-year university or relevant substitute experience
  • 6+ years relevant work experience in Technical and/or Application Support with strong knowledge of technical troubleshooting
  • 2-5 years of management experience with direct reports
Preferred Skills
  • Adaptive management style according to level and proficiency of engineering reports
  • Ability to understand technical employee career paths and collaboratively develop career plans
  • Scheduling a global team through holidays, sickness, and vacation leaves, across timezones
  • Understanding of large-scale distributed system architectures (e.g., databases, web services, application services)
  • Familiarity with monitoring tools (e.g., Prometheus, Grafana, Nagios)
  • Ability to author scripts to facilitate troubleshooting as well as configure alerts
  • Proficiency in scripting languages (e.g., Python, Bash) is a plus
What We Offer

The Trade Desk offers a competitive total compensation and benefits package, including comprehensive healthcare, retirement benefits, short and long-term disability coverage, basic life insurance, well-being benefits, reimbursement for certain tuition expenses, parental leave, sick time, vacation time, and around 13 paid holidays per year.

Employees can also purchase The Trade Desk stock at a discount through The Trade Desk's Employee Stock Purchase Plan.



  • San Jose, California, United States The Trade Desk Full time

    About The Trade DeskThe Trade Desk is a leading global technology company that empowers brands to connect with consumers through its innovative, cloud-based platform. Our mission is to deliver exceptional customer experiences by ensuring the reliability and performance of our platform.Job SummaryWe are seeking a highly skilled Reliability Operations Manager...


  • San Jose, California, United States Triune Infomatics Inc Full time

    Role:Senior Site Reliability ManagerTriune Infomatics Inc is seeking an experienced Senior Site Reliability Manager to join our team and contribute to the design and upkeep of our cloud-based IoT edge orchestration solution.Job Summary:The Senior Site Reliability Manager will be responsible for ensuring the availability of our SaaS platform and meeting the...


  • San Jose, California, United States NetApp Full time

    Job SummaryAs a Site Reliability Engineer, you will be responsible for ensuring the stability and security of multiple open-source systems and platforms that are run or operated in our environment.Key ResponsibilitiesBuilding and maintaining a reliable site environment to meet the development and maintenance requirements of open-source systems and...

  • Reliability Engineer

    4 weeks ago


    San Jose, California, United States Antora Energy Full time

    Job Title: Sr. Reliability EngineerAt Antora Energy, we're committed to revolutionizing the way industries approach energy storage. As a Sr. Reliability Engineer, you'll play a pivotal role in ensuring the high reliability and availability of our thermal battery systems.Key Responsibilities:Collaborate with cross-functional teams to scope, define, design,...


  • San Jose, California, United States Adobe Full time

    About the RoleWe are seeking an exceptional Site Reliability Engineering Manager to lead our team in driving reliability for Adobe's AI Inference Platform, Adobe Firefly. As a key member of our Engineering organization, you will be responsible for developing a team of Site Reliability Engineers who will work closely with our Engineering teams to build,...


  • San Jose, California, United States Diverse Lynx Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based infrastructure.Key Responsibilities:Design and implement automation scripts using shell,...

  • Reliability Expert

    1 month ago


    San Jose, California, United States Power Integrations Full time

    Job SummaryWe are seeking a highly skilled Senior Reliability Engineer to join our team at Power Integrations. As a key member of our reliability engineering team, you will be responsible for evaluating the reliability of IC products, packages, and process technology to ensure suitability for end applications and conformance to industry standards.Key...


  • San Jose, California, United States Syntricate Technologies Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based infrastructure.Key Responsibilities:Design, implement, and maintain scalable and highly...


  • San Jose, California, United States Adobe Full time

    About the RoleWe are seeking an exceptional Site Reliability Engineering Manager to lead our team in driving reliability for Adobe's AI Inference Platform, Adobe Firefly. As a key member of our Engineering organization, you will be responsible for developing a team of Site Reliability Engineers who will work closely with our Engineering teams to build,...


  • San Jose, California, United States NetApp Full time

    Job SummaryAs a Site Reliability Engineer at NetApp, you will be responsible for managing, supporting, and maintaining a reliable environment for our site. This involves ensuring the stability and security of multiple open-source systems and platforms that are run or operated in that environment.Key ResponsibilitiesBuilding and supporting a reliable site for...


  • San Jose, California, United States NetApp Full time

    Job SummaryWe are seeking a highly skilled Site Reliability Engineer to join our team at NetApp. As a Site Reliability Engineer, you will be responsible for managing, supporting, and maintaining a reliable environment for our site to ensure the stability and security of multiple open-source systems/platforms.Key ResponsibilitiesBuilding and supporting a...


  • San Jose, California, United States Tik Tok Full time

    Transforming Data Infrastructure with TikTokTikTok is a pioneer in innovation, merging software development and infrastructure operations to design, build, and manage large-scale, highly distributed systems. Our Site Reliability Engineering (SRE) team is a key player in this journey, overseeing one of the industry's most extensive cloud...


  • San Jose, California, United States Diverse Lynx Full time

    Job Title: Site Reliability EngineerJob Summary:Diverse Lynx LLC is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based infrastructure.Key Responsibilities:Design and implement automation scripts using shell,...


  • San Jose, California, United States Trianz Full time

    About TrianzTrianz is a leading-edge technology platforms and services company that accelerates digital transformations at Fortune 100 and emerging companies worldwide in data & analytics, digital experiences, cloud infrastructure, and security.Our VisionWe believe that companies around the world face three challenges in their digital transformation journeys...


  • San Jose, California, United States Zscaler Full time

    About ZscalerZscaler is a leading cloud security company that accelerates digital transformation for its customers. With a cloud-native platform, Zscaler protects thousands of organizations from cyber threats and data loss by securely connecting users, devices, and applications in any location.As a pioneer in cloud security, Zscaler has over 10 years of...


  • San Jose, California, United States Tik Tok Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our dynamic team at TikTok. As a pioneer in innovation, our data infrastructure SRE team seamlessly merges software development and infrastructure operations to design, build, and manage large-scale, highly distributed systems.Key ResponsibilitiesParticipate in and enhance the...


  • San Jose, California, United States Tik Tok Full time

    About UsTikTok is a global leader in short-form mobile video, inspiring creativity and bringing joy to users worldwide. Our mission is to empower creators and communities to thrive in a vibrant, inclusive space.Job SummaryWe're seeking a skilled Site Reliability Engineer to join our dynamic team, driving innovation and excellence in our cloud infrastructure....

  • Operations Manager

    2 weeks ago


    San Francisco, California, United States MP Mine Operations LLC Full time

    Job Title: Operations SupervisorMP Materials is seeking an experienced Operations Supervisor to join our team at our mining and processing site in Mountain Pass, California. As an Operations Supervisor, you will be responsible for managing and directing mineral processing and/or chemical plant activities.Key Responsibilities:Oversee safe, reliable, and...


  • San Jose, California, United States F5 Full time

    Job SummaryF5 is seeking a highly skilled Senior Site Reliability Engineer to join our team. As a key member of our SRE team, you will play a pivotal role in ensuring the reliability and scalability of our distributed cloud product.Key ResponsibilitiesDesign and implement automation solutions to reduce toil and improve operational efficiencyParticipate in...


  • San Jose, California, United States Hireio, Inc. Full time

    About the RoleHireio, Inc. is seeking a highly skilled Senior Site Reliability Engineer to join our team. As a key member of our data infrastructure team, you will be responsible for designing, building, and managing large-scale, highly distributed systems.Our team is a pioneer in innovation, seamlessly merging software development and infrastructure...