Senior SRE Engineering Leader

3 weeks ago


Santa Clara, California, United States NVIDIA Full time

NVIDIA is a leader in the AI revolution, driving innovation in industries with our cutting-edge GPU technology. Our GPUs power groundbreaking advancements in AI, big data, and deep learning.

We're seeking visionary leaders to join us as Senior SRE Engineering Leader. As a key member of our team, you'll lead our globally distributed clusters, ensuring seamless operations and delivering AI services that drive breakthroughs in life sciences and natural language processing.

As SRE Leader, you'll build and operate large-scale GPU clusters across various cloud providers. You'll design and implement processes, tools, and systems that transform our massive operational experience into an overall improvement to the ecosystem.

Key responsibilities include:

  • Managing distributed, multi-location GPU clusters for AI research
  • Leading a team of SREs, driving cluster operational excellence and efficiency
  • Delivering scalable distributed systems and AI services in fast-paced environments
  • Building strong, globally distributed teams and driving technical strategy
  • Collaborating across the company to improve the GPU ecosystem for AI use cases
  • Solving reliability, efficiency, and productivity challenges for GPU infrastructure
  • Defining strategy, managing projects, and driving technical leadership across multiple areas
  • Collaborating with internal stakeholders to ensure transparency on budget and operational efficiency

Requirements include:

  • 10+ years in engineering management; 3+ in leadership roles
  • Bachelor's or Master's in Computer Science or a related field, or equivalent experience
  • Experience supporting AI/ML workloads and driving operational standard methodologies
  • Strong Unix/Linux knowledge and proficiency in at least two programming languages (Perl, Python, Go)
  • Expertise in managing large-scale distributed systems and AI/HPC environments
  • Leadership experience, mentoring, and coaching skills
  • Ability to quickly learn and integrate new technologies
  • Strong collaboration skills across engineering, server, storage, and security teams

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We're a company that values diversity and is committed to fostering a work environment that is inclusive and respectful. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.



  • Santa Clara, California, United States Diverse Lynx Full time

    Job Title: Senior Network SREWe are seeking a seasoned Senior Network SRE to lead our network infrastructure team in achieving Service Level Objectives (SLOs) and minimizing manual labor.Key Responsibilities:Owning the operational aspect of the network infrastructure, ensuring high availability and reliability.Partnering with architecture, tooling, and...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Sr. Data Engineer, SRE to join our team at NVIDIA. As a key member of our data science and reporting team, you will be responsible for designing and delivering high-performance services and libraries, building streaming data pipelines, and partnering with other engineering and business teams to integrate your...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Summary:We are seeking a seasoned Network SRE technical lead to help actualize the SRE vision for our network infrastructure. As a key member of our Network Support and SRE team, you will be responsible for owning the operational aspect of the network infrastructure, ensuring its high availability and reliability.Key Responsibilities: Partner with...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionNVIDIA is seeking a Senior Site Reliability Engineer to join our AI Efficiency Team. As a key member of this team, you will contribute to the development of infrastructure that powers our innovative AI research.The AI Efficiency Team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data...


  • Santa Clara, California, United States NVIDIA Full time

    At NVIDIA, we're seeking a highly skilled Senior Cloud Reliability Engineer to join our team. As a key member of our Site Reliability Engineering (SRE) team, you'll be responsible for designing, building, and maintaining large-scale production systems with high efficiency and availability.This is a highly specialized discipline that demands knowledge across...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Sr Staff Site Reliability Engineer to join our CDL/SLS team at Palo Alto Networks. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.As a Sr Staff Site Reliability Engineer, you will contribute to the success of our SRE...


  • Santa Clara, California, United States NVIDIA Full time

    As a Senior Manager in Site Reliability Engineering (SRE) at NVIDIA, you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About UsPalo Alto Networks is a leader in the cybersecurity industry, dedicated to protecting the digital way of life. Our mission is to be the cybersecurity partner of choice, and we're looking for innovators who share our passion for shaping the future of cybersecurity.We're a company built on disruption, and we're looking for individuals who are...


  • Santa Clara, California, United States Capgemini Engineering Full time

    Job Title: Senior ASIC Physical Design EngineerJob Summary:We are seeking a highly skilled Senior ASIC Physical Design Engineer to join our team at Capgemini Engineering. As a key member of our design team, you will be responsible for designing and implementing complex ASICs using cutting-edge technologies and tools.Key Responsibilities:Design and implement...


  • Santa Clara, California, United States Capgemini Engineering Full time

    Job Title: Senior ASIC Physical Design EngineerJob Summary: We are seeking a highly skilled Senior ASIC Physical Design Engineer to join our team at Capgemini Engineering. As a key member of our design team, you will be responsible for the implementation of complex ASICs, focusing on high frequency block timing closure and physical verification. Key...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Senior Wireless Network Engineer to join our team at NVIDIA. As a key member of our Network Support and SRE team, you will play a critical role in ensuring the high availability and reliability of our wireless infrastructure.Your primary responsibilities will include owning the operational aspect of the wireless...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking an experienced Senior Integration Developer to join our dynamic IT team at Palo Alto Networks. As a key member of our team, you will play a critical part in driving the transformation of our integration landscape, improving transaction speed, data accuracy, and ensuring a seamless user experience.You will be responsible for...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Backend Software Engineer to join our team. As a key member of our engineering team, you will be responsible for designing and developing distributed backend services that serve as the backbone of our cloud-delivered security platform.Key ResponsibilitiesAnalyze requirements and design,...


  • Santa Clara, California, United States TRC Companies Full time

    About UsAt TRC, we're a team of innovators, thinkers, and problem-solvers who are passionate about shaping a brighter, more sustainable future. Our commitment to safety, quality, integrity, creativity, accountability, teamwork, and passion drives everything we do.We're a leading provider of geo-environmental consulting services, and we're seeking a talented...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team at Palo Alto Networks. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our Infrastructure Platform stack includes Terraform, Kubernetes, GitLab CI/CD, GitOps,...

  • Senior Manager

    4 weeks ago


    Santa Clara, California, United States Nvidia Full time

    Job SummaryNVIDIA is seeking a highly experienced Senior Manager to lead our Storage Systems team. As a key member of our Site Reliability Engineering (SRE) organization, you will be responsible for designing, implementing, and maintaining scalable and reliable storage systems to support our cloud infrastructure.Key ResponsibilitiesLead a team of Storage SRE...


  • Santa Clara, California, United States XPENG Motors Full time

    Job Title: Senior Staff AI Infrastructure SREXpeng Motors is a leading smart electric vehicle company that designs, develops, and manufactures cutting-edge EVs with advanced Internet, AI, and autonomous driving technologies. We are committed to in-house R&D and intelligent manufacturing to create a better mobility experience for our customers.About the...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job SummaryWe are seeking an experienced Senior Integration Developer to join our dynamic IT team at Palo Alto Networks. As a key member of our team, you will be responsible for designing, developing, and implementing scalable integration solutions using SnapLogic and other cutting-edge technologies.As a Senior Integration Developer, you will collaborate...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our Cortex Data Lake team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of our SRE and DevOps teams by developing...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our infrastructure team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key Responsibilities:Develop expertise in new technologies and contribute to the...