Senior Site Reliability Engineer

2 days ago


Santa Clara, California, United States NVIDIA Full time
NVIDIA is a leader in AI, machine learning, and datacenter acceleration.

We are expanding our leadership into datacenter networking with ethernet switches, NICs, and DPUs. Our team is responsible for designing and operating large-scale GPU compute clusters that power all AI research across NVIDIA.

Key Responsibilities:
  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Tackle strategic challenges in large-scale, high-performance computing environments
Requirements:
  • Bachelor's degree in Computer Science, Electrical Engineering, or related field
  • Proven experience in site reliability engineering for high-performance computing environments
  • Deep understanding of GPU computing and AI infrastructure
  • Passion for solving complex technical challenges and optimizing system performance
What We Offer:
  • Competitive salary and comprehensive benefits package
  • Opportunity to work with a world-class engineering team
  • Chance to contribute to cutting-edge AI and HPC research


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Site Reliability EngineerNVIDIA is a leader in AI, machine learning, and datacenter acceleration. Our company is expanding its leadership into datacenter networking with ethernet switches, NICs, and DPUs. We have continuously reinvented ourselves over two decades.Our invention of the GPU in 1999 sparked the growth of the PC gaming market,...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Site Reliability EngineerNVIDIA is a leader in AI, machine learning, and datacenter acceleration. Our company is expanding its leadership into datacenter networking with ethernet switches, NICs, and DPUs. We have continuously reinvented ourselves over two decades.Our invention of the GPU in 1999 sparked the growth of the PC gaming market,...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to create a better world for everyone, driven by our talented workforce. We prioritize speed and innovation to meet the demands of our customers and communities.Joining ServiceNow means becoming part of a dynamic team of innovators who possess a relentless curiosity and a commitment to creativity.We...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.Key ResponsibilitiesDevelop expertise in new technologies and contribute to the success of SRE and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key Responsibilities:Contribute to the success of SRE and DevOps teamsDevelop expertise...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to enhance global operations, and our dedicated workforce makes it all possible. We operate swiftly because the world demands it, innovating uniquely for our clients and communities.By becoming part of ServiceNow, you join a dynamic team of innovators who possess a relentless curiosity and a passion for...


  • Santa Clara, California, United States Diverse Lynx Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on...


  • Santa Clara, California, United States Nvidia Full time

    Job Title: Senior Site Reliability Engineer - HPC StorageNVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. We are seeking a phenomenal Senior Site Reliability Engineer to join our team and play a crucial role in designing, implementing, and optimizing on-prem High-Performance...


  • Santa Clara, California, United States Insight Global Full time

    Site Reliability EngineerAbout the RoleWe are seeking a seasoned Site Reliability Engineer to join our team at Insight Global. As a key member of our Infrastructure, Planning and Processes organization, you will be responsible for developing and maintaining sophisticated internal cloud provisioning products.Key ResponsibilitiesCollaborate with various teams,...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain scalable and highly available cloud...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on AWS,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Principal Site Reliability EngineerPalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.About the RoleWe are looking for a seasoned engineer with expertise in...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure. You will work closely with developers, researchers, data scientists, and security experts to ensure...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining scalable and reliable infrastructure to support our mission-critical platforms.Key ResponsibilitiesDesign and implement scalable and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Centrify Corporation Full time

    Cloud Site Reliability EngineerAt Centrify Corporation, we're seeking a skilled Cloud Site Reliability Engineer to join our Cloud DevOps team. As a key member of our operations team, you'll play a critical role in ensuring the uptime and delivery of our cloud-based services.Key Responsibilities:Manage our cloud application using DevOps and Agile practices to...


  • Santa Clara, California, United States Omni Vision Inc Full time

    Job Title: Senior Reliability EngineerOmni Vision Inc is seeking a highly skilled Senior Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for ensuring the quality and reliability of our CMOS Image Sensor products.Key Responsibilities:Review reliability qualification testing results and determine whether...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.Key ResponsibilitiesContribute to the success of SRE and DevOps teamsDevelop expertise in new...