Senior Site Reliability Engineer

2 hours ago


Santa Clara, California, United States NVIDIA Full time
Job Title: Senior Site Reliability Engineer

NVIDIA is a leader in AI, machine learning, and datacenter acceleration. Our company is expanding its leadership into datacenter networking with ethernet switches, NICs, and DPUs. We have continuously reinvented ourselves over two decades.

Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing.

We are a "learning machine" that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world.

This is our life's work, to amplify human imagination and intelligence. We are seeking an expert to join our diverse team.

Job Summary

We are looking for a Senior Site Reliability Engineer to provide leadership in the design and implementation of ground-breaking GPU compute clusters that run demanding deep learning, high-performance computing, and computationally intensive workloads.

The ideal candidate will have experience analyzing and tuning performance for a variety of AI/HPC workloads, including workflows that use MPI. They will also have working knowledge of cluster configuration management tools such as Ansible, Puppet, Salt, and experienced with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm, K8s, RTDA, or LSF.

Key Responsibilities
  • Design and implement large-scale GPU compute clusters
  • Analyze and tune performance for AI/HPC workloads
  • Work with internal and external stakeholders to troubleshoot and resolve system failures
  • Develop automation solutions for cluster bring-up and scaled-up operation
  • Improve operational excellence and processes
Requirements
  • Bachelor's degree in Computer Science, Electrical Engineering, or related field, or equivalent experience
  • Minimum 5 years of experience designing and operating large-scale compute infrastructure
  • Experience with NVIDIA GPUs, Cuda Programming, NCCL, and MLPerf benchmarking
  • Background with Machine Learning and Deep Learning concepts, algorithms, models
  • Familiarity with InfiniBand with IBoIP and RDMA
  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads
  • Familiarity with deep learning frameworks like PyTorch and TensorFlow
What We Offer

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most brilliant and talented people in the world working for us, and our world-class engineering teams are growing fast.

If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.

The base salary range is $148,000 - $276,000. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.



  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to create a better world for everyone, driven by our talented workforce. We prioritize speed and innovation to meet the demands of our customers and communities.Joining ServiceNow means becoming part of a dynamic team of innovators who possess a relentless curiosity and a commitment to creativity.We...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to enhance global operations, and our dedicated workforce makes it all possible. We operate swiftly because the world demands it, innovating uniquely for our clients and communities.By becoming part of ServiceNow, you join a dynamic team of innovators who possess a relentless curiosity and a passion for...


  • Santa Clara, California, United States Diverse Lynx Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States ServiceNow Full time

    OverviewThe ServiceNow SRE team is a group of highly skilled engineers who are responsible for maintaining and developing the reliability, scalability, and performance of the ServiceNow cloud infrastructure.Key ResponsibilitiesProvide relief and sustainable resolution to issues within our infrastructure.Use expertise in software development, systems...

  • Site Reliability Engineer

    31 minutes ago


    Santa Clara, California, United States ServiceNow Full time

    OverviewThe ServiceNow SRE team is a group of highly technical engineers who are tasked with maintaining and developing the reliability, scalability, and performance of the ServiceNow cloud infrastructure.Our SREs are empowered to drive technical resolutions across the technology stack from hardware through to application and all stops in between.They are...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in ensuring the high availability and reliability of our applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain scalable and reliable infrastructureBuild...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our Global Customer Operation Team. As a Site Reliability Engineer, you will play a critical role in designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Join Our Mission to End Breaches and Protect Digital LifePalo Alto Networks is the fastest-growing security company in history, and we're looking for a motivated, intelligent, and creative individual to join our team as a Site Reliability Engineer DevOps.About the RoleWe offer the chance to be part of an important mission: ending breaches and protecting our...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our Global Customer Operation Team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our Global Customer Operation Team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our Global Customer Operation Team. As a key member of our team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...


  • Santa Clara, California, United States Promote Project Full time

    About Promote Project: Promote Project is a leader in innovative technology solutions, dedicated to pushing the boundaries of what is possible in the realm of artificial intelligence and cloud computing. Our commitment to excellence is reflected in our talented workforce and our pursuit of groundbreaking advancements.Position Overview: We are seeking a...


  • Santa Clara, California, United States Veear Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Veear. As a key member of our infrastructure team, you will play a critical role in ensuring the reliability, scalability, and security of our cloud-based systems.Key ResponsibilitiesCollaboration and PartnershipPartner with cross-functional teams to ensure security...


  • Santa Clara, California, United States Promote Project Full time

    About the Company: Promote Project is at the forefront of innovation, leveraging cutting-edge technology to redefine the landscape of AI and computing. Our mission is to harness the power of advanced computing to create transformative solutions that impact various industries.Position Overview: We are seeking a Manager of Site Reliability Engineering to...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewCompany OverviewTo comply with U.S. federal government requirements, U.S. citizenship is required for this position.Our MissionAt Palo Alto Networks, our mission is clear:To be the cybersecurity partner of choice, safeguarding our digital existence.We envision a world where each day is safer and more secure than the last. Our foundation is built...


  • Santa Clara, California, United States XPENG Motors Full time

    About XPeng MotorsXpeng Motors is a leading innovator in the electric vehicle industry, dedicated to designing, developing, and manufacturing cutting-edge smart electric vehicles that seamlessly integrate advanced Internet, AI, and autonomous driving technologies.Job SummaryWe are seeking a highly skilled Senior Staff AI Infrastructure Site Reliability...


  • Santa Clara, California, United States Akraya Full time

    Job SummaryAkraya is seeking a highly skilled FedRAMP SRE to join our team. As an embedded Site Reliability Engineer, you will be responsible for supporting FedRAMP environments and ensuring security and compliance requirements are met.Key ResponsibilitiesSupport FedRAMP environments as an embedded Site Reliability Engineer.Collaborate with teams to ensure...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:Bachelor's degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework focused on semiconductor physics and electronics. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also beneficial.Key...


  • Santa Clara, California, United States Geospatial And Cloud Analytics Inc Full time

    About the RoleWe are seeking a highly skilled Senior Cloud Reliability Engineer to join our team at Geospatial And Cloud Analytics Inc. As a key member of our engineering team, you will be responsible for designing, implementing, and supporting operational and reliability aspects of large-scale cloud infrastructure.Key ResponsibilitiesDesign and implement...