Senior Site Reliability Engineer

3 weeks ago


Santa Clara, United States NVIDIA Full time

NVIDIA is the leader in AI, machine learning and datacenter acceleration. NVIDIA is expanding that leadership into datacenter networking with ethernet switches, NICs and DPUs NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Make the choice, join our diverse team today

As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of ground breaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:

In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging and alerting. Additional responsibilities include:

  • Design and implement state-of-the-art GPU compute clusters.

  • Optimize cluster operations for maximum reliability, efficiency, and performance.

  • Drive foundational improvements and automation to enhance researcher productivity.

  • Tackle strategic challenges in large-scale, high-performance computing environments.

  • Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners.

  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.

  • Practice sustainable incident response and blameless postmortems

  • Be part of an on call rotation to support production systems

  • Write and review code, develop documentation and capacity plans, debug the hardest problems, live, on some of the largest and most complex systems in the world.

  • Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log.

  • Manage upgrades and automated rollbacks across all clusters.

What we need to see:

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience with a minimum 6+ years of experience designing and operating large scale compute infrastructure

  • Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 5K GPUs cluster.

  • Deep understanding of GPU computing and AI infrastructure.

  • Passion for solving complex technical challenges and optimizing system performance.

  • Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm.

  • Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications, such as Kubernetes, Terraform, MySQL, etc.

  • In depth understating of container technologies like Docker, Enroot, etc.

  • Experience programming in Python and Bash scripting.

Ways to stand out from the crowd:

  • Interest in crafting, analyzing and fixing large-scale distributed systems

  • Familiarity with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking.

  • Familiarity with InfiniBand with IBoIP and RDMA.

  • Experience with Cloud Deployment, BCM, Terraform.

  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads.

  • Familiarity with deep learning frameworks like PyTorch and TensorFlow.

  • Multi-cloud experience.

The base salary range is 180,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.



  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is a leader in the field of artificial intelligence, machine learning, and datacenter acceleration. Our company has a rich history of innovation, with a legacy that dates back to the invention of the GPU in 1999. This groundbreaking technology sparked the growth of the PC gaming market, redefined modern computer graphics, and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Staff Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our Cloud Infrastructure team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our ideal candidate will have a strong background in cloud computing, with...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewPalo Alto Networks is seeking a highly skilled Cloud Infrastructure Engineer to join our CDL/SLS team. As a Senior Staff Site Reliability Engineer, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our team is at the forefront of innovation, constantly pushing the boundaries of what is...


  • Santa Clara, California, United States NVIDIA Full time

    As a Senior Manager in Site Reliability Engineering (SRE) at NVIDIA, you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About UsPalo Alto Networks is a leader in the cybersecurity industry, dedicated to protecting the digital way of life. Our mission is to be the cybersecurity partner of choice, and we're looking for innovators who share our passion for shaping the future of cybersecurity.We're a company built on disruption, and we're looking for individuals who are...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our infrastructure team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key Responsibilities:Develop expertise in new technologies and contribute to the...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS,...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job Title: Site Reliability EngineeringWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based systems.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of SRE and DevOps teamsDevelop expertise in new...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team at Palo Alto Networks. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our Infrastructure Platform stack includes Terraform, Kubernetes, GitLab CI/CD, GitOps,...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS, including EC2,...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a leader in AI, machine learning, and datacenter acceleration. Our company is expanding its leadership into datacenter networking with ethernet switches, NICs, and DPUs. We have continuously reinvented ourselves over two decades, with our invention of the GPU in 1999 sparking the growth of the PC gaming market, redefining modern computer graphics,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our Cortex Data Lake team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Your CareerAs a Senior Staff Site Reliability Engineer, you will have the opportunity to...


  • Santa Clara, California, United States NVIDIA Full time

    Unlock the Power of Cloud ServicesWe are seeking a highly motivated Site Reliability Engineer to join our Applications Infrastructure organization.This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.The SRE role focuses on...


  • Santa Clara, California, United States Nvidia Full time

    Senior Reliability EngineerNVIDIA is seeking a highly skilled Senior Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for planning and implementing the qualifications of new NVIDIA products, including IC chips in AI, Mobile, Automotive, Deep Learning, Graphic Processor, and System on Chip sectors.Key...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our Cortex Data Lake team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Your CareerAs a Senior Staff Site Reliability Engineer, you will have the opportunity to...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, building, and maintaining scalable and reliable infrastructure for our cloud-based products.Key Responsibilities:Design and implement scalable and reliable infrastructure for...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.You will work closely with our development team to ensure that applications are production-ready,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our Global Customer Operation Team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...