Senior Site Reliability Engineer

19 hours ago


Santa Clara, California, United States NVIDIA Full time
About NVIDIA

NVIDIA is a leader in the field of artificial intelligence, machine learning, and datacenter acceleration. With a rich history of innovation, we have continuously pushed the boundaries of what is possible in the world of computing.

Job Summary

We are seeking an experienced Site Reliability Engineer to join our GPU AI/HPC Infrastructure team. As a key member of our team, you will be responsible for designing and implementing large scale GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads.

Key Responsibilities
  • Design and implement large scale GPU compute clusters
  • Develop large scale automation solutions
  • Maintain and build deep learning AI-HPC GPU clusters at scale
  • Support researchers to run their flows on our clusters
  • Design, implement and support operational and reliability aspects of large scale distributed systems
Requirements
  • Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum 5 years of experience designing and operating large scale compute infrastructure
  • Experience with AI/HPC advanced job schedulers and familiarity with schedulers such as Slurm, K8s, RTDA or LSF
  • Experience analysing and tuning performance for a variety of AI/HPC workloads
  • Working knowledge of cluster configuration management tools and infrastructure level applications
  • In-depth understanding of container technologies
  • Experience learning development languages
Preferred Qualifications
  • Experience with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking
  • Background with Machine Learning and Deep Learning concepts, algorithms, models
  • Familiarity with InfiniBand with IBoIP and RDMA
  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads
  • Familiarity with deep learning frameworks like PyTorch and TensorFlow
What We Offer

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have a diverse and talented team of engineers working on cutting-edge technology, and we are committed to fostering a work environment that is inclusive and supportive of all employees.

We are an equal opportunity employer and welcome applications from qualified candidates from diverse backgrounds. If you are a creative and autonomous engineer with a passion for technology, we encourage you to apply.



  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Site Reliability EngineerNVIDIA is a leader in AI, machine learning, and datacenter acceleration. Our company is expanding its leadership into datacenter networking with ethernet switches, NICs, and DPUs. We have continuously reinvented ourselves over two decades.Our invention of the GPU in 1999 sparked the growth of the PC gaming market,...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to create a better world for everyone, driven by our talented workforce. We prioritize speed and innovation to meet the demands of our customers and communities.Joining ServiceNow means becoming part of a dynamic team of innovators who possess a relentless curiosity and a commitment to creativity.We...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key Responsibilities:Contribute to the success of SRE and DevOps teamsDevelop expertise...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to enhance global operations, and our dedicated workforce makes it all possible. We operate swiftly because the world demands it, innovating uniquely for our clients and communities.By becoming part of ServiceNow, you join a dynamic team of innovators who possess a relentless curiosity and a passion for...


  • Santa Clara, California, United States Diverse Lynx Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain scalable and highly available cloud...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on AWS,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Principal Site Reliability EngineerPalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.About the RoleWe are looking for a seasoned engineer with expertise in...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Centrify Corporation Full time

    Cloud Site Reliability EngineerAt Centrify Corporation, we're seeking a skilled Cloud Site Reliability Engineer to join our Cloud DevOps team. As a key member of our operations team, you'll play a critical role in ensuring the uptime and delivery of our cloud-based services.Key Responsibilities:Manage our cloud application using DevOps and Agile practices to...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our infrastructure platform, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of SRE and DevOps teams by developing expertise...


  • Santa Clara, California, United States Veear Full time

    Job Description:We are seeking a highly skilled Site Reliability Engineer to join our team at Veear. As a key member of our infrastructure team, you will play a critical role in ensuring the security, compliance, and reliability of our systems.Key Responsibilities:Partner with development teams to ensure that applications have scalability and reliability...


  • Santa Clara, California, United States ServiceNow Full time

    OverviewThe ServiceNow SRE team is a group of highly skilled engineers who are responsible for maintaining and developing the reliability, scalability, and performance of the ServiceNow cloud infrastructure.Key ResponsibilitiesProvide relief and sustainable resolution to issues within our infrastructure.Use expertise in software development, systems...


  • Santa Clara, California, United States ServiceNow Full time

    OverviewThe ServiceNow SRE team is a group of highly technical engineers who are tasked with maintaining and developing the reliability, scalability, and performance of the ServiceNow cloud infrastructure.Our SREs are empowered to drive technical resolutions across the technology stack from hardware through to application and all stops in between.They are...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in ensuring the high availability and reliability of our applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain scalable and reliable infrastructureBuild...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our Global Customer Operation Team. As a Site Reliability Engineer, you will play a critical role in designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Join Our Mission to End Breaches and Protect Digital LifePalo Alto Networks is the fastest-growing security company in history, and we're looking for a motivated, intelligent, and creative individual to join our team as a Site Reliability Engineer DevOps.About the RoleWe offer the chance to be part of an important mission: ending breaches and protecting our...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our Global Customer Operation Team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...