Site Reliability Engineer

3 weeks ago


Santa Clara, California, United States NVIDIA Full time
Job Title: Site Reliability Engineer

We are seeking a highly motivated Site Reliability Engineer to join our Applications Infrastructure organization. This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.

Key Responsibilities:
  • Develop and integrate new software, tools, and analytics to improve the availability, scalability, latency, and efficiency of our cloud services.
  • Manage upgrades and automated rollbacks across all clusters.
  • Maintain Service Level Agreements (SLAs) by collaborating with developers to define Service Level Indicators (SLIs) and design stable, secure services.
  • Guide the Change Advisory Board and Root Cause Corrective Action (RCCA) processes.
  • Collaborate with engineering, DevOps, and product leads across the GPU cloud services stack to build fast, reliable, and durable production systems.
  • Drive process changes to enhance the reliability and performance of cloud services.
  • Debug production issues across services and levels of the stack.
  • Improve operational processes.
Requirements:
  • Bachelor's degree in Computer Science or a related field, or equivalent experience.
  • 5+ years of experience in system design, complexity analysis, software design in Unix/Linux systems, performance tuning, and application issue resolution.
  • 5+ years of experience in authoring and debugging software written in C++ and Python.
  • Hands-on experience with Kubernetes-based cloud environments.
  • Multi-cloud experience.
  • Experience working with partners across multiple teams.
  • Experience operating production systems.
Preferred Qualifications:
  • Background with Software as a Service (SaaS) offerings.
  • Experience in application issues, algorithms, and data structures.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

The base salary range is 140,000 USD - 258,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits.



  • Santa Clara, California, United States Diverse Lynx Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS,...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job Title: Site Reliability EngineeringWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based systems.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using...


  • Santa Clara, California, United States Insight Global Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Insight Global. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based infrastructure.Key Responsibilities:Design, implement, and maintain scalable and highly...


  • Santa Clara, California, United States Cryptoware Technologies Inc Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Cryptoware Technologies Inc. As a Site Reliability Engineer, you will be responsible for leading the effort of global expansion of Huobi globe-spanning infrastructure.Key Responsibilities:Lead the effort of global expansion of Huobi...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on AWS, including EC2, SSM,...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS, including EC2,...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our Applications Infrastructure organization at NVIDIA. This team is responsible for designing, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications hosted in the cloud.Key Responsibilities:Develop and integrate new...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Principal Site Reliability EngineerWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.About the RoleThis is a unique opportunity to work with a...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure. You will work closely with developers, researchers, data scientists, and security experts to ensure...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining scalable and reliable infrastructure to support our mission-critical platforms.Key ResponsibilitiesDesign and implement scalable and...


  • Santa Clara, California, United States Centrify Corporation Full time

    Cloud Site Reliability EngineerAt Centrify Corporation, we're seeking a skilled Cloud Site Reliability Engineer to join our Cloud DevOps team. As a key member of our operations team, you'll play a critical role in ensuring the uptime and delivery of our cloud-based services.Key Responsibilities:Manage our cloud application using DevOps and Agile practices to...


  • Santa Clara, California, United States NVIDIA Full time

    Unlock the Power of Cloud ServicesWe are seeking a highly motivated Site Reliability Engineer to join our Applications Infrastructure organization.This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.The SRE role focuses on...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure. You will work closely with developers, researchers, data scientists, and security experts to ensure...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based security solutions.Key ResponsibilitiesDesign, build, and maintain scalable and reliable infrastructure for our...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, building, and maintaining scalable and reliable infrastructure for our cloud-based products.Key Responsibilities:Design and implement scalable and reliable infrastructure for...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Global Customer Operations team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Principal Site Reliability EngineerWe are seeking a highly skilled Principal Site Reliability Engineer to join our Global Customer Operations team at Palo Alto Networks. As a key member of our SRE team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Global Customer Operation Team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...