Senior Site Reliability Engineer supporting Nvidia

2 months ago


Santa Clara, United States Sustainable Talent Full time
Job DescriptionJob Description

Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA, with Hybrid work options. We offer competitive pay $75 - $90/hr based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company culture

As an SRE, you will be troubleshooting and managing our client's on-premises infrastructure to support various software engineering teams' company wide. Keen attention to detail, problem-solving abilities, and a solid knowledge base are essential.

What you'll be doing:

  • Working on systems deployed in NVIDIA's internal cloud making them available and reliable for our end users.
  • Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization.
  • Providing high quality of user support.
  • Monitoring KPIs and making sure that team's SLAs are met.
  • Managing and maintaining production Kubernetes clusters.
  • Drive automation of monitoring to gain more insight into applications and system health.
  • Craft and implement critical metrics using various analytics methods and dashboards.
  • Reuse AI techniques to extract useful signals about machines and jobs from the data generated.

What we need to see:

  • Proven SRE experience as an L1 support with on-call responsibilities, ideally over 5+ years.
  • Proficient in troubleshooting Linux OS issues such as SSH and performance.
  • Experience troubleshooting networking issues like DNS, DHCP, and familiarity with networking principles and protocols, including TCP/IP and VLANs.
  • Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, Elastic, or similar.
  • Strong understanding and practical experience with REST API calls.
  • Proficiency in basic scripting, with familiarity in Python or similar programming languages being a plus.
  • Knowledge of Ansible roles and playbooks, Jenkins CI/CD processes, and deployment experience with Kubernetes.
  • Experience with the Kickstart process for automated Linux installations.
  • Experience managing and troubleshooting Linux systems, as well as managing systems in data centers, using tools like BMC (Redfish), KVM, and IPMI.
  • Background in databases such as SQL (MySQL) and timeseries DBs like Prometheus.
  • Experience with data analytics and visualization tools like Kibana, Grafana, and Splunk.
  • Proficient with source code management and binary repository systems like GitLab, GitHub, Artifactory, and Perforce.
  • Advanced knowledge of standard methodologies related to security.
  • Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience.

Ways to stand out from the crowd:

  • Working knowledge of OpenStack.
  • Previous experience managing NVIDIA hardware such as GPUs and Tegras.
  • Prior experience with large scale operations teams.
  • Experience managing Windows server infrastructure.
  • Outstanding interpersonal skills and ability to communicate effectively with all levels of management.
  • Ability to analyze complex problems, design simple systems that function efficiently with minimal support, and thrive in a multi-tasking environment with evolving priorities.

Sustainable Talent is a M/F+, disabled, and veteran equal employment opportunity and affirmative action employer.



  • Santa Clara, United States Nvidia Full time

    Senior Site Reliability Engineer - StoragelocationsUS, CA, Santa Claratime typeFull timejob requisition idJR1979072NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and...


  • Santa Clara, California, United States Nvidia Full time

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and outstanding people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...


  • Santa Clara, United States NVIDIA Full time

    Senior Site Reliability Engineer, Data Science and ML Platforms Are you passionate about building and maintaining large-scale production systems that support advanced data science and machine learning applications? Do you want to join a team at the heart of NVIDIA's data-driven decision-making culture? If so, we have a great opportunity for you! NVIDIA is...


  • Santa Clara, United States NVIDIA Full time

    Senior System Reliability Engineer Locations: US, CA, Santa Clara Time Type: Full time Posted on: Posted 2 Days Ago Job Requisition ID: JR1980220 NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing —...


  • Santa Clara, United States NVIDIA Full time

    Senior Silicon Reliability Engineer Locations: US, CA, Santa Clara Time Type: Full time Posted on: Posted 3 Days Ago Job Requisition ID: JR1981353 NVIDIA has continuously reinvented itself over three decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing....


  • Santa Clara, United States NVIDIA Full time

    NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s motivated by outstanding technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...

  • Senior QA Engineer

    5 days ago


    Santa Clara, California, United States NVIDIA Corporation Full time

    Job DescriptionJob Summary:NVIDIA Corporation is seeking a highly skilled Senior Software Development Engineer in Test to join our team. As a key member of our QA team, you will play a critical role in ensuring the quality and reliability of our software releases.Key Responsibilities:Develop and execute comprehensive test plans to ensure the quality and...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleWe are seeking a highly skilled Senior Developer Relationship Manager to drive strategic partnerships with Independent Software Vendors (ISVs) and developer communities who are building data engineering/analytics platforms, applications, solutions, or services.Data analytics and machine learning applications are crucial for enterprises across...


  • Santa Clara, California, United States Nvidia Full time

    NVIDIA, a prominent player in the realms of Artificial Intelligence, High-Performance Computing, and Visualization, is on the lookout for a Lead Site Reliability Engineer specializing in HPC storage systems. This role involves collaborating with our team to architect, implement, and enhance on-premises HPC storage solutions while integrating cloud...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is a leader in developing cutting-edge processor and system architectures that accelerate machine learning, automotive, and high-performance computing platforms. Our innovative work enables groundbreaking discoveries, outstanding creativity, and powers futuristic inventions from artificial intelligence to autonomous cars.Key...


  • Santa Clara, California, United States NVIDIA Corporation Full time

    Job DescriptionNVIDIA Corporation is seeking a highly skilled Senior Software Development Engineer in Test to join our team. As a key member of our QA team, you will play a critical role in ensuring the quality and reliability of our software releases.Key ResponsibilitiesDevelop and execute comprehensive test plans to validate software releases on various...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA Scientific Computing is a key value driver for our supercomputing customers. There is a worldwide user-base of scientists and engineers at commercial, government, and academic customer sites that contribute to the ongoing development of scientific computing features. We are looking to hire an experienced product marketing manager to drive the...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA has been at the forefront of technological innovation since the introduction of the GPU in 1999, which not only transformed the PC gaming landscape but also redefined modern graphics and parallel computing. Recently, the advent of GPU deep learning has propelled us into a new era of computing, positioning the GPU as the central processing unit for...

  • Senior Manager

    4 weeks ago


    Santa Clara, United States NVIDIA Full time

    As a Sr Manager in Site Reliability Engineering (SRE), you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE Senior Managers...

  • Senior SRE Engineer

    3 weeks ago


    Santa Clara, United States NVIDIA Full time

    NVIDIA is looking for a seasoned SRE to join its complex and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains sophisticated NVIDIA's internal Jenkins based CI/CD product for GPUs and Tegra systems. The team works with...


  • Santa Clara, United States NVIDIA Full time

    Senior CPU Design Engineer We are looking for a Senior CPU Design Engineer! NVIDIA is seeking best-in-class CPU Design Engineers to design and implement the world’s leading CPUs and SoCs. This position offers you the opportunity to have real impact in a dynamic, technology-focused company impacting product lines ranging from consumer graphics to...


  • Santa Clara, United States Geospatial And Cloud Analytics Inc Full time

    Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline which demand knowledge across different systems, networking, coding, database,...