Senior Site Reliability Engineer supporting Nvidia

2 months ago

Santa Clara, United States Sustainable Talent Full time

Job DescriptionJob Description

Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA, with Hybrid work options. We offer competitive pay $75 - $90/hr based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company culture

As an SRE, you will be troubleshooting and managing our client's on-premises infrastructure to support various software engineering teams' company wide. Keen attention to detail, problem-solving abilities, and a solid knowledge base are essential.

What you'll be doing:

Working on systems deployed in NVIDIA's internal cloud making them available and reliable for our end users.
Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization.
Providing high quality of user support.
Monitoring KPIs and making sure that team's SLAs are met.
Managing and maintaining production Kubernetes clusters.
Drive automation of monitoring to gain more insight into applications and system health.
Craft and implement critical metrics using various analytics methods and dashboards.
Reuse AI techniques to extract useful signals about machines and jobs from the data generated.

What we need to see:

Proven SRE experience as an L1 support with on-call responsibilities, ideally over 5+ years.
Proficient in troubleshooting Linux OS issues such as SSH and performance.
Experience troubleshooting networking issues like DNS, DHCP, and familiarity with networking principles and protocols, including TCP/IP and VLANs.
Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, Elastic, or similar.
Strong understanding and practical experience with REST API calls.
Proficiency in basic scripting, with familiarity in Python or similar programming languages being a plus.
Knowledge of Ansible roles and playbooks, Jenkins CI/CD processes, and deployment experience with Kubernetes.
Experience with the Kickstart process for automated Linux installations.
Experience managing and troubleshooting Linux systems, as well as managing systems in data centers, using tools like BMC (Redfish), KVM, and IPMI.
Background in databases such as SQL (MySQL) and timeseries DBs like Prometheus.
Experience with data analytics and visualization tools like Kibana, Grafana, and Splunk.
Proficient with source code management and binary repository systems like GitLab, GitHub, Artifactory, and Perforce.
Advanced knowledge of standard methodologies related to security.
Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience.

Ways to stand out from the crowd:

Working knowledge of OpenStack.
Previous experience managing NVIDIA hardware such as GPUs and Tegras.
Prior experience with large scale operations teams.
Experience managing Windows server infrastructure.
Outstanding interpersonal skills and ability to communicate effectively with all levels of management.
Ability to analyze complex problems, design simple systems that function efficiently with minimal support, and thrive in a multi-tasking environment with evolving priorities.

Sustainable Talent is a M/F+, disabled, and veteran equal employment opportunity and affirmative action employer.

Senior Site Reliability Engineer

3 months ago

Santa Clara, United States Nvidia Full time

Senior Site Reliability Engineer - StoragelocationsUS, CA, Santa Claratime typeFull timejob requisition idJR1979072NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and...
Senior Site Reliability Engineer

1 month ago

Santa Clara, California, United States Nvidia Full time

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...
Site Reliability Engineer

2 months ago

Santa Clara, United States NVIDIA Full time

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and outstanding people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...
Senior Site Reliability Engineer

2 weeks ago

Santa Clara, United States NVIDIA Full time

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...
Senior Site Reliability Engineer, Data Science and ML Platforms

2 months ago

Santa Clara, United States NVIDIA Full time

Senior Site Reliability Engineer, Data Science and ML Platforms Are you passionate about building and maintaining large-scale production systems that support advanced data science and machine learning applications? Do you want to join a team at the heart of NVIDIA's data-driven decision-making culture? If so, we have a great opportunity for you! NVIDIA is...
Senior System Reliability Engineer

2 weeks ago

Santa Clara, United States NVIDIA Full time

Senior System Reliability Engineer Locations: US, CA, Santa Clara Time Type: Full time Posted on: Posted 2 Days Ago Job Requisition ID: JR1980220 NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing —...
Senior Silicon Reliability Engineer

2 weeks ago

Santa Clara, United States NVIDIA Full time

Senior Silicon Reliability Engineer Locations: US, CA, Santa Clara Time Type: Full time Posted on: Posted 3 Days Ago Job Requisition ID: JR1981353 NVIDIA has continuously reinvented itself over three decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing....
Manager, Site Reliability Engineer

3 weeks ago

Santa Clara, United States NVIDIA Full time

NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s motivated by outstanding technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...
Senior Staff Reliability Engineer

2 weeks ago

Santa Clara, United States NVIDIA Full time

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...
Senior QA Engineer

5 days ago

Santa Clara, California, United States NVIDIA Corporation Full time

Job DescriptionJob Summary:NVIDIA Corporation is seeking a highly skilled Senior Software Development Engineer in Test to join our team. As a key member of our QA team, you will play a critical role in ensuring the quality and reliability of our software releases.Key Responsibilities:Develop and execute comprehensive test plans to ensure the quality and...
Data Science Partnership Lead

5 days ago

Santa Clara, California, United States NVIDIA Full time

About the RoleWe are seeking a highly skilled Senior Developer Relationship Manager to drive strategic partnerships with Independent Software Vendors (ISVs) and developer communities who are building data engineering/analytics platforms, applications, solutions, or services.Data analytics and machine learning applications are crucial for enterprises across...
Lead Site Reliability Engineer for HPC Solutions

1 week ago

Santa Clara, California, United States Nvidia Full time

NVIDIA, a prominent player in the realms of Artificial Intelligence, High-Performance Computing, and Visualization, is on the lookout for a Lead Site Reliability Engineer specializing in HPC storage systems. This role involves collaborating with our team to architect, implement, and enhance on-premises HPC storage solutions while integrating cloud...
Senior DFX Architect

5 days ago

Santa Clara, California, United States NVIDIA Full time

About the RoleNVIDIA is a leader in developing cutting-edge processor and system architectures that accelerate machine learning, automotive, and high-performance computing platforms. Our innovative work enables groundbreaking discoveries, outstanding creativity, and powers futuristic inventions from artificial intelligence to autonomous cars.Key...
Senior Software Development Engineer in Test

5 days ago

Santa Clara, California, United States NVIDIA Corporation Full time

Job DescriptionNVIDIA Corporation is seeking a highly skilled Senior Software Development Engineer in Test to join our team. As a key member of our QA team, you will play a critical role in ensuring the quality and reliability of our software releases.Key ResponsibilitiesDevelop and execute comprehensive test plans to validate software releases on various...
Senior Product Marketing Manager

4 days ago

Santa Clara, United States NVIDIA Full time

NVIDIA Scientific Computing is a key value driver for our supercomputing customers. There is a worldwide user-base of scientists and engineers at commercial, government, and academic customer sites that contribute to the ongoing development of scientific computing features. We are looking to hire an experienced product marketing manager to drive the...
Lead Systems Reliability Engineer

2 weeks ago

Santa Clara, California, United States NVIDIA Full time

NVIDIA has been at the forefront of technological innovation since the introduction of the GPU in 1999, which not only transformed the PC gaming landscape but also redefined modern graphics and parallel computing. Recently, the advent of GPU deep learning has propelled us into a new era of computing, positioning the GPU as the central processing unit for...
Senior Manager

4 weeks ago

Santa Clara, United States NVIDIA Full time

As a Sr Manager in Site Reliability Engineering (SRE), you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE Senior Managers...
Senior SRE Engineer

3 weeks ago

Santa Clara, United States NVIDIA Full time

NVIDIA is looking for a seasoned SRE to join its complex and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains sophisticated NVIDIA's internal Jenkins based CI/CD product for GPUs and Tegra systems. The team works with...
Senior CPU Design Engineer

2 weeks ago

Santa Clara, United States NVIDIA Full time

Senior CPU Design Engineer We are looking for a Senior CPU Design Engineer! NVIDIA is seeking best-in-class CPU Design Engineers to design and implement the world’s leading CPUs and SoCs. This position offers you the opportunity to have real impact in a dynamic, technology-focused company impacting product lines ranging from consumer graphics to...
Senior Site Reliability Engineer

4 days ago

Santa Clara, United States Geospatial And Cloud Analytics Inc Full time

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline which demand knowledge across different systems, networking, coding, database,...

Americas

Europe

Asia / Oceania

Africa

Senior Site Reliability Engineer supporting Nvidia