Senior Site Reliability Engineer

19 hours ago

Santa Clara, California, United States NVIDIA Full time

About NVIDIA

NVIDIA is a leader in the field of artificial intelligence, machine learning, and datacenter acceleration. With a rich history of innovation, we have continuously pushed the boundaries of what is possible in the world of computing.

Job Summary

We are seeking an experienced Site Reliability Engineer to join our GPU AI/HPC Infrastructure team. As a key member of our team, you will be responsible for designing and implementing large scale GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads.

Key Responsibilities

Design and implement large scale GPU compute clusters
Develop large scale automation solutions
Maintain and build deep learning AI-HPC GPU clusters at scale
Support researchers to run their flows on our clusters
Design, implement and support operational and reliability aspects of large scale distributed systems

Requirements

Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience
Minimum 5 years of experience designing and operating large scale compute infrastructure
Experience with AI/HPC advanced job schedulers and familiarity with schedulers such as Slurm, K8s, RTDA or LSF
Experience analysing and tuning performance for a variety of AI/HPC workloads
Working knowledge of cluster configuration management tools and infrastructure level applications
In-depth understanding of container technologies
Experience learning development languages

Preferred Qualifications

Experience with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking
Background with Machine Learning and Deep Learning concepts, algorithms, models
Familiarity with InfiniBand with IBoIP and RDMA
Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads
Familiarity with deep learning frameworks like PyTorch and TensorFlow

What We Offer

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have a diverse and talented team of engineers working on cutting-edge technology, and we are committed to fostering a work environment that is inclusive and supportive of all employees.

We are an equal opportunity employer and welcome applications from qualified candidates from diverse backgrounds. If you are a creative and autonomous engineer with a passion for technology, we encourage you to apply.

Senior Site Reliability Engineer

4 days ago

Santa Clara, California, United States NVIDIA Full time

Job Title: Senior Site Reliability EngineerNVIDIA is a leader in AI, machine learning, and datacenter acceleration. Our company is expanding its leadership into datacenter networking with ethernet switches, NICs, and DPUs. We have continuously reinvented ourselves over two decades.Our invention of the GPU in 1999 sparked the growth of the PC gaming market,...
Senior Site Reliability Engineer

4 weeks ago

Santa Clara, California, United States ServiceNow Full time

Company OverviewAt ServiceNow, we harness technology to create a better world for everyone, driven by our talented workforce. We prioritize speed and innovation to meet the demands of our customers and communities.Joining ServiceNow means becoming part of a dynamic team of innovators who possess a relentless curiosity and a commitment to creativity.We...
Senior Staff Site Reliability Engineer

2 days ago

Santa Clara, California, United States Palo Alto Networks Full time

Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key Responsibilities:Contribute to the success of SRE and DevOps teamsDevelop expertise...
Senior Site Reliability Engineer

4 weeks ago

Santa Clara, California, United States ServiceNow Full time

Company OverviewAt ServiceNow, we harness technology to enhance global operations, and our dedicated workforce makes it all possible. We operate swiftly because the world demands it, innovating uniquely for our clients and communities.By becoming part of ServiceNow, you join a dynamic team of innovators who possess a relentless curiosity and a passion for...
Site Reliability Engineer

2 weeks ago

Santa Clara, California, United States Diverse Lynx Full time

About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on...
Site Reliability Engineer

2 days ago

Santa Clara, California, United States Diverse Lynx Full time

Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain scalable and highly available cloud...
Site Reliability Engineer

17 hours ago

Santa Clara, California, United States Syntricate Technologies Full time

Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on AWS,...
Principal Site Reliability Engineer

2 days ago

Santa Clara, California, United States Palo Alto Networks Full time

Job Title: Principal Site Reliability EngineerPalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.About the RoleWe are looking for a seasoned engineer with expertise in...
Principal Site Reliability Engineer

1 week ago

Santa Clara, California, United States Palo Alto Networks Full time

About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...
Principal Site Reliability Engineer

4 days ago

Santa Clara, California, United States Palo Alto Networks Full time

About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...
Principal Site Reliability Engineer

2 days ago

Santa Clara, California, United States Palo Alto Networks Full time

About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...
Cloud Site Reliability Engineer

2 days ago

Santa Clara, California, United States Centrify Corporation Full time

Cloud Site Reliability EngineerAt Centrify Corporation, we're seeking a skilled Cloud Site Reliability Engineer to join our Cloud DevOps team. As a key member of our operations team, you'll play a critical role in ensuring the uptime and delivery of our cloud-based services.Key Responsibilities:Manage our cloud application using DevOps and Agile practices to...
Principal Site Reliability Engineer

2 days ago

Santa Clara, California, United States Palo Alto Networks Full time

About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our infrastructure platform, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of SRE and DevOps teams by developing expertise...
Site Reliability Engineer

11 hours ago

Santa Clara, California, United States Veear Full time

Job Description:We are seeking a highly skilled Site Reliability Engineer to join our team at Veear. As a key member of our infrastructure team, you will play a critical role in ensuring the security, compliance, and reliability of our systems.Key Responsibilities:Partner with development teams to ensure that applications have scalability and reliability...
Site Reliability Engineer

1 week ago

Santa Clara, California, United States ServiceNow Full time

OverviewThe ServiceNow SRE team is a group of highly skilled engineers who are responsible for maintaining and developing the reliability, scalability, and performance of the ServiceNow cloud infrastructure.Key ResponsibilitiesProvide relief and sustainable resolution to issues within our infrastructure.Use expertise in software development, systems...
Site Reliability Engineer

4 days ago

Santa Clara, California, United States ServiceNow Full time

OverviewThe ServiceNow SRE team is a group of highly technical engineers who are tasked with maintaining and developing the reliability, scalability, and performance of the ServiceNow cloud infrastructure.Our SREs are empowered to drive technical resolutions across the technology stack from hardware through to application and all stops in between.They are...
Principal Site Reliability Engineer

2 weeks ago

Santa Clara, California, United States Palo Alto Networks Full time

About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in ensuring the high availability and reliability of our applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain scalable and reliable infrastructureBuild...
Principal Site Reliability Engineer

1 week ago

Santa Clara, California, United States Palo Alto Networks Full time

Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our Global Customer Operation Team. As a Site Reliability Engineer, you will play a critical role in designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...
Site Reliability Engineer DevOps

4 days ago

Santa Clara, California, United States Palo Alto Networks Full time

Join Our Mission to End Breaches and Protect Digital LifePalo Alto Networks is the fastest-growing security company in history, and we're looking for a motivated, intelligent, and creative individual to join our team as a Site Reliability Engineer DevOps.About the RoleWe offer the chance to be part of an important mission: ending breaches and protecting our...
Principal Site Reliability Engineer

1 week ago

Santa Clara, California, United States Palo Alto Networks Full time

About the RoleWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our Global Customer Operation Team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...

Americas

Europe

Asia / Oceania

Africa

Senior Site Reliability Engineer