Senior Site Reliability Engineer
19 hours ago
NVIDIA is a leader in the field of artificial intelligence, machine learning, and datacenter acceleration. With a rich history of innovation, we have continuously pushed the boundaries of what is possible in the world of computing.
Job SummaryWe are seeking an experienced Site Reliability Engineer to join our GPU AI/HPC Infrastructure team. As a key member of our team, you will be responsible for designing and implementing large scale GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads.
Key Responsibilities- Design and implement large scale GPU compute clusters
- Develop large scale automation solutions
- Maintain and build deep learning AI-HPC GPU clusters at scale
- Support researchers to run their flows on our clusters
- Design, implement and support operational and reliability aspects of large scale distributed systems
- Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience
- Minimum 5 years of experience designing and operating large scale compute infrastructure
- Experience with AI/HPC advanced job schedulers and familiarity with schedulers such as Slurm, K8s, RTDA or LSF
- Experience analysing and tuning performance for a variety of AI/HPC workloads
- Working knowledge of cluster configuration management tools and infrastructure level applications
- In-depth understanding of container technologies
- Experience learning development languages
- Experience with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking
- Background with Machine Learning and Deep Learning concepts, algorithms, models
- Familiarity with InfiniBand with IBoIP and RDMA
- Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads
- Familiarity with deep learning frameworks like PyTorch and TensorFlow
NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have a diverse and talented team of engineers working on cutting-edge technology, and we are committed to fostering a work environment that is inclusive and supportive of all employees.
We are an equal opportunity employer and welcome applications from qualified candidates from diverse backgrounds. If you are a creative and autonomous engineer with a passion for technology, we encourage you to apply.
-
Senior Site Reliability Engineer
4 days ago
Santa Clara, California, United States NVIDIA Full timeJob Title: Senior Site Reliability EngineerNVIDIA is a leader in AI, machine learning, and datacenter acceleration. Our company is expanding its leadership into datacenter networking with ethernet switches, NICs, and DPUs. We have continuously reinvented ourselves over two decades.Our invention of the GPU in 1999 sparked the growth of the PC gaming market,...
-
Senior Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States ServiceNow Full timeCompany OverviewAt ServiceNow, we harness technology to create a better world for everyone, driven by our talented workforce. We prioritize speed and innovation to meet the demands of our customers and communities.Joining ServiceNow means becoming part of a dynamic team of innovators who possess a relentless curiosity and a commitment to creativity.We...
-
Senior Staff Site Reliability Engineer
2 days ago
Santa Clara, California, United States Palo Alto Networks Full timeJob DescriptionPalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key Responsibilities:Contribute to the success of SRE and DevOps teamsDevelop expertise...
-
Senior Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States ServiceNow Full timeCompany OverviewAt ServiceNow, we harness technology to enhance global operations, and our dedicated workforce makes it all possible. We operate swiftly because the world demands it, innovating uniquely for our clients and communities.By becoming part of ServiceNow, you join a dynamic team of innovators who possess a relentless curiosity and a passion for...
-
Site Reliability Engineer
2 weeks ago
Santa Clara, California, United States Diverse Lynx Full timeAbout the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on...
-
Site Reliability Engineer
2 days ago
Santa Clara, California, United States Diverse Lynx Full timeJob DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain scalable and highly available cloud...
-
Site Reliability Engineer
17 hours ago
Santa Clara, California, United States Syntricate Technologies Full timeJob DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on AWS,...
-
Principal Site Reliability Engineer
2 days ago
Santa Clara, California, United States Palo Alto Networks Full timeJob Title: Principal Site Reliability EngineerPalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.About the RoleWe are looking for a seasoned engineer with expertise in...
-
Principal Site Reliability Engineer
1 week ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...
-
Principal Site Reliability Engineer
4 days ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...
-
Principal Site Reliability Engineer
2 days ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...
-
Cloud Site Reliability Engineer
2 days ago
Santa Clara, California, United States Centrify Corporation Full timeCloud Site Reliability EngineerAt Centrify Corporation, we're seeking a skilled Cloud Site Reliability Engineer to join our Cloud DevOps team. As a key member of our operations team, you'll play a critical role in ensuring the uptime and delivery of our cloud-based services.Key Responsibilities:Manage our cloud application using DevOps and Agile practices to...
-
Principal Site Reliability Engineer
2 days ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our infrastructure platform, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of SRE and DevOps teams by developing expertise...
-
Site Reliability Engineer
11 hours ago
Santa Clara, California, United States Veear Full timeJob Description:We are seeking a highly skilled Site Reliability Engineer to join our team at Veear. As a key member of our infrastructure team, you will play a critical role in ensuring the security, compliance, and reliability of our systems.Key Responsibilities:Partner with development teams to ensure that applications have scalability and reliability...
-
Site Reliability Engineer
1 week ago
Santa Clara, California, United States ServiceNow Full timeOverviewThe ServiceNow SRE team is a group of highly skilled engineers who are responsible for maintaining and developing the reliability, scalability, and performance of the ServiceNow cloud infrastructure.Key ResponsibilitiesProvide relief and sustainable resolution to issues within our infrastructure.Use expertise in software development, systems...
-
Site Reliability Engineer
4 days ago
Santa Clara, California, United States ServiceNow Full timeOverviewThe ServiceNow SRE team is a group of highly technical engineers who are tasked with maintaining and developing the reliability, scalability, and performance of the ServiceNow cloud infrastructure.Our SREs are empowered to drive technical resolutions across the technology stack from hardware through to application and all stops in between.They are...
-
Principal Site Reliability Engineer
2 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in ensuring the high availability and reliability of our applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain scalable and reliable infrastructureBuild...
-
Principal Site Reliability Engineer
1 week ago
Santa Clara, California, United States Palo Alto Networks Full timeJob DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our Global Customer Operation Team. As a Site Reliability Engineer, you will play a critical role in designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...
-
Site Reliability Engineer DevOps
4 days ago
Santa Clara, California, United States Palo Alto Networks Full timeJoin Our Mission to End Breaches and Protect Digital LifePalo Alto Networks is the fastest-growing security company in history, and we're looking for a motivated, intelligent, and creative individual to join our team as a Site Reliability Engineer DevOps.About the RoleWe offer the chance to be part of an important mission: ending breaches and protecting our...
-
Principal Site Reliability Engineer
1 week ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RoleWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our Global Customer Operation Team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...