Site Reliability Engineer
3 weeks ago
We are seeking a highly motivated Site Reliability Engineer to join our Applications Infrastructure organization. This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.
Key Responsibilities:- Develop and integrate new software, tools, and analytics to improve the availability, scalability, latency, and efficiency of our cloud services.
- Manage upgrades and automated rollbacks across all clusters.
- Maintain Service Level Agreements (SLAs) by collaborating with developers to define Service Level Indicators (SLIs) and design stable, secure services.
- Guide the Change Advisory Board and Root Cause Corrective Action (RCCA) processes.
- Collaborate with engineering, DevOps, and product leads across the GPU cloud services stack to build fast, reliable, and durable production systems.
- Drive process changes to enhance the reliability and performance of cloud services.
- Debug production issues across services and levels of the stack.
- Improve operational processes.
- Bachelor's degree in Computer Science or a related field, or equivalent experience.
- 5+ years of experience in system design, complexity analysis, software design in Unix/Linux systems, performance tuning, and application issue resolution.
- 5+ years of experience in authoring and debugging software written in C++ and Python.
- Hands-on experience with Kubernetes-based cloud environments.
- Multi-cloud experience.
- Experience working with partners across multiple teams.
- Experience operating production systems.
- Background with Software as a Service (SaaS) offerings.
- Experience in application issues, algorithms, and data structures.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
The base salary range is 140,000 USD - 258,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits.
-
Site Reliability Engineer
2 weeks ago
Santa Clara, California, United States Diverse Lynx Full timeJob Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS,...
-
Site Reliability Engineer
2 weeks ago
Santa Clara, California, United States Syntricate Technologies Full timeJob Title: Site Reliability EngineeringWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based systems.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using...
-
Site Reliability Engineer
3 weeks ago
Santa Clara, California, United States Insight Global Full timeJob Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Insight Global. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based infrastructure.Key Responsibilities:Design, implement, and maintain scalable and highly...
-
Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States Cryptoware Technologies Inc Full timeJob Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Cryptoware Technologies Inc. As a Site Reliability Engineer, you will be responsible for leading the effort of global expansion of Huobi globe-spanning infrastructure.Key Responsibilities:Lead the effort of global expansion of Huobi...
-
Site Reliability Engineer
3 weeks ago
Santa Clara, California, United States Syntricate Technologies Full timeJob DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on AWS, including EC2, SSM,...
-
Site Reliability Engineer
2 weeks ago
Santa Clara, California, United States Syntricate Technologies Full timeJob DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS, including EC2,...
-
Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States NVIDIA Full timeJob Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our Applications Infrastructure organization at NVIDIA. This team is responsible for designing, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications hosted in the cloud.Key Responsibilities:Develop and integrate new...
-
Principal Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeJob DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...
-
Principal Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeJob Title: Principal Site Reliability EngineerWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.About the RoleThis is a unique opportunity to work with a...
-
Principal Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure. You will work closely with developers, researchers, data scientists, and security experts to ensure...
-
Principal Site Reliability Engineer
2 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeJob DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...
-
Principal Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining scalable and reliable infrastructure to support our mission-critical platforms.Key ResponsibilitiesDesign and implement scalable and...
-
Cloud Site Reliability Engineer
1 month ago
Santa Clara, California, United States Centrify Corporation Full timeCloud Site Reliability EngineerAt Centrify Corporation, we're seeking a skilled Cloud Site Reliability Engineer to join our Cloud DevOps team. As a key member of our operations team, you'll play a critical role in ensuring the uptime and delivery of our cloud-based services.Key Responsibilities:Manage our cloud application using DevOps and Agile practices to...
-
Site Reliability Engineer
2 weeks ago
Santa Clara, California, United States NVIDIA Full timeUnlock the Power of Cloud ServicesWe are seeking a highly motivated Site Reliability Engineer to join our Applications Infrastructure organization.This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.The SRE role focuses on...
-
Principal Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure. You will work closely with developers, researchers, data scientists, and security experts to ensure...
-
Principal Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeJob DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based security solutions.Key ResponsibilitiesDesign, build, and maintain scalable and reliable infrastructure for our...
-
Principal Site Reliability Engineer
1 week ago
Santa Clara, California, United States Palo Alto Networks Full timeJob DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, building, and maintaining scalable and reliable infrastructure for our cloud-based products.Key Responsibilities:Design and implement scalable and reliable infrastructure for...
-
Principal Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Global Customer Operations team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...
-
Principal Site Reliability Engineer
4 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeJob Title: Principal Site Reliability EngineerWe are seeking a highly skilled Principal Site Reliability Engineer to join our Global Customer Operations team at Palo Alto Networks. As a key member of our SRE team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product...
-
Principal Site Reliability Engineer
3 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Global Customer Operation Team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...