Senior Cloud Infrastructure Engineer

4 weeks ago


Santa Clara, California, United States Nvidia Full time
Job Title: Senior Site Reliability Engineer

We are seeking a highly motivated and experienced Senior Site Reliability Engineer to join our Embedded organization. This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.

Key Responsibilities:
  • Develop and integrate new software, tools, and analytics to improve the availability, scalability, latency, and efficiency of our cloud services.
  • Manage upgrades and automated rollbacks across all clusters.
  • Maintain Service Level Agreements (SLAs) by collaborating with developers to define Service Level Indicators (SLIs) and design stable, secure services.
  • Guide the Change Advisory Board and Root Cause Corrective Action (RCCA) processes.
  • Collaborate with engineering, DevOps, and product leads across the GPU cloud services stack to build fast, reliable, and durable production systems.
  • Drive process changes to enhance the reliability and performance of cloud services.
  • Debug production issues across services and levels of the stack.
  • Improve operational processes.
Requirements:
  • Bachelor's degree in Computer Science or a related field, or equivalent experience.
  • 5+ years of experience in system design, complexity analysis, software design in Unix/Linux systems, performance tuning, and application issue resolution.
  • 5+ years of experience in authoring and debugging software written in C++ and Python.
  • Hands-on experience with Kubernetes-based cloud environments.
  • Multi-cloud experience.
  • Experience working with partners across multiple teams.
  • Background with operating production systems.
Preferred Qualifications:
  • Background with Software as a Service (SaaS) offerings.
  • Experience in application issues, algorithms, and data structures.

We offer competitive salaries and a generous benefits package. Our best-in-class engineering teams are rapidly growing, and we are widely considered to be one of the technology world's most desirable employers. If you're a creative and autonomous engineer with a real passion for technology, we want to hear from you.



  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Site Reliability EngineerNVIDIA is seeking a highly skilled Senior Site Reliability Engineer to join our Infrastructure, Planning and Process (IPP) team. As a key member of our global organization, you will play a critical role in designing and implementing scalable, reliable, and efficient cloud infrastructure solutions.Our cloud services...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Cloud Infrastructure Engineer to join our CDL/SLS team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using Terraform,...


  • Santa Clara, California, United States Oracle Full time

    Job Title: Senior Cloud Infrastructure DeveloperWe are seeking a highly skilled Senior Cloud Infrastructure Developer to join our Oracle Cloud Infrastructure (OCI) Platform Integration (PINT) team. As a key member of our team, you will be responsible for designing, implementing, and maintaining cloud infrastructure solutions that meet the needs of our...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly motivated Senior Cloud Infrastructure Engineer to join our Embedded organization.This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.The ideal candidate will focus on ensuring production health to...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionNVIDIA is seeking a Senior Site Reliability Engineer to join our AI Efficiency Team. As a key member of this team, you will contribute to the development of infrastructure that powers our innovative AI research.The AI Efficiency Team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our Cortex Data Lake team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of our SRE and DevOps teams by developing...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About UsPalo Alto Networks is a leading cybersecurity company that protects the digital way of life. Our mission is to be the cybersecurity partner of choice, and we're committed to providing innovative solutions to prevent cyberattacks.Job DescriptionWe're seeking a highly skilled Senior Staff DevOps Engineer to join our CDL/SLS team. As a key member of our...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our infrastructure platform team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our infrastructure platform stack includes Terraform, Kubernetes, GitLab...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff DevOps Engineer to join our CDL/SLS team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our infrastructure platform stack includes Terraform, Kubernetes, GitLab CI/CD, GitOps, Prometheus, Grafana,...


  • Santa Clara, California, United States Astera Labs Full time

    Astera Labs: Transforming Data-Driven ApplicationsAstera Labs is a global leader in purpose-built connectivity solutions that unlock the full potential of AI and cloud infrastructure.Our Intelligent Connectivity Platform integrates PCIe, CXL, and Ethernet semiconductor-based solutions and the COSMOS software suite of system management and optimization tools...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled Senior Systems Engineer to work on scaling our cloud compute platform for Autonomous Vehicles (AV). Our platform provides access to 100s of PBs of data and exa-scale GPU+CPU compute for various AV workloads including data ingestion, processing and model training.We are embarking on building the next generation of the platform...


  • Santa Clara, California, United States Oracle Full time

    Job SummaryWe are seeking a highly skilled Senior Cloud Infrastructure Developer to join our team at Oracle. As a key member of our team, you will be responsible for designing, implementing, and maintaining our cloud infrastructure. This is a unique opportunity to work with cutting-edge technology and be part of a dynamic team that is shaping the future of...


  • Santa Clara, California, United States Amazon Full time

    About the RoleWe are seeking a highly skilled Senior Product Marketing Manager for Cloud Infrastructure to join our team at Amazon. As a key member of our Sales, Marketing, and Global Services organization, you will be responsible for developing and executing marketing strategies to drive revenue growth and customer adoption of our cloud infrastructure...


  • Santa Monica, California, United States Volt Full time

    Job SummaryVolt is seeking a highly skilled Senior Cloud Engineer to join our team in Santa Monica, CA. As a Senior Cloud Engineer, you will play a key role in supporting the Cloud Infrastructure team in migrating our infrastructure from GCP to Azure.Key ResponsibilitiesCollaborate with the Cloud Infrastructure team to identify opportunities to migrate data...


  • Santa Clara, California, United States eTeam Full time

    Job Title: Cloud Infrastructure ArchitectWe are seeking a highly skilled Cloud Infrastructure Architect to join our eTeam team. As a key member of our team, you will be responsible for designing and implementing scalable, secure, and efficient cloud infrastructure solutions on Google Cloud Platform (GCP).Key Responsibilities:Design and implement cloud...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Description:At Diverse Lynx LLC, we are seeking a skilled Cloud Engineer to join our team. As a key member of our infrastructure team, you will be responsible for designing, implementing, and maintaining our cloud infrastructure. Key Responsibilities:Design and implement cloud infrastructure solutions using AWS, Azure, or Google Cloud...


  • Santa Clara, California, United States Astera Labs Full time

    Astera Labs Job DescriptionAstera Labs is a global leader in purpose-built connectivity solutions that unlock the full potential of AI and cloud infrastructure. Our Intelligent Connectivity Platform integrates PCIe, CXL, and Ethernet semiconductor-based solutions and the COSMOS software suite of system management and optimization tools to deliver a...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionAt Palo Alto Networks, we're committed to providing innovative cybersecurity solutions that protect our digital way of life. As a Senior Cloud Security Engineer, you'll play a vital role in shaping the future of our cloud-delivered security platform, Prisma Access. Your CareerPrisma Access is a highly scalable cloud service that addresses the...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Cloud Reliability EngineerWe are seeking a highly motivated Senior Cloud Reliability Engineer to join our Embedded organization.This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.The Senior Cloud Reliability...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Palo Alto Networks is a leader in cloud security, and we're seeking a skilled Senior Staff SQA Engineer to join our Cloud Intelligence team. As a key member of our team, you will be responsible for ensuring the quality and reliability of our cloud-based security solutions.Responsibilities:Develop and execute comprehensive test plans and test cases to ensure...