Senior Cloud Reliability Engineer

4 weeks ago


Santa Clara, California, United States NVIDIA Full time

At NVIDIA, we're seeking a highly skilled Senior Cloud Reliability Engineer to join our team. As a key member of our Site Reliability Engineering (SRE) team, you'll be responsible for designing, building, and maintaining large-scale production systems with high efficiency and availability.

This is a highly specialized discipline that demands knowledge across different systems, networking, coding, database, capacity management, continuous delivery, and deployment, as well as open-source cloud-enabling technologies like Kubernetes and OpenStack.

As an SRE at NVIDIA, you'll ensure that our internal and external-facing GPU cloud services run with maximum reliability and uptime, while also enabling developers to make changes to the existing system through careful preparation and planning.

SRE is also a mindset and a set of engineering approaches to running better production systems and optimizations.

Much of our software development focuses on eliminating manual work through automation, performance tuning, and growing efficiency of production systems.

As an SRE, you'll be responsible for the big picture of how our systems relate to each other, using a breadth of tools and approaches to tackle a broad spectrum of problems.

Practices such as limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work.

We promote self-direction to work on meaningful projects, while also striving to build an environment that provides the support and mentorship needed to learn and grow.

Key Responsibilities:

  • Design, implement, and support operational and reliability aspects of large-scale Kubernetes clusters with a focus on performance at scale, real-time monitoring, logging, and alerting.
  • Engage in and improve the whole lifecycle of services, from inception and design through deployment, operation, and refinement.
  • Support services before they go live through activities such as system design consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.
  • Be part of an on-call rotation to support production systems.

Requirements:

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
  • 5+ years of experience with infrastructure automation, distributed systems design, experience with designing, developing tools for running large-scale private or public cloud systems in production.
  • Experience in one or more of the following: Python, Go, Perl, or Ruby.
  • In-depth knowledge of Linux, Networking, and Containers.

What We Offer:

  • A competitive salary range of $132,000 - $310,500 USD, based on location, experience, and pay of employees in similar positions.
  • Eligibility for equity and benefits.
  • A diverse and inclusive work environment.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.



  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Cloud Reliability EngineerWe are seeking a highly motivated Senior Cloud Reliability Engineer to join our Embedded organization.This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.The Senior Cloud Reliability...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Palo Alto Networks is a leader in cloud security, and we're seeking a skilled Senior Staff SQA Engineer to join our Cloud Intelligence team. As a key member of our team, you will be responsible for ensuring the quality and reliability of our cloud-based security solutions.Responsibilities:Develop and execute comprehensive test plans and test cases to ensure...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Staff Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our Cloud Infrastructure team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our ideal candidate will have a strong background in cloud computing, with...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewPalo Alto Networks is seeking a highly skilled Cloud Infrastructure Engineer to join our CDL/SLS team. As a Senior Staff Site Reliability Engineer, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our team is at the forefront of innovation, constantly pushing the boundaries of what is...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly motivated Senior Cloud Infrastructure Engineer to join our Embedded organization.This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.The ideal candidate will focus on ensuring production health to...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Cloud Infrastructure Engineer to join our CDL/SLS team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using Terraform,...


  • Santa Clara, California, United States Amazon Development Center U.S., Inc. Full time

    Cloud-Scale Analytics EngineerAre you passionate about developing innovative cloud-scale analytics and observability solutions? Do you want to revolutionize the way people manage and derive insights from vast volumes of data in the cloud? As a Senior Cloud-Scale Analytics Engineer at Amazon Web Services (AWS), you will design, develop, and support a...


  • Santa Clara, California, United States NVIDIA Full time

    As a Senior Manager in Site Reliability Engineering (SRE) at NVIDIA, you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE...


  • Santa Clara, California, United States Oracle Full time

    Job Title: Senior Cloud Infrastructure DeveloperWe are seeking a highly skilled Senior Cloud Infrastructure Developer to join our Oracle Cloud Infrastructure (OCI) Platform Integration (PINT) team. As a key member of our team, you will be responsible for designing, implementing, and maintaining cloud infrastructure solutions that meet the needs of our...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our Cortex Data Lake team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of our SRE and DevOps teams by developing...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionAt Palo Alto Networks, we're committed to providing innovative cybersecurity solutions that protect our digital way of life. As a Senior Cloud Security Engineer, you'll play a vital role in shaping the future of our cloud-delivered security platform, Prisma Access. Your CareerPrisma Access is a highly scalable cloud service that addresses the...


  • Santa Clara, California, United States Nvidia Full time

    Senior Reliability EngineerNVIDIA is seeking a highly skilled Senior Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for planning and implementing the qualifications of new NVIDIA products, including IC chips in AI, Mobile, Automotive, Deep Learning, Graphic Processor, and System on Chip sectors.Key...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team at Palo Alto Networks. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our Infrastructure Platform stack includes Terraform, Kubernetes, GitLab CI/CD, GitOps,...

  • Senior Cloud Engineer

    4 weeks ago


    Santa Clara, California, United States Amazon Full time

    Job DescriptionThe AWS Identity team is seeking a skilled Senior Software Development Engineer to join our team. As a member of our team, you will be responsible for building and operating the identity and access management services that enable AWS customers to run their business workloads confidently and securely in the cloud.We are looking for a passionate...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job Title: Site Reliability EngineeringWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based systems.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionNVIDIA is seeking a Senior Site Reliability Engineer to join our AI Efficiency Team. As a key member of this team, you will contribute to the development of infrastructure that powers our innovative AI research.The AI Efficiency Team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job SummaryWe are seeking a highly skilled Senior Cloud Security Engineer to join our team at Palo Alto Networks. As a key member of our engineering team, you will be responsible for designing and implementing secure cloud-based solutions for our customers. You will work closely with our product management, development, and quality assurance teams to deliver...


  • Santa Clara, California, United States Palo Alto Networks Full time

    As a key member of the SQA Engineering Team at Palo Alto Networks, the successful candidate will be responsible for analyzing, testing, and modeling complex cloud security solutions to increase the reliability of our products. The ideal candidate will have a strong background in cloud networking security, with experience in designing, developing, and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionAt Palo Alto Networks, we're looking for a talented Sr Principal Software Engineer to join our team. As a key member of our engineering team, you will be responsible for designing and developing distributed backend services that serve as the backbone of our cloud-delivered security platform, Prisma Access.About the RoleAs a Senior Engineer,...


  • Santa Clara, California, United States Cynet Systems Full time

    Job Title: Senior Cloud Architect LeaderCynet Systems is seeking a highly experienced Senior Cloud Architect Leader to lead our cloud solutions team. The ideal candidate will have a strong background in public cloud environments, software engineering, and team leadership.Key Responsibilities:Lead the development and implementation of cloud-based...