Senior SRE Software Engineer

3 months ago


Santa Clara, United States NVIDIA Full time
Site Reliability Engineering (SRE) is an engineering discipline that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses various areas, including software and systems engineering practices, storage, data management, and services. SRE professionals are highly specialized and possess expertise in different domains such as systems, networking, storage, coding, database management, capacity management, continuous delivery and open-source cloud-enabling technologies like Kubernetes, containers, and virtualization. Their responsibilities encompass ensuring reliable storage solutions, managing data efficiently, and providing related services to support the overall stability and performance of the production systems.

SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. SRE is also a mindset and a set of engineering approaches to running efficient production systems, focusing on eliminating manual work through modern automation practices and performance tuning. Limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages are key to product quality, providing interesting and dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem-solving, and openness is important to its success. We encourage collaboration, thinking big, and taking risks in a blame-free environment. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.

What You Will Be Doing

  • Assist in designing, implementing, and supporting large-scale storage clusters and data services.
  • Build and improve service reliability tools and frameworks - logging, tracing, and alerts.
  • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.
  • Support services before they go live through activities such as system design consulting, developing software and frameworks, capacity management, launch reviews, managing data ingress and egress across multiple data centers and public clouds.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Be part of an on-call rotation to support production systems.

What We Need To See

  • BS degree in Computer Science or related technical field involving coding/automation or equivalent experience.
  • At least 5+ years equivalent practical experience.
  • Experience with Infrastructure as Code and Configuration Management Tools: Terraform, CloudFormation, CDK, Ansible
  • Proficiency in one or more of the following: Python, Golang
  • Knowledge with Public Clouds: AWS, Azure, GCP
  • Expertise with Kubernetes, and common Kubernetes Tooling and Approaches such as GitOps and CI/CD
  • Skills with Observability: Logging and Metrics with tools such as Prometheus, Mimir, Loki, Graylog, Grafana

Ways To Stand Out From The Crowd

  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success. Experience with Git, code review, pipelines, and CI/CD.
  • Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.
  • Thrive in collaborative environments and enjoy working with various teams. Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker. Experience with building Cloud Architectures: Serverless, Containers

The base salary range is 144,000 USD - 270,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits .NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#J-18808-Ljbffr

  • Santa Clara, California, United States Sage Lake Senior Living Full time

    About the RoleWe are seeking a seasoned Senior SRE Engineer to join our team at Sage Lake Senior Living, where you will play a critical role in ensuring the high availability and performance of our AI-powered applications.Key ResponsibilitiesOperate and improve the observability and maintainability of our distributed microservice cloud applications and...

  • Senior SRE Engineer

    3 weeks ago


    Santa Clara, United States Trillium Staffing Full time

    Trillium Professional is now seeking Senior SRE Engineers in Santa Clara, CA! Pay rate is $75 - $90/hour, depending on experience. Our client is looking for a seasoned SRE to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced...

  • Senior SRE Engineer

    2 weeks ago


    Santa Clara, United States NVIDIA Full time

    NVIDIA is looking for a seasoned SRE to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains sophisticated Nvidia’s internal cloud provisioning product for GPUs and Tegra systems. The team works...


  • Santa Clara, United States Sage Lake Senior Living Full time

    NVIDIA is the platform upon which every new AI-powered application is built. We are seeking a senior SRE to monitor and operate both the factory automation for NVIDIA Inference Microservices (NIMs) and its deployed services. The right person for this role brings technical drive and creativity to change the way NVIDIA provides high-performance inferencing for...

  • Senior SRE Engineer

    4 weeks ago


    Santa Clara, United States NVIDIA Full time

    NVIDIA is looking for a seasoned SRE to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains NVIDIA’s internal cloud provisioning product for GPUs and Tegra systems. The team works with various...

  • Senior SRE Engineer

    2 months ago


    Santa Clara, United States NVIDIA Full time

    NVIDIA is looking for a seasoned SRE to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains sophisticated Nvidia’s internal cloud provisioning product for GPUs and Tegra systems. The team works...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Senior Production SRE Engineer to join our team. As a key member of our SRE team, you will be responsible for designing, implementing, and supporting large-scale storage clusters, including monitoring, logging, and alerting.You will work closely with peers on the team to improve the lifecycle of services –...


  • Santa Clara, California, United States Nvidia Full time

    Job SummaryNVIDIA is seeking a highly skilled Senior Production SRE Engineer to join our team. As a key member of our SRE team, you will be responsible for designing, implementing, and supporting large-scale storage clusters, as well as working with AI/ML workloads to capture and correlate behavior in large clusters and workflows.Key ResponsibilitiesAssist...

  • Senior Network SRE

    3 weeks ago


    Santa Clara, United States Diverse Lynx Full time

    Role: Senior Network SREHybrid - 3 days in office Santa Clara, CA Contract Role Senior Network SRE The Network Support and SRE team is in search of a seasoned Network SRE technical lead to help actualize the SRE vision for our network infrastructure. This role demands a unique blend of hands-on expertise in network operations, engineering, and observability....


  • Santa Clara, California, United States Sage Lake Senior Living Full time

    About the RoleWe are seeking a seasoned Senior SRE Engineer to join our team at Sage Lake Senior Living, where you will play a critical role in monitoring and operating our NVIDIA Inference Microservices (NIMs) factory automation and deployed services.Key ResponsibilitiesOperate a software factory that takes an AI model as input and produces a deployable...

  • Senior Manager

    4 months ago


    Santa Clara, United States NVIDIA Full time

    As a Sr Manager in Site Reliability Engineering (SRE), you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE Senior Managers...

  • Senior Network SRE

    4 weeks ago


    Santa Clara, United States TekWissen LLC Full time

    Job DescriptionJob DescriptionOverview: TekWissen Group is a workforce management provider throughout the USA and many other countries in the world. Our client is an American multinational information technology services and consulting company and is a leading provider of information technology, consulting, and business process outsourcing services,...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Title: Senior Network SRE LeadWe are seeking a seasoned Senior Network SRE Lead to join our team at Diverse Lynx LLC. As a key member of our Network Support and SRE team, you will play a crucial role in actualizing our SRE vision for our network infrastructure.Key Responsibilities:Owning the operational aspect of the network infrastructure, ensuring its...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Senior Software Engineer, WildfireWe are seeking a highly skilled Senior Software Engineer to join our Wildfire team in the Content Delivered Security Service (CDSS) organization. As a key member of our engineering and Security Research team, you will play a critical role in delivering the best of security services in the cloud to prevent...

  • Sr. SRE Engineer

    3 months ago


    Santa Clara, United States TCWGlobal Full time

    Sr. SRE EngineerW2 Contract to Possible HireHybrid, Santa Clara, CA$75-90/hr + PTO, Paid Holidays, Benefits We are looking for a seasoned SRE to join our multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Senior Backend Software EngineerWe are seeking a highly skilled Senior Backend Software Engineer to join our team at Palo Alto Networks. As a key member of our engineering team, you will be responsible for designing and developing distributed backend services that serve as the backbone of our cloud-delivered security platform.About the RoleAs a...

  • Sr. SRE Engineer

    3 months ago


    Santa Clara, United States TCWGlobal Full time

    Job DescriptionJob DescriptionSr. SRE EngineerW2 Contract to Possible HireHybrid, Santa Clara, CA$75-90/hr + PTO, Paid Holidays, Benefits We are looking for a seasoned SRE to join our multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced...

  • Sr. SRE Engineer

    3 months ago


    Santa Clara, United States TCWGlobal Full time

    Job DescriptionJob DescriptionSr. SRE EngineerW2 Contract to Possible HireHybrid, Santa Clara, CA$75-90/hr + PTO, Paid Holidays, Benefits We are looking for a seasoned SRE to join our multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Backend Software Engineer to join our team at Palo Alto Networks. As a key member of our engineering team, you will be responsible for designing and developing distributed backend services that serve as the backbone of our cloud-delivered security platform.Key ResponsibilitiesAnalyze requirements and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Senior Backend Software EngineerWe are seeking a highly skilled Senior Backend Software Engineer to join our team at Palo Alto Networks. As a key member of our engineering team, you will be responsible for designing and developing distributed backend services that serve as the backbone of our cloud-delivered security platform.About the RoleAs a...