Current jobs related to Senior SRE Engineer - Santa Clara - NVIDIA


  • Santa Clara, California, United States NVIDIA Full time

    Senior SRE Engineering LeaderNVIDIA is a pioneer in the AI revolution, driving innovation in industries with our cutting-edge GPU technology. Our GPUs power groundbreaking advancements in AI, from self-driving cars to innovative research in computer vision, speech recognition, and more.We're seeking visionary leaders to join us on an exciting journey as...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a leader in the AI revolution, driving innovation in industries with our cutting-edge GPU technology. Our GPUs power groundbreaking advancements in AI, big data, and deep learning.We're seeking visionary leaders to join us as Senior SRE Engineering Leader. As a key member of our team, you'll lead our globally distributed clusters, ensuring seamless...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Senior Production SRE Engineer to join our team. As a key member of our SRE team, you will be responsible for designing, implementing, and supporting large-scale storage clusters, including monitoring, logging, and alerting.You will work closely with peers on the team to improve the lifecycle of services –...


  • Santa Clara, California, United States NVIDIA Full time

    Lead the Way in AI InnovationNVIDIA is revolutionizing industries with our cutting-edge GPU technology, driving groundbreaking innovations in AI, big data, and deep learning.We're seeking visionary leaders to join our team as Senior SRE Engineering Leader, responsible for managing globally distributed clusters and delivering AI services that drive...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Title: Senior Network SRE LeadWe are seeking a seasoned Senior Network SRE Lead to join our team at Diverse Lynx LLC. As a key member of our Network Support and SRE team, you will play a crucial role in actualizing our SRE vision for our network infrastructure.Key Responsibilities:Owning the operational aspect of the network infrastructure, ensuring its...


  • Santa Clara, California, United States Diverse Lynx Full time

    Senior Network SRE RoleWe are seeking a seasoned Network SRE technical lead to help actualize the SRE vision for our network infrastructure. This role demands a unique blend of hands-on expertise in network operations, engineering, and observability.Key ResponsibilitiesOwning the operational aspect of the network infrastructure, ensuring its high...


  • Santa Clara, California, United States Nvidia Full time

    Job Title: Senior Wireless Network SREWe are seeking a highly skilled Senior Wireless Network SRE to join our team. As a key member of our Network Support and SRE team, you will be responsible for ensuring the high availability and reliability of our wireless infrastructure.Key Responsibilities:Owning the operational aspect of the wireless infrastructure,...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Title: Senior Network SREWe are seeking a seasoned Senior Network SRE to lead our network infrastructure team in achieving Service Level Objectives (SLOs) and minimizing manual labor.Key Responsibilities:Owning the operational aspect of the network infrastructure, ensuring high availability and reliability.Partnering with architecture, tooling, and...

  • Senior Manager

    5 months ago


    Santa Clara, United States NVIDIA Full time

    As a Sr Manager in Site Reliability Engineering (SRE), you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE Senior Managers...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Sr. Data Engineer, SRE to join our team at NVIDIA. As a key member of our data science and reporting team, you will be responsible for designing and delivering high-performance services and libraries, building streaming data pipelines, and partnering with other engineering and business teams to integrate your...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Summary:We are seeking a seasoned Network SRE technical lead to help actualize the SRE vision for our network infrastructure. As a key member of our Network Support and SRE team, you will be responsible for owning the operational aspect of the network infrastructure, ensuring its high availability and reliability.Key Responsibilities: Partner with...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionNVIDIA is seeking a Senior Site Reliability Engineer to join our AI Efficiency Team. As a key member of this team, you will contribute to the development of infrastructure that powers our innovative AI research.The AI Efficiency Team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Summary:We are seeking a seasoned Network SRE technical lead to help drive the SRE vision for our network infrastructure.This role demands a unique blend of hands-on expertise in network operations, engineering, and observability.Key Responsibilities:Owning the operational aspect of the network infrastructure, ensuring its high availability and...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is a leader in the technology world, known for its innovative and forward-thinking approach to AI-powered applications. We're a company that values diversity and creativity, and we're looking for talented individuals to join our team.About the RoleWe're seeking a highly skilled SRE Manager to lead our team in building and managing SREs...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleWe are seeking a highly skilled SRE Manager to lead our NVIDIA Inference Microservices (NIM) team. As a key member of our organization, you will be responsible for building and managing a team of SREs who monitor and operate the factory automation for NIMs and its deployed services.Key ResponsibilitiesLead the operation of highly available...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is the driving force behind the innovation revolution in AI, computing, and graphics. We are a leader in the development of technologies that power the world's most advanced computing systems.Job Title: SRE Manager, NIM FactoryWe are seeking a highly skilled SRE Manager to join our NIM Factory team. As an SRE Manager, you will be...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is a leader in the development of AI-powered applications, and we're seeking a highly skilled SRE Manager to join our team. As a pioneer in the field of AI, we're committed to pushing the boundaries of what's possible with technology.Job SummaryWe're looking for a talented SRE Manager to lead our NIM Factory team. As a key member of our...


  • Santa Clara, California, United States NVIDIA Full time

    At NVIDIA, we're seeking a highly skilled Senior Cloud Reliability Engineer to join our team. As a key member of our Site Reliability Engineering (SRE) team, you'll be responsible for designing, building, and maintaining large-scale production systems with high efficiency and availability.This is a highly specialized discipline that demands knowledge across...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is the driving force behind the innovation revolution in AI, computing, and graphics. We are a leader in the development of technologies that power the world's most advanced computing systems.Job SummaryWe are seeking a highly skilled SRE Manager to join our NVIDIA Inference Microservices (NIM) team. As a key member of our team, you will...


  • Santa Clara, California, United States NVIDIA Full time

    Join NVIDIA's AI Efficiency TeamWe are seeking a Senior Site Reliability Engineer to contribute to the infrastructure that powers our innovative AI research.About the RoleThis team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data infrastructure tools and services.Our objective is to deliver a stable,...

Senior SRE Engineer

2 months ago


Santa Clara, United States NVIDIA Full time

NVIDIA is looking for a seasoned SRE to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains NVIDIA’s internal cloud provisioning product for GPUs and Tegra systems. The team works with various other business units within NVIDIA Software such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars to cater to their infrastructure & systems needs. As an SRE, you’ll also be working in conjunction with various teams such as software engineering to deploy these new products and manage our infrastructure, associated processes, and systems. Keen attention to detail, problem-solving abilities, and a solid knowledge base are essential. What you’ll be doing: Kubernetes System Administration for DevOps & CI/CD. Designing and implementing clusters, cluster segmentation, internal/external networking for multiple clusters and environments. Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization. Architect CI/CD pipelines for container build and deployment. Craft and develop tools needed for automating workflows. Develop, improve, and maintain our infrastructure codebase. Craft and implement critical metrics using various analytics methods and dashboards. Take part in prototyping, crafting, and developing cloud infrastructure for NVIDIA. Reuse AI techniques to extract useful signals about machines and jobs from the data generated. What we need to see: Kubernetes domain expertise with extensive experience building scalable, resilient platforms in both public and private cloud capable of providing platform engineering / architecture standard methodologies (including experience with architecting and implementing the overall platform, orchestration, security, and monitoring ecosystem). High proficiency in administering and configuring Kubernetes. Proficient with CI/CD pipelines like Jenkins, Gitlab CI, GitHub Actions, ArgoCD, etc. Experience with data analytics/visualization tools like Kibana, Grafana, Splunk, etc. Strong Ansible skills. Experience with other configuration tools like Chef and Puppet is also good to have. Proficient using source code management and binary repository systems like GitLab, GitHub, Artifactory, Perforce, etc. Knowledge of monitoring systems such as Zabbix, Alertmanager, PagerDuty, and/or similar systems. Well versed in Prometheus, writing custom exporters and PromQL. 8+ years of proven experience. Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience. Ways to stand out from the crowd: Experience managing NVIDIA hardware like GPUs and Tegras. Background with Gitlab CI. Experience with building and deploying containers. Solid understanding of containerization and microservices architecture. Certified Kubernetes Administrator (CKA), Certified Kubernetes Security Specialist (CKS) & Certified Kubernetes Application Developer (CKAD) preferred. Ability to design simple systems that can work efficiently without needing much support. With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us and, due to outstanding growth, our exclusive engineering teams are rapidly growing. If you're a creative and autonomous engineer with a real passion for technology, we want to hear from you. The base salary range is 140,000 USD - 258,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr