Current jobs related to Principal Software Engineer for AI Resiliency - Santa Clara, California - NVIDIA


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Principal Engineer for AI Software ResiliencyWe are seeking a highly skilled Principal Engineer to lead the development of AI software resiliency for our cutting-edge AI supercomputers.About the Role:As a Principal Engineer, you will play a pivotal role in defining and implementing critical resiliency features for our AI supercomputers at a scale...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Principal Engineer to lead the development of AI software resiliency for our cutting-edge AI supercomputers. As a key member of our team, you will play a pivotal role in defining and implementing critical resiliency features to ensure our AI systems remain robust and reliable at all times.Key...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Principal Engineer to lead the development of AI software resilience for our cutting-edge AI supercomputers.As a key member of our team, you will play a critical role in defining and implementing critical resiliency features for our AI systems, ensuring they remain robust and reliable at all times.Your expertise...


  • Santa Clara, California, United States Nvidia Full time

    About NVIDIANVIDIA is a leader in the technology industry, renowned for its innovative products and solutions. We are seeking a highly experienced and dynamic Principal Software Engineer to join our team and contribute to the development of our generative AI systems and productivity solutions.Job SummaryWe are looking for a skilled software engineer to lead...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Principal Software Engineer to join our AI Runtime Security team at Palo Alto Networks. As a key member of our engineering team, you will be responsible for developing and optimizing backend services for our cloud-based security platform.The ideal candidate will have a deep understanding of cloud...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job SummaryPalo Alto Networks is seeking a highly skilled Senior Principal Software Engineer to join our AI Runtime Security team. As a key member of our team, you will be responsible for designing and developing scalable, reliable, and efficient cloud services.Key ResponsibilitiesArchitect and develop cloud services for AI Runtime SecurityLead the design...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Principal Software Engineer to join our AI Runtime Security team at Palo Alto Networks. As a key member of our engineering team, you will be responsible for designing and developing scalable, reliable, and efficient cloud services for AI Runtime Security.Key ResponsibilitiesArchitect and develop cloud...


  • Santa Clara, California, United States Palo Alto Networks Full time

    At Palo Alto Networks, we're on a mission to build the industry's best Security large language model. Our engineering team is at the core of our products – connected directly to the mission of preventing cyberattacks. We're constantly innovating – challenging the way we, and the industry, think about cybersecurity.We're looking for an exceptional Senior...


  • Santa Clara, California, United States d-Matrix Full time

    We are seeking a highly skilled Senior/Staff SW Engineer (Systems) to join our team at d-Matrix. As a key member of our Software team, you will be responsible for the development, enhancement, and maintenance of the next-generation AI Deployment software.The ideal candidate will have a strong grasp of system software, data structures, parallel programming,...


  • Santa Clara, California, United States ServiceNow Full time

    About the RoleWe are seeking a highly skilled Senior Software Engineer to join our AI Engineering team at ServiceNow. As a key member of our team, you will be responsible for designing, implementing, and maintaining efficient, reusable, and reliable Python code for our AI-driven software solutions.As a Senior Software Engineer, you will have the opportunity...


  • Santa Clara, California, United States HPE Full time

    Job Description:Hewlett Packard Enterprise is seeking a highly skilled Software Engineer to join our HPC and AI organization. As a key member of the Slingshot Ethernet Fabric team, you will play a critical role in expanding HPE's High Performance Ethernet Fabric product growth through Commercial HPC use cases, AI use cases networking, systems, and...


  • Santa Clara, California, United States ServiceNow Full time

    About the RoleWe are seeking a highly skilled Senior Software Engineer to join our AI Engineering team. As a key member of our team, you will be responsible for designing, implementing, and maintaining efficient, reusable, and reliable Python code for our AI-driven software solutions.Key Responsibilities:Design and implement scalable, secure, and...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a Resiliency and Safety Architect to support the development of GPU and Tegra SoC hardware and software resiliency features. As a key member of our team, you will collaborate with hardware and software teams to architect new resiliency and safety features and guide future development.Key Responsibilities:Collaborate with hardware and software...

  • Principal AI Engineer

    4 weeks ago


    Santa Clara, California, United States Abbott Laboratories Full time

    About Abbott LaboratoriesAbbott is a global healthcare leader that helps people live more fully at all stages of life. Our portfolio of life-changing technologies spans the spectrum of healthcare, with leading businesses and products in diagnostics, medical devices, nutritionals and branded generic medicines.At Abbott, you can do work that matters, grow, and...

  • Principal Engineer

    4 weeks ago


    Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is a leader in the technology world, known for its innovative and forward-thinking approach to computing and deep learning. We are committed to fostering a diverse work environment and proud to be an equal opportunity employer.Job DescriptionWe are seeking a Principal Engineer to join our team and contribute to the development of our AI...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionNVIDIA is seeking a highly motivated and experienced Principal Graphics System Software Engineer to join our team. As a key member of our graphics software engineering team, you will be responsible for designing and implementing new emerging graphics features that cut through the entire stack from top-level graphics APIs through shading...


  • Santa Clara, California, United States Nvidia Full time

    Job SummaryWe are seeking a highly motivated and experienced Principal Graphics System Engineer to join our team at NVIDIA. As a key member of our graphics team, you will be responsible for designing and implementing new emerging graphics features that cut through the entire stack from top-level graphics APIs through shading languages and into the driver...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Principal IT Software EngineerWe are seeking a highly skilled Principal IT Software Engineer to join our IT Customer Experience Team. As a key member of our team, you will be responsible for delivering high-quality custom-built technology solutions, including our licensing system, which is a critical component of our PANW product ecosystem.As a...


  • Santa Clara, California, United States Couchbase, Inc. Full time

    Empower the Future of Database TechnologyCouchbase is seeking a highly skilled Senior Software Engineer to join our AI team. As a key member of our engineering team, you will design and implement cutting-edge database and AI features and tools using the latest techniques to evolve Couchbase products and Capella service.Key Responsibilities:Design and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Principal IT Software EngineerWe are seeking a highly skilled Principal IT Software Engineer to join our IT Customer Experience Team. As a key member of our team, you will be responsible for delivering high-quality custom-built technology solutions, including our licensing system, which is a critical component of our PANW product ecosystem.As a...

Principal Software Engineer for AI Resiliency

2 months ago


Santa Clara, California, United States NVIDIA Full time
Job Description

We are seeking a highly skilled Principal Engineer for AI Software Resiliency to join our team at NVIDIA. As a key member of our organization, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.

Key Responsibilities
  • Develop and lead the execution of software resiliency features, including fast checkpoint-recovery, automatic error detection, error isolation, SDC detection and mitigation, and straggler/hang detection.
  • Collaborate with multiple teams and stakeholders across NVIDIA to align on mission requirements, provide regular updates, and ensure the successful integration of resiliency features into AI frameworks like PyTorch and JAX/XLA.
  • Partner with TPMs, PMs, and QA teams to ensure the timely and successful launch of resiliency features.
  • Develop and maintain large software codebases, ensuring high code quality, rigorous testing, and solving complex challenges.
  • Lead others by example, fostering a culture of collaboration, innovation, and continuous improvement.
Requirements
  • Master's or Ph.D. in Computer Science, Electrical Engineering, Computer Engineering, or a related field from a reputed institution, or equivalent experience.
  • A minimum of 10 years of experience in systems architecture or related fields, with a deep understanding of distributed systems and large-scale AI infrastructure.
  • At least 10 years of hands-on experience in software development for distributed systems and 5 years in developing AI frameworks such as PyTorch or JAX/XLA.
  • Proven track record of working effectively across multiple engineering fields and communicating complex technical concepts to a diverse set of collaborators.
Preferred Qualifications
  • Experience with large-scale AI supercomputing applications, including in-depth knowledge of AI workload training and inference requirements.
  • A strong passion for developing AI-specific system architectures, including CPUs, GPUs, memory, storage, and networking.
  • Hands-on involvement in the design, development, and deployment of large-scale AI supercomputers.
  • Practical experience in adopting and implementing high-performance computing (HPC) software development in large-scale environments.
What We Offer

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. We offer a competitive salary range of $272,000 - $419,750 USD, based on location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits.