Senior AI Resiliency Software Engineer

6 days ago


Santa Clara, California, United States NVIDIA Full time
Job Description

We are seeking a highly skilled Principal Engineer for AI Software Resiliency to join our team at NVIDIA.

Key Responsibilities
  • Lead the development of AI software resiliency features for our most powerful AI supercomputers.
  • Collaborate with multiple teams and stakeholders to align on mission requirements and ensure successful integration of resiliency features into AI frameworks.
  • Partner with TPMs, PMs, and QA teams to ensure timely and successful launch of resiliency features.
  • Develop and implement critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.
  • Drive engineering excellence by contributing to large software codebases and ensuring high code quality and rigorous testing.
Requirements
  • Master's or Ph.D. in Computer Science, Electrical Engineering, Computer Engineering, or a related field from a reputed institution.
  • A minimum of 10 years of experience in systems architecture or related fields, with a deep understanding of distributed systems and large-scale AI infrastructure.
  • At least 10 years of hands-on experience in software development for distributed systems and 5 years in developing AI frameworks such as PyTorch or JAX/XLA.
  • Proven track record of working effectively across multiple engineering fields and communicating complex technical concepts to a diverse set of collaborators.
Preferred Qualifications
  • Experience with large-scale AI supercomputing applications, including in-depth knowledge of AI workload training and inference requirements.
  • A strong passion for developing AI-specific system architectures, including CPUs, GPUs, memory, storage, and networking.
  • Hands-on involvement in the design, development, and deployment of large-scale AI supercomputers.
  • Practical experience in adopting and implementing high-performance computing (HPC) software development in large-scale environments.
What We Offer

NVIDIA offers a competitive salary range of $272,000 - $419,750, based on location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.



  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Senior Software Architect to lead the development of AI software resilience for our most powerful AI supercomputers.Key ResponsibilitiesDevelop and implement critical resilience features to support frontier model training at scale, ensuring robust and reliable AI systems.Serve as a trusted authority on AI...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Principal Engineer for AI Software Resiliency to join our team at NVIDIA. As a key member of our organization, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.Key ResponsibilitiesDevelop and lead the execution of software...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled Principal Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world.As a lead focused on AI Software Resiliency, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.Your expertise...


  • Santa Clara, California, United States Platform Ldn Full time

    About Platform LdnPlatform Ldn is a pioneering company in the field of robotics, dedicated to advancing the development of AI platforms that support industrial-grade robotics solutions.Job SummaryWe are seeking a highly skilled Senior Software Engineer to lead the design and development of our AI platform, enabling clients to run their AI workflows...


  • Santa Clara, California, United States Celestial AI Full time

    About the RoleCelestial AI is seeking a highly skilled Senior Analog Design Engineer to drive the development of innovative, high-speed analog architectures for low-power, high-performance Analog-Mixed Signal (AMS) solutions customized for AI applications.Key ResponsibilitiesTop-Down Architectural Analysis: Conduct thorough analysis of AMS systems to...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Principal Engineer to lead the development of AI software resiliency for our most powerful AI supercomputers.Key ResponsibilitiesDevelop and implement critical resiliency features to support frontier model training at scale.Drive down cluster downtime towards zero, ensuring robust and reliable AI...


  • Santa Clara, California, United States NVIDIA Corporation Full time

    Job DescriptionAbout NVIDIA CorporationNVIDIA Corporation is a leader in the technology industry, renowned for its innovative solutions in artificial intelligence, deep learning, and computer vision. As a pioneer in these fields, we are committed to empowering businesses and organizations to harness the power of AI and drive meaningful change.Job SummaryWe...


  • Santa Clara, California, United States Tenstorrent Full time

    At Tenstorrent, we are at the forefront of pioneering advancements in artificial intelligence technology, setting new benchmarks for performance, usability, and cost-effectiveness. As AI reshapes the computing landscape, our solutions are evolving to integrate innovations across software models, compilers, platforms, networking, and semiconductor...


  • Santa Clara, California, United States Palo Alto Networks, Inc. Full time

    About the RoleWe are seeking a highly skilled Senior Principal Software Engineer to join our AI Runtime Security team at Palo Alto Networks, Inc. This is a critical role that will focus on the development and optimization of backend services, with a keen eye for scalability, reliability, and performance.Key ResponsibilitiesArchitect and develop scalable,...


  • Santa Clara, California, United States d-Matrix Full time

    Software Engineer, Senior - AI/ML Workloadsd-Matrix - Santa Clara, CALocationSanta Clara, CaTypeFull timeDepartmentR&D - SW Kernels & Workloadsd-Matrix has fundamentally changed the physics of memory-compute integration with our digital in-memory compute (DIMC) engine. The "holy grail" of AI compute has been to break through the memory wall to minimize data...

  • Senior AI Engineer

    3 days ago


    Santa Clara, California, United States XPENG Full time

    About the RoleXpeng Motors is a leading technology-driven company that is revolutionizing the transportation industry with its electric cars and autonomous driving technology. As a key player in this revolution, we are seeking a highly skilled Senior AI Engineer to join our team and contribute to the development of advanced humanoid robots and large language...


  • Santa Clara, California, United States ServiceNow Full time

    Job Description**About ServiceNow**ServiceNow is a global market leader in the field of cloud-based platforms, bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500. Our intelligent cloud-based platform seamlessly connects people, systems, and processes to empower organizations to find smarter, faster, and...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a dynamic organization that continuously adapts by pursuing impactful opportunities that only we can address. We attract top talent to achieve our ultimate goal: to create a workplace that allows us to excel in our craft. We are currently looking for a Safety and Resiliency Architect to contribute to the development of GPU (Graphics Processing...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a dynamic organization that continually seeks meaningful opportunities to address global challenges that only we can tackle. We attract top talent to achieve our mission: to create an environment where we can excel in our respective fields. We are currently looking for a Resiliency and Safety Architect to contribute to the advancement of GPU...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Position OverviewPalo Alto Networks is at the forefront of AI security in today's rapidly evolving technological landscape. Our AI security cloud service engineering team plays a pivotal role in developing robust solutions that safeguard our clients' operations, particularly in the realm of AI and large language model (LLM) services.Key...


  • Santa Monica, California, United States Jobot Full time

    About the RoleWe are seeking a highly skilled Senior Software Engineer to join our team as an AI Application Developer. As a key member of our Application Delivery team, you will be responsible for designing, developing, and supporting custom AI-enhanced applications hosted in the cloud-based Microsoft Azure platform.Key ResponsibilitiesDesign and develop...


  • Santa Clara, California, United States Oracle Full time

    About the RoleWe are seeking a highly experienced and skilled Engineering Leader to join our team at Oracle. As a Senior Director of Engineering, AI Workload Orchestration, you will be responsible for leading the software development organization building out and operating AI platforms that operate at unprecedented speed, scale, and reliability.Key...


  • Santa Clara, California, United States Amazon Full time

    Position OverviewWe are looking for a Senior Applied Scientist to become a vital member of our AI Security division. This team is dedicated to developing security tools and streamlined solutions that guarantee the Generative AI (GenAI) experiences created by Amazon meet our stringent security requirements. Additionally, we leverage AI to create foundational...


  • Santa Clara, California, United States Aitopics Full time

    Job DescriptionAitopics is seeking a highly skilled Senior Software Quality Assurance Engineer to join our team. As a Senior Software Quality Assurance Engineer, you will be responsible for ensuring the quality and reliability of our Deep Learning software.Key ResponsibilitiesWork closely with cross-functional teams to understand test requirements and take...


  • Santa Clara, California, United States Tenstorrent Inc. Full time

    Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high...