Lead Software Engineer for AI Supercomputing

7 days ago


Santa Clara, California, United States NVIDIA Full time

We are seeking a highly skilled Principal Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world.

As a lead focused on AI Software Resiliency, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.

Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times.


Key Responsibilities:

AI Resiliency Expertise:

Serve as a trusted authority on AI software resiliency, guiding the architecture, modeling, and scoping of resiliency features to support frontier model training at scale.


Feature Development:

Lead the execution and development of software resiliency features, including fast checkpoint-recovery, automatic error detection, error isolation, SDC detection and mitigation, and straggler/hang detection.


Software Engineering Leadership:

Drive engineering excellence by contributing to large software codebases, ensuring high code quality, rigorous testing, and solving complex challenges.

Lead others by example, fostering a culture of collaboration, innovation, and continuous improvement.

Cross-Team Collaboration:

Work closely with multiple teams and stakeholders across NVIDIA to align on mission requirements, provide regular updates, and ensure the successful integration of resiliency features into AI frameworks like PyTorch and JAX/XLA.


Customer Engagement:

Collaborate directly with major customers to embed AI resilience features into their AI frameworks, ensuring seamless integration and optimal performance.


Product Delivery:
Partner effectively with TPMs, PMs, and QA teams to ensure the timely and successful launch of resiliency features.

Requirements:
Master's or Ph.D. in Computer Science, Electrical Engineering, Computer Engineering, or a related field from a reputed institution, or equivalent experience.

A minimum of 10 years of experience in systems architecture or related fields, with a deep understanding of distributed systems and large-scale AI infrastructure.

At least 10 years of hands-on experience in software development for distributed systems and 5 years in developing AI frameworks such as PyTorch or JAX/XLA.

Proven track record of working effectively across multiple engineering fields and communicate complex technical concepts to a diverse set of collaborators.


Preferred Qualifications:

AI Supercomputing Expertise:
Experience with large-scale AI supercomputing applications, including in-depth knowledge of AI workload training and inference requirements.

System Architecture Passion:
A strong passion for developing AI-specific system architectures, including CPUs, GPUs, memory, storage, and networking.

Lifecycle Experience:
Hands-on involvement in the design, development, and deployment of large-scale AI supercomputers.

HPC Best Practices:
Practical experience in adopting and implementing high-performance computing (HPC) software development in large-scale environments.

Compensation:
The base salary range is 272,000 USD - 419,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits.

NVIDIA's Commitment to Diversity:
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.

As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.



  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Principal Engineer for AI Software Resiliency to join our team at NVIDIA.Key ResponsibilitiesLead the development of AI software resiliency features for our most powerful AI supercomputers.Collaborate with multiple teams and stakeholders to align on mission requirements and ensure successful integration of...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Principal Engineer for AI Software Resiliency to join our team at NVIDIA. As a key member of our organization, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.Key ResponsibilitiesDevelop and lead the execution of software...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Senior Software Architect to lead the development of AI software resilience for our most powerful AI supercomputers.Key ResponsibilitiesDevelop and implement critical resilience features to support frontier model training at scale, ensuring robust and reliable AI systems.Serve as a trusted authority on AI...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Principal Engineer to lead the development of AI software resiliency for our most powerful AI supercomputers.Key ResponsibilitiesDevelop and implement critical resiliency features to support frontier model training at scale.Drive down cluster downtime towards zero, ensuring robust and reliable AI...


  • Santa Clara, California, United States Montezuma Winery Llc Full time

    Artificial Intelligence is transforming the landscape of material and chemical discovery across diverse sectors including supercomputing, education, manufacturing, and agriculture. We are in search of a Senior Product Manager who will spearhead the creation of products and capabilities that integrate Generative AI and AI/ML methodologies to enhance and...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA stands at the forefront of computer graphics, artificial intelligence, and accelerated computing, addressing some of the most intricate challenges in High-Performance Computing (HPC) and Scientific Computing. The transition of HPC and AI capabilities from centralized data centers to edge environments can revolutionize scientific research and...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIA:NVIDIA has been at the forefront of revolutionizing computer graphics, gaming, and accelerated computing for over 25 years. Today, we are harnessing the limitless potential of AI to shape the future of computing.Our tools, SDKs, training resources, and online communities empower millions of developers, data scientists, researchers, and IT...


  • Santa Clara, California, United States Montezuma Winery Llc Full time

    Artificial Intelligence is transforming the landscape of material and chemical discovery across diverse sectors, including supercomputing, academia, manufacturing, and agriculture. We are in search of a Senior Product Manager who will spearhead the creation of products and capabilities that integrate Generative AI and AI/ML methodologies to enhance and...


  • Santa Clara, California, United States Montezuma Winery Llc Full time

    Overview: The integration of Artificial Intelligence is transforming the landscape of material and chemical discovery across diverse sectors, including supercomputing, education, manufacturing, and agriculture. We are in search of a Senior Product Manager to spearhead the creation of products and capabilities that leverage Generative AI and AI/ML...


  • Santa Clara, California, United States Tenstorrent Full time

    At Tenstorrent, we are at the forefront of pioneering advancements in artificial intelligence technology, setting new benchmarks for performance, usability, and cost-effectiveness. As AI reshapes the computing landscape, our solutions are evolving to integrate innovations across software models, compilers, platforms, networking, and semiconductor...


  • Santa Clara, California, United States Celestial AI Full time

    About Celestial AIAt Celestial AI, we are at the forefront of innovation in AI systems. Our ground-breaking Photonic Fabric technology provides a scalable solution to data transfer bottlenecks, revolutionizing AI system performance and delivering unmatched efficiency.Lead Reliability EngineerWe are seeking a dynamic Lead Reliability Engineer to drive...


  • Santa Clara, California, United States Platform Ldn Full time

    About Platform LdnPlatform Ldn is a pioneering company in the field of robotics, dedicated to advancing the development of AI platforms that support industrial-grade robotics solutions.Job SummaryWe are seeking a highly skilled Senior Software Engineer to lead the design and development of our AI platform, enabling clients to run their AI workflows...


  • Santa Clara, California, United States Platform Ldn Full time

    About Platform LdnPlatform Ldn is a pioneering company in the field of robotics, dedicated to advancing the development of AI platforms that support industrial-grade robotics solutions.Job SummaryWe are seeking a highly skilled Senior Software Engineer to lead the design and development of our AI platform, enabling clients to run their AI workflows...


  • Santa Clara, California, United States NVIDIA Corporation Full time

    Job DescriptionAbout NVIDIA CorporationNVIDIA Corporation is a leader in the technology industry, renowned for its innovative solutions in artificial intelligence, deep learning, and computer vision. As a pioneer in these fields, we are committed to empowering businesses and organizations to harness the power of AI and drive meaningful change.Job SummaryWe...

  • SoC DV Lead

    3 months ago


    Santa Clara, California, United States Celestial AI Full time

    About Celestial AIAs the industry strives to meet the demands of the AI workloads, bottlenecks in data transfers between processors and memory have hindered progress. The Photonic Fabric based Memory Fabric provides an optically scalable solution to the 'Memory Wall' problem, enabling tens of Terabytes of memory capacity at full HBM bandwidths with low tens...


  • Santa Clara, California, United States Celestial AI Full time

    About the RoleCelestial AI is seeking a highly skilled Senior Analog Design Engineer to drive the development of innovative, high-speed analog architectures for low-power, high-performance Analog-Mixed Signal (AMS) solutions customized for AI applications.Key ResponsibilitiesTop-Down Architectural Analysis: Conduct thorough analysis of AMS systems to...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is seeking talented engineers to enhance its AI Infrastructure. We are looking for individuals with a robust programming foundation, profound knowledge of distributed systems, and a strong grasp of software testing and deployment methodologies. Excellent communication and organizational skills are essential. We value innovative thinkers who can...


  • Santa Clara, California, United States NVIDIA Full time

    About the Role:NVIDIA is on the lookout for an exceptional engineer specializing in graphics and artificial intelligence to join our cutting-edge neural graphics product team. If you share our belief that the AI revolution is most thrilling when applied to real-world challenges, this team could be the perfect match for you. We are enthusiastic about...


  • Santa Clara, California, United States Oracle Full time

    About the RoleWe are seeking a highly experienced and skilled Engineering Leader to join our team at Oracle. As a Senior Director of Engineering, AI Workload Orchestration, you will be responsible for leading the software development organization building out and operating AI platforms that operate at unprecedented speed, scale, and reliability.Key...


  • Santa Clara, California, United States Palo Alto Networks, Inc. Full time

    About the RoleWe are seeking a highly skilled Senior Principal Software Engineer to join our AI Runtime Security team at Palo Alto Networks, Inc. This is a critical role that will focus on the development and optimization of backend services, with a keen eye for scalability, reliability, and performance.Key ResponsibilitiesArchitect and develop scalable,...