Senior Software Architect for AI Resilience

1 week ago


Santa Clara, California, United States NVIDIA Full time
About the Role

NVIDIA is seeking a highly skilled Senior Software Architect to lead the development of AI software resilience for our most powerful AI supercomputers.

Key Responsibilities
  • Develop and implement critical resilience features to support frontier model training at scale, ensuring robust and reliable AI systems.
  • Serve as a trusted authority on AI software resilience, guiding architecture, modeling, and scoping of resilience features.
  • Drive engineering excellence by contributing to large software codebases, ensuring high code quality, rigorous testing, and solving complex challenges.
Requirements
  • Master's or Ph.D. in Computer Science, Electrical Engineering, Computer Engineering, or a related field from a reputable institution.
  • Minimum of 10 years of experience in systems architecture or related fields, with a deep understanding of distributed systems and large-scale AI infrastructure.
  • At least 10 years of hands-on experience in software development for distributed systems and 5 years in developing AI frameworks such as PyTorch or JAX/XLA.
About NVIDIA

NVIDIA is a leader in the field of AI and deep learning, and we are committed to fostering a diverse and inclusive work environment.

We are recognized as one of the world's most desirable technology employers, home to some of the most forward-thinking and hardworking people in the world.

We are dedicated to pushing the boundaries of innovation and excellence in AI and deep learning, and we are seeking talented individuals to join our team.



  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Principal Engineer for AI Software Resiliency to join our team at NVIDIA.Key ResponsibilitiesLead the development of AI software resiliency features for our most powerful AI supercomputers.Collaborate with multiple teams and stakeholders to align on mission requirements and ensure successful integration of...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Principal Engineer for AI Software Resiliency to join our team at NVIDIA. As a key member of our organization, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.Key ResponsibilitiesDevelop and lead the execution of software...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a dynamic organization that continuously adapts by pursuing impactful opportunities that only we can address. We attract top talent to achieve our ultimate goal: to create a workplace that allows us to excel in our craft. We are currently looking for a Safety and Resiliency Architect to contribute to the development of GPU (Graphics Processing...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a dynamic organization that continually seeks meaningful opportunities to address global challenges that only we can tackle. We attract top talent to achieve our mission: to create an environment where we can excel in our respective fields. We are currently looking for a Resiliency and Safety Architect to contribute to the advancement of GPU...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled Principal Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world.As a lead focused on AI Software Resiliency, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.Your expertise...


  • Santa Clara, California, United States TechStar Group Full time

    Position: Artificial Intelligence ResearcherLocation: Santa Clara, CADuration: Long TermAs an AI Research Engineer, you will leverage your knowledge in various domains of artificial intelligence, including:- Extraction of critical context from datasets- Sequential Decision Making and Recommendation Systems- Generative AI TechniquesYou will work alongside...


  • Santa Clara, California, United States Amazon Full time

    Are you a technology enthusiast with a passion for innovation? Do you thrive in environments where hands-on experimentation is valued over mere discussion? If you possess a deep understanding of cloud architectures and are quick to adapt to new technologies, we want to hear from you. About the Role: As a Senior AI/ML Solutions Architect within the AWS...


  • Santa Clara, California, United States Palo Alto Networks, Inc. Full time

    About the RoleWe are seeking a highly skilled Senior Principal Software Engineer to join our AI Runtime Security team at Palo Alto Networks, Inc. This is a critical role that will focus on the development and optimization of backend services, with a keen eye for scalability, reliability, and performance.Key ResponsibilitiesArchitect and develop scalable,...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking an AI Solutions Architect to collaborate with clients, partners, and NVIDIA software engineers to enhance applications and platforms utilizing NVIDIA technology across various industry sectors. At NVIDIA, our Solutions Architects are comprised of top-tier developers and scientists who thrive on engaging with cutting-edge GPU and Networking...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled Senior GPU Performance Architect to join our AI Applications team at NVIDIA. As a key member of our architecture group, you will play a critical role in driving innovation and delivering cutting-edge performance in the field of artificial intelligence.The ideal candidate will have a strong background in computer science,...


  • Santa Clara, California, United States NVIDIA Corporation Full time

    Solutions Architect, Spectrum-X - DPU/RoCE-CentricAbout the Role:We are seeking an experienced Senior Network Architect to join our team at NVIDIA Corporation. As a key member of our network architecture team, you will be responsible for designing and implementing high-performance networking solutions for our AI and HPC workloads.Key Responsibilities:Support...


  • Santa Clara, California, United States Celestial AI Full time

    About the RoleCelestial AI is seeking a highly skilled Senior Analog Design Engineer to drive the development of innovative, high-speed analog architectures for low-power, high-performance Analog-Mixed Signal (AMS) solutions customized for AI applications.Key ResponsibilitiesTop-Down Architectural Analysis: Conduct thorough analysis of AMS systems to...


  • Santa Clara, California, United States Tenstorrent Full time

    At Tenstorrent, we are at the forefront of pioneering advancements in artificial intelligence technology, setting new benchmarks for performance, usability, and cost-effectiveness. As AI reshapes the computing landscape, our solutions are evolving to integrate innovations across software models, compilers, platforms, networking, and semiconductor...


  • Santa Clara, California, United States Platform Ldn Full time

    About Platform LdnPlatform Ldn is a pioneering company in the field of robotics, dedicated to advancing the development of AI platforms that support industrial-grade robotics solutions.Job SummaryWe are seeking a highly skilled Senior Software Engineer to lead the design and development of our AI platform, enabling clients to run their AI workflows...


  • Santa Clara, California, United States NVIDIA Corporation Full time

    Job DescriptionAbout NVIDIA CorporationNVIDIA Corporation is a leader in the technology industry, renowned for its innovative solutions in artificial intelligence, deep learning, and computer vision. As a pioneer in these fields, we are committed to empowering businesses and organizations to harness the power of AI and drive meaningful change.Job SummaryWe...


  • Santa Clara, California, United States NVIDIA Full time

    We are currently seeking an AI Solutions Architect at NVIDIA, focusing on cloud infrastructure and hyperscale environments. Your primary role will involve spearheading technical engagements for AI/ML software with customers deploying systems at an extensive scale. Collaborating across various teams within NVIDIA and with our clients, you will play a crucial...


  • Santa Clara, California, United States Amazon Full time

    Are you enthusiastic about Generative AI (GenAI)? Do you aspire to shape the future of Go to Market (GTM) strategies at Amazon Web Services (AWS) through generative AI? In this position, you will assist our prominent clients in constructing and deploying GenAI-enabled applications utilizing Amazon Bedrock and SageMaker. You will fine-tune and develop...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Position OverviewPalo Alto Networks is at the forefront of AI security in today's rapidly evolving technological landscape. Our AI security cloud service engineering team plays a pivotal role in developing robust solutions that safeguard our clients' operations, particularly in the realm of AI and large language model (LLM) services.Key...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Principal Engineer to lead the development of AI software resiliency for our most powerful AI supercomputers.Key ResponsibilitiesDevelop and implement critical resiliency features to support frontier model training at scale.Drive down cluster downtime towards zero, ensuring robust and reliable AI...

  • Solutions Architect

    3 days ago


    Santa Clara, California, United States NVIDIA Corporation Full time

    Solutions Architect - AI and HPC Cloud ExpertNVIDIA Corporation is seeking a highly skilled Solutions Architect to join its Cloud Infrastructure Team. As a key member of the team, you will be responsible for designing and implementing sophisticated cloud solutions that cater to the infrastructure needs of various NVIDIA groups, including Graphics Processors,...