Senior AI Resiliency Software Engineer
6 days ago
We are seeking a highly skilled Principal Engineer for AI Software Resiliency to join our team at NVIDIA.
Key Responsibilities- Lead the development of AI software resiliency features for our most powerful AI supercomputers.
- Collaborate with multiple teams and stakeholders to align on mission requirements and ensure successful integration of resiliency features into AI frameworks.
- Partner with TPMs, PMs, and QA teams to ensure timely and successful launch of resiliency features.
- Develop and implement critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.
- Drive engineering excellence by contributing to large software codebases and ensuring high code quality and rigorous testing.
- Master's or Ph.D. in Computer Science, Electrical Engineering, Computer Engineering, or a related field from a reputed institution.
- A minimum of 10 years of experience in systems architecture or related fields, with a deep understanding of distributed systems and large-scale AI infrastructure.
- At least 10 years of hands-on experience in software development for distributed systems and 5 years in developing AI frameworks such as PyTorch or JAX/XLA.
- Proven track record of working effectively across multiple engineering fields and communicating complex technical concepts to a diverse set of collaborators.
- Experience with large-scale AI supercomputing applications, including in-depth knowledge of AI workload training and inference requirements.
- A strong passion for developing AI-specific system architectures, including CPUs, GPUs, memory, storage, and networking.
- Hands-on involvement in the design, development, and deployment of large-scale AI supercomputers.
- Practical experience in adopting and implementing high-performance computing (HPC) software development in large-scale environments.
NVIDIA offers a competitive salary range of $272,000 - $419,750, based on location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.
-
Senior Software Architect for AI Resilience
1 week ago
Santa Clara, California, United States NVIDIA Full timeAbout the RoleNVIDIA is seeking a highly skilled Senior Software Architect to lead the development of AI software resilience for our most powerful AI supercomputers.Key ResponsibilitiesDevelop and implement critical resilience features to support frontier model training at scale, ensuring robust and reliable AI systems.Serve as a trusted authority on AI...
-
Santa Clara, California, United States NVIDIA Full timeJob DescriptionWe are seeking a highly skilled Principal Engineer for AI Software Resiliency to join our team at NVIDIA. As a key member of our organization, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.Key ResponsibilitiesDevelop and lead the execution of software...
-
Lead Software Engineer for AI Supercomputing
6 days ago
Santa Clara, California, United States NVIDIA Full timeWe are seeking a highly skilled Principal Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world.As a lead focused on AI Software Resiliency, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.Your expertise...
-
Santa Clara, California, United States Platform Ldn Full timeAbout Platform LdnPlatform Ldn is a pioneering company in the field of robotics, dedicated to advancing the development of AI platforms that support industrial-grade robotics solutions.Job SummaryWe are seeking a highly skilled Senior Software Engineer to lead the design and development of our AI platform, enabling clients to run their AI workflows...
-
Senior Analog Design Engineer
3 days ago
Santa Clara, California, United States Celestial AI Full timeAbout the RoleCelestial AI is seeking a highly skilled Senior Analog Design Engineer to drive the development of innovative, high-speed analog architectures for low-power, high-performance Analog-Mixed Signal (AMS) solutions customized for AI applications.Key ResponsibilitiesTop-Down Architectural Analysis: Conduct thorough analysis of AMS systems to...
-
Principal Engineer for AI Systems
1 week ago
Santa Clara, California, United States NVIDIA Full timeAbout the RoleNVIDIA is seeking a highly skilled Principal Engineer to lead the development of AI software resiliency for our most powerful AI supercomputers.Key ResponsibilitiesDevelop and implement critical resiliency features to support frontier model training at scale.Drive down cluster downtime towards zero, ensuring robust and reliable AI...
-
Senior Software Engineer, Metropolis AI Workflow
24 hours ago
Santa Clara, California, United States NVIDIA Corporation Full timeJob DescriptionAbout NVIDIA CorporationNVIDIA Corporation is a leader in the technology industry, renowned for its innovative solutions in artificial intelligence, deep learning, and computer vision. As a pioneer in these fields, we are committed to empowering businesses and organizations to harness the power of AI and drive meaningful change.Job SummaryWe...
-
AI Systems Software Engineer
2 weeks ago
Santa Clara, California, United States Tenstorrent Full timeAt Tenstorrent, we are at the forefront of pioneering advancements in artificial intelligence technology, setting new benchmarks for performance, usability, and cost-effectiveness. As AI reshapes the computing landscape, our solutions are evolving to integrate innovations across software models, compilers, platforms, networking, and semiconductor...
-
Senior Principal Software Engineer
24 hours ago
Santa Clara, California, United States Palo Alto Networks, Inc. Full timeAbout the RoleWe are seeking a highly skilled Senior Principal Software Engineer to join our AI Runtime Security team at Palo Alto Networks, Inc. This is a critical role that will focus on the development and optimization of backend services, with a keen eye for scalability, reliability, and performance.Key ResponsibilitiesArchitect and develop scalable,...
-
Software Engineer, Senior
1 month ago
Santa Clara, California, United States d-Matrix Full timeSoftware Engineer, Senior - AI/ML Workloadsd-Matrix - Santa Clara, CALocationSanta Clara, CaTypeFull timeDepartmentR&D - SW Kernels & Workloadsd-Matrix has fundamentally changed the physics of memory-compute integration with our digital in-memory compute (DIMC) engine. The "holy grail" of AI compute has been to break through the memory wall to minimize data...
-
Senior AI Engineer
3 days ago
Santa Clara, California, United States XPENG Full timeAbout the RoleXpeng Motors is a leading technology-driven company that is revolutionizing the transportation industry with its electric cars and autonomous driving technology. As a key player in this revolution, we are seeking a highly skilled Senior AI Engineer to join our team and contribute to the development of advanced humanoid robots and large language...
-
Senior Research Software Engineer
6 days ago
Santa Clara, California, United States ServiceNow Full timeJob Description**About ServiceNow**ServiceNow is a global market leader in the field of cloud-based platforms, bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500. Our intelligent cloud-based platform seamlessly connects people, systems, and processes to empower organizations to find smarter, faster, and...
-
Safety and Resiliency Solutions Architect
2 weeks ago
Santa Clara, California, United States NVIDIA Full timeNVIDIA is a dynamic organization that continuously adapts by pursuing impactful opportunities that only we can address. We attract top talent to achieve our ultimate goal: to create a workplace that allows us to excel in our craft. We are currently looking for a Safety and Resiliency Architect to contribute to the development of GPU (Graphics Processing...
-
Architect of Resiliency and Safety Systems
2 weeks ago
Santa Clara, California, United States NVIDIA Full timeNVIDIA is a dynamic organization that continually seeks meaningful opportunities to address global challenges that only we can tackle. We attract top talent to achieve our mission: to create an environment where we can excel in our respective fields. We are currently looking for a Resiliency and Safety Architect to contribute to the advancement of GPU...
-
Senior Principal Software Engineer
2 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timePosition OverviewPalo Alto Networks is at the forefront of AI security in today's rapidly evolving technological landscape. Our AI security cloud service engineering team plays a pivotal role in developing robust solutions that safeguard our clients' operations, particularly in the realm of AI and large language model (LLM) services.Key...
-
Senior Software Engineer
2 days ago
Santa Monica, California, United States Jobot Full timeAbout the RoleWe are seeking a highly skilled Senior Software Engineer to join our team as an AI Application Developer. As a key member of our Application Delivery team, you will be responsible for designing, developing, and supporting custom AI-enhanced applications hosted in the cloud-based Microsoft Azure platform.Key ResponsibilitiesDesign and develop...
-
Engineering Leader, AI Workload Management
6 days ago
Santa Clara, California, United States Oracle Full timeAbout the RoleWe are seeking a highly experienced and skilled Engineering Leader to join our team at Oracle. As a Senior Director of Engineering, AI Workload Orchestration, you will be responsible for leading the software development organization building out and operating AI platforms that operate at unprecedented speed, scale, and reliability.Key...
-
Senior Scientist for AI Security Solutions
2 weeks ago
Santa Clara, California, United States Amazon Full timePosition OverviewWe are looking for a Senior Applied Scientist to become a vital member of our AI Security division. This team is dedicated to developing security tools and streamlined solutions that guarantee the Generative AI (GenAI) experiences created by Amazon meet our stringent security requirements. Additionally, we leverage AI to create foundational...
-
Santa Clara, California, United States Aitopics Full timeJob DescriptionAitopics is seeking a highly skilled Senior Software Quality Assurance Engineer to join our team. As a Senior Software Quality Assurance Engineer, you will be responsible for ensuring the quality and reliability of our Deep Learning software.Key ResponsibilitiesWork closely with cross-functional teams to understand test requirements and take...
-
Staff Software Engineer, Compilers
2 months ago
Santa Clara, California, United States Tenstorrent Inc. Full timeTenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high...