We have other current jobs related to this field that you can find below
-
Lead Reliability Engineer
1 month ago
Santa Clara, California, United States Celestial AI Full timeAbout Celestial AIAt Celestial AI, we are at the forefront of innovation in AI systems. Our ground-breaking Photonic Fabric technology provides a scalable solution to data transfer bottlenecks, revolutionizing AI system performance and delivering unmatched efficiency.Lead Reliability EngineerWe are seeking a dynamic Lead Reliability Engineer to drive...
-
Product Development Engineer, NPI
1 day ago
Santa Clara, United States NVIDIA Full timeWe are now looking for a Datacenter Product Engineer! NVIDIA Corporation is a world leader in visual computing technology. The GPU, which the company invented, serves as the visual cortex of modern computers and is at the heart of their products and services. NVIDIA has transformed into a specialized platform company that targets four large markets –...
-
Technical Staff-System Architect
3 months ago
Santa Clara, United States Dell Products LP (1010) Full timeTechnical Staff-System Architect From applied research to advanced engineering, the Engineering Technologist team has the expertise to shape ground-breaking products, material and processes. It’s a fascinating field of work. We’re involved in assessing the competition, developing technology and product strategies and generating intellectual...
-
Datacenter Engineer
2 weeks ago
Santa Clara, United States Sustainable Talent Full timeJob DescriptionJob DescriptionSustainable Talent is partnering with Nvidia a global leader who's been transforming computer graphics, PC gaming, and accelerated computing for over 25 years. We are looking for a Datacenter Engineer to support our client's IPP (Infrastructure, planning, and process) team. This is a W-2 full-time contract position based...
-
Datacenter Technician
2 weeks ago
Santa Clara, United States Sustainable Talent Full timeJob DescriptionJob DescriptionSustainable Talent is partnering with Nvidia a global leader who's been transforming computer graphics, PC gaming, and accelerated computing for over 25 years. We are looking for a Datacenter Technician to support our client's IPP (Infrastructure, planning, and process) team. This is a W-2 full-time contract position...
-
Senior Platform Software Engineer, AI Server
2 months ago
Santa Clara, United States Nvidia Full timeSenior Platform Software Engineer, AI Server - GPUlocationsUS, CA, Santa ClaraUS, Remotetime typeFull timejob requisition idJR1980965NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern deep learning —...
-
Datacenter Systems Engineer
2 weeks ago
Santa Clara, United States Sustainable Talent Full timeJob DescriptionJob DescriptionSustainable Talent is partnering with Nvidia a global leader who's been transforming computer graphics, PC gaming, and accelerated computing for over 25 years. We are looking for a Datacenter Systems Engineer to support our client's IPP (Infrastructure, planning, and process) team. This is a W-2 full-time contract...
-
Robotics AI Engineer
15 hours ago
Santa Clara, California, United States Dexmate Full timeCompany OverviewDexmate is an innovative robotics firm focused on creating versatile mobile robots capable of executing intricate manipulation tasks. Our team comprises leading experts in artificial intelligence and robotics, dedicated to advancing the field of full-stack robotics. We are in search of talented and driven AI engineers to enhance our robot...
-
Santa Clara, California, United States NVIDIA Full timeAre you an innovative data scientist passionate about addressing challenges with artificial intelligence? Do you enjoy crafting case studies and educating others on utilizing cutting-edge AI technologies? We are seeking a dedicated Generative Technology Expert to join NVIDIA as an AI Solutions Consultant. As part of our partner enablement division, you will...
-
Santa Clara, California, United States Advanced Micro Devices, Inc Full timeOverview:YOUR WORK AT AMD MAKES A DIFFERENCEWe are dedicated to enhancing lives through AMD technology, impacting our industry, communities, and the globe. Our goal is to create outstanding products that propel next-generation computing experiences – the foundational elements for data centers, artificial intelligence, personal computing, gaming, and...
-
Site Reliability Engineering Manager
6 days ago
Santa Clara, California, United States Promote Project Full timeAbout Promote Project: Promote Project is a leader in innovative technology solutions, dedicated to pushing the boundaries of what is possible in the realm of artificial intelligence and cloud computing. Our commitment to excellence is reflected in our talented workforce and our pursuit of groundbreaking advancements.Position Overview: We are seeking a...
-
Principal Product Manager
6 days ago
Santa Clara, United States Astera Labs Full timeAstera Labs is a global leader in purpose-built connectivity solutions that unlock the full potential of AI and cloud infrastructure. Our Intelligent Connectivity Platform integrates PCIe®, CXL®, and Ethernet semiconductor-based solutions and the COSMOS software suite of system management and optimization tools to deliver a software-defined architecture...
-
Site Reliability Engineering Manager
6 days ago
Santa Clara, California, United States Promote Project Full timeAbout the Company: Promote Project is at the forefront of innovation, leveraging cutting-edge technology to redefine the landscape of AI and computing. Our mission is to harness the power of advanced computing to create transformative solutions that impact various industries.Position Overview: We are seeking a Manager of Site Reliability Engineering to...
-
Santa Clara, California, United States NVIDIA Full timeAs a Lead Solutions Architect focusing on AI/ML Storage Systems, you will play a crucial role in our innovative team, contributing to the development, implementation, and management of cutting-edge storage solutions designed specifically for Artificial Intelligence and Machine Learning applications. This position encompasses a variety of areas, including...
-
Datacenter GPU Platform Performance Engineer
3 months ago
Santa Clara, United States Advanced Micro Devices , Inc. Full timeOverview: WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences the building blocks for the data center, artificial intelligence, PCs, gaming and embedded....
-
Santa Clara, California, United States Astera Labs Full timeAstera Labs is a global leader in purpose-built connectivity solutions that unlock the full potential of AI and cloud infrastructure. Our Intelligent Connectivity Platform integrates PCIe, CXL, and Ethernet semiconductor-based solutions and the COSMOS software suite of system management and optimization tools to deliver a software-defined architecture that...
-
Lead Product Strategist for AI Solutions
18 hours ago
Santa Clara, California, United States Ushur Full timeCompany OverviewUshur is at the forefront of revolutionizing enterprise communication and customer engagement. As organizations respond to the increasing demand for self-service options, Ushur provides a robust platform for Customer Experience Automation, enabling businesses to enhance their digital engagement strategies and deliver exceptional experiences...
-
Senior Principal Software Engineer
6 days ago
Santa Clara, California, United States Palo Alto Networks Full timePosition OverviewPalo Alto Networks is at the forefront of AI security in today's rapidly evolving technological landscape. Our AI security cloud service engineering team plays a pivotal role in developing robust solutions that safeguard our clients' operations, particularly in the realm of AI and large language model (LLM) services.Key...
-
Site Reliability Engineering Manager
6 days ago
Santa Clara, California, United States Promote Project Full timeAbout the Company: Promote Project is at the forefront of innovation, focusing on redefining technology and enhancing the capabilities of AI. We are dedicated to creating groundbreaking solutions that push the boundaries of what is possible in computing.Position Overview: We are seeking a Manager for Site Reliability Engineering to spearhead our cloud...
-
Lead Product Strategist for AI Solutions
18 hours ago
Santa Clara, California, United States Ushur Full timeCompany OverviewUshur is revolutionizing enterprise communication and customer engagement. As organizations adapt to the growing demand for self-service, Ushur is emerging as the leading platform for Customer Experience Automation. Our innovative solutions empower businesses to enhance customer and employee interactions, leveraging advanced technologies such...
Reliability, Availability and Serviceability Expert, Datacenter AI Products Development
3 months ago
Reliability, Availability and Serviceability Expert, Datacenter AI Products Development page is loaded
Reliability, Availability and Serviceability Expert, Datacenter AI Products Development
Apply
locations
US, CA, Santa Clara
US, TX, Austin
US, OR, Hillsboro
time type
Full time
posted on
Posted 30+ Days Ago
job requisition id
JR1975187
For two decades, we have pioneered visual computing, the art and science of computer graphics - with our invention of the GPUs, the engine of modern AI technologies, the field has expanded to encompass AI-powered video games, social networking and web search, IC & other product design, medical diagnosis, and scientific research. Today, visual computing is the critical computing engine for deep learning-based AI including ChatGPT, becoming increasingly central to how people entertain and interact, and there has never been a more exciting time to join us to enable visual computing and AI to the next chapter. We are looking for one product development engineer as a SME to drive key aspects of RAS/Resilience features from Chip to module to server for our next-generation products for AI Applications. We are expecting you to bring deep knowledge and experience in RAS/Resilience testing, characterization, analysis, benchmarking, and risk assessment of large AI training or HPC cluster systems with InfiniBand or enhanced Ethernet.
What you’ll be doing:
The focal point SME for manufacturing test requirements, test methodology, test plan and test flow for AI system RAS/Resilience features to ensure good test coverage and successful production ramp-ups.
Own the AI system RAS/Resilience models, Benchmarking and Risk assessment.
Own the troubleshooting and root-causing of AI system RAS/Resilience related failures at factory and in the field.
Drive the end-to-end RAS efforts of chip-board-system to reduce FIT rates.
Lead the data analysis of RAS/Resilience logs to refine, revise and overhaul test methodology and manufacturing flows; influence and drive software tools/infrastructure required for new product development, validation, and productization.
Opportunity to work closely and partner with architecture, hardware, software, and product engineering teams through the product development lifecycle.
Be ready to be challenged to assess new hardware features and architect manufacturing RAS tests, flows, methodologies.
You'll nurture a deep understanding of NVIDIA's AI hardware and software architecture.
What we need to see:
BS or higher in EE, CE, CS, Mathematics, or equivalent experience.
12+ years proven hands-on experiences in design, testing, benchmarking, and risk assessment of system RAS / Resiliency features of large Compute or AI or HPC systems.
Proficient in Compute System RAS/Resilience model theory and methodology.
Proficient in HPC or AI system architecture and Cluster Interconnect technologies.
Proficient in using test equipment, Linux commands and benchmark utilities to test and trouble-shoot compute system RAS & Resiliency features.
Strong problem-solving and trouble-shooting expertise; and institutionalizing root-cause analysis.
Self-initiative, strong interpersonal skills, and flexibility to adapt to new technologies.
Solid Knowledge and/or Experience in HPC or MLPerf benchmarking is a plus.
NVIDIA is widely considered to be one of the technology world’s most desirable employers We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you
The base salary range is 188,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
You will also be eligible for equity and benefits .
NVIDIA accepts applications on an ongoing basis.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
Similar Jobs (5)
Manager, Speed and Reliability
locations
US, CA, Santa Clara
time type
Full time
posted on
Posted 15 Days Ago
Principal Engineer, Performance Analysis - AI Applications and Services
locations
US, CA, Santa Clara
time type
Full time
posted on
Posted 30+ Days Ago
Principal Infrastructure Performance and Development Engineer
locations
US, CA, Santa Clara
time type
Full time
posted on
Posted 30+ Days Ago
NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and the metaverse is transforming the world's largest industries and profoundly impacting society.
#J-18808-Ljbffr