Reliability, Availability and Serviceability Expert, Datacenter AI Products Development

4 weeks ago


Oklahoma City, United States NVIDIA Full time

Senior Datacenter Product Development Engineer - RAS SME page is loaded

Senior Datacenter Product Development Engineer - RAS SME

Apply

locations

US, TX, Austin

US, OR, Hillsboro

US, CA, Santa Clara

time type

Full time

posted on

Posted 6 Days Ago

job requisition id

JR1975187

For two decades, we have pioneered visual computing, the art and science of computer graphics - with our invention of the GPUs, the engine of modern AI technologies, the field has expanded to encompass AI-powered video games, social networking and web search, IC & other product design, medical diagnosis, and scientific research. Today, visual computing is the critical computing engine for deep learning-based AI including ChatGPT, becoming increasingly central to how people entertain and interact, and there has never been a more exciting time to join us to enable visual computing and AI to the next chapter. We are looking for one product development engineer as a SME to drive key aspects of RAS/Resilience features from Chip to module to server for our next-generation products for AI Applications. We are expecting you to bring deep knowledge and experience in RAS/Resilience testing, characterization, analysis, benchmarking, and risk assessment of large AI training or HPC cluster systems with InfiniBand or enhanced Ethernet. What you’ll be doing: The focal point SME for manufacturing test requirements, test methodology, test plan and test flow for AI system RAS/Resilience features to ensure good test coverage and successful production ramp-ups.

Own the AI system RAS/Resilience models, Benchmarking and Risk assessment.

Own the troubleshooting and root-causing of AI system RAS/Resilience related failures at factory and in the field.

Drive the end-to-end RAS efforts of chip-board-system to reduce FIT rates.

Lead the data analysis of RAS/Resilience logs to refine, revise and overhaul test methodology and manufacturing flows; influence and drive software tools/infrastructure required for new product development, validation, and productization.

Opportunity to work closely and partner with architecture, hardware, software, and product engineering teams through the product development lifecycle.

Be ready to be challenged to assess new hardware features and architect manufacturing RAS tests, flows, methodologies.

You'll nurture a deep understanding of NVIDIA's AI hardware and software architecture.

What we need to see: BS or higher in EE, CE, CS, Mathematics, or equivalent experience.

12+ years proven hands-on experiences in design, testing, benchmarking, and risk assessment of system RAS / Resiliency features of large Compute or AI or HPC systems.

Proficient in Compute System RAS/Resilience model theory and methodology.

Proficient in HPC or AI system architecture and Cluster Interconnect technologies.

Proficient in using test equipment, Linux commands and benchmark utilities to test and trouble-shoot compute system RAS & Resiliency features.

Strong problem-solving and trouble-shooting expertise; and institutionalizing root-cause analysis.

Self-initiative, strong interpersonal skills, and flexibility to adapt to new technologies.

Solid Knowledge and/or Experience in HPC or MLPerf benchmarking is a plus.

NVIDIA is widely considered to be one of the technology world’s most desirable employers We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you The base salary range is 180,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits .

NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs (5)

Senior Datacenter Product Development Engineer

locations

3 Locations

time type

Full time

posted on

Posted 28 Days Ago

Senior Datacenter Product Development Engineer, Server

locations

3 Locations

time type

Full time

posted on

Posted 30+ Days Ago

Senior Product Development Engineer - Datacenter

locations

US, CA, Santa Clara

time type

Full time

posted on

Posted 30+ Days Ago

#J-18808-Ljbffr



  • Redwood City, United States Snorkel AI Full time

    We're on a mission to democratize AI by building the definitive AI data development platform. The AI landscape has gone through incredible change between 2016, when Snorkel started as a research project in the Stanford AI Lab, to the generative AI breakthroughs of today. But one thing has remained constant: the data you use to build AI is the key to...


  • Redwood City, CA, United States Snorkel AI, Inc. Full time

    We are looking for a Director of Engineering to lead our AI Platform team. Our AI Platform team builds innovative software systems to power the Snorkel Flow platform. This includes services to train and serve generative AI and machine learning models using novel data-centric techniques, libraries to support AI workflows for a variety of data modalities and...


  • Oklahoma City, Oklahoma, United States Oracle Full time

    Job DescriptionThe Oracle Database Cloud is responsible for building the cloud service framework powering various Oracle Autonomous Database cloud services and the Exadata Cloud Services including ADB-Shared, ADB-Dedicated, Base DB, ExaDB-OCI and ExaDB-. The cloud framework automates deployment, scaling and management of databases in the cloud. It is built...


  • Kansas City, United States SinteX AI Full time

    At SinteX AI we are on the lookout for an influential Network Partner with industry connections and a knack for identifying significant opportunities. As we specialize in custom AI software development, our goal is to amplify our presence in the software industry. EXCLUSIVE OPPORTUNITY ✔️ Flexible and remote work setup ✔️ Earnings based on...


  • Kansas City, United States SinteX AI Full time

    At SinteX AI we are on the lookout for an influential Network Partner with industry connections and a knack for identifying significant opportunities. As we specialize in custom AI software development, our goal is to amplify our presence in the software industry. EXCLUSIVE OPPORTUNITY ✔️ Flexible and remote work setup ✔️ Earnings based on...


  • Jersey City, New Jersey, United States BAE Systems Full time

    Job Description This job is a Hybridposition, spending 50% of their time working out of BAE Systems' 65 River Road Location. The Cloud Datacenter IT Admin will play a crucial role in managing and supporting our Virtual Cloud Computing Center (VC3). The role requires someone with a deep understanding of datacenter management, automation, cloud services and...


  • Redwood City, CA, United States C3 AI Full time

    NYSE:AI) is a leading Enterprise AI software provider for accelerating digital transformation. The proven C3 AI Platform provides comprehensive services to build enterprise-scale AI applications more efficiently and cost-effectively than alternative approaches. The C3 AI Platform supports the value chain in any industry with prebuilt, configurable,...


  • Salt Lake City, United States Altitude AI Full time

    Job DescriptionJob DescriptionBusiness Development Associate RoleCome be a part of the future of AI and autonomous robots!Altitude is seeking a Business Development Associate to support and lead projects that have a direct impact on the growth and success of the business. The ideal candidate is an exceptional leader with excellent communication skills who...


  • Redwood City, CA, United States C3 IoT Full time

    C3.ai, Inc. (NYSE:AI) is a leading Enterprise AI software provider for accelerating digital transformation. The proven C3 AI Platform provides comprehensive services to build enterprise-scale AI applications more efficiently and cost-effectively than alternative approaches. The C3 AI Platform supports the value chain in any industry with prebuilt,...


  • Oklahoma City, Oklahoma, United States EDB Full time

    A Little About UsInnovative, collaborative minds wanted. The world loves Postgres. We envision a world where organizations thrive by harnessing the full power of Postgres, the world's fastest growing and most loved and used open source database. Our mission is to enable data teams everywhere to harness the full power of Postgres, whether on premises or in...


  • Oklahoma City, United States EDB Full time

    A Little About Us Innovative, collaborative minds wanted. The world loves Postgres. We envision a world where organizations thrive by harnessing the full power of Postgres, the world’s fastest growing and most loved and used open source database. Our mission is to enable data teams everywhere to harness the full power of Postgres, whether on premises or...


  • Redwood City, CA, United States C3 AI Full time

    NYSE:AI) is a leading Enterprise AI software provider for accelerating digital transformation. The proven C3 AI Platform provides comprehensive services to build enterprise-scale AI applications more efficiently and cost-effectively than alternative approaches. The C3 AI Platform supports the value chain in any industry with prebuilt, configurable,...


  • Arizona City, United States Catapult Staffing Full time

    Mechanical Design Engineer – Datacenter HVAC Onsite – Greater Phoenix AZ Direct Hire We specialize in designing and manufacturing advanced climate control systems, focusing on heating, ventilation, air conditioning, and refrigeration (HVAC&R) for commercial markets and datacenters. We aim to provide innovative solutions that enhance energy...


  • Foster City, United States Visa Full time

    Job Description The Global Data Office With data being the fuel that drives our future - our strategies, policies, and business successes around data will define our future growth prospects. Unlocking the value available through the innovative use of data on behalf of consumers, businesses, and communities is key to our future. With our ongoing commitment...


  • Oklahoma City, Oklahoma, United States Carrington Full time

    Come join our amazing team and work remote from homeWhat you'll do:The Senior AI Engineer will be responsible for designing, implementing, and maintaining our AI solutions, ensuring business requirements are met and in line with industry best practices. Perform all duties in accordance with the company's policies and procedures, all US state and federal laws...


  • Oklahoma City, United States HubSpot Full time

    Thanks to our employees' feedback, HubSpot has been named the #2 Best Leadership Team in 2023 by Comparably! However you identify or whatever your path here, please apply if you see a position that makes your heart skip a beat. Come join us and help us build a global company where we're all proud to belong. Marketing Manager, AI-Driven Product Communications...


  • Oklahoma City, Oklahoma, United States Oracle Full time

    Job DescriptionAre you a creative person who loves a challenge? Solve the complex puzzles you've been dreaming of as our Engineer. If you have a passion for innovation in tech, we want you on our team Thrive in this crucial automation role. Oracle is a technology leader that's changing how the world does business. We're looking for an experienced and...

  • Hardware Engineer

    2 weeks ago


    Salt Lake City, United States Altitude AI Full time

    Job DescriptionJob DescriptionBe a part of the future of autonomous robots! In this hardware engineering role, you'll be on the front lines working hands on to build a fully autonomous robot, guided by a team of expert software engineers and roboticists from Waymo, Google, BYU, Princeton, and top robotics startups.Responsibilities:Develop detailed and...

  • Hardware Engineer

    2 weeks ago


    Salt Lake City, United States Altitude AI Full time

    Job DescriptionJob DescriptionBe a part of the future of autonomous robots! In this hardware engineering role, you'll be on the front lines working hands on to build a fully autonomous robot, guided by a team of expert software engineers and roboticists from Waymo, Google, BYU, Princeton, and top robotics startups.Responsibilities:Develop detailed and...


  • Oklahoma City, United States VitalAire Canada Inc. Full time

    Senior Reliability Engineer page is loaded Senior Reliability Engineer Apply locations Bayport, TX time type Full time posted on Posted 2 Days Ago job requisition id R10039331 Air Liquide Large Industries provides our customers with industrial gas and energy solutions that are vital to their own industrial production. We own and operate over 2,000 miles of...