Reliability, Availability and Serviceability Expert, Datacenter AI Products Development

7 months ago


Hillsboro, United States NVIDIA Full time

For two decades, we have pioneered visual computing, the art and science of computer graphics - with our invention of the GPUs, the engine of modern AI technologies, the field has expanded to encompass AI-powered video games, social networking and web search, IC & other product design, medical diagnosis, and scientific research. Today, visual computing is the critical computing engine for deep learning-based AI including ChatGPT, becoming increasingly central to how people entertain and interact, and there has never been a more exciting time to join us to enable visual computing and AI to the next chapter. We are looking for one product development engineer as a SME to drive key aspects of RAS/Resilience features from Chip to module to server for our next-generation products for AI Applications. We are expecting you to bring deep knowledge and experience in RAS/Resilience testing, characterization, analysis, benchmarking, and risk assessment of large AI training or HPC cluster systems with InfiniBand or enhanced Ethernet.

What you’ll be doing:

  • The focal point SME for manufacturing test requirements, test methodology, test plan and test flow for AI system RAS/Resilience features to ensure good test coverage and successful production ramp-ups.

  • Own the AI system RAS/Resilience models, Benchmarking and Risk assessment.

  • Own the troubleshooting and root-causing of AI system RAS/Resilience related failures at factory and in the field.

  • Drive the end-to-end RAS efforts of chip-board-system to reduce FIT rates.

  • Lead the data analysis of RAS/Resilience logs to refine, revise and overhaul test methodology and manufacturing flows; influence and drive software tools/infrastructure required for new product development, validation, and productization.

  • Opportunity to work closely and partner with architecture, hardware, software, and product engineering teams through the product development lifecycle.

  • Be ready to be challenged to assess new hardware features and architect manufacturing RAS tests, flows, methodologies.

  • You'll nurture a deep understanding of NVIDIA's AI hardware and software architecture.

What we need to see:

  • BS or higher in EE, CE, CS, Mathematics, or equivalent experience.

  • 12+ years proven hands-on experiences in design, testing, benchmarking, and risk assessment of system RAS / Resiliency features of large Compute or AI or HPC systems.

  • Proficient in Compute System RAS/Resilience model theory and methodology.

  • Proficient in HPC or AI system architecture and Cluster Interconnect technologies.

  • Proficient in using test equipment, Linux commands and benchmark utilities to test and trouble-shoot compute system RAS & Resiliency features.

  • Strong problem-solving and trouble-shooting expertise; and institutionalizing root-cause analysis.

  • Self-initiative, strong interpersonal skills, and flexibility to adapt to new technologies.

  • Solid Knowledge and/or Experience in HPC or MLPerf benchmarking is a plus.

NVIDIA is widely considered to be one of the technology world’s most desirable employers We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you

The base salary range is 188,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and . NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.



  • Hillsboro, United States NVIDIA Full time

    NVIDIA is a leader in the visual computing industry, pioneering advancements in computer graphics, artificial intelligence, and deep learning.As a Reliability, Availability and Serviceability (RAS) expert for our AI systems, you will play a critical role in ensuring the high quality and reliability of our products.About the Role:You will be responsible for...


  • Hillsboro, United States Sustainable Talent Full time

    Join Sustainable Talent as an Engineering Technician (Platform Reliability Engineer) supporting Nvidia and their IPP Platform Group (Infrastructure, Planning and Process)! This is a W-2 full-time contract with openings in Hillsboro, OR and Austin, TX. We offer competitive pay $35-45/hourly based on factors like experience, education, location, etc. and...


  • Hillsboro, United States Alignerr Full time

    Unlock the future of AI in Chemical Engineering with a unique opportunity at Alignerr. We are a community of subject matter experts who align AI models by creating high-quality data in their field of expertise. Our goal is to build the future of Generative AI.We are operated by Labelbox, the leading data-centric AI platform for building intelligent...

  • AIS Summer Intern

    4 months ago


    Hillsboro, United States Skyworks Full time

    If you are looking for a challenging and exciting career in the world of technology, then look no further. Skyworks is an innovator of high-performance analog semiconductors whose solutions are powering the wireless networking revolution. Through our broad technology expertise and one of the most extensive product portfolios in the industry, we are ...


  • Hillsboro, United States ASM Company Full time

    ASM Company, a global leader in semiconductor processing solutions, offers an exciting opportunity for a highly skilled Field Process Engineering Expert to join our team. Based in Hillsboro, Oregon, this role is a key part of our Global Sales and Services business unit. About the JobThe Field Process Engineering Expert will be responsible for supporting our...


  • Hillsboro, United States 24 Hour Fitness Full time

    24 Hour Fitness - 6095 SE Tualatin Valley Hwy [Guest Service / Front Desk] As a Sales and Service Expert at 24 Hour Fitness, you'll: Offer membership, fitness and retail products and services; Build relationships and translate the value of 24 Hour Fitness product and service offerings; Have a strong focus on behaviors that drive member acquisition and...


  • Hillsboro, United States Akraya Full time

    Primary Skills: Linux (Expert), WindowsServer (Expert), Virtualization (Intermediate), Networking (Intermediate), Scripting (Intermediate) Contract Type: W2 Only Duration: 6+ Months Location: Hillsboro, OR Pay Range: $40 - $43 per hour Job Summary: We are seeking a skilled High-Performance Computing Systems Administrator to provide IT infrastructure support...

  • Senior Data Scientist

    3 weeks ago


    Hillsboro, United States Skyworks Solutions Full time

    About the Role:Skyworks Solutions is seeking a seasoned Data Scientist to lead our Artificial Intelligence Solutions (AIS) business unit. As a key member of our team, you will be responsible for designing and implementing cutting-edge AI solutions that drive innovation and growth in the semiconductor industry.Key Responsibilities:Analyzing and validating...


  • hillsboro, United States VanderHouwen Full time

    Job Title: Machine Programming Expert for Metal FabricationSalary Range: $85,000 - $110,000 per year, depending on experienceAbout the Role:We are seeking a highly skilled CNC Mill Programmer to join our team at VanderHouwen. As a key member of our production team, you will be responsible for creating and optimizing CNC programs using MasterCam and Esprit...


  • Hillsboro, United States Eateam Full time

    **Job Overview**Eateam is seeking a highly skilled PLM Solution Architect with Teamcenter to lead the implementation of effective solutions in the semiconductor industry.**Salary**$150,000 - $200,000 per year, depending on experience and qualifications.**Job Description**We are looking for a seasoned PLM Solution Architect to design and implement innovative...


  • Hillsboro, United States Microsoft Corporation Full time

    Microsoft Corporation is a global leader in cloud computing, and its Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) team plays a crucial role in powering the company's expanding cloud infrastructure. As a Senior Platform Systems Engineer, you will be part of this innovative team that delivers the core infrastructure and foundational...

  • DC Technician Level 2

    2 weeks ago


    Hillsboro, United States TEKsystems Full time

    TEKsystems is seeking 2 datacenter engineers in Hillsboro, OR for a key client in GPU development and innovation. Top Skills' DetailsThese techs must have DC Tech experience2 years of DC Tech experienceTroubleshooting and isolating issues on serversRack/Stack and configuration of servers and other DC gearHands on wiring/cablingBreak/Fix and doing the actual...

  • Speakers/Writers

    2 months ago


    Hillsboro, United States Jobs for Humanity Full time

    Company Description Jobs for Humanity is collaborating with Upwardly Global and with Unclassified to build an inclusive and just employment ecosystem. We support individuals coming from all walks of life. Company Name: Unclassified Job Description Job Description: Alignerr.com is a community of subject matter experts from several disciplines who align AI...


  • Hillsboro, United States Thermo Fisher Scientific Careers Full time

    We are seeking an Electronics Field Service Expert to join our team at Thermo Fisher Scientific Careers. The ideal candidate will have a strong background in electronics, physics, or engineering and possess excellent troubleshooting and repair skills.Company OverviewThermo Fisher Scientific is a leading global provider of scientific instruments, reagents,...


  • Hillsboro, United States Cisco Systems, Inc. Full time

    Cisco Systems, Inc. - Senior Software EngineerCompany Overview:We are the UCS Blade Platform BMC Team at Cisco, driving innovation and excellence in BMC development for our X-Series Blade Servers. Our team is at the forefront of creating groundbreaking solutions, tackling complex projects throughout the year.Estimated Salary: $140,000 - $200,000 per yearJob...


  • Hillsboro, United States TalentBurst Full time

    TalentBurst Overview\TalentBurst is a technology-driven company that connects clients with top talent. We believe in building long-term relationships and fostering a culture of excellence.\


  • Hillsboro, United States NVIDIA Full time

    NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI - the next era of computing. NVIDIA is a “learning machine” that constantly evolves by...


  • Hillsboro, Ohio, United States Family Dollar Full time

    At Family Dollar, we are committed to providing exceptional customer service. As a Customer Service Representative, you will be the face of our company, greeting customers and assisting with merchandise selection. Your role will involve completing cash register transactions accurately and efficiently, while also being knowledgeable about our store and...


  • Hillsboro, United States Linde Plc. Full time

    Job Title: Gas Production EngineerLinde Plc. is looking for a skilled Gas Production Engineer to join our team in the Hillsboro, OR area. The successful candidate will have a strong background in gas production and processing.Responsibilities:Apply PMI principles in the execution of required tasks and tools development/improvements.Maintain a thorough...

  • Datacenter Technician

    3 weeks ago


    Hillsboro, United States Biblioso Full time

    Hardware Lab Technician (Servers) / Data Center TechnicianMicrosoft Managed ServicesOn-site in Hillsboro, ORAbout the RoleJoin our Managed Services team working on permanent projects at the Microsoft campus in Hillsboro, OR. This is a full-time, on-site role from Monday to Friday. Biblioso offers healthcare, benefits, and a competitive annual salary range of...