Lead Systems Reliability Engineer

2 weeks ago


Santa Clara, California, United States NVIDIA Full time

NVIDIA has been at the forefront of technological innovation since the introduction of the GPU in 1999, which not only transformed the PC gaming landscape but also redefined modern graphics and parallel computing. Recently, the advent of GPU deep learning has propelled us into a new era of computing, positioning the GPU as the central processing unit for computers, robots, and autonomous vehicles that interpret and interact with their environments. Today, we proudly identify as 'the AI computing company.' Our mission is to expand our organization by assembling teams composed of the most insightful professionals in the industry.

NVIDIA's DGX, HGX, and MGX systems provide unparalleled solutions for enterprise AI infrastructure on a large scale.

We are in search of a skilled and seasoned engineer with expertise in RAS (Reliability, Availability, and Serviceability) and failure mode analysis (FMEA). Your primary responsibility will be to enhance the reliability of NVIDIA's GPU and Grace systems through comprehensive failure analysis and the design of software and firmware that exhibit fault resilience. You will collaborate closely with cross-disciplinary teams, including hardware engineers, system architects, and software developers, to develop architectures that satisfy rigorous reliability standards and deliver outstanding customer experiences.

Key Responsibilities:

  • Develop and implement server-level FMEA (Failure Mode and Effects Analysis) for NVIDIA's Data Center offerings.
  • Establish server-level reliability, availability, and serviceability specifications in partnership with various stakeholders, including cloud service providers, to deliver fault-tolerant solutions that meet customer expectations.
  • Work alongside hardware, software, and firmware teams to pinpoint potential failure points, conduct FMEA, and suggest mitigation strategies.
  • Create fault detection, isolation, and recovery mechanisms to ensure system resilience and reduce downtime. Design redundancy and fault-tolerant features, such as redundant components, interfaces, and error correction codes (ECC), to maximize system uptime.
  • Assess and select suitable technologies and components to enhance reliability, availability, and serviceability, taking into account metrics like mean time between failures (MTBF), mean time to repair (MTTR), and total cost of ownership (TCO).
  • Collaborate with vendors and suppliers to evaluate and integrate their RAS-related solutions into the overall system architecture. Conduct simulations, analyses, and testing at the system and cluster levels to validate the effectiveness of the RAS architecture and its components.
  • Stay informed about the latest advancements in RAS techniques, fault tolerance mechanisms, and industry trends to inform future system designs.
  • Engage with NVIDIA partners on RAS-related architecture discussions to enhance their utilization of NVIDIA products. Contribute to all stages of product development, from definition and architecture design to implementation, debugging, testing, and initial customer support.

Qualifications:

  • BS, MS, or PhD in Electrical Engineering, Computer Science, or a related field with over 12 years of experience.
  • Proficient in programming with C/C++ in a Linux environment, with a strong grasp of Linux kernel internals and code review capabilities.
  • Extensive knowledge in system-level architecture design, reliability engineering, and fault tolerance mechanisms, with a focus on optimizing RAS architectures for complex computing systems, data centers, or mission-critical applications.
  • Familiarity with scale-out architectures; hands-on experience is advantageous.
  • Experience with fault-tolerant design principles and techniques, including redundancy, error correction codes (ECC), and error recovery strategies.
  • Proficient in system-level simulation tools and methodologies (e.g., fault injection, reliability block diagrams, failure rate analysis).
  • Exceptional problem-solving abilities, meticulous attention to detail, and the capacity to analyze intricate system-level challenges.
  • Strong written and verbal communication skills, a solid work ethic, a high degree of teamwork, a commitment to producing quality work, and a dedication to completing tasks consistently. You should be a self-motivated individual who enjoys devising innovative solutions to complex problems.

Preferred Qualifications:

  • Proven experience conducting FMEA at the system level.
  • In-depth understanding of the interplay between machine check architecture and error flows with system firmware/software.
  • Hands-on experience with x86 or ARM system architecture.

NVIDIA is recognized as one of the most sought-after employers in the technology sector. Our team comprises some of the most innovative and dedicated individuals in the industry. If you are creative and independent, we would like to hear from you.

The base salary range is competitive and will be determined based on your location, experience, and the compensation of employees in similar roles. You will also be eligible for equity and benefits. NVIDIA welcomes applications on an ongoing basis.

NVIDIA is committed to promoting a diverse workplace and is proud to be an equal opportunity employer. We highly value diversity among our current and future employees and do not discriminate in our hiring and promotion practices based on race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.



  • Santa Clara, California, United States Celestial AI Full time

    About Celestial AIAt Celestial AI, we are at the forefront of innovation in AI systems. Our ground-breaking Photonic Fabric technology provides a scalable solution to data transfer bottlenecks, revolutionizing AI system performance and delivering unmatched efficiency.Lead Reliability EngineerWe are seeking a dynamic Lead Reliability Engineer to drive...


  • Santa Barbara, California, United States FLIR Systems Full time

    Job Summary:As a Reliability Engineer at FLIR Systems, you will play a critical role in ensuring the quality and reliability of our cooled camera systems. This position requires a strong understanding of engineering principles, quality systems, and problem-solving methodologies.Key Responsibilities:Conduct product and process qualification planning,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewCompany OverviewTo comply with U.S. federal government requirements, U.S. citizenship is required for this position.Our MissionAt Palo Alto Networks, our mission is clear:To be the cybersecurity partner of choice, safeguarding our digital existence.We envision a world where each day is safer and more secure than the last. Our foundation is built...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Company OverviewOur VisionAt Palo Alto Networks, our mission is clear: To be the preferred cybersecurity partner, safeguarding our digital lives. We envision a future where each day is safer and more secure than the last. Our foundation is built on challenging the status quo and innovating the cybersecurity landscape. We seek forward-thinkers who are...


  • Santa Clara, California, United States Apollo Professional Solutions Full time

    Job OverviewPosition SummaryLead Systems EngineerLocation: RemoteCompensation: $80.00 per hourContractor Benefits: Medical, Vision, Dental, 401kThe Lead Systems Engineer is dedicated to enhancing the reliability and durability of microfluidic electrochemical sensor devices through comprehensive failure analysis, root cause investigations, and the...


  • Santa Clara, California, United States Siri InfoSolutions Inc Full time

    Job OverviewPosition: Reliability EngineerCompany: Siri InfoSolutions IncLocation: Santa Clara, California, United States (On-site)Role Summary:The Reliability Engineer will engage in the Board Level Reliability laboratory setting, where they will establish functional testing hardware and software for a variety of products, including extensive server...


  • Santa Clara, California, United States Anello Full time

    About Anello Photonics:ANELLO Photonics is a leading-edge technology company based in Santa Clara, CA. The company has developed integrated photonic system-on-chip technology for next generation navigation. ANELLO's SIPHOGTM gyroscope is based on its patented photonic integrated circuit technology. The result is a product that is higher performance, much...


  • Santa Clara, California, United States Halo Industries Full time

    Position OverviewAs a Lead Automation Systems Engineer at Halo Industries, you will be instrumental in advancing our innovative semiconductor manufacturing solutions. Key ResponsibilitiesIn this role, you will: Direct the design of system architecture, integrate subsystems, and develop control systems for semiconductor manufacturing machinery. Work...


  • Santa Clara, California, United States Halo Industries Full time

    Position OverviewAs a Lead Automation Systems Engineer at Halo Industries, you will be instrumental in advancing our innovative semiconductor manufacturing technologies. Key ResponsibilitiesIn this role, you will: Direct the architecture design of systems, integrating subsystems and developing control systems for semiconductor manufacturing equipment. Work...


  • Santa Clara, California, United States Nvidia Full time

    NVIDIA, a prominent player in the realms of Artificial Intelligence, High-Performance Computing, and Visualization, is on the lookout for a Lead Site Reliability Engineer specializing in HPC storage systems. This role involves collaborating with our team to architect, implement, and enhance on-premises HPC storage solutions while integrating cloud...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:Bachelor's degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework focused on semiconductor physics and electronic systems. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also...


  • Santa Clara, California, United States Innova Solutions Full time

    Position Overview: Innova Solutions is currently seeking a Systems Reliability Specialist.Employment Type: Full TimeRole Responsibilities: - Engage in the Board Level Reliability laboratory setting, establishing functional testing hardware and software for a variety of NV products, including extensive server systems. - Conduct a range of functional...


  • Santa Clara, California, United States Diverse Lynx Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on...


  • Santa Clara, California, United States Innova Solutions Full time

    Innova Solutions is currently seeking a Systems Reliability Specialist.Before proceeding to apply, please ensure that you have reviewed all relevant application documentation thoroughly.Position Type: Full Time Role Overview: As a Systems Reliability Specialist, you will be responsible for ensuring the dependability and performance of various technological...


  • Santa Clara, California, United States Wipro Full time

    Position: Reliability Test EngineerCompany: WiproOverview:• Engage in the Board Level Reliability laboratory setting, establishing functional testing hardware and software for a variety of products, including extensive server systems, while executing diverse functional assessments for GPU/Tegra products;• Develop scripts for automated testing...


  • Santa Clara, California, United States Johnson & Johnson Full time

    Job SummaryWe are seeking a highly skilled Staff Reliability Engineer - Electrical to join our team at Johnson & Johnson Medical Devices Companies. As a key member of our Hardware Team, you will play a critical role in designing and developing the next generation of robotic platforms.Key ResponsibilitiesReliability Strategy Implementation: Plan and...


  • Santa Clara, California, United States Blue River Technology Full time

    Job OverviewPosition: Lead Software Engineer for Autonomous SystemsLocation: Remote with occasional office presence requiredKey ResponsibilitiesConduct research, design, and development of software applications for computer and network systems.Create resilient and reliable components for robotics systems aimed at autonomous functionality.Implement support...

  • Reliability Engineer

    3 weeks ago


    Santa Clara, California, United States Innova Solutions Full time

    Innova Solutions is immediately hiring a Reliability EngineerPosition type: Full Time Duration: Full Time Location: Santa Clara, CAAs a Reliability Engineer, you will:Minimum Qualifications: EE education is must + board level debugging exp is mustWork in the Board Level Reliability lab environment and setup functional test hardware and software for various...


  • Santa Clara, California, United States Innova Solutions Full time

    Innova Solutions is actively seeking a Reliability Engineer. Position Type: Full Time Location: Santa Clara, CA As a Reliability Engineer, your responsibilities will include: Key Responsibilities:Engaging in Board Level Reliability laboratory activities, establishing functional test hardware and software for various NV products, including large server...


  • Santa Clara, California, United States Innova Solutions Full time

    Innova Solutions is actively seeking a Reliability Engineer. Position Type: Full Time Location: Santa Clara, CA As a Reliability Engineer, your responsibilities will include: Key Responsibilities:Engaging in Board Level Reliability laboratory operations, establishing functional testing hardware and software for various NV products, including extensive server...