Lead Reliability Engineer

4 days ago


Santa Clara, California, United States Celestial AI Full time
About Celestial AI

Celestial AI is a pioneering company in the field of Generative AI, data center infrastructure, and high-performance computing. As we navigate the era of Accelerated Computing, we recognize that data center bottlenecks are no longer limited to compute performance, but rather the system's interconnect bandwidth, memory bandwidth, and memory capacity.

Our Photonic Fabric technology offers a 10X increase in performance and energy efficiency over competitive technologies, making it an ideal solution for our customers' AI accelerators and GPUs. With our technology, customers can seamlessly integrate high-bandwidth, low-power, and low-latency optical interfaces into their systems.

Job Description

We are seeking a highly skilled Lead Reliability Engineer to spearhead reliability efforts specifically tailored for datacenter and high-performance computing applications. The ideal candidate will have a strong background in reliability engineering, with a focus on these critical environments, ensuring the robustness and uptime of our systems in demanding operational scenarios.

Key Responsibilities
  • Develop and implement reliability strategies, standards, and processes customized for datacenter and high-performance computing applications.
  • Lead reliability testing and qualification activities tailored for datacenter and HPC environments.
  • Collaborate closely with cross-functional teams to integrate reliability considerations into product development and deployment processes.
  • Conduct thorough reliability analyses specific to datacenter and HPC applications.
  • Define reliability requirements and specifications for new products targeting datacenter and HPC markets.
  • Lead root cause analysis and corrective actions for reliability issues identified in datacenter and HPC environments.
Requirements
  • Bachelor's degree in Engineering or related field; Master's or PhD degree preferred.
  • 15+ years of experience in reliability engineering, with a focus on datacenter and high-performance computing applications.
  • Strong understanding of reliability principles, methodologies, and tools relevant to datacenter and HPC environments.
  • Experience working with industry standards and guidelines specific to datacenter and HPC reliability.
  • Proven ability to lead cross-functional teams and drive reliability initiatives in fast-paced environments.
What We Offer

Celestial AI offers a highly competitive total compensation package, inclusive of a competitive base salary and a generous grant of our valuable early-stage equity. We also offer great benefits, a collaborative and continuous learning work environment, and the opportunity to work with smart and dedicated people engaged in developing the next generation architecture for high-performance computing.

Celestial AI is proud to be an equal opportunity workplace and is an affirmative action employer.



  • Santa Clara, California, United States Celestial AI Full time

    About Celestial AICelestial AI is a pioneering company in the field of artificial intelligence, striving to push the boundaries of innovation and performance. As the industry grapples with the challenges of AI workloads, we are committed to delivering cutting-edge solutions that address the 'Memory Wall' problem and enable unprecedented scalability and...


  • Santa Clara, California, United States Omni Vision Inc Full time

    Job Title: Sr. Reliability EngineerOmni Vision Inc is seeking a highly skilled Sr. Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for ensuring the quality and reliability of our CMOS Image Sensor products.Key Responsibilities:Review reliability qualification testing results and determine whether our...

  • Reliability Engineer

    3 weeks ago


    Santa Clara, California, United States Omni Vision Inc Full time

    Job Title: Reliability EngineerOmni Vision Inc is seeking a highly skilled Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing and debugging hardware for biased product reliability evaluation, evaluating and qualifying CMOS Imaging Sensor (CIS) products for mass production, and collaborating...


  • Santa Clara, California, United States Amazon Full time

    Job DescriptionWe are seeking a highly skilled Hardware Reliability Engineer to join our team at Amazon Web Services (AWS). As a key member of our Hardware Engineering team, you will play a critical role in designing and developing cutting-edge compute and storage platforms that enable our cloud services.The successful candidate will have a strong background...


  • Santa Clara, California, United States Ushur Full time

    About UshurUshur is a leading provider of Customer Experience Automation solutions, empowering enterprises to deliver delightful customer and employee experiences. Our cutting-edge technologies, including Conversational AI, Machine Learning, and Intelligent Process Automation, enable Fortune 100 companies to automate their customer engagement.The RoleWe are...

  • Reliability Engineer

    2 weeks ago


    Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly motivated and experienced Reliability Engineer to join our team. As a key member of our Hardware Quality and Compliance Engineering team, you will play a critical role in ensuring the quality and reliability of our new products from inception through the first year in production.Key...

  • Reliability Engineer

    3 weeks ago


    Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly motivated and experienced Reliability Engineer to join our team. As a key member of our Hardware Quality and Compliance Engineering team, you will play a critical role in ensuring the quality and reliability of our new products from inception through the first year in production.You will be responsible for...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly motivated and experienced Reliability Engineer to join our team. As a key member of our Hardware Quality and Compliance Engineering team, you will play a critical role in ensuring the quality and reliability of our new products from inception through the first year in production.Key...

  • Reliability Engineer

    3 weeks ago


    Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Principal NPI Reliability EngineerPalo Alto Networks is seeking an experienced and highly motivated Reliability Engineer to join our team. The successful candidate will take ownership and drive quality and reliability into the company's new products from inception through the first year in production.Key Responsibilities:Establish and maintain...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly motivated and experienced Reliability Engineer to join our team. As a key member of our Hardware Quality and Compliance Engineering team, you will play a critical role in ensuring the quality and reliability of our new products from inception through the first year in production.Key...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly motivated and experienced Reliability Engineer to join our team. As a key member of our World-Wide Operations team, you will play a critical role in ensuring the quality and reliability of our new products from inception through the first year in production.Key ResponsibilitiesEstablish and maintain...

  • Reliability Engineer

    3 weeks ago


    Santa Clara, California, United States Comtech Full time

    Reliability/Failure Analysis EngineerComtech Telecommunications Corp. is seeking a skilled Reliability/Failure Analysis Engineer to join our team in Santa Clara, CA. As a key member of our technical team, you will collaborate with diverse professionals and interact with customers to provide solutions to technical problems of moderate scope and...


  • Santa Clara, California, United States NVIDIA Full time

    Reliability Engineer for NVIDIA's System ProductsNVIDIA is a leader in the field of artificial intelligence and high-performance computing, and we're looking for a skilled Reliability Engineer to join our team. As a Reliability Engineer, you will be responsible for ensuring the reliability of our system products, including graphics cards, servers, and data...

  • Reliability Engineer

    3 weeks ago


    Santa Clara, California, United States Innova Solutions Full time

    Job Title: Reliability EngineerInnova Solutions is seeking a highly skilled Reliability Engineer to join our team. As a Reliability Engineer, you will be responsible for ensuring the reliability and quality of our products.Key Responsibilities:Work in the Board Level Reliability lab environment and setup functional test hardware and software for various NV...

  • Reliability Engineer

    4 weeks ago


    Santa Clara, California, United States Innova Solutions Full time

    Job Title: Reliability EngineerInnova Solutions is seeking a skilled Reliability Engineer to join our team. As a Reliability Engineer, you will be responsible for ensuring the reliability and quality of our products.Key Responsibilities:Work in the Board Level Reliability lab environment and setup functional test hardware and software for various NV...


  • Santa Clara, California, United States Cryptoware Technologies Inc Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Cryptoware Technologies Inc. As a Site Reliability Engineer, you will be responsible for leading the effort of global expansion of Huobi globe-spanning infrastructure.Key Responsibilities:Lead the effort of global expansion of Huobi...

  • Reliability Engineer

    4 weeks ago


    Santa Clara, California, United States Innova Solutions Full time

    Job Title: Reliability EngineerWe are seeking a highly skilled Reliability Engineer to join our team at Innova Solutions. As a Reliability Engineer, you will play a critical role in ensuring the reliability and quality of our products.Key Responsibilities:Design and implement test plans and procedures to evaluate the reliability of our products.Conduct...


  • Santa Clara, California, United States Comtech Full time

    Comtech Telecommunications Corp. is seeking a highly skilled Reliability/Failure Analysis Engineer to join our team in Santa Clara, CA. In this critical role, you will collaborate with a diverse team of technical professionals and interact with outside customers to provide solutions to a variety of technical problems of moderate scope and complexity.Key...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.Key ResponsibilitiesContribute to the success of SRE and DevOps teamsDevelop expertise in new...


  • Santa Clara, California, United States Anello Photonics Full time

    About Anello PhotonicsAnello Photonics is a leading-edge technology company based in Santa Clara, CA. We have developed integrated photonic system-on-chip technology for next-generation navigation. Our SIPHOGTM gyroscope is based on our patented photonic integrated circuit technology.This innovative technology enables a product that is higher performance,...