Lead Site Reliability Engineer for HPC Solutions

1 week ago


Santa Clara, California, United States Nvidia Full time

NVIDIA, a prominent player in the realms of Artificial Intelligence, High-Performance Computing, and Visualization, is on the lookout for a Lead Site Reliability Engineer specializing in HPC storage systems. This role involves collaborating with our team to architect, implement, and enhance on-premises HPC storage solutions while integrating cloud technologies.

Key Responsibilities:

  • Architect and deploy on-premises HPC infrastructure complemented by cloud resources.
  • Create scalable storage architectures tailored for data-intensive applications.
  • Automate the deployment and oversight of extensive infrastructure environments.
  • Document workflows and protocols related to distributed file systems.
  • Work closely with engineering teams to ascertain infrastructure needs.
  • Provide guidance on methodologies for developing, testing, and launching applications.

Essential Qualifications:

  • Bachelor's degree in Computer Science or related field with over 8 years of relevant experience.
  • Proven track record in addressing performance challenges for HPC applications.
  • Familiarity with distributed filesystems such as Lustre and GPFS.
  • Experience with enterprise NAS solutions.
  • Proficient in programming languages including Python, Bash, or Golang.
  • Experience managing services in leading cloud environments.
  • Strong communication and teamwork abilities.

Preferred Qualifications:

  • Experience with RDMA fabrics.
  • Familiarity with monitoring solutions like Prometheus and Grafana.
  • Knowledge of HPC cluster management tools.
  • Experience with containerization technologies.

Compensation: The base salary range is competitive based on experience and location, supplemented by equity and benefits.

NVIDIA is an equal opportunity employer dedicated to fostering diversity in the workplace. We appreciate a diverse workforce and do not discriminate based on various characteristics.

Position Summary:

Type: Full-time



  • Santa Clara, California, United States Celestial AI Full time

    About Celestial AIAt Celestial AI, we are at the forefront of innovation in AI systems. Our ground-breaking Photonic Fabric technology provides a scalable solution to data transfer bottlenecks, revolutionizing AI system performance and delivering unmatched efficiency.Lead Reliability EngineerWe are seeking a dynamic Lead Reliability Engineer to drive...


  • Santa Clara, California, United States Nvidia Full time

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...

  • Solutions Architect

    3 days ago


    Santa Clara, California, United States NVIDIA Corporation Full time

    Solutions Architect - AI and HPC Cloud ExpertNVIDIA Corporation is seeking a highly skilled Solutions Architect to join its Cloud Infrastructure Team. As a key member of the team, you will be responsible for designing and implementing sophisticated cloud solutions that cater to the infrastructure needs of various NVIDIA groups, including Graphics Processors,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewCompany OverviewTo comply with U.S. federal government requirements, U.S. citizenship is required for this position.Our MissionAt Palo Alto Networks, our mission is clear:To be the cybersecurity partner of choice, safeguarding our digital existence.We envision a world where each day is safer and more secure than the last. Our foundation is built...

  • HPC Cluster Engineer

    4 weeks ago


    Santa Clara, California, United States Sustainable Talent Full time

    Sustainable Talent is partnering with Nvidia a global leader who's been transforming computer graphics, PC gaming, and accelerated computing for over 25 years.We are looking for a HPC Cluster Engineer to support our client's GPU/HPC Infrastructure Team.As a member of the GPU/HPC Infrastructure team, you will provide leadership in the design and...


  • Santa Clara, California, United States Diverse Lynx Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on...


  • Santa Clara, California, United States AMD Full time

    WHAT YOU DO AT AMD CHANGES EVERYTHINGWe are committed to transforming lives through AMD technology, enhancing our industry, communities, and the world. Our mission is to create exceptional products that propel next-generation computing experiences – the foundational elements for data centers, artificial intelligence, personal computing, gaming, and...


  • Santa Clara, California, United States Promote Project Full time

    About the Company: Promote Project is at the forefront of innovation, leveraging cutting-edge technology to redefine the landscape of AI and computing. Our mission is to harness the power of advanced computing to create transformative solutions that impact various industries.Position Overview: We are seeking a Manager of Site Reliability Engineering to...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to enhance global operations, and our dedicated workforce makes it all possible. We operate swiftly because the world demands it, innovating uniquely for our clients and communities.By becoming part of ServiceNow, you join a dynamic team of innovators who possess a relentless curiosity and a passion for...


  • Santa Clara, California, United States Promote Project Full time

    About Promote Project: Promote Project is a leader in innovative technology solutions, dedicated to pushing the boundaries of what is possible in the realm of artificial intelligence and cloud computing. Our commitment to excellence is reflected in our talented workforce and our pursuit of groundbreaking advancements.Position Overview: We are seeking a...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to create a better world for everyone, driven by our talented workforce. We prioritize speed and innovation to meet the demands of our customers and communities.Joining ServiceNow means becoming part of a dynamic team of innovators who possess a relentless curiosity and a commitment to creativity.We...


  • Santa Clara, California, United States Centrify Corporation Full time

    **About Centrify Corporation**Centrify Corporation is a leading provider of cloud-based identity and access management solutions. Our software runs on public clouds with 99.9% or better uptime and is mission critical for our customers.**Job Summary**We are seeking a highly skilled Cloud Site Reliability Engineer to join our Cloud DevOps team. As a Cloud Site...


  • Santa Clara, California, United States Promote Project Full time

    About the Company: Promote Project is at the forefront of innovation, focusing on redefining technology and enhancing the capabilities of AI. We are dedicated to creating groundbreaking solutions that push the boundaries of what is possible in computing.Position Overview: We are seeking a Manager for Site Reliability Engineering to spearhead our cloud...


  • Santa Clara, California, United States XPENG Motors Full time

    About XPeng MotorsXpeng Motors is a leading innovator in the electric vehicle industry, dedicated to designing, developing, and manufacturing cutting-edge smart electric vehicles that seamlessly integrate advanced Internet, AI, and autonomous driving technologies.Job SummaryWe are seeking a highly skilled Senior Staff AI Infrastructure Site Reliability...

  • Reliability Engineer

    3 weeks ago


    Santa Clara, California, United States Innova Solutions Full time

    Innova Solutions is immediately hiring a Reliability EngineerPosition type: Full Time Duration: Full Time Location: Santa Clara, CAAs a Reliability Engineer, you will:Minimum Qualifications: EE education is must + board level debugging exp is mustWork in the Board Level Reliability lab environment and setup functional test hardware and software for various...


  • Santa Clara, California, United States Veear Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Veear. As a key member of our infrastructure team, you will play a critical role in ensuring the reliability, scalability, and security of our cloud-based systems.Key ResponsibilitiesCollaboration and PartnershipPartner with cross-functional teams to ensure security...


  • Santa Clara, California, United States NVIDIA Corporation Full time

    Position Overview:The role of a Solutions Architect for Hyperscale at NVIDIA involves collaborating with leading-edge clients to develop and implement Artificial Intelligence (AI) and High-Performance Computing (HPC) software solutions at scale. This position is integral to the NVIDIA Solutions Architecture team, focusing on delivering comprehensive...


  • Santa Clara, California, United States Innova Solutions Full time

    Innova Solutions is actively seeking a Reliability Engineer. Position Type: Full Time Location: Santa Clara, CA As a Reliability Engineer, your responsibilities will include: Key Responsibilities:Engaging in Board Level Reliability laboratory activities, establishing functional test hardware and software for various NV products, including large server...


  • Santa Clara, California, United States Innova Solutions Full time

    Innova Solutions is actively seeking a Reliability Engineer. Position Type: Full Time Location: Santa Clara, CA As a Reliability Engineer, your responsibilities will include: Key Responsibilities:Engaging in Board Level Reliability laboratory operations, establishing functional testing hardware and software for various NV products, including extensive server...


  • Santa Clara, California, United States Innova Solutions Full time

    Innova Solutions is currently seeking a Reliability Engineer. Position Type: Full Time Location: Santa Clara, CA As a Reliability Engineer, your responsibilities will include: Key Responsibilities:Engaging in the Board Level Reliability laboratory setting, establishing functional test hardware and software for various NV products, including extensive server...