HPC Cluster Engineer

4 weeks ago


Santa Clara, California, United States Sustainable Talent Full time

Sustainable Talent is partnering with Nvidia a global leader who's been transforming computer graphics, PC gaming, and accelerated computing for over 25 years.We are looking for a HPC Cluster Engineer to support our client's GPU/HPC Infrastructure Team.As a member of the GPU/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads.As an expert, you will help us with the strategic challenges we encounter including computer, networking, and storage design for large-scale, high-performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment.Working knowledge of cluster configuration managements tools such as Ansible, Puppet, Salt.Experience with HPC workflows that use MPI.Understanding of fast, distributed storage systems like Lustre and GPFS for HPC workloads.Familiarity with deep learning frameworks like PyTorch and TensorFlow.



  • Santa Clara, California, United States Nvidia Full time

    NVIDIA, a prominent player in the realms of Artificial Intelligence, High-Performance Computing, and Visualization, is on the lookout for a Lead Site Reliability Engineer specializing in HPC storage systems. This role involves collaborating with our team to architect, implement, and enhance on-premises HPC storage solutions while integrating cloud...

  • Solutions Architect

    5 days ago


    Santa Clara, California, United States NVIDIA Corporation Full time

    Solutions Architect - AI and HPC Cloud ExpertNVIDIA Corporation is seeking a highly skilled Solutions Architect to join its Cloud Infrastructure Team. As a key member of the team, you will be responsible for designing and implementing sophisticated cloud solutions that cater to the infrastructure needs of various NVIDIA groups, including Graphics Processors,...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA has been at the forefront of innovation for over two decades. Our creation of the GPU in 1999 not only propelled the PC gaming industry but also transformed modern graphics and parallel computing. Recently, the advent of GPU deep learning has ushered in a new era of artificial intelligence — a pivotal moment in computing history. At NVIDIA, we pride...


  • Santa Clara, California, United States Nvidia Full time

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled Principal Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world.As a lead focused on AI Software Resiliency, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs.Your expertise...


  • Santa Clara, California, United States Nvidia Full time

    Senior Software Engineer, GPU Communications and NetworkinglocationsUS, CA, Santa Claratime typeFull timejob requisition idJR1972306NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of...


  • Santa Clara, California, United States Celestial AI Full time

    About Celestial AIAt Celestial AI, we are at the forefront of innovation in AI systems. Our ground-breaking Photonic Fabric technology provides a scalable solution to data transfer bottlenecks, revolutionizing AI system performance and delivering unmatched efficiency.Lead Reliability EngineerWe are seeking a dynamic Lead Reliability Engineer to drive...


  • Santa Clara, California, United States Oracle Full time

    Job DescriptionJob Summary: We are seeking a highly skilled and experienced Senior Principal Software Engineer to join our Cloud Engineering Infrastructure Development team at Oracle. As a key member of our team, you will be responsible for designing, developing, and performance tuning the networking stack required to run distributed AI/ML/HPC workloads...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is seeking exceptional software engineers to enhance our enterprise GPU management and monitoring solutions. In this position, you will collaborate with the broader NVIDIA team to architect and develop Linux-based management agents, Kubernetes integrations, and comprehensive integration solutions that merge GPUs with the overall datacenter software...


  • Santa Clara, California, United States Tenstorrent Inc Full time

    Job Description**About the Role**Tenstorrent Inc is seeking a highly skilled and experienced Senior Principal High-Performance Computing Architect to lead the design and implementation of cutting-edge architectures for high-performance computing systems. As a key member of our team, you will play a crucial role in enabling efficient and scalable computation...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled and experienced professional to join our team as a GPU Developer Advocate. This is a unique opportunity to work with a leading technology company in the field of High Performance Computing (HPC) and Artificial Intelligence (AI).Key ResponsibilitiesRecruit and collaborate with researchers, scientists, and data...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is seeking talented engineers to enhance its AI Infrastructure. We are looking for individuals with a robust programming foundation, profound knowledge of distributed systems, and a strong grasp of software testing and deployment methodologies. Excellent communication and organizational skills are essential. We value innovative thinkers who can...


  • Santa Clara, California, United States NVIDIA Full time

    Job SummaryNVIDIA is seeking a highly skilled Senior SRE Engineer to join its fast-paced Infrastructure, Planning and Processes organization. As a key member of the team, you will be responsible for designing and implementing scalable, resilient cloud infrastructure platforms for NVIDIA's internal cloud provisioning product.Key ResponsibilitiesDesign and...


  • Santa Clara, California, United States Nvidia Corporation Full time

    Perception for autonomous vehicles (AV) is one of the most exciting and challenging areas to work on today. Machine learning plays a crucial role in this field, but to excel in machine learning for Perception AV, we need to master the fundamentals. Join the Perception ML Foundation team, where we combine expertise in machine learning, high-performance...


  • Santa Clara, California, United States Summit Healthcare Inc Full time

    We are thrilled to present an opportunity for a Cloud Solutions Architect at Summit Healthcare Inc. We are in search of a dedicated professional with a keen interest in artificial intelligence and machine learning. If you are passionate about engaging in initiatives that redefine the possibilities of cloud-scale AI, we encourage you to explore this role.Key...


  • Santa Clara, California, United States INTEL Full time

    Job SummaryWe are seeking a highly experienced and visionary leader to serve as the Vice President and General Manager of our High-Performance Computing Business Unit at Intel. As a key member of our leadership team, you will be responsible for driving the strategic direction, growth, and profitability of our HPC business, overseeing all aspects of product...


  • Santa Clara, California, United States Oracle Full time

    Cloud Engineering Infrastructure Development at OracleWe are seeking a highly skilled Cloud Engineering Infrastructure Developer to join our team at Oracle, working on the development of ultra-high performance networks supporting AI/ML/HPC workloads.This dynamic team is responsible for designing, developing, and fine-tuning the networking stack for...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Principal Engineer to lead the development of AI software resiliency for our most powerful AI supercomputers.Key ResponsibilitiesDevelop and implement critical resiliency features to support frontier model training at scale.Drive down cluster downtime towards zero, ensuring robust and reliable AI...


  • Santa Clara, California, United States Intel Full time

    Job SummaryWe are seeking a highly skilled Cloud Software Development Engineer to join our team at Intel. As a key member of our Data Platforms Engineering and Architecture (DPEA) Group, you will be responsible for designing, developing, and testing software solutions for our data center products.Key ResponsibilitiesDesign and develop system validation...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Principal Engineer for AI Software Resiliency to join our team at NVIDIA.Key ResponsibilitiesLead the development of AI software resiliency features for our most powerful AI supercomputers.Collaborate with multiple teams and stakeholders to align on mission requirements and ensure successful integration of...