Senior High Performance Computing Cluster Administrator

5 days ago


US CA Santa Clara NVIDIA Full time

NVIDIA's Deep Learning Optimized Frameworks Group is looking for a deeply technical HPC cluster administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains. As a member of the DLFW Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute cluster that runs demanding deep learning, high performance computing, and computationally intensive workloads. We are looking for an expert to identify architectural changes and/or completely innovative approaches for our GPU Compute Cluster. In this role, you will help us with the strategic challenges we encounter, including compute, networking, and storage design for large-scale, high-performance workloads and effective resource utilization in a heterogeneous compute environment.

What you'll be doing:

  • Administer Linux systems, ranging from powerful DGX servers to embedded systems, bringup hardware to publicly available systems.

  • Coordinate Storage Solutions and plan for growth.

  • Automate configuration management, software updates, and maintenance and monitoring of system availability using modern DevOps tools (Ansible, Gitlab, etc.)

  • Actively connect with management regarding any problems with the equipment and propose resolution.

  • Plan, build and install/upgrade new systems that support NVIDIA DL Software

What we need to see:

  • You have a BA, BS, or MS in CS, EE, CE or equivalent experience

  • 4+ years of previous experience deploying and administrating HPC clusters

  • Familiar with resource scheduling managers (Slurm (preferred), LSF, etc

  • Proven track record to script in bash, Perl or python

  • Experience with containers (Docker, Singularity, LXC)

  • Deep understanding of operating systems, computer networks, and high-performance applications

  • Ability to work well with developers & test engineers

  • Hard-working dedication to provide quality in support for your users

Ways to stand out from the crowd:

  • Familiarity and prior work experience with technologies such as: Ansible, GIT, Slurm, Zabbix, Prometheus, Grafana and Docker

  • Familiarity with GPU usage in Compute Cluster and Cuda

  • Experience with mobile and embedded systems

  • Basic knowledge of Deep Learning.

  • Experience coding/scripting in Perl/Python/bash

The base salary range is 148,000 USD - 230,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.



  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionNVIDIA is seeking a highly skilled Senior High Performance Computing Cluster Administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains.Key ResponsibilitiesAdminister Linux systems, ranging from powerful DGX servers to embedded...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA's Deep Learning Optimized Frameworks Group is looking for a deeply technical HPC cluster administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains. As a member of the DLFW Infrastructure team, you will provide leadership in the design...


  • Santa Clara, United States Advanced Micro Devices , Inc. Full time

    Overview: WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences the building blocks for the data center, artificial intelligence, PCs, gaming and embedded....


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    GPU Cluster Performance EngineerWe are seeking a highly motivated and skilled GPU Cluster Performance Engineer to join our dynamic team at Advanced Micro Devices, Inc.In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters. The ideal candidate will have a strong background in GPU architectures, parallel...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    GPU Cluster Performance EngineerAt Advanced Micro Devices, Inc., we're pushing the boundaries of innovation to solve the world's most complex challenges. We're seeking a highly skilled GPU Cluster Performance Engineer to join our dynamic team.Key Responsibilities:Performance Optimization: Collaborate with hardware and software teams to enhance the overall...


  • Santa Clara, California, United States Tenstorrent Inc Full time

    Job Description**About the Role**Tenstorrent Inc is seeking a highly skilled and experienced Senior Principal High-Performance Computing Architect to lead the design and implementation of cutting-edge architectures for high-performance computing systems. As a key member of our team, you will play a crucial role in enabling efficient and scalable computation...

  • HPC Cluster Engineer

    3 months ago


    Santa Clara, United States Sustainable Talent Full time

    Job DescriptionJob DescriptionAre you ready to make your mark in the forefront of technological innovation? As an HPC Cluster Engineer, you'll play a pivotal role in shaping the future of AI, deep learning, and machine learning initiatives. Join us and leverage Nvidia's cutting-edge GPU technology to drive groundbreaking discoveries and revolutionize...

  • HPC Cluster Engineer

    4 months ago


    Santa Clara, United States Sustainable Talent Full time

    Job DescriptionJob DescriptionAre you ready to make your mark in the forefront of technological innovation? As an HPC Cluster Engineer, you'll play a pivotal role in shaping the future of AI, deep learning, and machine learning initiatives. Join us and leverage Nvidia's cutting-edge GPU technology to drive groundbreaking discoveries and revolutionize...


  • Santa Clara, California, United States Tenstorrent Inc Full time

    High-Performance Computing ArchitectTenstorrent Inc is seeking a skilled High-Performance Computing (HPC) Architect to design and implement cutting-edge architectures for high-performance computing systems. As an HPC Architect, you will play a crucial role in enabling efficient and scalable computation for scientific, research, and data-intensive...


  • Santa Clara, California, United States Tenstorrent Inc Full time

    About the RoleTenstorrent Inc is seeking a highly skilled and experienced High-Performance Computing (HPC) Architect to lead the design and implementation of cutting-edge HPC systems. As an HPC Architect, you will play a crucial role in delivering optimized solutions that meet the demanding requirements of HPC workloads.Key ResponsibilitiesDesign and Develop...

  • Senior CPU Architect

    3 weeks ago


    Santa Clara, California, United States Sunlune Full time

    Job Description**Role:** CPU Architecture Engineer, Full-time**About the Role:** We are seeking a highly skilled CPU Architecture Engineer to join our team at Sunlune. As a key member of our engineering team, you will be responsible for designing and optimizing high-performance CPU architectures for AI applications.**Responsibilities:**Design and optimize...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Performance Optimization EngineerWe are seeking a highly skilled Senior Performance Optimization Engineer to join our AI Applications organization at NVIDIA. As a key member of our team, you will be responsible for optimizing the performance of our distributed cloud native accelerated video analytics applications.Our team is building...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled and experienced professional to join our team as a GPU Developer Advocate. This is a unique opportunity to work with a leading technology company in the field of High Performance Computing (HPC) and Artificial Intelligence (AI).Key ResponsibilitiesEvent Planning and ExecutionRecruit and manage sites to host...


  • Santa Clara, California, United States Sage Lake Senior Living Full time

    About the RoleWe are seeking a seasoned Senior SRE Engineer to join our team at Sage Lake Senior Living, where you will play a critical role in monitoring and operating our NVIDIA Inference Microservices (NIMs) factory automation and deployed services.Key ResponsibilitiesOperate a software factory that takes an AI model as input and produces a deployable...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Site Reliability EngineerNVIDIA is a leader in AI, machine learning, and datacenter acceleration. Our company is expanding its leadership into datacenter networking with ethernet switches, NICs, and DPUs. We have continuously reinvented ourselves over two decades.Our invention of the GPU in 1999 sparked the growth of the PC gaming market,...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleWe are seeking a highly skilled performance engineer to join our AI Applications organization at NVIDIA. As a performance engineer, you will work closely with our application teams to optimize the performance of our distributed cloud native accelerated video analytics applications.Key ResponsibilitiesPlan, enable, and drive performance...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled Senior Solutions Architect to join our team at NVIDIA. As a key member of our team, you will be responsible for designing, building, and maintaining large-scale HPC and AI hybrid computing solutions.Key Responsibilities:Guide partners in their adoption of end-to-end Machine Learning and Deep Learning solutions using NVIDIA's...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled Senior Solutions Architect to join our team at NVIDIA. As a key member of our team, you will play a critical role in designing, building, and maintaining large-scale HPC and AI hybrid computing solutions.Key Responsibilities:Guide partners in their adoption of end-to-end Machine Learning and Deep Learning solutions using...


  • Santa Clara, California, United States NVIDIA Full time

    Job Summary:NVIDIA is seeking a highly skilled Senior Developer Technology Engineer to join our team and contribute to the development of high-performance database systems. As a key member of our team, you will be responsible for researching and developing techniques to GPU-accelerate high-performance database and ETL applications.Key...

  • Senior Cloud Engineer

    3 weeks ago


    Santa Clara, California, United States NVIDIA Full time

    Job SummaryNVIDIA is seeking a highly skilled Senior SRE Engineer to join its fast-paced Infrastructure, Planning and Processes organization. As a key member of the team, you will be responsible for designing and implementing scalable, resilient cloud infrastructure platforms for NVIDIA's internal cloud provisioning product.Key ResponsibilitiesDesign and...