Senior AI-HPC Cluster Engineer

1 week ago


Austin, Texas, United States NVIDIA Full time
NVIDIA has continuously reinvented itself over two decades.

Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing.

More recently, GPU deep learning ignited modern AI — the next era of computing.

NVIDIA is a "learning machine" that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world.

This is our life's work, to amplify human imagination and intelligence.

Make the choice to join us todayAs a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of ground breaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads.

We seek an expert to identify architectural changes and/or completely new approaches for our GPU Compute Clusters.

As an expert, you will help us with the strategic challenges we encounter including:
compute, networking, and storage design for large scale, high performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment

What you'll be doing:
Building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions

Maintaining and building deep learning clusters at scale

Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows

Root cause analysis and suggest corrective action for problems large and small scales

Finding and fixing problems before they occur

What we need to see:
Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience.

Minimum 5 years of experience designing and operating large scale compute infrastructure.

Experience analyzing and tuning performance for a variety of AI/HPC workloads.

Working knowledge of cluster configuration managements tools such as Ansible, Puppet, Salt.

Experience with AI/HPC advanced job schedulers, and ideally familiarity with schedulers such as Slurm, K8s, RTDA or LSF

In depth understating of container technologies like Docker, Singularity, Shifter, Charliecloud

Proficient in Centos/RHEL and/or Ubuntu Linux distros including Python programming and bash scripting

Experience with AI/HPC workflows that use MPI

Ways to stand out from the crowd:

Experience with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking

Experience with Machine Learning and Deep Learning concepts, algorithms and models

Familiarity with InfiniBand with IBOP and RDMA

Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads

Familiarity with deep learning frameworks like PyTorch and TensorFlow

NVIDIA offers highly competitive salaries and a comprehensive benefits package.

We have some of the most brilliant and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast.

If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.
The base salary range is 148,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits.

NVIDIA accepts applications on an ongoing basis.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.

As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#J-18808-Ljbffr

  • Austin, Texas, United States NVIDIA Full time

    At NVIDIA in Santa Clara, CA, USA, we are currently seeking a skilled AI/ML Infrastructure Engineer to join our team. As an Engineer, you will have a unique chance to enhance productivity for our researchers by implementing improvements throughout the entire stack. Your main responsibility will be to identify and address infrastructure gaps to ensure...


  • Austin, Texas, United States Advanced Micro Devices , Inc. Full time

    Overview: WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences the building blocks for the data center, artificial intelligence, PCs, gaming and embedded....


  • Austin, Texas, United States NXP Semiconductors Full time

    HPC DevOps Engineer Austin, US (Hybrid) This is what you will do as HPC DevOps engineer at NXP You are expected to work very closely with your global colleagues within R&D IT and help deliver the HPC services (High Performance Computing and Virtual Desktop Infrastructure) to our engineering and R&D customers. Your AMEC team has operational responsibility...


  • Austin, Texas, United States I-Con Technology Full time

    Now Hiring Senior AI Data Engineer I ICON is looking for an entrepreneurial Senior AI Data Engineer I to join our growing team. In this role, you will be responsible for designing, building, and maintaining the data infrastructure that powers our data-driven products and services. You will work with our AI engineers, data labelers and external data providers...


  • Austin, Texas, United States Advanced Micro Devices , Inc. Full time

    Overview: WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences the building blocks for the data center, artificial intelligence, PCs, gaming and embedded....


  • Austin, Texas, United States Dell Technologies Full time

    **Senior Engineer Site Reliability** Dell Technologies customers rely on our products and services to drive progress. So, we take the service we provide extremely seriously. Service Delivery is all about making sure our technical solutions help clients fulfil their priorities, challenges and initiatives. As trusted advisors, we build in-depth knowledge of...


  • Austin, Texas, United States Optum Full time

    Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and they need to feel their best. Here, you will find a culture guided by diversity and inclusion, talented...


  • Austin, Texas, United States Meta Inc Full time

    Summary:Meta is seeking a Partner Engineer to join Metas Applied AI Partner Engineering team, a highly technical team that works with strategic partners, machine learning leaders across the industry and all major cloud service providers for building and launching new Generative AI product services and experience and taking Large Language Models (LLMs) from...


  • Austin, Texas, United States webAI Full time

    Title: AI Platform Solutions Engineer Company: webAI Location: Grand Rapids, MI; Austin, TX; Remote Type: Full-Time, Salaried Exempt Experience: 5-10 years Education: Bachelor's Degree, minimum About Us: webAI is a software company that is building a decentralized AI development platform. Our technology enables the development of powerful AI using limited...


  • Austin, Texas, United States AECOM Full time

    Job Title: AI and Data Integration Engineer at AECOMCompany DescriptionWork with Us. Change the World.AECOM is a leading infrastructure consulting firm, partnering with clients worldwide to tackle complex challenges and leave legacies for future generations. Our global team of over 50,000 professionals is dedicated to delivering projects that make a positive...


  • Austin, Texas, United States AMD Full time

    WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our...

  • AI Strategist

    3 weeks ago


    Austin, Texas, United States KUNGFU Full time

    KUNGFU.AI is a management consulting and engineering firm focused exclusively on artificial intelligence. We empower CEOs and senior executives to leverage the full potential of AI so they remain competitive in a rapidly evolving world. Our expert team delivers AI strategy and bespoke production-grade solutions that allow clients to rapidly realize value. We...


  • Austin, Texas, United States SambaNova Systems Full time

    The era of pervasive AI has arrived. In this era, organizations will use generative AI to unlock hidden value in their data, accelerate processes, reduce costs, drive efficiency and innovation to fundamentally transform their businesses and operations at scale.SambaNova Suite is the first full-stack, generative AI platform, from chip to model, optimized for...


  • Austin, Texas, United States SambaNova Systems Full time

    The era of pervasive AI has arrived. In this era, organizations will use generative AI to unlock hidden value in their data, accelerate processes, reduce costs, drive efficiency and innovation to fundamentally transform their businesses and operations at scale. SambaNova Suite is the first full-stack, generative AI platform, from chip to model, optimized for...


  • Austin, Texas, United States Optum Full time

    This position is fully remote. You may work from anywhere in the US.Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best....


  • Austin, Texas, United States PayPal Full time

    At PayPal (NASDAQ:PYPL), we believe that every person has the right to participate fully in the global economy. Our mission is to revolutionize commerce globally to make moving money, selling and shopping, personalized and secure.Job Description Summary:We are seeking a highly experienced Sr. Director of Engineering with a robust background in AI & Machine...

  • Platform Engineer

    1 week ago


    Austin, Texas, United States AMD Full time

    WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our...


  • Austin, Texas, United States Siemens Digital Industries Software Full time

    Job Family:Internal Services Req ID:413100 Employer:Siemens Industry Software Inc. Job Title:Senior Software Engineer [MULTIPLE POSITIONS] Job Location:Austin, TX Job Type:Full Time Duties:Design and develop large scale digital simulation software. Design and implement solutions for parsing, elaborating and simulating a digital design description. Develop...


  • Austin, Texas, United States NVIDIA Full time

    We're looking for a motivated Senior Machine Learning Engineer, focused on Vector Search, to join NVIDIAs RAPIDS Machine Learning team. RAPIDS is the open source suite of libraries that combine the performance of modern GPUs with the ease of use of Python APIs. We are working on growing the capabilities of the ML components of RAPIDS as well as integrating...


  • Austin, Texas, United States Tenstorrent Inc. Full time

    Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high...