Senior AI-HPC Cluster Engineer

2 weeks ago


Durham, United States NVIDIA Full time

Senior HPC Cluster Engineer page is loaded

Senior HPC Cluster Engineer

Apply

locations

US, CA, Santa Clara

US, MA, Westford

US, TX, Austin

US, NC, Durham

time type

Full time

posted on

Posted 7 Days Ago

job requisition id

JR1965956

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Make the choice to join us today As a member of the GPU/HPC Infrastructure team, you will provide leadership in the design and implementation of ground breaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek an expert to identify architectural changes and/or completely new approaches for our GPU Compute Clusters. As an expert, you will help us with the strategic challenges we encounter including: compute, networking, and storage design for large scale, high performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment. What you'll be doing: Building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions

Maintaining and building deep learning clusters at scale

Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows

Root cause analysis and suggest corrective action for problems large and small scales

Finding and fixing problems before they occur

What we need to see: Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.

Minimum 5 years of experience designing and operating large scale compute infrastructure.

Experience analyzing and tuning performance for a variety of HPC workloads.

Working knowledge of cluster configuration managements tools such as Ansible, Puppet, Salt.

Experience with HPC cluster job schedulers such as SLURM, LSF

In depth understating of container technologies like Docker, Singularity, Shifter, Charliecloud

Proficient in Centos/RHEL and/or Ubuntu Linux distros including Python programming and bash scripting

Experience with HPC workflows that use MPI

Ways to stand out from the crowd: Understanding of MLPerf benchmarking

Familiarity with InfiniBand with IBOP and RDMA

Understanding of fast, distributed storage systems like Lustre and GPFS for HPC workloads.

Background with Software Defined Networking and HPC cluster networking

Familuarity with deep learning frameworks like PyTorch and TensorFlow

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most brilliant and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you. The base salary range is 148,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits .

NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs (5)

HPC Cluster Administrator

locations

US, CA, Santa Clara

time type

Full time

posted on

Posted 7 Days Ago

Senior HPC Performance Engineer

locations

US, CA, Santa Clara

time type

Full time

posted on

Posted 7 Days Ago

Senior HPC Programming Model Architect - C++

locations

4 Locations

time type

Full time

posted on

Posted 7 Days Ago NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and the metaverse is transforming the world's largest industries and profoundly impacting society.

#J-18808-Ljbffr



  • Durham, United States NVIDIA Full time

    NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by...


  • Durham, United States NVIDIA Full time

    We are seeking a motivated Senior HPC Technical Account Manager, passionate about data center and networking technologies, to provide comprehensive solutions for sophisticated installations, maintenance, or operations for a broad scope of groundbreaking networking products and will provide a premium customer experience to some of our largest customers by...


  • Durham, United States NVIDIA Full time

    NVIDIA is looking for a passionate, world-class computer scientist to work in its Compute Developer Technology (Devtech) team as an AI Developer Technology Engineer. Artificial intelligence, the dream of computer scientists for over half a century, is no longer science fiction. And in the next few years, it will transform every industry. Soon, self-driving...


  • Durham, United States NVIDIA Full time

    NVIDIA’s Deep Learning Architecture and Libraries group is seeking excellent Software Engineers to design and develop the software stack for our next generation test and development cluster, the core infrastructure that provides a foundation for every stage of our product development. Our mission, which spans both hardware and software, is to consistently...

  • Data Scientist

    7 days ago


    Durham, NC, United States Ascendion Inc. Full time

    Ascendion is a full-service digital engineering solutions company. We make and manage software platforms and products that power growth and deliver captivating experiences to consumers and employees. Our engineering, cloud, data, experience design, and talent solution capabilities accelerate transformation and impact for enterprise clients. We have a culture...

  • Data Scientist

    7 days ago


    Durham, United States Ascendion Inc. Full time

    About Ascendion Ascendion is a full-service digital engineering solutions company. We make and manage software platforms and products that power growth and deliver captivating experiences to consumers and employees. Our engineering, cloud, data, experience design, and talent solution capabilities accelerate transformation and impact for enterprise clients....


  • Durham, United States NVIDIA Full time

    Senior ASIC Verification Engineer - GPU page is loaded Senior ASIC Verification Engineer - GPU Apply locations US, CA, Santa Clara US, TX, Austin US, NC, Durham time type Full time posted on Posted 30+ Days Ago job requisition id JR1960956 NVIDIA is seeking elite ASIC Verification Engineers to verify the design and implementation of the world’s leading...


  • Durham, United States NVIDIA Full time

    Senior Software Engineer - Chip Design Tools page is loaded Senior Software Engineer - Chip Design Tools Apply locations US, CA, Santa Clara US, MA, Westford US, TX, Austin US, NC, Durham time type Full time posted on Posted 4 Days Ago job requisition id JR1977911 NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999...


  • Durham, United States NVIDIA Full time

    Senior CPU Design Engineer page is loaded Senior CPU Design Engineer Apply locations US, OR, Hillsboro US, MA, Westford US, TX, Austin US, NC, Durham US, CA, Remote time type Full time posted on Posted 30+ Days Ago job requisition id JR1978695 We are looking for a Senior CPU Design Engineer!NVIDIA is seeking best-in-class CPU Design Engineers to design and...

  • Data Scientist

    7 days ago


    Durham, NC, United States Ascendion Inc. Full time

    About Ascendion Ascendion is a full-service digital engineering solutions company. We make and manage software platforms and products that power growth and deliver captivating experiences to consumers and employees. Our engineering, cloud, data, experience design, and talent solution capabilities accelerate transformation and impact for enterprise clients....


  • Durham, United States JobRialto Full time

    Job Description: The Workplace Investing Analytics and Reporting Chapter team is looking for an Engineer to join our team to help deliver reporting application and AI (Artificial Intelligence) models into production. This role is a dynamic Agile engineering position where you will partner with your teammates on our development team and peer data scientists...


  • Durham, NC, United States IQVIA Full time

    Overview:IQVIA is a global leader in healthcare intelligence and innovation, leveraging the power of data, analytics, and artificial intelligence to transform the industry. We are looking for a RDS Generative AI Program Manager to join our RDS Gen AI Program team and help us deliver cutting-edge solutions that leverage generative AI to enhance speed,...


  • Durham, United States RIT Solutions, Inc. Full time

    Reporting and Analytics Engineer Durham, NC Advertising, Marketing & Communications SUMMARY: The client, is initiating a migration project to enhance and update their current reporting infrastructure, aiming to move away from the existing ecosystem of Oracle Business Intelligence Enterprise Edition (OBIEE) and Oracle Exadata towards a new setup involving...


  • Durham, United States Skilzmatrix Full time

    Job DescriptionJob DescriptionReporting / Analytics Engineer Location: DURHAM , NC- 2 weeks a month office SUMMARY: The client, Fidelity, is initiating a migration project to enhance and update their current reporting infrastructure, aiming to move away from the existing ecosystem of Oracle Business Intelligence Enterprise Edition (OBIEE) and Oracle Exadata...


  • Durham, United States Crescens Full time

    Job title: Integration Engineer Location: Durham, NC Duration: 12+ months Type: ContractShort Description: • The client requires the services of a Senior Integration Engineer to administer, design, implement, and oversee the integration solutions using the Mule Soft Any point Platform. Job Description: • The Client seeks highly technical resources to...


  • Durham, United States Fidelity Investments Full time

    Provides system production support using Cloud-based technologies -- Saas solutions for Cloud providers. Coordinates work flows using Continuous Integration and Continuous. Deployment (CI/ CD) pipelines and associated technologies. Scripts in PowerSh Systems Engineer, Information Technology, Systems, Computer Science, Platform Engineer, Senior

  • Senior Network Engineer

    26 minutes ago


    Durham, North Carolina, United States Motion Recruitment Full time

    Senior Network Engineer / Durham, NCAre you passionate about leveraging cutting-edge technology to drive positive change for the environment? Do you thrive in fast-paced environments where every day presents new challenges and opportunities? If so, we have the perfect opportunity for you Our clients leading environmental company is seeking a talented Senior...


  • Durham, NC, United States Nvidia Full time

    We are currently seeking a Senior Developer Technology Engineer, CPU Performance!Would you enjoy researching new algorithms and discovering new techniques to optimize data intensive applications? Do you like investigating hardware and system bottlenecks, and optimizing the performance of critical applications on heterogeneous computing systems with CPUs and...

  • Delivery Lead

    4 weeks ago


    Durham, United States CareerBuilder Full time

    The People Strategy & Operations (PSO) team is looking for a dynamic and engaging Delivery Lead to drive efficient and effective enablement for transformational solutions specific to AI. Who We Are The People Strategy & Operations team leads P&C in prioritizing, sequencing, and deploying P&C programs and solutions using streamlined processes, systems, and...

  • Senior Cloud Engineer

    2 weeks ago


    Durham, United States Fidelity TalentSource LLC Full time

    Senior Cloud Engineer - Cloud Platforms The Role Do you want to work on leading edge cloud technologies which are transforming how developers work with cloud? As a Senior Cloud Engineer in our Cloud Platforms area, you will work within a diverse team comprised of passionate technologists who believe in the power of innovation and constant collaboration. We...