Principal Infrastructure Performance and Development Engineer

3 weeks ago


Santa Clara, United States NVIDIA Full time

Principal Infrastructure Performance and Development Engineer page is loaded

Principal Infrastructure Performance and Development Engineer

Apply

locations

US, CA, Santa Clara

time type

Full time

posted on

Posted Yesterday

job requisition id

JR1981842

Joining NVIDIA's AI Efficiency Team means contributing to the infrastructure that powers our leading-edge AI research. This team focuses on optimizing efficiency and resiliency of ML workloads, as well

as developing

scalable AI infrastructure tools and services. Our objective is to deliver a stable, scalable environment for NVIDIA's AI researchers, providing them with the necessary resources and scale to foster innovation. We're transforming the way Deep Learning applications run on tens of thousands of GPUs. Join our team of experts and help us build a supercharged AI platform that maximizes efficiency, resilience, and Model FLOPs Utilization (MFU). In this

position you will be collaborating with a diverse team that cuts across many areas of Deep Learning HW/SW stack in building a highly scalable, fault tolerant and optimized AI platform. What you will be doing: Build tools and frameworks that provide real time application performance metrics that can be correlated with system metrics

Develop automation frameworks that empower applications to thoughtfully predict and overcome

system/infrastructure

failures, ensuring fault tolerance.

Collaborate with software teams to pinpoint performance bottlenecks. Design, prototype, and integrate solutions that deliver demonstrable performance gains in production environments.

Adapt and enhance communication libraries to seamlessly support innovative network topologies and system architectures.

Design or adapt optimized storage solutions to boost Deep Learning efficiency, resilience, and developer productivity.

What We Need to See: BS/MS/PhD (or equivalent experience) in Computer Science, Electrical Engineering or a related field.

Proven experience in least one of the following area:

10+ years of experience in analyzing and improving performance of training applications using PyTorch or similar framework 10+

years of experience with building distributed software applications 10+

years of experience in building storage solutions for Deep Learning applications 10+ years of background in building automated fault tolerant distributed applications 5+ years building tools for bottleneck analysis and automation of fault tolerance in distributed environments. Strong background in parallel programming and distributed systems

Experience analyzing and optimizing large scale distributed applications.

Excellent verbal and written communication skills

Ways To Stand Out From The Crowd: Deep understanding of HPC and distributed system architecture with emphasis on RDMA

Hands on working experience in more than one of the above areas especially with performance analysis and profiling of Deep Learning workloads.

Comfortable navigating and working with the PyTorch codebase.

Proven understanding of CUDA and GPU architecture

The base salary range is 272,000 USD - 419,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits .

NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. About Us

NVIDIA is a Learning Machine NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and the metaverse is transforming the world's largest industries and profoundly impacting society.

#J-18808-Ljbffr



  • Santa Clara, United States Oracle Full time

    At Oracle Cloud Infrastructure (OCI), we build the future of the cloud for Enterprises as a diverse team of fellow creators and inventors. We act with the speed and attitude of a start-up, with the scale and customer focus of the leading enterprise software company in the world. Cloud Engineering Infrastructure Development. The Oracle Cloud Infrastructure...


  • Santa Clara, United States NVIDIA Full time

    Join NVIDIA IT, where we are on a mission to building and delivering world-class platforms to optimize NVIDIA's IT infrastructure operations, encompassing IT asset management, configuration management, monitoring, logging, and incident management. Our platform ecosystem combines open-source tools, vendor products, and in-house innovations, with our ultimate...


  • Santa Clara, California, United States Motion Recruitment Full time

    This cybersecurity company in the Bay Area provides optimized access and real time security for people, devices, and data. They help customers reduce risk, accelerate performance, and get visibility into any cloud, web, and private application activity. They are looking to bring on a Senior Staff/Principal Software Engineer for a full time, remote role. This...


  • Santa Clara, United States Professional Recruiters Full time

    Principal Software Engineer, Santa Clara, California or Tempe, Arizona Come join a growing bank at the heart of the innovation, technology, green tech and life sciences space. We continue to expand our global footprint and our banking technology is at the core of everything we do. Work within our DevOps team and be part of a group that helps ensure our...


  • Santa Clara, United States Kofi Group Full time

    To Apply for this Job Click HerePrincipal Site Reliability EngineerSan Francisco Bay Area, CAWe are partnering with a late-stage Cloud Security company that is looking for a Principal Level SRE The ideal candidate will have:Strong sense of architecture and design for fault tolerance, scale-out approaches, and stability Deep experience in building tools...


  • Santa Clara, United States Palo Alto Networks Full time

    Job Description Your Career We are seeking an automation savvy Senior Principal QA Engineer as we scale the Prisma Access Test team. We are looking for a strong technical leader who takes ownership of their areas of focus and who are driven to solve problems at every level. Collaboration and teamwork are at the foundation of our culture and we need...


  • Santa Clara, United States Motion Recruitment Full time

    This cybersecurity company in the Bay Area provides optimized access and real time security for people, devices, and data. They help customers reduce risk, accelerate performance, and get visibility into any cloud, web, and private application activity.  They are looking to bring on a Senior Staff/Principal Software Engineer for a full time, remote...


  • Santa Clara, United States Professional Recruiters Full time

    Principal Software Engineer, Santa Clara, California or Tempe, Arizona Come join a growing bank at the heart of the innovation, technology, green tech and life sciences space. We continue to expand our global footprint and our banking technology is at the core of everything we do. Work within our DevOps team and be part of a group that helps ensure our...


  • Santa Clara, United States Sustainable Talent Full time

    Job DescriptionJob DescriptionJoin the Sustainable Talent team, supporting NVIDIA as a Senior Infrastructure Engineer supporting the IPP (Infrastructure, Planning and Process) Cloud Infrastructure Team.This is a W-2 full-time 1 year contract based in Santa Clara, CA with hybrid work options. We offer competitive pay $90 - $100/hr based on factors like...

  • Sr Principal Engineer

    7 hours ago


    Santa Clara, United States Palo Alto Networks Full time

    Your Career Palo Alto Networks SaaS Security team is looking for a seasoned and accomplished Senior Principal Software Engineer to help scale out our security platform with a sharp focus on platform and infrastructure capabilities. As a member of the team, you have the unique opportunity to: Be part of a world-class software engineering team that works on...


  • Santa Clara, United States NVIDIA Full time

    Our technology has no boundaries! NVIDIA is building the world’s most groundbreaking and pioneering computing platforms. Because of our work, scientists, researchers, and engineers can advance their ideas. At its core, our visual computing technology not only enables an outstanding computing experience, but it is also energy efficient! We pioneered a...


  • Santa Clara, United States Principal Service Solutions Full time

    Principal Service Solutions is hiring a Field Service Engineer who will be responsible for servicing and maintaining Abatement tools in a Semiconductor sub-fab environment. We are searching for detail-oriented people with technical and mechanical experience and a strong work ethic who are looking for an opportunity to start a career that offers unlimited...


  • Santa Clara, United States Oracle Full time

    Compute, networking, storage, DB, Security, Observability are the key services for any IaaS offering. Security and Observability are cross-cutting concerns and enable the whole ecosystem to provide the world's most secure cloud platform. Observability services are foundational with the highest scaling and availability requirements! We innovate in every...


  • Santa Clara, United States TalentBurst, an Inc 5000 company Full time

    Title: Sr Infrastructure Engineer Duration: 06 Months (Possible Extensions) Location: Santa Clara, CA/Remote We are seeking a skilled Systems Engineer Consultant to join our team to provide architectural governance and strategic guidance for the Microsoft Power Platform and Fabric environment. The ideal candidate will have a strong background in systems...


  • Santa Clara, United States TalentBurst, an Inc 5000 company Full time

    Title: Sr Infrastructure Engineer Duration: 06 Months (Possible Extensions) Location: Santa Clara, CA/Remote We are seeking a skilled Systems Engineer Consultant to join our team to provide architectural governance and strategic guidance for the Microsoft Power Platform and Fabric environment. The ideal candidate will have a strong background in systems...


  • Santa Clara, United States TalentBurst, an Inc 5000 company Full time

    Title: Sr Infrastructure Engineer Duration: 06 Months (Possible Extensions) Location: Santa Clara, CA/Remote We are seeking a skilled Systems Engineer Consultant to join our team to provide architectural governance and strategic guidance for the Microsoft Power Platform and Fabric environment. The ideal candidate will have a strong background in systems...


  • Santa Clara, United States NVIDIA Full time

    Our technology has no boundaries! NVIDIA is building the world’s most groundbreaking and pioneering computing platforms. Because of our work, scientists, researchers, and engineers can advance their ideas. At its core, our visual computing technology not only enables an outstanding computing experience, but it is also energy efficient! We pioneered a...


  • Santa Clara, California, United States ServiceNow Full time

    Company DescriptionAt ServiceNow, our technology makes the world work for everyone, and our people make it possible. We move fast because the world can't wait, and we innovate in ways no one else can for our customers and communities. By joining ServiceNow, you are part of an ambitious team of change makers who have a restless curiosity and a drive for...


  • Santa Clara, United States Resource Point LLC Full time

    Job DescriptionJob DescriptionJob title: Software Development EngineerLocation: Santa Clara, CA (Onsite from day one)Duration: 12 Months CTH Job Description: Candidate will participate in a focused effort to develop and productize ground-breaking solutions that will redefine the world of transportation and the growing field of self-driving cars. You will...

  • Principal SW Engineer

    5 hours ago


    Santa Clara, United States Gigamon Full time

    Description We are seeking a Principal Software Engineer for our GigaSMART team. The candidate will be working as part of a dynamic team developing high performance packet processing applications for our next generation products. The candidate will be responsible for architecting, designing, and implementing features in networking and security...