Principal Infrastructure Performance and Development Engineer
3 weeks ago
Principal Infrastructure Performance and Development Engineer page is loaded
Principal Infrastructure Performance and Development Engineer
Apply
locations
US, CA, Santa Clara
time type
Full time
posted on
Posted Yesterday
job requisition id
JR1981842
Joining NVIDIA's AI Efficiency Team means contributing to the infrastructure that powers our leading-edge AI research. This team focuses on optimizing efficiency and resiliency of ML workloads, as well
as developing
scalable AI infrastructure tools and services. Our objective is to deliver a stable, scalable environment for NVIDIA's AI researchers, providing them with the necessary resources and scale to foster innovation. We're transforming the way Deep Learning applications run on tens of thousands of GPUs. Join our team of experts and help us build a supercharged AI platform that maximizes efficiency, resilience, and Model FLOPs Utilization (MFU). In this
position you will be collaborating with a diverse team that cuts across many areas of Deep Learning HW/SW stack in building a highly scalable, fault tolerant and optimized AI platform.
What you will be doing:
Build tools and frameworks that provide real time application performance metrics that can be correlated with system metrics
Develop automation frameworks that empower applications to thoughtfully predict and overcome
system/infrastructure
failures, ensuring fault tolerance.
Collaborate with software teams to pinpoint performance bottlenecks. Design, prototype, and integrate solutions that deliver demonstrable performance gains in production environments.
Adapt and enhance communication libraries to seamlessly support innovative network topologies and system architectures.
Design or adapt optimized storage solutions to boost Deep Learning efficiency, resilience, and developer productivity.
What We Need to See:
BS/MS/PhD (or equivalent experience) in Computer Science, Electrical Engineering or a related field.
Proven experience in least one of the following area:
10+ years of experience in analyzing and improving performance of training applications using PyTorch or similar framework
10+
years of experience with building distributed software applications
10+
years of experience in building storage solutions for Deep Learning applications
10+ years of background in building automated fault tolerant distributed applications
5+ years building tools for bottleneck analysis and automation of fault tolerance in distributed environments.
Strong background in parallel programming and distributed systems
Experience analyzing and optimizing large scale distributed applications.
Excellent verbal and written communication skills
Ways To Stand Out From The Crowd:
Deep understanding of HPC and distributed system architecture with emphasis on RDMA
Hands on working experience in more than one of the above areas especially with performance analysis and profiling of Deep Learning workloads.
Comfortable navigating and working with the PyTorch codebase.
Proven understanding of CUDA and GPU architecture
The base salary range is 272,000 USD - 419,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
You will also be eligible for equity and benefits .
NVIDIA accepts applications on an ongoing basis.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
About Us
NVIDIA is a Learning Machine
NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and the metaverse is transforming the world's largest industries and profoundly impacting society.
#J-18808-Ljbffr
-
Principal Software Engineer
5 hours ago
Santa Clara, United States Oracle Full timeAt Oracle Cloud Infrastructure (OCI), we build the future of the cloud for Enterprises as a diverse team of fellow creators and inventors. We act with the speed and attitude of a start-up, with the scale and customer focus of the leading enterprise software company in the world. Cloud Engineering Infrastructure Development. The Oracle Cloud Infrastructure...
-
IT Infrastructure Software Engineer
4 weeks ago
Santa Clara, United States NVIDIA Full timeJoin NVIDIA IT, where we are on a mission to building and delivering world-class platforms to optimize NVIDIA's IT infrastructure operations, encompassing IT asset management, configuration management, monitoring, logging, and incident management. Our platform ecosystem combines open-source tools, vendor products, and in-house innovations, with our ultimate...
-
Principal Software Engineer
1 week ago
Santa Clara, California, United States Motion Recruitment Full timeThis cybersecurity company in the Bay Area provides optimized access and real time security for people, devices, and data. They help customers reduce risk, accelerate performance, and get visibility into any cloud, web, and private application activity. They are looking to bring on a Senior Staff/Principal Software Engineer for a full time, remote role. This...
-
Principal Software Engineer
17 hours ago
Santa Clara, United States Professional Recruiters Full timePrincipal Software Engineer, Santa Clara, California or Tempe, Arizona Come join a growing bank at the heart of the innovation, technology, green tech and life sciences space. We continue to expand our global footprint and our banking technology is at the core of everything we do. Work within our DevOps team and be part of a group that helps ensure our...
-
Principal Site Reliability Engineer
2 days ago
Santa Clara, United States Kofi Group Full timeTo Apply for this Job Click HerePrincipal Site Reliability EngineerSan Francisco Bay Area, CAWe are partnering with a late-stage Cloud Security company that is looking for a Principal Level SRE The ideal candidate will have:Strong sense of architecture and design for fault tolerance, scale-out approaches, and stability Deep experience in building tools...
-
Sr Principal QA Automation Engineer
1 week ago
Santa Clara, United States Palo Alto Networks Full timeJob Description Your Career We are seeking an automation savvy Senior Principal QA Engineer as we scale the Prisma Access Test team. We are looking for a strong technical leader who takes ownership of their areas of focus and who are driven to solve problems at every level. Collaboration and teamwork are at the foundation of our culture and we need...
-
Principal Software Engineer
1 week ago
Santa Clara, United States Motion Recruitment Full timeThis cybersecurity company in the Bay Area provides optimized access and real time security for people, devices, and data. They help customers reduce risk, accelerate performance, and get visibility into any cloud, web, and private application activity. They are looking to bring on a Senior Staff/Principal Software Engineer for a full time, remote...
-
Principal Software Engineer
8 hours ago
Santa Clara, United States Professional Recruiters Full timePrincipal Software Engineer, Santa Clara, California or Tempe, Arizona Come join a growing bank at the heart of the innovation, technology, green tech and life sciences space. We continue to expand our global footprint and our banking technology is at the core of everything we do. Work within our DevOps team and be part of a group that helps ensure our...
-
Senior Infrastructure Engineer
1 week ago
Santa Clara, United States Sustainable Talent Full timeJob DescriptionJob DescriptionJoin the Sustainable Talent team, supporting NVIDIA as a Senior Infrastructure Engineer supporting the IPP (Infrastructure, Planning and Process) Cloud Infrastructure Team.This is a W-2 full-time 1 year contract based in Santa Clara, CA with hybrid work options. We offer competitive pay $90 - $100/hr based on factors like...
-
Sr Principal Engineer
7 hours ago
Santa Clara, United States Palo Alto Networks Full timeYour Career Palo Alto Networks SaaS Security team is looking for a seasoned and accomplished Senior Principal Software Engineer to help scale out our security platform with a sharp focus on platform and infrastructure capabilities. As a member of the team, you have the unique opportunity to: Be part of a world-class software engineering team that works on...
-
Santa Clara, United States NVIDIA Full timeOur technology has no boundaries! NVIDIA is building the world’s most groundbreaking and pioneering computing platforms. Because of our work, scientists, researchers, and engineers can advance their ideas. At its core, our visual computing technology not only enables an outstanding computing experience, but it is also energy efficient! We pioneered a...
-
Field Service Engineer
9 hours ago
Santa Clara, United States Principal Service Solutions Full timePrincipal Service Solutions is hiring a Field Service Engineer who will be responsible for servicing and maintaining Abatement tools in a Semiconductor sub-fab environment. We are searching for detail-oriented people with technical and mechanical experience and a strong work ethic who are looking for an opportunity to start a career that offers unlimited...
-
Principal Software Engineer
2 days ago
Santa Clara, United States Oracle Full timeCompute, networking, storage, DB, Security, Observability are the key services for any IaaS offering. Security and Observability are cross-cutting concerns and enable the whole ecosystem to provide the world's most secure cloud platform. Observability services are foundational with the highest scaling and availability requirements! We innovate in every...
-
Senior Infrastructure Engineer
3 weeks ago
Santa Clara, United States TalentBurst, an Inc 5000 company Full timeTitle: Sr Infrastructure Engineer Duration: 06 Months (Possible Extensions) Location: Santa Clara, CA/Remote We are seeking a skilled Systems Engineer Consultant to join our team to provide architectural governance and strategic guidance for the Microsoft Power Platform and Fabric environment. The ideal candidate will have a strong background in systems...
-
Senior Infrastructure Engineer
3 weeks ago
Santa Clara, United States TalentBurst, an Inc 5000 company Full timeTitle: Sr Infrastructure Engineer Duration: 06 Months (Possible Extensions) Location: Santa Clara, CA/Remote We are seeking a skilled Systems Engineer Consultant to join our team to provide architectural governance and strategic guidance for the Microsoft Power Platform and Fabric environment. The ideal candidate will have a strong background in systems...
-
Senior Infrastructure Engineer
3 weeks ago
Santa Clara, United States TalentBurst, an Inc 5000 company Full timeTitle: Sr Infrastructure Engineer Duration: 06 Months (Possible Extensions) Location: Santa Clara, CA/Remote We are seeking a skilled Systems Engineer Consultant to join our team to provide architectural governance and strategic guidance for the Microsoft Power Platform and Fabric environment. The ideal candidate will have a strong background in systems...
-
Santa Clara, United States NVIDIA Full timeOur technology has no boundaries! NVIDIA is building the world’s most groundbreaking and pioneering computing platforms. Because of our work, scientists, researchers, and engineers can advance their ideas. At its core, our visual computing technology not only enables an outstanding computing experience, but it is also energy efficient! We pioneered a...
-
Database Performance Engineer
24 hours ago
Santa Clara, California, United States ServiceNow Full timeCompany DescriptionAt ServiceNow, our technology makes the world work for everyone, and our people make it possible. We move fast because the world can't wait, and we innovate in ways no one else can for our customers and communities. By joining ServiceNow, you are part of an ambitious team of change makers who have a restless curiosity and a drive for...
-
Software Development Engineer
2 weeks ago
Santa Clara, United States Resource Point LLC Full timeJob DescriptionJob DescriptionJob title: Software Development EngineerLocation: Santa Clara, CA (Onsite from day one)Duration: 12 Months CTH Job Description: Candidate will participate in a focused effort to develop and productize ground-breaking solutions that will redefine the world of transportation and the growing field of self-driving cars. You will...
-
Principal SW Engineer
5 hours ago
Santa Clara, United States Gigamon Full timeDescription We are seeking a Principal Software Engineer for our GigaSMART team. The candidate will be working as part of a dynamic team developing high performance packet processing applications for our next generation products. The candidate will be responsible for architecting, designing, and implementing features in networking and security...