Senior System Software Engineer, NCCL
2 hours ago
NVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High Performance Computing, and Visualization. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.
We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver communication runtimes like NCCL and NVSHMEM for Deep Learning and HPC applications. We are looking for a motivated Partner Enablement Engineer to guide our key partners and customers with NCCL.
Key Responsibilities:
- Engage with our partners and customers to root cause functional and performance issues reported with NCCL
- Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters
- Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.)
- Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters
- Document and conduct trainings/webinars for NCCL
- Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure, and support
Requirements:
- B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience
- Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
- Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design
- Experience working with engineering or academic research community supporting HPC or AI
- Practical experience with high-performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control
- Expert in Linux fundamentals and a scripting language, preferably Python
- Familiar with containers, cloud provisioning, and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible)
- Adaptability and passion to learn new areas and tools
- Flexibility to work and communicate effectively across different teams and time zones
Preferred Qualifications:
- Experience conducting performance benchmarking and developing infrastructure on HPC clusters
- Prior system administration experience, especially for large clusters
- Experience debugging network configuration issues in large-scale deployments
- Familiarity with CUDA programming and/or GPUs
- Good understanding of Machine Learning concepts and experience with Deep Learning Frameworks such as PyTorch, TensorFlow
The base salary range is 148,000 USD - 276,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.
-
Senior HPC Systems Engineer
2 weeks ago
Santa Clara, California, United States NVIDIA Full timeAbout NVIDIANVIDIA is a leader in the technology world, renowned for its innovative products and services. As a pioneer in the field of accelerated computing, NVIDIA has been transforming computer graphics, PC gaming, and AI for over 25 years.Job SummaryWe are seeking an exceptional Senior HPC Systems Engineer to join our team. As a key player in our...
-
Senior HPC Systems Engineer
1 month ago
Santa Clara, California, United States NVIDIA Full timeAbout NVIDIANVIDIA is a leader in the field of computer graphics, PC gaming, and accelerated computing. With a legacy of innovation spanning over 25 years, we're committed to pushing the boundaries of what's possible with AI and GPU computing.Job SummaryWe're seeking an exceptional Senior HPC Systems Engineer to join our team. As a key player in our AI...
-
Senior HPC Systems Engineer
4 weeks ago
Santa Clara, California, United States NVIDIA Full timeAbout NVIDIANVIDIA has been a pioneer in computer graphics, PC gaming, and accelerated computing for over 25 years. Our legacy of innovation is fueled by great technology and amazing people. Today, we're pushing the boundaries of AI to define the next era of computing.Job SummaryWe're seeking an exceptional Senior HPC Systems Engineer to join our team. As a...
-
Senior Principal Software Developer
3 weeks ago
Santa Clara, California, United States Oracle Corporation Full timeUnlock the Power of AI and ML with Oracle Cloud InfrastructureOracle Cloud Infrastructure is revolutionizing the way we approach AI and ML workloads. As a Senior Principal Software Developer, you will be part of a team that designs and develops ultra-high performance networks required to support these workloads.About the RoleWe are seeking a highly skilled...
-
Senior Principal Software Engineer
4 weeks ago
Santa Clara, California, United States Oracle Full timeCloud Engineering Infrastructure DevelopmentOracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high performance network required to support AI/ML/HPC workloads. This is an exciting opportunity to join the AI revolution and design systems that allow customers to scale from tens to thousands of GPUs without compromising on...
-
Senior HPC Systems Engineer
2 weeks ago
Santa Clara, California, United States NVIDIA Full timeAbout NVIDIANVIDIA has been a pioneer in computer graphics, PC gaming, and accelerated computing for over 25 years. Our legacy of innovation is fueled by great technology and amazing people. Today, we're pushing the boundaries of AI to define the next era of computing.Job SummaryWe're seeking an exceptional Senior HPC Systems Engineer to join our team. As a...
-
Senior Systems Software Engineer
2 days ago
Santa Clara, California, United States NVIDIA Full timeWe are seeking a Senior Systems Software Engineer to join our TAO Toolkit Team, where you will be responsible for developing novel, scalable, and automated pipelines to make sense of petabytes of unstructured data. You will collaborate with multiple deep-learning architects and engineers to enable the development of pioneering AI models.Key Responsibilities:...
-
Senior System Software and Firmware Engineer
4 hours ago
Santa Clara, California, United States NVIDIA Full timeWe are seeking a highly skilled Senior System Software and Firmware Engineer to join our team at NVIDIA. As a key member of our engineering team, you will be responsible for designing, implementing, and verifying system software and firmware for our next-generation System on Chip (SoC) products.Key Responsibilities:Architect and design system software and...
-
Software Engineering Manager
2 hours ago
Santa Clara, California, United States NVIDIA Full timeJob DescriptionWe are seeking a highly skilled Technical Lead to manage our GPU Communications Libraries and Networking team at NVIDIA. As a key member of our team, you will be responsible for leading, mentoring, and growing your library engineering team, as well as planning and executing projects to ensure the quality and performance of our libraries.Key...
-
Senior Software Engineer
1 month ago
Santa Clara, California, United States Qualcomm Full timeJob SummaryWe are seeking a highly skilled Senior Engineer to join our Systems Engineering team at Qualcomm. As a Senior Engineer, you will play a key role in researching, designing, developing, and optimizing systems-level software, hardware, architecture, algorithms, and machine learning solutions that enable cutting-edge technology in the AI/ML field.Key...
-
Senior System Software Engineer
3 hours ago
Santa Clara, California, United States NVIDIA Full timeNVIDIA is seeking a talented software engineer to join our Solutions Engineering team and contribute to the development of our autonomous vehicle platform.You will work closely with experts in Deep Learning, Computer Vision, and vehicle control to design, develop, and implement software and systems that will revolutionize the automotive industry.The ideal...
-
Santa Clara, California, United States AMD Full timeTransforming Lives with AMD TechnologyWe are a team of innovators at AMD, driven by a passion to transform lives with our technology. Our mission is to build exceptional products that accelerate next-generation computing experiences, serving as the cornerstone for enterprise Data Centers, Artificial Intelligence, HPC, and Embedded systems.The RoleWe are...
-
Senior Systems Engineer
4 weeks ago
Santa Clara, California, United States Qualcomm Full timeJob Title: Senior Systems EngineerWe are seeking a highly skilled Senior Systems Engineer to join our team at Qualcomm. As a Senior Systems Engineer, you will be responsible for designing and implementing advanced signal-processing algorithms for Wireless LAN (WLAN/Wi-Fi) communications systems.Key Responsibilities:Apply systems knowledge and experience to...
-
Senior AI Optimization Engineer
3 days ago
Santa Clara, California, United States Advanced Micro Devices , Inc. Full timeTransforming Lives with AMD TechnologyWe are a team of innovators at Advanced Micro Devices, Inc. who are passionate about transforming lives with our technology. Our mission is to build great products that accelerate next-generation computing experiences, driving the evolution of computing experiences for enterprise Data Centers, Artificial Intelligence,...
-
Senior Software Engineer
3 weeks ago
Santa Clara, California, United States ServiceNow Full timeJob Title: Senior Staff Software EngineerAt ServiceNow, we're looking for a highly skilled Senior Staff Software Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, developing, and delivering high-quality software solutions that meet the needs of our customers.Key Responsibilities:Design and develop...
-
Santa Clara, California, United States NVIDIA Full timeJob Title: Senior System Software Engineer, Infrastructure AutomationWe are seeking a highly skilled Senior System Software Engineer to join our team at NVIDIA. As a key member of our GPU-accelerated deep learning software team, you will be responsible for designing and implementing infrastructure solutions for our Triton Inference Server.Our team is...
-
Senior System Software Engineer, CUDA
2 days ago
Santa Clara, California, United States NVIDIA Full timeJob DescriptionNVIDIA is seeking a highly skilled Senior System Software Engineer to join our team. As a key member of our CUDA Driver team, you will be responsible for designing, developing, and delivering high-quality software solutions for accelerating general-purpose computation on the GPU.Key Responsibilities:Design and implement new features for the...
-
Senior Software Engineer
3 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeJob Title: Senior Software EngineerPalo Alto Networks is seeking a highly skilled Senior Software Engineer to join our App Acceleration team. As a key member of our engineering team, you will be responsible for designing, developing, and implementing highly scalable software features.Our team is passionate about building innovative products that shape the...
-
Senior System Software Engineer Platform
2 days ago
Santa Clara, California, United States NVIDIA Full timeNVIDIA is a leader in the field of artificial intelligence and computing. We are seeking a highly skilled Senior System Software Engineer Platform to join our team.As a Senior System Software Engineer Platform, you will be responsible for designing and implementing microcontroller firmware for GPU Server platforms. This will involve developing C/C++ server...
-
Software Development Engineer
2 days ago
Santa Clara, California, United States Selector Software Full timeJob OverviewSelector Software is seeking a skilled Software Development Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, developing, and deploying scalable cloud-based systems.Key Responsibilities:Design and implement cloud-based systems using Python and GolangDevelop REST APIs and microservices for...