Senior System Software Engineer, NCCL

2 hours ago


Santa Clara, California, United States NVIDIA Full time

NVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High Performance Computing, and Visualization. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.

We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver communication runtimes like NCCL and NVSHMEM for Deep Learning and HPC applications. We are looking for a motivated Partner Enablement Engineer to guide our key partners and customers with NCCL.

Key Responsibilities:

  • Engage with our partners and customers to root cause functional and performance issues reported with NCCL
  • Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters
  • Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.)
  • Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters
  • Document and conduct trainings/webinars for NCCL
  • Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure, and support

Requirements:

  • B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience
  • Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
  • Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design
  • Experience working with engineering or academic research community supporting HPC or AI
  • Practical experience with high-performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control
  • Expert in Linux fundamentals and a scripting language, preferably Python
  • Familiar with containers, cloud provisioning, and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible)
  • Adaptability and passion to learn new areas and tools
  • Flexibility to work and communicate effectively across different teams and time zones

Preferred Qualifications:

  • Experience conducting performance benchmarking and developing infrastructure on HPC clusters
  • Prior system administration experience, especially for large clusters
  • Experience debugging network configuration issues in large-scale deployments
  • Familiarity with CUDA programming and/or GPUs
  • Good understanding of Machine Learning concepts and experience with Deep Learning Frameworks such as PyTorch, TensorFlow

The base salary range is 148,000 USD - 276,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.



  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is a leader in the technology world, renowned for its innovative products and services. As a pioneer in the field of accelerated computing, NVIDIA has been transforming computer graphics, PC gaming, and AI for over 25 years.Job SummaryWe are seeking an exceptional Senior HPC Systems Engineer to join our team. As a key player in our...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is a leader in the field of computer graphics, PC gaming, and accelerated computing. With a legacy of innovation spanning over 25 years, we're committed to pushing the boundaries of what's possible with AI and GPU computing.Job SummaryWe're seeking an exceptional Senior HPC Systems Engineer to join our team. As a key player in our AI...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA has been a pioneer in computer graphics, PC gaming, and accelerated computing for over 25 years. Our legacy of innovation is fueled by great technology and amazing people. Today, we're pushing the boundaries of AI to define the next era of computing.Job SummaryWe're seeking an exceptional Senior HPC Systems Engineer to join our team. As a...


  • Santa Clara, California, United States Oracle Corporation Full time

    Unlock the Power of AI and ML with Oracle Cloud InfrastructureOracle Cloud Infrastructure is revolutionizing the way we approach AI and ML workloads. As a Senior Principal Software Developer, you will be part of a team that designs and develops ultra-high performance networks required to support these workloads.About the RoleWe are seeking a highly skilled...


  • Santa Clara, California, United States Oracle Full time

    Cloud Engineering Infrastructure DevelopmentOracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high performance network required to support AI/ML/HPC workloads. This is an exciting opportunity to join the AI revolution and design systems that allow customers to scale from tens to thousands of GPUs without compromising on...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA has been a pioneer in computer graphics, PC gaming, and accelerated computing for over 25 years. Our legacy of innovation is fueled by great technology and amazing people. Today, we're pushing the boundaries of AI to define the next era of computing.Job SummaryWe're seeking an exceptional Senior HPC Systems Engineer to join our team. As a...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a Senior Systems Software Engineer to join our TAO Toolkit Team, where you will be responsible for developing novel, scalable, and automated pipelines to make sense of petabytes of unstructured data. You will collaborate with multiple deep-learning architects and engineers to enable the development of pioneering AI models.Key Responsibilities:...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled Senior System Software and Firmware Engineer to join our team at NVIDIA. As a key member of our engineering team, you will be responsible for designing, implementing, and verifying system software and firmware for our next-generation System on Chip (SoC) products.Key Responsibilities:Architect and design system software and...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionWe are seeking a highly skilled Technical Lead to manage our GPU Communications Libraries and Networking team at NVIDIA. As a key member of our team, you will be responsible for leading, mentoring, and growing your library engineering team, as well as planning and executing projects to ensure the quality and performance of our libraries.Key...


  • Santa Clara, California, United States Qualcomm Full time

    Job SummaryWe are seeking a highly skilled Senior Engineer to join our Systems Engineering team at Qualcomm. As a Senior Engineer, you will play a key role in researching, designing, developing, and optimizing systems-level software, hardware, architecture, algorithms, and machine learning solutions that enable cutting-edge technology in the AI/ML field.Key...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is seeking a talented software engineer to join our Solutions Engineering team and contribute to the development of our autonomous vehicle platform.You will work closely with experts in Deep Learning, Computer Vision, and vehicle control to design, develop, and implement software and systems that will revolutionize the automotive industry.The ideal...


  • Santa Clara, California, United States AMD Full time

    Transforming Lives with AMD TechnologyWe are a team of innovators at AMD, driven by a passion to transform lives with our technology. Our mission is to build exceptional products that accelerate next-generation computing experiences, serving as the cornerstone for enterprise Data Centers, Artificial Intelligence, HPC, and Embedded systems.The RoleWe are...


  • Santa Clara, California, United States Qualcomm Full time

    Job Title: Senior Systems EngineerWe are seeking a highly skilled Senior Systems Engineer to join our team at Qualcomm. As a Senior Systems Engineer, you will be responsible for designing and implementing advanced signal-processing algorithms for Wireless LAN (WLAN/Wi-Fi) communications systems.Key Responsibilities:Apply systems knowledge and experience to...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    Transforming Lives with AMD TechnologyWe are a team of innovators at Advanced Micro Devices, Inc. who are passionate about transforming lives with our technology. Our mission is to build great products that accelerate next-generation computing experiences, driving the evolution of computing experiences for enterprise Data Centers, Artificial Intelligence,...


  • Santa Clara, California, United States ServiceNow Full time

    Job Title: Senior Staff Software EngineerAt ServiceNow, we're looking for a highly skilled Senior Staff Software Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, developing, and delivering high-quality software solutions that meet the needs of our customers.Key Responsibilities:Design and develop...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior System Software Engineer, Infrastructure AutomationWe are seeking a highly skilled Senior System Software Engineer to join our team at NVIDIA. As a key member of our GPU-accelerated deep learning software team, you will be responsible for designing and implementing infrastructure solutions for our Triton Inference Server.Our team is...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionNVIDIA is seeking a highly skilled Senior System Software Engineer to join our team. As a key member of our CUDA Driver team, you will be responsible for designing, developing, and delivering high-quality software solutions for accelerating general-purpose computation on the GPU.Key Responsibilities:Design and implement new features for the...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Senior Software EngineerPalo Alto Networks is seeking a highly skilled Senior Software Engineer to join our App Acceleration team. As a key member of our engineering team, you will be responsible for designing, developing, and implementing highly scalable software features.Our team is passionate about building innovative products that shape the...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a leader in the field of artificial intelligence and computing. We are seeking a highly skilled Senior System Software Engineer Platform to join our team.As a Senior System Software Engineer Platform, you will be responsible for designing and implementing microcontroller firmware for GPU Server platforms. This will involve developing C/C++ server...


  • Santa Clara, California, United States Selector Software Full time

    Job OverviewSelector Software is seeking a skilled Software Development Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, developing, and deploying scalable cloud-based systems.Key Responsibilities:Design and implement cloud-based systems using Python and GolangDevelop REST APIs and microservices for...