GPU Cluster System/Network Engineer

3 weeks ago


Santa Clara, United States Advanced Micro Devices , Inc. Full time

WHAT YOU DO AT AMD CHANGES EVERYTHING

We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world's most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.

AMD together we advance_

THE TEAM:

AMD's Data Center GPU organization is transforming the industry with our AI based Graphic Processors. Our primary objective is to design exceptional products that drive the evolution of computing experiences, serving as the cornerstone for enterprise Data Centers, (AI) Artificial Intelligence, HPC and Embedded systems. If this resonates with you, come and joining our Data Center GPU organization where we are building amazing AI powered products with amazing people.

THE ROLE:

We are seeking a highly motivated and skilled GPU Cluster System/Network Engineer to join our dynamic team. In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters. The ideal candidate will have a strong background in GPU architectures, parallel computing, and hands-on experience in system level performance tuning and debug methodologies. The team fosters and encourages continuous technical innovation to showcase successes as well as facilitate continuous career development.

THE PERSON:

The Cluster System/Network Engineer plays a critical role in shaping the future of AI/ML training and inferencing systems as they move into the Ethernet era. This individual will collaborate with a broad range of internal and external partners, including NIC, Switch, and Software Enablement teams, to integrate state-of-the-art technology solutions that pave the way for ethernet to be used as a viable network technology for the GPU-to-GPU communication required during AI inferencing and training.

KEY RESPONSIBILITIES:

  • Performance Optimization: Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as RDMA throughput, latency, and collective communications
  • Benchmarking and Analysis: Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments
  • Scalability Testing: Evaluate the scalability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across different cluster sizes, configurations, and networking technologies (IB & RoCE)
  • Performance Profiling: Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement
  • Performance Tuning: Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations
  • Documentation: Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders
  • Collaboration: Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture
  • Continuous Learning: Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance

PREFERRED EXPERIENCE:

  • Proven experience in optimizing the performance of GPU clusters
  • Strong understanding of GPU architectures, parallel computing concepts, and network protocols
  • Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis
  • Experience with system level performance analysis tools and methodologies for GPU clusters
  • Analytical mindset with excellent problem-solving and debug skills
  • Familiarity with cluster management tools and systems
  • Excellent communication and collaboration skills for effective teamwork
  • RDMA network configuration, troubleshooting and performance tuning
  • Linux kernel networking expertise
  • Machine learning and/or HPC system design

ACADEMIC CREDENTIALS:

Bachelors or Master's degree in computer science or equivalent experience

#LI-RW1

#LI-HYBRID

At AMD, your base pay is one part of your total rewards package. Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD's Employee Stock Purchase Plan. You'll also be eligible for competitive benefits described in more detail here.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.


  • GPU Cluster Engineer

    2 weeks ago


    Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    Job DescriptionWe are seeking a highly motivated and skilled GPU Cluster System/Network Engineer to join our dynamic team at Advanced Micro Devices, Inc.The ideal candidate will have a strong background in GPU architectures, parallel computing, and hands-on experience in system level performance tuning and debug methodologies.The successful candidate will be...

  • Cluster Engineer

    3 weeks ago


    Santa Clara, United States Sustainable Talent Full time

    Job DescriptionJob DescriptionJoin Sustainable Talent as a Cluster Engineer supporting Nvidia and their IPP Cloud Infrastructure Team. This is a W-2 full-time contract with openings in Santa Clara, Ca. We offer competitive pay $80-110/hourly based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company...

  • Cluster Engineer

    1 month ago


    Santa Clara, United States Sustainable Talent Full time

    Join Sustainable Talent as a Cluster Engineer supporting Nvidia and their IPP Cloud Infrastructure Team. This is a W-2 full-time contract with openings in Santa Clara, Ca. We offer competitive pay $80-110/hourly based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company culture! NVIDIA is looking for a...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    Key ResponsibilitiesCollaborate with hardware and software teams to enhance the overall performance of GPU clusters.Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.Evaluate the scalability of GPU clusters by conducting thorough...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    About the RoleWe are looking for an experienced GPU Cluster System/Network Engineer to join our Data Center GPU organization.In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters, developing and executing comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify...


  • Santa Clara, United States Advanced Micro Devices , Inc. Full time

    Overview: WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences the building blocks for the data center, artificial intelligence, PCs, gaming and embedded....


  • Santa Clara, California, United States Roche Holdings Inc. Full time

    We are looking for a Senior GPU Engineer to join our team at Roche Holdings Inc. The successful candidate will have a strong background in computer science, specifically in parallel computing and GPU architecture. They will have expertise in writing, debugging, and optimizing parallel (CPU and GPU) Linux applications, including machine learning techniques,...


  • Santa Clara, California, United States Qualcomm Full time

    Job Summary">We're looking for a talented Senior GPU Architect and Hardware Engineer to join our Engineering Group at Qualcomm Technologies, Inc. As a key member of our team, you will be responsible for the design, implementation, and verification of GPU hardware, drivers, features, applications, and tools.Main Duties and ResponsibilitiesApplying graphics...

  • GPU Design Engineer

    4 weeks ago


    Santa Clara, CA, United States Qualcomm Full time

    Company: Qualcomm Technologies, Inc. Job Area: Engineering Group, Engineering Group > GPU ASICS Engineering General Summary: As a leading technology innovator, Qualcomm pushes the boundaries of what's possible to enable next-generation experiences and drives digital transformation to help create a smarter, connected future for all. As a Qualcomm GPU...


  • Santa Clara, California, United States Qualcomm Full time

    About the Role">We are seeking a skilled GPU Design Verification Engineer, Staff to join our team at Qualcomm Technologies, Inc. This role involves architecting, designing, implementing, verifying, and optimizing performance and power of GPU cores.Key ResponsibilitiesOwning and executing on key independent tasks towards program requirementsCollaborating with...


  • Santa Clara, United States Solid System Team GmbH Full time

    NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for decades. Today, we are increasingly known as “the AI computing company” — with the GPU deep learning ignited modern AI, acting as the brain of computers, robots, and self-driving cars etc. We are hiring a System Software Engineer who will help build outstanding...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    **Company Overview:**Advanced Micro Devices, Inc. is a global leader in the development of innovative technologies for the data center, artificial intelligence, and high-performance computing industries.We are seeking a highly skilled Data Center GPU Engineer to join our team. In this role, you will be responsible for designing and implementing exceptional...


  • Santa Clara, California, United States Qualcomm Full time

    Job DescriptionWe are seeking a highly skilled Senior GPU Architecture Engineer to join our team at Qualcomm. As a key member of our Engineering Group, you will be responsible for architecting, designing, implementing, and verifying the performance and power of GPU cores.About the RoleArchitect and design GPU hardware, drivers, features, applications, and...


  • Santa Clara, United States Disability Solutions Full time

    Roche fosters diversity, equity and inclusion, representing the communities we serve. When dealing with healthcare on a global scale, diversity is an essential ingredient to success. We believe that inclusion is key to understanding people's varied healthcare needs. Together, we embrace individuality and share a passion for exceptional care. Join Roche,...


  • Santa Clara, United States Ledgent Technology Full time

    Location: Santa Clara, Ca.Rate: $43 - $45/hrContract to Hire - 6+ months, OnsiteLocal Candidates Highly Desirable!SummaryThe Failure Analysis Engineer uses procedures and instructions to initiate the analysis process when product failure occurs. Investigations are researched for root causes with analysis documented, recorded, and communicated internally and...


  • Santa Clara, California, United States Roche Holdings Inc. Full time

    Job SummaryWe are seeking a highly skilled Senior GPU Software Engineer to join our team at Roche Holdings Inc. as Principal GPU Software Engineer.This is a unique opportunity to work on cutting-edge projects, collaborating with research and algorithm experts to accelerate bioinformatics techniques using GPU hardware.


  • Santa Clara, United States Disability Solutions Full time

    Roche fosters diversity, equity and inclusion, representing the communities we serve. When dealing with healthcare on a global scale, diversity is an essential ingredient to success. We believe that inclusion is key to understanding people's varied healthcare needs. Together, we embrace individuality and share a passion for exceptional care. Join Roche,...


  • Santa Clara, United States CV Library Full time

    Overview:WHAT YOU DO AT AMD CHANGES EVERYTHINGWe care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next- computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our...

  • System Test Engineer

    1 month ago


    Santa Clara, United States Futran Tech Solutions Pvt. Ltd. Full time

    Job Title : System Test Engineer Location : Santa Clara, CA (Onsite) Job Description This person will be part of the system team in an Agile environment to test GPU software that will be part of data centre. . Design test plan and execute test cases . Hardware (board) bring up with system software as per the configuration . Test functional features of the...


  • Santa Clara, United States Acceler8 Talent Full time

    Software Engineer, LLM PlatformA leading AI solutions company is seeking an experienced Software Engineer, LLM Platform to join their R&D team in Menlo Park. This is an exciting opportunity to work on cutting-edge large language model (LLM) technologies while contributing to mission-critical platforms used by enterprise customers. If you’re passionate...