GPU Cluster Performance Engineer

2 months ago


Santa Clara, United States Advanced Micro Devices , Inc. Full time

Overview:

WHAT YOU DO AT AMD CHANGES EVERYTHING

We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the worlds most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.

AMD together we advance_

Responsibilities:

THE ROLE:?

We are seeking a highly motivated and skilled GPU Cluster Performance Attainment Engineer to join our dynamic team. In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters. The ideal candidate will have a strong background in GPU architectures, parallel computing, and hands-on experience in system level performance tuning and debug methodologies. The team fosters and encourages continuous technical innovation to showcase successes as well as facilitate continuous career development.?

KEY RESPONSIBILITIES:

  • Performance Optimization: Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as RDMA throughput, latency, and collective communications.
  • Benchmarking and Analysis: Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.
  • Scalability Testing: Evaluate the scalability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across different cluster sizes, configurations, and networking technologies (IB & RoCE)
  • Performance Profiling: Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.
  • Performance Tuning: Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.
  • Documentation: Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.
  • Collaboration: Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.
  • Continuous Learning: Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance.

PREFERRED EXPERIENCE:

  • Proven experience in optimizing the performance of GPU clusters.
  • Strong understanding of GPU architectures, parallel computing concepts, and network protocols.
  • Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis.
  • Experience with system level performance analysis tools and methodologies for GPU clusters.
  • Analytical mindset with excellent problem-solving and debug skills.
  • Familiarity with cluster management tools and systems.
  • Excellent communication and collaboration skills for effective teamwork.
  • RDMA network configuration, troubleshooting and performance tuning.
  • Linux kernel networking expertise
  • Machine learning and/or HPC system design

ACADEMIC CREDENTIALS:?

Bachelors or Masters degree in computer science or equivalent experience


#LI-RW1

#LI-HYBRID

Qualifications:

At AMD, your base pay is one part of your total rewards package. Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMDs Employee Stock Purchase Plan. Youll also be eligible for competitive benefits described in more detail here.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants needs under the respective laws throughout all stages of the recruitment and selection process.


  • HPC Cluster Engineer

    4 weeks ago


    Santa Clara, United States Sustainable Talent Full time

    Sustainable Talent is partnering with Nvidia a global leader who's been transforming computer graphics, PC gaming, and accelerated computing for over 25 years.We are looking for a HPC Cluster Engineer to support our client's GPU/HPC Infrastructure Team.As a member of the GPU/HPC Infrastructure team, you will provide leadership in the design and...

  • HPC Cluster Engineer

    4 weeks ago


    Santa Clara, California, United States Sustainable Talent Full time

    Sustainable Talent is partnering with Nvidia a global leader who's been transforming computer graphics, PC gaming, and accelerated computing for over 25 years.We are looking for a HPC Cluster Engineer to support our client's GPU/HPC Infrastructure Team.As a member of the GPU/HPC Infrastructure team, you will provide leadership in the design and...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    About the RoleWe are seeking a highly motivated and experienced GPU Performance Optimization Engineer to join our team at Advanced Micro Devices, Inc. (AMD). As a key member of our datacenter GPU platform performance team, you will be responsible for ensuring that our GPU-accelerated systems operate at peak performance, enabling our customers to solve the...

  • HPC Cluster Engineer

    3 months ago


    Santa Clara, United States Sustainable Talent Full time

    Job DescriptionJob DescriptionAre you ready to make your mark in the forefront of technological innovation? As an HPC Cluster Engineer, you'll play a pivotal role in shaping the future of AI, deep learning, and machine learning initiatives. Join us and leverage Nvidia's cutting-edge GPU technology to drive groundbreaking discoveries and revolutionize...

  • HPC Cluster Engineer

    3 months ago


    Santa Clara, United States Sustainable Talent Full time

    Job DescriptionJob DescriptionAre you ready to make your mark in the forefront of technological innovation? As an HPC Cluster Engineer, you'll play a pivotal role in shaping the future of AI, deep learning, and machine learning initiatives. Join us and leverage Nvidia's cutting-edge GPU technology to drive groundbreaking discoveries and revolutionize...


  • Santa Clara, California, United States AMD Full time

    JOIN AMD AND MAKE A DIFFERENCEAt AMD, we are dedicated to revolutionizing lives through our advanced technology, enhancing our industry, communities, and the global landscape. Our vision is to create exceptional products that propel next-generation computing experiences, serving as the foundation for data centers, artificial intelligence, personal computing,...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA has been at the forefront of innovation for over two decades. Our creation of the GPU in 1999 not only propelled the PC gaming industry but also transformed modern graphics and parallel computing. Recently, the advent of GPU deep learning has ushered in a new era of artificial intelligence — a pivotal moment in computing history. At NVIDIA, we pride...


  • Santa Clara, California, United States Apple Full time

    About the RoleWe are seeking a highly motivated and dedicated engineer to join our Platform Architecture GPU Performance Modeling Team. As a key member of this team, you will be responsible for driving advanced exploration for next-generation GPU architectures and micro-architectures in iPhone, iPad, and Mac products.Key ResponsibilitiesDevelop and maintain...


  • Santa Clara, California, United States Nvidia Full time

    Senior Software Engineer, GPU Communications and NetworkinglocationsUS, CA, Santa Claratime typeFull timejob requisition idJR1972306NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of...


  • Santa Clara, United States Nvidia Full time

    Senior Software Engineer, GPU Communications and NetworkinglocationsUS, CA, Santa Claratime typeFull timejob requisition idJR1972306NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of...


  • Santa Clara, California, United States NVIDIA Full time

    We are currently seeking a Lead Architect for GPU System Performance Optimization. The NVIDIA Platform Architecture team is in search of exceptional computer architects who possess a genuine enthusiasm for GPU-enhanced deep learning, data analysis, and high-performance computing. This role is pivotal in designing and developing the forthcoming generation of...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled Senior GPU Performance Architect to join our AI Applications team at NVIDIA. As a key member of our architecture group, you will play a critical role in driving innovation and delivering cutting-edge performance in the field of artificial intelligence.The ideal candidate will have a strong background in computer science,...


  • Santa Clara, California, United States Apple Full time

    Energy Efficiency GPU EngineerLocation: Santa Clara, California, United StatesDepartment: HardwareAre you passionate about developing innovative solutions to intricate problems? Within our Silicon Technologies division, you will contribute to the design and production of our cutting-edge, high-performance, energy-efficient processors and system-on-chip (SoC)...


  • Santa Clara, United States Oracle Full time

    Cloud Engineering Infrastructure Development Oracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high performance network required to support AI/ML/HPC workloads. This is your opportunity to join the AI revolution and designing systems which allow customers to scale from tens to thousands of GPU without compromising on...


  • Santa Clara, United States US Tech Solutions Full time

    Duration: 12 months contract Job Description: · This position is for an experienced engineer with GPU programming and optimizations skills, with a proven ability to analyse GPU codes and delivery of highly parallel solutions. · You will be part of a team developing and tuning a computational geometry application for Clients CPU and GPU...


  • Santa Clara, United States US Tech Solutions Full time

    Duration: 12 months contract Job Description: · This position is for an experienced engineer with GPU programming and optimizations skills, with a proven ability to analyse GPU codes and delivery of highly parallel solutions. · You will be part of a team developing and tuning a computational geometry application for Clients CPU and GPU platforms....


  • Santa Clara, California, United States NVIDIA Full time

    We are currently seeking a Lead GPU System Architect to join our dynamic GPU team.NVIDIA's innovation in graphics and parallel computing is a cornerstone of our success, allowing us to deliver unparalleled performance in graphics processing. We are continually exploring avenues to enhance our GPU architecture and uphold our leadership position in the...


  • Santa Clara, United States EDA Cafe Full time

    Job Location : 2485 Augustine Dr Santa Clara, California 95054 United States We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs,...


  • Santa Clara, United States NVIDIA Full time

    We are now looking for a Senior GPU Performance Architect, AI applications. NVIDIA GPU Architecture group is looking for architects and software developers to join our various architecture efforts. A key part of NVIDIA's strength is to innovate in the graphics and parallel computing fields, delivering the highest performance in the world for graphics...


  • Santa Clara, California, United States Apple Full time

    Overview As a key member of our Silicon Technologies division, you will play a crucial role in the design and development of cutting-edge, high-efficiency processors and system-on-chip (SoC) solutions. Your expertise will contribute to the creation of Apple’s next-generation GPU, ensuring our products deliver exceptional performance and user satisfaction....