GPU Cluster Performance Optimization Specialist

5 days ago


Santa Clara, California, United States Advanced Micro Devices , Inc. Full time
GPU Cluster Performance Engineer

We are seeking a highly motivated and skilled GPU Cluster Performance Engineer to join our dynamic team. In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters.

Key Responsibilities:
  • Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as RDMA throughput, latency, and collective communications.
  • Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.
  • Evaluate the scalability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across different cluster sizes, configurations, and networking technologies (IB & RoCE).
  • Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.
  • Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.
  • Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.
  • Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.
  • Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance.
Preferred Experience:
  • Proven experience in optimizing the performance of GPU clusters.
  • Strong understanding of GPU architectures, parallel computing concepts, and network protocols.
  • Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis.
  • Experience with system level performance analysis tools and methodologies for GPU clusters.
  • Analytical mindset with excellent problem-solving and debug skills.
  • Familiarity with cluster management tools and systems.
  • Excellent communication and collaboration skills for effective teamwork.
  • RDMA network configuration, troubleshooting and performance tuning.
  • Linux kernel networking expertise
  • Machine learning and/or HPC system design
Academic Credentials:
  • Bachelors or Master's degree in computer science or equivalent experience

At AMD, your base pay is one part of your total rewards package. Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD's Employee Stock Purchase Plan. You'll also be eligible for competitive benefits described in more detail here.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.

We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.



  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    GPU Cluster Performance EngineerWe are seeking a highly motivated and skilled GPU Cluster Performance Engineer to join our dynamic team at Advanced Micro Devices, Inc.In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters. The ideal candidate will have a strong background in GPU architectures, parallel...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    GPU Cluster Performance EngineerAt Advanced Micro Devices, Inc., we're pushing the boundaries of innovation to solve the world's most complex challenges. We're seeking a highly skilled GPU Cluster Performance Engineer to join our dynamic team.Key Responsibilities:Performance Optimization: Collaborate with hardware and software teams to enhance the overall...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    Job Title: GPU Cluster Performance EngineerWe are seeking a highly skilled and motivated GPU Cluster Performance Engineer to join our dynamic team at Advanced Micro Devices, Inc. (AMD). As a key member of our team, you will be responsible for optimizing and achieving peak performance for GPU clusters.Key Responsibilities:Collaborate with hardware and...


  • Santa Clara, California, United States Apple Full time

    About the RoleWe are seeking a highly skilled Performance Engineer to join our team at Apple. As a key member of our GPU, Graphics, and Display Software team, you will play a critical role in optimizing the performance of our latest Apple Silicon GPUs.Key ResponsibilitiesAnalyze workloads to identify hardware issues and software bottlenecksCollaborate with...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    About the RoleWe are seeking a highly motivated and experienced GPU Performance Optimization Engineer to join our team at Advanced Micro Devices, Inc. (AMD). As a key member of our datacenter GPU platform performance team, you will be responsible for ensuring that our GPU-accelerated systems operate at peak performance, enabling our customers to solve the...


  • Santa Clara, California, United States Apple Full time

    About the RoleWe are seeking a highly skilled Performance Engineer to join our GPU, Graphics, and Display Software team at Apple. As a key member of our team, you will play a critical role in achieving excellent performance of our latest Apple Silicon GPUs.Key ResponsibilitiesAnalyze workloads to identify hardware issues and software bottlenecksCollaborate...


  • Santa Clara, California, United States Apple Full time

    About the RoleWe are seeking a highly skilled Performance Engineer to join our team at Apple, where you will play a critical role in optimizing the performance of our latest Apple Silicon GPUs. As a key member of our team, you will work closely with engineers from driver, framework, hardware, and architecture teams to identify and resolve performance...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is a leader in the field of artificial intelligence, deep learning, and autonomous vehicles. Our team is passionate about building innovative software solutions that impact the world.Job SummaryWe are seeking a highly skilled software engineer to join our team as a Performance Optimization Specialist. In this role, you will work with our...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleWe are seeking a highly skilled performance engineer to join our AI Applications organization at NVIDIA. As a performance engineer, you will work closely with our application teams to optimize the performance of our distributed cloud native accelerated video analytics applications.Key ResponsibilitiesPlan, enable, and drive performance...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Performance Optimization EngineerWe are seeking a highly skilled Senior Performance Optimization Engineer to join our AI Applications organization at NVIDIA. As a key member of our team, you will be responsible for optimizing the performance of our distributed cloud native accelerated video analytics applications.Our team is building...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Performance Optimization EngineerWe are seeking a highly skilled Senior Performance Optimization Engineer to join our AI Applications organization at NVIDIA. As a key member of our team, you will be responsible for optimizing the performance of our distributed cloud native accelerated video analytics applications.Our team is building...

  • Principal Engineer

    2 weeks ago


    Santa Clara, California, United States NVIDIA Full time

    Job Title: Principal Engineer - Performance OptimizationWe are seeking a highly skilled Principal Engineer to join our AI Applications organization at NVIDIA. As a key member of our team, you will be responsible for optimizing the performance of our distributed cloud native accelerated video analytics applications.Our team is building cutting-edge...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    Unlock the Power of Datacenter GPU PerformanceAt Advanced Micro Devices, Inc., we're pushing the boundaries of innovation to solve the world's most complex challenges. As a Datacenter GPU Platform Performance Engineer, you'll play a critical role in ensuring our Instinct GPU-accelerated systems operate at peak performance, empowering our customers to tackle...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionNVIDIA is seeking a highly skilled Senior High Performance Computing Cluster Administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains.Key ResponsibilitiesAdminister Linux systems, ranging from powerful DGX servers to embedded...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA Deep Learning Infrastructure TeamWe are seeking a highly skilled HPC cluster administrator to lead our diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains.Key Responsibilities:Administer Linux systems, including powerful DGX servers and embedded systems,...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is a leader in the field of artificial intelligence, deep learning, and autonomous vehicles. Our engineering teams are working on cutting-edge technologies that are transforming the world.Job SummaryWe are seeking a highly skilled software engineer to join our team as a Performance Engineer. In this role, you will be responsible for...


  • Santa Clara, California, United States NVIDIA Full time

    Unlock the Power of AI with NVIDIANVIDIA is seeking a talented software engineer to join our team and contribute to the development of cutting-edge AI technologies. As a Performance Optimization Engineer, you will play a critical role in optimizing the performance of Deep Learning models for NVIDIA GPUs and systems.Key Responsibilities:Optimize the...

  • HPC Cluster Engineer

    2 weeks ago


    Santa Clara, California, United States Sustainable Talent Full time

    Unlock the Power of HPCSustainable Talent is seeking a seasoned HPC Cluster Engineer to join our team in shaping the future of AI, deep learning, and machine learning initiatives. As a key player in our Nvidia-powered HPC environment, you'll leverage cutting-edge GPU technology to drive groundbreaking discoveries and revolutionize industries.With over 25...


  • Santa Clara, California, United States NVIDIA Full time

    Join NVIDIA's Ambitious Team as a Performance EngineerNVIDIA is seeking a talented software engineer to join our team and contribute to the development of cutting-edge AI and Deep Learning technologies. As a Performance Engineer, you will play a crucial role in optimizing the performance of Deep Learning models for NVIDIA GPUs and systems.Key...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    About the RoleWe are seeking a highly motivated and experienced Datacenter GPU Platform Performance Engineer to join our team at Advanced Micro Devices, Inc. This is an exciting opportunity to work on cutting-edge technology and contribute to the development of innovative solutions for the data center, artificial intelligence, and other emerging markets.Key...