GPU Cluster Performance Optimization Engineer

4 days ago


Santa Clara, California, United States Advanced Micro Devices , Inc. Full time
Job Title: GPU Cluster Performance Engineer

We are seeking a highly skilled and motivated GPU Cluster Performance Engineer to join our dynamic team at Advanced Micro Devices, Inc. (AMD). As a key member of our team, you will be responsible for optimizing and achieving peak performance for GPU clusters.

Key Responsibilities:
  • Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as RDMA throughput, latency, and collective communications.
  • Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.
  • Evaluate the scalability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across different cluster sizes, configurations, and networking technologies (IB & RoCE).
  • Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.
  • Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.
  • Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.
  • Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.
  • Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance.
Preferred Experience:
  • Proven experience in optimizing the performance of GPU clusters.
  • Strong understanding of GPU architectures, parallel computing concepts, and network protocols.
  • Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis.
  • Experience with system level performance analysis tools and methodologies for GPU clusters.
  • Analytical mindset with excellent problem-solving and debug skills.
  • Familiarity with cluster management tools and systems.
  • Excellent communication and collaboration skills for effective teamwork.
  • RDMA network configuration, troubleshooting and performance tuning.
  • Linux kernel networking expertise
  • Machine learning and/or HPC system design
Academic Credentials:
  • Bachelors or Master's degree in computer science or equivalent experience


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    GPU Cluster Performance EngineerWe are seeking a highly motivated and skilled GPU Cluster Performance Engineer to join our dynamic team at Advanced Micro Devices, Inc.In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters. The ideal candidate will have a strong background in GPU architectures, parallel...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    GPU Cluster Performance EngineerAt Advanced Micro Devices, Inc., we're pushing the boundaries of innovation to solve the world's most complex challenges. We're seeking a highly skilled GPU Cluster Performance Engineer to join our dynamic team.Key Responsibilities:Performance Optimization: Collaborate with hardware and software teams to enhance the overall...


  • Santa Clara, California, United States Advanced Micro Devices , Inc. Full time

    About the RoleWe are seeking a highly motivated and experienced GPU Performance Optimization Engineer to join our team at Advanced Micro Devices, Inc. (AMD). As a key member of our datacenter GPU platform performance team, you will be responsible for ensuring that our GPU-accelerated systems operate at peak performance, enabling our customers to solve the...


  • Santa Clara, California, United States Apple Full time

    About the RoleWe are seeking a highly skilled Performance Engineer to join our team at Apple, where you will play a critical role in optimizing the performance of our latest Apple Silicon GPUs. As a key member of our team, you will work closely with engineers from driver, framework, hardware, and architecture teams to identify and resolve performance...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA has been at the forefront of innovation for over two decades. Our creation of the GPU in 1999 not only propelled the PC gaming industry but also transformed modern graphics and parallel computing. Recently, the advent of GPU deep learning has ushered in a new era of artificial intelligence — a pivotal moment in computing history. At NVIDIA, we pride...


  • Santa Clara, California, United States Apple Full time

    About the RoleWe are seeking a highly skilled Performance Engineer to join our team at Apple. As a key member of our GPU, Graphics, and Display Software team, you will play a critical role in optimizing the performance of our latest Apple Silicon GPUs.Key ResponsibilitiesAnalyze workloads to identify hardware issues and software bottlenecksCollaborate with...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Performance Optimization EngineerWe are seeking a highly skilled Senior Performance Optimization Engineer to join our AI Applications organization at NVIDIA. As a key member of our team, you will be responsible for optimizing the performance of our distributed cloud native accelerated video analytics applications.Our team is building...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleWe are seeking a highly skilled performance engineer to join our AI Applications organization at NVIDIA. As a performance engineer, you will work closely with our application teams to optimize the performance of our distributed cloud native accelerated video analytics applications.Key ResponsibilitiesPlan, enable, and drive performance...

  • Principal Engineer

    3 days ago


    Santa Clara, California, United States NVIDIA Full time

    Job Title: Principal Engineer - Performance OptimizationWe are seeking a highly skilled Principal Engineer to join our AI Applications organization at NVIDIA. As a key member of our team, you will be responsible for optimizing the performance of our distributed cloud native accelerated video analytics applications.Our team is building cutting-edge...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is a leader in the field of artificial intelligence, deep learning, and autonomous vehicles. Our engineering teams are working on cutting-edge technologies that are transforming the world.Job SummaryWe are seeking a highly skilled software engineer to join our team as a Performance Engineer. In this role, you will be responsible for...


  • Santa Clara, California, United States Apple Full time

    About the RoleWe are seeking a highly skilled Performance Engineer to join our GPU, Graphics, and Display Software team at Apple. As a key member of our team, you will play a critical role in achieving excellent performance of our latest Apple Silicon GPUs.Key ResponsibilitiesAnalyze workloads to identify hardware issues and software bottlenecksCollaborate...


  • Santa Clara, California, United States NVIDIA Full time

    Unlock the Power of AI with NVIDIANVIDIA is seeking a talented software engineer to join our team and contribute to the development of cutting-edge AI technologies. As a Performance Optimization Engineer, you will play a critical role in optimizing the performance of Deep Learning models for NVIDIA GPUs and systems.Key Responsibilities:Optimize the...


  • Santa Clara, California, United States NVIDIA Full time

    Join NVIDIA's Ambitious Team as a Performance EngineerNVIDIA is seeking a talented software engineer to join our team and contribute to the development of cutting-edge AI and Deep Learning technologies. As a Performance Engineer, you will play a crucial role in optimizing the performance of Deep Learning models for NVIDIA GPUs and systems.Key...


  • Santa Clara, California, United States Sustainable Talent Full time

    Unlock the Power of HPCSustainable Talent is seeking a seasoned HPC Cluster Engineer to join our team in shaping the future of AI, deep learning, and machine learning initiatives. As a key player in our Nvidia-powered HPC environment, you'll leverage cutting-edge GPU technology to drive groundbreaking discoveries and revolutionize industries.With over 25...


  • Santa Clara, California, United States Nvidia Full time

    Join NVIDIA's Team of Performance EngineersNVIDIA is seeking highly skilled performance engineers to join our team and contribute to the development of cutting-edge AI and deep learning technologies. As a performance engineer at NVIDIA, you will work closely with our software development teams to optimize the performance of our AI and deep learning...


  • Santa Clara, California, United States Sustainable Talent Full time

    Unlock the Power of HPCSustainable Talent is seeking a seasoned HPC Cluster Engineer to join our team in shaping the future of AI, deep learning, and machine learning initiatives. As a key player in our Nvidia-powered HPC environment, you'll leverage cutting-edge GPU technology to drive groundbreaking discoveries and revolutionize industries.As a trusted...


  • Santa Clara, California, United States Apple Full time

    About the RoleWe are seeking a highly skilled Performance Engineer to join our team at Apple. As a key member of our GPU, Graphics, and Display Software team, you will play a critical role in developing and optimizing the graphics software foundation for our innovative products.Key ResponsibilitiesAnalyze performance of 3D applications and games to identify...


  • Santa Clara, California, United States Apple Full time

    GPU Performance Modeling EngineerWe are seeking a highly skilled and motivated engineer to join our Platform Architecture GPU Performance Modeling Team. As a GPU Performance Modeling Engineer, you will be responsible for developing and maintaining GPU performance models from the shader core up to the full system.Key Responsibilities:Develop and maintain GPU...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionNVIDIA is seeking a highly skilled Senior High Performance Computing Cluster Administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains.Key ResponsibilitiesAdminister Linux systems, ranging from powerful DGX servers to embedded...


  • Santa Clara, California, United States Apple Full time

    About the RoleWe are seeking a highly skilled GPU Performance Analysis Engineer to join our team at Apple. As a key member of our Silicon Engineering Group, you will play a critical role in designing and manufacturing our next-generation, high-performance, power-efficient GPU.Key ResponsibilitiesAnalyze unit and system-level performance issues to ensure our...