GPU Cluster Performance Engineer
2 weeks ago
Overview:
WHAT YOU DO AT AMD CHANGES EVERYTHING
We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the worlds most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.
AMD together we advance_
Responsibilities:THE ROLE:?
We are seeking a highly motivated and skilled GPU Cluster Performance Attainment Engineer to join our dynamic team. In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters. The ideal candidate will have a strong background in GPU architectures, parallel computing, and hands-on experience in system level performance tuning and debug methodologies. The team fosters and encourages continuous technical innovation to showcase successes as well as facilitate continuous career development.?
KEY RESPONSIBILITIES:
- Performance Optimization: Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as RDMA throughput, latency, and collective communications.
- Benchmarking and Analysis: Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.
- Scalability Testing: Evaluate the scalability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across different cluster sizes, configurations, and networking technologies (IB & RoCE)
- Performance Profiling: Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.
- Performance Tuning: Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.
- Documentation: Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.
- Collaboration: Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.
- Continuous Learning: Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance.
PREFERRED EXPERIENCE:
- Proven experience in optimizing the performance of GPU clusters.
- Strong understanding of GPU architectures, parallel computing concepts, and network protocols.
- Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis.
- Experience with system level performance analysis tools and methodologies for GPU clusters.
- Analytical mindset with excellent problem-solving and debug skills.
- Familiarity with cluster management tools and systems.
- Excellent communication and collaboration skills for effective teamwork.
- RDMA network configuration, troubleshooting and performance tuning.
- Linux kernel networking expertise
- Machine learning and/or HPC system design
ACADEMIC CREDENTIALS:?
Bachelors or Masters degree in computer science or equivalent experience
#LI-RW1
#LI-HYBRID
Qualifications:At AMD, your base pay is one part of your total rewards package. Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMDs Employee Stock Purchase Plan. Youll also be eligible for competitive benefits described in more detail here.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants needs under the respective laws throughout all stages of the recruitment and selection process.
-
GPU Cluster Performance Engineer
4 months ago
Santa Clara, United States Advanced Micro Devices , Inc. Full timeOverview: WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences the building blocks for the data center, artificial intelligence, PCs, gaming and embedded....
-
Senior Software Engineer
5 days ago
Santa Clara, United States NVIDIA Full timeNVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were...
-
Senior GPU Cluster Tools Developer
3 weeks ago
Santa Clara, California, United States NVIDIA Full timeA key part of NVIDIA's strength is our sophisticated analysis and debugging tools that empower NVIDIA engineers to improve performance and power efficiency of our products and the running applications.We are seeking a forward-thinking, hard-working, and creative software engineer to join a multifaceted software team with high standards.This role involves...
-
Senior GPU Cluster Tools Developer
3 weeks ago
Santa Clara, California, United States NVIDIA Full timeA key part of NVIDIA's strength is our sophisticated analysis and debugging tools that empower NVIDIA engineers to improve performance and power efficiency of our products and the running applications.We are seeking a forward-thinking, hard-working, and creative software engineer to join a multifaceted software team with high standards.This role involves...
-
Senior GPU Cluster Tools Developer
3 weeks ago
Santa Clara, California, United States NVIDIA Full timeA key part of NVIDIA's strength is our sophisticated analysis and debugging tools that empower NVIDIA engineers to improve performance and power efficiency of our products and the running applications.We are seeking a forward-thinking, hard-working, and creative software engineer to join our multifaceted software team with high standards.This role involves...
-
High Performance Computing Cluster Architect
3 weeks ago
Santa Clara, California, United States NVIDIA Full timeNVIDIA is seeking a highly skilled HPC cluster administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains. As a member of the DLFW Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute...
-
Santa Clara, California, United States NVIDIA Full timeNVIDIA Deep Learning Infrastructure TeamWe are seeking a highly skilled and experienced HPC cluster administrator to lead our diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains.Key Responsibilities:Design and implement groundbreaking GPU compute clusters that...
-
Santa Clara, California, United States NVIDIA Full timeNVIDIA's Deep Learning Optimized Frameworks Group is seeking a highly skilled HPC cluster administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural guidance to product teams in the deep learning and scientific computing domains.As a member of the DLFW Infrastructure team, you will provide leadership in the design and...
-
Santa Clara, California, United States NVIDIA Full timeJob Title: Senior High Performance Computing Cluster AdministratorNVIDIA's Deep Learning Optimized Frameworks Group is seeking a highly skilled HPC cluster administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains.Key...
-
Santa Clara, United States NVIDIA Full timeNVIDIA's Deep Learning Optimized Frameworks Group is looking for a deeply technical HPC cluster administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains. As a member of the DLFW Infrastructure team, you will provide leadership in the design...
-
Santa Clara, United States NVIDIA Full timeNVIDIA's Deep Learning Optimized Frameworks Group is looking for a deeply technical HPC cluster administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains. As a member of the DLFW Infrastructure team, you will provide leadership in the design...
-
Senior Site Reliability Engineer
2 weeks ago
Santa Clara, United States NVIDIA Full timeNVIDIA is the leader in AI, machine learning and datacenter acceleration. NVIDIA is expanding that leadership into datacenter networking with ethernet switches, NICs and DPUs NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and...
-
GPU Performance Optimization Engineer
3 weeks ago
Santa Clara, California, United States Advanced Micro Devices , Inc. Full timeJob SummaryWe're seeking a highly motivated and skilled GPU Performance Optimization Engineer to join our team at Advanced Micro Devices, Inc. The ideal candidate will have expertise in GPU performance and familiarity with performance monitoring and tuning tools. Key Responsibilities• Define performance suite and best practices for measuring...
-
GPU Performance Analysis Engineer
3 weeks ago
Santa Clara, California, United States Apple Full timeGPU Performance Analysis EngineerAt Apple, we're looking for a skilled GPU Performance Analysis Engineer to join our Silicon Engineering Group. As a key member of our team, you'll be responsible for delivering high-quality, low-power graphics IP that meets our performance and power goals.Key Responsibilities:Analyze unit and system-level performance...
-
Senior Site Reliability Engineer
3 weeks ago
Santa Clara, United States NVIDIA Full timeNVIDIA is the leader in AI, machine learning and datacenter acceleration. NVIDIA is expanding that leadership into datacenter networking with ethernet switches, NICs and DPUs NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and...
-
GPU Performance Analysis Engineer
4 weeks ago
Santa Clara, California, United States Apple Full timeRole SummaryAs a GPU Performance Analysis Engineer at Apple, you will play a crucial role in designing and manufacturing next-generation, high-performance, power-efficient GPUs. Your expertise will ensure that Apple products and services can seamlessly handle complex tasks, making them beloved by millions.Key ResponsibilitiesAnalyze unit and system-level...
-
Senior Site Reliability Engineer
3 weeks ago
Santa Clara, United States NVIDIA Full timeNVIDIA is the leader in AI, machine learning and datacenter acceleration. NVIDIA is expanding that leadership into datacenter networking with ethernet switches, NICs and DPUs NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and...
-
Datacenter GPU Performance Engineer
2 weeks ago
Santa Clara, California, United States Advanced Micro Devices , Inc. Full timeWe're seeking a highly skilled GPU Performance Optimization Engineer to join our team at Advanced Micro Devices, Inc. As part of our team, you'll be responsible for ensuring that AMD Instinct GPU-accelerated systems are operating at peak performance before being deployed to solve the world's most challenging problems.
-
Senior Performance Engineer
3 weeks ago
Santa Clara, California, United States Nvidia Full timeUnlock the Power of High-Performance ComputingNVIDIA is revolutionizing the field of Artificial Intelligence, High Performance Computing, and Visualization. As a key player in this space, we're seeking a motivated Performance Engineer to join our GPU Communications Libraries and Networking team.As a Performance Engineer, you'll play a crucial role in shaping...
-
Senior HPC Cluster Administrator
3 weeks ago
Santa Clara, California, United States Nvidia Full timeJob SummaryNVIDIA is seeking a highly skilled Senior HPC Cluster Administrator to lead our GPU Compute Cluster team. As a key member of our Deep Learning Frameworks Group, you will be responsible for designing and implementing cutting-edge GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive...