RDMA Ops Engineer

4 weeks ago


Sunnyvale, United States Alibaba Cloud Full time

OverviewWe're seeking a skilled RDMA Ops Engineer to optimize and maintain high-performance networking infrastructure for our computing clusters. This role focuses on building and operating ultra-low latency, high-throughput networks using RDMA technologies to power next-generation computing workloads.ResponsibilitiesDeploy, operate and maintain RDMA-based network architectures (RoCE/InfiniBand) for cluster with thousands of nodesOptimize network performance for distributed collective communication workloads (NCCL, MPI, etc.)Solve complex network issues in distributed collective communication (e.g., NCCL/MPI communication bottlenecks)Use automation tools for network provisioning, monitoring, diagnostics, and network performance profiling (latency/throughput analysis)Implement CI/CD pipelines for network infrastructure-as-codeManage end-to-end network lifecycle: deployment, configuration, monitoring, upgradesCollaborate with computing algorithm engineers to troubleshoot network-related bottlenecks in training/inference pipelinesBridge Computing framework requirements with underlying network infrastructure capabilitiesEnsure compliance with security and scalability requirementsQualificationsStrong scripting skills (Python/Go/Bash) for operational automationExpert-level RDMA operational experience (RoCEv2/InfiniBand)Understanding of Linux internals (kernel bypass, syscall optimization, etc), and proficient in Linux network stack tuning (irqbalance, NUMA, hugepages)Hands-on experience with RDMA/DPDK performance tuningStrong knowledge of network protocols (TCP/IP, RoCEv2) and NIC architecture principlesAbility to abstract complex technical concepts into architectural diagramsProven track record of translating R&D innovations into production solutionsStrong communication skills for cross-functional collaboration with Computing researchers and SRE teamsExperience managing production computing networksFamiliar with Kubernetes networking (CNI, Multus, SR-IOV) and GPU-aware schedulingBackground in computing system optimization (NVIDIA collective libraries, MPI tuning)Deep understanding of computing workload patterns and their network implicationsCompensation and EmploymentThe pay range for this position at commencement of employment is expected to be between $104,400 and $171,000/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If hired, employee will be in an “at-will position” and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors. #J-18808-Ljbffr


  • RDMA Ops Engineer

    4 days ago


    Sunnyvale, CA, United States Alibaba Cloud Full time

    We're seeking a skilled RDMA Ops Engineer to optimize and maintain high-performance networking infrastructure for our computing clusters. This role focuses on building and operatiing ultra-low latency, high-throughput networks using RDMA technologies to power next-generation computing workloads.Key Responsibilities:• Deploy, operate and maintain RDMA-based...


  • Sunnyvale, United States Institute of Foundation Models Full time

    A dedicated research lab is seeking a Network Engineer to design and optimize low-latency, high-bandwidth networking solutions for AI supercomputing clusters. You will work on cutting-edge technologies in collaboration with world-class researchers. The ideal candidate has strong experience with NVIDIA RDMA technologies, networking protocols, and Kubernetes....


  • Sunnyvale, United States Institute of Foundation Models Full time

    About the Institute of Foundation Models We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. As part of our team, youll have the opportunity to work on the...


  • Sunnyvale, CA, United States Institute of Foundation Models Full time

    About the Institute of Foundation Models We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. As part of our team, you'll have the opportunity to work on the...


  • Sunnyvale, CA, United States Institute of Foundation Models Full time

    About the Institute of Foundation Models We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. As part of our team, you'll have the opportunity to work on the...


  • Sunnyvale, United States CMK Resources, Inc. Full time

    CMK Resources is partnering with a fast-scaling AI cloud platform on a high-impact, confidential search. This team is solving cutting-edge infrastructure challenges to support massive-scale AI and HPC workloads. They are urgently seeking an experienced Staff/Sr. Staff+ Network Engineer to lead architecture and design of next-generation networking...


  • Sunnyvale, CA, United States Apple Full time

    Weekly Hours: 40 Role Number: 200633188-3956 Summary As a Network Systems Integration Engineer, you will build and maintain the critical infrastructure that enables our hardware innovation. Your primary mission is to provide robust, hands-on support for our data center development labs, ensuring our electrical, validation, and cross-functional hardware...


  • Sunnyvale, United States Apple Inc. Full time

    Network Systems Integration Engineer - Data Center Hardware As a Network Systems Integration Engineer, you will build and maintain the critical infrastructure that enables our hardware innovation. Your primary mission is to provide robust, hands‑on support for our data center development labs, ensuring our electrical, validation, and cross‑functional...


  • Sunnyvale, United States QFocus Technologies LLC Full time

    Domain: Embedded software, network protocol implementationDescriptionJoin our team as a Network Driver Developer and play a critical role in developing high-performance Network Interface Cards (NICs) with expertise in L2/L3 protocols, RDMA, and RoCE.Key ResponsibilitiesDevelop, validate and maintain NIC drivers for Linux kernel and other operating...


  • Sunnyvale, United States Google Inc. Full time

    A leading technology company in Sunnyvale is seeking a Staff Software Engineer specializing in embedded systems and networking. The role involves providing technical leadership, managing project priorities, and developing high-scale software solutions. Ideal candidates will have extensive experience in software development and embedded operating systems,...