Software Engineering Manager, AI Networking

7 hours ago


Menlo Park, United States META Full time

Summary:

In this role, you will be a member of the Network AI Software team and part of the bigger DC networking organization. The team develops and owns the software stack around collective communication libraries around Meta.At the high level, the team aims to enable Meta-wide ML products and innovations to leverage our large-scale training and inference fleet through an observable, reliable and high-performance distributed AI communication stack. Currently, one of the team’s focus is on building customized features, SW benchmarks, performance tuners and SW stacks around PyTorch to improve the full-stack distributed ML reliability and performance (e.g. Large-Scale GenAI/LLM training) from the trainer down to the network communication layer. And we are seeking for leaders to work on the space of GenAI/LLM scaling reliability and performance.

Required Skills:

Software Engineering Manager, AI Networking Responsibilities:

  1. Help define technical roadmap for the team, drive execution of associated tasks and support the team in resolving dependencies

  2. Collaborate effectively with other groups such as Hardware, Infrastructure, Operations

  3. Interact with external partners as needed in resolving dependencies associated with objectives

  4. Guide and help team members develop appropriate skillsets to grow in their careers, and where necessary address under performance

  5. Communicate cross-functionally and drive engineering efforts

Minimum Qualifications:

Minimum Qualifications:

  1. BS or MS in Computer Science or related technical discipline or equivalent experience

  2. 2+ years experience managing a networking related Software Engineering Team

  3. Working knowledge of network transport stack such as RoCE (RDMA)

  4. Experience with software development for Distributed and Embedded systems

  5. Experience recruiting and managing Software Engineers

Preferred Qualifications:

Preferred Qualifications:

  1. Experience with NCCL and distributed GPU reliability/performance improvment on RoCE/Infiniband

  2. Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow

  3. Knowledge of ML, deep learning and LLM

Public Compensation:

$error/year to $error/year + bonus + equity + benefits We apologize for the inconvenience, please be patient as we work to correct the issue.

Industry: Internet

Equal Opportunity:

Meta is proud to be an Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender, gender identity, gender expression, transgender status, sexual stereotypes, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. We also consider qualified applicants with criminal histories, consistent with applicable federal, state and local law. Meta participates in the E-Verify program in certain locations, as required by law. Please note that Meta may leverage artificial intelligence and machine learning technologies in connection with applications for employment.

Meta is committed to providing reasonable accommodations for candidates with disabilities in our recruiting process. If you need any assistance or accommodations due to a disability, please let us know at accommodations-ext@fb.com.



  • Menlo Park, California, United States META Full time

    Job Summary:In this role, you will be a key member of the Network AI Software team, part of the larger DC networking organization at Meta. The team is responsible for developing and owning the software stack around collective communication libraries.The team's primary goal is to enable Meta-wide ML products and innovations to leverage our large-scale...


  • Menlo Park, United States META Full time

    Summary: The Host Networking team is responsible for all aspects of networking specific to servers including networking applications, network transport and analytics and NICs. The team is increasingly focused on building high performance network solutions for our AI workloads.We are looking for a manager who will lead the group developing network drivers and...


  • Menlo Park, United States META Full time

    Summary: The MTIA (Meta Training & Inference Accelerator) Software team has been developing a comprehensive AI Compiler strategy and optimizing compiler toolchains. This enables training and inference of Meta’s production DL/ML workloads on the specialized MTIA AI accelerator hardware in a highly performant and flexible way.We are looking for a Software...


  • Menlo Park, California, United States META Full time

    Job Summary:Meta's AI Training and Inference Infrastructure is rapidly expanding to support the increasing use of AI. This growth presents a significant scaling challenge that our engineers must address daily. We need to design and evolve our network infrastructure to connect numerous GPUs together efficiently.To improve performance, we continuously look for...


  • Menlo Park, California, United States META Full time

    Job SummaryThe Meta AI Compiler Software team is seeking a Software Engineering Manager to lead the development and optimization of compiler toolchains for Meta's production DL/ML workloads on the MTIA AI accelerator hardware. The ideal candidate will have experience with compiler architecture, development, and management, as well as a strong understanding...


  • Menlo Park, United States Meta Inc Full time

    Summary: The MTIA (Meta Training & Inference Accelerator) Software team is part of AI Infra PyTorch org. The team’s mission is to explore, develop and help productize high-performance software and hardware technologies for AI at datacenter scale. The team co-optimizes both SW (e.g., algorithms and numerics) and HW (e.g., platform and network) to come up...


  • Menlo Park, United States META Full time

    Summary: Meta is seeking a Technical Program Manager (TPM) experienced in managing large-scale AI cluster design, development and deployment. This position will work with cross-functional teams in Meta’s Infrastructure organization to build Large-scale AI clusters that enable Meta’s AI applications and use cases . This position would focus on creating...


  • Menlo Park, United States OSI Engineering Full time

    Job Overview: We are looking for an experienced Staff/Principal Engineer to lead the development of AI capabilities. As the technical lead, you will focus on architecting and building high-quality front-end solutions while collaborating closely with platform engineers working on the AI infrastructure as well as senior product managers to create innovative...


  • Menlo Park, United States OSI Engineering Full time

    Job Overview:We are looking for an experienced Staff/Principal Engineer to lead the development of AI capabilities. As the technical lead, you will focus on architecting and building high-quality front-end solutions while collaborating closely with platform engineers working on the AI infrastructure as well as senior product managers to create innovative...


  • Menlo Park, United States OSI Engineering Full time

    Job Overview:We are looking for an experienced Staff/Principal Engineer to lead the development of AI capabilities. As the technical lead, you will focus on architecting and building high-quality front-end solutions while collaborating closely with platform engineers working on the AI infrastructure as well as senior product managers to create innovative...


  • Menlo Park, United States OSI Engineering Full time

    Job Overview:We are looking for an experienced Staff/Principal Engineer to lead the development of AI capabilities. As the technical lead, you will focus on architecting and building high-quality front-end solutions while collaborating closely with platform engineers working on the AI infrastructure as well as senior product managers to create innovative...


  • Menlo Park, United States OSI Engineering Full time

    Job Overview:We are looking for an experienced Staff/Principal Engineer to lead the development of AI capabilities. As the technical lead, you will focus on architecting and building high-quality front-end solutions while collaborating closely with platform engineers working on the AI infrastructure as well as senior product managers to create innovative...


  • Menlo Park, United States OSI Engineering Full time

    Job Overview:We are looking for an experienced Staff/Principal Engineer to lead the development of AI capabilities. As the technical lead, you will focus on architecting and building high-quality front-end solutions while collaborating closely with platform engineers working on the AI infrastructure as well as senior product managers to create innovative...


  • Menlo Park, United States META Full time

    Summary: Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to...


  • Menlo, Georgia, United States Quicken Full time

    Job Title: Principal Software Engineer - AI/ML SolutionsQuicken is a leading provider of personal finance management software, committed to helping individuals achieve financial stability. We're seeking an experienced Principal Software Engineer to lead the development of AI-driven capabilities within our products.Responsibilities:Architect and develop...


  • Menlo Park, United States OSI Engineering, Inc. Full time

    Job Overview:We are looking for an experienced Staff/Principal Engineer to lead the development of AI capabilities. As the technical lead, you will focus on architecting and building high-quality front-end solutions while collaborating closely with platform engineers working on the AI infrastructure as well as senior product managers to create innovative...

  • Infra Hardware TPM

    1 week ago


    Menlo Park, United States META Full time

    Summary: Meta is seeking a Technical Program Manager (TPM) experienced in managing large-scale AI cluster design, development and deployment. This position will work with cross-functional teams in Meta’s Infrastructure organization to build Large-scale AI clusters that enable Meta’s AI applications and use cases . This position would focus on creating...


  • Menlo, Georgia, United States OSI Engineering Full time

    Job Overview:We are seeking a highly skilled Staff/Principal Engineer to lead the development of AI capabilities. As the technical lead, you will focus on architecting and building high-quality front-end solutions while collaborating closely with platform engineers working on the AI infrastructure and senior product managers to create innovative customer...


  • menlo, United States OSI Engineering Full time

    Job Overview:We are looking for an experienced Staff/Principal Engineer to lead the development of AI capabilities. As the technical lead, you will focus on architecting and building high-quality front-end solutions while collaborating closely with platform engineers working on the AI infrastructure as well as senior product managers to create innovative...


  • menlo, United States OSI Engineering Full time

    Job Overview:We are looking for an experienced Staff/Principal Engineer to lead the development of AI capabilities. As the technical lead, you will focus on architecting and building high-quality front-end solutions while collaborating closely with platform engineers working on the AI infrastructure as well as senior product managers to create innovative...