Production Systems Engineer, AI Systems

7 days ago


Menlo Park, California, United States Meta Full time

Summary:

Meta is seeking a Systems Engineer to join our Release to Production (RTP) team working on AI/ML initiatives supporting large scale AI Training and Inference. Our servers and data centers are the foundation upon which our rapidly scaling infrastructure operates efficiently to deliver our innovative services. The RTP team is responsible for the end-to-end Hardware Lifecycle of all Meta servers including prototyping of experimental HW, pre-production hands-on system and hardware debugging and stress testing, enabling production-ready system monitoring, automated provisioning and automated remediation of issues. RTP team also helps in exploring, developing and productizing high-performance software and hardware technologies for AI at datacenter scale.RTP Engineers have a large swath of XFN partners they work closely with e.g. HW/SW co-design teams, hardware designers, networking teams, system manufacturers, component vendors, capacity engineering, production engineering, production services, and data center operations teams to enable new systems that will be deployed in our production data centers. We are looking for a candidate who can support scale up and scale out network technologies (e.g. NICs) for Meta AI systems that are powering Meta's tremendous leaps in the AI space. The ideal candidate is knowledgeable about network technologies and has hands-on experience supporting them through at least a couple of hardware/software (firmware, driver) lifecycle phases: design+bring up, server integration, system validation, supporting customer deployment, production issue triage, rolling out new features in FW/Driver.

Required Skills:

Production Systems Engineer, AI Systems Responsibilities:

  1. Support new AI platform introduction in to Meta fleet by driving scale up (e.g. NVlink, XGMI) and scale out (e.g. NICs) interface integration.
  2. Contribute to new feature/technology development/validation across hardware/software stack.
  3. Contribute to enabling hacks for future technology explorations in AI space such as memory, network and storage interdependencies in the context of AI workloads.
  4. Proactively create experiments and tooling to detect and diagnose hardware/firmware/software health issues.
  5. Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external partners.
  6. Develop visibility through data visualization and implement systemic solutions to hardware health issues.
  7. Leverage production experience to drive external and internal teams to continuously improve product quality.

Minimum Qualifications:

Minimum Qualifications:

  1. Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
  2. 4+ years of work experience in one or more domains such as: Network ASIC development (Silicon design or bringup or characterization), board development, network product deployment and customer support (Switches, NICs), Interconnect Technologies (e.g. Optics, DAC).
  3. Knowledge of server architecture and components.
  4. Experience working with Linux.
  5. Knowledge of TCP/IP and experience using iperf.
  6. Hands on troubleshooting and debug experience.

Preferred Qualifications:

Preferred Qualifications:

  1. Experience working with full server system, including PCIe.
  2. Experience working with RDMA.
  3. Experience working with large scale deployments.
  4. 2+ years experience scripting automation in Python or PHP or Perl.

Public Compensation:

$124,000/year to $191,000/year + bonus + equity + benefits

Industry: Internet

Equal Opportunity:

Meta is proud to be an Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender, gender identity, gender expression, transgender status, sexual stereotypes, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. We also consider qualified applicants with criminal histories, consistent with applicable federal, state and local law. Meta participates in the E-Verify program in certain locations, as required by law. Please note that Meta may leverage artificial intelligence and machine learning technologies in connection with applications for employment.

Meta is committed to providing reasonable accommodations for candidates with disabilities in our recruiting process. If you need any assistance or accommodations due to a disability, please let us know at accommodations-




  • Menlo Park, California, United States Facebook Full time

    Meta is seeking an experienced Production Systems Engineer to join our Release to Production (RTP) team. Our servers and data centers are the foundation upon which our rapidly scaling infrastructure operates efficiently to deliver our innovative services. The RTP team is responsible for the Hardware Lifecycle of all Meta servers including pre-production...


  • Menlo Park, California, United States META Full time

    Meta Platforms, Inc. (Meta), formerly known as Facebook Inc., builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps and services like Messenger, Instagram, and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens...


  • Menlo Park, California, United States META Full time

    Meta is seeking a Partner Engineer to join Meta's Applied AI Partner Engineering team, a highly technical team that works with strategic partners, machine learning leaders across the industry and all major cloud service providers for building and launching new Generative AI product services and experience and taking Large Language Models (LLMs) from research...


  • Menlo Park, California, United States Meta Full time

    Summary: Meta is seeking a software engineer to join our AI & Systems Co-Design team to drive the definition of our next-generation compute and storage architectures. This person will work cross-functionally with internal software and platforms engineering teams to understand the workloads and infrastructure requirements. They will drive technology...

  • AI Research Engineer

    1 month ago


    Menlo Park, California, United States Facebook Full time

    Meta is seeking a Research Engineer to join our Fundamental AI Research (FAIR) Team, a research organization focused on making significant progress in AI. Advances in AI are key to our mission, spanning some of the most pressing research challenges of our generation across such areas as artificial intelligence, machine learning, computational statistics, and...

  • Research Engineer

    3 weeks ago


    Menlo Park, California, United States Character AI Full time

    About the roleWe are looking for people with solid engineering and machine learning skills to drive the research required for pushing the boundaries of artificial intelligence.Requirements: 5+ years' experience with deep learning frameworks like Pytorch, Tensorflow, Jax Deep understanding of the "whole stack" when it comes to designing, training, evaluating...

  • AI Policy Manager

    3 weeks ago


    Menlo Park, California, United States Facebook Full time

    Meta is hiring AI Policy Managers with expertise in AI policy issues to join our Privacy and Data Strategy Policy team and help us build products, services, and technologies that promote the best interests of our users, developers and the AI research community. The team's mission is to develop innovative approaches to AI services and research projects that...


  • Menlo Park, California, United States Resource Logistics Full time

    Responsibilities Deep functional and operational understanding and expertise of consumer electronics logistics domainInterface with logistics and supply chain management teams to define comprehensive process and system strategies that provide flexibility and are scalable as well as sustainable.Contribute to supply chain network strategies by providing...


  • Menlo Park, California, United States Character Full time

    About us Character's mission is to empower everyone with AGI. Our vision is to enable people with our technology so that they can use Character.AI any moment of any day. Character.AI is one of the world's leading personal AI platforms. Founded in 2021 by AI pioneers Noam Shazeer and Daniel De Freitas, Character.AI is a full-stack AI company with a globally...


  • Menlo Park, California, United States Stanford Health Care Full time

    If you're ready to be part of our legacy of hope and innovation, we encourage you to take the first step and explore our current job openings. Your best is waiting to be discovered. Day - 08 Hour (United States of America)Located in the heart of Silicon Valley, Stanford Health Care's mission is to heal humanity through science and compassion, one patient at...


  • Menlo Park, California, United States Character Full time

    About the role Responsibilities: As a Multimodal Site Reliability Engineer (SRE) at Character, you will be responsible for ensuring the reliability, scalability, and performance of our app and AI multimodal services (e.g., voice interfacing services). You will work closely with our development team to design and implement processes and systems that ensure...


  • Menlo Park, California, United States Character Full time

    About the role We are building the future of open-ended interactions between people and intelligent dialog agents. This requires building and managing software infrastructure that can not only support the millions of active daily users on our site today but handle the product's reliability, scalability, and performance as we grow our user base to humanity...


  • Menlo Park, California, United States Character Technologies Full time

    As a Multimodal Site Reliability Engineer (SRE) at Character, you will be responsible for ensuring the reliability, scalability, and performance of our app and AI multimodal services (e.g., Instrument, monitor and optimize the performance and reliability of our service.Implement and maintain automation tools and processes to prevent and mitigate service...

  • Research Engineer

    3 weeks ago


    Menlo Park, California, United States Character Technologies Full time

    Joining us as a Safety and Alignment Research Engineer on the Post-Training team, you'll be building tools to align our models and making sure they meet the highest standards of safety in the real world.As increasingly powerful AI models get deployed, building tools to align and steer them becomes increasingly important. Your work will directly contribute to...


  • Menlo Park, California, United States Facebook Full time

    Meta is seeking a Research Scientist to join our Llama Large Language Model (LLM) Research team. We are looking for recognized experts in NLP or reinforcement learning; with experience in areas like LLM alignment; multilingual modeling; code generation; responsible AI; and model controllability. The ideal candidate will have an interest in producing and...

  • Production Engineer

    3 weeks ago


    Menlo Park, California, United States META Full time

    Meta Platforms, Inc. (Meta), formerly known as Facebook Inc., builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps and services like Messenger, Instagram, and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens...

  • Engineering Manager

    3 weeks ago


    Menlo Park, California, United States Meta Inc Full time

    The AI & Systems Co-Design team at Meta is looking for a technical manager to lead the development of next-generation compute platforms. You will have the opportunity to lead a team of skilled engineers in exploring innovative technologies, defining roadmap strategy, and collaborating on hardware/software design to enhance the capabilities and efficiency of...


  • Menlo Park, California, United States Meta Inc Full time

    This role is a part of the Data Center Network Product organization which owns the entire life-cycle of design, development, testing, deployment and operations of Meta's data-center network product. The scale of the network and its continuous expansion presents an opportunity to work on and solve interesting engineering challenges in the datacenter network...


  • Menlo Park, California, United States Exponent Full time

    About ExponentExponent is the only premium engineering and scientific consulting firm with the depth and breadth of expertise to solve our clients' most profoundly unique, unprecedented, and urgent challenges. Our vision is to engage multidisciplinary teams of science, engineering, and regulatory experts to empower clients with solutions that create a...

  • Production Engineer

    3 weeks ago


    Menlo Park, California, United States META Full time

    Meta Platforms, Inc. (Meta), formerly known as Facebook Inc., builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps and services like Messenger, Instagram, and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens...