Machine Learning Engineer, Distributed Systems

3 weeks ago


Mountain View, California, United States Waymo Full time

Waymo is an autonomous driving technology company with a mission to be the most trusted driver. The Waymo Driver powers Waymo One, a fully autonomous ride-hailing service, and can also be applied to a range of vehicle platforms and product use cases.

The Waymo Driver has provided over one million rider-only trips, enabled by its experience autonomously driving tens of millions of miles on public roads and tens of billions in simulation across 13+ U.S. states.

The Waymo ML Infrastructure team works with Research and Production teams to develop models in Perception and Planning that are core to our autonomous driving software.

We ensure our partners by offering the best solutions for the entire model development lifecycle. These solutions are developed in close collaboration with teams at Google. They are geared towards both scaling models and solving problems unique to ML for autonomous driving.

We develop a set of libraries and tools that enhance TensorFlow and JAX, and address scalability, reliability, and performance challenges faced by Waymo's ML practitioners: training fast and at scale, increasing ML accelerator efficiency, fine-tuning multimodal LLMs for autonomous driving tasks, discovering hyper-parameters, retraining neural networks, computing reliable and noiseless metrics on validation sets, and validating newly trained DNNs when deployed into the full onboard software stack.

In this role, you will report to the Technical Lead Manager of Machine Learning Training.

Key Responsibilities:

  • Develop the infrastructure components necessary for distributed training, including job scheduling, resource management, data distribution, and model synchronization.
  • Implement automation solutions for provisioning, deployment, monitoring, and scaling of distributed training infrastructure to improve operations and reliability.
  • Monitor system health, diagnose and troubleshoot issues, and perform routine maintenance tasks to ensure the reliability of the distributed training infrastructure.
  • Identify performance bottlenecks and optimization opportunities.
  • Improve the developer experience and performance of our scalable ML framework.

Requirements:

  • Bachelor's degree in Computer Science, Engineering, or related field, or 2+ years equivalent experience.
  • Experience with distributed systems principles and experience building distributed systems for production environments.
  • Solid Python or C++ skills.
  • Prior experience with Machine Learning frameworks (e.g., TensorFlow, PyTorch) and distributed training algorithms.
  • Debug complex distributed systems issues.
  • Experience communicating updates and resolutions to customers and other partners.

Preferred Qualifications:

  • Practical familiarity using ML accelerator profiling tools to uncover performance bottlenecks.
  • Familiarity with cloud computing platforms (e.g., AWS, Azure, GCP) and experience deploying and managing distributed systems in cloud environments.
  • Knowledge of optimization and deep learning algorithms.

Waymo employees are also eligible to participate in Waymo's discretionary annual bonus program, equity incentive plan, and generous Company benefits program, subject to eligibility requirements.

Salary Range $158,000—$200,000 USD



  • Mountain View, California, United States Waymo Full time

    Join Waymo's Autonomous Driving TeamWaymo is a leader in autonomous driving technology, and we're looking for a talented Machine Learning Engineer to join our team. As a key member of our Machine Learning Infrastructure team, you will work closely with Research and Production teams to develop models in Perception and Planning that are core to our autonomous...


  • Mountain View, California, United States Tik Tok Full time

    About the RoleTikTok is seeking a highly skilled Machine Learning Engineer to join our team in the United States. As a key member of our Data Security team, you will be responsible for designing and implementing a global-scale machine learning system for feeds, ads, and search ranking models.You will work closely with our engineering team to improve the...


  • Mountain View, California, United States Nuro Full time

    About NuroNuro exists to better everyday life through robotics. Founded in 2016, we have developed autonomous driving (AD) technology and commercialized AD applications. Our world-class autonomous driving system combines AD hardware with our generalized AI-first self-driving software. Our system is built to learn and improve through data and is one of the...


  • Mountain View, California, United States META Full time

    Job Summary:Meta is embarking on a transformative journey, and our Machine Learning Engineers are at the forefront of this evolution. As a key member of our team, you will lead crucial projects and initiatives that have never been done before, helping us advance the way people connect around the world.The ideal candidate will have industry experience working...


  • Mountain View, California, United States Moveworks Full time

    About the RoleMoveworks is seeking a highly skilled Machine Learning Engineer to join our team and help build cutting-edge ML infrastructure for large language models. This role will be critical in designing, building, and optimizing scalable machine learning systems to support training, evaluation, and deployment of LLMs.The successful candidate will work...


  • Mountain View, California, United States Coupang Full time

    Role OverviewCoupang's Search and Discovery Product Engineering (SDPE) ML group is a cutting-edge team that leverages machine learning to drive innovation and excellence in search and discovery experiences. As a Senior Staff Machine Learning Engineer, you will be part of this team, working on developing and deploying advanced machine learning models to...


  • Mountain View, California, United States META Full time

    About the Role:Meta is seeking a highly skilled Machine Learning Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, developing, and deploying machine learning models that drive business outcomes.Key Responsibilities:Lead the development of machine learning models and algorithms to solve complex...


  • Mountain View, California, United States Waymo Full time

    About the RoleJob SummaryWaymo is an autonomous driving technology company with a mission to be the most trusted driver. As a member of the Machine Learning Infrastructure team, you will work closely with Research and Production teams to develop models in Perception and Planning that are core to our autonomous driving software.About the TeamThe team focuses...


  • Mountain View, California, United States Coupang Full time

    Director of Machine Learning Engineering for Growth MarketingCoupang is a leading e-commerce company that is disrupting the multi-billion-dollar industry from the ground up. We are proud to have a startup culture with the resources of a large global public company, which fuels our growth and innovation. As our Director of Machine Learning Engineering for...


  • Mountain View, California, United States Moveworks Full time

    Job DescriptionMoveworks is seeking a highly skilled Machine Learning Engineer to join our team. As a key member of our ML infrastructure team, you will be responsible for designing, building, and optimizing scalable machine learning infrastructure to support training, evaluation, and deployment of large language models.You will work closely with our machine...


  • Mountain View, California, United States Waymo Full time

    About the Role:Waymo is an autonomous driving technology company with a mission to be the most trusted driver.The Waymo Driver powers Waymo One, a fully autonomous ride-hailing service, and can be applied to a range of vehicle platforms and product use cases.The Waymo ML Infrastructure team collaborates with Research and Production teams to develop models in...


  • Mountain View, California, United States SAMSUNG Full time

    Job SummarySamsung Ads is a leading advertising technology company that leverages machine learning to enable advertisers to connect with audiences from Samsung devices. As a Machine Learning Engineer, you will play a crucial role in developing and deploying large-scale machine learning products with real-world impact.Key ResponsibilitiesDesign and develop a...


  • Mountain View, California, United States Nuro Full time

    About NuroNuro is a leading autonomous technology company that exists to better everyday life through robotics. Founded in 2016, the company's core technology is the Nuro Driver, an integrated autonomous driving system consisting of state-of-the-art, AI-first software and custom-built sense and compute hardware.About the RoleThe ML Infra team at Nuro is...


  • Mountain View, California, United States Samsung Electronics America North America Full time

    Job DescriptionSamsung Ads is a leading advertising technology company that enables brands to connect with Samsung TV audiences through digital media. As a Machine Learning Research Engineer, you will work on cutting-edge projects with stakeholders and teams around the globe, leveraging the company's comprehensive data to build a world-class advertising...


  • Mountain View, California, United States IBM Full time

    Job DescriptionWe are seeking a highly skilled Machine Learning Engineer to join our team at IBM. As a key member of our conversational AI group, you will be responsible for designing, developing, and deploying machine learning models to improve the accuracy and efficiency of our conversational AI system.Key Responsibilities:Investigate and experiment with...


  • Mountain View, California, United States NewsBreak Full time

    About NewsBreakNewsBreak is a leading local news app that connects users with their communities. Our mission is to foster safer, more vibrant, and authentically connected lives. We achieve this by bridging local users, content creators, and businesses through robust collaborations with thousands of local publishers and businesses across the nation.As a...


  • Mountain View, California, United States IBM Full time

    About the RoleWe are seeking a highly skilled Machine Learning Engineer to join our team at IBM. As a key member of our conversational AI group, you will play a critical role in designing, developing, and deploying advanced machine learning models that power our conversational AI system.Key ResponsibilitiesInvestigate and experiment with new model...


  • Mountain View, California, United States YouTube Full time

    About the RoleAs a Software Engineer III, Machine Learning at YouTube, you will be part of a team that develops cutting-edge technologies to improve the user experience on our platform.Our team is responsible for designing, developing, and deploying large-scale machine learning models that power our recommendation systems, video classification, and other...


  • Mountain View, California, United States Acceler8 Talent Full time

    Pioneering Machine Learning ResearcherWe're at the forefront of AGI computing, dedicated to crafting a comprehensive AGI compute platform. If you're passionate about pushing the boundaries in systems-focused ML research, join us on this impactful journey.Key Responsibilities:Train and optimize Large Language Models tailored for our advanced hardware...


  • Mountain View, California, United States Moveworks Full time

    About the RoleWe are seeking a highly skilled Machine Learning Engineer to join our team at Moveworks. As a Machine Learning Engineer, you will be responsible for designing, building, and optimizing scalable machine learning infrastructure to support training, evaluation, and deployment of large language models. Key Responsibilities- Design and build...