ML Infra Engineer
2 hours ago
In this role you will help scale and optimize our training systems and core model code. You'll own critical infrastructure for large-scale training, from managing GPU/TPU compute and job orchestration to building reusable and efficient JAX training pipelines. You'll work closely with researchers and model engineers to translate ideas into experiments—and those experiments into production training runs.
This is a hands-on, high-leverage role at the intersection of ML, software engineering, and scalable infrastructure.
The TeamThe ML Infrastructure team supports and accelerates PI's core modeling efforts by building the systems that make large-scale training reliable, reproducible, and fast. The team works closely with research, data, and platform engineers to ensure models can scale from prototype to production-grade training runs.
In This Role You Will- Own training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, including scheduling, job management, checkpointing, and metrics/logging.
- Scale distributed training: Work with researchers to scale JAX-based training across TPU and GPU clusters with minimal friction.
- Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization.
- Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments.
- Manage compute resources: Ensure efficient allocation and utilization of cloud-based GPU/TPU compute while controlling cost.
- Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale.
- Contribute to core training code: Evolve JAX model and training code to support new architectures, modalities, and evaluation metrics.
What We Hope You'll Bring- Strong software engineering fundamentals and experience building ML training infrastructure or internal platforms.
- Hands-on large-scale training experience in JAX (preferred), PyTorch.
- Familiarity with distributed training, multi-host setups, data loaders, and evaluation pipelines.
- Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS).
- Ability to debug and optimize performance bottlenecks across the training stack.
- Strong cross-functional communication and ownership mindset.
Bonus Points If You Have- Deep ML systems background (e.g., training compilers, runtime optimization, custom kernels).
- Experience operating close to hardware (GPU/TPU performance tuning).
- Background in robotics, multimodal models, or large-scale foundation models.
- Experience designing abstractions that balance researcher flexibility with system reliability.
-
ML Infra Engineer
3 hours ago
San Francisco, California, United States Physical Intelligence Full timeWho We ArePhysical Intelligence is bringing general-purpose AI into the physical world. We are a team of engineers, scientists, roboticists, and company builders developing foundation models and learning algorithms to power the robots of today and the physically-actuated devices of the future.The TeamThe Infrastructure team builds and operates the backbone...
-
AI Infra Engineer
1 hour ago
San Francisco, California, United States Healthleap Full timeAbout HealthleapHealthLeap builds AI that helps clinicians prioritize patients, surfaces the right data, and gets patients the care they need earlier, so they can leave the hospital sooner.We integrate with hospital electronic health record systems, screen 100% of patients daily, and risk-rank them in real time. Clinicians at Cedars-Sinai and Penn Medicine...
-
Founding ML Engineer
1 hour ago
San Francisco, California, United States Pear VC Full timeAbout UsOutspeed is creating the most lifelike conversational voice systems to augment human-computer interaction. We are building the infrastructure and tools to unlock applications in therapy, coaching, companionship and gaming.Outspeed is led by an experienced team of researchers and engineers with collective experience from MIT, Google, and Microsoft....
-
Full Stack Engineer
3 hours ago
San Francisco, California, United States Tenex Engineer Full timeFull Stack Engineer (AI Search Products / TypeScript) — Member of Technical StaffSan Francisco (In‑person)| $140K–$220K + Equity| Medical / Dental / Vision|Visa sponsorship availableWe're partnered with awell-funded AI search companybuilding a next‑generation search engine designed forAI agents and power users. Their products enable users to...
-
AI Infra Engineer
52 minutes ago
San Francisco, California, United States Perplexity Full timeWe are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clustersResponsibilitiesDesign, deploy,...
-
Software Engineer, ML Infra
3 hours ago
San Francisco, California, United States Twitch Full timeAbout UsTwitch is the world's biggest live streaming service, with global communities built around gaming, entertainment, music, sports, cooking, and more. It is where thousands of communities come together for whatever, every day.We're about community, inside and out. You'll find coworkers who are eager to team up, collaborate, and smash (or elegantly...
-
Software Engineer, ML Infra
3 hours ago
San Francisco, California, United States Twitch Full time $127,100 - $185,000About UsTwitch is the world's biggest live streaming service, with global communities built around gaming, entertainment, music, sports, cooking, and more. It is where thousands of communities come together for whatever, every day.We're about community, inside and out. You'll find coworkers who are eager to team up, collaborate, and smash (or elegantly...
-
Staff Software Engineer, ML Systems
5 hours ago
San Francisco, California, United States Waymo Full timeWaymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waymo Driver—The World's Most Experienced Driver—to improve access to mobility while saving thousands of lives now lost to traffic crashes. The Waymo...
-
Software Engineer, ML Infra
2 hours ago
San Francisco, California, United States Amazon Full time $127,100 - $185,000If you are interested in this position, please apply on Twitch's Career site About Us:Twitch is the world's biggest live streaming service, with global communities built around gaming, entertainment, music, sports, cooking, and more. It is where thousands of communities come together for whatever, every day.We're about community, inside and out. You'll find...
-
ML Research Engineer
53 minutes ago
San Francisco, California, United States Achira Full timeWhy AchiraJoin a world-class team of scientists, ML researchers, and engineers working together to make the physical microcosm predictable and reshape the future of drug discovery.Move beyond the beaten path: we are actively exploring the next frontier of model architectures for AI x chemistry.Operate at frontier scale: massive compute, massive data, and...