Senior ML Systems Engineer, Frameworks

7 days ago

San Francisco, California, United States Cohere Full time

Who are we?

Our mission is to scale intelligence to serve humanity. We're training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents. We believe that our work is instrumental to the widespread adoption of AI.

We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast to do what's best for our customers.

Cohere is a team of researchers, engineers, designers, and more, who are passionate about their craft. Each person is one of the best in the world at what they do. We believe that a diverse range of perspectives is a requirement for building great products.

Join us on our mission and shape the future

We're looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs.

If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.

What You'll Work On

Build and own the training framework responsible for large-scale LLM training.
Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high-performance training.
Investigate and resolve performance bottlenecks across the ML systems stack.
Build robust systems that ensure reproducible, debuggable, large-scale runs.

You Might Be a Good Fit If You Have

Strong engineering experience in large-scale distributed training or HPC systems.
Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
Experience working with containerized environments (Docker, Singularity/Apptainer).
A track record of building tools that increase developer velocity for ML teams.
Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.
Strong collaboration skills — you'll work closely with infra, research, and deployment teams.

Nice to Have

Experience with training LLMs or other large transformer architectures.
Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.).
Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches).
Experience with data pipeline optimization, sharded datasets, or caching strategies.
Background in performance engineering, profiling, or low-level systems.

Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).

Why Join Us

You'll work on some of the most challenging and consequential ML systems problems today.
You'll collaborate with a world-class team working fast and at scale.
You'll have end-to-end ownership over critical components of the training stack.
You'll shape the next generation of infrastructure for frontier-scale models.
You'll build tools and systems that directly accelerate research and model quality.

Sample Projects:

Build a high-performance data loading and caching pipeline.
Implement performance profiling across the ML systems stack
Develop internal metrics and monitoring for training runs.
Build reproducibility and regression testing infrastructure.
Develop a performant fault-tolerant distributed checkpointing system.

If some of the above doesn't line up perfectly with your experience, we still encourage you to apply

We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.

Full-Time Employees at Cohere enjoy these Perks:

An open and inclusive culture and work environment

Work closely with a team on the cutting edge of AI research

Weekly lunch stipend, in-office lunches & snacks

Full health and dental benefits, including a separate budget to take care of your mental health

100% Parental Leave top-up for up to 6 months

Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement

Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend

6 weeks of vacation (30 working days)

Senior ML Research Engineer

5 days ago

San Francisco, California, United States Discover International Full time

Senior ML Research EngineerSan Francisco, CA (On-site)Full-timeOverviewI am partnering with a pioneering company developing advanced intelligence systems to detect and remediate critical software vulnerabilities. The founding team includes senior engineers from some of the world's leading AI and infrastructure organizations.This is an opportunity to join a...
Senior ML Infrastructure Engineer

5 days ago

San Francisco, California, United States Gridware Full time $190,000 - $210,000

About GridwareGridware is a San Francisco-based technology company dedicated to protecting and enhancing the electrical grid. We pioneered a groundbreaking new class of grid management called active grid response (AGR), focused on monitoring the electrical, physical, and environmental aspects of the grid that affect reliability and safety. Gridware's...
Senior ML Performance Engineer

2 weeks ago

San Francisco, California, United States Lemurian Labs Full time

About UsAt Lemurian Labs, we're on a mission to bring the power of AI to everyone—without leaving a massive environmental footprint. We care deeply about the impact AI has on our society and planet, and we're building a solid foundation for its future, ensuring AI grows sustainably and responsibly. Innovation should help the world, not harm it.We are...
Senior ML/AI Engineer

1 day ago

San Francisco, California, United States Reacher Full time

About Reacher:We're the #1 TikTok Shop partner helping brands like Under Armour, Hanes, HeyDude, and Logitech scale their affiliate marketing. We've crossed 7 figures in ARR, and are rapidly scaling our team this year. Our vision is to become the Hubspot for creator marketing, powering brands and creators to connect and grow across all commerce platforms...
Senior Staff Infrastructure AI/ML RDMA RoCEv2 Engineer

5 days ago

San Jose, California, United States Tara Technical Solutions (TTS) Full time

Company DescriptionTara Technical Solutions (TTS)Is the Authorized Venfor for our Fortune 500 Client.We are represting full-time-direct hires only.Role DescriptionThis is a full-time, on-site role located in San Jose, CA, for a Senior Staff Infrastructure AI/ML RDMA RoCEv2 Engineer. The responsibilities include designing, implementing, and optimizing RDMA...
Senior Software Engineer

5 days ago

San Francisco, California, United States Haystack Full time

We're working withAnnapurna Labs (AWS)on this opportunity.Senior Software Development Engineer – AI/ML (AWS Neuron, Model Inference)Cupertino, CA — Remote/Hybrid$151,300 - $261,500AWS's Annapurna Labs team buildsNeuron— the software stack powering Inferentia and Trainium. They're hiring a Senior SDE to work at the bleeding edge ofLLM inference...
Principal AI/ML Engineer

2 weeks ago

San Francisco, California, United States SignalFire Full time

Join SignalFire's Talent Network for Principal AI/ML Engineer Roles at VC-Backed StartupsAt SignalFire, we partner with top early-stage startups that are shaping the future of technology. Our portfolio spans 200+ innovative companies across AI, cybersecurity, healthtech, fintech, developer tools, and enterprise SaaS.We're looking to connect with exceptional...
AI / ML Engineer

2 weeks ago

San Francisco, California, United States Jobs via Dice Full time

Dice is the leading career destination for tech experts at every stage of their careers. Our client, Aroha Technologies, is seeking the following. Apply via Dice todayAI / ML EngineerJob Description:AI Engineer:Required Skills & Experience:3-5 years of professional experience in an AI or Machine Learning engineering role.Hands-on experience with LLM...
Senior Software Engineer

2 days ago

San Francisco, California, United States Plaid Full time

Plaid is evolving into an AI-first company, where data and machine learning are the key enablers of smarter, more secure insight products built on top of Plaid's vast financial data network. The Machine Learning Infrastructure team sits at the center of this transformation. We build the platforms that enable model developers to experiment, train, deploy, and...
ML Infrastructure Engineer, Safeguards

6 hours ago

San Francisco, California, United States Anthropic Full time

About AnthropicAnthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. About the roleWe are...

Americas

Europe

Asia / Oceania

Africa

Senior ML Systems Engineer, Frameworks