Machine Learning Engineer

2 weeks ago


Redwood City, CA, United States C3 AI Full time

C3 AI (NYSE: AI), is the Enterprise AI application software company. C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end-to-end platform for developing, deploying, and operating enterprise AI applications, C3 AI applications, a portfolio of industry-specific SaaS enterprise AI applications that enable the digital transformation of organizations globally, and C3 Generative AI, a suite of domain-specific generative AI offerings for the enterprise. Learn more at: C3 AI

C3 AI Data Science team is dedicated to pushing the boundaries of what is possible with large-scale AI. We are seeking a hands-on Machine Learning Engineer to design, build, and operate a bespoke, next-generation research platform dedicated to training novel, large-scale foundation models far beyond conventional LLM recipes.

This is a critical systems role. You will create the orchestration, secure data pathways, and frictionless developer experience that empowers our researchers to move fast, experiment securely, and scale complex training jobs on heterogeneous GPU clusters.

Responsibilities:

We are looking for an expert who can solve infrastructure problems where off-the-shelf cloud tools are insufficient.

  • Design and manage the core research compute cluster, including node layouts, queues/partitions, preemption/fair-share policies, and multi-tenant isolation.
  • Implement secure access controls for all users and services across the cluster using Kubernetes and/or SLURM.
  • Build robust branch-to-experiment CI/CD workflows, encompassing templated job creation, config promotion, and integrated version control.
  • Implement an experiment and metrics tracking system (runs, configs, checkpoints, logs) with searchable lineage to enable frictionless cross-team collaboration and sharing.
  • Design and integrate auto-checkpointing, artifact retention, and necessary rollout/rollback mechanisms.
  • Stand up robust dataset registries, ensuring data lineage, versioning, and secure access.
  • Implement sharding, streaming, and prefetch mechanisms to support efficient TB-scale data corpora access and long-term archival with reproducible rehydration.
  • Profile NCCL/I/O hotspots, optimize training throughput (mixed precision/AMP, ZeRO/FSDP, kernel fusion, caching).
  • Harden training pipelines for scale and resilience, including checkpoint recovery, and tolerance for spot/preemptible instances.
  • Build opinionated templates, job specifications, and guardrails to ensure researchers can focus on modifying custom training code and recipes without fighting infrastructure bottlenecks.
Qualifications:
  • BS/MS in Computer Science/Electrical Engineering or equivalent deep, practical experience.
  • 5+ years of work experience (8+ years for Senior Machine Learning Engineer)
  • Proven track record building custom ML/HPC platforms for specialized research (e.g., novel model architectures, time-series, multimodal AI) where commercial cloud tools were insufficient.
  • Deep expertise with Kubernetes and/or SLURM on GPU clusters, including proficiency with containers, images, and multi-node scheduling.
  • Strong software development skills in Python and one of Go, C++, or Rust. Comfortable developing controllers/operators, high-performance services, and CLI tooling on Linux.
  • Practical, hands-on knowledge of distributed ML frameworks (PyTorch DDP/FSDP/ZeRO, DeepSpeed, or JAX/TPU) and performance profiling (NCCL, CUDA basics, I/O performance).
  • Experience with object stores, Parquet format, dataset version control, streaming/sharding techniques, and efficient artifact management for checkpoints and logs.
  • Practical experience with observability (Prometheus/Grafana/OpenTelemetry) and infra-as-code (Terraform/Helm/Ansible).
Preferred Qualifications:
  • Experience with high-speed networking and storage, including InfiniBand/RDMA, GPUDirect-RDMA, NVLink topology, and high-throughput file/object systems.
  • Direct experience modifying or working with K8s device plugins, custom schedulers/quotas, or SLURM internals (fair-share/preemption).
  • Expertise in implementing true reproducibility at scale: seeding, deterministic builds, environment capture, and building robust dataset & experiment lineage that guarantee re-runnability months later.
  • Experience with advanced performance work such as kernel fusion, custom CUDA operations, and fine-tuning complex FSDP/ZeRO configurations.
  • A pragmatic, product-focused approach to researcher ergonomics, demonstrated by platforms you have shipped that materially increased experiment throughput and velocity.


C3 AI provides excellent benefits, a competitive compensation package and generous equity plan.

California Base Pay Range

$140,000-$206,000 USD

C3 AI is proud to be an Equal Opportunity and Affirmative Action Employer. We do not discriminate on the basis of any legally protected characteristics, including disabled and veteran status.

  • Redwood City, CA, United States Protogon Holdings, Inc Full time

    About the job Machine Learning Engineer Protogon Research builds AI models with a deep understanding of the world, monetizing them through proprietary trading. Founded and led by serial entrepreneur Rafael Cosman, the co-founder of Archblock and the DeFi Protocol TrueFi, Protogon Research is building superintelligence. We believe trading is a powerful...


  • Redwood City, CA, United States Protogon Holdings, Inc Full time

    About the job Machine Learning Engineer Protogon Research builds AI models with a deep understanding of the world, monetizing them through proprietary trading. Founded and led by serial entrepreneur Rafael Cosman, the co-founder of Archblock and the DeFi Protocol TrueFi, Protogon Research is building superintelligence. We believe trading is a powerful...


  • Redwood City, CA, United States Protogon Holdings, Inc Full time

    About the job Machine Learning Engineer Protogon Research builds AI models with a deep understanding of the world, monetizing them through proprietary trading. Founded and led by serial entrepreneur Rafael Cosman, the co-founder of Archblock and the DeFi Protocol TrueFi, Protogon Research is building superintelligence. We believe trading is a powerful...

  • Machine Learning Engineer

    11 minutes ago


    Redwood City, CA, United States Protogon Holdings, Inc Full time

    About the job Machine Learning Engineer Protogon Research builds AI models with a deep understanding of the world, monetizing them through proprietary trading. Founded and led by serial entrepreneur Rafael Cosman, the co-founder of Archblock and the DeFi Protocol TrueFi, Protogon Research is building superintelligence. We believe trading is a powerful...


  • Redwood City, CA, United States Poshmark Full time

    Confidence can sometimes hold us back from applying for a job. Here's a secret: there's no such thing as a "perfect" candidate. Poshmark is looking for exceptional people who want to make a positive impact through their work and help create an organization where everyone can thrive. So whatever background you bring with you, please apply if this role would...


  • Redwood City, CA, United States Poshmark Full time

    Confidence can sometimes hold us back from applying for a job. Here's a secret: there's no such thing as a "perfect" candidate. Poshmark is looking for exceptional people who want to make a positive impact through their work and help create an organization where everyone can thrive. So whatever background you bring with you, please apply if this role would...


  • Redwood City, CA, United States Digital Gurus Full time

    Senior Machine Learning Engineer (LLM / Audio / Voice AI)$200,000-$300,000 + EquitySan Francisco Bay Area (Relocation + Sponsorship Available for Exceptional Talent)PermanentOur client is a rapidly scaling AI startup operating at $1M+ ARR per employee, building the next generation of real-time voice automation. They're now hiring exceptionally talented...


  • Redwood City, CA, United States Digital Gurus Full time

    Senior Machine Learning Engineer (LLM / Audio / Voice AI)$200,000-$300,000 + EquitySan Francisco Bay Area (Relocation + Sponsorship Available for Exceptional Talent)PermanentOur client is a rapidly scaling AI startup operating at $1M+ ARR per employee, building the next generation of real-time voice automation. They're now hiring exceptionally talented...


  • Redwood City, CA, United States C3.ai, Inc. Full time

    C3 AI (NYSE: AI), is the Enterprise AI application software company. C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end-to-end platform for developing, deploying, and operating enterprise AI applications, C3 AI applications, a portfolio of industry-specific SaaS enterprise AI applications that enable the digital...


  • Redwood City, CA, United States Moloco Full time

    About Moloco: Moloco builds some of the most powerful AI advertising solutions in the world. Our name-short for "machine learning company"-reflects our core mission: democratizing access to the advanced AI that has historically been reserved for tech giants. Led by machine learning pioneers who built some of the most successful ad systems at Google,...