ML Infrastructure Deployment Specialist

3 days ago


San Francisco, California, United States CentML Full time

About CentML

We believe AI will fundamentally transform how people live and work. Our mission is to massively reduce the cost of developing and deploying ML models so we can enable anyone to harness the power of AI and everyone to benefit from its potential.

Our founding team is made up of experts in AI, compilers, and ML hardware with extensive industry experience at Amazon, Google, Microsoft Research, Nvidia, Intel, Qualcomm, and IBM.

We are seeking a highly motivated and skilled senior infrastructure engineer to join our team in designing, developing, and maintaining the CentML platform. As an infrastructure engineer, you will be responsible for laying out the design of a deployment infrastructure for ML training and inference jobs over GPU clusters spanning multiple cloud service providers.

Key Responsibilities
  1. Design and lead the development of the deployment infrastructure of the CentML platform, managing the hardware resources necessary to deploy ML training and inference applications.
  2. Implement GPU cluster scheduling solutions for large-scale ML training and inference workloads to efficiently utilize the hardware resources in the GPU cluster.
  3. Collaborate with product teams to define new features and goals for improving the CentML platform.

Required Qualifications

  • 4+ years of experience working with containerized deployment systems (e.g., kubernetes, openshift, terraform etc.).
  • A big plus if you have contributed to kubernetes and have expertise in container runtime technologies like docker engine, containerd, or CRI-O.
  • Experience with deploying and managing cloud infrastructure on AWS, GCP, Azure.
  • Past experience in building GPU clusters for large-scale ML training and inference is desirable.
  • Knowledge in GPU architecture and Nvidia GPU virtualization technologies is highly desirable.
  • Strong coding skills in languages like Python, Java, Go, and/or C/C++.

Benefits & Perks

An open and inclusive work environment.

Employee stock options.

Best-in-class medical and dental benefits.

Parental Leave top-up for 6 months.

Professional development budget.

Flexible vacation time to promote a healthy work-life blend.

We are an equal opportunity employer and value diversity at our company. We do not discriminate based on race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability, and any other protected ground of discrimination under applicable human rights legislation.

CentML strives to respect the dignity and independence of people with disabilities and is committed to giving them the same opportunity to succeed as all other employees.

Inclusiveness is core to our culture at CentML, and we strive to ensure you get the most from your interview experience. CentML makes reasonable accommodations for applicants with disabilities. If a reasonable accommodation is needed to participate in the job application or interview process, please reach out to the Talent team.



  • San Francisco, California, United States ZipRecruiter Full time

    About the RoleWe are seeking an experienced Machine Learning Systems Engineer to join our team. As an ML Systems Engineer at Abridge, you will be responsible for scaling and deploying machine learning models to handle increasing traffic demands and integrating them with various platforms.Architect, design, and implement ML software systems for deploying and...


  • San Francisco, California, United States CentML Full time

    About Our MissionWe believe AI will fundamentally transform how people live and work. CentML's mission is to massively reduce the cost of developing and deploying ML models so we can enable anyone to harness the power of AI and everyone to benefit from its potential.Our founding team is made up of experts in AI, compilers, and ML hardware and has led efforts...


  • San Francisco, California, United States Unity Technologies Full time

    We are seeking a talented Senior Data and ML Infrastructure Engineer to join our team at Unity Technologies. This role is responsible for designing and optimizing large-scale data platforms and machine learning infrastructure systems for efficiency, reliability, and cost-effectiveness.Job OverviewUnity is the world's leading platform of tools for creators to...


  • San Francisco, California, United States Unity Full time

    Welcome to Unity, the world's leading platform of tools for creators to build and grow real-time games, apps, and experiences across multiple platforms. As a highly skilled data and machine learning (ML) infrastructure engineer, you will play a crucial role in designing and optimizing large-scale data platforms and ML infrastructure systems for efficiency,...


  • San Francisco, California, United States Fieldguide Full time

    About Us: Fieldguide is a pioneering company that's revolutionizing the audit and advisory industry by leveraging cutting-edge Machine Learning (ML) technology. As a Senior Platform Engineer, Machine Learning, you'll be instrumental in building and maintaining the infrastructure that powers our ML solutions, enabling us to deliver impactful results to our...


  • San Francisco, California, United States Delphina Full time

    About DelphinaWe are on a mission to revolutionize the way data scientists work. Our vision is to empower teams to build powerful machine learning models quickly and efficiently, without the pain points associated with traditional tools.As a Founding ML Infrastructure Engineer at Delphina, you will be part of a team that has previously led large data science...


  • San Francisco, California, United States Unreal Gigs Full time

    Unreal Gigs is seeking an experienced AI Infrastructure Specialist to design, automate, and manage robust machine learning pipelines. Job OverviewThis role involves building scalable infrastructure for AI workloads, automating workflows, and developing tools that enable continuous integration and continuous delivery (CI/CD) of ML...


  • San Francisco, California, United States Harnham Full time

    **Transforming the Future of Live Entertainment**Harnham is a leading tech company that's changing the game by creating seamless digital solutions for millions of users.We're looking for a highly skilled MACHINE LEARNING PLATFORM ENGINEER to design and build infrastructure that accelerates the ML lifecycle, enabling scalable, reliable systems for critical...


  • San Francisco, California, United States Abridge Al, Inc Full time

    About the JobWe are seeking an experienced Machine Learning Systems Engineer to join our team. As an ML Systems Engineer at Abridge, you will be responsible for scaling and deploying machine learning models to handle increasing traffic demands and integrating them with various platforms.You will play a pivotal role in building a scalable infrastructure that...


  • San Francisco, California, United States ZipRecruiter Full time

    Job DescriptionWe're looking for a highly skilled Ai Infrastructure Specialist to join our team of engineers and data scientists. As an AI Infrastructure Specialist, you'll play a key role in designing, building, and optimizing our AI infrastructure to support the needs of our organization.About the RoleDesign and Build Infrastructure: Design and build...


  • San Francisco, California, United States Unity Full time

    Job OverviewWe are seeking a Senior Data Engineer and Infrastructure Specialist to join our Data & ML Platform team at Unity.About the RoleIn this position, you will design and optimize large-scale data platforms and machine learning infrastructure systems for efficiency, reliability, and cost-effectiveness. You will also lead improvements in infrastructure...


  • San Francisco, California, United States CentML Full time

    About CentMLWe're a cutting-edge technology company dedicated to revolutionizing the field of artificial intelligence. Our goal is to make AI more accessible and affordable for everyone.Our TeamOur team consists of world-renowned experts in AI, compilers, and ML hardware who have led efforts at top tech companies like Amazon, Google, and Microsoft.Job...


  • San Francisco, California, United States Unreal Gigs Full time

    Job OverviewWe are seeking an experienced Artificial Intelligence Infrastructure Specialist to join our team at Unreal Gigs. As a key member of our infrastructure team, you will play a crucial role in designing, building, and optimizing our machine learning infrastructure to support the needs of our organization.Key Responsibilities:Machine Learning...


  • San Francisco, California, United States University of California - San Francisco Campus and Health Full time

    Job SummaryThe senior software engineer will lead the development, implementation, and maintenance of computing and data infrastructure to support the deployment and monitoring of Machine Learning (ML) and generative Artificial Intelligence (AI) tools at UCSF Health.This includes leading the Health IT Platform for Advanced Computing (HIPAC), a cloud...


  • San Francisco, California, United States Abridge Full time

    About AbridgeAbridge is a trailblazing, mission-driven organization that is revolutionizing the healthcare industry through AI-powered technology.Opportunities and BenefitsWe offer a unique opportunity to work with talented individuals, have ownership and impact at a high-growth startup, and enjoy a range of benefits including flexible/ unlimited PTO,...


  • San Francisco, California, United States WEX, Inc. Full time

    About WEX, Inc.We're a global commerce platform and payments technology company forging the way in a rapidly changing environment. Our mission is to simplify the business of doing business for customers, freeing them to focus on what matters most. We're committed to building a consistent world-class user experience across our products and services,...


  • San Francisco, California, United States Abridge AI Inc. Full time

    Abridge AI Inc. is a pioneering force in healthcare technology, utilizing artificial intelligence to empower deeper understanding and improve clinical documentation efficiency.Role OverviewWe are seeking an exceptional ML Systems Engineer to join our team, responsible for scaling and deploying machine learning models to handle increasing traffic demands and...


  • San Francisco, California, United States Fieldguide Full time

    The Role: Design and implement infrastructure for ML model management, including training, deployment, and monitoring Build and maintain platforms for running ML algorithms at scale Develop systems for A/B testing, performance monitoring, and continuous model training About You:You have 3-4 years of experience in software engineering, DevOps, or a related...


  • San Francisco, California, United States WEX Full time

    Overview:Achieve technical excellence in AI infrastructure development with WEX, a leading global commerce platform and payments technology company. We're seeking an experienced Staff Cloud Engineer to spearhead our AI infrastructure initiatives, leveraging cloud-based solutions and cutting-edge technologies.About the Role:This is an exceptional opportunity...


  • San Francisco, California, United States WEX, Inc. Full time

    About WEX, Inc.WEX is an innovative global commerce platform and payments technology company that aims to simplify the business of doing business for customers. We are on a mission to create a consistent world-class user experience across our products and services, leveraging customer-focused innovations in big data, AI, and Risk.We are looking for a highly...