Senior Software Engineer, Infrastructure

4 days ago


San Francisco, United States CentML Full time
About Us

We believe AI will fundamentally transform how people live and work. CentML's mission is to massively reduce the cost of developing and deploying ML models so we can enable anyone to harness the power of AI and everyone to benefit from its potential.

Our founding team is made up of experts in AI, compilers, and ML hardware and has led efforts at companies like Amazon, Google, Microsoft Research, Nvidia, Intel, Qualcomm, and IBM. Our co-founder and CEO, Gennady Pekhimenko, is a world-renowned expert in ML systems who holds multiple academic and industry research awards from Google, Amazon, Facebook, and VMware.

Position Overview:

We are seeking a highly motivated and skilled senior infrastructure engineer to join our team in a key role focused on designing, developing, and maintaining the CentML platform that offers a cost effective infrastructure for serving and training large scale machine learning models. As an infrastructure engineer, you will be responsible for laying out the design of a deployment infrastructure for ML training and inference jobs over GPU clusters that spans across multiple cloud service providers like AWS, GCP, Azure, Coreweave, and OCI. You should also be responsible for leading a team of engineers and building a scalable, performant, and reliable platform, enabling our customers to seamlessly access and utilize a comprehensive suite of ML services that we offer.

Responsibilities
    • Design and lead the development of the deployment infrastructure of the CentML platform. The deployment infrastructure manages the hardware resources necessary to deploy the ML training and inference applications.
    • Implementing GPU cluster scheduling solutions for large scale ML training and inference workloads to efficiently utilize the hardware resources in the GPU cluster.
    • Communicate with our product teams and define new features and goals for improving the CentML platform.
Qualifications
    • 4+ years of experience working with containerized deployment systems (e.g, kubernetes, openshift, terraform etc.).
    • A big plus if you have contributed to kubernetes and have expertise in container runtime technologies like docker engine, containerd, or CRI-O
    • Experience with deploying and managing cloud infrastructure on AWS, GCP, Azure
    • Past experience in building GPU clusters for large scale ML training and inference is desirable.
    • Knowledge in GPU architecture and Nvidia GPU virtualization technologies is highly desirable.
    • Strong coding skills in languages like Python, Java, Go, and/or C/C++.


Benefits & Perks

- An open and inclusive work environment

- Employee stock options

- Best-in-class medical and dental benefits

- Parental Leave top-up for 6 months

- Professional development budget

- Flexible vacation time to promote a healthy work-life blend

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability, and any other protected ground of discrimination under applicable human rights legislation.

CentML strives to respect the dignity and independence of people with disabilities and is committed to giving them the same opportunity to succeed as all other employees.

Inclusiveness is core to our culture at CentML, and we strive to ensure you get the most from your interview experience. CentML makes reasonable accommodations for applicants with disabilities. If a reasonable accommodation is needed to participate in the job application or interview process, please reach out to the Talent team.

  • San Francisco, United States Tbwa ChiatDay Inc Full time

    Senior Software Engineer (Infrastructure)Fathom is on a mission to use AI to understand and structure the world’s medical data, starting by making sense of the terabytes of clinician notes contained within the electronic health records of the world’s largest health systems. Our deep learning engine automates the translation of patient records into the...


  • San Francisco, United States Baton Full time

    Senior Software Engineer - InfrastructureWho We AreBaton is seeking ambitious individuals who desire the autonomy and agility of a startup environment combined with the backing, power, reach, and stability of a highly respected logistics industry giant.Baton is the Silicon Valley-based technology innovation lab for Ryder, a leading logistics company that...


  • San Francisco, United States Tbwa ChiatDay Inc Full time

    Senior Software Engineer - InfrastructureAs a team, we’ve launched five satellites into orbit, signed ten commercial deals worth over $1 billion in revenue, raised over $750 million from top global investors, and recruited a team of over 400 world-class engineers. We all work out of our (legendary) San Francisco office, which was once used to build ships...


  • San Francisco, United States Astranis Full time

    Astranis is on a mission to bridge the digital divide by connecting the four billion people worldwide who currently lack internet access. We're doing this by building the next generation of smaller, more cost-effective spacecraft to bring the world online.  As a team, we’ve launched two satellites into orbit,  signed ten commercial deals worth over $1...


  • San Francisco, United States Nexus Full time

    We are seeking a skilled Senior Software Engineer to join our infrastructure team and help us shape the future of verifiable computing. Leveraging your expertise in Rust, you will contribute to the development of efficient, scalable, and secure systems that support our ambitious goals.About NexusThe Nexus Project is a scientific and engineering effort...


  • San Francisco, United States Astranis Full time

    As a team, we’ve launched five satellites into orbit, signed ten commercial deals worth over $1 billion in revenue, raised over $750 million from top global investors, and recruited a team of over 400 world-class engineers. We all work out of our (legendary) San Francisco office, which was once used to build ships during the World Wars.Our satellites,...


  • San Francisco, United States Fathom Full time

    Fathom is on a mission to use AI to understand and structure the world's medical data, starting by making sense of the terabytes of clinician notes contained within the electronic health records of the world's largest health systems. Our deep learning engine automates the translation of patient records into the billing codes used for healthcare provider...


  • San Francisco, California, United States Astranis Full time

    Astranis is revolutionizing global connectivity by building innovative spacecraft to bridge the digital divide. Our team has achieved remarkable milestones, including launching two satellites into orbit, signing commercial deals worth over $1 billion, and recruiting a team of over 300 engineers.Our geostationary satellites operate at a significantly lower...


  • San Francisco, United States Tbwa ChiatDay Inc Full time

    Senior Software Engineer - Realtime InfrastructureDiscord is used by over 200 million people every month for many different reasons, but there’s one thing that nearly everyone does on our platform: play video games. Over 90% of our users play games, spending a combined 1.5 billion hours playing thousands of unique titles on Discord each month. Discord...


  • San Francisco, United States Tbwa ChiatDay Inc Full time

    Senior Software Engineer (API Infrastructure)Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers the powerful APIs, SDKs, and tools necessary to build and scale onchain apps and rollups.Our infrastructure powers 70%...


  • San Francisco, United States Recruiting From Scratch Full time

    Who is Recruiting from Scratch: Recruiting from Scratch is a premier talent firm that focuses on placing the best product managers, software, and hardware talent at innovative companies. Our team is 100% remote and we work with teams across the United States to help them hire. Our client is looking for a Senior Infrastructure Engineer located in San...


  • San Francisco, United States Recruiting from Scratch Full time

    Who is Recruiting from Scratch: Recruiting from Scratch is a premier talent firm that focuses on placing the best product managers, software, and hardware talent at innovative companies. Our team is 100% remote and we work with teams across the United States to help them hire. Our client is looking for a Senior Infrastructure Engineer located in San...


  • San Francisco, United States Tbwa ChiatDay Inc Full time

    Senior Software Engineer, Ads InfrastructureSan Francisco, CA or Remote (U.S.)Discord is used by over 200 million people every month for many different reasons, but there’s one thing that nearly everyone does on our platform: play video games. Over 90% of our users play games, spending a combined 1.5 billion hours playing thousands of unique titles on...


  • San Francisco, United States Acceler8 Talent Full time

    Senior Software Engineer (AI Infrastructure / MLOps)Location: San Francisco (3 days per week in office)Introduction:We are seeking a Senior Software Engineer (AI Infrastructure / MLOps) to join a pioneering AI startup focused on enhancing data quality for machine learning. This role offers the chance to work on large-scale web applications and tackle complex...


  • San Francisco, United States Acceler8 Talent Full time

    Senior Software Engineer (AI Infrastructure / MLOps)Location: San Francisco (3 days per week in office)Introduction:We are seeking a Senior Software Engineer (AI Infrastructure / MLOps) to join a pioneering AI startup focused on enhancing data quality for machine learning. This role offers the chance to work on large-scale web applications and tackle complex...


  • San Francisco, United States Tbwa ChiatDay Inc Full time

    Senior Software Engineer, Ads InfrastructureSan Francisco, CA or Remote (U.S.)Discord is used by over 200 million people every month for many different reasons, but there’s one thing that nearly everyone does on our platform: play video games. Over 90% of our users play games, spending a combined 1.5 billion hours playing thousands of unique titles on...


  • San Francisco, United States Kiddom Inc Full time

    About KiddomKiddom is a groundbreaking educational platform that promotes student equity and growth by uniting high-quality instructional materials with dynamic digital learning. Through unparalleled curriculum management functionality, Kiddom empowers schools and districts to take ownership of their curriculum – resulting in learning experiences tailored...


  • San Francisco, United States Tbwa ChiatDay Inc Full time

    Senior Software Engineer, InfrastructureAt Pomelo, we're not just improving money transfer—we're transforming how families support each other across borders. As the first financial technology platform to combine consumer credit with international money transfer, we've eliminated fees, added rewards, and given immigrants a path to build credit while...


  • San Francisco, United States Tbwa ChiatDay Inc Full time

    Senior Software Engineer, Ads InfrastructureSan Francisco, CA or Remote (U.S.)Discord is used by over 200 million people every month for many different reasons, but there’s one thing that nearly everyone does on our platform: play video games. Over 90% of our users play games, spending a combined 1.5 billion hours playing thousands of unique titles on...


  • San Francisco, California, United States Pilot Full time

    Pilot is a fast-growing fintech company that provides innovative financial solutions to small businesses. We are looking for a highly skilled Software Engineer Leader for Infrastructure Development to join our team.About the RoleWe are seeking a seasoned Full Stack Infrastructure Professional to lead the development of our infrastructure and ensure seamless...