Software Engineer, Supercomputing

4 weeks ago


San Francisco CA, United States Adept AI Labs Full time

Adept is working to advance a people-centric approach to AI that optimizes for what’s actually most useful for people and their work. You can see this approach in the technology we’re building: models that are trained to use software and take actions just as a person would.We’ve recently raised a $350M Series B led by General Catalyst and Spark, on top of a $65M Series A in 2022 with Addition and Greylock. We’re fortunate to be supported by amazing firms and angels such as Chris Re, Andrej Karpathy, Root Ventures, Howie Liu, Dara Khosrowshahi,  and others, and were recently highlighted by Forbes. Adept is backed by a coalition of strategic partners, including Atlassian, Microsoft, NVIDIA, and Workday. We're looking for passionate team members who want to swing for the fences to accomplish our mission, are excited by a startup environment where the hardest problems are yet to be solved, and are eager to learn and collaborate together in our San Francisco office.For more information, check out our blogPosition SummaryAdept is building a new class of multimodal AI models designed specifically for digital agents. Our model iteration speed depends on training performance while using some of the largest GPU clusters around. Engineers with an HPC background can help Adept iterate faster. Areas of focus on the Infrastructure team include:Compute - building and managing large GPU clusters training SOTA modelsOptimization - improving the utilization and reliability of those clustersResearch - working directly with researchers to align model architecture with training performanceWe value curious engineers who can engage with new problems and get things done at a startup. Our team members come from a variety of backgrounds. If you have some of these, you might be a good fit:8+ years of experience as a software engineerDeep understanding of GPU cluster hardware, performance, interconnect, etc.Experience with distributed training concepts and tools, e.g., torch.distributed, NCCL, MPI, etc.Experience at fast-growing companies or startupsDemonstrated end-to-end ownership and self-directionComfort with moving fast and learning by doingExcellent communication and collaboration skills, both verbal and writtenThe pay range for this position in California is $175,000 - $350,000yr; however, base pay offered may vary depending on job-related knowledge, skills, candidate location, and experience. We also offer competitive equity packages in the form of stock options and a comprehensive benefits plan. Our benefitsMedical, dental, and vision insurance - 100% coveredUnlimited vacation time for exempt employees4 remote weeks per year - work from anywhereCompetitive salary & stock options 24 weeks paid parental leaveMonthly wellness stipendDaily meals for those in our comfortable SF office Commuter benefitsDog friendly officeAdept is an equal opportunity employer. We're excited about candidates who will raise the bar of our team, regardless of specific experiences -- we encourage applicants from a range of backgrounds to apply.



  • San Francisco, CA, United States OpenAI Full time

    About The Team Supercomputers scale vertically. The workloads are synchronous and cluster-scale. These conditions demand a novel approach to cluster infrastructure, and it is the work of the Supercomputing Scalability Pillar to invent it. The focus is on scaling beyond k8s supported node counts, deploying cluster wide releases rapidly and atomically,...


  • San Francisco, CA, United States OpenAI Full time

    About The Team The Supercomputing Scheduling Pillar at OpenAI is dedicated to ensuring the reliability, scalability, and user-friendliness of job lifecycle management, with an emphasis on efficient and flexible job scheduling, quota management, and job execution workflows. We maximize researcher productivity by ensuring high goodput, efficient packing, and...


  • San Francisco, United States Omega Venture Partners Full time

    About the Team The Supercomputing Scheduling Pillar at OpenAI is dedicated to ensuring the reliability, scalability, and user-friendliness of job lifecycle management, with an emphasis on efficient and flexible job scheduling, quota management, and job execution workflows. We maximize researcher productivity by ensuring high goodput, efficient packing, and a...


  • San Francisco, CA, United States OpenAI Full time

    About The Team We believe that increasing compute is a huge lever to AI progress. The Supercomputing team owns the entire process of building OpenAI’s compute and infrastructure. This includes the deployment of huge clusters using Kubernetes and Azure, and building the internal experiment platform for running/training the world’s largest AI models. ...

  • Engineering Manager

    6 days ago


    San Francisco, United States OpenAI Full time

    About the Team We believe that increasing compute is a huge lever to AI progress. The Supercomputing team owns the entire process of building OpenAI's compute and infrastructure, which includes the deployment of huge clusters using Kubernetes and Azure, and building the internal experiment platform for running/training the world's largest AI models. We work...

  • Tech Lead Manager

    4 weeks ago


    San Francisco, United States OpenAI Full time

    About the Team Supercomputers scale vertically. The workloads are synchronous and cluster-scale. These conditions demand a novel approach to cluster infrastructure, and it is the work of the Supercomputing Scalability Pillar to invent it. The focus is on scaling beyond k8s supported node counts, deploying cluster wide releases rapidly and atomically,...


  • San Francisco, CA, United States OpenAI Full time

    About The Team The Platform Runtime team builds the low level framework components to power our ML training systems. We work on building robust, scalable, high performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress...


  • San Francisco, CA, United States OpenAI Full time

    About The Team Storage Infrastructure provides APIs for data access, placement, and lifecycle management, while ensuring that the storage systems’ capacity, throughput, and IOPs satisfy the needs of our AI researchers. Scalability, reliability, security, and usability are the core concerns of the team. About The Role As an engineer on the Storage...

  • Software Engineer

    6 days ago


    San Francisco, CA, United States Magic AI Full time

    Join us to build and safely deploy aligned, superhuman AI. We are building an AI pair programmer that feels like a full colleague inside your computer - capable, conversational, and reliable across domains. As a Software Engineer working on our large-scale training and inference infrastructure, you will architect and build resilient solutions for AI...


  • San Diego, United States CEREBRAS SYSTEMS INC. Full time

    Cerebras Systems has pioneered a groundbreaking chip and system that revolutionizes deep learning applications. Our system empowers ML researchers to achieve unprecedented speeds in training and inference workloads, propelling AI innovation to new horizons. Condor Galaxy 1 (CG-1), a supercomputer set to revolutionize the world of artificial intelligence....


  • San Diego, United States CEREBRAS SYSTEMS INC. Full time

    Cerebras Systems has pioneered a groundbreaking chip and system that revolutionizes deep learning applications. Our system empowers ML researchers to achieve unprecedented speeds in training and inference workloads, propelling AI innovation to new horizons. Condor Galaxy 1 (CG-1), a supercomputer set to revolutionize the world of artificial intelligence....


  • San Francisco, CA, United States OpenAI Full time

    About the Team OpenAI's Hardware Health team oversees all hardware health related aspects of our custom-built hyperscale supercomputers. The team is responsible for maximizing the available supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. The hardware health team is being incubated...


  • San Francisco, CA, United States Advent Software, Inc. Full time

    Associate Software Engineer page is loaded Associate Software Engineer Apply locations San Francisco, CA time type Full time posted on Posted 2 Days Ago job requisition id R16507 SS&C is a global provider of investment and financial services and software for the financial services and healthcare industries. Named to Fortune 1000 list as top U.S. company...


  • San Francisco, CA, United States OpenAI Full time

    OpenAI's Hardware Health team oversees all hardware health related aspects of our custom-built hyperscale supercomputers. The hardware health team is being incubated inside OpenAI's Research team, which operates at the far edge of all available innovations in AI - doing the engineering and research required to train large-scale AI models of unprecedented...


  • San Francisco, CA, United States Advent Software, Inc. Full time

    Associate Software Engineer page is loaded Associate Software Engineer Apply locations San Francisco, CA time type Full time posted on Posted 2 Days Ago job requisition id R16507 SS&C is a global provider of investment and financial services and software for the financial services and healthcare industries. S. company based on revenue, SS&C is...


  • Santa Clara, CA, United States NVIDIA Full time

    We are now looking for a Principal Software Architect for AI and HPC. At NVIDIA, we are advancing the frontiers of AI capabilities. We seek an expert in high-performance computing and AI to design and develop software resiliency features for training AI models on the world’s most powerful and largest supercomputers. In this role, you will outline...

  • Performance Engineer

    1 month ago


    San Francisco, CA, United States Anthropic Limited Full time

    Running machine learning (ML) algorithms at our scale often requires solving novel systems problems. As a Performance Engineer, you'll be responsible for identifying these problems, and then developing systems that optimize the throughput and robustness of our largest distributed systems. Strong candidates here will have a track record of solving...


  • San Francisco, United States OpenAI Full time

    About the Team Our team brings OpenAI's most capable technology to the world through our products. Most recently, we released ChatGPT, GPT-4, the Whisper API, and DALL-E. We empower consumers and developers alike to use and access our start-of-the-art AI models, allowing them to do things that they've never been able to before. Across all product lines, we...

  • Software Engineer

    6 days ago


    San Francisco, CA, United States Bunkerhill Health Full time

    About The Role We are seeking a talented and enthusiastic Software Engineer to join our dynamic team. As a Software Engineer, you will work closely with our senior engineers to develop, test, and maintain software solutions that meet our clients' needs. Responsibilities Collaborate with cross-functional teams to understand project requirements and...

  • Performance Engineer

    1 month ago


    San Francisco, United States Anthropic Limited Full time

    Running machine learning (ML) algorithms at our scale often requires solving novel systems problems. As a Performance Engineer, you'll be responsible for identifying these problems, and then developing systems that optimize the throughput and robustness of our largest distributed systems. Strong candidates here will have a track record of solving large-scale...