Software Engineer, Supercomputing Scalability

3 weeks ago


San Francisco, United States OpenAI Full time
About The Team

Supercomputers scale vertically. The workloads are synchronous and cluster-scale. These conditions demand a novel approach to cluster infrastructure, and it is the work of the Supercomputing Scalability Pillar to invent it. The focus is on scaling beyond k8s supported node counts, deploying cluster wide releases rapidly and atomically, comprehensive telemetry into the health and activity in the cluster, and rapid onboarding of new supercomputing systems with bleeding edge hardware & world-class scale.

About The Role

As an Engineer for Supercomputing Scalability, you will work to simplify and scale the operations of our DC-scale computers. You will use widely available tools effectively, while building novel solutions when we scale beyond their limits, expanding our ability to handle novel hardware, rapidly growing numbers of (ever larger) clusters, and a fast-growing set of research users.

This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In This Role, You Will

  • Design and operate the orchestration and monitoring stack for our supercomputers
  • Automate everything until we have unprecedented control of our stack
  • Deeply understand what it means for a supercomputer to be healthy and useful to researchers and enable frontier model training

You Might Thrive In This Role If You

  • Deeply understand k8s and other cluster orchestration systems
  • Have strong software development skills
  • Have experience working in public clouds (especially Azure)
  • Bias for action and comfort building in a fast paced, dynamic environment
  • Have familiarity with AI/ML data access patterns

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or any other legally protected status.

For US Based Candidates: Pursuant to the San Francisco Fair Chance Ordinance, we will consider qualified applicants with arrest and conviction records.

We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.

OpenAI Global Applicant Privacy Policy

At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology. #J-18808-Ljbffr

  • San Francisco, United States OpenAI Full time

    Supercomputing, Software Engineer - Scheduling | OpenAICareersSupercomputing, Software Engineer - SchedulingSupercomputing - San FranciscoAbout the TeamThe Supercomputing Scheduling Pillar at OpenAI is dedicated to ensuring the reliability, scalability, and user-friendliness of job lifecycle management, with an emphasis on efficient and flexible job...


  • San Francisco, United States OpenAI Full time

    Supercomputing, Software Engineer - StorageStorage Infrastructure provides APIs for data access, placement, and lifecycle management, while ensuring that the storage systems’ capacity, throughput, and IOPs satisfy the needs of our AI researchers. Scalability, reliability, security, and usability are the core concerns of the team.About the RoleAs an...


  • San Francisco, United States OpenAI Full time

    About the Team The Supercomputing Scheduling Pillar at OpenAI is dedicated to ensuring the reliability, scalability, and user-friendliness of job lifecycle management, with an emphasis on efficient and flexible job scheduling, quota management, and job execution workflows. We maximize researcher productivity by ensuring high goodput, efficient packing, and a...

  • Software Engineer

    3 days ago


    San Francisco, United States ZipRecruiter Full time

    Job DescriptionMagic’s mission is to build safe AGI that accelerates humanity’s progress on the world’s most important problems. We believe the most promising path to safe AGI lies in automating research and code to improve models and solve alignment more reliably than humans can alone. Our approach combines frontier-scale pre-training, domain-specific...


  • San Francisco, United States OpenAI Full time

    About the Role You will design, build, and maintain user interfaces that make managing large-scale job scheduling and cluster orchestration accessible and efficient. Your work will empower researchers and engineers to easily interact with and monitor complex supercomputing resources, focusing on usability and reliability. We're looking for front-end...


  • San Francisco, United States OpenAI Full time

    You will design, build, and maintain user interfaces that make managing large-scale job scheduling and cluster orchestration accessible and efficient. Your work will empower researchers and engineers to easily interact with and monitor complex supercomputing resources, focusing on usability and reliability.We’re looking for front-end engineers with...


  • San Francisco, California, United States FlexOS Global Pte. Ltd. Full time

    Highly Scalable Software EngineerWe're seeking a highly skilled and experienced Software Engineer to join our team at FlexOS Global Pte. Ltd.About the Role:This is an exceptional opportunity to leverage your expertise in software development, particularly in designing and developing highly scalable internal applications using Force.com technologies.Key...


  • San Francisco, United States OpenAI Full time

    About the Team Storage Infrastructure provides APIs for data access, placement, and lifecycle management, while ensuring that the storage systems' capacity, throughput, and IOPs satisfy the needs of our AI researchers. Scalability, reliability, security, and usability are the core concerns of the team. About the Role As an engineer on the Storage...

  • Software Engineer

    1 week ago


    San Francisco, California, United States Eventual Computing Full time

    About Eventual ComputingEventual Computing is a cutting-edge data platform that empowers data scientists and engineers to build scalable data applications across ETL, analytics, and ML/AI.We are on a mission to bridge the gap between traditional tabular data analytics and modern ML/AI workloads. Our open-source distributed data engine, Daft, runs on 800k CPU...


  • San Francisco, United States OpenAI Full time

    About the TeamThe Platform Runtime team builds the low-level framework components to power our ML training systems. We work on building robust, scalable, high-performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress towards...


  • San Francisco, United States OpenAI Full time

    About the Team The Platform Runtime team builds the low level framework components to power our ML training systems. We work on building robust, scalable, high performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress...

  • Software Engineer

    2 weeks ago


    San Francisco, United States OpenAI Full time

    Software Engineer - Power Management, Hardware HealthOpenAI’s Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in...


  • San Francisco, United States OpenAI Full time

    About the Team OpenAI’s Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...


  • San Francisco, California, United States OpenAI Full time

    About the TeamAt OpenAI, our Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults.The Hardware Health team operates within the broader Platform...

  • Software Engineer

    2 weeks ago


    San Francisco, United States Openai Full time

    About the Team OpenAI's Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...


  • San Francisco, United States OpenAI Full time

    The Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom built supercomputers.The models we train are key ingredients to the AI...


  • San Francisco, United States OpenAI Full time

    About the Team The Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom built supercomputers. The models we train are key ingredients...


  • San Francisco, California, United States OpenAI Full time

    About the TeamThe Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA, we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom-built supercomputers.About the RoleAs a Software Engineer,...


  • San Francisco, United States OpenAI Full time

    About the Team OpenAI's Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...


  • San Francisco, United States San Francisco Compute Co. Full time

    AboutWe’re the San Francisco Compute Company. We’re building the first real-time compute trading platform. We think that over the next decade, thousands of startups and labs are going to be training and serving large models. They need compute to do this, and we’re building a platform on which that compute can be traded. If we’re successful, it will...