Supercomputing, Software Engineer

4 weeks ago


San Francisco, United States OpenAI Full time
About the Team

The Supercomputing Scheduling Pillar at OpenAI is dedicated to ensuring the reliability, scalability, and user-friendliness of job lifecycle management, with an emphasis on efficient and flexible job scheduling, quota management, and job execution workflows. We maximize researcher productivity by ensuring high goodput, efficient packing, and a consistent, ergonomic training workflow, while scaling to ever larger supercomputers while reducing operational burden to the team.

About the Role

As an engineer in the Scheduling Pillar, you will design, write, deploy, and operate job lifecycle management systems for model training on some of the largest supercomputers in the world. The scale is immense, the timelines are tight, and the organization is moving fast; this is an opportunity to shape a critical system in support of OpenAI's mission.

This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will:
  • Design, implement and operate components of our quota management, job scheduling, and queuing systems. In short, you will work on the key user interface to our supercomputers.
  • Interface with researchers to understand workload requirements
  • Harmonize job lifecycle features with cluster infrastructure, storage, and hardware health requirements.
You might thrive in this role if you:
  • Have significant experience with hyperscale scheduling systems
  • Possess strong programming skills
  • Have experience working in public clouds (especially Azure)
  • Execution focused mentality paired with a rigorous focus on user requirements
  • As a bonus, have an understanding of AI/ML workloads


About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or any other legally protected status.

OpenAI Affirmative Action and Equal Employment Opportunity Policy Statement

For US Based Candidates: Pursuant to the San Francisco Fair Chance Ordinance, we will consider qualified applicants with arrest and conviction records.

We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.

OpenAI Global Applicant Privacy Policy

At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.

  • San Francisco, United States OpenAI Full time

    Supercomputing, Software Engineer - Scheduling | OpenAICareersSupercomputing, Software Engineer - SchedulingSupercomputing - San FranciscoAbout the TeamThe Supercomputing Scheduling Pillar at OpenAI is dedicated to ensuring the reliability, scalability, and user-friendliness of job lifecycle management, with an emphasis on efficient and flexible job...


  • San Francisco, United States OpenAI Full time

    About The TeamSupercomputers scale vertically. The workloads are synchronous and cluster-scale. These conditions demand a novel approach to cluster infrastructure, and it is the work of the Supercomputing Scalability Pillar to invent it. The focus is on scaling beyond k8s supported node counts, deploying cluster wide releases rapidly and atomically,...

  • Software Engineer

    3 days ago


    San Francisco, United States ZipRecruiter Full time

    Job DescriptionMagic’s mission is to build safe AGI that accelerates humanity’s progress on the world’s most important problems. We believe the most promising path to safe AGI lies in automating research and code to improve models and solve alignment more reliably than humans can alone. Our approach combines frontier-scale pre-training, domain-specific...


  • San Francisco, United States OpenAI Full time

    Supercomputing, Software Engineer - StorageStorage Infrastructure provides APIs for data access, placement, and lifecycle management, while ensuring that the storage systems’ capacity, throughput, and IOPs satisfy the needs of our AI researchers. Scalability, reliability, security, and usability are the core concerns of the team.About the RoleAs an...


  • San Francisco, United States OpenAI Full time

    About the Role You will design, build, and maintain user interfaces that make managing large-scale job scheduling and cluster orchestration accessible and efficient. Your work will empower researchers and engineers to easily interact with and monitor complex supercomputing resources, focusing on usability and reliability. We're looking for front-end...


  • San Francisco, United States OpenAI Full time

    You will design, build, and maintain user interfaces that make managing large-scale job scheduling and cluster orchestration accessible and efficient. Your work will empower researchers and engineers to easily interact with and monitor complex supercomputing resources, focusing on usability and reliability.We’re looking for front-end engineers with...

  • Software Engineer

    2 weeks ago


    San Francisco, United States OpenAI Full time

    Software Engineer - Power Management, Hardware HealthOpenAI’s Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in...


  • San Francisco, United States OpenAI Full time

    About the Team OpenAI’s Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...


  • San Francisco, California, United States OpenAI Full time

    About the TeamAt OpenAI, our Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults.The Hardware Health team operates within the broader Platform...

  • Software Engineer

    2 weeks ago


    San Francisco, United States Openai Full time

    About the Team OpenAI's Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...


  • San Francisco, United States OpenAI Full time

    The Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom built supercomputers.The models we train are key ingredients to the AI...


  • San Francisco, United States OpenAI Full time

    About the Team The Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom built supercomputers. The models we train are key ingredients...


  • San Francisco, California, United States OpenAI Full time

    About the TeamThe Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA, we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom-built supercomputers.About the RoleAs a Software Engineer,...


  • San Francisco, United States OpenAI Full time

    About the TeamThe Platform Runtime team builds the low-level framework components to power our ML training systems. We work on building robust, scalable, high-performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress towards...


  • San Francisco, United States OpenAI Full time

    About the Team OpenAI's Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...


  • San Francisco, United States San Francisco Compute Co. Full time

    AboutWe’re the San Francisco Compute Company. We’re building the first real-time compute trading platform. We think that over the next decade, thousands of startups and labs are going to be training and serving large models. They need compute to do this, and we’re building a platform on which that compute can be traded. If we’re successful, it will...


  • San Francisco, United States OpenAI Full time

    About the Team The Platform Runtime team builds the low level framework components to power our ML training systems. We work on building robust, scalable, high performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress...


  • San Francisco, United States OpenAI Full time

    About the Team Storage Infrastructure provides APIs for data access, placement, and lifecycle management, while ensuring that the storage systems' capacity, throughput, and IOPs satisfy the needs of our AI researchers. Scalability, reliability, security, and usability are the core concerns of the team. About the Role As an engineer on the Storage...

  • Software Engineer

    1 week ago


    San Francisco, California, United States Eventual Computing Full time

    About Eventual ComputingEventual Computing is a cutting-edge data platform that empowers data scientists and engineers to build scalable data applications across ETL, analytics, and ML/AI.We are on a mission to bridge the gap between traditional tabular data analytics and modern ML/AI workloads. Our open-source distributed data engine, Daft, runs on 800k CPU...


  • San Francisco, United States OpenAI Full time

    The Frontiers Infrastructure team builds the low level framework components to power our ML training systems. We work on building robust, debuggable, high performance libraries to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress towards...