Supercomputing, Software Engineer
4 weeks ago
The Supercomputing Scheduling Pillar at OpenAI is dedicated to ensuring the reliability, scalability, and user-friendliness of job lifecycle management, with an emphasis on efficient and flexible job scheduling, quota management, and job execution workflows. We maximize researcher productivity by ensuring high goodput, efficient packing, and a consistent, ergonomic training workflow, while scaling to ever larger supercomputers while reducing operational burden to the team.
About the Role
As an engineer in the Scheduling Pillar, you will design, write, deploy, and operate job lifecycle management systems for model training on some of the largest supercomputers in the world. The scale is immense, the timelines are tight, and the organization is moving fast; this is an opportunity to shape a critical system in support of OpenAI's mission.
This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.
In this role, you will:
- Design, implement and operate components of our quota management, job scheduling, and queuing systems. In short, you will work on the key user interface to our supercomputers.
- Interface with researchers to understand workload requirements
- Harmonize job lifecycle features with cluster infrastructure, storage, and hardware health requirements.
- Have significant experience with hyperscale scheduling systems
- Possess strong programming skills
- Have experience working in public clouds (especially Azure)
- Execution focused mentality paired with a rigorous focus on user requirements
- As a bonus, have an understanding of AI/ML workloads
About OpenAI
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.
We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or any other legally protected status.
OpenAI Affirmative Action and Equal Employment Opportunity Policy Statement
For US Based Candidates: Pursuant to the San Francisco Fair Chance Ordinance, we will consider qualified applicants with arrest and conviction records.
We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.
OpenAI Global Applicant Privacy Policy
At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
-
Supercomputing, Software Engineer
2 weeks ago
San Francisco, United States OpenAI Full timeSupercomputing, Software Engineer - Scheduling | OpenAICareersSupercomputing, Software Engineer - SchedulingSupercomputing - San FranciscoAbout the TeamThe Supercomputing Scheduling Pillar at OpenAI is dedicated to ensuring the reliability, scalability, and user-friendliness of job lifecycle management, with an emphasis on efficient and flexible job...
-
Software Engineer, Supercomputing Scalability
3 weeks ago
San Francisco, United States OpenAI Full timeAbout The TeamSupercomputers scale vertically. The workloads are synchronous and cluster-scale. These conditions demand a novel approach to cluster infrastructure, and it is the work of the Supercomputing Scalability Pillar to invent it. The focus is on scaling beyond k8s supported node counts, deploying cluster wide releases rapidly and atomically,...
-
Software Engineer
3 days ago
San Francisco, United States ZipRecruiter Full timeJob DescriptionMagic’s mission is to build safe AGI that accelerates humanity’s progress on the world’s most important problems. We believe the most promising path to safe AGI lies in automating research and code to improve models and solve alignment more reliably than humans can alone. Our approach combines frontier-scale pre-training, domain-specific...
-
Supercomputing, Software Engineer
2 weeks ago
San Francisco, United States OpenAI Full timeSupercomputing, Software Engineer - StorageStorage Infrastructure provides APIs for data access, placement, and lifecycle management, while ensuring that the storage systems’ capacity, throughput, and IOPs satisfy the needs of our AI researchers. Scalability, reliability, security, and usability are the core concerns of the team.About the RoleAs an...
-
Supercomputing, Front End Engineer
4 weeks ago
San Francisco, United States OpenAI Full timeAbout the Role You will design, build, and maintain user interfaces that make managing large-scale job scheduling and cluster orchestration accessible and efficient. Your work will empower researchers and engineers to easily interact with and monitor complex supercomputing resources, focusing on usability and reliability. We're looking for front-end...
-
Supercomputing, Front End Engineer
2 weeks ago
San Francisco, United States OpenAI Full timeYou will design, build, and maintain user interfaces that make managing large-scale job scheduling and cluster orchestration accessible and efficient. Your work will empower researchers and engineers to easily interact with and monitor complex supercomputing resources, focusing on usability and reliability.We’re looking for front-end engineers with...
-
Software Engineer
2 weeks ago
San Francisco, United States OpenAI Full timeSoftware Engineer - Power Management, Hardware HealthOpenAI’s Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in...
-
Software Engineer, Hardware Health
1 month ago
San Francisco, United States OpenAI Full timeAbout the Team OpenAI’s Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...
-
Software Engineer, Hardware Health Specialist
1 month ago
San Francisco, California, United States OpenAI Full timeAbout the TeamAt OpenAI, our Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults.The Hardware Health team operates within the broader Platform...
-
Software Engineer
2 weeks ago
San Francisco, United States Openai Full timeAbout the Team OpenAI's Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...
-
Software Engineer, Networking
3 weeks ago
San Francisco, United States OpenAI Full timeThe Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom built supercomputers.The models we train are key ingredients to the AI...
-
Software Engineer, Networking
4 weeks ago
San Francisco, United States OpenAI Full timeAbout the Team The Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom built supercomputers. The models we train are key ingredients...
-
Software Engineer, Networking Specialist
2 weeks ago
San Francisco, California, United States OpenAI Full timeAbout the TeamThe Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA, we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom-built supercomputers.About the RoleAs a Software Engineer,...
-
Software Engineer, Distributed Systems
1 week ago
San Francisco, United States OpenAI Full timeAbout the TeamThe Platform Runtime team builds the low-level framework components to power our ML training systems. We work on building robust, scalable, high-performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress towards...
-
Software Engineer, Hardware Health
4 weeks ago
San Francisco, United States OpenAI Full timeAbout the Team OpenAI's Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...
-
Supercomputing Engineer
3 weeks ago
San Francisco, United States San Francisco Compute Co. Full timeAboutWe’re the San Francisco Compute Company. We’re building the first real-time compute trading platform. We think that over the next decade, thousands of startups and labs are going to be training and serving large models. They need compute to do this, and we’re building a platform on which that compute can be traded. If we’re successful, it will...
-
Software Engineer, Distributed Systems
4 weeks ago
San Francisco, United States OpenAI Full timeAbout the Team The Platform Runtime team builds the low level framework components to power our ML training systems. We work on building robust, scalable, high performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress...
-
Supercomputing, Software Engineer
3 weeks ago
San Francisco, United States OpenAI Full timeAbout the Team Storage Infrastructure provides APIs for data access, placement, and lifecycle management, while ensuring that the storage systems' capacity, throughput, and IOPs satisfy the needs of our AI researchers. Scalability, reliability, security, and usability are the core concerns of the team. About the Role As an engineer on the Storage...
-
Software Engineer
1 week ago
San Francisco, California, United States Eventual Computing Full timeAbout Eventual ComputingEventual Computing is a cutting-edge data platform that empowers data scientists and engineers to build scalable data applications across ETL, analytics, and ML/AI.We are on a mission to bridge the gap between traditional tabular data analytics and modern ML/AI workloads. Our open-source distributed data engine, Daft, runs on 800k CPU...
-
Frontiers Infrastructure Engineer
2 weeks ago
San Francisco, United States OpenAI Full timeThe Frontiers Infrastructure team builds the low level framework components to power our ML training systems. We work on building robust, debuggable, high performance libraries to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress towards...