Supercomputing, Software Engineer

2 weeks ago


San Francisco, United States OpenAI Full time
Supercomputing, Software Engineer - Storage

Storage Infrastructure provides APIs for data access, placement, and lifecycle management, while ensuring that the storage systems’ capacity, throughput, and IOPs satisfy the needs of our AI researchers. Scalability, reliability, security, and usability are the core concerns of the team.

About the Role

As an engineer on the Storage Infrastructure team, you will design, build, and operate Exascale systems to scalably and reliably manage our research data across multiple regions.

We’re looking for distributed systems engineers who have worked on exascale data management systems or distributed filesystems.

This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will:

  1. Develop software to manage exascale data, and make it accessible to researchers.
  2. Drive the reliability, predictability, and cost effectiveness of our storage systems.
  3. Interface with researchers to understand and accommodate data use-cases.
  4. Ensure the security of our critical datasets.

You might thrive in this role if you:

  1. Have a deep understanding of distributed systems principles and a proven track record in designing and building scalable, reliable, and secure storage solutions.
  2. Possess strong programming skills.
  3. Have experience working in public clouds (especially Azure).
  4. Are familiar with AI/ML data access patterns.
  5. Have a bias for action and comfort building in a fast-paced, dynamic environment.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability, or any other legally protected status.

For US Based Candidates: Pursuant to the San Francisco Fair Chance Ordinance, we will consider qualified applicants with arrest and conviction records.

We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.

At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.

#J-18808-Ljbffr

  • San Francisco, United States OpenAI Full time

    Supercomputing, Software Engineer - Scheduling | OpenAICareersSupercomputing, Software Engineer - SchedulingSupercomputing - San FranciscoAbout the TeamThe Supercomputing Scheduling Pillar at OpenAI is dedicated to ensuring the reliability, scalability, and user-friendliness of job lifecycle management, with an emphasis on efficient and flexible job...


  • San Francisco, United States OpenAI Full time

    About The TeamSupercomputers scale vertically. The workloads are synchronous and cluster-scale. These conditions demand a novel approach to cluster infrastructure, and it is the work of the Supercomputing Scalability Pillar to invent it. The focus is on scaling beyond k8s supported node counts, deploying cluster wide releases rapidly and atomically,...


  • San Francisco, United States OpenAI Full time

    About the Team The Supercomputing Scheduling Pillar at OpenAI is dedicated to ensuring the reliability, scalability, and user-friendliness of job lifecycle management, with an emphasis on efficient and flexible job scheduling, quota management, and job execution workflows. We maximize researcher productivity by ensuring high goodput, efficient packing, and a...

  • Software Engineer

    3 days ago


    San Francisco, United States ZipRecruiter Full time

    Job DescriptionMagic’s mission is to build safe AGI that accelerates humanity’s progress on the world’s most important problems. We believe the most promising path to safe AGI lies in automating research and code to improve models and solve alignment more reliably than humans can alone. Our approach combines frontier-scale pre-training, domain-specific...


  • San Francisco, United States OpenAI Full time

    You will design, build, and maintain user interfaces that make managing large-scale job scheduling and cluster orchestration accessible and efficient. Your work will empower researchers and engineers to easily interact with and monitor complex supercomputing resources, focusing on usability and reliability.We’re looking for front-end engineers with...


  • San Francisco, United States OpenAI Full time

    About the Role You will design, build, and maintain user interfaces that make managing large-scale job scheduling and cluster orchestration accessible and efficient. Your work will empower researchers and engineers to easily interact with and monitor complex supercomputing resources, focusing on usability and reliability. We're looking for front-end...

  • Software Engineer

    2 weeks ago


    San Francisco, United States OpenAI Full time

    Software Engineer - Power Management, Hardware HealthOpenAI’s Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in...


  • San Francisco, United States OpenAI Full time

    About the Team OpenAI’s Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...


  • San Francisco, California, United States OpenAI Full time

    About the TeamAt OpenAI, our Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults.The Hardware Health team operates within the broader Platform...

  • Software Engineer

    2 weeks ago


    San Francisco, United States Openai Full time

    About the Team OpenAI's Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...


  • San Francisco, United States OpenAI Full time

    The Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom built supercomputers.The models we train are key ingredients to the AI...


  • San Francisco, United States OpenAI Full time

    About the Team The Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom built supercomputers. The models we train are key ingredients...


  • San Francisco, California, United States OpenAI Full time

    About the TeamThe Platform Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA, we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom-built supercomputers.About the RoleAs a Software Engineer,...


  • San Francisco, United States OpenAI Full time

    About the TeamThe Platform Runtime team builds the low-level framework components to power our ML training systems. We work on building robust, scalable, high-performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress towards...


  • San Francisco, United States OpenAI Full time

    About the Team OpenAI's Hardware Health team is dedicated to ensuring the optimal performance and reliability of our custom-built hyperscale supercomputers. We focus on maximizing supercomputing capacity for research and ensuring that our researchers are minimally impacted by hardware faults. This team is critical in maintaining the infrastructure that...


  • San Francisco, United States San Francisco Compute Co. Full time

    AboutWe’re the San Francisco Compute Company. We’re building the first real-time compute trading platform. We think that over the next decade, thousands of startups and labs are going to be training and serving large models. They need compute to do this, and we’re building a platform on which that compute can be traded. If we’re successful, it will...


  • San Francisco, United States OpenAI Full time

    About the Team The Platform Runtime team builds the low level framework components to power our ML training systems. We work on building robust, scalable, high performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress...


  • San Francisco, United States OpenAI Full time

    About the Team Storage Infrastructure provides APIs for data access, placement, and lifecycle management, while ensuring that the storage systems' capacity, throughput, and IOPs satisfy the needs of our AI researchers. Scalability, reliability, security, and usability are the core concerns of the team. About the Role As an engineer on the Storage...

  • Software Engineer

    1 week ago


    San Francisco, California, United States Eventual Computing Full time

    About Eventual ComputingEventual Computing is a cutting-edge data platform that empowers data scientists and engineers to build scalable data applications across ETL, analytics, and ML/AI.We are on a mission to bridge the gap between traditional tabular data analytics and modern ML/AI workloads. Our open-source distributed data engine, Daft, runs on 800k CPU...


  • San Francisco, United States OpenAI Full time

    The Frontiers Infrastructure team builds the low level framework components to power our ML training systems. We work on building robust, debuggable, high performance libraries to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress towards...