Technical Program Manager, GPU Fleet Management

4 weeks ago


San Francisco, United States OpenAI Full time

About the Team The fleet team runs the GPU fleet that serves the models backing ChatGPT and API while also supporting training workloads for our next generation models. We manage one of the largest cutting edge GPU fleets in the world, exposing it as a singular platform for other OpenAI teams to seamlessly run production Applied AI and training workloads. We seek to learn from deployment and distribute the benefits of AI, while ensuring that this powerful tool is used responsibly and safely. Safety is more important to us than unfettered growth. About the Role As a Technical Program Manager for the GPU Fleet, your role is to help make our future compute plans become a reality by coordinating with engineers to manage the capacity footprint for OpenAI’s training and inference workloads while upleveling the demand-materialization lifecycle for all compute capacity. You will be responsible for managing & coordinating the overall body of work across many parallel programs/projects, ensuring cohesive communication and consistent alignment across all teams in platform, to all cross functional teams, and up to leadership. This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees. In this role, you will: Guide the roadmap for automation for a fleet that can grow an order of magnitude in size or more. Ensure that the capacity demand signal is operationally materialized within Fleet tooling and scheduling systems. Build net‑new systems that allow Fleet to better manage supply v. demand compute lifecycles across training and inference workstreams. Uplevel our autoscaling and global scheduling infrastructure by improving its reliability, ergonomics, and expanding capabilities. Work with Fleet Turnup engineers on executing on tight timelines and iteratively improving the process, tooling, and automation. Work with external partners to unlock bleeding edge compute and make it available as a turnkey resource for scheduling workloads. Collaborate closely with a broad set of stakeholders, including product engineering, inference, security, research and finance. You might thrive in this role if you: Possess a degree in a hard science, or have a demonstrated track record of engineering expertise. Have 5+ years of experience in program management for major projects including capital projects or hyperscaler infrastructure deployment. Demonstrated ability to serve as the go‑to person solely responsible for driving and delivering complex projects. Comfortable in managing cross‑functional and cross‑company teams; experience driving information and decision hygiene. Have an extensive track record of successfully delivering high‑profile, technical projects against tight deadlines. Are technically adept and have effectively partnered with engineering or fundamental research teams of the highest caliber. Interfacing and leading external vendors including: engineering firms, equipment suppliers, and/or construction firms. Expertise in designing and implementing simple, scalable processes that solve complex problems. Experience managing complicated dependencies such as logistics and or supply chains. Are relentlessly resourceful and thrive in ambiguous, fast‑paced environments. Are interested in and thoughtful about the impacts of AGI. About OpenAI OpenAI is an AI research and deployment company dedicated to ensuring that general‑purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non‑public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations. To notify OpenAI that you believe this job posting is non‑compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance. We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link. At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology. #J-18808-Ljbffr



  • San Francisco, CA, United States Lambda Full time

    Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambdas mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU. If youd like to build the worlds best deep learning cloud, join us. About the Team The Fleet Engineering team is...


  • San Francisco, CA, United States Lambda Full time

    Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambdas mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. If youd like to build the worlds best deep learning cloud, join us. The Fleet Engineering team is responsible for the logical deployment...


  • San Francisco, United States Element Fleet Corporation Full time

    Get started on an exciting career at Element! Element employees make a difference in the lives of others every day. We are re-defining the fleet management industry to be people first, then business - delivering on our promise of a superior client experience. This takes hard work and innovation, and we need more like-minded people on our team. About the Role...


  • San Francisco, CA, United States Lambda Full time

    A cutting-edge technology company based in Seattle is seeking an Engineering Program Manager to coordinate cross-functional teams ensuring timely delivery of GPU capacity. This role involves managing complex infrastructure projects, driving improvements, and fostering stakeholder relationships. Candidates should have at least 10 years of infrastructure...


  • San Francisco, United States OpenAI Full time

    A leading AI research and deployment company in San Francisco is seeking a Software Engineer for GPU Infrastructure on the HPC team. You will ensure the uptime and reliability of our extensive compute fleet, developing automated systems for monitoring and managing server performance and health. Ideal candidates have a strong background in server environments...


  • San Francisco, United States Menlo Ventures Full time

    Location San Francisco, Remote Employment Type Full time Location Type Hybrid Department Engineering Building the Future of Decentralized AI Development At Prime Intellect, we're enabling the next generation of AI breakthroughs by helping our customers deploy and optimize massive GPU clusters. As our Solutions Architect for GPU Infrastructure, you'll be the...


  • San Francisco, California, United States DigitalOcean Full time $168,000 - $210,000 per year

    Dive in and do the best work of your career at DigitalOcean. Journey alongside a strong community of top talent who are relentless in their drive to build the simplest scalable cloud. If you have a growth mindset, naturally like to think big and bold, and are energized by the fast-paced environment of a true industry disruptor, you'll find your place here....


  • San Francisco, United States OpenAI Full time

    Join a forward-thinking company as an engineer in the fleet infrastructure team, where you'll design and operate systems for one of the largest GPU fleets globally. This role offers the chance to work in a hybrid setting while contributing to cutting-edge AI capabilities. Your expertise in hyperscale compute systems and programming will be crucial in shaping...


  • San Francisco, United States OpenAI Full time

    This role will support the fleet infrastructure team at OpenAI. The fleet team focuses on running the world’s largest, most reliable, and frictionless GPU fleet to support OpenAI’s general purpose model training and deployment. Work on this team ranges fromMaximizing GPUs doing useful work by building user-friendly scheduling and quota systemsRunning a...


  • San Francisco, United States Amazon.com, Inc. Full time

    The Global Fleet and Products organization is responsible for managing and growing the Amazon Last Mile Fleet. This team is looking for a Sr. Program Manager to ensure world-class vehicle safety and 100% mission capable vehicles for the Amazon Fleet Fleet, Program Manager, Manager, Management, Program, Operations, Transportation