Software Engineer, Distributed Systems

3 weeks ago


San Francisco, United States San Francisco Compute Co. Full time
About

We’re the San Francisco Compute Company. We’re building the first real-time compute trading platform. We think that over the next decade, thousands of startups and labs are going to be training and serving large models. They need compute to do this, and we’re building a platform on which that compute can be traded. If we’re successful, it will be possible to scale to tens of thousands of accelerators for hours at a time without having to build your own infrastructure. This will greatly increase the number of organizations that can afford to train large models, which will make the most important technology of our lifetime accessible to more people.

The Role

As a distributed systems software engineer, you’ll be working on our in-house resource orchestration system. This system coordinates state and access to hundreds (soon thousands) of GPU compute nodes in multi-tenant clusters spanning across multiple data centers. Some responsibilities of the role include:

  • Design of distributed system architectures that enable high availability fault tolerant state management
  • Deployment automation and performance optimization of virtual machines running on bare metal that utilize GPU passthrough
  • Design and deployment of multi-tier high performance network attached storage systems
About You
  • You have built fault tolerant distributed systems before that can manage hardware resources at scale
  • You enjoy creating self-correcting systems that contribute to hardware health and reliability
  • You have experience with Linux virtualization (Cloud Hypervisor, QEMU, libvirt, virtiofs, sr-iov, PCIe passthrough)
  • You appreciate and value good documentation
Some Nice to Haves
  • Experience with Rust (our VM orchestrator is written in Rust)
  • Experience with etcd
  • Experience with high performance storage systems (WEKA, VAST, Ceph, etc.)
Benefits
  • Unlimited office book budget: You can buy as many books for the office as you want. You’re encouraged to spend time during the workday reading
  • Generous equity grant: Team members are offered a competitive salary along with equity in the company
  • Retirement matching: We match 401(k) plans up to 4%
  • Medical, dental & vision: We offer competitive medical, dental, vision insurance for employees and dependents and cover 100% of premiums
  • Time off: We offer unlimited paid time off as well as 10+ observed holidays
  • Parental leave: We offer biological, adoptive, and foster parents paid time off to spend quality time with family
  • Daily lunch: We cover lunch daily for employees
  • Visa Sponsorships: Yes, we sponsor visas and work permits

The San Francisco Compute Company is committed to maintaining a workplace free from discrimination and harassment. We make employment decisions based on business needs, job requirements, and individual qualifications, without regard to race, color, religion, belief, national origin, social or ethical origin, age, physical, mental, or sensory disability, sexual orientation, gender identity or expression, marital status, civil union or domestic partnership status, past or present military service, HIV status, family medical history or genetic information, family or parental status including pregnancy, or any other status protected by law.

We welcome the opportunity to consider qualified applicants with prior arrest or conviction records. Our commitment to diversity includes hiring talented individuals regardless of their criminal history, in accordance with local, state, and federal laws, including San Francisco’s Fair Chance Ordinance and California’s ban-the-box laws.

If you require reasonable accommodation for any reason, please reach out to us at team@sfcompute.com.

#J-18808-Ljbffr

  • San Francisco, United States OpenAI Full time

    About the TeamThe Platform Runtime team builds the low-level framework components to power our ML training systems. We work on building robust, scalable, high-performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress towards...


  • San Francisco, United States Mixpanel Full time

    We are actively recruiting for multiple Software Engineers across different levels for our org!About the RoleMixpanel is powered by a custom distributed database. This system ingests more than 1 Trillion user-generated events every month while ensuring end-to-end latencies of under a minute and queries typically scan more than 1 Quadrillion events over the...


  • San Francisco, United States OpenAI Full time

    About the Team The Platform Runtime team builds the low level framework components to power our ML training systems. We work on building robust, scalable, high performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress...


  • San Francisco, United States OpenAI Full time

    About the Team The Platform Runtime team builds the low level framework components to power our ML training systems. We work on building robust, scalable, high performance components to support our distributed training workloads. Our priorities are to maximize the productivity of our researchers and our hardware, with the goal of accelerating progress...


  • San Francisco, California, United States Ripple Full time

    Build the Future of PaymentsRipple is seeking a highly skilled Senior Software Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, developing, and deploying scalable and performant enterprise software solutions for our distributed payment applications.Key Responsibilities:Design and develop software...


  • San Francisco, United States ZipRecruiter Full time

    Job DescriptionPosition: Senior Distributed Systems EngineerWe are looking for a senior distributed systems engineer to join the Core Team (aka our Distributed Systems Team). Our Core Team handles the scheduling, planning, and execution of data syncing. They work on the systems that power our core syncing engine that other engineering teams, as well as...


  • San Francisco, United States salesforce Full time

    To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts.Job Category: Software EngineeringJob Details:About Salesforce: We’re Salesforce, the Customer Company, inspiring the future of business with AI+ Data +CRM. Leading with our core values, we help companies across...

  • Software Engineer

    7 days ago


    San Francisco, United States Wayfinder Full time

    As a Distributed Systems Engineer at Browserbase, you’ll be directly responsible for developing our core web automation platform. You’ll ensure it is high performance, scalable, constantly evolving and growing, and that our customers know they can count on it.As a Distributed Systems Engineer at Browserbase, you will:Build, operate, and grow the...


  • San Francisco, California, United States GEICO Full time

    Position OverviewWe are seeking an experienced Software Systems Engineer to join our team at GEICO. As a key member of our engineering organization, you will be responsible for designing, building, and maintaining scalable, resilient distributed systems that meet the needs of our customers.Key ResponsibilitiesDesign and implement distributed systems that...


  • San Francisco, California, United States Cloudflare, Inc. Full time

    About UsAt Cloudflare, we're dedicated to building a better Internet. Our mission is to create a fast, secure and reliable network that powers millions of websites and applications worldwide.We're looking for talented individuals who share our vision and are passionate about developing high-performance distributed systems. As a Distributed Systems Engineer...


  • San Francisco, United States Conduit Full time

    Conduit – The Onchain Compute CompanyAt Conduit, we're building the rollup-native cloud platform that will scale Ethereum.Crypto builders have been held back by a lack of onchain compute — without the ability to process a high volume of transactions quickly and cheaply, onchain apps and ecosystems can’t build high quality user experiences.Rollups have...


  • San Francisco, United States Argus Labs Full time

    Argus Labs is building the next generation of massively multiplayer online (MMO) games by empowering players with the extensive freedom to build, extend, and influence the game worlds they inhabit. Our approach is centered around World Engine, our state-of-the-art onchain game server framework.World Engine leverages a novel sharded rollup blockchain...


  • San Francisco, United States Cloudflare, Inc. Full time

    About UsAt Cloudflare, we are on a mission to help build a better Internet. Today the company runs one of the world's largest networks that powers millions of websites and other Internet properties for customers ranging from individual bloggers to SMBs to Fortune 500 companies. Cloudflare protects and accelerates any Internet application online without...


  • San Francisco, United States Hyperbolic Labs Full time

    Who We Are: Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By making better use of idle computing resources across the globe, we offer an innovative GPU marketplace and AI inference service that promise affordability and accessibility for all. As pioneers at the intersection...


  • San Francisco, United States Hyperbolic Labs Full time

    Who We Are: Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By making better use of idle computing resources across the globe, we offer an innovative GPU marketplace and AI inference service that promise affordability and accessibility for all. As pioneers at the intersection...


  • San Francisco, United States Career Renew Full time

    Career Renew is recruiting for one of its clients a Distributed Systems Engineer - Blockchain - this is a hybrid role in San Francisco, US. We are on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By making better use of idle computing resources across the globe, we offer an innovative GPU...


  • San Francisco, United States CV Library Full time

    Career Renew is recruiting for one of its clients a Distributed Systems Engineer - Blockchain - this is a hybrid role in San Francisco, US.We are on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By making better use of idle computing resources across the globe, we offer an innovative GPU...


  • San Francisco, United States Career Renew Full time

    Career Renew is recruiting for one of its clients a Distributed Systems Engineer - Blockchain - this is a hybrid role in San Francisco, US.We are on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By making better use of idle computing resources across the globe, we offer an innovative GPU...

  • Software Engineer

    1 week ago


    San Francisco, United States Informal Systems Inc. Full time

    Software Engineer, EVM SVM Infrastructure - Informal StakingFull TimeRemoteInformal SystemsInformal Systems was founded with a mission to foster trust in software and monetary systems. Our team spent years building state-of-the-art distributed systems and grappling with complexities and limitations. Despite securing billions of dollars in real-world value,...

  • Software Engineer

    4 days ago


    San Francisco, United States Informal Systems Inc. Full time

    Software Engineer, EVM SVM Infrastructure - Informal StakingFull TimeRemoteInformal SystemsInformal Systems was founded with a mission to foster trust in software and monetary systems. Our team spent years building state-of-the-art distributed systems and grappling with complexities and limitations. Despite securing billions of dollars in real-world value,...