Senior Distributed Systems Engineer

5 days ago


San Francisco, California, United States Xai Full time

About xAI

xAI is a cutting-edge organization dedicated to creating AI systems that accurately understand the universe and aid humanity in its pursuit of knowledge.

Our team is composed of highly motivated and focused individuals who excel in engineering and strive for excellence.

As a company, we operate with a flat organizational structure, where all employees are expected to be hands-on and contribute directly to our mission.

Leadership is given to those who demonstrate initiative and consistently deliver exceptional results.

We value strong communication skills, which enable our team members to share knowledge concisely and accurately with their peers.

xAI does not have recruiters; every application is reviewed directly by a technical member of our team.

Our Tech Stack

We utilize Python, Rust, and C++ in our development process.

JAX and XLA are key components of our AI systems.

NCCL and CUDA (C++ and Triton) are also essential tools in our arsenal.

Location

The role is based in the Bay Area, with San Francisco and Palo Alto being our primary locations.

Candidates are expected to be located near the Bay Area or be open to relocation.

Focus

Our ideal candidate will design, build, and implement large-scale distributed training systems.

Profiling, debugging, and optimizing multi-host GPU utilization are also key responsibilities.

Hardware, software, and algorithm co-design are essential aspects of this role.

Maintaining and innovating on our codebase is also crucial.

Building tools to boost the productivity of our team is a valuable asset.

Ideal Experiences

Experience in configuring and troubleshooting operating systems for maximum performance is essential.

Built scalable training framework for AI models in HPC clusters, including scalable orchestration framework and tools.

Machine learning compilers and runtime such as XLA, MLIR, and Triton are also valuable skills.

Distributed training strategies such as FSDP, Megatron, and pipeline parallelism are key components.

NCCL or custom communication libraries for performant communication collectives are also essential.

Interview Process

After submitting your application, our team reviews your CV and statement of exceptional work.

If your application passes this stage, you will be invited to a 15-minute interview.

Following the initial phone interview, you will enter the main process, which consists of four technical interviews:

Coding assessment in a language of your choice.

Systems hands-on: Demonstrate practical skills in a live problem-solving session.

Project deep-dive: Present your past exceptional work to a small audience.

Meet and greet with the wider team.

Our goal is to finish the main process within one week.

Every application is reviewed by a member of our technical team.

All interviews will be conducted via Google Meet.

Annual Salary Range

$180,000 - $440,000 USD.



  • San Francisco, California, United States USM Business Systems Full time

    Job Title: Senior Software Engineer - Distributed SystemsWe are seeking a highly skilled Senior Software Engineer to join our team in San Francisco, CA. As a key member of our development team, you will be responsible for designing and implementing scalable distributed systems using Java, Kafka, Cassandra, and Spring.About the Role:Develop high-performance,...


  • San Francisco, California, United States Amplitude Full time

    Amplitude is a leading digital analytics platform that empowers companies to gain self-service visibility into their products. With over 3,500 customers, including top brands like Atlassian and Shopify, Amplitude helps teams deliver better product experiences through data-driven insights.The company is filled with humble, life-long learners who are eager to...


  • San Francisco, California, United States MongoDB Full time

    About the RoleThe Atlas Online Archive service provides low-cost, tiered storage for querying infrequently-accessed, read-only data. As a Senior Software Engineer, you will drive challenging, high-impact projects that improve and enhance the performance, scalability, durability, availability, and reliability of Online Archive's distributed storage...


  • San Francisco, California, United States Figma Full time

    Figma is a leading design and collaboration platform that empowers teams to create innovative products. We are seeking an experienced Senior Distributed Systems Engineer to join our Application Platform team, responsible for developing the core backend platform. The ideal candidate will have 6+ years of experience building and scaling distributed systems and...


  • San Francisco, California, United States Figma Full time

    Figma is a design and collaboration platform that empowers teams to create better products, faster. As a Senior Backend Engineer on our Application Platform team, you will play a critical role in shaping the core architecture of our backend codebase. Your expertise in building scalable distributed systems will enable us to grow our engineering team...


  • San Francisco, California, United States Ripple Full time

    About the RoleRipple is seeking an experienced Senior Distributed Systems Architect to join our team. As a key member of our architecture team, you will be responsible for designing and developing complex distributed systems, leveraging your expertise in C++ and blockchain technology.Key Responsibilities• Lead the development of innovative architectural...


  • San Francisco, California, United States MongoDB Full time

    Company OverviewMongoDB empowers innovators to create, transform, and disrupt industries by unleashing the power of software and data. Our mission is to enable organizations of all sizes to easily build, scale, and run modern applications.Estimated Salary: $150,000 - $200,000 per yearJob DescriptionWe are seeking a highly skilled Senior Software Engineer to...


  • San Francisco, California, United States Databricks Full time

    Job Title: Senior Engineering Leader for Distributed Data SystemsJob Summary:At Databricks, we are passionate about empowering data teams to tackle the world's most complex challenges. We achieve this by building and operating the world's leading data and AI infrastructure platform, enabling our customers to derive deep insights that drive their business...


  • San Francisco, California, United States Salesforce Inc Full time

    About SalesforceWe're a company that inspires the future of business with AI, Data, and CRM. Our core values guide us as we help companies across every industry blaze new trails and connect with customers in a whole new way.At Salesforce Inc., we empower you to be a Trailblazer, driving your performance and career growth, charting new paths, and improving...


  • San Francisco, California, United States Nextdoor Full time

    Nextdoor OverviewAt Nextdoor, we're building a platform where neighbors can connect and share information. Our Core Services team operates critical high-throughput services that power the communities on our platform worldwide.We operate the core datastores and critical services, including a distributed, multi-tiered cache with zone-aware routing in Golang,...


  • San Francisco, California, United States Nextdoor Full time

    Role OverviewDedicated to building high-performance distributed systems, we seek a skilled Distributed Systems Engineer to join our team. This pivotal role will focus on optimizing the scalability and resilience of our platforms, ensuring seamless operations for users worldwide.With a strong background in distributed datastores, cache management, and...


  • San Francisco, California, United States Discord Full time

    Discord plays a uniquely important role in the future of gaming. As a Senior Distributed Systems Developer, you'll work on building and maintaining the systems that power chat, push notifications, presence, and more for our users. With over 200 million people using Discord every month, you'll have the opportunity to make a significant impact on the quality...


  • San Francisco, California, United States Mixpanel Full time

    About UsMixpanel is a pioneering analytics platform that helps companies make better decisions by providing a powerful and simple solution for understanding user behavior and tracking key performance indicators.Job SummaryWe are seeking a highly skilled Senior Software Engineer to join our Distributed Systems team. As a member of this team, you will be...


  • San Francisco, California, United States Mixpanel Full time

    About MixpanelWe are a leading product analytics software company, helping businesses answer critical questions about their products.Our event-based tracking solution enables teams to gain insights into user behavior across web and mobile platforms.We serve nearly 7,000 customers worldwide through seven offices globally.The RoleWe are seeking an experienced...


  • San Francisco, California, United States Cisco Full time

    OverviewCisco ThousandEyes is a Digital Experience Assurance platform that empowers organizations to deliver flawless digital experiences across every network. Our goal is to equip our customers with complete visibility into end-user connectivity, wherever they may be located.About the RoleThis Senior Software Engineer will be working in the Endpoint team,...


  • San Francisco, California, United States Ripple Full time

    Company OverviewRipple is a pioneering company that is changing the way value moves around the world. Our goal is to build a world where value can move like information does today, making it faster, cheaper, and more efficient. We are committed to innovation, collaboration, and customer satisfaction, and we strive to create a workplace culture that is...


  • San Francisco, California, United States USM Business Systems Full time

    Job SummaryWe are seeking an experienced Senior Software Developer to join our team. As a key member of our development team, you will focus on backend development using Java, Kafka, and Cassandra.About the RoleThis is a challenging opportunity for a skilled developer who is passionate about designing and developing complex distributed systems. You will work...


  • San Francisco, California, United States Figma Full time

    Job OverviewWe are seeking an experienced Distributed Systems Engineer to join our Application Platform team. As a key member of this team, you will be responsible for developing the application primitives used by backend engineers across the company, including data access frameworks, asynchronous jobs platforms, developer tools, and more. This role requires...


  • San Francisco, California, United States CrowdStrike Holdings, Inc. Full time

    Job Title: Senior Cloud Systems EngineerAbout the Role:CrowdStrike Holdings, Inc. is seeking a highly skilled Senior Cloud Systems Engineer to build, monitor, and maintain complex distributed systems infrastructure. The ideal candidate has extensive experience with hybrid cloud systems in critical production environments.Key Responsibilities:* Design,...


  • San Francisco, California, United States Mixpanel Full time

    About MixpanelMixpanel is a leading analytics platform that empowers companies to make data-driven decisions. Our event-based data analytics platform provides a powerful solution for understanding user behavior and tracking key performance indicators.Job DescriptionWe are seeking an experienced Senior Software Engineer to join our Distributed Systems team....