AI Infrastructure Engineer

3 weeks ago

Sunnyvale, United States Scout Ai Full time

The future of defense will be decided by those who field intelligent machines at scale. At Scout, we’re developing Fury — the first robotic foundation model for defense — to give U.S. forces overwhelming, adaptable, and autonomous power across every domain. Fury enables human operators to command fleets of robots through natural language, and empowers those machines to sense, decide, and act together as one. It’s not just a leap in autonomy, it’s a force multiplier built for real-world conflict. This mission will ask everything of us: urgency, precision, and relentless work.The RoleWe’re looking for an AI Infrastructure Engineer to build and scale the backbone of Fury’s model training and deployment ecosystem. You’ll design the data, compute, and orchestration infrastructure that enables our vision-language-action models to learn from massive real-world datasets and operate across edge and cloud environments. This role bridges systems engineering, distributed computing, and machine learning infrastructure. Your work will ensure our teams can iterate rapidly, train large models efficiently, and deploy them reliably on robotic platforms in the field.We’re a startup. You’ll be moving fast, context-switching daily, and helping define the culture and process as we go. This is a rare opportunity to come in early and architect the future of defense.ResponsibilitiesDesign and implement data pipelines for ingesting, transforming, and storing petabytes of multimodal data from Fury’s robotic and operator systemsDevelop internal tooling for dataset exploration, curation, versioning, and quality monitoring over timeBuild and maintain distributed training infrastructure (cloud and on-prem) for large-scale multimodal and foundation model trainingImplement job orchestration workflows for launching, tracking, and debugging large-scale model runsIdentify and remediate bottlenecks in compute, memory, storage, and network performance to optimize throughput and cost efficiencyCollaborate with AI, autonomy, and systems teams to ensure data and training infrastructure supports real-time and mission-critical use casesMaintain observability and reliability tooling for training and inference pipelinesStay current on best practices in MLOps, distributed training frameworks, and AI infrastructure at scaleQualifications3+ years of experience in ML infrastructure, MLOps, or large-scale data systemsProven experience with distributed training (PyTorch DDP, DeepSpeed, Ray, or similar) and workflow orchestration (Kubernetes, Airflow, or equivalent)Strong proficiency in Python and cloud-native infrastructure (AWS, GCP, or Azure)Deep understanding of data engineering (ETL pipelines, object storage, data versioning, metadata management)Familiarity with containerization and deployment (Docker, Kubernetes) and monitoring systems (Prometheus, Grafana)Experience optimizing GPU cluster utilization, scaling training jobs, and profiling model performanceBachelor’s degree or higher in Computer Science, Electrical Engineering, or related technical fieldBonus: Experience with edge-deployed ML systems, federated training, or robotic data collection pipelinesMust be a U.S. Person due to required access to U.S. export controlled information or facilitiesWhy Join ScoutWork on the world’s most important frontier, ensuring U.S. and allied dominance in the age of intelligent machinesBe a core part of a team building the first defense-specific robotic foundation modelCollaborate with some of the top engineers in autonomy, AI, and national securitySee your work deployed on real systemsHelp define the future of intelligent defense systemsBacked by Draper Associates, Booz Allen Ventures, and other top investorsBenefitsCompetitive base salary and meaningful equityPremium medical, dental, and vision plans with $0 paycheck contributionCompetitive PTO and company holiday calendarCatered lunch daily and fully stocked kitchenEV chargingRelocation assistance (depending on role eligibility)

AI Infrastructure Operations Engineer

7 days ago

Sunnyvale, CA, United States CEREBRAS SYSTEMS INC. Full time

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to...
AI Infrastructure Operations Engineer

2 weeks ago

Sunnyvale, CA, United States CEREBRAS SYSTEMS INC. Full time

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to...
AI Infrastructure Operations Engineer

1 week ago

Sunnyvale, CA, United States CEREBRAS SYSTEMS INC. Full time

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to...
Senior Software Engineer, Infrastructure, AI and Infrastructure

3 weeks ago

Sunnyvale, United States Google Inc. Full time

Minimum qualificationsBachelor’s degree or equivalent practical experience.5 years of experience programming in C++, Python or Go.3 years of experience testing, maintaining, or launching software products, and 1 year of experience with software design and architecture.3 years of experience developing large‑scale infrastructure, distributed systems or...
Software Engineer III, AI/ML, AI and Infrastructure

4 weeks ago

Sunnyvale, United States Google Inc. Full time

Software Engineer III, AI/ML, AI and Infrastructure corporate_fare Google place Sunnyvale, CA, USA Apply Bachelor’s degree or equivalent practical experience. 2 years of experience programming in Python or C++. 1 year of experience with one or more of the following: Speech/audio (e.g., technology duplicating and responding to the human voice),...
Software Engineer III, AI/ML GenAI, AI and Infrastructure

4 weeks ago

Sunnyvale, United States Google Full time

Software Engineer III, AI/ML GenAI, AI and Infrastructure Join to apply for the Software Engineer III, AI/ML GenAI, AI and Infrastructure role at Google Minimum qualifications Bachelor’s degree or equivalent practical experience. 2 years of experience programming in Python or C++. 1 year of experience with ML infrastructure (e.g., model deployment, model...
Software Engineer III, AI/ML GenAI, AI and Infrastructure

4 weeks ago

Sunnyvale, United States Google Inc. Full time

Software Engineer III, AI/ML GenAI, AI and Infrastructure corporate_fare Google place Sunnyvale, CA, USA Apply Bachelor’s degree or equivalent practical experience. 2 years of experience programming in Python or C++. 1 year of experience with ML infrastructure (e.g., model deployment, model evaluation, optimization, data processing, debugging). Experience...
Distributed Systems Engineer for AI Infrastructure

2 weeks ago

Sunnyvale, CA, United States The Crypto Recruiters Full time

Join our innovative team as a Distributed Systems Engineer focused on enhancing our cutting-edge data infrastructure for AI workloads. If you are passionate about building robust distributed systems and creating powerful tools for data orchestration and retrieval, we want to hear from you! We are based in downtown SF and require collaboration in the office...
Software Engineer III, AI/ML GenAI, AI and Infrastructure

4 days ago

Sunnyvale, California, United States Google Full time $141,000 - $202,000

Minimum qualifications:Bachelor's degree or equivalent practical experience.2 years of experience programming in Python or C++.1 year of experience with ML infrastructure (e.g., model deployment, model evaluation, optimization, data processing, debugging).Experience with core GenAI concepts (LLM, Multi-Modal, Large Vision Models) and experience with text,...
Senior Network Engineer – AI Cloud Infrastructure

2 weeks ago

Sunnyvale, United States CMK Resources, Inc. Full time

CMK Resources is partnering with a fast-scaling AI cloud platform on a high-impact, confidential search. This team is solving cutting-edge infrastructure challenges to support massive-scale AI and HPC workloads. They are urgently seeking an experienced Staff/Sr. Staff+ Network Engineer to lead architecture and design of next-generation networking...

Americas

Europe

Asia / Oceania

Africa

AI Infrastructure Engineer