Software Engineer, AI Training Infrastructure
1 week ago
About Us:Here at Fireworks, we’re building the future of generative AI infrastructure. Fireworks offers the generative AI platform with the highest-quality models and the fastest, most scalable inference. We’ve been independently benchmarked to have the fastest LLM inference and have been getting great traction with innovative research projects, like our own function calling and multi-modal models. Fireworks is funded by top investors, like Benchmark and Sequoia, and we’re an ambitious, fun team composed primarily of veterans from Pytorch and Google Vertex AI.The Role: As a Training Infrastructure Engineer, you'll design, build, and optimize the infrastructure that powers our large-scale model training operations. Your work will be essential to developing high-performance AI training infrastructure. You'll collaborate with AI researchers and engineers to create robust training pipelines, optimize distributed training workloads, and ensure reliable model development.Key Responsibilities:Design and implement scalable infrastructure for large-scale model training workloadsDevelop and maintain distributed training pipelines for LLMs and multimodal modelsOptimize training performance across multiple GPUs, nodes, and data centersImplement monitoring, logging, and debugging tools for training operationsArchitect and maintain data storage solutions for large-scale training datasetsAutomate infrastructure provisioning, scaling, and orchestration for model trainingCollaborate with researchers to implement and optimize training methodologiesAnalyze and improve efficiency, scalability, and cost-effectiveness of training systemsTroubleshoot complex performance issues in distributed training environmentsMinimum Qualifications:Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience3+ years of experience with distributed systems and ML infrastructureExperience with PyTorchProficiency in cloud platforms (AWS, GCP, Azure)Experience with containerization, orchestration (Kubernetes, Docker)Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP)Preferred Qualifications:Master's or PhD in Computer Science or related fieldExperience training large language models or multimodal AI systemsExperience with ML workflow orchestration toolsBackground in optimizing high-performance distributed computing systemsFamiliarity with ML DevOps practicesContributions to open-source ML infrastructure or related projectsCompensation is determined by various factors including individual qualifications, experience, skills, interview performance, market data, and work location. The listed salary range for this role is a guideline and may be modified.Redwood City Pay Range$175,000 - $220,000 USDWhy Fireworks AI?Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure, from low-latency inference to scalable model serving.Build What’s Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally.Ownership & Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results.Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation.Fireworks AI is an equal-opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all innovators.
-
Software Engineer, Training
2 weeks ago
Redwood City, United States DatologyAI Full timeSoftware Engineer, Machine Learning InfrastructureJoin to apply for the Software Engineer, Machine Learning Infrastructure role at DatologyAISoftware Engineer, Machine Learning InfrastructureJoin to apply for the Software Engineer, Machine Learning Infrastructure role at DatologyAIGet AI-powered advice on this job and more exclusive features.About The...
-
Principal Software Engineer
3 weeks ago
Redwood City, United States Snorkel AI Full timeAbout Snorkel At Snorkel, we believe meaningful AI doesn't start with the model, it starts with the data. We're on a mission to help enterprises transform expert knowledge into specialized AI at scale. The AI landscape has gone through incredible changes between 2015, when Snorkel started as a research project in the Stanford AI Lab, to the generative AI...
-
Principal Software Engineer AI Platform
3 weeks ago
Redwood City, United States Snorkel AI Full timePrincipal Software Engineer Ai Platform At Snorkel, we believe meaningful AI doesn't start with the model, it starts with the data. We're on a mission to help enterprises transform expert knowledge into specialized AI at scale. The AI landscape has gone through incredible changes between 2015, when Snorkel started as a research project in the Stanford AI...
-
Senior Principal Software Engineer
2 weeks ago
Redwood City, United States Oracle Full timeSenior Principal Software Engineer - AI Infrastructure Innovation Join to apply for the Senior Principal Software Engineer - AI Infrastructure Innovation role at Oracle 5 days ago – be among the first 25 applicants Oracle Cloud Infrastructure’s (OCI) architecture development engineering team is seeking a highly driven GPU platform software & system...
-
Senior Software Engineer, Platform
3 weeks ago
Redwood City, United States C3 AI Full timeSenior Software Engineer, Platform - Data + AI (Back-End) Apply for the Senior Software Engineer, Platform - Data + AI (Back-End) role at C3 AI. C3 AI (NYSE: AI) is the Enterprise AI application software company. C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end?to?end platform for developing, deploying, and...
-
Software Engineer, Training
3 days ago
Redwood City, United States DatologyAI Full timeAbout the CompanyModels are what they eat. But a large portion of training compute is wasted training on data that are already learned, irrelevant, or even harmful, leading to worse models that cost more to train and deploy.At DatologyAI, we've built a state of the art data curation suite to automatically curate and optimize petabytes of data to create the...
-
Senior Software Engineer, Platform Data + AI
2 weeks ago
Redwood City, United States C3 AI Full timeC3 AI (NYSE: AI), is the Enterprise AI application software company. C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end-to-end platform for developing, deploying, and operating enterprise AI applications, C3 AI applications, a portfolio of industry-specific SaaS enterprise AI applications that enable the digital...
-
Senior Software Engineer
5 days ago
Redwood City, United States Retell AI Full timeABOUT RETELL AI Retell AI is using the first principles to reimagine the call center with cutting edge voice AI. We believe voice is still the most natural way humans communicate, yet it has been trapped in outdated call centers for decades. Our mission is to bring intelligence, empathy, and speed to every phone conversation between businesses and their...
-
Senior Software Engineer
2 weeks ago
Redwood City, United States Retell AI Full timeABOUT RETELL AI Retell AI is using the first principles to reimagine the call center with cutting edge voice AI. We believe voice is still the most natural way humans communicate, yet it has been trapped in outdated call centers for decades. Our mission is to bring intelligence, empathy, and speed to every phone conversation between businesses and their...
-
Senior Software Engineer
2 weeks ago
Redwood City, United States Retell AI Full timeABOUT RETELL AI Retell AI is using the first principles to reimagine the call center with cutting edge voice AI. We believe voice is still the most natural way humans communicate, yet it has been trapped in outdated call centers for decades. Our mission is to bring intelligence, empathy, and speed to every phone conversation between businesses and their...