ML Systems Engineer, Infrastructure
4 days ago
Basis is a nonprofit applied AI research organization with two mutually reinforcing goals.
The first is to understand and build intelligence. This means to establish the mathematical principles of what it means to reason, to learn, to make decisions, to understand, and to explain; and to construct software that implements these principles.
The second is to advance society's ability to solve intractable problems. This means expanding the scale, complexity, and breadth of problems that we can solve today, and even more importantly, accelerating our ability to solve problems in the future.
To achieve these goals, we're building both a new technological foundation that draws inspiration from how humans reason, and a new kind of collaborative organization that puts human values first.
About the RoleML Systems Engineers at Basis ensure training and evaluation infrastructure is fast, reliable, and scalable. You will own the full stack from distributed training frameworks through cloud administration, making it possible for researchers to iterate quickly on complex models while managing computational resources efficiently.
We are looking for engineers who combine deep understanding of ML systems with operational excellence. The ideal ML Systems Engineer has experience with distributed training at scale, understands the intricacies of debugging numerical instabilities, and can manage cloud infrastructure that scales from experiments to production. You will be the guardian of training stability, the optimizer of compute costs, and the enabler of reproducible research.
This role spans traditional ML engineering and cloud/DevOps responsibilities. You will manage GPU clusters, optimize cloud spending, ensure security and compliance, and build the infrastructure that lets researchers focus on algorithms rather than operations.
We seek individuals who aspire to build robust ML infrastructure, maintain "logbook culture" for documenting issues and solutions, and treat operational excellence as a first-class concern.
We expect you to:Have demonstrated expertise in ML systems engineering. Examples include:
Managing distributed training jobs across hundreds of GPUs
Debugging and fixing numerical instabilities in large-scale training
Building infrastructure for reproducible ML experiments
Optimizing training throughput and resource utilization
Possess deep knowledge of distributed training frameworks including PyTorch/JAX distributed strategies (DDP, FSDP, ZeRO), gradient accumulation, mixed precision training, and checkpoint/recovery systems.
Have strong cloud administration skills including AWS/GCP/Azure services, infrastructure as code (Terraform), Kubernetes orchestration, cost optimization, security best practices, and compliance requirements.
Understand the full ML stack from hardware (GPUs, interconnects, storage) through frameworks (PyTorch, JAX) to high-level training loops and evaluation pipelines.
Be skilled at debugging complex failures across the stack—GPU/NCCL issues, data loading bottlenecks, memory leaks, gradient explosions, and convergence problems.
Value documentation and knowledge sharing. You maintain comprehensive logs of issues encountered, solutions found, and lessons learned, building institutional knowledge.
Progress with autonomy while coordinating closely with researchers. You can anticipate infrastructure needs, prevent problems before they occur, and respond quickly when issues arise.
In addition, the following would be an advantage:
Experience at organizations training large models (OpenAI, Anthropic, Google, Meta).
Background in both ML research and production systems.
Contributions to ML frameworks or distributed training libraries.
Experience with on-premise GPU cluster management.
Knowledge of optimization theory and numerical methods.
Understanding of robotics-specific infrastructure requirements.
Own distributed training infrastructure including job launchers, checkpointing systems, recovery mechanisms, and monitoring that ensures experiments run reliably at scale.
Debug and resolve training failures by diagnosing issues across GPUs, networking, numerics, and data pipelines, maintaining detailed logs of problems and solutions.
Profile and optimize training performance by identifying bottlenecks in data loading, gradient computation, communication overhead, and implementing solutions that improve step time.
Manage cloud infrastructure and costs including capacity planning, spot instance strategies, storage optimization, and building tools that give researchers visibility into resource usage.
Implement security and compliance measures including access controls, data encryption, audit logging, and ensuring infrastructure meets requirements for handling sensitive data.
Build evaluation and benchmarking infrastructure that enables consistent, reproducible measurement of model performance across different conditions and datasets.
Develop monitoring and alerting systems that detect anomalies in training metrics, resource utilization, or system health, enabling rapid response to issues.
Maintain development environments including containerization, dependency management, and tools that ensure researchers can reproduce results across different systems.
Document and share knowledge through runbooks, post-mortems, and training materials that help the team understand and operate ML infrastructure effectively.
Collaborate with researchers to understand requirements, suggest infrastructure solutions, and ensure systems support rather than constrain research goals.
Exceptional candidates who may not meet all of the following criteria are still encouraged to apply.
FT/PT: Full-time.
In-person Policy: We are in the office four days a week. Be prepared to attend multi-day Basis-wide in-person events.
Location: New York City or Cambridge, MA.
Salary range: Competitive salary.
Privacy Notice
By submitting your application, you grant Basis permission to use your materials for both hiring evaluation and recruitment-related research and development purposes. Your information may be processed in different countries, including the US. You retain copyright while providing Basis a license to use these materials for the stated purposes.
Read our full Global Data Privacy Notice here.
-
Cambridge, Massachusetts, United States Basis Research Institute Full time $80,000 - $120,000 per yearAbout BasisBasis is a nonprofit applied AI research organization with two mutually reinforcing goals.The first is to understand and build intelligence. This means to establish the mathematical principles of what it means to reason, to learn, to make decisions, to understand, and to explain; and to construct software that implements these principles.The...
-
Cambridge, Massachusetts, United States Google Full time $141,000 - $202,000 per yearMinimum qualifications:Bachelor's degree or equivalent practical experience. 2 years of experience with software development in one or more programming languages (C, C++, Python, or Go), or 1 year of experience with an advanced degree.2 years of experience with developing large-scale infrastructure, distributed systems or networking, or experience with...
-
Machine Learning Operations Engineer
1 week ago
Cambridge, Massachusetts, United States Robotics and AI Institute Full time $128,100 - $237,900 per yearOur Mission Our mission is to solve the most important and fundamental challenges in AI and Robotics to enable future generations of intelligent machines that will help us all live better lives. Machine Learning Operations (ML-Ops) Engineers build infrastructure that supports the entire lifecycle of Machine Learning (ML) projects from development to...
-
Systems Engineer
6 days ago
Cambridge, Massachusetts, United States Glocomms Full time $150,000 - $200,000 per yearAbout the companyWe're working with a pioneering energy technology firm that's transforming how the U.S. power grid is modeled, optimized, and understood. For over two decades, this company has quietly powered the backbone of electricity markets and grid reliability, partnering with hundreds of operators, utilities, and infrastructure developers. Their...
-
Lead Machine Learning Engineer
1 week ago
Cambridge, Massachusetts, United States TalentAlly Full time $193,400 - $220,700 per yearAt Capital One, we are changing banking for good by creating responsible and reliable AI-powered systems. Our investments in technology infrastructure and world-class talent - along with our deep experience in machine learning - position us to be at the forefront of enterprises leveraging AI. From informing customers about unusual charges to answering their...
-
Power Systems Applications Engineer
35 minutes ago
Cambridge, Massachusetts, United States Softcom Systems Inc Full timeMust have skills:BE in Electrical EngineeringExperience in Power Systems Applications (PSA), Energy Management Systems (EMS), or SCADA integration.Proven experience with GE AEMSeTerra environments is required.Strong understanding of power system operations, network modeling, and realtime data processing.Detailed Job Description:BE in Electrical...
-
Senior System Security Engineer
4 days ago
Cambridge, Massachusetts, United States Draper Full time $82,000 - $205,750 per yearOverview:Draper is an independent, nonprofit research and development company headquartered in Cambridge, MA. The 2,000+ employees of Draper tackle important national challenges with a promise of delivering successful and usable solutions. From military defense and space exploration to biomedical engineering, lives often depend on the solutions we provide....
-
Cambridge, Massachusetts, United States HubSpot Full time $120,000 - $180,000 per yearPOS-16866HubSpot is seeking a Technical Lead II to join the Data Infrastructure—Batch Processing Team.About the teamThe Batch Processing Team makes processing large datasets at HubSpot unified, fast, and easy. We provide a suite of tools, platforms, and processes based on Spark, Hadoop, and Iceberg.Our platform supports thousands of processing jobs daily...
-
Senior System Safety Engineer
6 days ago
Cambridge, Massachusetts, United States Draper Full time $120,000 - $250,000 per yearOverviewDraper is an independent, nonprofit research and development company headquartered in Cambridge, MA. The 2,000+ employees of Draper tackle important national challenges with a promise of delivering successful and usable solutions. From military defense and space exploration to biomedical engineering, lives often depend on the solutions we provide....
-
Senior AI Engineer
4 days ago
Cambridge, Massachusetts, United States -30ed-4224-ad0b-4b9103b33a2b Full time $158,600 - $181,000 per yearOverview:At Capital One, we are creating responsible and reliable AI systems, changing banking for good. For years, Capital One has been an industry leader in using machine learning to create real-time, personalized customer experiences. Our investments in technology infrastructure and world-class talent — along with our deep experience in machine learning...