AI Research Engineer, Enterprise Evaluations

3 weeks ago

San Francisco, United States The Rundown AI, Inc. Full time

Scale AI is seeking a technically rigorous and driven AI Research Engineer to join our Enterprise Evaluations team. This high-impact role is critical to our mission of delivering the industry's leading GenAI Evaluation Suite. You will be a hands-on contributor to the core systems that ensure the safety, reliability, and continuous improvement of LLM-powered workflows and agents for the enterprise.The ideal candidate has a strong foundational knowledge of large language models, a passion for tackling complex evaluation challenges, and thrives in a dynamic, fast-paced research environment. We are looking for an engineer who can think outside the box, stays current with the latest literature in AI evaluation, and is passionate about integrating novel research ideas into our workflows to build best-in-class evaluation systems.ResponsibilitiesPartner with Scale’s Operations team and enterprise customers to translate ambiguity into structured evaluation data, guiding the creation and maintenance of gold-standard human-rated datasets and expert rubrics that anchor AI evaluation systems.Analyze feedback and collected data to identify patterns, refine evaluation frameworks, and establish iterative improvement loops that enhance the quality and relevance of human-curated assessments.Design, research, and develop LLM-as-a-Judge autorater frameworks and AI-assisted evaluation systems. This includes creating models that critique, grade, and explain agent outputs (e.g., RLAIF, model-judging-model setups), along with scalable evaluation pipelines and diagnostic tools.Pursue research initiatives that explore new methodologies for automatically analyzing, evaluating, and improving the behavior of enterprise agents, pushing the boundaries of how AI systems are assessed and optimized in real-world contexts.Basic QualificationsBachelor’s degree in Computer Science, Electrical Engineering, a related field, or equivalent practical experience.2+ years of experience in Machine Learning or Applied Research, focused on applied ML systems or evaluation infrastructure.Hands-on experience with Large Language Models (LLMs) and Generative AI in professional or research environments.Strong understanding of frontier model evaluation methodologies and the current research landscape.Proficiency in Python and major ML frameworks (e.g., PyTorch, TensorFlow).Solid engineering and statistical analysis foundation, with experience developing data-driven methods for assessing model quality.Preferred QualificationsAdvanced degree (Master’s or Ph.D.) in Computer Science, Machine Learning, or a related quantitative field.Published research in leading ML or AI conferences such as NeurIPS, ICML, ICLR, or KDD.Experience designing, building, or deploying LLM-as-a-Judge frameworks or other automated evaluation systems for complex models.Experience collaborating with operations or external teams to define high-quality human annotator guidelines.Expertise in ML research engineering, stochastic systems, observability, or LLM-powered applications for model evaluation and analysis.Experience contributing to scalable pipelines that automate the evaluation and monitoring of large-scale models and agents.Familiarity with distributed computing frameworks and modern cloud infrastructure. #J-18808-Ljbffr

AI Research Engineer, Enterprise Evaluations

2 weeks ago

San Francisco, United States Scale AI Full time

Scale AI is seeking a technically rigorous and driven AI Research Engineer to join our Enterprise Evaluations team. This high-impact role is critical to our mission of delivering the industry's leading GenAI Evaluation Suite. You will be a hands-on contributor to the core systems that ensure the safety, reliability, and continuous improvement of LLM-powered...
AI Research Engineer, Enterprise Evaluations

24 hours ago

San Francisco, CA, United States Scale AI Full time

Scale AI is seeking a technically rigorous and driven AI Research Engineer to join our Enterprise Evaluations team. This high-impact role is critical to our mission of delivering the industry's leading GenAI Evaluation Suite. You will be a hands-on contributor to the core systems that ensure the safety, reliability, and continuous improvement of LLM-powered...
AI Research Engineer, Enterprise Evaluations

2 weeks ago

San Francisco, CA, United States Scale AI Full time

Scale AI is seeking a technically rigorous and driven AI Research Engineer to join our Enterprise Evaluations team. This high-impact role is critical to our mission of delivering the industry's leading GenAI Evaluation Suite. You will be a hands-on contributor to the core systems that ensure the safety, reliability, and continuous improvement of LLM-powered...
AI Research Engineer, Enterprise Evaluations

7 days ago

San Francisco, CA, United States Scale AI Full time

Scale AI is seeking a technically rigorous and driven AI Research Engineer to join our Enterprise Evaluations team. This high-impact role is critical to our mission of delivering the industry's leading GenAI Evaluation Suite. You will be a hands-on contributor to the core systems that ensure the safety, reliability, and continuous improvement of LLM-powered...
AI Research Engineer, Enterprise Evaluations

3 days ago

San Francisco, CA, United States Scale AI Full time

Scale AI is seeking a technically rigorous and driven AI Research Engineer to join our Enterprise Evaluations team. This high-impact role is critical to our mission of delivering the industry's leading GenAI Evaluation Suite. You will be a hands-on contributor to the core systems that ensure the safety, reliability, and continuous improvement of LLM-powered...
Enterprise AI Evaluation Engineer for LLMs

3 weeks ago

San Francisco, United States The Rundown AI, Inc. Full time

A leading AI solutions company in San Francisco is seeking an AI Research Engineer. This role involves partnering with teams to build evaluation datasets, designing AI evaluation systems, and pursuing cutting-edge research in model evaluation. Candidates should have strong experience in Machine Learning and LLMs, with a bachelor's degree in a related field....
Applied AI Researcher, AI Systems

6 days ago

San Francisco, California, United States Distyl AI Full time

Distyl AI develops AI native technologies for humans & AI to collaborate to power the operations of the Global Fortune 1000.In just 24 months, we've rapidly grown to partner with some of the world's largest enterprises—including F100 telecom, healthcare, manufacturing, insurance, and retail companies—delivering multiple AI deployments with $100M+ impact....
Applied AI Engineering Manager, Enterprise

4 weeks ago

San Francisco, United States Scale AI Full time

Applied Ai Engineering Manager, EnterpriseAI is becoming vitally important in every function of our society. At Scale, our mission is to accelerate the development of AI applications. For 8 years, Scale has been the leading AI data foundry, helping fuel the most exciting advancements in AI, including generative AI, defense applications, and autonomous...
Head of Evaluation and Oversight Research

1 week ago

San Francisco, CA, United States Scale AI Full time

Scale is the leading data and evaluation partner for frontier AI companies, playing an integral role in advancing the science of evaluating and characterizing large language models (LLMs). Our research focuses on tackling the hardest problems in scalable oversight and the evaluation of advanced AI capabilities. We collaborate broadly across industry and...
Head of Evaluation and Oversight Research

2 weeks ago

San Francisco, CA, United States Scale AI Full time

Scale is the leading data and evaluation partner for frontier AI companies, playing an integral role in advancing the science of evaluating and characterizing large language models (LLMs). Our research focuses on tackling the hardest problems in scalable oversight and the evaluation of advanced AI capabilities. We collaborate broadly across industry and...

Americas

Europe

Asia / Oceania

Africa

AI Research Engineer, Enterprise Evaluations