Category

Research

Snorkel AI emerged from a research project, and we remain closely connected to the research community. Students and professors associated with the Snorkel project continue to publish academic papers that push the field forward, and the Snorkel AI research team integrates the most promising of those ideas into our platform.

Our picks

Image for Getting better performance from foundation models (with less data)
Getting better performance from foundation models (with less data)
Getting better performance from foundation models (with less data)
August 4, 2023
Fred Sala
Image for Snorkel AI researchers present 18 papers at NeurIPS 2023
Snorkel AI researchers present 18 papers at NeurIPS 2023
The Snorkel AI team will present 18 research papers and talks at the 2023 Neural Information Processing Systems (NeurIPS) conference from December 10-16. The Snorkel papers cover a broad range of topics including fairness, semi-supervised learning, large language models (LLMs), and domain-specific models. Snorkel AI is proud of its roots in the research community and endeavors to remain at the forefront
October 31, 2023
Team Snorkel
Image for Long context models in the enterprise: benchmarks and beyond
Long context models in the enterprise: benchmarks and beyond
Snorkel researchers devised a new way to evaluate long context models and address their “lost-in-the-middle” challenges with mediod voting.
June 6, 2024
Amanda Dsouza

All articles on Research

Image
Intelligence Per Watt: A New Metric for AI’s Future
Snorkel AI contributes specialized datasets to Hazy Research’s “Intelligence-per-Watt” study, advancing how efficiently AI turns energy into intelligence.
November 12, 2025
Kobie Crawford
Image
Snorkeling in RL environments
We unpack what makes a high-quality RL environment for LLMs and show how we build realistic, enterprise-grade environments at Snorkel AI.
November 4, 2025
Armin Parchami
Image
Introducing SnorkelSpatial
A procedurally generated and programmatically verified benchmark for evaluating spatial reasoning capabilities in LLMs Large language models (LLMs) are showing remarkable results on solving complex reasoning problems across domains—from mathematical proofs and logical puzzles to graduate-level science and engineering questions. On the other hand, their spatial reasoning capabilities are less understood, even though such reasoning underlies many everyday tasks. We
October 24, 2025
Harit Vishwakarma
Image
Scaling Trust: Rubrics in Snorkel’s Quality Process
Snorkel’s “Trusted Scale” philosophy Welcome to Part 4 of Snorkel AI’s rubric series. In previous posts, we explored how rubrics enable structured evaluation (Part 1), the spectrum of rubric types and use cases (Part 2), and the science behind designing and validating them (Part 3). In this latest installment, we pull back the curtain on how Snorkel puts these principles
October 16, 2025
Derek Pham
Image
Evaluating Multi-Agent Systems in Enterprise Tool Use
In recent months, there has been increasing interest in the area of multi-agent systems and how they can be used to solve more complex tasks than a single agent could accomplish on its own. The topic is particularly interesting and raises several questions and ideas to consider: Anthropic’s blog post about how they architected a multi-agent deep research system is
October 9, 2025
Bhavishya Pohani
Image
Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark
Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing
September 30, 2025
Tom Walshe
,
Kobie Crawford
,
Jeong Shin
Image
Parsing Isn’t Neutral: Why Evaluation Choices Matter
Behind every AI benchmark is a hidden choice: how to read the model’s answers. That choice—parsing—can quietly tilt results more than the model itself. Parsing is where we take an AI system’s raw response and extract the “answer” we use for scoring. It sounds mechanical, but as our research shows, the choice of parser can dramatically change measured accuracy. In
September 26, 2025
Justin Bauer
Image
The science of rubric design
Part 3 of our rubric series explains the science of rubric design. We show why rubrics should be treated like models—structured, measured, and iterated—to maximize objective alignment and inter-rater agreement. Learn how to choose hierarchy and scale points, track agreement (IAA) and LLMAJ alignment, and refine with domain experts, with examples like PaperBench and HealthBench.
September 11, 2025
Chris Glaze
,
Charles Dickens
Image
The right tool for the job: An A-Z of rubrics
Rubrics turn fuzzy “good vs. bad” into measurable criteria for GenAI. In Part 2, we map what to measure (granularity and dataset-level vs instance-specific), where to measure (process vs outcome), and how to measure (humans, LLM-as-judge, code, reward models)—with examples like HHH, FLASK, HealthBench, and PaperBench.
September 2, 2025
Tom Walshe
,
Armin Parchami
Image
Data quality and rubrics: how to build trust in your models
Rubrics aren’t just for evaluation—they’re a blueprint for better data annotation. In this post, we explore how structured rubrics enable scalable, high-quality labeling and evaluation of GenAI systems. Learn how Snorkel and leading labs use rubrics to align human and automated judgment and accelerate trusted AI development.
July 29, 2025
Armin Parchami
Performance of different models on five different benchmarks.
Research spotlight: is long chain-of-thought structure all that matters when it comes to LLM reasoning distillation?
We’re taking a look at the research paper, LLMs can easily learn to reason from demonstration (Li et al., 2025), in this week’s community research spotlight. It focuses on how the structure of reasoning traces impacts distillation from models such as DeepSeek R1. What’s the big idea regarding LLM reasoning distillation? The reasoning capabilities of powerful models such as DeepSeek
March 19, 2025
Shane Johnson
Image
Research spotlight: Is intent analysis the key to unlocking more accurate LLM question answering?
Learn how ARR improves QA accuracy in LLMs through intent analysis, retrieval, and reasoning. Is intent the key to smarter AI? Explore ARR results!
February 27, 2025
Shane Johnson
Image1
Long context models in the enterprise: benchmarks and beyond
Snorkel researchers devised a new way to evaluate long context models and address their “lost-in-the-middle” challenges with mediod voting.
June 6, 2024
Amanda Dsouza
Image
How ROBOSHOT boosts zero-shot foundation model performance
ROBOSHOT acts like a lens on foundation models and improves their zero-shot performance without additional fine-tuning.
April 30, 2024
Dyah Adila
Image
Snorkel teams with Microsoft to showcase new AI research at NVIDIA GTC
Microsoft infrastructure facilitates Snorkel AI research experiments, including our recent high rank on the AlpacaEval 2.0 LLM leaderboard.
March 18, 2024
Snorkel Team