Data development
Research

Parsing Isn’t Neutral: Why Evaluation Choices Matter

September 26, 2025
4 min read

Behind every AI benchmark is a hidden choice: how to read the model’s answers. That choice—parsing—can quietly tilt results more than the model itself.

Parsing is where we take an AI system’s raw response and extract the “answer” we use for scoring. It sounds mechanical, but as our research shows, the choice of parser can dramatically change measured accuracy.

In this post, we’ll unpack one of the most overlooked pieces of model evaluation, parsing, and explore three key questions:

  • What happens when you enforce strict versus flexible parsing rules?
  • How structured outputs can limit model reasoning itself?
  • Why do the same models look better or worse depending on the evaluation setup?

The Setup: Same Models, Different Parsers

We ran a range of models across SnorkelGraph, a benchmark of graph reasoning problems. Each problem has a structural answer, such as a list of nodes, which must be parsed, normalized, and then passed through graph validators to check correctness.

To prepare those answers, we tested two main parsing strategies:

  • Structured parsers: Requires exact JSON outputs, parsed with tools like Pydantic.
  • Unstructured parsers: Extracts answers using regex rules or another LLM.

This design put parsing at the center, allowing us to measure its direct impact on evaluation outcomes.

Results: Parsing Changes the Score

Figure 1: Parsing Methods Comparison LiteLLM JSON mode uses a structured output flag, Pydantic AI applies its parsing library, Regex Parser extracts answers from a specified format, and LLM Parser uses GPT-4o to extract answers.
Figure 2: Model Performance Comparison

The results, shown in Figure 1: Parsing Methods Comparison and Figure 2: Model Performance Comparison, reveal three notable patterns:

  • Structured formats can constrain reasoning. Models like GPT-4.1 and Grok-3 performed worse when forced into rigid JSON structures, as the requirement to maintain exact formatting limited their ability to reason fully through the task.
  • Reasoning-first models held steady. Models like Claude Sonnet 4, Gemini 2.5 Pro, and o4-mini showed minimal sensitivity to the parsing method, with only slight decreases under regex parsing due to its stricter format requirements.
  • Flexible parsing raised scores for weaker models. Regex and LLM parsers captured valid answers from freer-form reasoning, improving reported accuracy. GPT-4o struggled across all formats but performed best when given more flexibility with the LLM parser.

Parsing speed also varied dramatically. As shown in Figure 3: Parsing Methods – Time Comparison, regex and JSON parsing were nearly instantaneous, LLM parsing took a few seconds, and Pydantic AI lagged far behind at nearly 30 seconds per response.

Figure 3: Parsing Methods – Time Comparison

In short: the same model could look better or worse depending on how its answers were parsed.

Why Structured Outputs Can Hold Models Back

Forcing models into structured formats didn’t just affect evaluation—it actively reduced reasoning quality.

Weaker models, in particular, struggled to balance two demands at once: reasoning through the problem and conforming to schema rules. The result was often incomplete reasoning or failed answers, even before validation.

By contrast, when allowed to reason freely and have answers extracted and validated later, models performed better across the board. Structured constraints don’t just change how we measure results—they can reshape reasoning itself.

Why It’s Tricky

Parsing isn’t just a technical detail—it’s part of the evaluation. Strict parsing enforces discipline but can constrain reasoning. Flexible parsing captures more reasoning ability but risks overstating robustness by being too forgiving.

It’s a trade-off: exactness versus resilience. Both are valid, but they measure different things.

Recommendations for Practitioners

  • For precision: Use regex parsing—it’s reliable, fast, and strict.
  • For flexibility: Use LLM-based parsing—it better reflects reasoning, though it’s less exact.
  • Use caution with structured outputs: They can depress scores and limit reasoning for weaker models.

Closing Thoughts

Parsing may seem like a technical afterthought, but it shapes the story you tell about your AI. By making deliberate parsing choices—and recognizing their impact before answers even reach an evaluator—we can move from misleading metrics to evaluations we can trust.

That’s why our published SnorkelGraph benchmark uses an LLM parser with unstructured outputs. The goal isn’t to measure whether models can produce perfectly formatted JSON, but whether they can actually solve the complex spatial and mathematical reasoning problems the benchmark was designed to test.

At Snorkel AI, we pay close attention to every aspect of evaluating LLM responses, and iteratively improve them by collaborating with our network of experts to develop rubrics of carefully chosen evaluation criteria. Be sure to take a look at our series of posts on rubric development. Get in touch with us if you have a project that needs high quality data!

Share this article
Image
Justin Bauer
Research Engineer

Justin Bauer is a Research Engineer at Snorkel AI, working on synthetic data, evaluation, and benchmarks. He previously interned at Google DeepMind and Tesla, focusing on reinforcement learning and sensor perception.

Recommended articles

View all articles
Image
Agents’ Last Exam: AI Benchmarking for Real Work
At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can
June 29, 2026
Snorkel Team
alex-ratner-talk
Agentic AI Evaluation: Closing the Gap with Better Benchmarks and Data
Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to
June 22, 2026
Snorkel Team
Image
Benchtalks #3: We taught AI everything except how to learn
For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration
June 20, 2026
Vincent Sunn Chen
Image
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.