Data development

The Self-Critique Paradox: Why AI Verification Fails Where It’s Needed Most

November 26, 2025
4 min read

TL;DR: We stress-tested the “generate → criticize → improve” loop on 50 visual reasoning tasks. The results were counterintuitive: self-critique acts as a corrosive agent on high-performance tasks, turning 98% accuracy into 57%. Yet, for tasks where models fail completely, it works like magic. This difficulty-dependent behavior poses a critical, hidden risk for RLFT pipelines.


The Promise vs. The Reality

The “agentic loop”—having an AI critique and improve its own work—is a popular method for attempting to boost performance. Techniques like Self-Refine (Madaan et al., 2023) and Reflexion (Shinn et al., 2023) have popularized the idea that iterative feedback can help models solve complex tasks. The logic is simple: two heads (even if they’re the same head) are better than one.

But are they always?

We ran a rigorous experiment:

  • 50 hard visual reasoning tasks (verifiable ground truth)
  • 2 frontier models (Claude Sonnet 4.5, OpenAI o4-mini)
  • 5 critique-improve iterations per task
  • 100 total experiments

Why these models? We chose two of the strongest reasoning models available. If SOTA models—specifically optimized for reasoning—cannot leverage self-critique to fix their own outputs on simple tasks, it suggests a fundamental limitation that is likely even more severe in weaker or smaller models.

What we found challenges the core assumption of agentic AI.


The Data: A Tale of Two Extremes

When we aggregated our results, it looked like a generic failure: accuracy dropped 10% overall. But when we split tasks by difficulty, a startling pattern emerged.

1. The “Corrosive Critique” Effect (Easy Tasks)

For tasks where models started strong (≥75% accuracy), the critique loop was devastating for both models.

ModelInitialLoop 5DropResult
Claude Sonnet 4.598.1%56.9%↓ 41.2%0 improved, 8 degraded
OpenAI o4-mini94.2%78.4%↓ 15.8%0 improved, 5 degraded

What happened? Hallucination. The critic, primed to find errors, invented them. A correct answer of “yes” became “no” because the model “detected” a 2-pixel discrepancy that didn’t exist. Confidence became a liability.

2. The “Lazarus” Effect (Hard Tasks)

For tasks where models failed completely (<35% accuracy), critique was a miracle worker.

ModelInitialLoop 5GainResult
Claude Sonnet 4.50.0%60.0%↑ 60.0%3 improved, 0 degraded
OpenAI o4-mini0.0%20.0%↑ 20.0%1 improved, 0 degraded

Here, the critic had real errors to catch—calculation mistakes, logic inversions—and debugging actually worked. This universality across models suggests a fundamental property of LLM reasoning, not a quirk of one architecture.


The Hidden Danger for Model Training

This finding has profound implications beyond just prompt engineering. It strikes at the heart of modern model training, particularly Reinforcement Learning Fine-Tuning (RLFT) and Reinforcement Learning from AI Feedback (RLAIF).

The Reward Modeling Trap

In RLFT/RLAIF pipelines, we often use a strong model (the “Judge”) to score the outputs of a model being trained. If the Judge is the same model (or a similar one), our results suggest a dangerous feedback loop:

  1. Penalty for Perfection: If the student model gets an easy task right, the Judge might hallucinate a flaw and penalize it.
  2. Reward for Uncertainty: The Judge may prefer hedged, uncertain answers over confident, correct ones to avoid “missed” errors.
  3. Drift: Over time, this could train models to be less decisive on simple tasks while over-correcting on complex ones.

If your reward model (Judge) has the same blind spots as your policy model, self-correction isn’t just useless—it’s an adversarial attack on your own training data.


When to Use Critique in Agentic Solutions

The data is clear: Self-critique is not a free lunch. It’s a high-stakes bet that only pays off when you’re already losing.

The Core Strategy: Triage Your Tasks

Don’t apply a flat “3 loops” policy to everything. You must categorize incoming requests by difficulty or risk corrosive effects.

1. The “Red Zone” (Easy Tasks) -> ZERO Loops

Identify them by: Simple classification, high initial confidence (>90%), or tasks where LLMs historically excel (e.g., sentiment analysis, basic extraction).
Action: Trust the first draft. Critique here is actively harmful (↓15-40% accuracy).
Why: The model is right, but the critic will hallucinate flaws to justify its existence.

2. The “Green Zone” (Hard Tasks) -> 3-5 Loops

Identify them by: Complex reasoning, multi-step logic, or low initial confidence (<50%).
Action: Force critique loops.
Why: The model is likely wrong initially. The critic acts as a debugger, catching calculation errors or logic gaps that the generator missed.

The Golden Rule for Agents

Critique is for debugging, not polishing.
If your agent is confident and the task is standard, shut the critic up. Only engage the loop when the model is struggling or the task complexity demands a “second pair of eyes” to catch structural errors.


References

  1. Self-Refine: Madaan, A., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651.
  2. Reflexion: Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366.

Share this article
Image
Armin Parchami
Director of Research Engineering

Armin Parchami is the Director of Research Engineering at Snorkel AI, where he leads work on synthetic data, data quality, and model fine-tuning. He previously held technical leadership roles at Ford and Nokia Bell Labs, focusing on multimodal AI and autonomy. His work centers on moving research into production.

Recommended articles

View all articles
Image
Agents’ Last Exam: AI Benchmarking for Real Work
At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can
June 29, 2026
Snorkel Team
alex-ratner-talk
Agentic AI Evaluation: Closing the Gap with Better Benchmarks and Data
Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to
June 22, 2026
Snorkel Team
Image
Benchtalks #3: We taught AI everything except how to learn
For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration
June 20, 2026
Vincent Sunn Chen
Image
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.