author

Harit Vishwakarma

Research Intern

,

Snorkel AI

Harit Vishwakarma is a Research Intern at Snorkel AI, focusing on evaluating and improving the reasoning capabilities of large language models. He recently completed his PhD in Computer Science at the University of Wisconsin–Madison. His research centers on studying and developing methods for reliable inference and leveraging them for automated data labeling and enhancing performance at test time. Next, he is off to the University of Oxford for a postdoc.

The latest from Harit

Automating Benchmark Design

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space...

Research Paper

Accepted to ICLR 2026

Automating Benchmark Design

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark…

Oct 30, 2025 •

Amanda Dsouza, Harit Vishwakarma, Zhengyang Qi, Justin Bauer, Derek Pham, Thomas Walshe, Armin Parchami, Frederic Sala, Paroma Varma

Learn more about Automating Benchmark Design

Blog

Introducing SnorkelSpatial

A procedurally generated and programmatically verified benchmark for evaluating spatial reasoning capabilities in LLMs Large language models (LLMs) are showing remarkable results on solving complex reasoning problems across domains—from mathematical proofs and logical puzzles to graduate-level science and engineering questions. On the other hand, their spatial reasoning capabilities are less understood, even though such reasoning underlies many everyday tasks. We…

Oct 24, 2025 •

Harit Vishwakarma

Learn more about Introducing SnorkelSpatial

For models that need to be right. Not just good enough.

Request dataset samples

Talk to our team

Harit Vishwakarma

The latest from Harit

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?