Image
author

Armin Parchami

Director of Research Engineering
,
Snorkel AI

Armin Parchami is the Director of Research Engineering at Snorkel AI, where he leads work on synthetic data, data quality, and model fine-tuning. He previously held technical leadership roles at Ford and Nokia Bell Labs, focusing on multimodal AI and autonomy. His work centers on moving research into production.

The latest from Armin

Cua-Bench: benchmarking computer-use agents on professional software
Blog
Cua-Bench: benchmarking computer-use agents on professional software

TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —…

Learn more about Cua-Bench: benchmarking computer-use agents on professional software
The Self-Critique Paradox: Why AI Verification Fails Where It’s Needed Most
Blog
The Self-Critique Paradox: Why AI Verification Fails Where It’s Needed Most

TL;DR: We stress-tested the “generate → criticize → improve” loop on 50 visual reasoning tasks. The results were counterintuitive: self-critique acts as a corrosive agent on high-performance tasks, turning 98% accuracy into 57%. Yet, for tasks where models fail completely, it works like magic. This difficulty-dependent behavior poses a critical, hidden risk for RLFT pipelines. The Promise vs. The Reality…

Nov 26, 2025
Learn more about The Self-Critique Paradox: Why AI Verification Fails Where It’s Needed Most
Snorkeling in RL environments
Blog
Snorkeling in RL environments

We unpack what makes a high-quality RL environment for LLMs and show how we build realistic, enterprise-grade environments at Snorkel AI.

Nov 04, 2025
Learn more about Snorkeling in RL environments
Automating Benchmark Design
The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space...
Research Paper
Accepted to ICLR 2026
Automating Benchmark Design

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark…

Learn more about Automating Benchmark Design
The right tool for the job: An A-Z of rubrics
Blog
The right tool for the job: An A-Z of rubrics

Rubrics turn fuzzy “good vs. bad” into measurable criteria for GenAI. In Part 2, we map what to measure (granularity and dataset-level vs instance-specific), where to measure (process vs outcome), and how to measure (humans, LLM-as-judge, code, reward models)—with examples like HHH, FLASK, HealthBench, and PaperBench.

Sep 02, 2025
Learn more about The right tool for the job: An A-Z of rubrics
Data quality and rubrics: how to build trust in your models
Blog
Data quality and rubrics: how to build trust in your models

Rubrics aren’t just for evaluation—they’re a blueprint for better data annotation. In this post, we explore how structured rubrics enable scalable, high-quality labeling and evaluation of GenAI systems. Learn how Snorkel and leading labs use rubrics to align human and automated judgment and accelerate trusted AI development.

Jul 29, 2025
Learn more about Data quality and rubrics: how to build trust in your models
A Clinical Text Classification Paradigm Using Weak Supervision…
This work develops a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models.
Research Paper
A Clinical Text Classification Paradigm Using Weak Supervision…

This work develops a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models.

Learn more about A Clinical Text Classification Paradigm Using Weak Supervision…
Image

For models that need to be right. Not just good enough.