Benchtalks #2: The Future of Coding Benchmarks with John Yang (SWE-Bench, ProgramBench)

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench.

Highlights

Every frontier model scored 0% at launch: ProgramBench tests whether language models can rebuild real software (FFmpeg, SQLite, the PHP interpreter, the tinycc compiler) from scratch with no internet access. At launch, every frontier model on the leaderboard scored 0%. A week later, GPT 5.5 became the first model to crack a single task. As Yang puts it, “I can’t implement FFmpeg from scratch in one month. But maybe the model can. That’s a gear shift that’s meaningful.”
Cheating detection is its own benchmark problem: When the team allowed models internet access during evaluation, up to 36% of strong-model runs were flagged by a 9-judge LM panel for looking up source code, despite an explicit system prompt forbidding it. Those judges disagreed 40 to 57 percent of the time on what counted as cheating. Yang’s stance: “Rigor is king, reliability is king, reproducibility is king. Everything else is cherry on top.”
Verification has evolved alongside the benchmarks: From SWE-bench’s unit tests, to CodeClash’s tournament-style head-to-head competition between artifacts, to ProgramBench’s behavioral test suites that never touch the implementation. Yang frames the progression as a question of “fundamentally what do you want to test about the model?” He notes that the fuzzer-generated suites in ProgramBench averaged 79.7% line coverage on the underlying executables, higher than the 56.8% of the projects’ own native test suites.
The next frontier is not autonomy, it is interaction: Yang’s recent position paper argues that as agents get better at solo task-solving, the bottleneck shifts to how humans communicate with, steer, and verify them. As Yang notes, “If a model can get 80% on ProgramBench but it’s implementing everything in Python in single files, that’s a meaningful end state, but we can ask more out of the models.”

More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com.

More from John Yang: Publications and writing at john-b-yang.github.io.

Snorkel Open Benchmarks Grants: A $3M commitment to funding benchmarks for frontier agents. Learn more.

Watch the full interview on YouTube

Episode Transcript

ProgramBench launch reception

Vincent Chen: Welcome to Benchtalks, John. Last week you dropped ProgramBench, a new benchmark focused on end-to-end program-level output, similar to Nicholas Carlini’s work on the C compiler. When you first launched it, every frontier model had a 0% pass rate. And then just this week, GPT cracked that 0% ceiling. What has the release been like?

John Yang: It’s been a fun time. We released last Tuesday morning, and I’m grateful for all the good reception. A lot of people were excited because there have already been really great works that investigated this problem at a case study level. ProgramBench is a natural follow-up to the SWE-bench setting, where you have a much simpler environment: here’s a GitHub issue, create a PR for it. That was meaningful for two years. But now that these models are so good at solving those PRs, the question is whether they can put together a large application.

Nicholas Carlini and Anthropic’s blog on the C compiler was a meaningful inspiration. There’s also a blog post from Cursor on different multi-agent setups they were using to put together a browser. And then Epoch AI along with METR had RE-Bench, a case study on four repositories. The role of ProgramBench is to formalize this setting and add enough task instances that we can study this domain with meaningful statistical power. When you have one-offs, the settings don’t necessarily transfer. As benchmark builders, the hope is to provide equal footing where we can study this task with purpose and clarity.

Why artifact-level evaluation

Vincent Chen: Talk to me a little more about the style of the eval. You’re evaluating the artifact, not the code path or the implementation. You’re up-leveling the interface at which these agents are working.

John Yang: There have been 0-to-1 code generation benchmarks before. Sasha Rush and his student Wenting Zhao did great work with CommitZero, released around the same time as SWE-bench. They took something like marshmallow or numpy, wiped out the implementation, but kept the function headers, class headers, and typing, and asked the model to implement everything against the existing unit test suite.

The big thing I wanted to do was make the solution space completely up to the model. With CommitZero, you’re still imposing a programmatic scaffold that dictates what language the model should implement in, what the classes and functions are, the relationships, the parameters. That’s a very good chunk of system design and language selection. The small eureka moment came when I was talking to Kilian Lieret about this and he said: it’s not so much the implementation that’s important, it’s the artifact of the deliverable. So we made the decision to put aside implementation as a general principle, and do everything around the artifact. That’s why it’s called ProgramBench. We specifically test executables and binaries that you can run in the terminal.

Surprising findings: GPT really likes Python

Vincent Chen: What were some of the surprising findings around how agents behave in these end-to-end problems?

John Yang: One I particularly loved: in the default inference setting for ProgramBench, models aren’t allowed to use the internet, and we don’t allow Ghidra or strings-style reverse engineering of the binary. But we do allow the model to implement its solution in any language, including the reference language. Instinctively you might think, oh, if the executable is originally written in Rust and the model picks Rust, it must have a leg up. But that’s not always the case.

Models really like Python. That artifact of post-training is very evident in the setting. With GPT 5.5, which solved the very first task instance, cmatrix, the terminal animation, the solution was totally written in Python. The reference wasn’t. Claude models tend to be a little more varied: sometimes Go or Rust, rarely C or C++. But GPT overwhelmingly enjoys Python. Initially you’d think it’s shooting itself in the foot. It’s very much not so.

ProgramBench as a research tool

Vincent Chen: One higher-level reflection: I see this as almost a research tool. We can ask questions about how to fuzz or validate repos, cost-versus-quality trade-offs, how performance differs from bugs versus features. How do you think about this as a research tool the community should build on?

John Yang: In the same way that reading Carlini’s blog or the Cursor blog gave me so much insight from one case study, we do 200, which is the right way to scale these things up. But even just the experience of looking at what models do for one task instance, running multiple models, or fixing the scaffold to mini-SWE-agent, which is justified because it’s the paradigm that came from the SWE-bench era and we want to stress-test it. That starts a conversation about where your scaffold can be better.

What I’d recommend: take one task instance, over-index on it, look at what the trajectory actually looks like. What proportion of time is the model spending probing the executable? How often is it writing code? That’s where the insight comes from. Claude is a little more interwoven, probing a little, implementing a little. GPT does a really thorough job up front, fleshes out a specification for itself, then one-shots the implementation. It’s just the tip of the iceberg for how we can understand these models as developers.

Benchmarks as community efforts

Vincent Chen: What’s your view on how these benchmarks should be used by software engineers or others in dictating the changes in the field?

John Yang: Something a lot of prior benchmarks, especially Terminal-Bench, have done well is collecting community power. The tagline that you can add your own task instance to Terminal-Bench and have that be something the frontier is evaluated on, big credit to Mike Merrill and Alex Shaw and Ludwig Schmidt and the folks at Snorkel, is very compelling. The way I hope this message carries over: whatever your favorite command-line tool is, put it in ProgramBench. We’ll open up a leaderboard, and we’ll figure out a way to open up adding tasks.

The lineage: from InterCode to ProgramBench

Vincent Chen: Walk me through how you got here. You started with SWE-bench, you did SWE-smith, then CodeClash. How does that journey look to you?

John Yang: A lot of luck, and a lot of thanks to the people I’ve collaborated with. This journey started when I was a master’s student at Princeton, and the first person I worked with was Shunyu Yao, who did ReAct and Tree of Thoughts and a lot of really fundamental agent work.

The fundamentals that have remained: rigor and high reproducibility. When I collaborated with Carlos Jimenez on SWE-bench, finding the 2,294 task instances wasn’t hard. Carlos and I sketched the idea on a whiteboard in one day in June 2023. What took the next three or four months was reproducibility. When we released SWE-bench originally, we didn’t use Docker. We used conda environments. That was a nightmare for reproducibility. This was totally my fault. Carlos suggested Docker. I said it was too heavy. He was 100% right. Kilian and I invested a lot of time making sure ProgramBench was clean to download and run.

What has remained different is being willing to challenge our notions of what language models can do. The first project I did was InterCode. I was taking benchmarks like MBPP and HumanEval, which were meant to be input-output with no interaction, and casting them into an interactive format. With InterCode, the framing was: this stuff is not realistic, or it’s realistic in terms of passing LeetCode questions, but it isn’t what we do day to day. Can models do this? And honestly, when SWE-bench was released, I got comments at the release like, “oh, this seems impossible, I don’t know if the models will ever be able to do GitHub.” Funnily enough, you get the same wave of comments with the ProgramBench release.

That kind of skepticism is incredibly helpful. The reason I’ve gone in the direction I have is to continue to question. We’re entering a realm where before, you could find tasks where it’s “well, Vincent and John and humans in general can do this, can models?” And now, I can’t implement FFmpeg from scratch in one month. But maybe the model can. That’s a gear shift that’s meaningful. It’s challenging to get right, and we’ll get it wrong in some ways. But overall I have faith in that direction.

Methodological evolution of grading

Vincent Chen: You’ve explored a number of ways to grade model outputs, from unit tests to tournament-style eval, now to fuzzing and output-driven validation. What have you learned?

John Yang: The evolution of verification itself is a meaningful problem. With SWE-bench, the verification didn’t do anything new conceptually. MBPP, HumanEval, all the LeetCode-style things had unit tests too. The change was that with HumanEval, they had people write the tests. Carlos and I weren’t OpenAI. We didn’t have that kind of money. So we asked: how do we find tests in the open, in the wild?

With CodeClash, I was inspired by the LMArena work out of Berkeley. What does it mean to bring that into a code setting? There were prior works like Copilot Arena, even VS Code and Cursor where there’s a voting mechanism with version A versus version B. It made sense, but I questioned whether it would scale. How many times can you ask someone “do you prefer A or B” with two code snippets before they say “I don’t care, as long as it works”? So with CodeClash, the idea was: let’s take the code artifacts themselves and make them compete against each other.

Something I still like about CodeClash, even though it hasn’t been picked up as much as ProgramBench, is that it deals with the saturation problem better. It’s truly open-ended. You’re not going to hit a ceiling. If a model can create one solution that’s better than another, that’s all that’s needed for progress. That idea of making artifacts compete against each other was part of the inspiration for ProgramBench. Form-factor-wise it’s still written as pytest tests, but they’re all invocations of the executable. They’re calls of the program. They no longer touch the implementation. You completely disentangle the evaluation from the problem specification.

The position paper: humans in the loop

Vincent Chen: You had a position paper around how humans need to be a bigger part of coding benchmarks and research in general. Where do humans need to be in the loop?

John Yang: I’ve had a couple of conversations with my advisor, Diyi Yang, who’s very into human-AI collaboration. I came into the PhD program more autonomy-driven. But in the two years since, I’ve been convinced this is meaningful.

ProgramBench is a much longer-horizon task than SWE-bench, and the solution space is large. But even as human software developers, we prefer solutions that have certain characteristics, conditioned on what the software is for. Styled well, readable, portable. Some labs are probably post-training things like good style and clean documentation into models. But empowering the individual to steer the model toward their preferences is generally important to me. People should be able to build software the way they want it built.

In the position paper, my coauthor Zora Wang at CMU, who works with Graham Neubig and the OpenHands folks, talks about steerability. The day we get to 80% on ProgramBench will be exciting. But what does that mean for the people who actually use these tools? If a model gets 80% but it’s implementing everything in Python in single files, not really using a file system, and the code isn’t reproducible, that’s a meaningful end state, but we can ask more out of the models.

Quality control with agents in the loop

Vincent Chen: One of the pieces we pay attention to when building data sets or benchmarks is how to ensure aggregate quality is at a really high standard. Having agents in the loop is effective but a double-edged sword. How do you think about this?

John Yang: Honestly with SWE-smith and ProgramBench it’s been case by case. There are probably three vectors where models can help. One: assisting with generating the verification. Two, the SWE-smith-oriented take: actually creating task instances. You have the code base installed, and you ask the model to write a funky implementation that breaks tests. Three: constructing the environment itself. For ProgramBench, if I want the reference executable, I clone the code base and ask the model to write the compilation script with all the source code available. For SWE-smith, here’s the Python library, make sure you install it and run the unit tests.

Vincent Chen: A lot of what our teams find is that injecting human expertise and steering in the right places efficiently is tricky. It’s case by case, pipeline-dependent. But having an intelligent way to say “here’s where the human needs to be in the loop to steer this part, here’s where we let the agents rip” makes a lot of sense. You’re certainly at the frontier of that.

John Yang: In some sense, we can afford to be precise and focus on small-scale experimentation. Unlike before, once you get that right, it’s much easier to scale up with agentic systems.

Cheating, internet access, and the 9-judge disagreement

Vincent Chen: You noted that models with internet access were reward-hacking up to 36% of the time. You’ve also dealt with leaderboard hacking in SWE-bench. How do you address the challenge of measuring model performance honestly?

John Yang: Two parts. When ProgramBench launched, a big topic online was: humans use the internet, why disallow it? The central thesis was that any time there’s an X percent improvement on ProgramBench, we want it to be undoubted. No asterisks. The tricky part with the internet is that you would like to see a model go on GitHub and find adjacent relevant code, but not the source code itself. Maybe it checks Stack Overflow or a language specification. So I agree the internet is useful, but not at the cost of the rigor of the benchmark.

Early in the experimentation process, we allowed internet, then ran nine different LLMs as judges, looking at trajectories to decide whether the model cheated. Big credit to Kilian for that pipeline. The problem wasn’t just that the cheating rate was a third, which is extremely high. Releasing a benchmark that says “0% resolved, but also they cheat a third of the time” muddles the messaging. And Kilian also found that the judges would disagree. We have examples in the paper where it’s a 5-to-4 vote. Five judges say “it’s not allowed to look it up on SourceForge.” The four say “only GitHub was explicitly disallowed, SourceForge should be fine.” It becomes a cat-and-mouse game.

This is a constant tug of war for benchmark builders. Rigor is king. Reliability is king. Reproducibility is king. Everything else is cherry on top. If any of those start to be affected, the answer is no. Rigor is core to having people come together and hill-climb, rather than pointing at each other saying “you didn’t do it right.”

Open leaderboards and trajectory submission

John Yang: Curating submissions for SWE-bench was a wonderful experience. We ended up accepting over 300 entries on the leaderboard. The experience: set up submission pipelines that enable the community to be inspired by and double-check each other’s work. There was a big moment in SWE-bench where, because of some things that happened, we said you must submit trajectories. Terminal-Bench has adopted this, and ProgramBench will too. When trajectories get uploaded, people can look at them, which inspires new approaches and helps answer clarifying questions like: is this cheating?

Credit to the FAIR coding team. They brought up the behavior in SWE-bench where some models were “fast-forwarding,” going to future commits to solve the problem. Thanks to them, because a lot of people weren’t reporting these things, not out of malice, but because they hadn’t caught it. As open as possible. It helps with reproducibility, and the trust you can build as a leaderboard host with the people participating.

Failure mode analysis

Vincent Chen: How do you go about that failure mode analysis as trajectories get longer-horizon?

John Yang: Two sides of the coin. One: invest in better tools using language models themselves. The Transluce folks have the Docent tool, where you can inspect SWE-bench trajectories and ask aggregated questions. Tools like that are worth investing in. On the other hand, nothing beats sitting down and manually scrolling through one trajectory. So it’s a balance. Take a little time, figure out how it looks for one or two instances manually, then scale.

Vincent Chen: We used to have an onboarding exercise at Snorkel where you dug into the data. Nothing beats building firsthand intuition for what the shape of the data looks like, then translating that to programmatic approaches.

The 5 levels of benchmarking

Vincent Chen: You tweeted that we’re moving from a regime where language models do what humans do, to one where we’re asking whether they can do things previously impossible. What kinds of tasks fall into that second category?

John Yang: I talked with Ofir Press, a long-time collaborator, about this. He has a diagram of five levels of benchmarking. Levels one, two, three are all in the realm of “can humans do this?” Levels four and five are “humans can’t do this.” ProgramBench is in the realm of level four. The reference solution exists. The FFmpeg source code exists. You can generate tests on top of it. But we’re compressing the timeline. Models can do the same thing humans can, in a way shorter time span. That’s superhuman in the sense that yes, we could pull this off, but it just took us a much longer time.

The fifth realm is more sci-fi to me. Things that are literally unsolved, whether Millennium Prize problems in mathematics, or challenging the boundaries of science. This is where a lot of the excitement around recursive self-improvement and AI scientists lies. I agree with that. My personal approach is: having models recreate the things that exist is a good stepping stone to having empirical faith that they’ll create something new and impactful.

When does ProgramBench hit 80%?

Vincent Chen: GPT 5.5 just cracked one of the tasks in ProgramBench. When does the first model hit 80%?

John Yang: 80%? For our sake, hopefully not too soon. Just joking. Maybe a year. Maybe a year and a half from now. ProgramBench has a gradient of difficulty. cmatrix, the first task solved, is one of the easiest. Our difficulty rating is purely based on lines of code and number of dependencies, totally independent of performance. At the limit of ProgramBench, the toughest tasks are SQLite, the PHP interpreter, the tinycc compiler. There’s one really amazing developer, Fabrice Bellard, who wrote a bunch of this software. FFmpeg. There’s a lot of true human ingenuity in his work.

I have faith the models will get there. 80% is 160 out of 200 instances, and those 160 are meaningful. Maybe the model gets ripgrep or fzf, tools more focused in their utility. The remaining 20%, I’m not sure. My hunch is there’s a bit of a long tail. Things like ripgrep or jq are tricky. Compared to FFmpeg, that’s a different league. One year is my bullish take. Happy to be proven wrong.

Curiosity, persistence, and the Edison metaphor

Vincent Chen: You said there’s a meaningful difference between 80% and the rest. Do you have words to capture what that shift looks like?

John Yang: Some executables have a lot of subcommands. Even putting aside lines of code and dependencies, they encompass different amounts of functionality. FFmpeg handles so many audio and video formats. You can transcode MOV to MP4, apply different encodings. The model has to do heavy probing: downloading different assets, where the assets vary by video length and quality. There’s so much that FFmpeg accounts for. So ProgramBench, beyond just implementation correctness, is the one that tests curiosity. Did you probe enough? Because if you didn’t, it doesn’t matter how good of a software engineer the model is.

That extends to models becoming better research scientists and innovators. So much of what we do has no specification, no well-written doc. It’s someone poking endlessly at something until they discover a behavior that wasn’t realized before and formalize it. It’s the Thomas Edison thing. He just had to try a bunch of different metals until he got the one that worked. That persistence is exciting.

Benchmarks worth paying attention to

Vincent Chen: What benchmark should more people be paying attention to?

John Yang: My personal focus is on long-horizon tasks. Within 20 to 40 turns, which is very much SWE-bench territory, we have the formulas in place. Long-horizon makes us ask questions about how models construct memory for themselves. That memory doesn’t have to be natural language. It could be: this model and that model both solve the same task, but which one provides a solution that’s more extensible for whoever picks it up next?

Kudos to your team. The Continual Learning Bench from Berkeley that you worked on feels really great. ProgramBench is around a hundred turns. Continual Learning Bench is also effectively a hundred turns, but with checkpoints in the middle. One task I really liked resembles a chain of GitHub issues where B is blocked by A, and C is blocked by B which was blocked by A. Once you solve A, it’s one thing to solve it, but writing the function at the right layer in the call stack such that B can correctly invoke it, versus the solution being too high up and causing B to redundantly repeat code that should be one layer below. That’s really fascinating.

A benchmark John wishes existed

Vincent Chen: What’s a benchmark you wish existed?

John Yang: I’m excited about using coding generally to tackle problems in domains that aren’t just coding. SWE-bench is effectively a benchmark where we grade models on their ability to write code for the sake of better code. We fix this bug. But as we’ve seen with Claude Code, people are using these agents for so many things beyond code. Claude Cowork is fantastic, but it’s almost like the hood of an engine where underneath is a coding agent with extra tool calls.

If I had to sit down today and brainstorm: maybe something in the biology or medical industry. Clinical experiments are truly long-horizon. What’s the blocker for why we can’t deploy an agent that can write code, and code is easily operationalized, to do those things?

Vincent Chen: Two reactions. One, I talked to Alex Shaw about this on the last Benchtalks: the bet on the terminal, on code as a universal interface for agents, was an awesome one. Two, I agree that we need to pull in domain experts to the benchmark-building process. A lot of what we as researchers have contact with are problems within our domains. There’s a whole universe of use cases that needs to be represented.

Will benchmarks still look like benchmarks in 5 years?

Vincent Chen: Five years out, does the benchmark still look like a benchmark?

John Yang: If we’d asked this question when SWE-bench was created, the answer would have been “not that different from 2018 to 2023.” I don’t think that’s the case going forward. A lot of benchmarking that shaped the language model space has been silos of what people can do: question answering, entailment, machine translation. Those were different cuts of what people can do with language. Tasks inspired by “I can do this, but can the machine?” were a good source of instances, and we calibrated which benchmarks to pay attention to based on where models were.

Going forward, the human inspiration of “can a human do this?” is going to be less and less the source. More along the lines of “how do we engage with it, work with it, interact with it” as the core premise. In terms of form factor, the idea of having environments, whether digital or real-world, will stay around. Verification is where I’m more curious how it evolves. In the regime of code, verification has evolved a lot in the past three or four years, and I don’t expect that trend to stop.

How to leverage ProgramBench

Vincent Chen: Last question. How can people leverage ProgramBench?

John Yang: programbench.com. The URL is there. It was a little pricey, but we got it. We’re going to set up a leaderboard soon. We’re encouraging people to pick the low-hanging fruit. We use mini-SWE-agent. Use your own scaffold if you think it’s better. Single-agent versus multi-agent, give that a go. I’d also be excited to see people train their own models on ProgramBench and see how far we can get with a 32B model. Given that this task is longer and bigger, what does training on it do to how we approach code tasks with larger solution spaces?

I’d also love to set up a way for people to add tasks. ProgramBench’s formulation is meant to be very generalizable. We want as few constraints on the word “program” as possible. We do executables, but there’s no reason someone couldn’t take a Mac app, an iOS app, a website, anything that constitutes rendered software. Those are the two calls to action. Kilian and I look at the GitHub every day for issues. We’re looking forward to growing this.

Vincent Chen: Thanks again, John. This was an awesome chat.

John Yang: Thanks so much.