The Terminal-Bench team is extending Terminal-Bench to complex scientific workflow tasks in the natural sciences.
✦
TLDR
Terminal-Bench Science is now open for contributions — looking for scientists to turn real research workflows into benchmark tasks that shape the next generation of AI agents.
Terminal-Bench Science is a benchmark for evaluating AI agents on real computational workflows from scientific research. It builds on Terminal-Bench, which has been adopted by frontier labs including Anthropic, OpenAI, and Google DeepMind and has helped drive progress in AI agents on software engineering tasks by defining what those labs measure and optimize for. Terminal-Bench Science brings the same approach to the natural sciences.
Most existing “AI for Science” benchmarks test textbook knowledge, not real workflows. Terminal-Bench Science closes this gap with real computational workflow tasks from research labs, evaluated in containerized environments with programmatic verification. The goal is to give scientists a direct voice in shaping AI progress: domain experts contribute scientific workflows as benchmark tasks, frontier labs evaluate and improve their AI agents against them, and the improved AI agents with stronger scientific capabilities flow back as better tools for researchers.
Terminal-Bench Science is targeting 100+ benchmark tasks across the life sciences, physical sciences, and earth sciences, but is also open to tasks from the mathematical sciences and other domains with computational workflows.
| Domain | Areas |
|---|---|
| Life Sciences | Biology, Medicine, Neuroscience |
| Physical Sciences | Physics, Chemistry, Astronomy, Materials Science |
| Earth Sciences | Atmospheric Science, Geoscience, Water Science |
| Mathematical Sciences | Applied Mathematics, Statistics, Autoformalization |
| Other | Interdisciplinary Sciences, Computational Sciences, Engineering Sciences, etc. |
WHY CONTRIBUTE
The Terminal-Bench team looking for complex, real-world computational workflows from practicing scientists across the natural sciences that meet the following three key criteria:
Tasks follow the Harbor Task Format. Check out example tasks for reference.
The Terminal-Bench team follows a curated contribution process to maintain quality:
Once merged, the Terminal-Bench team runs frontier AI agents against all merged tasks to calibrate difficulty. Tasks that pass are included in the official Terminal-Bench Science release on the Terminal-Bench Benchmarks and Terminal-Bench Leaderboards.
Tasks must be submitted and merged by August 17, 2026. Starting early is highly recommended — most tasks require a few rounds of feedback and iteration before they’re ready to merge.
Join the Discord and reach out to @stevendi11 on Discord or stevendi@stanford.edu to get involved. Key channels: #tb-science for general discussion, #tb-science-announcements for project updates, #tb-science-task-ideas for quick early feedback on ideas, and #tb-science-task-proposals for submitted proposals, automated reviews, and reviewer feedback. Plus, you can join the weekly meeting every Monday at 9am PT.
Terminal-Bench Science is an open academic collaboration hosted by Stanford University and the Laude Institute. As part of the Terminal-Bench franchise, it is built by the Terminal-Bench & Harbor Framework team, and scientific contributors. We thank Snorkel AI for support via the Open Benchmarks Grants program, and Laude Institute and 2077AI for API credits that power benchmark evaluations.