Back to Benchmarks
Released April 4, 2026
Archived

Finance Reasoning

A benchmark co-created with Snorkel's financial expert network, to test agents on financial reasoning questions through tool-calling and planning.
Overview

This benchmark is an improvement over Snorkel Finance, which tested agents on tool-calling for financial queries but in which the queries required limited reasoning to answer the questions.

With the Financial Reasoning dataset, our aim was to create question-answer pairs that required models to reason in order to answer them correctly. An example query: "For AT&T, how significant are the company's postretirement benefit obligations in terms of interest burden, and what does this indicate about the company's long-term liability management in 2024?"

As with Snorkel Finance, we aimed to create a realistic environment in which a financial analyst agent can find answers to high-level questions based on information in 10-K filings. To do this, we converted information from tables in 10-K documents into a relational database. Agents must reason about what information is required, use database tools to look up the correct tables, make accurate SQL calls often in succession, and combine answers to produce a final response.

Question-answer pairs have been carefully co-created with Snorkel's Expert Data-as-a-Service network of financial experts, to ensure they are high quality, representative of real-world financial analyst questions, accurate, and require sufficient reasoning. This is a challenging task, requiring an average of 12 steps of reasoning and tool use.

Leaderboard

For models that need to be right. Not just good enough.