Finance Reasoning
This benchmark is an improvement over Snorkel Finance, which tested agents on tool-calling for financial queries but in which the queries required limited reasoning to answer the questions.
With the Financial Reasoning dataset, our aim was to create question-answer pairs that required models to reason in order to answer them correctly. An example query: "For AT&T, how significant are the company's postretirement benefit obligations in terms of interest burden, and what does this indicate about the company's long-term liability management in 2024?"
As with Snorkel Finance, we aimed to create a realistic environment in which a financial analyst agent can find answers to high-level questions based on information in 10-K filings. To do this, we converted information from tables in 10-K documents into a relational database. Agents must reason about what information is required, use database tools to look up the correct tables, make accurate SQL calls often in succession, and combine answers to produce a final response.
Question-answer pairs have been carefully co-created with Snorkel's Expert Data-as-a-Service network of financial experts, to ensure they are high quality, representative of real-world financial analyst questions, accurate, and require sufficient reasoning. This is a challenging task, requiring an average of 12 steps of reasoning and tool use.

