medium.com Open in urlscan Pro
2606:4700:7::a29f:9904  Public Scan

Submitted URL: https://email.aiera.com/e3t/Ctc/V+113/d14K2y04/VVFKcJ30GdKNW6C90J-3Xqw1TW8kTzPf5hDs3fN2z6H2Y3lYMRW7Y8-PT6lZ3ktW41BBWk3Y0...
Effective URL: https://medium.com/@jacqueline.garrahan/lessons-in-benchmarking-finqa-0a5e810b8d15?utm_medium=email&_hsenc=p2ANqtz-...
Submission: On July 17 via api from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

Open in app

Sign up

Sign in

Write


Sign up

Sign in




LESSONS IN BENCHMARKING: FINQA

Jacqueline Garrahan

·

Follow

7 min read
·
6 days ago

2



Listen

Share

At Aiera, we build insights against transcripts and documents using LLMs.
To-date, we’ve combatted errors and hallucinations through human-in-the-loop
validation and benchmarking. As we expand into new applications, our
requirements have grown and we use internal benchmarks to ensure we’re using the
right model in the right place.

Leaderboards rank model performance on popular standards such as ARC, HellaSwag,
MMLU, GSM8K, TruthfulQA, etc. But, while standard benchmarks help in assessing
the generalization ability of models across a wide range of tasks, they may not
effectively measure how well a model performs on areas requiring highly
specialized knowledge or skills. This gap may lead to false senses of model
competency and superiority. Projects like huggingface datasets function as
communal dataset repositories for diverse natural language tasks. However,
community datasets may vary significantly in quality or format, contain errors,
inconsistencies, or lack thorough documentation. In this article, I’ll outline
some lessons learned from my exploration into benchmarking model performance on
financial question-and-answer tasks focused on multi-step computation.

Quantitative question answering requires domain comprehension, data extraction,
and the execution of numerical operations, which is among the most challenging
tasks for LLMs. In 2021, researchers from the University of Pennsylvania, J.P.
Morgan, and Amazon published “FinQA: A Dataset of Numerical Reasoning over
Financial Data, ” introducing a dataset of 8,281 annotated QA pairs built
against publicly available earnings reports of S&P 500 companies from 1999 to
2019 (Zheng et al., 2021). Each task is represented as a single question and
answer pair derived from tabular and textual data from the earnings report. The
original formulation distills the answer reasoning into sets of mathematical and
tabular operations: add, subtract, multiply, divide, greater, exp, table-max,
table-min, table-sum, table-average.



For this project, I used the PIXIU FinQA dataset available on huggingface here.
PIXIU evaluated models response against the questions for exact-match accuracy,
focusing on the generation rather than intermediate computation steps. For the
purpose of side-by-side model ranking, I only cared about the model’s ability to
surface the correct result to the user. Their data is structured as below:


Example Q&A pair from PIXIU FinQA (MAKE BIGGER VISUAL)

For the execution, I used EleutherAI’s lm-evaluation-harness to execute an
evaluation task on the FinQA dataset. For those new to the
lm-evaluation-harness, its an excellent open-source tool that can be used to
template model evaluation tasks. A guide for configuring new tasks can be found
in the lm-eval docs here and user’s can get quickly started with a number of
major model providers. Tasks reference huggingface dataset paths and are
configurable with a variety of generation and evaluation options. To set up my
task, I created a directory tasks in my project and a subdirectory tasks/finqa.
Then, I created a yaml spec for the flare_finqa task referencing the original
dataset in tasks/finqa/flare_finqa.yaml:

task: flare_finqa
dataset_path: TheFinAI/flare-finqa
training_split: null
validation_split: null
test_split: test
doc_to_text: query
doc_to_target: answer
process_results: !function utils.process_results_gen
generation_kwargs:
  max_gen_toks: 100
  do_sample: False
  temperature: 0.0
  until:
    - "<s>"
metric_list:
  - metric: exact_match_manual
    aggregation: mean
    higher_is_better: true

I also set up a utils.py file to postprocess model results.

def process_results_gen(doc, results):
    completion = results[0]
    target =  str(doc["answer"])

    # hack fixes to string formatting
    if target[-2] == ".":
      target = target + "0"
    elif "." not in target:
        target = target + ".00"

    exact_match_manual = 1 if completion == target else 0
    return {
        "exact_match_manual": exact_match_manual
    }

I added a hack fix for float formatting from float → dataset string that impacts
the precision reflected in the target string. Additionally, I noticed was that
the OpenAI models were prematurely stoping with double newlines (likely the
default in the lm-eval-harness), so I added a stop token in the generate_until
field.

I used the lm-evaluation-harness’ API over their CLI tools because I wanted to
run some tests in a Jupyter notebook. I found the API to be simple and useful,
though the CLI is documented as the default use.

from lm_eval.models.openai_completions import OpenaiChatCompletionsLM

task_name = "flare_finqa"
model_name = "gpt-4-turbo-2024-04-09"
model = OpenaiChatCompletionsLM("gpt-4-turbo-2024-04-09")

task_manager = tasks.TaskManager(include_path="path/to/tasks")

results = simple_evaluate( # call simple_evaluate
    model=model,
    tasks=[task_name],
    num_fewshot=0,
    task_manager=task_manager,
    write_out = True,

I ran the gpt-4–2024–04–09 against a subset of 100 dataset samples and observed
an exact match score of 0.0. On suspicions I’d fumbled, I logged completions:

Answer: The business combination of Entergy Louisiana and Entergy Gulf 
States Louisiana in 2015 significantly impacted the financial results of 
Entergy Corporation in several ways. Firstly, the combination resulted 
in the recognition of a deferred tax asset and a corresponding net increase 
in tax basis amounting to approximately $334 million. This likely provided 
a substantial tax benefit to the...

Consistent with past experience using chat-models on targeted tasks, I found
that model’s often disregard instructions to report the only results and express
a clear preference for reporting their explanation. Looking back at the dataset
query, the prompt reads:

> Please answer the given financial question based on the context.
> Context: …
> Question: …
> Answer:

This prompt does little to specify the format and precision of the desired
result. For the purpose of this test, I decided to allow the models to generate
their explanation, but discard that explanation before evaluation. Comparing the
results of the verbose gpt-4–2024–04–09 output with the dataset answers, I found
several cases of incorrect calculations in the original dataset. One issue was
the conflation of the words portion, ratio, and proportion in calculations
reported as a decimal proportion. The semantic difference is small, but portion
refers the quantity allocated. For example, in the case where 30 balls are green
in a total of 100 balls, the portion of balls that are green is 30. The decimal
proportion of green balls is 0.3 and the percentage proportion is 3%. Ratios
were also used to mean decimal proportion in the dataset. In order to give the
model best chance of success, I modified the prompt to specify decimal
percentage as the output.

I added further specification of a unitless result returned with a precision of
two decimal points.

The new prompt reads:

> Context: {{context}}
> 
> Given the context, {{question}} Report your answer using the following format:
> 
> Explanation: Explanation of calculation
> Formatted answer: Float number to two decimal point precision and no units

Due to other discovered errors, I decided to manually verify the calculations in
the set. The verification was arduous, and so this dataset is only a small
91-sample subset of the original test set (available here).

The new yaml for the task is:

task: flare_finqa
dataset_path: Aiera/flare-finqa-verified
training_split: null
validation_split: null
test_split: test
doc_to_target: answer
doc_to_text: "Context:\n{{context}}\n\nGiven the context, \
{{question}} Report your answer using the following format:\n\
Explanation: Explanation of calculation\n\
Formatted answer: Float number to two decimal point precision and no units\n"
process_results: !function utils.process_results_gen
generation_kwargs:
  max_gen_toks: 500
  do_sample: False
  temperature: 0.0
  until:
    - "<s>"
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: false
  - metric: exact_match_manual
    aggregation: mean
    higher_is_better: true

The doc_to_text field specifies a jinja prompt template used to compose the
prompt from the bracketed dataset columns at runtime. The post-generation
processing in my utils.py extracts the formatted answer:

def process_results_gen(doc, results):
    completion = results[0]
    target =  str(doc["answer"])

    if "formatted answer:" in completion.lower():
        completion_splits = completion.split(":")
        completion = completion_splits[-1].strip()

    # hack fix for string formatting
    if target[-2] == ".":
        target = target + "0"
    
    elif "." not in target:
        target = target + ".00"

    exact_match_manual = 1 if completion == target else 0

    return {
        "exact_match_manual": exact_match_manual
    }



Now, I ran my task using the eval harness for a couple of different models:



I found claude-3-opus to be the winner, followed by gemini-1.5-pro then
gpt-4-turbo-2024-04-09.

Because this testing set is a much smaller subset of the original dataset, I
wanted to measure the confidence in how well the smaller sample was able
represent the model’s broader performance. In the yaml, I specified an
exact_match evaluation metric that outputs trial results as a 1 for a hit
(correct computation) or a 0 for a miss (incorrect). The resulting outputs
follow a discrete Bernoulli distribution where the value 1 is assumed with
probability p and 0 is assumed with probability q=1-p. Using the distribution,
we can establish the minimum dataset size we’ll need to understand the model’s
performance on this specific task:



Where:



The lm-eval-harness reports out the standard error associated with our
exact_match calc.


Scores and stderr for FinQA test

For a Z score of 1.96 and a margin of error of 0.02 score points we can
calculate the minimum samples to evaluate performance to the 95% confidence
level:


Minimum sample counts for each model on finQA to establish 95% confidence bound

Our 91 sub-samples exceeds n across models, so we can be reasonably confident
these scores represent model performance on this specific task and dataset. In
close, this sufficiency demonstrates why smaller, high integrity datasets are
most valuable in evaluating model competence. Future areas of exploration that
follow naturally from this brief exploration are evaluations of significant
digits, unit comprehension, and expansion into other datasets such as the
ConvFinQA, using few-shot and chain of thought prompting.



Be part of a better internet.
Get 20% off membership for a limited time.


FREE



Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.


Sign up for free


MEMBERSHIP

Get 20% off


Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app


Try for 5 $ 4 $/month
Llm Evaluation
Benchmarking
Financial Services
AI


2

2



Follow



WRITTEN BY JACQUELINE GARRAHAN

4 Followers

Machine Learning @aiera

Follow




MORE FROM JACQUELINE GARRAHAN

Jacqueline Garrahan


EVALUATING GPT-4O ON FINANCIAL TASKS


ON MAY 13, OPENAI ANNOUNCED ITS NEW FLAGSHIP MODEL, GPT-4O, CAPABLE OF REASONING
OVER VISUAL, AUDIO, AND TEXT INPUT IN REAL TIME. THE MODEL…

May 17




Jacqueline Garrahan


A CONVERSATIONAL EARNINGS CALL ASSISTANT WITH OPENAI AND AIERA


AIERA IS AN EVENT TRANSCRIPTION AND MONITORING PLATFORM THAT EMPOWERS
FUNDAMENTAL INVESTORS AND OTHER CORPORATE RESEARCH PROFESSIONALS. WE…

Jan 30
74


See all from Jacqueline Garrahan



RECOMMENDED FROM MEDIUM

Alexander Nguyen

in

Level Up Coding


THE RESUME THAT GOT A SOFTWARE ENGINEER A $300,000 JOB AT GOOGLE.


1-PAGE. WELL-FORMATTED.


Jun 1
13.2K
186



Vishal Rajput



in

AIGuys


PROMPT ENGINEERING IS DEAD: DSPY IS NEW PARADIGM FOR PROMPTING


DSPY PARADIGM: LET’S PROGRAM — NOT PROMPT — LLMS


May 29
4K
40




LISTS


GENERATIVE AI RECOMMENDED READING

52 stories·1205 saves


WHAT IS CHATGPT?

9 stories·391 saves


THE NEW CHATBOTS: CHATGPT, BARD, AND BEYOND

12 stories·422 saves


NATURAL LANGUAGE PROCESSING

1587 stories·1133 saves


Dominik Polzer

in

Towards Data Science


17 (ADVANCED) RAG TECHNIQUES TO TURN YOUR LLM APP PROTOTYPE INTO A
PRODUCTION-READY SOLUTION


A COLLECTION OF RAG TECHNIQUES TO HELP YOU DEVELOP YOUR RAG APP INTO SOMETHING
ROBUST THAT WILL LAST


Jun 26
1.8K
19



Derek Johnson


I’M UNEMPLOYED FOR OVER TWO YEARS (AS A SOFTWARE ENGINEER)


IN 2022, I WORKED ON A CONTRACT AS A SOFTWARE ENGINEER AT APPLE. APPLE DISSOLVED
OUR ENTIRE TEAM RIGHT BEFORE THE 2022 TECH RECESSION…


Jun 1
4.4K
113



Gao Dalie (高達烈)

in

Towards AI


LANGCHAIN + GRAPH RAG + GPT-4O PYTHON PROJECT: EASY AI/CHAT FOR YOUR WEBSITE


THIS IS GRAPH AND I HAVE A SUPER QUICK TUTORIAL SHOWING HOW TO CREATE A FULLY
LOCAL CHATBOT WITH LANGCHAIN, GRAPH RAG AND GPT-4O TO MAKE A…


Jul 7
805
2



Abhay Parashar

in

The Pythoneers


17 MINDBLOWING PYTHON AUTOMATION SCRIPTS I USE EVERYDAY


SCRIPTS THAT INCREASED MY PRODUCTIVITY AND PERFORMANCE


6d ago
2.8K
21


See more recommendations

Help

Status

About

Careers

Press

Blog

Privacy

Terms

Text to speech

Teams

To make Medium work, we log user data. By using Medium, you agree to our Privacy
Policy, including cookie policy.