medium.com
Open in
urlscan Pro
2606:4700:7::a29f:9904
Public Scan
Submitted URL: https://email.aiera.com/e3t/Ctc/V+113/d14K2y04/VVFKcJ30GdKNW6C90J-3Xqw1TW8kTzPf5hDs3fN2z6H2Y3lYMRW7Y8-PT6lZ3ktW41BBWk3Y0...
Effective URL: https://medium.com/@jacqueline.garrahan/lessons-in-benchmarking-finqa-0a5e810b8d15?utm_medium=email&_hsenc=p2ANqtz-...
Submission: On July 17 via api from US — Scanned from DE
Effective URL: https://medium.com/@jacqueline.garrahan/lessons-in-benchmarking-finqa-0a5e810b8d15?utm_medium=email&_hsenc=p2ANqtz-...
Submission: On July 17 via api from US — Scanned from DE
Form analysis
0 forms found in the DOMText Content
Open in app Sign up Sign in Write Sign up Sign in LESSONS IN BENCHMARKING: FINQA Jacqueline Garrahan · Follow 7 min read · 6 days ago 2 Listen Share At Aiera, we build insights against transcripts and documents using LLMs. To-date, we’ve combatted errors and hallucinations through human-in-the-loop validation and benchmarking. As we expand into new applications, our requirements have grown and we use internal benchmarks to ensure we’re using the right model in the right place. Leaderboards rank model performance on popular standards such as ARC, HellaSwag, MMLU, GSM8K, TruthfulQA, etc. But, while standard benchmarks help in assessing the generalization ability of models across a wide range of tasks, they may not effectively measure how well a model performs on areas requiring highly specialized knowledge or skills. This gap may lead to false senses of model competency and superiority. Projects like huggingface datasets function as communal dataset repositories for diverse natural language tasks. However, community datasets may vary significantly in quality or format, contain errors, inconsistencies, or lack thorough documentation. In this article, I’ll outline some lessons learned from my exploration into benchmarking model performance on financial question-and-answer tasks focused on multi-step computation. Quantitative question answering requires domain comprehension, data extraction, and the execution of numerical operations, which is among the most challenging tasks for LLMs. In 2021, researchers from the University of Pennsylvania, J.P. Morgan, and Amazon published “FinQA: A Dataset of Numerical Reasoning over Financial Data, ” introducing a dataset of 8,281 annotated QA pairs built against publicly available earnings reports of S&P 500 companies from 1999 to 2019 (Zheng et al., 2021). Each task is represented as a single question and answer pair derived from tabular and textual data from the earnings report. The original formulation distills the answer reasoning into sets of mathematical and tabular operations: add, subtract, multiply, divide, greater, exp, table-max, table-min, table-sum, table-average. For this project, I used the PIXIU FinQA dataset available on huggingface here. PIXIU evaluated models response against the questions for exact-match accuracy, focusing on the generation rather than intermediate computation steps. For the purpose of side-by-side model ranking, I only cared about the model’s ability to surface the correct result to the user. Their data is structured as below: Example Q&A pair from PIXIU FinQA (MAKE BIGGER VISUAL) For the execution, I used EleutherAI’s lm-evaluation-harness to execute an evaluation task on the FinQA dataset. For those new to the lm-evaluation-harness, its an excellent open-source tool that can be used to template model evaluation tasks. A guide for configuring new tasks can be found in the lm-eval docs here and user’s can get quickly started with a number of major model providers. Tasks reference huggingface dataset paths and are configurable with a variety of generation and evaluation options. To set up my task, I created a directory tasks in my project and a subdirectory tasks/finqa. Then, I created a yaml spec for the flare_finqa task referencing the original dataset in tasks/finqa/flare_finqa.yaml: task: flare_finqa dataset_path: TheFinAI/flare-finqa training_split: null validation_split: null test_split: test doc_to_text: query doc_to_target: answer process_results: !function utils.process_results_gen generation_kwargs: max_gen_toks: 100 do_sample: False temperature: 0.0 until: - "<s>" metric_list: - metric: exact_match_manual aggregation: mean higher_is_better: true I also set up a utils.py file to postprocess model results. def process_results_gen(doc, results): completion = results[0] target = str(doc["answer"]) # hack fixes to string formatting if target[-2] == ".": target = target + "0" elif "." not in target: target = target + ".00" exact_match_manual = 1 if completion == target else 0 return { "exact_match_manual": exact_match_manual } I added a hack fix for float formatting from float → dataset string that impacts the precision reflected in the target string. Additionally, I noticed was that the OpenAI models were prematurely stoping with double newlines (likely the default in the lm-eval-harness), so I added a stop token in the generate_until field. I used the lm-evaluation-harness’ API over their CLI tools because I wanted to run some tests in a Jupyter notebook. I found the API to be simple and useful, though the CLI is documented as the default use. from lm_eval.models.openai_completions import OpenaiChatCompletionsLM task_name = "flare_finqa" model_name = "gpt-4-turbo-2024-04-09" model = OpenaiChatCompletionsLM("gpt-4-turbo-2024-04-09") task_manager = tasks.TaskManager(include_path="path/to/tasks") results = simple_evaluate( # call simple_evaluate model=model, tasks=[task_name], num_fewshot=0, task_manager=task_manager, write_out = True, I ran the gpt-4–2024–04–09 against a subset of 100 dataset samples and observed an exact match score of 0.0. On suspicions I’d fumbled, I logged completions: Answer: The business combination of Entergy Louisiana and Entergy Gulf States Louisiana in 2015 significantly impacted the financial results of Entergy Corporation in several ways. Firstly, the combination resulted in the recognition of a deferred tax asset and a corresponding net increase in tax basis amounting to approximately $334 million. This likely provided a substantial tax benefit to the... Consistent with past experience using chat-models on targeted tasks, I found that model’s often disregard instructions to report the only results and express a clear preference for reporting their explanation. Looking back at the dataset query, the prompt reads: > Please answer the given financial question based on the context. > Context: … > Question: … > Answer: This prompt does little to specify the format and precision of the desired result. For the purpose of this test, I decided to allow the models to generate their explanation, but discard that explanation before evaluation. Comparing the results of the verbose gpt-4–2024–04–09 output with the dataset answers, I found several cases of incorrect calculations in the original dataset. One issue was the conflation of the words portion, ratio, and proportion in calculations reported as a decimal proportion. The semantic difference is small, but portion refers the quantity allocated. For example, in the case where 30 balls are green in a total of 100 balls, the portion of balls that are green is 30. The decimal proportion of green balls is 0.3 and the percentage proportion is 3%. Ratios were also used to mean decimal proportion in the dataset. In order to give the model best chance of success, I modified the prompt to specify decimal percentage as the output. I added further specification of a unitless result returned with a precision of two decimal points. The new prompt reads: > Context: {{context}} > > Given the context, {{question}} Report your answer using the following format: > > Explanation: Explanation of calculation > Formatted answer: Float number to two decimal point precision and no units Due to other discovered errors, I decided to manually verify the calculations in the set. The verification was arduous, and so this dataset is only a small 91-sample subset of the original test set (available here). The new yaml for the task is: task: flare_finqa dataset_path: Aiera/flare-finqa-verified training_split: null validation_split: null test_split: test doc_to_target: answer doc_to_text: "Context:\n{{context}}\n\nGiven the context, \ {{question}} Report your answer using the following format:\n\ Explanation: Explanation of calculation\n\ Formatted answer: Float number to two decimal point precision and no units\n" process_results: !function utils.process_results_gen generation_kwargs: max_gen_toks: 500 do_sample: False temperature: 0.0 until: - "<s>" metric_list: - metric: exact_match aggregation: mean higher_is_better: true ignore_case: true ignore_punctuation: false - metric: exact_match_manual aggregation: mean higher_is_better: true The doc_to_text field specifies a jinja prompt template used to compose the prompt from the bracketed dataset columns at runtime. The post-generation processing in my utils.py extracts the formatted answer: def process_results_gen(doc, results): completion = results[0] target = str(doc["answer"]) if "formatted answer:" in completion.lower(): completion_splits = completion.split(":") completion = completion_splits[-1].strip() # hack fix for string formatting if target[-2] == ".": target = target + "0" elif "." not in target: target = target + ".00" exact_match_manual = 1 if completion == target else 0 return { "exact_match_manual": exact_match_manual } Now, I ran my task using the eval harness for a couple of different models: I found claude-3-opus to be the winner, followed by gemini-1.5-pro then gpt-4-turbo-2024-04-09. Because this testing set is a much smaller subset of the original dataset, I wanted to measure the confidence in how well the smaller sample was able represent the model’s broader performance. In the yaml, I specified an exact_match evaluation metric that outputs trial results as a 1 for a hit (correct computation) or a 0 for a miss (incorrect). The resulting outputs follow a discrete Bernoulli distribution where the value 1 is assumed with probability p and 0 is assumed with probability q=1-p. Using the distribution, we can establish the minimum dataset size we’ll need to understand the model’s performance on this specific task: Where: The lm-eval-harness reports out the standard error associated with our exact_match calc. Scores and stderr for FinQA test For a Z score of 1.96 and a margin of error of 0.02 score points we can calculate the minimum samples to evaluate performance to the 95% confidence level: Minimum sample counts for each model on finQA to establish 95% confidence bound Our 91 sub-samples exceeds n across models, so we can be reasonably confident these scores represent model performance on this specific task and dataset. In close, this sufficiency demonstrates why smaller, high integrity datasets are most valuable in evaluating model competence. Future areas of exploration that follow naturally from this brief exploration are evaluations of significant digits, unit comprehension, and expansion into other datasets such as the ConvFinQA, using few-shot and chain of thought prompting. Be part of a better internet. Get 20% off membership for a limited time. FREE Distraction-free reading. No ads. Organize your knowledge with lists and highlights. Tell your story. Find your audience. Sign up for free MEMBERSHIP Get 20% off Read member-only stories Support writers you read most Earn money for your writing Listen to audio narrations Read offline with the Medium app Try for 5 $ 4 $/month Llm Evaluation Benchmarking Financial Services AI 2 2 Follow WRITTEN BY JACQUELINE GARRAHAN 4 Followers Machine Learning @aiera Follow MORE FROM JACQUELINE GARRAHAN Jacqueline Garrahan EVALUATING GPT-4O ON FINANCIAL TASKS ON MAY 13, OPENAI ANNOUNCED ITS NEW FLAGSHIP MODEL, GPT-4O, CAPABLE OF REASONING OVER VISUAL, AUDIO, AND TEXT INPUT IN REAL TIME. THE MODEL… May 17 Jacqueline Garrahan A CONVERSATIONAL EARNINGS CALL ASSISTANT WITH OPENAI AND AIERA AIERA IS AN EVENT TRANSCRIPTION AND MONITORING PLATFORM THAT EMPOWERS FUNDAMENTAL INVESTORS AND OTHER CORPORATE RESEARCH PROFESSIONALS. WE… Jan 30 74 See all from Jacqueline Garrahan RECOMMENDED FROM MEDIUM Alexander Nguyen in Level Up Coding THE RESUME THAT GOT A SOFTWARE ENGINEER A $300,000 JOB AT GOOGLE. 1-PAGE. WELL-FORMATTED. Jun 1 13.2K 186 Vishal Rajput in AIGuys PROMPT ENGINEERING IS DEAD: DSPY IS NEW PARADIGM FOR PROMPTING DSPY PARADIGM: LET’S PROGRAM — NOT PROMPT — LLMS May 29 4K 40 LISTS GENERATIVE AI RECOMMENDED READING 52 stories·1205 saves WHAT IS CHATGPT? 9 stories·391 saves THE NEW CHATBOTS: CHATGPT, BARD, AND BEYOND 12 stories·422 saves NATURAL LANGUAGE PROCESSING 1587 stories·1133 saves Dominik Polzer in Towards Data Science 17 (ADVANCED) RAG TECHNIQUES TO TURN YOUR LLM APP PROTOTYPE INTO A PRODUCTION-READY SOLUTION A COLLECTION OF RAG TECHNIQUES TO HELP YOU DEVELOP YOUR RAG APP INTO SOMETHING ROBUST THAT WILL LAST Jun 26 1.8K 19 Derek Johnson I’M UNEMPLOYED FOR OVER TWO YEARS (AS A SOFTWARE ENGINEER) IN 2022, I WORKED ON A CONTRACT AS A SOFTWARE ENGINEER AT APPLE. APPLE DISSOLVED OUR ENTIRE TEAM RIGHT BEFORE THE 2022 TECH RECESSION… Jun 1 4.4K 113 Gao Dalie (高達烈) in Towards AI LANGCHAIN + GRAPH RAG + GPT-4O PYTHON PROJECT: EASY AI/CHAT FOR YOUR WEBSITE THIS IS GRAPH AND I HAVE A SUPER QUICK TUTORIAL SHOWING HOW TO CREATE A FULLY LOCAL CHATBOT WITH LANGCHAIN, GRAPH RAG AND GPT-4O TO MAKE A… Jul 7 805 2 Abhay Parashar in The Pythoneers 17 MINDBLOWING PYTHON AUTOMATION SCRIPTS I USE EVERYDAY SCRIPTS THAT INCREASED MY PRODUCTIVITY AND PERFORMANCE 6d ago 2.8K 21 See more recommendations Help Status About Careers Press Blog Privacy Terms Text to speech Teams To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.