benchmarks.kensho.com
Open in
urlscan Pro
2a02:26f0:1700:11::b856:679d
Public Scan
Submitted URL: http://benchmarks.kensho.com/
Effective URL: https://benchmarks.kensho.com/
Submission: On November 16 via api from US — Scanned from DE
Effective URL: https://benchmarks.kensho.com/
Submission: On November 16 via api from US — Scanned from DE
Form analysis
0 forms found in the DOMText Content
Finance FundamentalsLong-Document QA Sign upSign in S&P AI BENCHMARKS BY KENSHO A SERIES OF BENCHMARKS THAT EVALUATE AI SYSTEMS INCLUDING LARGE LANGUAGE MODELS (LLMS) FOR BUSINESS AND FINANCE USE CASES. S&P AI Benchmarks by Kensho consists of two evaluation sets informed by S&P Global’s world-class data and industry expertise. These benchmarks are designed to assess the ability of LLMs to solve real-world business and finance questions and were developed in collaboration with experts across S&P Global to ensure accuracy and reliability. Everyone is welcome to sign up and participate, from academic labs and large corporations to independent model developers. The public-facing leaderboards are designed to encourage innovation and collaborative understanding. Submit now FINANCE FUNDAMENTALS View details Rank Model Name Organization Overall (%) 1 o1-preview OpenAI 91.05 2 Claude 3.5 Sonnet Anthropic 88.07 3 GPT-4o OpenAI 87.96 4 GPT-4 Turbo OpenAI 87.66 5 GPT-4 OpenAI 85.41 6 Mistral Large 2 Mistral AI 85.31 7 Claude 3 Opus Anthropic 83.13 8 Gemini 1.5 Pro Google 82.06 9 Claude 3 Sonnet Anthropic 79.47 10 Llama 3 70B Meta 79.42 LONG-DOCUMENT QA View details Rank Model Name Architecture Score (%) Source 1 Claude 3.5 Sonnet N/A 47.11 N/A 2 o1-preview N/A 43.11 N/A 3 GPT-4o N/A 39.56 N/A 4 Claude 3.5 Sonnet text-embedding-3-large 31.11 N/A 5 GPT-4o text-embedding-3-large 30.67 N/A 6 Mistral Large 2 text-embedding-3-large 27.11 N/A 7 GPT-4o text-embedding-3-small 26.22 N/A 8 Llama 3 70B text-embedding-3-large 24.89 N/A 9 o1-preview text-embedding-3-large 22.67 N/A 10 GPT-4o nvidia-Llama3-ChatQA-1.5-8B 21.33 N/A WHY WE CREATED S&P AI BENCHMARKS Although today’s LLMs generally demonstrate strong performance on question-answering (QA) and code generation tasks, it remains difficult for models to reason about quantities and numbers. This poses issues for using LLMs for real-world applications in business and finance, as these fields can require transparent and precise reasoning capabilities, along with a wide breadth of technical knowledge. Existing benchmarks for these domains include tasks such as sentiment analysis, text classification, or named-entity extraction. With S&P AI Benchmarks, we’ve created rigorous and challenging tasks that are rooted in realistic use cases for business professionals. Our goal is to build trustworthy, objective evaluation sets to encourage the development of better models for business and finance. To learn more, read our latest research papers. “Bizbench: A Quantitative Reasoning Benchmark for Business and Finance” (ACL 2024) “DocFinQA: A Long-Context Financial Reasoning Dataset” (ACL 2024) READY TO FIND YOUR PLACE ON THE LEADERBOARDS? Submit now FREQUENTLY ASKED QUESTIONS Where can I find the evaluation set/questions? Once you click “submit” and sign into your Kensho account, you will have access to the full list of questions. When can I expect to receive the results for my submission? Please allow 2-4 business days for a Kensho team member to review and approve your submission. You will receive an email notification with details about your scores. How many times can I submit? You can submit the same model to each leaderboard once per business day. Is there a participation fee? This is a free benchmarking leaderboard, but participants bear the cost of running the model. What am I expected to submit? In our submission process, you are only required to provide the outputs of your model; there is no need to share the actual model itself. Do I need to give my LLM extra information to accurately run the tests? Your LLM simply needs to generate a response to each question in our evaluation set; no pre-training is necessary. However, you and your team are free to prompt however you wish. You’re not restricted to asking the model the questions verbatim. What measures are in place to ensure the evaluation process is fair and free from bias? The evaluation is based on numerical and string matching, so the model will either generate the right or wrong numerical answer to the question. As a result, the evaluation does not require any subjective judgment calls. Who and how would someone use this? Can you share a couple example use cases? A researcher would use this benchmark to track improvements of their model during pre-training or fine tuning. A technology consultant can use it to prove the value-add of their LLM-powered application services. A product manager or risk manager would use this as an independent third party benchmark to verify their AI product's performance and reliability. A product or business leader would use this tool as a standardized metric to compare LLMs and inform purchasing decisions. How will you manage my submission? Because users submit their model’s output, not the model itself, we do not see or collect your model code or model weights. We use submitted outputs to calculate your score, which will be populated on the leaderboard. Any user who wishes to have their scores removed from the leaderboard can request so at any time. Will the ranking give me feedback on why my LLM did or didn’t rank well? After completing the benchmark, you will receive a score for each task. We don’t provide further or specific feedback on which questions were correct or incorrect. Do my results have to be displayed on the leaderboard? By default, your model’s score will be displayed on the leaderboard. If you prefer not to have your results shown, you can reach out to our team at benchmarks@kensho.com to opt out. By allowing your score to remain public, you are benefitting the broader community, as we can all collectively learn about model performance—ultimately fueling improvements over time. HARVARD SQUARE + AI LAB 44 Brattle St Cambridge, MA 02138 NEW YORK CITY 55 Water Street New York, NY 10041 Data AgreementContactPrivacy PolicyWeb TermsService Terms EmailLinkedInTwitter Copyright © 2024 Kensho Technologies, LLC.. Kensho marks are the property of Kensho Technologies, LLC.. All rights reserved.