benchmarks.kensho.com Open in urlscan Pro
2a02:26f0:1700:11::b856:679d  Public Scan

Submitted URL: http://benchmarks.kensho.com/
Effective URL: https://benchmarks.kensho.com/
Submission: On November 16 via api from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

Finance FundamentalsLong-Document QA
Sign upSign in


S&P AI BENCHMARKS BY KENSHO


A SERIES OF BENCHMARKS THAT EVALUATE AI SYSTEMS INCLUDING LARGE LANGUAGE MODELS
(LLMS) FOR BUSINESS AND FINANCE USE CASES.

S&P AI Benchmarks by Kensho consists of two evaluation sets informed by S&P
Global’s world-class data and industry expertise. These benchmarks are designed
to assess the ability of LLMs to solve real-world business and finance questions
and were developed in collaboration with experts across S&P Global to ensure
accuracy and reliability.

Everyone is welcome to sign up and participate, from academic labs and large
corporations to independent model developers. The public-facing leaderboards are
designed to encourage innovation and collaborative understanding.

Submit now


FINANCE FUNDAMENTALS

View details
Rank
Model Name
Organization
Overall (%)
1
o1-preview

OpenAI
91.05
2
Claude 3.5 Sonnet

Anthropic
88.07
3
GPT-4o

OpenAI
87.96
4
GPT-4 Turbo

OpenAI
87.66
5
GPT-4

OpenAI
85.41
6
Mistral Large 2

Mistral AI
85.31
7
Claude 3 Opus

Anthropic
83.13
8
Gemini 1.5 Pro

Google
82.06
9
Claude 3 Sonnet

Anthropic
79.47
10
Llama 3 70B

Meta
79.42


LONG-DOCUMENT QA

View details
Rank
Model Name
Architecture
Score (%)
Source
1
Claude 3.5 Sonnet
N/A
47.11
N/A
2
o1-preview
N/A
43.11
N/A
3
GPT-4o
N/A
39.56
N/A
4
Claude 3.5 Sonnet
text-embedding-3-large
31.11
N/A
5
GPT-4o
text-embedding-3-large
30.67
N/A
6
Mistral Large 2
text-embedding-3-large
27.11
N/A
7
GPT-4o
text-embedding-3-small
26.22
N/A
8
Llama 3 70B
text-embedding-3-large
24.89
N/A
9
o1-preview
text-embedding-3-large
22.67
N/A
10
GPT-4o
nvidia-Llama3-ChatQA-1.5-8B
21.33
N/A


WHY WE CREATED S&P AI BENCHMARKS

Although today’s LLMs generally demonstrate strong performance on
question-answering (QA) and code generation tasks, it remains difficult for
models to reason about quantities and numbers. This poses issues for using LLMs
for real-world applications in business and finance, as these fields can require
transparent and precise reasoning capabilities, along with a wide breadth of
technical knowledge.

Existing benchmarks for these domains include tasks such as sentiment analysis,
text classification, or named-entity extraction. With S&P AI Benchmarks, we’ve
created rigorous and challenging tasks that are rooted in realistic use cases
for business professionals. Our goal is to build trustworthy, objective
evaluation sets to encourage the development of better models for business and
finance.

To learn more, read our latest research papers.

“Bizbench: A Quantitative Reasoning Benchmark for Business and Finance” (ACL
2024)
“DocFinQA: A Long-Context Financial Reasoning Dataset” (ACL 2024)


READY TO FIND YOUR PLACE ON THE LEADERBOARDS?

Submit now


FREQUENTLY ASKED QUESTIONS

Where can I find the evaluation set/questions?
Once you click “submit” and sign into your Kensho account, you will have access
to the full list of questions.
When can I expect to receive the results for my submission?
Please allow 2-4 business days for a Kensho team member to review and approve
your submission. You will receive an email notification with details about your
scores.
How many times can I submit?
You can submit the same model to each leaderboard once per business day.
Is there a participation fee?
This is a free benchmarking leaderboard, but participants bear the cost of
running the model.
What am I expected to submit?
In our submission process, you are only required to provide the outputs of your
model; there is no need to share the actual model itself.
Do I need to give my LLM extra information to accurately run the tests?
Your LLM simply needs to generate a response to each question in our evaluation
set; no pre-training is necessary. However, you and your team are free to prompt
however you wish. You’re not restricted to asking the model the questions
verbatim.
What measures are in place to ensure the evaluation process is fair and free
from bias?
The evaluation is based on numerical and string matching, so the model will
either generate the right or wrong numerical answer to the question. As a
result, the evaluation does not require any subjective judgment calls.
Who and how would someone use this? Can you share a couple example use cases?
A researcher would use this benchmark to track improvements of their model
during pre-training or fine tuning. A technology consultant can use it to prove
the value-add of their LLM-powered application services. A product manager or
risk manager would use this as an independent third party benchmark to verify
their AI product's performance and reliability. A product or business leader
would use this tool as a standardized metric to compare LLMs and inform
purchasing decisions.
How will you manage my submission?
Because users submit their model’s output, not the model itself, we do not see
or collect your model code or model weights. We use submitted outputs to
calculate your score, which will be populated on the leaderboard. Any user who
wishes to have their scores removed from the leaderboard can request so at any
time.
Will the ranking give me feedback on why my LLM did or didn’t rank well?
After completing the benchmark, you will receive a score for each task. We don’t
provide further or specific feedback on which questions were correct or
incorrect.
Do my results have to be displayed on the leaderboard?
By default, your model’s score will be displayed on the leaderboard. If you
prefer not to have your results shown, you can reach out to our team at
benchmarks@kensho.com to opt out. By allowing your score to remain public, you
are benefitting the broader community, as we can all collectively learn about
model performance—ultimately fueling improvements over time.

HARVARD SQUARE + AI LAB

44 Brattle St
Cambridge, MA 02138

NEW YORK CITY

55 Water Street
New York, NY 10041
Data AgreementContactPrivacy PolicyWeb TermsService Terms
EmailLinkedInTwitter
Copyright © 2024 Kensho Technologies, LLC.. Kensho marks are the property of
Kensho Technologies, LLC.. All rights reserved.