chatbot-arena-leaderboard.kevinlidk.cn Open in urlscan Pro
172.67.135.60  Public Scan

URL: https://chatbot-arena-leaderboard.kevinlidk.cn/
Submission: On January 05 via api from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

🏆 CHATBOT ARENA LLM LEADERBOARD: COMMUNITY-DRIVEN EVALUATION FOR BEST LLM AND
AI CHATBOTS

Twitter | Discord | Blog | GitHub | Paper | Dataset | Kaggle Competition

Vote!

This is a mirror of the live leaderboard created and maintained at
https://lmarena.ai/leaderboard. Please link to the original URL for citation
purposes.

Chatbot Arena (lmarena.ai) is an open-source platform for evaluating AI through
human preference, developed by researchers at UC Berkeley SkyLab and LMSYS. With
over 1,000,000 user votes, the platform ranks best LLM and AI chatbots using the
Bradley-Terry model to generate live leaderboards. For technical details, check
out our paper.

Chatbot Arena thrives on community engagement — cast your vote to help improve
AI evaluation!

New Launch! Copilot Arena: VS Code Extension to compare Top LLMs
Arena 📣 NEW: Overview Arena (Vision) Arena-Hard-Auto Full Leaderboard

Total #models: 187.    Total #votes: 2,488,392.    Last updated: 2024-12-29.

Code to recreate leaderboard tables and plots in this notebook. You can
contribute your vote at lmarena.ai!

Category


Apply filter
Style Control Show Deprecated


OVERALL QUESTIONS

    #MODELS: 187 (100%)     #VOTES: 2,488,392 (100%)   

Rank* (UB)

Rank (StyleCtrl)

Model

Arena Score

95% CI

Votes

Organization

License

Knowledge Cutoff

103
101

Gemini-2.0-Flash-Thinking-Exp-1219

1373
+12/-12
117891
Cognitive Computations
Falcon-180B TII License
Unknown

Rank* (UB)

Rank (StyleCtrl)

Model

Arena Score

95% CI

Votes

Organization

License

Knowledge Cutoff

1
1

Gemini-Exp-1206

1373
+4/-5
16361
Google
Proprietary
Unknown
1
3

Gemini-2.0-Flash-Thinking-Exp-1219

1366
+5/-4
10633
Google
Proprietary
Unknown
1
5

Gemini-Exp-1121

1365
+5/-5
17193
Google
Proprietary
Unknown
1
1

ChatGPT-4o-latest (2024-11-20)

1365
+4/-3
29314
OpenAI
Proprietary
Unknown
1
1

o1-2024-12-17

1359
+9/-11
2934
OpenAI
Proprietary
Unknown
4
4

Gemini-2.0-Flash-Exp

1356
+4/-5
15282
Google
Proprietary
Unknown
5
9

Gemini-Exp-1114

1346
+5/-4
17119
Google
Proprietary
Unknown
8
4

o1-preview

1335
+4/-3
33232
OpenAI
Proprietary
2023/10
9
8

DeepSeek-V3

1315
+12/-12
2199
DeepSeek
DeepSeek
Unknown
9
13

o1-mini

1306
+3/-3
44113
OpenAI
Proprietary
2023/10
9
9

Gemini-1.5-Pro-002

1302
+4/-3
40515
Google
Proprietary
Unknown
9
9

Gemini-1.5-Pro-Exp-0827

1299
+4/-3
32287
Google
Proprietary
2023/11
13
16

Grok-2-08-13

1288
+3/-3
62620
xAI
Proprietary
2024/3
13
16

Yi-Lightning

1287
+4/-4
29169
01 AI
Proprietary
Unknown
13
11

GPT-4o-2024-05-13

1285
+3/-2
117891
OpenAI
Proprietary
2023/10
13
8

Claude 3.5 Sonnet (20241022)

1283
+4/-3
42796
Anthropic
Proprietary
2024/4
13
24

Qwen2.5-plus-1127

1279
+9/-8
3979
Alibaba
Proprietary
Unknown
13
20

Deepseek-v2.5-1210

1279
+7/-7
7231
DeepSeek
DeepSeek
Unknown
16
24

Athene-v2-Chat-72B

1277
+5/-6
16292
NexusFlow
NexusFlow
Unknown
17
22

GLM-4-Plus

1274
+3/-4
27973
Zhipu AI
Proprietary
Unknown
17
23

GPT-4o-mini-2024-07-18

1273
+2/-3
58283
OpenAI
Proprietary
2023/10

*Rank (UB): model's ranking (upper-bound), defined by one + the number of models
that are statistically better than the target model. Model A is statistically
better than model B when A's lower-bound score is greater than B's upper-bound
score (in 95% confidence interval). See Figure 1 below for visualization of the
confidence intervals of model scores.

Rank (StyleCtrl): model's ranking with style control, which accounts for factors
like response length and markdown usage to decouple model performance from these
potential confounding variables. See blog post for further details.

Note: in each category, we exclude models with fewer than 300 votes as their
confidence intervals can be large.


MORE STATISTICS FOR CHATBOT ARENA (OVERALL)

FIGURE 1: CONFIDENCE INTERVALS ON MODEL STRENGTH (VIA BOOTSTRAPPING)

Plot

gemini-exp-1206gemini-2.0-flash-thinking-exp-1219gemini-exp-1121chatgpt-4o-latest-20241120o1-2024-12-17gemini-2.0-flash-expgemini-exp-1114chatgpt-4o-latest-20240903o1-previewchatgpt-4o-latest-20240808deepseek-v3o1-minigemini-1.5-pro-002gemini-1.5-pro-exp-0827gemini-1.5-pro-exp-0801grok-2-2024-08-13yi-lightninggpt-4o-2024-05-13claude-3-5-sonnet-20241022qwen2.5-plus-1127deepseek-v2.5-1210athene-v2-chatglm-4-plusgpt-4o-mini-2024-07-18gemini-1.5-flash-002128013001320134013601380

ModelRating
plotly-logomark

FIGURE 2: AVERAGE WIN RATE AGAINST ALL OTHER MODELS (ASSUMING UNIFORM SAMPLING
AND NO TIES)

Plot

0.600.600.590.570.570.560.550.550.550.530.520.520.520.510.510.470.470.470.450.440.420.420.410.400.31gemini-2.0-flash-thinking-exp-1219gemini-exp-1206chatgpt-4o-latest-20240808chatgpt-4o-latest-20240903chatgpt-4o-latest-20241120gemini-1.5-pro-exp-0801gemini-exp-1121gemini-1.5-pro-api-0409-previewo1-previewgemini-advanced-0514gemini-2.0-flash-expgemini-1.5-pro-exp-0827gemini-exp-1114o1-2024-12-17bard-jan-24-gemini-proo1-miniyi-large-previewgemini-1.5-pro-002gpt-4-1106-previewgpt-4o-2024-05-13gpt-4-0125-previewgpt-4-turbo-2024-04-09gpt-4-0314claude-1gpt-3.5-turbo-031400.10.20.30.40.50.6

ModelAverage Win Rate
plotly-logomark

FIGURE 3: FRACTION OF MODEL A WINS FOR ALL NON-TIED A VS. B BATTLES

Plot

0.000.520.490.510.600.600.660.580.570.000.670.630.670.000.000.750.000.730.680.740.670.650.000.700.750.480.000.710.550.500.510.000.000.520.000.690.670.600.000.000.690.000.860.650.610.710.740.000.740.740.510.290.000.470.000.540.560.590.560.000.000.610.690.000.000.700.670.690.660.000.000.710.740.670.700.490.450.530.000.570.530.590.550.570.000.650.640.620.000.000.680.690.690.680.740.770.660.710.740.710.400.500.000.430.000.480.000.000.000.000.000.650.630.000.000.790.000.000.630.000.750.710.000.620.520.400.490.460.470.520.000.530.440.550.000.640.610.600.000.000.640.000.680.620.640.700.710.000.730.700.340.000.440.410.000.470.000.470.540.000.000.590.620.000.000.640.630.630.620.000.000.720.710.640.710.420.000.410.450.000.560.530.000.520.540.000.560.590.600.000.630.640.680.590.000.000.660.610.700.640.420.480.440.430.000.450.460.480.000.000.000.540.590.560.000.610.610.640.580.700.510.650.650.660.570.000.000.000.000.000.000.000.460.000.000.000.000.000.490.540.530.000.610.000.000.000.000.000.650.000.330.310.000.350.000.360.000.000.000.000.000.560.400.000.000.620.000.000.560.000.000.680.000.680.690.380.330.390.360.350.390.410.440.460.000.440.000.490.540.000.570.560.570.560.510.590.550.600.580.570.330.400.310.380.380.400.380.410.410.000.600.510.000.570.000.510.600.530.530.600.560.540.550.580.600.000.000.000.000.000.000.000.400.440.510.000.460.430.000.550.510.000.520.000.000.000.000.480.620.670.000.000.000.000.000.000.000.000.000.460.000.000.000.450.000.510.000.530.000.000.000.000.000.560.000.250.310.300.320.210.360.360.370.390.470.380.430.490.490.490.000.490.530.480.470.510.490.500.530.530.000.000.330.310.000.000.370.360.390.000.000.440.400.000.000.510.000.510.520.000.000.590.540.530.540.270.140.310.310.000.320.370.320.360.390.000.430.470.480.470.470.490.000.510.000.000.550.520.540.540.320.350.340.320.370.380.380.410.420.000.440.440.470.000.000.520.480.490.000.520.550.550.510.550.500.260.390.000.260.000.360.000.000.300.000.000.490.400.000.000.530.000.000.480.000.560.550.000.560.500.330.290.000.230.250.300.000.000.490.000.000.410.440.000.000.490.000.000.450.440.000.500.000.520.490.350.260.290.340.290.290.280.340.350.000.320.450.460.000.000.510.410.450.450.450.500.000.600.480.410.000.000.260.290.000.000.290.390.350.000.000.400.450.520.000.500.460.480.490.000.000.400.000.490.490.300.260.330.260.380.270.360.300.340.350.320.420.420.380.440.470.470.460.450.440.480.520.510.000.470.250.260.300.290.480.300.290.360.430.000.310.430.400.330.000.470.460.460.500.500.510.590.510.530.00gemini-exp-1206gemini-2.0-flash-thinking-exp-1219gemini-exp-1121chatgpt-4o-latest-20241120o1-2024-12-17gemini-2.0-flash-expgemini-exp-1114chatgpt-4o-latest-20240903o1-previewchatgpt-4o-latest-20240808deepseek-v3o1-minigemini-1.5-pro-002gemini-1.5-pro-exp-0827gemini-1.5-pro-exp-0801grok-2-2024-08-13yi-lightninggpt-4o-2024-05-13claude-3-5-sonnet-20241022qwen2.5-plus-1127deepseek-v2.5-1210athene-v2-chatglm-4-plusgpt-4o-mini-2024-07-18gemini-1.5-flash-002gemini-1.5-flash-002gpt-4o-mini-2024-07-18glm-4-plusathene-v2-chatdeepseek-v2.5-1210qwen2.5-plus-1127claude-3-5-sonnet-20241022gpt-4o-2024-05-13yi-lightninggrok-2-2024-08-13gemini-1.5-pro-exp-0801gemini-1.5-pro-exp-0827gemini-1.5-pro-002o1-minideepseek-v3chatgpt-4o-latest-20240808o1-previewchatgpt-4o-latest-20240903gemini-exp-1114gemini-2.0-flash-expo1-2024-12-17chatgpt-4o-latest-20241120gemini-exp-1121gemini-2.0-flash-thinking-exp-1219gemini-exp-1206

0.20.30.40.50.60.70.8Model BModel A
plotly-logomark

FIGURE 4: BATTLE COUNT FOR EACH COMBINATION OF MODELS (WITHOUT TIES)

Plot

0265189260851898838120073272252001890196318152902690152259265024214130223001120931992250023907269702882080981811892402620220239255133002692530028493272303001291112382462602142620813194902961760554714900049918731947911544226221326384851300810910000093960082008102880045441892232203199108136153050324304002140221352143105317012926588023949008103772180028338800378189280393002082733102623802552960363770724134080970736709257075976960013063165359312011213317601532187240005254591830541514381454463710052939642300000001340000087970563107320000064907393055050000006268005200660056025262721992694719332428380952506206091940676615471605895420359747650525222525349096304388707459068609013806094904437331036221343142824600000003671838790194138041135108410000426218100000000070500041102840716000005450189239284499822143789255416315267660935128405388686589747205521756440009318700189707514006154900053804515070073492436480196727231902212805973817320471443841716868451050500132389864332318269303479813523936964540666057330065850750501135123249751555815270011501430046008910300970011301379904574902880442810500370054620047005113703802349269208129226803172081301000562032130020573132232993806212719500111221002736315290059743142052149238949700620383431152982383264512931065339664925476428621545756436864515452312738303572591812463844426526259342302650524681044048033255874491954313570gemini-exp-1206gemini-2.0-flash-thinking-exp-1219gemini-exp-1121chatgpt-4o-latest-20241120o1-2024-12-17gemini-2.0-flash-expgemini-exp-1114chatgpt-4o-latest-20240903o1-previewchatgpt-4o-latest-20240808deepseek-v3o1-minigemini-1.5-pro-002gemini-1.5-pro-exp-0827gemini-1.5-pro-exp-0801grok-2-2024-08-13yi-lightninggpt-4o-2024-05-13claude-3-5-sonnet-20241022qwen2.5-plus-1127deepseek-v2.5-1210athene-v2-chatglm-4-plusgpt-4o-mini-2024-07-18gemini-1.5-flash-002gemini-1.5-flash-002gpt-4o-mini-2024-07-18glm-4-plusathene-v2-chatdeepseek-v2.5-1210qwen2.5-plus-1127claude-3-5-sonnet-20241022gpt-4o-2024-05-13yi-lightninggrok-2-2024-08-13gemini-1.5-pro-exp-0801gemini-1.5-pro-exp-0827gemini-1.5-pro-002o1-minideepseek-v3chatgpt-4o-latest-20240808o1-previewchatgpt-4o-latest-20240903gemini-exp-1114gemini-2.0-flash-expo1-2024-12-17chatgpt-4o-latest-20241120gemini-exp-1121gemini-2.0-flash-thinking-exp-1219gemini-exp-1206

0100200300400500600700800900Model BModel A
plotly-logomark
For a more holistic comparison, we've updated the leaderboard to show model rank
(UB) across tasks and languages. Check out the 'Arena' tab for more categories,
statistics, and model info.

Total #models: 187.    Total #votes: 2,488,392.    Last updated: 2024-12-29.

Code to recreate leaderboard tables and plots in this notebook. You can
contribute your vote at lmarena.ai!

  Task Leaderboard

Sort by Rank Sort by Arena Score

  Chatbot Arena Overview

Model

Overall

Overall w/ Style Control

Hard Prompts

Hard Prompts w/ Style Control

Coding

Math

Creative Writing

Instruction Following

Longer Query

Multi-Turn


gemini-2.0-flash-thinking-exp-1219

103
101
100
107
101
117
107
100
103
101

Model

Overall

Overall w/ Style Control

Hard Prompts

Hard Prompts w/ Style Control

Coding

Math

Creative Writing

Instruction Following

Longer Query

Multi-Turn


gemini-exp-1206

1
1
1
1
1
1
1
1
1
1

gemini-2.0-flash-thinking-exp-1219

1
3
1
1
1
1
1
1
1
1

gemini-exp-1121

1
5
2
5
3
3
1
3
2
1

chatgpt-4o-latest-20241120

1
1
3
5
1
6
1
3
1
1

o1-2024-12-17

1
1
1
1
1
1
5
1
1
1

gemini-2.0-flash-exp

4
4
2
3
2
4
3
4
1
3

gemini-exp-1114

5
9
4
6
5
3
3
4
3
3

o1-preview

8
4
1
1
1
1
8
4
6
3

deepseek-v3

9
8
2
3
3
5
7
9
3
6

o1-mini

9
13
4
6
1
1
26
9
8
9

gemini-1.5-pro-002

9
9
10
9
11
9
7
10
9
12

gemini-1.5-pro-exp-0827

9
9
12
10
12
9
8
10
9
9

grok-2-2024-08-13

13
16
18
16
14
15
12
16
16
13

yi-lightning

13
16
10
12
11
9
12
13
10
9

gpt-4o-2024-05-13

13
11
14
13
11
15
12
13
12
12

claude-3-5-sonnet-20241022

13
8
10
1
8
9
11
10
9
9

qwen2.5-plus-1127

13
24
10
13
6
9
15
10
10
9

deepseek-v2.5-1210

13
20
10
13
11
11
11
13
9
9

athene-v2-chat

16
24
10
13
11
9
26
14
12
13

glm-4-plus

17
22
15
19
13
16
16
16
14
15

gpt-4o-mini-2024-07-18

17
23
19
26
13
24
13
21
13
13

gemini-1.5-flash-002

17
26
26
32
30
18
12
20
14
33

gemini-1.5-flash-exp-0827

17
21
20
17
28
16
12
19
9
15

llama-3.1-nemotron-70b-instruct

17
38
18
20
17
16
12
21
31
15

llama-3.1-405b-instruct-bf16

18
14
18
12
14
13
17
20
19
13

claude-3-5-sonnet-20240620

20
13
14
7
11
9
27
14
16
12

llama-3.1-405b-instruct-fp8

20
15
20
14
17
15
17
19
26
13

gemini-advanced-0514

22
14
30
17
30
20
12
23
30
21

grok-2-mini-2024-08-13

22
35
26
31
28
21
24
28
20
27

gpt-4o-2024-08-06

22
15
20
16
18
15
13
17
15
17

yi-lightning-lite

22
21
15
17
21
15
17
18
14
13

qwen-max-0919

23
26
15
18
12
16
22
16
13
16

gemini-1.5-pro-001

29
21
26
16
28
20
13
23
14
27

deepseek-v2.5

31
31
18
19
11
17
27
24
17
27

  Language Leaderboard

Sort by Rank Sort by Arena Score

  Chatbot Arena Overview

Model

English

Chinese

German

French

Spanish

Russian

Japanese

Korean


gemini-2.0-flash-thinking-exp-1219

101
113
101
100
111
106
111
118

Model

English

Chinese

German

French

Spanish

Russian

Japanese

Korean


gemini-exp-1206

1
1
1
1
1
1
1
1

gemini-2.0-flash-thinking-exp-1219

1
1
1
1
1
1
1
1

gemini-exp-1121

2
1
1
1
1
1
1
1

chatgpt-4o-latest-20241120

2
1
1
1
1
1
1
1

o1-2024-12-17

2
1
1
1
-1
1
1
1

gemini-2.0-flash-exp

2
1
1
1
1
1
1
2

o1-preview

2
8
7
1
2
8
2
8

gemini-exp-1114

7
1
1
1
2
2
1
2

deepseek-v3

8
1
4
1
-1
8
1
1

o1-mini

9
8
7
6
2
11
11
8

gemini-1.5-pro-002

11
8
7
6
3
8
5
8

yi-lightning

11
8
9
6
2
23
11
11

qwen2.5-plus-1127

11
8
11
1
2
19
11
20

gemini-1.5-pro-exp-0827

12
8
7
3
2
8
5
8

grok-2-2024-08-13

12
14
7
6
5
12
9
8

deepseek-v2.5-1210

12
8
7
6
2
11
11
11

llama-3.1-nemotron-70b-instruct

12
17
4
1
2
33
11
24

gpt-4o-2024-05-13

14
16
7
6
5
12
9
8

athene-v2-chat

14
9
7
6
5
12
15
10

llama-3.1-405b-instruct-fp8

15
33
11
10
6
19
26
15

llama-3.1-405b-instruct-bf16

15
36
9
7
6
20
11
16

claude-3-5-sonnet-20241022

16
16
7
6
5
8
9
12

gpt-4o-mini-2024-07-18

17
20
9
6
5
12
12
12

grok-2-mini-2024-08-13

18
20
7
6
5
20
15
12

llama-3.3-70b-instruct

18
35
9
1
4
20
31
47

glm-4-plus

20
13
7
6
3
12
12
8

yi-lightning-lite

21
20
7
4
4
27
10
13

gpt-4o-2024-08-06

23
26
11
11
5
13
11
10

qwen-max-0919

23
23
9
5
4
12
12
21

llama-3.1-70b-instruct

23
42
23
15
10
35
40
23

gemini-1.5-flash-002

24
13
9
7
6
12
9
10

claude-3-5-sonnet-20240620

25
20
9
6
5
12
11
11

deepseek-v2.5

25
14
15
6
5
23
15
12

gpt-4-turbo-2024-04-09

25
30
11
10
6
25
21
15

*Rank (UB): model's ranking (upper-bound), defined by one + the number of models
that are statistically better than the target model. Model A is statistically
better than model B when A's lower-bound score is greater than B's upper-bound
score (in 95% confidence interval). See Figure 1 below for visualization of the
confidence intervals of model scores.

Note: in each category, we exclude models with fewer than 300 votes as their
confidence intervals can be large.

Total #models: 49.    Total #votes: 156,151.    Last updated: 2024-12-30.

Code to recreate leaderboard tables and plots in this notebook. You can
contribute your vote at lmarena.ai!

Category


Apply filter
Style Control Show Deprecated


OVERALL QUESTIONS

    #MODELS: 49 (100%)     #VOTES: 156,151 (100%)   

Rank* (UB)

Rank (StyleCtrl)

Model

Arena Score

95% CI

Votes

Organization

License

Knowledge Cutoff

12
10

Qwen-VL-Max-1119

1299
+14/-11
10647
Anthropic
Proprietary
Unknown

Rank* (UB)

Rank (StyleCtrl)

Model

Arena Score

95% CI

Votes

Organization

License

Knowledge Cutoff

1
1

Gemini-Exp-1121

1299
+14/-11
2361
Google
Proprietary
Unknown
2
9

Gemini-Exp-1114

1272
+11/-15
2116
Google
Proprietary
Unknown
2
1

Gemini-2.0-Flash-Thinking-Exp-1219

1268
+15/-11
1096
Google
Proprietary
Unknown
2
1

Gemini-2.0-Flash-Exp

1256
+13/-10
2178
Google
Proprietary
Unknown
4
1

Gemini-Exp-1206

1236
+13/-11
2480
Google
Proprietary
Unknown
5
4

Gemini-1.5-Pro-Exp-0827

1232
+7/-6
10647
Google
Proprietary
2023/11
5
1

ChatGPT-4o-latest (2024-11-20)

1226
+8/-9
4151
OpenAI
Proprietary
Unknown
5
4

Gemini-1.5-Pro-002

1221
+7/-5
6780
Google
Proprietary
Unknown
7
10

Gemini-1.5-Flash-Exp-0827

1212
+7/-10
3741
Google
Proprietary
2023/11
9
6

GPT-4o-2024-05-13

1206
+4/-5
23055
OpenAI
Proprietary
2023/10
9
10

Gemini-1.5-Flash-002

1205
+7/-6
6247
Google
Proprietary
Unknown
12
10

Claude 3.5 Sonnet (20240620)

1189
+6/-5
22261
Anthropic
Proprietary
2024/4
12
7

Claude 3.5 Sonnet (20241022)

1183
+8/-7
6417
Anthropic
Proprietary
2024/4
14
16

Pixtral-Large-2411

1158
+11/-10
2259
Mistral
MRL
Unknown
14
14

Gemini-1.5-Pro-001

1151
+6/-5
17183
Google
Proprietary
2023/11
14
14

GPT-4-Turbo-2024-04-09

1151
+7/-5
13735
OpenAI
Proprietary
2023/12
14
15

Qwen-VL-Max-1119

1133
+18/-19
834
Alibaba
Proprietary
Unknown
17
12

GPT-4o-2024-08-06

1126
+11/-8
3392
OpenAI
Proprietary
2023/10
17
17

GPT-4o-mini-2024-07-18

1124
+5/-6
15066
OpenAI
Proprietary
2023/10
17
20

Gemini-1.5-Flash-8B-Exp-0827

1112
+9/-9
3427
Google
Proprietary
2023/11
17
20

Step-1V-32K

1111
+16/-17
1555
StepFun
Proprietary
Unknown
18
17

Qwen2-VL-72b-Instruct

1111
+6/-7
5435
Alibaba
Qwen
2024/9
19
20

Gemini-1.5-Flash-8B-001

1106
+8/-7
5571
Google
Proprietary
Unknown
24
23

Claude 3 Opus

1076
+5/-6
15954
Anthropic
Proprietary
2023/8
24
23

Molmo-72B-0924

1076
+8/-9
3095
AI2
Apache 2.0
Unknown
24
21

Gemini-1.5-Flash-001

1072
+8/-5
13595
Google
Proprietary
2023/11
24
29

Pixtral-12B-2409

1072
+6/-7
6249
Mistral
Apache 2.0
2024/9
24
24

Llama-3.2-90B-Vision-Instruct

1072
+7/-6
7078
Meta
Llama 3.2
2023/11
24
29

InternVL2-26B

1067
+8/-6
5266
OpenGVLab
MIT
2024/7
24
27

Amazon Nova Lite 1.0

1057
+14/-10
1686
Amazon
Proprietary
Unknown
29
26

Qwen2-VL-7B-Instruct

1054
+8/-10
5613
Aliaba
Apache 2.0
Unknown
30
29

Claude 3 Sonnet

1048
+6/-5
12689
Anthropic
Proprietary
2023/8
30
29

Yi-Vision

1045
+15/-14
1237
01 AI
Proprietary
2024/7
30
28

Amazon Nova Pro 1.0

1041
+14/-17
1739
Amazon
Proprietary
Unknown

*Rank (UB): model's ranking (upper-bound), defined by one + the number of models
that are statistically better than the target model. Model A is statistically
better than model B when A's lower-bound score is greater than B's upper-bound
score (in 95% confidence interval). See Figure 1 below for visualization of the
confidence intervals of model scores.

Rank (StyleCtrl): model's ranking with style control, which accounts for factors
like response length and markdown usage to decouple model performance from these
potential confounding variables. See blog post for further details.

Note: in each category, we exclude models with fewer than 300 votes as their
confidence intervals can be large.


MORE STATISTICS FOR CHATBOT ARENA (OVERALL)

FIGURE 1: CONFIDENCE INTERVALS ON MODEL STRENGTH (VIA BOOTSTRAPPING)

Plot

gemini-exp-1121gemini-exp-1114gemini-2.0-flash-thinking-exp-1219gemini-2.0-flash-expchatgpt-4o-latest-20240903gemini-exp-1206gemini-1.5-pro-exp-0827chatgpt-4o-latest-20241120gemini-1.5-pro-002gemini-1.5-pro-exp-0801gemini-1.5-flash-exp-0827gpt-4o-2024-05-13gemini-1.5-flash-002claude-3-5-sonnet-20240620claude-3-5-sonnet-20241022pixtral-large-2411gemini-1.5-pro-001gpt-4-turbo-2024-04-09qwen-vl-max-1119gpt-4o-2024-08-06gpt-4o-mini-2024-07-18gemini-1.5-flash-8b-exp-0827step-1v-32kqwen2-vl-72bgemini-1.5-flash-8b-00111001150120012501300

ModelRating
plotly-logomark

FIGURE 2: AVERAGE WIN RATE AGAINST ALL OTHER MODELS (ASSUMING UNIFORM SAMPLING
AND NO TIES)

Plot

0.670.660.640.630.620.610.600.590.570.540.540.530.520.490.490.460.450.430.430.380.370.370.350.340.30gemini-exp-1121gemini-1.5-pro-exp-0801gemini-1.5-pro-exp-0827chatgpt-4o-latest-20240903gemini-2.0-flash-thinking-exp-1219gemini-exp-1114gemini-1.5-flash-exp-0827gemini-2.0-flash-expgpt4o-lmsys-1216agemini-1.5-pro-002gemini-exp-1206gemini-1.5-flash-002claude-3-5-sonnet-20240620chatgpt-4o-latest-20241120gpt-4o-2024-05-13claude-3-5-sonnet-20241022gpt-4o-2024-11-20gemini-1.5-pro-001claude-3-5-sonnet-20241022-spgpt-4-turbo-2024-04-09pixtral-large-2411gemini-1.5-flash-8b-exp-0827gpt-4o-mini-2024-07-18step-1v-32kclaude-3-opus-2024022900.10.20.30.40.50.60.7

ModelAverage Win Rate
plotly-logomark

FIGURE 3: FRACTION OF MODEL A WINS FOR ALL NON-TIED A VS. B BATTLES

Plot

0.000.620.600.510.510.680.000.620.620.000.000.710.720.000.750.830.000.001.000.830.840.000.000.790.780.380.000.000.000.600.000.000.560.590.000.000.700.660.000.640.700.000.000.000.740.680.000.000.800.880.400.000.000.620.400.610.000.710.520.000.000.680.670.000.000.780.000.000.780.000.590.000.000.720.820.490.000.380.000.420.510.000.560.630.000.000.580.680.000.630.720.000.000.780.820.680.000.000.750.730.490.400.600.580.000.530.500.550.590.000.620.650.590.640.630.640.750.701.000.810.760.710.770.780.820.320.000.390.490.470.000.000.500.490.000.000.650.460.000.750.640.000.000.660.570.660.000.000.790.690.000.000.000.000.500.000.000.000.640.000.540.580.560.600.000.000.710.700.000.000.730.760.000.760.250.380.440.290.440.450.500.000.000.420.000.000.660.560.000.590.660.000.000.860.720.740.000.000.700.700.380.410.480.370.410.510.360.580.000.000.450.540.480.580.570.660.450.710.670.720.650.680.820.750.770.000.000.000.000.000.000.000.000.000.000.000.530.000.590.000.000.680.630.000.000.740.000.000.000.000.000.000.000.000.380.000.460.000.550.000.000.510.500.560.000.000.690.640.000.000.740.790.000.690.600.290.300.320.420.350.350.420.340.460.470.490.000.480.530.550.580.620.660.790.760.720.650.720.740.730.280.340.330.320.410.540.440.440.520.000.500.520.000.570.540.530.671.000.760.610.720.620.820.690.710.000.000.000.000.360.000.400.000.420.410.440.470.430.000.480.000.550.610.000.630.630.690.600.660.630.250.360.000.380.370.250.000.410.430.000.000.450.460.520.000.600.670.000.000.670.580.000.770.650.680.170.300.220.280.360.360.000.340.340.000.000.420.470.000.400.000.000.000.820.530.700.000.000.570.610.000.000.000.000.250.000.290.000.550.320.310.380.330.450.330.000.000.510.000.250.550.630.500.580.670.000.000.000.000.300.000.300.000.290.370.360.340.000.390.000.000.490.000.000.000.530.550.000.670.000.000.000.220.220.000.340.000.140.330.000.000.210.240.000.000.180.000.000.000.000.750.000.000.800.200.170.260.000.180.190.430.000.280.280.000.000.240.390.370.330.470.750.000.000.000.370.000.540.480.520.160.320.410.320.240.340.270.260.350.260.260.280.280.370.420.300.450.470.250.630.000.460.530.510.540.000.000.000.000.290.000.240.000.320.000.210.350.380.310.000.000.370.450.000.000.540.000.000.551.000.000.000.000.000.230.000.000.000.180.000.000.280.180.400.230.000.500.000.000.460.470.000.000.550.560.210.200.280.250.220.210.240.300.250.000.310.260.310.340.350.420.420.330.200.520.490.450.450.000.500.220.120.180.270.180.310.750.300.230.000.400.270.290.370.320.390.330.000.800.480.460.000.440.500.00gemini-exp-1121gemini-exp-1114gemini-2.0-flash-thinking-exp-1219gemini-2.0-flash-expchatgpt-4o-latest-20240903gemini-exp-1206gemini-1.5-pro-exp-0827chatgpt-4o-latest-20241120gemini-1.5-pro-002gemini-1.5-pro-exp-0801gemini-1.5-flash-exp-0827gpt-4o-2024-05-13gemini-1.5-flash-002claude-3-5-sonnet-20240620claude-3-5-sonnet-20241022pixtral-large-2411gemini-1.5-pro-001gpt-4-turbo-2024-04-09qwen-vl-max-1119gpt-4o-2024-08-06gpt-4o-mini-2024-07-18gemini-1.5-flash-8b-exp-0827step-1v-32kqwen2-vl-72bgemini-1.5-flash-8b-001gemini-1.5-flash-8b-001qwen2-vl-72bstep-1v-32kgemini-1.5-flash-8b-exp-0827gpt-4o-mini-2024-07-18gpt-4o-2024-08-06qwen-vl-max-1119gpt-4-turbo-2024-04-09gemini-1.5-pro-001pixtral-large-2411claude-3-5-sonnet-20241022claude-3-5-sonnet-20240620gemini-1.5-flash-002gpt-4o-2024-05-13gemini-1.5-flash-exp-0827gemini-1.5-pro-exp-0801gemini-1.5-pro-002chatgpt-4o-latest-20241120gemini-1.5-pro-exp-0827gemini-exp-1206chatgpt-4o-latest-20240903gemini-2.0-flash-expgemini-2.0-flash-thinking-exp-1219gemini-exp-1114gemini-exp-1121

00.20.40.60.81Model BModel A
plotly-logomark

FIGURE 4: BATTLE COUNT FOR EACH COMBINATION OF MODELS (WITHOUT TIES)

Plot

026156761790100930066810115860057831007273260001120082900073930162230009438007910715003956903431003843004100410170025336703903878090810065770247200321144006137611125380303696239037148234902612828101988452103170244790697830009070008384012900035143800684800003600025016763516721005904580067315103781008234909690001880010415501979200141014700110149939031812397025188002021713512828464201791351122522216124000000000006420669005544650013550000000037016702000166121650011510700129213036566733865148836351042176421660170131621353111478714546761905811418381934377234841615513501217001132835824317159116132219511300009007210128669165131611302901221106201969616060568311516202426112019728400213283290403300203111031201277862341722890092640053580400001149200040330000280590020554115111424122133008180460510810241500001004580174651077873106200818000443111012050413213501490014170011000040055789401198140101135005415919203494000350102811243138174484386734711213551296761166961112060544343501011961870000520151025021319013160001081110010100335000010300022005822603101000102190042167279256117068371102160361141955620140241258161334201597310733372444881491240518311383277331505124875161590gemini-exp-1121gemini-2.0-flash-thinking-exp-1219chatgpt-4o-latest-20240903gemini-1.5-pro-exp-0827gemini-1.5-pro-002gemini-1.5-flash-exp-0827gemini-1.5-flash-002claude-3-5-sonnet-20241022gemini-1.5-pro-001qwen-vl-max-1119gpt-4o-mini-2024-07-18step-1v-32kgemini-1.5-flash-8b-001gemini-1.5-flash-8b-001step-1v-32kgpt-4o-mini-2024-07-18qwen-vl-max-1119gemini-1.5-pro-001claude-3-5-sonnet-20241022gemini-1.5-flash-002gemini-1.5-flash-exp-0827gemini-1.5-pro-002gemini-1.5-pro-exp-0827chatgpt-4o-latest-20240903gemini-2.0-flash-thinking-exp-1219gemini-exp-1121

020040060080010001200Model BModel A
plotly-logomark

Last Updated: 2024-07-31

Arena-Hard-Auto v0.1 - an automatic evaluation tool for instruction-tuned LLMs
with 500 challenging user queries curated from Chatbot Arena.

We prompt GPT-4-Turbo as judge to compare the models' responses against a
baseline model (default: GPT-4-0314). If you are curious to see how well your
model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto. Check
out our paper for more details about how Arena-Hard-Auto works as an fully
automated data pipeline converting crowdsourced data into high-quality
benchmarks -> [Paper | Repo]

Rank* (UB)

Model

Win-rate

95% CI

Average Tokens

Organization

10

Phi-3-Mini-128k-Instruct

82.63
+2.0/-1.9
662
DeepSeek AI

Rank* (UB)

Model

Win-rate

95% CI

Average Tokens

Organization

1

GPT-4-Turbo-2024-04-09

82.63
+2.0/-1.9
662
OpenAI
2

Claude 3.5 Sonnet (20240620)

79.35
+1.3/-2.1
567
Anthropic
2

GPT-4o-2024-05-13

79.21
+1.5/-1.8
696
OpenAI
2

GPT-4-0125-preview

77.96
+1.9/-2.0
619
OpenAI
2

Athene-70B

76.83
+1.9/-2.0
683
NexusFlow
4

GPT-4o-mini-2024-07-18

74.94
+2.1/-2.3
668
OpenAI
6

Gemini-1.5-Pro-001

71.96
+2.7/-2.3
676
Google
6

Yi-Large-preview

71.48
+1.9/-2.5
720
01 AI
7

Mistral-Large-2407

70.42
+2.0/-2.3
623
Mistral
10

Meta-Llama-3.1-405B-Instruct-fp8

64.09
+2.5/-2.7
633
Meta
10

GLM-4-0520

63.84
+2.3/-2.6
636
Zhipu AI
10

Yi-Large

63.7
+2.2/-1.9
626
01 AI
10

DeepSeek-Coder-V2-Instruct

62.3
+2.4/-2.5
578
DeepSeek AI
10

Claude 3 Opus

60.36
+2.0/-2.8
541
Anthropic
13

Gemma-2-27B-it

57.51
+2.6/-2.4
577
Google
14

Meta-Llama-3.1-70B-Instruct

55.73
+2.5/-2.9
628
Meta
14

GLM-4-0116

55.72
+2.4/-1.9
622
Zhipu AI
15

Gemini-1.5-Pro-Preview-0409

53.37
+3.3/-2.2
478
Google
17

GLM-4-AIR

50.88
+2.3/-2.3
619
Zhipu AI
19

GPT-4-0314

50
+0.0/-0.0
423
OpenAI
18

Gemini-1.5-Flash-001

49.61
+2.6/-2.1
642
Google
20

Qwen2-72B-Instruct

46.86
+2.4/-2.3
515
Alibaba
20

Claude 3 Sonnet

46.8
+2.2/-2.7
552
Anthropic
20

Llama-3-70B-Instruct

46.57
+2.6/-2.7
591
Meta
24

Claude 3 Haiku

41.47
+2.6/-1.9
505
Anthropic
25

GPT-4-0613

37.9
+2.5/-2.3
354
OpenAI
25

Mistral-Large-2402

37.71
+2.1/-2.9
400
Mistral
26

Mixtral-8x22b-Instruct-v0.1

36.36
+2.2/-2.1
430
Mistral
26

Qwen1.5-72B-Chat

36.12
+2.0/-2.2
474
Alibaba
27

Phi-3-Medium-4k-Instruct

33.37
+1.8/-2.1
517
Microsoft
27

Command R+ (04-2024)

33.07
+2.0/-2.2
541
Cohere
28

Mistral Medium

31.9
+2.4/-2.2
485
Mistral
30

Phi-3-Small-8k-Instruct

29.77
+2.2/-1.8
568
Microsoft
33

Mistral-Next

27.37
+1.7/-2.0
297
Mistral

Three benchmarks are displayed: Arena Elo, MT-Bench and MMLU.

 * Chatbot Arena - a crowdsourced, randomized battle platform. We use 1M+ user
   votes to compute model strength.
 * MT-Bench: a set of challenging multi-turn questions. We use GPT-4 to grade
   the model responses.
 * MMLU (5-shot): a test to measure a model's multitask accuracy on 57 tasks.

💻 Code: The MT-bench scores (single-answer grading on a scale of 10) are
computed by fastchat.llm_judge. The MMLU scores are mostly computed by
InstructEval. Higher values are better for all benchmarks. Empty cells mean not
available.

Model

Arena Score

arena-hard-auto

MT-bench

MMLU

Organization

License


Qwen-VL-Max-1119

1373
79.21
9.32
88.7
Cognitive Computations
Falcon-180B TII License

Model

Arena Score

arena-hard-auto

MT-bench

MMLU

Organization

License


Gemini-Exp-1206

1373



Google
Proprietary

Gemini-2.0-Flash-Thinking-Exp-1219

1366



Google
Proprietary

ChatGPT-4o-latest (2024-11-20)

1365



OpenAI
Proprietary

Gemini-Exp-1121

1365



Google
Proprietary

o1-2024-12-17

1359



OpenAI
Proprietary

Gemini-2.0-Flash-Exp

1356



Google
Proprietary

Gemini-Exp-1114

1346



Google
Proprietary

ChatGPT-4o-latest (2024-09-03)

1339



OpenAI
Proprietary

o1-preview

1335



OpenAI
Proprietary

ChatGPT-4o-latest (2024-08-08)

1317



OpenAI
Proprietary

DeepSeek-V3

1315



DeepSeek
DeepSeek

o1-mini

1306



OpenAI
Proprietary

Gemini-1.5-Pro-002

1302



Google
Proprietary

Gemini-1.5-Pro-Exp-0827

1299



Google
Proprietary

Gemini-1.5-Pro-Exp-0801

1298



Google
Proprietary

Grok-2-08-13

1288



xAI
Proprietary

Yi-Lightning

1287



01 AI
Proprietary

GPT-4o-2024-05-13

1285
79.21

88.7
OpenAI
Proprietary

Claude 3.5 Sonnet (20241022)

1283


88.7
Anthropic
Proprietary

Deepseek-v2.5-1210

1279



DeepSeek
DeepSeek

Qwen2.5-plus-1127

1279



Alibaba
Proprietary

Athene-v2-Chat-72B

1277



NexusFlow
NexusFlow

GLM-4-Plus

1274



Zhipu AI
Proprietary

GPT-4o-mini-2024-07-18

1273
74.94

82
OpenAI
Proprietary

Gemini-1.5-Flash-002

1271



Google
Proprietary

Gemini-1.5-Flash-Exp-0827

1269



Google
Proprietary

Llama-3.1-Nemotron-70B-Instruct

1269



Nvidia
Llama 3.1

Claude 3.5 Sonnet (20240620)

1268
79.35

88.7
Anthropic
Proprietary

Gemini Advanced App (2024-05-14)

1267



Google
Proprietary

Meta-Llama-3.1-405B-Instruct-fp8

1267
64.09

88.6
Meta
Llama 3.1 Community

Meta-Llama-3.1-405B-Instruct-bf16

1266


88.6
Meta
Llama 3.1 Community

Grok-2-Mini-08-13

1266



xAI
Proprietary

GPT-4o-2024-08-06

1265



OpenAI
Proprietary

Yi-Lightning-lite

1264



01 AI
Proprietary


Citation ▼


CITATION

Please cite the following paper if you find our leaderboard or dataset helpful.

@misc{chiang2024chatbot,
    title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
    author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
    year={2024},
    eprint={2403.04132},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}



TERMS OF SERVICE

Users are required to agree to the following terms before using the service:

The service is a research preview. It only provides limited safety measures and
may generate offensive content. It must not be used for any illegal, harmful,
violent, racist, or sexual purposes. Please do not upload any private
information. The service collects user dialogue data, including both text and
images, and reserves the right to distribute it under a Creative Commons
Attribution (CC-BY) or a similar license.

PLEASE REPORT ANY BUG OR ISSUE TO OUR DISCORD/ARENA-FEEDBACK.


ACKNOWLEDGMENT

We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Together AI, Hyperbolic,
RunPod, Anyscale, HuggingFace for their generous sponsorship.


Use via API
·
Mit Gradio erstellt

KevinLi0628
/
chatbot-arena-leaderboard

0