chatbot-arena-leaderboard.kevinlidk.cn
Open in
urlscan Pro
172.67.135.60
Public Scan
URL:
https://chatbot-arena-leaderboard.kevinlidk.cn/
Submission: On January 05 via api from US — Scanned from DE
Submission: On January 05 via api from US — Scanned from DE
Form analysis
0 forms found in the DOMText Content
🏆 CHATBOT ARENA LLM LEADERBOARD: COMMUNITY-DRIVEN EVALUATION FOR BEST LLM AND AI CHATBOTS Twitter | Discord | Blog | GitHub | Paper | Dataset | Kaggle Competition Vote! This is a mirror of the live leaderboard created and maintained at https://lmarena.ai/leaderboard. Please link to the original URL for citation purposes. Chatbot Arena (lmarena.ai) is an open-source platform for evaluating AI through human preference, developed by researchers at UC Berkeley SkyLab and LMSYS. With over 1,000,000 user votes, the platform ranks best LLM and AI chatbots using the Bradley-Terry model to generate live leaderboards. For technical details, check out our paper. Chatbot Arena thrives on community engagement — cast your vote to help improve AI evaluation! New Launch! Copilot Arena: VS Code Extension to compare Top LLMs Arena 📣 NEW: Overview Arena (Vision) Arena-Hard-Auto Full Leaderboard Total #models: 187. Total #votes: 2,488,392. Last updated: 2024-12-29. Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai! Category Apply filter Style Control Show Deprecated OVERALL QUESTIONS #MODELS: 187 (100%) #VOTES: 2,488,392 (100%) Rank* (UB) Rank (StyleCtrl) Model Arena Score 95% CI Votes Organization License Knowledge Cutoff 103 101 Gemini-2.0-Flash-Thinking-Exp-1219 1373 +12/-12 117891 Cognitive Computations Falcon-180B TII License Unknown Rank* (UB) Rank (StyleCtrl) Model Arena Score 95% CI Votes Organization License Knowledge Cutoff 1 1 Gemini-Exp-1206 1373 +4/-5 16361 Google Proprietary Unknown 1 3 Gemini-2.0-Flash-Thinking-Exp-1219 1366 +5/-4 10633 Google Proprietary Unknown 1 5 Gemini-Exp-1121 1365 +5/-5 17193 Google Proprietary Unknown 1 1 ChatGPT-4o-latest (2024-11-20) 1365 +4/-3 29314 OpenAI Proprietary Unknown 1 1 o1-2024-12-17 1359 +9/-11 2934 OpenAI Proprietary Unknown 4 4 Gemini-2.0-Flash-Exp 1356 +4/-5 15282 Google Proprietary Unknown 5 9 Gemini-Exp-1114 1346 +5/-4 17119 Google Proprietary Unknown 8 4 o1-preview 1335 +4/-3 33232 OpenAI Proprietary 2023/10 9 8 DeepSeek-V3 1315 +12/-12 2199 DeepSeek DeepSeek Unknown 9 13 o1-mini 1306 +3/-3 44113 OpenAI Proprietary 2023/10 9 9 Gemini-1.5-Pro-002 1302 +4/-3 40515 Google Proprietary Unknown 9 9 Gemini-1.5-Pro-Exp-0827 1299 +4/-3 32287 Google Proprietary 2023/11 13 16 Grok-2-08-13 1288 +3/-3 62620 xAI Proprietary 2024/3 13 16 Yi-Lightning 1287 +4/-4 29169 01 AI Proprietary Unknown 13 11 GPT-4o-2024-05-13 1285 +3/-2 117891 OpenAI Proprietary 2023/10 13 8 Claude 3.5 Sonnet (20241022) 1283 +4/-3 42796 Anthropic Proprietary 2024/4 13 24 Qwen2.5-plus-1127 1279 +9/-8 3979 Alibaba Proprietary Unknown 13 20 Deepseek-v2.5-1210 1279 +7/-7 7231 DeepSeek DeepSeek Unknown 16 24 Athene-v2-Chat-72B 1277 +5/-6 16292 NexusFlow NexusFlow Unknown 17 22 GLM-4-Plus 1274 +3/-4 27973 Zhipu AI Proprietary Unknown 17 23 GPT-4o-mini-2024-07-18 1273 +2/-3 58283 OpenAI Proprietary 2023/10 *Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores. Rank (StyleCtrl): model's ranking with style control, which accounts for factors like response length and markdown usage to decouple model performance from these potential confounding variables. See blog post for further details. Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large. MORE STATISTICS FOR CHATBOT ARENA (OVERALL) FIGURE 1: CONFIDENCE INTERVALS ON MODEL STRENGTH (VIA BOOTSTRAPPING) Plot gemini-exp-1206gemini-2.0-flash-thinking-exp-1219gemini-exp-1121chatgpt-4o-latest-20241120o1-2024-12-17gemini-2.0-flash-expgemini-exp-1114chatgpt-4o-latest-20240903o1-previewchatgpt-4o-latest-20240808deepseek-v3o1-minigemini-1.5-pro-002gemini-1.5-pro-exp-0827gemini-1.5-pro-exp-0801grok-2-2024-08-13yi-lightninggpt-4o-2024-05-13claude-3-5-sonnet-20241022qwen2.5-plus-1127deepseek-v2.5-1210athene-v2-chatglm-4-plusgpt-4o-mini-2024-07-18gemini-1.5-flash-002128013001320134013601380 ModelRating plotly-logomark FIGURE 2: AVERAGE WIN RATE AGAINST ALL OTHER MODELS (ASSUMING UNIFORM SAMPLING AND NO TIES) Plot 0.600.600.590.570.570.560.550.550.550.530.520.520.520.510.510.470.470.470.450.440.420.420.410.400.31gemini-2.0-flash-thinking-exp-1219gemini-exp-1206chatgpt-4o-latest-20240808chatgpt-4o-latest-20240903chatgpt-4o-latest-20241120gemini-1.5-pro-exp-0801gemini-exp-1121gemini-1.5-pro-api-0409-previewo1-previewgemini-advanced-0514gemini-2.0-flash-expgemini-1.5-pro-exp-0827gemini-exp-1114o1-2024-12-17bard-jan-24-gemini-proo1-miniyi-large-previewgemini-1.5-pro-002gpt-4-1106-previewgpt-4o-2024-05-13gpt-4-0125-previewgpt-4-turbo-2024-04-09gpt-4-0314claude-1gpt-3.5-turbo-031400.10.20.30.40.50.6 ModelAverage Win Rate plotly-logomark FIGURE 3: FRACTION OF MODEL A WINS FOR ALL NON-TIED A VS. B BATTLES Plot 0.000.520.490.510.600.600.660.580.570.000.670.630.670.000.000.750.000.730.680.740.670.650.000.700.750.480.000.710.550.500.510.000.000.520.000.690.670.600.000.000.690.000.860.650.610.710.740.000.740.740.510.290.000.470.000.540.560.590.560.000.000.610.690.000.000.700.670.690.660.000.000.710.740.670.700.490.450.530.000.570.530.590.550.570.000.650.640.620.000.000.680.690.690.680.740.770.660.710.740.710.400.500.000.430.000.480.000.000.000.000.000.650.630.000.000.790.000.000.630.000.750.710.000.620.520.400.490.460.470.520.000.530.440.550.000.640.610.600.000.000.640.000.680.620.640.700.710.000.730.700.340.000.440.410.000.470.000.470.540.000.000.590.620.000.000.640.630.630.620.000.000.720.710.640.710.420.000.410.450.000.560.530.000.520.540.000.560.590.600.000.630.640.680.590.000.000.660.610.700.640.420.480.440.430.000.450.460.480.000.000.000.540.590.560.000.610.610.640.580.700.510.650.650.660.570.000.000.000.000.000.000.000.460.000.000.000.000.000.490.540.530.000.610.000.000.000.000.000.650.000.330.310.000.350.000.360.000.000.000.000.000.560.400.000.000.620.000.000.560.000.000.680.000.680.690.380.330.390.360.350.390.410.440.460.000.440.000.490.540.000.570.560.570.560.510.590.550.600.580.570.330.400.310.380.380.400.380.410.410.000.600.510.000.570.000.510.600.530.530.600.560.540.550.580.600.000.000.000.000.000.000.000.400.440.510.000.460.430.000.550.510.000.520.000.000.000.000.480.620.670.000.000.000.000.000.000.000.000.000.460.000.000.000.450.000.510.000.530.000.000.000.000.000.560.000.250.310.300.320.210.360.360.370.390.470.380.430.490.490.490.000.490.530.480.470.510.490.500.530.530.000.000.330.310.000.000.370.360.390.000.000.440.400.000.000.510.000.510.520.000.000.590.540.530.540.270.140.310.310.000.320.370.320.360.390.000.430.470.480.470.470.490.000.510.000.000.550.520.540.540.320.350.340.320.370.380.380.410.420.000.440.440.470.000.000.520.480.490.000.520.550.550.510.550.500.260.390.000.260.000.360.000.000.300.000.000.490.400.000.000.530.000.000.480.000.560.550.000.560.500.330.290.000.230.250.300.000.000.490.000.000.410.440.000.000.490.000.000.450.440.000.500.000.520.490.350.260.290.340.290.290.280.340.350.000.320.450.460.000.000.510.410.450.450.450.500.000.600.480.410.000.000.260.290.000.000.290.390.350.000.000.400.450.520.000.500.460.480.490.000.000.400.000.490.490.300.260.330.260.380.270.360.300.340.350.320.420.420.380.440.470.470.460.450.440.480.520.510.000.470.250.260.300.290.480.300.290.360.430.000.310.430.400.330.000.470.460.460.500.500.510.590.510.530.00gemini-exp-1206gemini-2.0-flash-thinking-exp-1219gemini-exp-1121chatgpt-4o-latest-20241120o1-2024-12-17gemini-2.0-flash-expgemini-exp-1114chatgpt-4o-latest-20240903o1-previewchatgpt-4o-latest-20240808deepseek-v3o1-minigemini-1.5-pro-002gemini-1.5-pro-exp-0827gemini-1.5-pro-exp-0801grok-2-2024-08-13yi-lightninggpt-4o-2024-05-13claude-3-5-sonnet-20241022qwen2.5-plus-1127deepseek-v2.5-1210athene-v2-chatglm-4-plusgpt-4o-mini-2024-07-18gemini-1.5-flash-002gemini-1.5-flash-002gpt-4o-mini-2024-07-18glm-4-plusathene-v2-chatdeepseek-v2.5-1210qwen2.5-plus-1127claude-3-5-sonnet-20241022gpt-4o-2024-05-13yi-lightninggrok-2-2024-08-13gemini-1.5-pro-exp-0801gemini-1.5-pro-exp-0827gemini-1.5-pro-002o1-minideepseek-v3chatgpt-4o-latest-20240808o1-previewchatgpt-4o-latest-20240903gemini-exp-1114gemini-2.0-flash-expo1-2024-12-17chatgpt-4o-latest-20241120gemini-exp-1121gemini-2.0-flash-thinking-exp-1219gemini-exp-1206 0.20.30.40.50.60.70.8Model BModel A plotly-logomark FIGURE 4: BATTLE COUNT FOR EACH COMBINATION OF MODELS (WITHOUT TIES) Plot 0265189260851898838120073272252001890196318152902690152259265024214130223001120931992250023907269702882080981811892402620220239255133002692530028493272303001291112382462602142620813194902961760554714900049918731947911544226221326384851300810910000093960082008102880045441892232203199108136153050324304002140221352143105317012926588023949008103772180028338800378189280393002082733102623802552960363770724134080970736709257075976960013063165359312011213317601532187240005254591830541514381454463710052939642300000001340000087970563107320000064907393055050000006268005200660056025262721992694719332428380952506206091940676615471605895420359747650525222525349096304388707459068609013806094904437331036221343142824600000003671838790194138041135108410000426218100000000070500041102840716000005450189239284499822143789255416315267660935128405388686589747205521756440009318700189707514006154900053804515070073492436480196727231902212805973817320471443841716868451050500132389864332318269303479813523936964540666057330065850750501135123249751555815270011501430046008910300970011301379904574902880442810500370054620047005113703802349269208129226803172081301000562032130020573132232993806212719500111221002736315290059743142052149238949700620383431152982383264512931065339664925476428621545756436864515452312738303572591812463844426526259342302650524681044048033255874491954313570gemini-exp-1206gemini-2.0-flash-thinking-exp-1219gemini-exp-1121chatgpt-4o-latest-20241120o1-2024-12-17gemini-2.0-flash-expgemini-exp-1114chatgpt-4o-latest-20240903o1-previewchatgpt-4o-latest-20240808deepseek-v3o1-minigemini-1.5-pro-002gemini-1.5-pro-exp-0827gemini-1.5-pro-exp-0801grok-2-2024-08-13yi-lightninggpt-4o-2024-05-13claude-3-5-sonnet-20241022qwen2.5-plus-1127deepseek-v2.5-1210athene-v2-chatglm-4-plusgpt-4o-mini-2024-07-18gemini-1.5-flash-002gemini-1.5-flash-002gpt-4o-mini-2024-07-18glm-4-plusathene-v2-chatdeepseek-v2.5-1210qwen2.5-plus-1127claude-3-5-sonnet-20241022gpt-4o-2024-05-13yi-lightninggrok-2-2024-08-13gemini-1.5-pro-exp-0801gemini-1.5-pro-exp-0827gemini-1.5-pro-002o1-minideepseek-v3chatgpt-4o-latest-20240808o1-previewchatgpt-4o-latest-20240903gemini-exp-1114gemini-2.0-flash-expo1-2024-12-17chatgpt-4o-latest-20241120gemini-exp-1121gemini-2.0-flash-thinking-exp-1219gemini-exp-1206 0100200300400500600700800900Model BModel A plotly-logomark For a more holistic comparison, we've updated the leaderboard to show model rank (UB) across tasks and languages. Check out the 'Arena' tab for more categories, statistics, and model info. Total #models: 187. Total #votes: 2,488,392. Last updated: 2024-12-29. Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai! Task Leaderboard Sort by Rank Sort by Arena Score Chatbot Arena Overview Model Overall Overall w/ Style Control Hard Prompts Hard Prompts w/ Style Control Coding Math Creative Writing Instruction Following Longer Query Multi-Turn gemini-2.0-flash-thinking-exp-1219 103 101 100 107 101 117 107 100 103 101 Model Overall Overall w/ Style Control Hard Prompts Hard Prompts w/ Style Control Coding Math Creative Writing Instruction Following Longer Query Multi-Turn gemini-exp-1206 1 1 1 1 1 1 1 1 1 1 gemini-2.0-flash-thinking-exp-1219 1 3 1 1 1 1 1 1 1 1 gemini-exp-1121 1 5 2 5 3 3 1 3 2 1 chatgpt-4o-latest-20241120 1 1 3 5 1 6 1 3 1 1 o1-2024-12-17 1 1 1 1 1 1 5 1 1 1 gemini-2.0-flash-exp 4 4 2 3 2 4 3 4 1 3 gemini-exp-1114 5 9 4 6 5 3 3 4 3 3 o1-preview 8 4 1 1 1 1 8 4 6 3 deepseek-v3 9 8 2 3 3 5 7 9 3 6 o1-mini 9 13 4 6 1 1 26 9 8 9 gemini-1.5-pro-002 9 9 10 9 11 9 7 10 9 12 gemini-1.5-pro-exp-0827 9 9 12 10 12 9 8 10 9 9 grok-2-2024-08-13 13 16 18 16 14 15 12 16 16 13 yi-lightning 13 16 10 12 11 9 12 13 10 9 gpt-4o-2024-05-13 13 11 14 13 11 15 12 13 12 12 claude-3-5-sonnet-20241022 13 8 10 1 8 9 11 10 9 9 qwen2.5-plus-1127 13 24 10 13 6 9 15 10 10 9 deepseek-v2.5-1210 13 20 10 13 11 11 11 13 9 9 athene-v2-chat 16 24 10 13 11 9 26 14 12 13 glm-4-plus 17 22 15 19 13 16 16 16 14 15 gpt-4o-mini-2024-07-18 17 23 19 26 13 24 13 21 13 13 gemini-1.5-flash-002 17 26 26 32 30 18 12 20 14 33 gemini-1.5-flash-exp-0827 17 21 20 17 28 16 12 19 9 15 llama-3.1-nemotron-70b-instruct 17 38 18 20 17 16 12 21 31 15 llama-3.1-405b-instruct-bf16 18 14 18 12 14 13 17 20 19 13 claude-3-5-sonnet-20240620 20 13 14 7 11 9 27 14 16 12 llama-3.1-405b-instruct-fp8 20 15 20 14 17 15 17 19 26 13 gemini-advanced-0514 22 14 30 17 30 20 12 23 30 21 grok-2-mini-2024-08-13 22 35 26 31 28 21 24 28 20 27 gpt-4o-2024-08-06 22 15 20 16 18 15 13 17 15 17 yi-lightning-lite 22 21 15 17 21 15 17 18 14 13 qwen-max-0919 23 26 15 18 12 16 22 16 13 16 gemini-1.5-pro-001 29 21 26 16 28 20 13 23 14 27 deepseek-v2.5 31 31 18 19 11 17 27 24 17 27 Language Leaderboard Sort by Rank Sort by Arena Score Chatbot Arena Overview Model English Chinese German French Spanish Russian Japanese Korean gemini-2.0-flash-thinking-exp-1219 101 113 101 100 111 106 111 118 Model English Chinese German French Spanish Russian Japanese Korean gemini-exp-1206 1 1 1 1 1 1 1 1 gemini-2.0-flash-thinking-exp-1219 1 1 1 1 1 1 1 1 gemini-exp-1121 2 1 1 1 1 1 1 1 chatgpt-4o-latest-20241120 2 1 1 1 1 1 1 1 o1-2024-12-17 2 1 1 1 -1 1 1 1 gemini-2.0-flash-exp 2 1 1 1 1 1 1 2 o1-preview 2 8 7 1 2 8 2 8 gemini-exp-1114 7 1 1 1 2 2 1 2 deepseek-v3 8 1 4 1 -1 8 1 1 o1-mini 9 8 7 6 2 11 11 8 gemini-1.5-pro-002 11 8 7 6 3 8 5 8 yi-lightning 11 8 9 6 2 23 11 11 qwen2.5-plus-1127 11 8 11 1 2 19 11 20 gemini-1.5-pro-exp-0827 12 8 7 3 2 8 5 8 grok-2-2024-08-13 12 14 7 6 5 12 9 8 deepseek-v2.5-1210 12 8 7 6 2 11 11 11 llama-3.1-nemotron-70b-instruct 12 17 4 1 2 33 11 24 gpt-4o-2024-05-13 14 16 7 6 5 12 9 8 athene-v2-chat 14 9 7 6 5 12 15 10 llama-3.1-405b-instruct-fp8 15 33 11 10 6 19 26 15 llama-3.1-405b-instruct-bf16 15 36 9 7 6 20 11 16 claude-3-5-sonnet-20241022 16 16 7 6 5 8 9 12 gpt-4o-mini-2024-07-18 17 20 9 6 5 12 12 12 grok-2-mini-2024-08-13 18 20 7 6 5 20 15 12 llama-3.3-70b-instruct 18 35 9 1 4 20 31 47 glm-4-plus 20 13 7 6 3 12 12 8 yi-lightning-lite 21 20 7 4 4 27 10 13 gpt-4o-2024-08-06 23 26 11 11 5 13 11 10 qwen-max-0919 23 23 9 5 4 12 12 21 llama-3.1-70b-instruct 23 42 23 15 10 35 40 23 gemini-1.5-flash-002 24 13 9 7 6 12 9 10 claude-3-5-sonnet-20240620 25 20 9 6 5 12 11 11 deepseek-v2.5 25 14 15 6 5 23 15 12 gpt-4-turbo-2024-04-09 25 30 11 10 6 25 21 15 *Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores. Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large. Total #models: 49. Total #votes: 156,151. Last updated: 2024-12-30. Code to recreate leaderboard tables and plots in this notebook. You can contribute your vote at lmarena.ai! Category Apply filter Style Control Show Deprecated OVERALL QUESTIONS #MODELS: 49 (100%) #VOTES: 156,151 (100%) Rank* (UB) Rank (StyleCtrl) Model Arena Score 95% CI Votes Organization License Knowledge Cutoff 12 10 Qwen-VL-Max-1119 1299 +14/-11 10647 Anthropic Proprietary Unknown Rank* (UB) Rank (StyleCtrl) Model Arena Score 95% CI Votes Organization License Knowledge Cutoff 1 1 Gemini-Exp-1121 1299 +14/-11 2361 Google Proprietary Unknown 2 9 Gemini-Exp-1114 1272 +11/-15 2116 Google Proprietary Unknown 2 1 Gemini-2.0-Flash-Thinking-Exp-1219 1268 +15/-11 1096 Google Proprietary Unknown 2 1 Gemini-2.0-Flash-Exp 1256 +13/-10 2178 Google Proprietary Unknown 4 1 Gemini-Exp-1206 1236 +13/-11 2480 Google Proprietary Unknown 5 4 Gemini-1.5-Pro-Exp-0827 1232 +7/-6 10647 Google Proprietary 2023/11 5 1 ChatGPT-4o-latest (2024-11-20) 1226 +8/-9 4151 OpenAI Proprietary Unknown 5 4 Gemini-1.5-Pro-002 1221 +7/-5 6780 Google Proprietary Unknown 7 10 Gemini-1.5-Flash-Exp-0827 1212 +7/-10 3741 Google Proprietary 2023/11 9 6 GPT-4o-2024-05-13 1206 +4/-5 23055 OpenAI Proprietary 2023/10 9 10 Gemini-1.5-Flash-002 1205 +7/-6 6247 Google Proprietary Unknown 12 10 Claude 3.5 Sonnet (20240620) 1189 +6/-5 22261 Anthropic Proprietary 2024/4 12 7 Claude 3.5 Sonnet (20241022) 1183 +8/-7 6417 Anthropic Proprietary 2024/4 14 16 Pixtral-Large-2411 1158 +11/-10 2259 Mistral MRL Unknown 14 14 Gemini-1.5-Pro-001 1151 +6/-5 17183 Google Proprietary 2023/11 14 14 GPT-4-Turbo-2024-04-09 1151 +7/-5 13735 OpenAI Proprietary 2023/12 14 15 Qwen-VL-Max-1119 1133 +18/-19 834 Alibaba Proprietary Unknown 17 12 GPT-4o-2024-08-06 1126 +11/-8 3392 OpenAI Proprietary 2023/10 17 17 GPT-4o-mini-2024-07-18 1124 +5/-6 15066 OpenAI Proprietary 2023/10 17 20 Gemini-1.5-Flash-8B-Exp-0827 1112 +9/-9 3427 Google Proprietary 2023/11 17 20 Step-1V-32K 1111 +16/-17 1555 StepFun Proprietary Unknown 18 17 Qwen2-VL-72b-Instruct 1111 +6/-7 5435 Alibaba Qwen 2024/9 19 20 Gemini-1.5-Flash-8B-001 1106 +8/-7 5571 Google Proprietary Unknown 24 23 Claude 3 Opus 1076 +5/-6 15954 Anthropic Proprietary 2023/8 24 23 Molmo-72B-0924 1076 +8/-9 3095 AI2 Apache 2.0 Unknown 24 21 Gemini-1.5-Flash-001 1072 +8/-5 13595 Google Proprietary 2023/11 24 29 Pixtral-12B-2409 1072 +6/-7 6249 Mistral Apache 2.0 2024/9 24 24 Llama-3.2-90B-Vision-Instruct 1072 +7/-6 7078 Meta Llama 3.2 2023/11 24 29 InternVL2-26B 1067 +8/-6 5266 OpenGVLab MIT 2024/7 24 27 Amazon Nova Lite 1.0 1057 +14/-10 1686 Amazon Proprietary Unknown 29 26 Qwen2-VL-7B-Instruct 1054 +8/-10 5613 Aliaba Apache 2.0 Unknown 30 29 Claude 3 Sonnet 1048 +6/-5 12689 Anthropic Proprietary 2023/8 30 29 Yi-Vision 1045 +15/-14 1237 01 AI Proprietary 2024/7 30 28 Amazon Nova Pro 1.0 1041 +14/-17 1739 Amazon Proprietary Unknown *Rank (UB): model's ranking (upper-bound), defined by one + the number of models that are statistically better than the target model. Model A is statistically better than model B when A's lower-bound score is greater than B's upper-bound score (in 95% confidence interval). See Figure 1 below for visualization of the confidence intervals of model scores. Rank (StyleCtrl): model's ranking with style control, which accounts for factors like response length and markdown usage to decouple model performance from these potential confounding variables. See blog post for further details. Note: in each category, we exclude models with fewer than 300 votes as their confidence intervals can be large. MORE STATISTICS FOR CHATBOT ARENA (OVERALL) FIGURE 1: CONFIDENCE INTERVALS ON MODEL STRENGTH (VIA BOOTSTRAPPING) Plot gemini-exp-1121gemini-exp-1114gemini-2.0-flash-thinking-exp-1219gemini-2.0-flash-expchatgpt-4o-latest-20240903gemini-exp-1206gemini-1.5-pro-exp-0827chatgpt-4o-latest-20241120gemini-1.5-pro-002gemini-1.5-pro-exp-0801gemini-1.5-flash-exp-0827gpt-4o-2024-05-13gemini-1.5-flash-002claude-3-5-sonnet-20240620claude-3-5-sonnet-20241022pixtral-large-2411gemini-1.5-pro-001gpt-4-turbo-2024-04-09qwen-vl-max-1119gpt-4o-2024-08-06gpt-4o-mini-2024-07-18gemini-1.5-flash-8b-exp-0827step-1v-32kqwen2-vl-72bgemini-1.5-flash-8b-00111001150120012501300 ModelRating plotly-logomark FIGURE 2: AVERAGE WIN RATE AGAINST ALL OTHER MODELS (ASSUMING UNIFORM SAMPLING AND NO TIES) Plot 0.670.660.640.630.620.610.600.590.570.540.540.530.520.490.490.460.450.430.430.380.370.370.350.340.30gemini-exp-1121gemini-1.5-pro-exp-0801gemini-1.5-pro-exp-0827chatgpt-4o-latest-20240903gemini-2.0-flash-thinking-exp-1219gemini-exp-1114gemini-1.5-flash-exp-0827gemini-2.0-flash-expgpt4o-lmsys-1216agemini-1.5-pro-002gemini-exp-1206gemini-1.5-flash-002claude-3-5-sonnet-20240620chatgpt-4o-latest-20241120gpt-4o-2024-05-13claude-3-5-sonnet-20241022gpt-4o-2024-11-20gemini-1.5-pro-001claude-3-5-sonnet-20241022-spgpt-4-turbo-2024-04-09pixtral-large-2411gemini-1.5-flash-8b-exp-0827gpt-4o-mini-2024-07-18step-1v-32kclaude-3-opus-2024022900.10.20.30.40.50.60.7 ModelAverage Win Rate plotly-logomark FIGURE 3: FRACTION OF MODEL A WINS FOR ALL NON-TIED A VS. B BATTLES Plot 0.000.620.600.510.510.680.000.620.620.000.000.710.720.000.750.830.000.001.000.830.840.000.000.790.780.380.000.000.000.600.000.000.560.590.000.000.700.660.000.640.700.000.000.000.740.680.000.000.800.880.400.000.000.620.400.610.000.710.520.000.000.680.670.000.000.780.000.000.780.000.590.000.000.720.820.490.000.380.000.420.510.000.560.630.000.000.580.680.000.630.720.000.000.780.820.680.000.000.750.730.490.400.600.580.000.530.500.550.590.000.620.650.590.640.630.640.750.701.000.810.760.710.770.780.820.320.000.390.490.470.000.000.500.490.000.000.650.460.000.750.640.000.000.660.570.660.000.000.790.690.000.000.000.000.500.000.000.000.640.000.540.580.560.600.000.000.710.700.000.000.730.760.000.760.250.380.440.290.440.450.500.000.000.420.000.000.660.560.000.590.660.000.000.860.720.740.000.000.700.700.380.410.480.370.410.510.360.580.000.000.450.540.480.580.570.660.450.710.670.720.650.680.820.750.770.000.000.000.000.000.000.000.000.000.000.000.530.000.590.000.000.680.630.000.000.740.000.000.000.000.000.000.000.000.380.000.460.000.550.000.000.510.500.560.000.000.690.640.000.000.740.790.000.690.600.290.300.320.420.350.350.420.340.460.470.490.000.480.530.550.580.620.660.790.760.720.650.720.740.730.280.340.330.320.410.540.440.440.520.000.500.520.000.570.540.530.671.000.760.610.720.620.820.690.710.000.000.000.000.360.000.400.000.420.410.440.470.430.000.480.000.550.610.000.630.630.690.600.660.630.250.360.000.380.370.250.000.410.430.000.000.450.460.520.000.600.670.000.000.670.580.000.770.650.680.170.300.220.280.360.360.000.340.340.000.000.420.470.000.400.000.000.000.820.530.700.000.000.570.610.000.000.000.000.250.000.290.000.550.320.310.380.330.450.330.000.000.510.000.250.550.630.500.580.670.000.000.000.000.300.000.300.000.290.370.360.340.000.390.000.000.490.000.000.000.530.550.000.670.000.000.000.220.220.000.340.000.140.330.000.000.210.240.000.000.180.000.000.000.000.750.000.000.800.200.170.260.000.180.190.430.000.280.280.000.000.240.390.370.330.470.750.000.000.000.370.000.540.480.520.160.320.410.320.240.340.270.260.350.260.260.280.280.370.420.300.450.470.250.630.000.460.530.510.540.000.000.000.000.290.000.240.000.320.000.210.350.380.310.000.000.370.450.000.000.540.000.000.551.000.000.000.000.000.230.000.000.000.180.000.000.280.180.400.230.000.500.000.000.460.470.000.000.550.560.210.200.280.250.220.210.240.300.250.000.310.260.310.340.350.420.420.330.200.520.490.450.450.000.500.220.120.180.270.180.310.750.300.230.000.400.270.290.370.320.390.330.000.800.480.460.000.440.500.00gemini-exp-1121gemini-exp-1114gemini-2.0-flash-thinking-exp-1219gemini-2.0-flash-expchatgpt-4o-latest-20240903gemini-exp-1206gemini-1.5-pro-exp-0827chatgpt-4o-latest-20241120gemini-1.5-pro-002gemini-1.5-pro-exp-0801gemini-1.5-flash-exp-0827gpt-4o-2024-05-13gemini-1.5-flash-002claude-3-5-sonnet-20240620claude-3-5-sonnet-20241022pixtral-large-2411gemini-1.5-pro-001gpt-4-turbo-2024-04-09qwen-vl-max-1119gpt-4o-2024-08-06gpt-4o-mini-2024-07-18gemini-1.5-flash-8b-exp-0827step-1v-32kqwen2-vl-72bgemini-1.5-flash-8b-001gemini-1.5-flash-8b-001qwen2-vl-72bstep-1v-32kgemini-1.5-flash-8b-exp-0827gpt-4o-mini-2024-07-18gpt-4o-2024-08-06qwen-vl-max-1119gpt-4-turbo-2024-04-09gemini-1.5-pro-001pixtral-large-2411claude-3-5-sonnet-20241022claude-3-5-sonnet-20240620gemini-1.5-flash-002gpt-4o-2024-05-13gemini-1.5-flash-exp-0827gemini-1.5-pro-exp-0801gemini-1.5-pro-002chatgpt-4o-latest-20241120gemini-1.5-pro-exp-0827gemini-exp-1206chatgpt-4o-latest-20240903gemini-2.0-flash-expgemini-2.0-flash-thinking-exp-1219gemini-exp-1114gemini-exp-1121 00.20.40.60.81Model BModel A plotly-logomark FIGURE 4: BATTLE COUNT FOR EACH COMBINATION OF MODELS (WITHOUT TIES) Plot 026156761790100930066810115860057831007273260001120082900073930162230009438007910715003956903431003843004100410170025336703903878090810065770247200321144006137611125380303696239037148234902612828101988452103170244790697830009070008384012900035143800684800003600025016763516721005904580067315103781008234909690001880010415501979200141014700110149939031812397025188002021713512828464201791351122522216124000000000006420669005544650013550000000037016702000166121650011510700129213036566733865148836351042176421660170131621353111478714546761905811418381934377234841615513501217001132835824317159116132219511300009007210128669165131611302901221106201969616060568311516202426112019728400213283290403300203111031201277862341722890092640053580400001149200040330000280590020554115111424122133008180460510810241500001004580174651077873106200818000443111012050413213501490014170011000040055789401198140101135005415919203494000350102811243138174484386734711213551296761166961112060544343501011961870000520151025021319013160001081110010100335000010300022005822603101000102190042167279256117068371102160361141955620140241258161334201597310733372444881491240518311383277331505124875161590gemini-exp-1121gemini-2.0-flash-thinking-exp-1219chatgpt-4o-latest-20240903gemini-1.5-pro-exp-0827gemini-1.5-pro-002gemini-1.5-flash-exp-0827gemini-1.5-flash-002claude-3-5-sonnet-20241022gemini-1.5-pro-001qwen-vl-max-1119gpt-4o-mini-2024-07-18step-1v-32kgemini-1.5-flash-8b-001gemini-1.5-flash-8b-001step-1v-32kgpt-4o-mini-2024-07-18qwen-vl-max-1119gemini-1.5-pro-001claude-3-5-sonnet-20241022gemini-1.5-flash-002gemini-1.5-flash-exp-0827gemini-1.5-pro-002gemini-1.5-pro-exp-0827chatgpt-4o-latest-20240903gemini-2.0-flash-thinking-exp-1219gemini-exp-1121 020040060080010001200Model BModel A plotly-logomark Last Updated: 2024-07-31 Arena-Hard-Auto v0.1 - an automatic evaluation tool for instruction-tuned LLMs with 500 challenging user queries curated from Chatbot Arena. We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto. Check out our paper for more details about how Arena-Hard-Auto works as an fully automated data pipeline converting crowdsourced data into high-quality benchmarks -> [Paper | Repo] Rank* (UB) Model Win-rate 95% CI Average Tokens Organization 10 Phi-3-Mini-128k-Instruct 82.63 +2.0/-1.9 662 DeepSeek AI Rank* (UB) Model Win-rate 95% CI Average Tokens Organization 1 GPT-4-Turbo-2024-04-09 82.63 +2.0/-1.9 662 OpenAI 2 Claude 3.5 Sonnet (20240620) 79.35 +1.3/-2.1 567 Anthropic 2 GPT-4o-2024-05-13 79.21 +1.5/-1.8 696 OpenAI 2 GPT-4-0125-preview 77.96 +1.9/-2.0 619 OpenAI 2 Athene-70B 76.83 +1.9/-2.0 683 NexusFlow 4 GPT-4o-mini-2024-07-18 74.94 +2.1/-2.3 668 OpenAI 6 Gemini-1.5-Pro-001 71.96 +2.7/-2.3 676 Google 6 Yi-Large-preview 71.48 +1.9/-2.5 720 01 AI 7 Mistral-Large-2407 70.42 +2.0/-2.3 623 Mistral 10 Meta-Llama-3.1-405B-Instruct-fp8 64.09 +2.5/-2.7 633 Meta 10 GLM-4-0520 63.84 +2.3/-2.6 636 Zhipu AI 10 Yi-Large 63.7 +2.2/-1.9 626 01 AI 10 DeepSeek-Coder-V2-Instruct 62.3 +2.4/-2.5 578 DeepSeek AI 10 Claude 3 Opus 60.36 +2.0/-2.8 541 Anthropic 13 Gemma-2-27B-it 57.51 +2.6/-2.4 577 Google 14 Meta-Llama-3.1-70B-Instruct 55.73 +2.5/-2.9 628 Meta 14 GLM-4-0116 55.72 +2.4/-1.9 622 Zhipu AI 15 Gemini-1.5-Pro-Preview-0409 53.37 +3.3/-2.2 478 Google 17 GLM-4-AIR 50.88 +2.3/-2.3 619 Zhipu AI 19 GPT-4-0314 50 +0.0/-0.0 423 OpenAI 18 Gemini-1.5-Flash-001 49.61 +2.6/-2.1 642 Google 20 Qwen2-72B-Instruct 46.86 +2.4/-2.3 515 Alibaba 20 Claude 3 Sonnet 46.8 +2.2/-2.7 552 Anthropic 20 Llama-3-70B-Instruct 46.57 +2.6/-2.7 591 Meta 24 Claude 3 Haiku 41.47 +2.6/-1.9 505 Anthropic 25 GPT-4-0613 37.9 +2.5/-2.3 354 OpenAI 25 Mistral-Large-2402 37.71 +2.1/-2.9 400 Mistral 26 Mixtral-8x22b-Instruct-v0.1 36.36 +2.2/-2.1 430 Mistral 26 Qwen1.5-72B-Chat 36.12 +2.0/-2.2 474 Alibaba 27 Phi-3-Medium-4k-Instruct 33.37 +1.8/-2.1 517 Microsoft 27 Command R+ (04-2024) 33.07 +2.0/-2.2 541 Cohere 28 Mistral Medium 31.9 +2.4/-2.2 485 Mistral 30 Phi-3-Small-8k-Instruct 29.77 +2.2/-1.8 568 Microsoft 33 Mistral-Next 27.37 +1.7/-2.0 297 Mistral Three benchmarks are displayed: Arena Elo, MT-Bench and MMLU. * Chatbot Arena - a crowdsourced, randomized battle platform. We use 1M+ user votes to compute model strength. * MT-Bench: a set of challenging multi-turn questions. We use GPT-4 to grade the model responses. * MMLU (5-shot): a test to measure a model's multitask accuracy on 57 tasks. 💻 Code: The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are mostly computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. Model Arena Score arena-hard-auto MT-bench MMLU Organization License Qwen-VL-Max-1119 1373 79.21 9.32 88.7 Cognitive Computations Falcon-180B TII License Model Arena Score arena-hard-auto MT-bench MMLU Organization License Gemini-Exp-1206 1373 Google Proprietary Gemini-2.0-Flash-Thinking-Exp-1219 1366 Google Proprietary ChatGPT-4o-latest (2024-11-20) 1365 OpenAI Proprietary Gemini-Exp-1121 1365 Google Proprietary o1-2024-12-17 1359 OpenAI Proprietary Gemini-2.0-Flash-Exp 1356 Google Proprietary Gemini-Exp-1114 1346 Google Proprietary ChatGPT-4o-latest (2024-09-03) 1339 OpenAI Proprietary o1-preview 1335 OpenAI Proprietary ChatGPT-4o-latest (2024-08-08) 1317 OpenAI Proprietary DeepSeek-V3 1315 DeepSeek DeepSeek o1-mini 1306 OpenAI Proprietary Gemini-1.5-Pro-002 1302 Google Proprietary Gemini-1.5-Pro-Exp-0827 1299 Google Proprietary Gemini-1.5-Pro-Exp-0801 1298 Google Proprietary Grok-2-08-13 1288 xAI Proprietary Yi-Lightning 1287 01 AI Proprietary GPT-4o-2024-05-13 1285 79.21 88.7 OpenAI Proprietary Claude 3.5 Sonnet (20241022) 1283 88.7 Anthropic Proprietary Deepseek-v2.5-1210 1279 DeepSeek DeepSeek Qwen2.5-plus-1127 1279 Alibaba Proprietary Athene-v2-Chat-72B 1277 NexusFlow NexusFlow GLM-4-Plus 1274 Zhipu AI Proprietary GPT-4o-mini-2024-07-18 1273 74.94 82 OpenAI Proprietary Gemini-1.5-Flash-002 1271 Google Proprietary Gemini-1.5-Flash-Exp-0827 1269 Google Proprietary Llama-3.1-Nemotron-70B-Instruct 1269 Nvidia Llama 3.1 Claude 3.5 Sonnet (20240620) 1268 79.35 88.7 Anthropic Proprietary Gemini Advanced App (2024-05-14) 1267 Google Proprietary Meta-Llama-3.1-405B-Instruct-fp8 1267 64.09 88.6 Meta Llama 3.1 Community Meta-Llama-3.1-405B-Instruct-bf16 1266 88.6 Meta Llama 3.1 Community Grok-2-Mini-08-13 1266 xAI Proprietary GPT-4o-2024-08-06 1265 OpenAI Proprietary Yi-Lightning-lite 1264 01 AI Proprietary Citation ▼ CITATION Please cite the following paper if you find our leaderboard or dataset helpful. @misc{chiang2024chatbot, title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference}, author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica}, year={2024}, eprint={2403.04132}, archivePrefix={arXiv}, primaryClass={cs.AI} } TERMS OF SERVICE Users are required to agree to the following terms before using the service: The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license. PLEASE REPORT ANY BUG OR ISSUE TO OUR DISCORD/ARENA-FEEDBACK. ACKNOWLEDGMENT We thank UC Berkeley SkyLab, Kaggle, MBZUAI, a16z, Together AI, Hyperbolic, RunPod, Anyscale, HuggingFace for their generous sponsorship. Use via API · Mit Gradio erstellt KevinLi0628 / chatbot-arena-leaderboard 0