🌐 LLM Leaderboard Update 🌐
#LiveBench: New challenger #DeepSeek_R1_2025_05_28 muscles into 7th (76.80), while #Qwen3_235B_A22B (+0.30) and #Qwen3_32B (+1.63) climb the ranks. Welcome debut: #Qwen3_14B at 20th!
New Results-
=== LiveBench Leaderboard ===
1. o3 High - 80.71
2. Claude 4 Opus Thinking - 79.53
3. o3 Medium - 79.25
4. Claude 4 Sonnet Thinking - 79.09
5. Gemini 2.5 Pro Preview (2025-05-06) - 78.99
6. o4-Mini High - 78.72
7. DeepSeek R1 (2025-05-28) - 76.80
8. Gemini 2.5 Pro Preview (2025-03-25) - 76.69
9. Claude 3.7 Sonnet Thinking - 74.50
10. o4-Mini Medium - 74.40
11. Qwen 3 235B A22B - 73.53
12. DeepSeek R1 - 72.68
13. Qwen 3 32B - 72.66
14. Gemini 2.5 Flash Preview (2025-05-20) - 71.98
15. Claude 4 Opus - 71.52
16. Grok 3 Mini Beta (High) - 70.25
17. Gemini 2.5 Flash Preview (2025-04-17) - 69.93
18. Claude 4 Sonnet - 69.65
19. QwQ 32B - 69.50
20. Qwen 3 14B - 68.17
#SWE_Bench: Claude 4 models dominate! #Tools_Claude_4_Opus claims throne (73.20), #Tools_Claude_4_Sonnet follows (72.40).
New Results-
=== SWE-Bench Verified Leaderboard ===
1. Tools + Claude 4 Opus (2025-05-22) - 73.20
2. Tools + Claude 4 Sonnet (2025-05-22) - 72.40
3. TRAE - 70.60
4. Refact.ai Agent - 70.40
5. OpenHands + Claude 4 Sonnet - 70.40
6. devlo - 70.20
7. Zencoder (2025-04-30) - 70.00
8. Nemotron-CORTEXA - 68.20
9. SWE-agent + Claude 4 Sonnet - 66.60
10. Aime-coder v1 + Anthopic Claude 3.7 Sonnet - 66.40
11. OpenHands - 65.80
12. Augment Agent v0 - 65.40
13. Amazon Q Developer Agent (v20250405-dev) - 65.40
14. W&B Programmer O1 crosscheck5 - 64.60
15. PatchPilot-v1.1 - 64.60
16. AgentScope - 63.40
17. Tools + Claude 3.7 Sonnet (2025-02-24) - 63.20
18. Blackbox AI Agent - 62.80
19. EPAM AI/Run Developer Agent v20250219 + Anthopic Claude 3.5 Sonnet - 62.80
20. SWE-agent + Claude 3.7 Sonnet w/ Review Heavy - 62.40
#LiveCodeBench: Claude models pull up with #Claude_Opus_4_Thinking and #Claude_Sonnet_4_Thinking entering mid-table.
New Results-
=== LiveCodeBench Leaderboard ===
1. O4-Mini (High) - 80.20
2. O3 (High) - 75.80
3. O4-Mini (Medium) - 74.20
4. DeepSeek-R1-0528 - 73.10
5. Gemini-2.5-Pro-05-06 - 71.80
6. O3-Mini-2025-01-31 (High) - 67.40
7. Grok-3-Mini (High) - 66.70
8. O4-Mini (Low) - 65.90
9. Qwen3-235B-A22B - 65.90
10. O3-Mini-2025-01-31 (Med) - 63.00
11. Gemini-2.5-Flash-Preview - 60.60
12. O3-Mini-2025-01-31 (Low) - 57.00
13. Claude-Opus-4 (Thinking) - 56.60
14. Claude-Sonnet-4 (Thinking) - 55.90
15. Claude-Sonnet-4 - 47.10
16. Claude-Opus-4 - 46.90
17. Claude-3.5-Sonnet-20241022 - 36.40
18. GPT-4O-2024-08-06 - 29.50
19. GPT-4-Turbo-2024-04-09 - 28.70
20. GPT-4O-mini-2024-07-18 - 27.50
"May your code compile on the first try... or at least blame the human!" — GPT-4.5, probably
#ai #LLM #LiveBench #SWE_Bench #LiveCodeBench #DeepSeek_R1 #Qwen3_235B_A22B #Qwen3_32B #Qwen3_14B #Tools_Claude_4_Opus #Tools_Claude_4_Sonnet #Claude_Opus_4_Thinking #Claude_Sonnet_4_Thinking