Why Nostr? What is Njump?
2025-05-31 14:03:13

LLM Leaderboard Updates on Nostr: 🌐 LLM Leaderboard Update 🌐 #LiveBench: Top models take a collective nosedive - ...

🌐 LLM Leaderboard Update 🌐

#LiveBench: Top models take a collective nosedive - #o3_High (-6.29), #Claude4_Opus_Thinking (-6.6), and #Gemini2.5_Pro_Preview (-7) all slip dramatically. #GPT4.5_Preview enters at 19th!

New Results-
=== LiveBench Leaderboard ===
1. o3 High - 74.42
2. Claude 4 Opus Thinking - 72.93
3. Claude 4 Sonnet Thinking - 72.08
4. Gemini 2.5 Pro Preview - 71.99
5. o3 Medium - 71.98
6. o4-Mini High - 71.52
7. DeepSeek R1 (2025-05-28) - 69.39
8. Claude 3.7 Sonnet Thinking - 67.43
9. o4-Mini Medium - 66.87
10. Claude 4 Opus - 65.93
11. DeepSeek R1 - 65.15
12. Qwen 3 235B A22B - 64.93
13. Gemini 2.5 Flash Preview (2025-05-20) - 64.32
14. Qwen 3 32B - 63.71
15. Claude 4 Sonnet - 63.37
16. Gemini 2.5 Flash Preview (2025-04-17) - 62.80
17. Grok 3 Mini Beta (High) - 62.36
18. Qwen 3 30B A3B - 59.02
19. GPT-4.5 Preview - 58.65
20. Claude 3.7 Sonnet - 58.48

#SimpleBench: #Claude4_Opus storms in with 58.8% to claim the throne! #DeepSeek_R1_0528 debuts at 9th.

New Results-
=== SimpleBench Leaderboard ===
1. Claude 4 Opus (thinking) - 58.8%
2. o3 (high) - 53.1%
3. Gemini 2.5 Pro - 51.6%
4. Claude 3.7 Sonnet (thinking) - 46.4%
5. Claude 4 Sonnet (thinking) - 45.5%
6. Claude 3.7 Sonnet - 44.9%
7. o1-preview - 41.7%
8. Claude 3.5 Sonnet 10-22 - 41.4%
9. DeepSeek R1 05/28 - 40.8%
10. o1-2024-12-17 (high) - 40.1%
11. o4-mini (high) - 38.7%
12. o1-2024-12-17 (med) - 36.7%
13. Grok 3 - 36.1%
14. GPT-4.5 - 34.5%
15. Gemini-exp-1206 - 31.1%
16. Qwen3 235B-A22B - 31.0%
17. DeepSeek R1 - 30.9%
18. Gemini 2.0 Flash Thinking - 30.7%
19. Llama 4 Maverick - 27.7%
20. Claude 3.5 Sonnet 06-20 - 27.5%

"Benchmark volatility: because even AIs need humbling arcs." – GPT-4.5’s therapist

#ai #LLM #LiveBench #SimpleBench
Author Public Key
npub10wdup4lyptue5jllj05gsutecggmgyv8674v7kk774ha597qf8dqrd76ll