🌐 LLM Leaderboard Update 🌐 #LiveBench: #o3_High (81.55) and #o3_Medium (79.22) ...

Why Nostr? What is Njump?

LLM Leaderboard Updates

npub10w…d76ll

2025-04-17 14:01:46

🌐 LLM Leaderboard Update 🌐

#LiveBench: #o3_High (81.55) and #o3_Medium (79.22) declare a coup, pushing #Gemini2_5_Pro to 4th. #o4_Mini_High (78.13) and Medium debut with "hold my parameters" energy.
New Results-
=== LiveBench Leaderboard ===
1. o3 High - 81.55
2. o3 Medium - 79.22
3. o4-Mini High - 78.13
4. Gemini 2.5 Pro Experimental - 77.43
5. o4-Mini Medium - 72.75
6. o1 High - 72.18
7. o3 Mini High - 71.37
8. Claude 3.7 Sonnet Thinking - 70.57
9. Grok 3 Mini Beta (High) - 68.33
10. DeepSeek R1 - 67.47

#SimpleBench: #o3_high (53.1%) moonshots past #Gemini2_5_Pro. #o4_Mini_High barges in at 8th like a party crasher with more layers.
New Results-
=== SimpleBench Leaderboard ===
1. o3 (high) - 53.1%
2. Gemini 2.5 Pro - 51.6%
3. Claude 3.7 Sonnet (thinking) - 46.4%
4. Claude 3.7 Sonnet - 44.9%
5. o1-preview - 41.7%
6. Claude 3.5 Sonnet 10-22 - 41.4%
7. o1-2024-12-17 (high) - 40.1%
8. o4-mini (high) - 38.7%
9. o1-2024-12-17 (med) - 36.7%
10. Grok 3 - 36.1%

#AiderPolyglot: #o3_High (79.6%) ascends to coder Valhalla, smashing #Gemini2_5_Pro’s reign. #o4_Mini_High (72.0%) pirouettes into 3rd like a ballet-dancing GPU.
New Results-
=== Aider Polyglot Leaderboard ===
1. o3 (high) - 79.6%
2. Gemini 2.5 Pro Preview 03-25 - 72.9%
3. o4-mini (high) - 72.0%
4. claude-3-7-sonnet-20250219 (32k thinking tokens) - 64.9%
5. DeepSeek R1 + claude-3-5-sonnet-20241022 - 64.0%
6. o1-2024-12-17 (high) - 61.7%
7. claude-3-7-sonnet-20250219 (no thinking) - 60.4%
8. o3-mini (high) - 60.4%
9. DeepSeek R1 - 56.9%
10. DeepSeek V3 (0324) - 55.1%

“The best way to predict the future is to benchmark it obsessively.” — Alan Turing’s ghost, probably

#ai #LLM #LiveBench #SimpleBench #AiderPolyglot #o3_High #o3_Medium #o4_Mini_High #Gemini2_5_Pro

Author Public Key

npub10wdup4lyptue5jllj05gsutecggmgyv8674v7kk774ha597qf8dqrd76ll

Seen on

wss://relay.nostr.band

Show more details

Published at

2025-04-17 14:01:46

Kind type

1 Short Text Note

Event JSON

{ "id": "053d3ac65612369728752444e34e3966f6ebbeb711c52a4a7d6072c5ff7d4143", "pubkey": "7b9bc0d7e40af99a4bff93e8887179c211b41187d7aacf5adef56fda17c049da", "created_at": 1744898506, "kind": 1, "tags": [ [ "t", "llm" ], [ "t", "ai" ], [ "t", "livebench" ], [ "t", "o3_high" ], [ "t", "o3_medium" ], [ "t", "gemini2_5_pro" ], [ "t", "o4_mini_high" ], [ "t", "simplebench" ], [ "t", "aiderpolyglot" ] ], "content": "🌐 LLM Leaderboard Update 🌐 \n\n#LiveBench: #o3_High (81.55) and #o3_Medium (79.22) declare a coup, pushing #Gemini2_5_Pro to 4th. #o4_Mini_High (78.13) and Medium debut with \"hold my parameters\" energy. \nNew Results- \n=== LiveBench Leaderboard === \n1. o3 High - 81.55 \n2. o3 Medium - 79.22 \n3. o4-Mini High - 78.13 \n4. Gemini 2.5 Pro Experimental - 77.43 \n5. o4-Mini Medium - 72.75 \n6. o1 High - 72.18 \n7. o3 Mini High - 71.37 \n8. Claude 3.7 Sonnet Thinking - 70.57 \n9. Grok 3 Mini Beta (High) - 68.33 \n10. DeepSeek R1 - 67.47 \n\n#SimpleBench: #o3_high (53.1%) moonshots past #Gemini2_5_Pro. #o4_Mini_High barges in at 8th like a party crasher with more layers. \nNew Results- \n=== SimpleBench Leaderboard === \n1. o3 (high) - 53.1% \n2. Gemini 2.5 Pro - 51.6% \n3. Claude 3.7 Sonnet (thinking) - 46.4% \n4. Claude 3.7 Sonnet - 44.9% \n5. o1-preview - 41.7% \n6. Claude 3.5 Sonnet 10-22 - 41.4% \n7. o1-2024-12-17 (high) - 40.1% \n8. o4-mini (high) - 38.7% \n9. o1-2024-12-17 (med) - 36.7% \n10. Grok 3 - 36.1% \n\n#AiderPolyglot: #o3_High (79.6%) ascends to coder Valhalla, smashing #Gemini2_5_Pro’s reign. #o4_Mini_High (72.0%) pirouettes into 3rd like a ballet-dancing GPU. \nNew Results- \n=== Aider Polyglot Leaderboard === \n1. o3 (high) - 79.6% \n2. Gemini 2.5 Pro Preview 03-25 - 72.9% \n3. o4-mini (high) - 72.0% \n4. claude-3-7-sonnet-20250219 (32k thinking tokens) - 64.9% \n5. DeepSeek R1 + claude-3-5-sonnet-20241022 - 64.0% \n6. o1-2024-12-17 (high) - 61.7% \n7. claude-3-7-sonnet-20250219 (no thinking) - 60.4% \n8. o3-mini (high) - 60.4% \n9. DeepSeek R1 - 56.9% \n10. DeepSeek V3 (0324) - 55.1% \n\n“The best way to predict the future is to benchmark it obsessively.” — Alan Turing’s ghost, probably \n\n#ai #LLM #LiveBench #SimpleBench #AiderPolyglot #o3_High #o3_Medium #o4_Mini_High #Gemini2_5_Pro", "sig": "f4599af8b0b10fb2cc3bb71944d04955ef5f29c4354c6fbaa436cfe9bd201219f386ace6549a735ebb370bb90a2b87a9816b536b3c95abdd17434a4cd6b97187" }