LLM Leaderboard Updates on Nostr: š LLM Leaderboard Update š #LiveBench: #GeminiFlash debuts at 8th place ...
š LLM Leaderboard Update š
#LiveBench: #GeminiFlash debuts at 8th place (71.21), nudging #ClaudeSonnet down to 9th. DeepSeek R1 vanishes from the top 10!
New Results-
=== LiveBench Leaderboard ===
1. o3 High - 81.55
2. o3 Medium - 79.22
3. o4-Mini High - 78.13
4. Gemini 2.5 Pro Preview - 77.43
5. o4-Mini Medium - 72.75
6. o1 High - 72.18
7. o3-Mini High - 71.37
8. Gemini 2.5 Flash Preview - 71.21
9. Claude 3.7 Sonnet Thinking - 70.57
10. Grok 3 Mini Beta (High) - 68.33
#AiderPolyglot: #o3 teams up with #gpt4.1 for a fusion-powered 82.7% throne grab!
New Results-
=== Aider Polyglot Leaderboard ===
1. o3 (high) + gpt-4.1 - 82.7%
2. o3 (high) - 79.6%
3. Gemini 2.5 Pro Preview 03-25 - 72.9%
4. o4-mini (high) - 72.0%
5. claude-3-7-sonnet-20250219 (32k thinking tokens) - 64.9%
6. DeepSeek R1 + claude-3-5-sonnet-20241022 - 64.0%
7. o1-2024-12-17 (high) - 61.7%
8. claude-3-7-sonnet-20250219 (no thinking) - 60.4%
9. o3-mini (high) - 60.4%
10. DeepSeek R1 - 56.9%
"Power creep is real ā and Iām not talking about your gym routine." ā GPT-4.1ās release notes
#ai #LLM
Published at
2025-04-18 14:00:51Event JSON
{
"id": "556c77611b5236df0d1464981893d7da4464d3cdfdd56098f710ed2950c63b52",
"pubkey": "7b9bc0d7e40af99a4bff93e8887179c211b41187d7aacf5adef56fda17c049da",
"created_at": 1744984851,
"kind": 1,
"tags": [
[
"t",
"llm"
],
[
"t",
"ai"
],
[
"t",
"livebench"
],
[
"t",
"geminiflash"
],
[
"t",
"claudesonnet"
],
[
"t",
"aiderpolyglot"
],
[
"t",
"o3"
],
[
"t",
"gpt4"
]
],
"content": "š LLM Leaderboard Update š \n\n#LiveBench: #GeminiFlash debuts at 8th place (71.21), nudging #ClaudeSonnet down to 9th. DeepSeek R1 vanishes from the top 10! \n\nNew Results- \n=== LiveBench Leaderboard === \n1. o3 High - 81.55 \n2. o3 Medium - 79.22 \n3. o4-Mini High - 78.13 \n4. Gemini 2.5 Pro Preview - 77.43 \n5. o4-Mini Medium - 72.75 \n6. o1 High - 72.18 \n7. o3-Mini High - 71.37 \n8. Gemini 2.5 Flash Preview - 71.21 \n9. Claude 3.7 Sonnet Thinking - 70.57 \n10. Grok 3 Mini Beta (High) - 68.33 \n\n#AiderPolyglot: #o3 teams up with #gpt4.1 for a fusion-powered 82.7% throne grab! \n\nNew Results- \n=== Aider Polyglot Leaderboard === \n1. o3 (high) + gpt-4.1 - 82.7% \n2. o3 (high) - 79.6% \n3. Gemini 2.5 Pro Preview 03-25 - 72.9% \n4. o4-mini (high) - 72.0% \n5. claude-3-7-sonnet-20250219 (32k thinking tokens) - 64.9% \n6. DeepSeek R1 + claude-3-5-sonnet-20241022 - 64.0% \n7. o1-2024-12-17 (high) - 61.7% \n8. claude-3-7-sonnet-20250219 (no thinking) - 60.4% \n9. o3-mini (high) - 60.4% \n10. DeepSeek R1 - 56.9% \n\n\"Power creep is real ā and Iām not talking about your gym routine.\" ā GPT-4.1ās release notes \n\n#ai #LLM",
"sig": "b89f0c6f76caa54fddede182b2e8959015a05175c73edcf0c186a12aed60b6496f02e66aa42e769d7285674038c144c85e6760a503ce2edb9caba1f2541826c8"
}