Joe Resident on Nostr: Made a bot to save myself having to compulsively check all the LLM benchmarks I care ...
Made a bot to save myself having to compulsively check all the LLM benchmarks I care about every day. Gonna add ARC-AGI when I get a chance.
Impressed by the new Gemini 2.5 Flash today, for such a small model!
🌐 LLM Leaderboard Update 🌐
#LiveBench: #GeminiFlash debuts at 8th place (71.21), nudging #ClaudeSonnet down to 9th. DeepSeek R1 vanishes from the top 10!
New Results-
=== LiveBench Leaderboard ===
1. o3 High - 81.55
2. o3 Medium - 79.22
3. o4-Mini High - 78.13
4. Gemini 2.5 Pro Preview - 77.43
5. o4-Mini Medium - 72.75
6. o1 High - 72.18
7. o3-Mini High - 71.37
8. Gemini 2.5 Flash Preview - 71.21
9. Claude 3.7 Sonnet Thinking - 70.57
10. Grok 3 Mini Beta (High) - 68.33
#AiderPolyglot: #o3 teams up with #gpt4.1 for a fusion-powered 82.7% throne grab!
New Results-
=== Aider Polyglot Leaderboard ===
1. o3 (high) + gpt-4.1 - 82.7%
2. o3 (high) - 79.6%
3. Gemini 2.5 Pro Preview 03-25 - 72.9%
4. o4-mini (high) - 72.0%
5. claude-3-7-sonnet-20250219 (32k thinking tokens) - 64.9%
6. DeepSeek R1 + claude-3-5-sonnet-20241022 - 64.0%
7. o1-2024-12-17 (high) - 61.7%
8. claude-3-7-sonnet-20250219 (no thinking) - 60.4%
9. o3-mini (high) - 60.4%
10. DeepSeek R1 - 56.9%
"Power creep is real – and I’m not talking about your gym routine." – GPT-4.1’s release notes
#ai #LLM
#devstr #vibecoding you might like, includes aider polyglot and SWE-Bench Verified
Published at
2025-04-18 16:09:42Event JSON
{
"id": "958f9bb4232a5ebbd6c3512777c33c2ac9f62c8ae6bc53609309b53323882dfa",
"pubkey": "a43b0118fd72492f2ba11290cccb27418b1fdbb7ce3a122d229404e57a75975a",
"created_at": 1744992582,
"kind": 1,
"tags": [
[
"e",
"556c77611b5236df0d1464981893d7da4464d3cdfdd56098f710ed2950c63b52",
"",
"mention"
],
[
"p",
"7b9bc0d7e40af99a4bff93e8887179c211b41187d7aacf5adef56fda17c049da",
"",
"mention"
],
[
"q",
"556c77611b5236df0d1464981893d7da4464d3cdfdd56098f710ed2950c63b52"
],
[
"t",
"devstr"
],
[
"t",
"vibecoding"
]
],
"content": "Made a bot to save myself having to compulsively check all the LLM benchmarks I care about every day. Gonna add ARC-AGI when I get a chance. \nImpressed by the new Gemini 2.5 Flash today, for such a small model!\n\nnostr:nevent1qqs92mrhvyd4ydklp52xfxqcj0ta53ry60xlm4tqnrm3pmff2rrrk5spz4mhxue69uhhyetvv9ujuerpd46hxtnfduhsygrmn0qd0eq2lxdyhlunazy8z7wzzx6prp7h4t844hh4dldp0szfmgpsgqqqqqqsvylf6k\n\n#devstr #vibecoding you might like, includes aider polyglot and SWE-Bench Verified",
"sig": "e879e76c56391dc9b8c34d5d3c3e233b46bd91efd7af528ce6afd93f52c90de924d1bba1b66af2430b81dc1679864970b41f72c1a8e20465cf4835df3ee1b7d6"
}