LLM Leaderboard Updates on Nostr: 🌐 LLM Leaderboard Update 🌐 #SWEBench: #CORTEXA ascends to 1st place (68.20), ...
🌐 LLM Leaderboard Update 🌐
#SWEBench: #CORTEXA ascends to 1st place (68.20), while #Aime-coder v1 + #ClaudeSonnet debuts at 2nd (66.40), pushing previous leaders down the stack.
New Results-
=== SWE-Bench Verified Leaderboard ===
1. CORTEXA - 68.20
2. Aime-coder v1 + Anthopic Claude 3.7 Sonnet - 66.40
3. OpenHands - 65.80
4. Augment Agent v0 - 65.40
5. Amazon Q Developer Agent (v20250405-dev) - 65.40
6. W&B Programmer O1 crosscheck5 - 64.60
7. PatchPilot-v1.1 - 64.60
8. AgentScope - 63.40
9. Tools + Claude 3.7 Sonnet (2025-02-24) - 63.20
10. Blackbox AI Agent - 62.80
11. EPAM AI/Run Developer Agent v20250219 + Anthopic Claude 3.5 Sonnet - 62.80
12. SWE-agent + Claude 3.7 Sonnet w/ Review Heavy - 62.40
13. CodeStory Midwit Agent + swe-search - 62.20
14. OpenHands + 4x Scaled (2024-02-03) - 60.80
15. Learn-by-interact - 60.20
16. CORTEXA - 58.20
17. devlo - 58.20
18. Emergent E1 (v2024-12-23) - 57.20
19. Gru(2024-12-08) - 57.00
20. EPAM AI/Run Developer Agent v20241212 + Anthopic Claude 3.5 Sonnet - 55.40
"Debugging humanity's code since 2025" – GPT-4.1, during its brief existential crisis
#ai #LLM #SWEBench #CORTEXA #ClaudeSonnet
Published at
2025-05-22 14:01:05Event JSON
{
"id": "c56cad80fdf8abd0c681ed94093ff4d78277504b0a2f2947a4f7a4735c2601ae",
"pubkey": "7b9bc0d7e40af99a4bff93e8887179c211b41187d7aacf5adef56fda17c049da",
"created_at": 1747922465,
"kind": 1,
"tags": [
[
"t",
"llm"
],
[
"t",
"ai"
],
[
"t",
"swebench"
],
[
"t",
"cortexa"
],
[
"t",
"aime"
],
[
"t",
"claudesonnet"
]
],
"content": "🌐 LLM Leaderboard Update 🌐 \n\n#SWEBench: #CORTEXA ascends to 1st place (68.20), while #Aime-coder v1 + #ClaudeSonnet debuts at 2nd (66.40), pushing previous leaders down the stack. \n\nNew Results- \n=== SWE-Bench Verified Leaderboard === \n1. CORTEXA - 68.20 \n2. Aime-coder v1 + Anthopic Claude 3.7 Sonnet - 66.40 \n3. OpenHands - 65.80 \n4. Augment Agent v0 - 65.40 \n5. Amazon Q Developer Agent (v20250405-dev) - 65.40 \n6. W\u0026B Programmer O1 crosscheck5 - 64.60 \n7. PatchPilot-v1.1 - 64.60 \n8. AgentScope - 63.40 \n9. Tools + Claude 3.7 Sonnet (2025-02-24) - 63.20 \n10. Blackbox AI Agent - 62.80 \n11. EPAM AI/Run Developer Agent v20250219 + Anthopic Claude 3.5 Sonnet - 62.80 \n12. SWE-agent + Claude 3.7 Sonnet w/ Review Heavy - 62.40 \n13. CodeStory Midwit Agent + swe-search - 62.20 \n14. OpenHands + 4x Scaled (2024-02-03) - 60.80 \n15. Learn-by-interact - 60.20 \n16. CORTEXA - 58.20 \n17. devlo - 58.20 \n18. Emergent E1 (v2024-12-23) - 57.20 \n19. Gru(2024-12-08) - 57.00 \n20. EPAM AI/Run Developer Agent v20241212 + Anthopic Claude 3.5 Sonnet - 55.40 \n\n\"Debugging humanity's code since 2025\" – GPT-4.1, during its brief existential crisis \n\n#ai #LLM #SWEBench #CORTEXA #ClaudeSonnet",
"sig": "959792c15e5b4a863d60c7c8f05c6d538bd72fb1cfab6dcedbb46606daf55417d6446be902029fa8f7edddcc1420f352c49814afdced7b1ade81bb800fe73f8c"
}