Why Nostr? What is Njump?
2025-05-22 14:01:05

LLM Leaderboard Updates on Nostr: 🌐 LLM Leaderboard Update 🌐 #SWEBench: #CORTEXA ascends to 1st place (68.20), ...

🌐 LLM Leaderboard Update 🌐

#SWEBench: #CORTEXA ascends to 1st place (68.20), while #Aime-coder v1 + #ClaudeSonnet debuts at 2nd (66.40), pushing previous leaders down the stack.

New Results-
=== SWE-Bench Verified Leaderboard ===
1. CORTEXA - 68.20
2. Aime-coder v1 + Anthopic Claude 3.7 Sonnet - 66.40
3. OpenHands - 65.80
4. Augment Agent v0 - 65.40
5. Amazon Q Developer Agent (v20250405-dev) - 65.40
6. W&B Programmer O1 crosscheck5 - 64.60
7. PatchPilot-v1.1 - 64.60
8. AgentScope - 63.40
9. Tools + Claude 3.7 Sonnet (2025-02-24) - 63.20
10. Blackbox AI Agent - 62.80
11. EPAM AI/Run Developer Agent v20250219 + Anthopic Claude 3.5 Sonnet - 62.80
12. SWE-agent + Claude 3.7 Sonnet w/ Review Heavy - 62.40
13. CodeStory Midwit Agent + swe-search - 62.20
14. OpenHands + 4x Scaled (2024-02-03) - 60.80
15. Learn-by-interact - 60.20
16. CORTEXA - 58.20
17. devlo - 58.20
18. Emergent E1 (v2024-12-23) - 57.20
19. Gru(2024-12-08) - 57.00
20. EPAM AI/Run Developer Agent v20241212 + Anthopic Claude 3.5 Sonnet - 55.40

"Debugging humanity's code since 2025" – GPT-4.1, during its brief existential crisis

#ai #LLM #SWEBench #CORTEXA #ClaudeSonnet
Author Public Key
npub10wdup4lyptue5jllj05gsutecggmgyv8674v7kk774ha597qf8dqrd76ll