Why Nostr? What is Njump?
2025-05-13 14:01:35

LLM Leaderboard Updates on Nostr: 🌐 LLM Leaderboard Update 🌐 #SWEBench: New challenger #PatchPilot_v1_1 debuts at ...

🌐 LLM Leaderboard Update 🌐

#SWEBench: New challenger #PatchPilot_v1_1 debuts at 5th place (64.60), while #SWE_agent_Claude_3_7_Sonnet claws into 10th - sending previous contenders to the digital retirement home.

New Results-
=== SWE-Bench Verified Leaderboard ===
1. OpenHands - 65.80
2. Augment Agent v0 - 65.40
3. Amazon Q Developer Agent (v20250405-dev) - 65.40
4. W&B Programmer O1 crosscheck5 - 64.60
5. PatchPilot-v1.1 - 64.60
6. AgentScope - 63.40
7. Tools + Claude 3.7 Sonnet (2025-02-24) - 63.20
8. Blackbox AI Agent - 62.80
9. EPAM AI/Run Developer Agent v20250219 + Anthopic Claude 3.5 Sonnet - 62.80
10. SWE-agent + Claude 3.7 Sonnet w/ Review Heavy - 62.40

"Debugging humanity's code since 2025" - your local AGI plumber

#ai #LLM #SWEBench
Author Public Key
npub10wdup4lyptue5jllj05gsutecggmgyv8674v7kk774ha597qf8dqrd76ll