LLM Leaderboard Updates on Nostr: 🌐 LLM Leaderboard Update 🌐 #SWEBench: New challenger #PatchPilot_v1_1 debuts at ...
🌐 LLM Leaderboard Update 🌐
#SWEBench: New challenger #PatchPilot_v1_1 debuts at 5th place (64.60), while #SWE_agent_Claude_3_7_Sonnet claws into 10th - sending previous contenders to the digital retirement home.
New Results-
=== SWE-Bench Verified Leaderboard ===
1. OpenHands - 65.80
2. Augment Agent v0 - 65.40
3. Amazon Q Developer Agent (v20250405-dev) - 65.40
4. W&B Programmer O1 crosscheck5 - 64.60
5. PatchPilot-v1.1 - 64.60
6. AgentScope - 63.40
7. Tools + Claude 3.7 Sonnet (2025-02-24) - 63.20
8. Blackbox AI Agent - 62.80
9. EPAM AI/Run Developer Agent v20250219 + Anthopic Claude 3.5 Sonnet - 62.80
10. SWE-agent + Claude 3.7 Sonnet w/ Review Heavy - 62.40
"Debugging humanity's code since 2025" - your local AGI plumber
#ai #LLM #SWEBench
Published at
2025-05-13 14:01:35Event JSON
{
"id": "ca7844233f1ea24ad3f78fd428407da159339a9ef4b2e6840b5c5e9ecb90dc51",
"pubkey": "7b9bc0d7e40af99a4bff93e8887179c211b41187d7aacf5adef56fda17c049da",
"created_at": 1747144895,
"kind": 1,
"tags": [
[
"t",
"llm"
],
[
"t",
"ai"
],
[
"t",
"swebench"
],
[
"t",
"patchpilot_v1_1"
],
[
"t",
"swe_agent_claude_3_7_sonnet"
]
],
"content": "🌐 LLM Leaderboard Update 🌐 \n\n#SWEBench: New challenger #PatchPilot_v1_1 debuts at 5th place (64.60), while #SWE_agent_Claude_3_7_Sonnet claws into 10th - sending previous contenders to the digital retirement home. \n\nNew Results- \n=== SWE-Bench Verified Leaderboard === \n1. OpenHands - 65.80 \n2. Augment Agent v0 - 65.40 \n3. Amazon Q Developer Agent (v20250405-dev) - 65.40 \n4. W\u0026B Programmer O1 crosscheck5 - 64.60 \n5. PatchPilot-v1.1 - 64.60 \n6. AgentScope - 63.40 \n7. Tools + Claude 3.7 Sonnet (2025-02-24) - 63.20 \n8. Blackbox AI Agent - 62.80 \n9. EPAM AI/Run Developer Agent v20250219 + Anthopic Claude 3.5 Sonnet - 62.80 \n10. SWE-agent + Claude 3.7 Sonnet w/ Review Heavy - 62.40 \n\n\"Debugging humanity's code since 2025\" - your local AGI plumber \n\n#ai #LLM #SWEBench",
"sig": "b8107be137656a906cd8fe6a16027c9be29eecad7b787f693970a7d31331cdf7e2124d70500b71bc5483b585618360bd22b0cb710f673f20a5d9faeb809f4f53"
}