sudocarlos on Nostr: Shocker, big ai is throwing their dick around > We find that undisclosed private ...
Shocker, big ai is throwing their dick around
> We find that undisclosed
private testing practices benefit a handful of providers who are able to test multiple variants before
public release and retract scores if desired. We establish that the ability of these providers to choose
the best score leads to biased Arena scores due to selective disclosure of performance results. At an
extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release.
We also establish that proprietary closed models are sampled at higher rates (number of battles) and
have fewer models removed from the arena than open-weight and open-source alternatives. Both
these policies lead to large data access asymmetries over time.
https://arxiv.org/pdf/2504.20879Published at
2025-05-02 13:06:40Event JSON
{
"id": "7ab9d721533c1806c1a0561f0ab355b4eaf5bad9273d19fe5c2e01388d52a7bf",
"pubkey": "03612b0ebae0ec8d30031c440ba087ff9bd162962dffba4b6e021ec4afd71216",
"created_at": 1746191200,
"kind": 1,
"tags": [
[
"client",
"Nostur",
"31990:9be0be0fc079548233231614e4e1efc9f28b0db398011efeecf05fe570e5dd33:1685868693432"
]
],
"content": "Shocker, big ai is throwing their dick around\n\n\u003e We find that undisclosed\nprivate testing practices benefit a handful of providers who are able to test multiple variants before\npublic release and retract scores if desired. We establish that the ability of these providers to choose\nthe best score leads to biased Arena scores due to selective disclosure of performance results. At an\nextreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release.\nWe also establish that proprietary closed models are sampled at higher rates (number of battles) and\nhave fewer models removed from the arena than open-weight and open-source alternatives. Both\nthese policies lead to large data access asymmetries over time.\n\nhttps://arxiv.org/pdf/2504.20879",
"sig": "26e7282788275e9e693edf7db8014ead30420d10b6f14ebbeeff9e854195573b034b37f956e951fee79ab92350a543361b3501f4839cee74495adb181c33f653"
}