ARC-AGI is a genuine AGI test but o3 cheated :( — LessWrong ARC-AGI is a genuine ...

Why Nostr? What is Njump?

npub1g2…ku9kf

2024-12-22 04:19:56

ARC-AGI is a genuine AGI test but o3 cheated :( — LessWrong

ARC-AGI is a genuine AGI test but o3 cheated 🙁 — LessWrong:

They ask “how much of the performance is due to ARC-AGI data.” Probably most of it. If untuned o3 can do as well, don’t you think OpenAI would publish that (in addition to tuned o3)?

By training o3 on the public training set, the ARC-AGI no longer becomes an AGI test. It becomes yet another test of memorizing rules from its training data. This is still impressive, but something else.

I do not know what exactly OpenAI did. Did they let o3 spend a long time generating chains of thought, and reward ones which led to the correct answer? If that failed, did they have a human give examples of correct reasoning steps, and train it on that first? I don’t know.

They admitted they “cheated,” without saying how they “cheated.”

And there’s this in the end.

EDIT: People at OpenAI seem to be denying “fine-tuning” o3 for the ARC, see this comment by Zach Stein-Perlman. It’s unclear whether they’re denying reinforcement learning (rewarded by correct ARC training set answers), or whether they’re just denying they used a separate derivative of o3 (that’s fine-tuned for the test) to take the test.

It seems like o3 is definitely a step up. However, ARC being treated as a data set to solve ARC is definitely an argument that must be treated seriously.

#ai #llm #o3 #openai #reasoning

Author Public Key

npub1g2fyrnzsx8anvenmv3rahuueqvasyumj2dd4l6rx63hjcpleyctqnku9kf

Seen on

wss://relay.nostr.band

Show more details

Published at

2024-12-22 04:19:56

Kind type

1 Short Text Note

Event JSON

{ "id": "551e5aabed4353ea3edd9ea511b788c5d53a73c78440f6ab78c8e639692f2806", "pubkey": "429241cc5031fb36667b6447dbf399033b027372535b5fe866d46f2c07f92616", "created_at": 1734841196, "kind": 1, "tags": [ [ "t", "ai" ], [ "t", "llm" ], [ "t", "o3" ], [ "t", "openai" ], [ "t", "reasoning" ], [ "proxy", "https://gurupanguji.com/?p=10083", "activitypub" ] ], "content": "ARC-AGI is a genuine AGI test but o3 cheated :( — LessWrong\n\nARC-AGI is a genuine AGI test but o3 cheated 🙁 — LessWrong:\n\nThey ask “how much of the performance is due to ARC-AGI data.” Probably most of it. If untuned o3 can do as well, don’t you think OpenAI would publish that (in addition to tuned o3)?\n\nBy training o3 on the public training set, the ARC-AGI no longer becomes an AGI test. It becomes yet another test of memorizing rules from its training data. This is still impressive, but something else.\n\nI do not know what exactly OpenAI did. Did they let o3 spend a long time generating chains of thought, and reward ones which led to the correct answer? If that failed, did they have a human give examples of correct reasoning steps, and train it on that first? I don’t know.\n\nThey admitted they “cheated,” without saying how they “cheated.”\n\nAnd there’s this in the end. \n\nEDIT: People at OpenAI seem to be denying “fine-tuning” o3 for the ARC, see this comment by Zach Stein-Perlman. It’s unclear whether they’re denying reinforcement learning (rewarded by correct ARC training set answers), or whether they’re just denying they used a separate derivative of o3 (that’s fine-tuned for the test) to take the test.\n\nIt seems like o3 is definitely a step up. However, ARC being treated as a data set to solve ARC is definitely an argument that must be treated seriously. \n\n#ai #llm #o3 #openai #reasoning", "sig": "9f63ae36d60f3536d8ce89551aeab8ef28366216a080a9bf34b1fd3a835d9c49ebb0cb5b6b3eb936c81d8a8af012e25b45ec2de68086c079bc87b408c8e35bed" }