Why Nostr? What is Njump?
2024-12-22 04:19:56

@gurupanguji on Nostr: ARC-AGI is a genuine AGI test but o3 cheated :( — LessWrong ARC-AGI is a genuine ...

ARC-AGI is a genuine AGI test but o3 cheated :( — LessWrong

ARC-AGI is a genuine AGI test but o3 cheated 🙁 — LessWrong:

They ask “how much of the performance is due to ARC-AGI data.” Probably most of it. If untuned o3 can do as well, don’t you think OpenAI would publish that (in addition to tuned o3)?

By training o3 on the public training set, the ARC-AGI no longer becomes an AGI test. It becomes yet another test of memorizing rules from its training data. This is still impressive, but something else.

I do not know what exactly OpenAI did. Did they let o3 spend a long time generating chains of thought, and reward ones which led to the correct answer? If that failed, did they have a human give examples of correct reasoning steps, and train it on that first? I don’t know.

They admitted they “cheated,” without saying how they “cheated.”

And there’s this in the end.

EDIT: People at OpenAI seem to be denying “fine-tuning” o3 for the ARC, see this comment by Zach Stein-Perlman. It’s unclear whether they’re denying reinforcement learning (rewarded by correct ARC training set answers), or whether they’re just denying they used a separate derivative of o3 (that’s fine-tuned for the test) to take the test.

It seems like o3 is definitely a step up. However, ARC being treated as a data set to solve ARC is definitely an argument that must be treated seriously.

#ai #llm #o3 #openai #reasoning
Author Public Key
npub1g2fyrnzsx8anvenmv3rahuueqvasyumj2dd4l6rx63hjcpleyctqnku9kf