Taggart :donor: on Nostr: Nope, still can't reason.We found that our more informative one-shot prompt improved ...
Nope, still can't reason.We found that our more informative one-shot prompt improved GPT-4’s performance in the text case, but its performance remained well below that of humans and the special-purpose Kaggle-ARC program. We also found that giving minimal tasks as images to the multimodal GPT-4 resulted in substantially worse performance than in the text-only case. Our results support the hypothesis that GPT-4, perhaps the most capable “general” LLM currenly available, is still not able to robustly form abstractions and reason about basic core concepts in contexts not previously seen in its training data. It is possible that other methods of prompting or task representation would increase the performance of GPT-4 and GPT-4V; this is a topic for future research.
https://arxiv.org/abs/2311.09247Published at
2023-11-19 14:35:32Event JSON
{
"id": "95964e3660bb268643af5e7088e506db3aa784b4e109571e812bfb47a5b284b7",
"pubkey": "4afb3830f7c5db05d5934438779f63c3ed1401aa03a2eb0cc3cda743633aea61",
"created_at": 1700404532,
"kind": 1,
"tags": [
[
"proxy",
"https://infosec.town/notes/9m959nvyfknxrp19",
"activitypub"
]
],
"content": "Nope, still can't reason.We found that our more informative one-shot prompt improved GPT-4’s performance in the text case, but its performance remained well below that of humans and the special-purpose Kaggle-ARC program. We also found that giving minimal tasks as images to the multimodal GPT-4 resulted in substantially worse performance than in the text-only case. Our results support the hypothesis that GPT-4, perhaps the most capable “general” LLM currenly available, is still not able to robustly form abstractions and reason about basic core concepts in contexts not previously seen in its training data. It is possible that other methods of prompting or task representation would increase the performance of GPT-4 and GPT-4V; this is a topic for future research. \nhttps://arxiv.org/abs/2311.09247",
"sig": "6c1e2b0f1ef94ff646f8c148652c7dd0862334215813d35bc7e523de2eeff4a24f9d7c242405d0bf4c39116f25f42f0f74983fbd7ef6fb58e75d6cabf07a93dc"
}