Human Intelligence Still Outshines AI on Abstract Reasoning Tasks
The ability to quickly grasp abstract patterns and apply them to new situations is a hallmark of human intelligence. Despite rapid advances in artificial intelligence, humans still vastly outperform even the most sophisticated AI systems on tasks requiring this kind of flexible reasoning.
NYU Psychology PhD student Solim LeGris, CDS Research Scientist Wai Keen Vong, CDS Associate Professor of Psychology and Data Science Brenden Lake, and NYU Psychology Professor Todd Gureckis recently conducted a comprehensive study evaluating human performance on the Abstraction and Reasoning Corpus (ARC), a challenging benchmark designed to test general intelligence. Their findings, published in a new paper titled “H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark,” build upon earlier work by many of the same researchers.
In 2021, Vong, Lake, and Gureckis — with former NYU Psychology PhD student Aysja Johnson — co-authored a paper titled “Fast and flexible: Human program induction in abstract reasoning tasks,” which examined human performance on a subset of 40 ARC tasks. That initial study found impressive human capabilities, with participants having an average accuracy of 84% across tasks.
The latest research expands significantly on the 2021 study, evaluating human performance across all 800 publicly available ARC tasks. “We now have a much more robust estimate of human performance for the training tasks. It’s a little bit lower than [the earlier paper’s results],” said LeGris, the first author on the new paper, which shows that average humans still solve these abstract visual puzzles with significantly higher accuracy than top AI models.
“Humans are just much better problem solvers here,” Vong said. “The top models are still nowhere near what humans are doing.”
The ARC benchmark, created by AI researcher François Chollet in 2019, consists of 1000 visual pattern completion tasks. Each task provides a few example input-output pairs showing a transformation, and test-takers must infer the underlying rule and apply it to a new input. The tasks are intentionally designed to be unlike anything one would encounter on the internet, requiring genuine abstract thinking rather than simply retrieving memorized information.
The challenge posed by ARC has sparked a $1,000,000+ competition, the ARC Prize, aimed at developing AI systems that can match or exceed human performance on these tasks. The prize, hosted by Mike Knoop and Chollet, intends to underscore the significance of ARC as a measure of progress towards artificial general intelligence (AGI).
In fact, Knoop and Chollet will be speaking at NYU as part of the ARC Prize 2024 University Tour on October 11. This event, organized in partnership with CDS’ Minds, Brains, and Machines, will provide deeper insights into the technical approaches designed to tackle ARC and the future of AGI research.
To obtain an estimate of human performance on the ARC benchmark, LeGris, Vong, and their co-authors recruited over 1,700 participants through Amazon Mechanical Turk to attempt the ARC tasks. Each participant was given five randomly selected puzzles and allowed three attempts per puzzle.
The results showed that humans solved 76.2% of the training set tasks and 64.2% of the more difficult evaluation set tasks on average. Remarkably, for 98.8% of the 800 tasks, at least one person who was presented with a given task could successfully solve it. “If you contact 10 random people on the internet, at least one will be able to solve any given ARC problem,” the authors noted in the paper. This highlights the universality of human abstract reasoning capabilities.
In contrast, even the best AI systems struggle with ARC. The top-performing model, based on GPT-4, achieves only 42% accuracy on the evaluation set. Vong explained that unlike humans, AI models seem unable to flexibly search through the space of possible transformations: “There’s something Chollet talks a lot about, which is the differences between memorization/knowledge versus skill.”
LeGris, a researcher tangentially interested in the psychological phenomenon of ‘insight’, explained: “People can flexibly reason and use on-the-fly abstractions to solve arbitrary tasks. And that’s a very different thing from current AI systems.”
The researchers also analyzed the types of errors made by humans and AI models. While the AI systems made fewer mistakes in grid dimensions, the overall pattern of errors differed significantly between humans and machines. This suggests fundamentally different approaches to solving the tasks.
One key advantage humans demonstrated was the ability to learn from minimal feedback and correct initial mistakes. Human performance improved substantially given multiple attempts, while AI models showed little improvement beyond their first guess. “People will often make initially wrong guesses but they are capable of self-correction and can flexibly consider alternative solutions,” the authors said. “Understanding how people achieve this is likely to be useful for improving machine intelligence.”
While the results demonstrate clear human superiority on ARC for now, LeGris and Vong emphasized that the goal isn’t to prove humans are “better” than AI. Rather, by deeply analyzing human performance, they aim to uncover insights that could lead to more genuinely intelligent and flexible AI systems. They hope this rich dataset, which they’ve made publicly available on H-ARC’s website, will spur further research.
By Stephen Thomas