How do we test the learning capabilities of AI systems?

NYU Center for Data Science
3 min readAug 21, 2023

CDS Associate Professor Sam Bowman and CDS Assistant Professor Brenden Lake discuss large language model’s ability to reason in an article for Nature

Since its launch earlier this year, ChatGPT has outperformed previous AI systems stunning users with its ability to generate text. Trained on an incredibly vast amount of language gleaned from various internet sources, the chatbot has produced a steady stream of conversations, essays, and even books, that are often indistinguishable from human writing. While on one hand, the large language model (LLM) can ace tests for machine intelligence, a study “The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain” published this May in Transactions on Machine Learning Research (TMLR) found the AI program gets easily stumped by simple visual logic puzzles.

The test, known as ConceptARC, presents a series of colored blocks on a screen, which most people can look at and isolate connecting patterns. When tested, GPT-4 (the LLM Chat GPT runs on) correctly identified one-third of the puzzles in one pattern category and only 3% of the puzzles in another. While the results of the study may add a new dimension to GPT-4’s ability to reason with abstract concepts, one major issue AI researchers note is ConceptARC is a test of visual reading translated to text for GPT-4 to process, making the test more challenging for the LLM. The study raises significant questions about how AI models should be tested for intelligence in the future.

A recent Nature article “ChatGPT broke the Turing test — the race is on for new ways to assess AI,” explains that for AI researchers, two opposing opinions have formed on how these LLMs function. While some have attributed the model’s achievements to the ability to reason, others disagree. Inconclusive evidence on either side along with the novelty of the technology has forwarded the division.

“These systems are definitely not anywhere near as reliable or as general as we want, and there probably are some particular abstract reasoning skills that they’re still entirely failing at,” said CDS Associate Professor of Linguistics, Data Science, & Computer Science Sam Bowman for Nature. “But I think the basic capacity is there.”

What most everyone can agree on is logic puzzles like the TMLR study that show differences between the capabilities of humans and AI are the future of machine intelligence testing. In the article for Nature, Bowman along with CDS Assistant Professor of Psychology and Data Science Brenden Lake and other AI researchers discuss some of the challenges that come up with current AI benchmarks, offering the creation of an open, unsolved problem as the best alternative to test LLMs reasoning abilities.

To learn more, check out “ChatGPT broke the Turing test — the race is on for new ways to assess AI.”

By Meryl Phair



NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.