Children Beat AI at a Simple Test of Scientific Thinking
Four-year-olds consistently outperform today’s most advanced language models at inferring causal relationships when the evidence points toward complex explanations. CDS PhD student Anthony GX-Chen discovered this surprising failure mode while testing whether AI systems could learn to discover causal structures the way scientists do, revealing that large language models inherit the same cognitive biases that limit adult reasoning.
GX-Chen and his collaborators, including Rob Fergus, a CDS-associated Professor and the Director of AI Research and Head of FAIR at Meta, tested this using a classic psychology experiment called the Blicket Test, where participants must figure out which objects activate a machine and what rule governs it. The test presents two possibilities: either any special object can turn on the machine alone (a disjunctive / “OR” rule), or all special objects must be present together (a conjunctive / “AND” rule). While an optimal agent should handle both rules equally well, the researchers found that language models systematically failed when the correct answer required conjunctive thinking.
The work, “Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?”, accepted to this year’s Conference on Language Modeling (COLM), tested models including GPT-4 and DeepSeek across thousands of trials. Even the best-performing models achieved near-perfect accuracy on disjunctive cases but struggled dramatically with conjunctive ones, sometimes performing no better than random guessing.
“We found all of that very interesting,” GX-Chen said. “And then in this work, we try to ask this one question very rigorously, which is: can these language models agents — “agents” meaning they take actions and make decisions on their own in some kind of environment — actually act in a way as to discover causal structure within their own environment?”
The bias appears deeply rooted in how these models explore their environment. When the researchers analyzed the models’ exploration patterns, they found the AI agents expected simple explanations, took fewer steps, and tried fewer combinations toward exploring alternative hypotheses. Even when provided with perfect exploration data that clearly indicated a conjunctive rule, models like GPT-4 still showed lower accuracy than in disjunctive cases.
The parallels to human cognition proved particularly striking. Psychologists have long known that adults show a strong preference for disjunctive explanations, while young children remain more open to both possibilities. When GX-Chen’s team replicated classic developmental psychology experiments with language models, they found the AI systems matched adult response patterns almost exactly. Given evidence that supported conjunctive rules, both language models and human adults overwhelmingly chose the simpler disjunctive interpretation, while four-year-olds correctly identified conjunctive relationships.
The implications extend beyond academic curiosity. Companies like OpenAI have identified scientific discovery as a key frontier for AI development, with systems like Sakana AI’s “AI Scientist” already attempting to automate entire research workflows. But if language models can’t reliably infer that, for example, a heart attack might require both genetic predisposition and high cholesterol — a basic conjunctive relationship — their ability to make genuine scientific discoveries remains questionable.
“These findings indicate that we should be at least measured in what we want to expect out of these LLMs, and we should be very careful to measure how well they’re actually doing,” GX-Chen explained.
The researchers developed a solution that significantly reduces this bias through what they call inference-time hypothesis sampling. Instead of relying on the model’s skewed internal prior, the method explicitly generates diverse hypotheses about causal relationships and systematically eliminates them based on evidence. With this approach, the performance gap between conjunctive and disjunctive reasoning largely disappeared.
The technique works by having the model generate multiple Python functions representing different causal hypotheses, then prompting it to take actions that would eliminate as many hypotheses as possible. By forcing the model to consider a “flatter” distribution of possibilities rather than its biased prior, the method moves language models closer to the unbiased curiosity that characterizes both optimal algorithms and young children.
GX-Chen speculates the bias originates in training data: “At every single stage of the training process, the data generating process itself comes, most of the time, from human adults,” he noted. In other words, language models are pre-trained on adult-generated text, fine-tuned on adult instructions, and aligned using adult preferences. So the models may have absorbed not just human knowledge but the cognitive limitations of human adults — the mental shortcuts and biases that help us navigate daily life but hinder scientific thinking.
The findings raise fundamental questions about what it means to build AI systems for scientific discovery. While language models have achieved remarkable capabilities in many domains, this work suggests they may need more than scale to overcome the cognitive biases baked into their training. Until then, when it comes to unbiased causal reasoning, four-year-olds remain the gold standard.
By Stephen Thomas
Have feedback on our content? Help us improve our blog by completing our (super quick) survey.
