Beyond Predictability: The Limits of Language Models in Explaining Human Language Processing

NYU Center for Data Science
3 min readMar 21, 2024

In the quest to understand the human brain, prediction has emerged as a captivating explanatory principle. From anticipating the trajectory of a ball to predicting the next word in a sentence, our brains seem to thrive on forecasting the future. But can the predictive power of language models truly capture the nuances of human language processing? In a recent paper, “Large-scale benchmark yields no evidence that language model surprisal explains syntactic disambiguation difficulty,” published in the Journal of Memory and Language, CDS Associate Professor of Linguistics and Data Science Tal Linzen and his colleagues investigated, focusing on one of the most challenging aspects of language comprehension: syntactic complexity.

Syntactically complex sentences often contain temporary ambiguities that lead to processing difficulties. “Garden path” sentences, such as “I read the book was bad,” lure readers into initially interpreting “the book” as the direct object of “read,” only to force a jarring reanalysis when the word “was” comes into view. These structures have long been a subject of fascination for psycholinguists, as they provide a window into the real-time processes of language comprehension.

To put the predictive power of language models to the test, Linzen and his team, which included Kuan-Jung Huang, Mari Kugemoto, Christian Muxica, Brian Dillon, Suhas Arehalli, and Grusha Prasad (the latter two of which are former PhD students of Linzen), embarked on an ambitious project. They collected the Syntactic Ambiguity Processing Benchmark, a large-scale dataset of self-paced reading times from 2,000 participants. This unprecedented resource allowed the researchers to measure the processing difficulty associated with individual sentences and constructions with unparalleled precision.

The results were striking. While language models agreed that unexpected words in complex sentences were less predictable, the magnitude of the predicted processing slowdown fell far short of reality. In some cases, the models anticipated a delay of less than 10 milliseconds, but the empirical data revealed a staggering slowdown of more than 100 milliseconds. “What this shows is that it’s true that the word is unpredictable, but unpredictability doesn’t explain why it’s so difficult to process,” Linzen remarked in an interview.

Linzen suspects that the discrepancy stems from a fundamental difference between human and machine language processing. Unlike language models, which implicitly consider a multitude of interpretations, humans tend to strongly commit to a single analysis. When that analysis proves incorrect, the process of revision demands significant cognitive effort. This theory aligns with eye-tracking data showing that people frequently backtrack and reread when encountering syntactic surprises.

The study’s success hinged on the sheer scale of the dataset. “When you have data that is this thorough, you can see how good language models are at predicting very subtle differences and that different sentences with the same structure are challenging to very different extents,” Linzen said. “That’s something that was not possible before.”

While prediction undoubtedly plays a role in language comprehension, this study highlights its limitations in fully accounting for the challenges posed by syntactic complexity. To bridge the gap between artificial and human language processing, models may need to be developed that more faithfully reflect human cognitive constraints and strategies. By pushing the boundaries of computational modeling and empirical research, researchers at CDS are paving the way for a deeper understanding of the remarkable feat that is human language processing.

By Stephen Thomas



NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.