Language Models Provide Insight into Linguistic Redundancy
In the realm of natural language processing, a fundamental question arises: Can language models, trained solely on the task of next-word prediction, understand the meanings or logical implications of sentences? Researchers from CDS have delved into this query, looking into how language model predictions reflect semantic relationships between sentences. Their findings, detailed in the preprint “Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment,” suggest language models leverage semantic relationships between sentences when predicting the next word. In doing so, the paper also identifies possible gaps in how current linguistic theories account for redundant speech.
The study builds upon a paper by Merrill et al. from 2022, previously covered on our blog, which derived a theoretical correspondence between next-word prediction and the ability to discern entailment relationships. Under certain assumptions about the authors of language model training data, including the notion that speakers avoid redundancy, the researchers posited that an optimal language model’s co-occurrence probabilities should encode information about whether one sentence entails another. This suggests that learning which sentences can co-occur might provide an important pathway for language models to learn about aspects of the meaning of text.
But the original result was highly theoretical: it was unclear whether language models like ChatGPT actually learn to reflect entailment when predicting sentence co-occurrences.
William Merrill, a CDS PhD student, and co-first author Zhaofeng Wu, a MIT PhD student, set out to empirically evaluate the predictions from the original paper, alongside their collaborators NYU computer science Masters student Norihito Naka, CDS Associate Professor Tal Linzen, and Yoon Kim, an assistant professor at MIT. Their goal? To determine whether sentence probabilities from real-world language models can indeed be used to reconstruct semantic entailment relationships between complex sentences.
The findings were surprising and thought-provoking. Running their test on the language models GPT-2, OPT, Llama-1, Vicuna, Llama-2, and Llama-2-Chat, Merrill, Wu, and their team discovered that a variant of the entailment test consistently detected entailment well above random chance across a wide range of entailment benchmarks. This result suggests that language models, through the task of next-word prediction, are implicitly modeling semantic properties of text to some extent in order to predict how sentences can co-occur.
However, a perplexing twist emerged. Contrary to the theoretical predictions, the direction of the test that best predicted entailment was flipped. Higher co-occurrence probabilities correlated with entailment, the opposite of what was expected under the original assumptions. “The weird thing is that it kind of worked exactly backwards from what our original paper had predicted,” Merrill explained.
This discrepancy prompted the researchers to reevaluate the assumptions underlying the original theoretical framework, particularly the notion that human speakers always avoid redundancy. The core assumption behind the original theoretical result was that humans always avoid redundancy. A corpus study revealed that humans, in fact, produce contextually entailed sentences at rates far higher than expected for strictly non-redundant speakers.
Merrill and his colleagues analyzed examples from various web domains, including book excerpts, Wikipedia articles, news reports, and online reviews. They found many instances where speakers intentionally repeated or summarized information for emphasis, explanation, or logical reasoning. As Merrill put it, “After looking at the results, we found that the linguistic theory that predicted the original entailment test did not fully account for natural data.”
One compelling example from a Yelp review illustrates how humans violate the non-redundancy theory:
When he returned with it, he just placed it in front of me on the wet bar — no napkin/coaster, the beer was flat, and contained a FREAKING lemon. Not an orange — a lemon.
In this case, the speaker emphatically repeats the fact that the beer contained a lemon, violating the assumption of non-redundancy.
Findings like this challenged the idealized assumption of speakers always avoiding redundancy which they had developed in their “Gricean speaker” model (i.e. “a computational model of human speakers implementing fundamental principles for effective communication”). This surprise suggests modeling human authors of web corpora would require more nuanced linguistic theories that better account for certain types of redundant speech. One existing approach they considered was a noise-tolerant speaker model, which assumes a noisy communication channel and incentivizes speakers to repeat important information for clarity. However, further analysis suggested that noise tolerance alone could not explain the flipped test direction.
Alternatively, the researchers proposed that a speaker model formalizing the goal of explaining might better account for the observed redundancy patterns. By considering how explanatory contexts can lower the processing cost of subsequent conclusions, the researchers hypothesized that a revised theory could predict the flipped entailment test and provide a more nuanced understanding of human pragmatics (how humans use language to communicate effectively).
Whether and how language models like ChatGPT can help advance theories in linguistics has recently become a topic of much discussion and debate. Noam Chomsky has argued that the models fall short of human-like language use and understanding, while others, like Steven Piantadosi, contend that these models in fact demonstrate key failings of Chomsky’s approach to language. The work of Merrill, Wu, and their team is an example of language models pushing forward our collective understanding of human language. While language models learn and use language differently from humans, the fact that they approximate the speech pattern of a large population of humans means they can be used as one source of data about humans if viewed appropriately. “Our results motivate future work in computational pragmatics accounting for redundancy and are a case study for how the data aggregated about many speakers in LMs can be used to test and develop pragmatic theories,” Merrill et al. wrote.
The study not only challenges existing linguistic theories but also highlights the potential of language models as a rich source of data for investigating the intricacies of human language. By leveraging the vast corpora used to train these models, researchers can gain insights into the patterns and nuances of linguistic redundancy, potentially refining our understanding of pragmatics and how sentence co-occurrence patterns relate to meaning.
As the field of natural language processing continues to evolve, studies like this one underscore the importance of empirically testing theoretical predictions and embracing the complexities of human communication. The path to truly understanding semantics may lie not only in the pursuit of ever-larger language models but also in a deeper exploration of the principles that govern how we use language to convey meaning.
By Stephen Thomas