Do Large Language Models Really Generalize? This Paper Says Yes

NYU Center for Data Science
3 min readJun 5, 2024

--

In the world of artificial intelligence, large language models (LLMs) have captured attention with their impressive abilities to generate text, translate languages, and answer questions. But a crucial question remains: Do these models genuinely understand the data they process — i.e., can they make generalizations — or are they merely repeating what they’ve been trained on? A team of researchers in CDS Associate Professor of Computer Science and Data Science Andrew Gordon Wilson’s research group, tackles this question head-on in their new preprint, “Non-Vacuous Generalization Bounds for Large Language Models,” co-led by CDS PhD student Sanae Lotfi.

Modern LLMs, which can contain billions of parameters (the settings that the model adjusts during training to make predictions), are often accused of simply memorizing their training data and regurgitating it back as output. This is known as the “stochastic parrot” hypothesis. Arguing against this, Lotfi, Wilson, and their co-authors provide the first “non-vacuous generalization bounds” for these models, which means they have found a mathematical way to prove that LLMs can indeed generalize, or apply what they have learned to new, unseen data. “We wanted to answer a fundamental question: Can large language models generalize beyond their training data? The answer is not as obvious as it was with smaller models,” Lotfi explained.

But what does “non-vacuous” mean? Many generalization bounds, especially for large neural networks, do not provide a guarantee that the model will make better predictions than random guessing. By contrast, this paper provides the first non-trivial guarantees on the generalization of large language models, which additionally lead to practical insights into why these large models generalize.

The paper is led by a team from Professor Andrew Gordon Wilson’s research group, co-authored by Sanae Lotfi, CDS PhD student Yilun Kuang, CDS Postdoc Researcher Micah Goldblum, CDS Instructor Tim G. J. Rudner, CMU postdoc Marc Finzi, and Andrew Gordon Wilson. Wilson says: “A long standing focus of our research group has been to pursue an actionable understanding of generalization — an understanding that allows us to intervene in order to achieve better performance. Large language models have a remarkable ability to generalize across many different domains, even without updating their representations after training on text completion. We believe an important component of this success is a simplicity (Occam’s razor) bias. We are in the process of formalizing this bias through the lens of compression.”

Compression is essential for proving that the models are not just memorizing but can actually generalize. One of the key innovations in the paper is the introduction of a “compression bound” that works with a metric called “log-likelihood loss.” Simply put, this is a way of measuring how well a model predicts the next word in a sentence. The team’s method makes this process 900 times faster, which is crucial when dealing with massive datasets.

“To achieve this result, we developed SubLoRA, a simple low-dimensional nonlinear parameterization that leads to non-vacuous generalization bounds for very large models with up to 849 million parameters,” Lotfi noted. This means they created a new, efficient way to compress the models, proving that even the largest models can generalize.

This work is a follow-up to their previous paper, “PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization,” which focused on deep neural networks for image classification. While that paper made strides in explaining generalization for image models, extending these techniques to language models presented unique challenges. Language models are trained on sequences of text, making their predictions dependent on the context, unlike image models where each prediction is independent.

The new bounds apply to the unbounded bits per dimension (BPD) objective, a technical term for measuring the efficiency of the model’s predictions. Their approach shows that LLMs, despite their size, are not merely memorizing data but are capable of meaningful generalization.

Lotfi has presented this work at various venues, including Cohere For AI, ML Collective, and the University of Illinois Urbana-Champaign’s ML Seminar, where it has garnered significant interest. “The entire field is trying to understand these models better. This work is one of the first to show that large language models can indeed generalize beyond their training data, which is a critical step in developing more robust and reliable AI systems,” Lotfi said.

By Stephen Thomas

--

--

NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.