Middle Layers Excel: New Research Challenges Final-Layer Focus in Language Models
The intermediate layers of large language models (LLMs) contain surprisingly rich representations that often outperform the final layer on downstream tasks, according to new research from CDS Research Scientist Ravid Shwartz-Ziv, CDS Professor Yann LeCun, and their collaborators.
Their paper, “Layer by Layer: Uncovering Hidden Representations in Language Models,” led by Oscar Skean and Md Rifat Arefin and with contributions from MS student at NYU Courant Dan Zhao and others, reveals that the conventional wisdom of using final-layer outputs for embeddings may be suboptimal. Through extensive analysis, the researchers found that mid-depth layers in autoregressive transformers undergo significant information compression that ultimately helps them better capture and distill relevant features.
“When training the model for next-token prediction, intermediate layers often provide better representations for downstream tasks than the final layer,” said Shwartz-Ziv. “This is because these layers strike an optimal balance between preserving task-relevant information and discarding noise.”
The research team developed a unified framework combining information theory, geometric analysis, and invariance metrics to quantify representation quality across several architectures. Their comparative analysis included decoder-only transformers like Pythia, encoder-only models like BERT, and state space models (SSMs) like Mamba. While all models showed unique patterns, they discovered that the compression in intermediate layers was most pronounced in autoregressive transformers, with bidirectional models like BERT showing milder intermediate changes, suggesting this behavior is specifically tied to the autoregressive training objective rather than the architecture itself.
This work builds on the same team’s previous paper, “Does Representation Matter? Exploring Intermediate Layers in Large Language Models,” which first identified this phenomenon. The team’s latest research expands the analysis to more models and training regimes while offering a comprehensive theoretical framework to explain why intermediate representations excel.
Their analysis employed multiple metrics, including prompt entropy, curvature, and augmentation invariance. These metrics revealed that as training progresses, intermediate layers learn to compress information more efficiently, developing a distinctive “valley” in their entropy measures that correlates with better downstream performance.
“Our last observation, and maybe what I find most interesting, is what we discovered about reasoning models,” noted Shwartz-Ziv. “When models are explicitly trained on reasoning tasks, you see an increase in information preservation in the intermediate layers, preventing this collapse. We used this insight in our Seq-VCR paper to create a regularizer that explicitly maintains higher information in these layers, which significantly improved performance on math problems.”
In that follow-up paper, “Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning,” led by Arefin, with contributions from LeCun and Shwartz-Ziv, they did exactly that, deliberately preserving information in intermediate layers. The results were striking. Their method achieved 99.5% accuracy on complex 5×5 digit multiplication tasks, surpassing even GPT-4 with chain-of-thought prompting, which only managed 44% accuracy.
These findings have practical implications for how we utilize LLMs, suggesting that practitioners should consider using intermediate layer representations rather than final layers for certain tasks. The research also illuminates why chain-of-thought reasoning improves model performance — it helps maintain richer context throughout the network’s layers by preserving information that would otherwise be compressed away.
By Stephen Thomas