Making the Invisible Visible: New Method Reveals How Language Models Encode Concepts
Language models encode numerous concepts in their internal representations — from demographic characteristics to factual knowledge to sentiment. CDS Faculty Fellow Shauli Ravfogel and his colleagues have developed a method to make these hidden patterns visible by translating representation-level interventions into natural language text.
The research, presented in their paper “A Practical Method for Generating String Counterfactuals,” addresses a significant challenge in understanding how various concepts are encoded in language models’ internal representations. While previous techniques could intervene at the representation level to modify or remove information, translating these changes back into readable text has remained difficult. In their study, the team demonstrated the technique using gender as one illustrative application, revealing how gender information manifests in subtle patterns of word choice and topic emphasis.
“People have developed different techniques to intervene in representations, either to delete information or to steer the model from one value of a concept to another,” Ravfogel said. “Our goal was to translate these interventions into natural language.”
The method works by first encoding text into a representation, then applying an intervention to modify how gender is encoded, and finally inverting the modified representation back into text. The resulting “string counterfactuals” reveal precisely how gender influences language generation beyond obvious markers like pronouns.
When the team examined which words increased or decreased in likelihood after gender interventions, they found revealing patterns. Technical terms like “hardware,” “enterprise,” and “developer” became more common when shifting from female to male biographies, while terms related to caregiving and social relationships increased when shifting from male to female.
Beyond its value for interpretation, the technique has practical applications for reducing bias. The researchers demonstrated that adding these counterfactual examples to training data improved fairness in profession classification tasks without sacrificing accuracy. Models trained on both original biographies and their gender-swapped counterfactuals showed reduced gender bias compared to those trained solely on original data.
“The original motivation was interpretation, but bias reduction is an immediate use case,” Ravfogel explained. His team focused on gender for this study, but the same technique could potentially be applied to any concept captured in language model representations.
One potential future direction for this work is creating more general methods for translating representations into natural language. “If we could translate these hidden representations to natural language, we might gain insight into how models actually operate when solving complex reasoning tasks,” Ravfogel said.
By making the implicit explicit, Ravfogel’s work provides a window into the subtle ways language models encode social biases and offers a practical method for mitigating their effects.
By Stephen Thomas