CDS Researcher Develops Method for Targeted Language Model Updates
Language models can now generate text with remarkable fluency, but they still occasionally produce undesirable outputs like factual errors or inappropriate content. However, addressing these issues via finetuning or retraining models from scratch can often result in unforeseen performance changes or degradations. Now, CDS PhD student Lily H. Zhang, along with CDS Associate Professor of Computer Science and Data Science Rajesh Ranganath and co-author Arya Tafvizi at Google, has developed a novel approach for making precise, targeted updates to language models without compromising their overall capabilities.
In a new paper, “Towards Minimal Targeted Updates of Language Models with Targeted Negative Training,” published in Transactions on Machine Learning Research, Zhang introduced a method called Targeted Negative Training (TNT) that allows for minimal targeted updates to existing language models. TNT enables models to avoid specific undesirable outputs while maintaining their performance on other tasks.
“The motivation was to bring modeling closer to software development,” Zhang said. “As models become more important aspects of products, we want to enable more targeted feature development, similar to how you would make specific code changes when updating an app or website.”
The targeted nature of TNT updates makes it well-suited for iterative refinement of language models. “Conceptually, negative updates can be applied in any order,” explained Zhang. This allows for flexible, ongoing improvement of model behavior.
Existing techniques for constraining language model outputs often rely on modifying the decoding process, which can result in an increasingly complex prediction pipeline as the number of changes grows. In contrast, TNT is a fine-tuning-based approach, meaning iterative updates do not affect prediction-time complexity.
Zhang and her collaborators tested TNT on two key tasks: reducing hallucinations in text summarization and avoiding toxic language in dialogue generation. Compared to baseline methods, TNT achieved better trade-offs between reducing unwanted behaviors and maintaining similarity to the original model outputs. Moreover, TNT introduced far fewer disfluencies like repetitive text compared to other techniques.
While further work is needed to address limitations like the requirement for token-level annotations, TNT represents a promising step toward more controlled and responsible development of large language models. As these models become increasingly central to real-world applications, techniques like TNT that enable precise, targeted updates will be crucial for addressing safety and quality concerns.
By Stephen Thomas