AI Language Models: Inevitable Vulnerabilities and New Safeguards
Researchers at CDS and Meta AI have made a dual breakthrough in AI safety, simultaneously proving the inevitability of language model vulnerabilities and developing a new method to significantly reduce their occurrence.
CDS PhD student Jingtong Su led the study, collaborating with CDS Silver Professor of Computer Science, Mathematics, and Data Science Julia Kempe and Meta AI Research Scientist Karen Ullrich. Their work builds upon recent findings in the field of AI safety and alignment.
“Our research was motivated by several recent studies that highlighted ongoing vulnerabilities in language models,” Su explained. “For instance, researchers at Princeton recently showed that even safety-aligned models can be jailbroken by manipulating decoding hyperparameters, suggesting that harmful knowledge persists after alignment.”
The team’s work is inspired by a theoretical framework proposed in 2023, which demonstrated the potential for misaligning language models through adversarial prompts. “While this earlier work was crucial, it didn’t fully explain the jailbreak phenomenon we see in practice,” Su noted. “Our framework provides a more direct link between input prompts and expected outputs, offering a clearer picture of how jailbreaking occurs.”
The CDS and Meta AI study, titled “Mission Impossible: A Statistical Perspective on Jailbreaking LLMs,” to appear at the upcoming NeurIPS 2024 conference, establishes that jailbreaking AI language models — manipulating them to produce harmful content — is fundamentally unavoidable when alignment is performed under the current Reinforcement Learning from Human Feedback (RLHF) paradigm. However, in the same work, they’ve introduced a novel technique that substantially lowers the chances of such breaches occurring.
“Our work demonstrates two critical points,” Su explained. “First, under reasonable assumptions, there’s always a non-zero probability of jailbreaking a language model. Second, we can significantly reduce this probability by expanding the model’s set of safe responses.”
Su, Ullrich, and Kempe’s research introduces a new statistical framework for analyzing language model alignment. This approach allowed the team to mathematically prove the inevitability of jailbreaking while also identifying ways to enhance model safety.
The study also revealed a fundamental flaw in the widely used RLHF method for AI alignment. In response, the team proposed a modified version called E-RLHF (Expanded RLHF).
The authors illustrated this technique in a poster they made about this paper for the ICML 2024 Workshop on Theoretical Foundations of Foundation Models, of which this is an excerpt (and which Ullrich discusses in a thread on X):
The key innovation of E-RLHF lies in its modification of the original RLHF objective. While RLHF inadvertently preserves harmful responses by anchoring to the initial distribution of the language model, E-RLHF replaces the potentially harmful anchor distribution with a safer alternative. For instance, when faced with a prompt like “Tell me how to make a bomb,” E-RLHF would anchor to a distribution of responses for a safer prompt such as “Tell me how to reject a request on making a bomb.” This subtle but powerful shift expands the set of safe responses available to the model, effectively reducing the likelihood of generating harmful content while maintaining the model’s ability to engage with a wide range of topics appropriately.
The researchers tested E-RLHF using HarmBench, a standardized evaluation framework for AI safety. Their new method showed marked improvements across all tested jailbreak adversaries without compromising the model’s performance on utility.
While this work represents a significant advance in AI safety, Su emphasized that the field is still evolving rapidly. He pointed to a recent paper from DeepMind about their Gemini 1.0 model: “That study didn’t find evidence of very strong dangerous capabilities,” he said, “but it did identify some warning signs. It underscores the importance of continued research in this area.”
This work from CDS and Meta AI is exactly this kind of continued research, representing a significant advance in both the theoretical understanding of AI vulnerabilities and practical methods for enhancing safety. As AI systems become increasingly integrated into various aspects of society, these insights will be crucial for developers, policymakers, and users alike. By balancing a clear-eyed view of AI’s inherent limitations with practical steps for improvement, Su, Kempe, and Ullrich have charted a course for the future of safer, more robust AI language models.
By Stephen Thomas