New Research Finds Method to Reduce AI Language Models’ Biased Reasoning

3 min read3 days ago

In a paper last year, recently departed CDS Junior Research Scientist Miles Turpin, CDS Associate Professor of Linguistics and Data Science Samuel R. Bowman, CDS Research Scientist Julian Michael, and Anthropic’s Ethan Perez found that asking an AI language model to explain its reasoning often yields rationalizations that fail to fully account for the factors influencing the model’s outputs. In a recent follow-up paper, “Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought,” co-authored with James Chua, Edward Rees, and Hunar Batra, however, they introduce a novel training method that shows promise in reducing this problem of unfaithful and biased reasoning in language models.

The earlier paper introducing the problem “got a lot of attention,” said Turpin. “It was showing that if you give a model a sequence of multiple-choice questions where the correct answer is always ‘A’, the model will give rationalizations about why the answer to a new question should be ‘A’ without mentioning that it was influenced by the very obvious ‘all-As’ pattern.” This is an example of the kind of “bias” Turpin and his colleagues are trying to inoculate models against.

Chain-of-thought prompting, where models are asked to provide step-by-step reasoning before giving a final answer, has generated excitement as one path towards greater explainability of language models. However, the rationalizations outlined in the earlier paper, showing clear examples of a model’s explanations of itself diverging significantly from what it’s actually doing under the hood, have, for some, been a major cause for concern.

To tackle this issue, the new paper introduces bias-augmented consistency training (BCT). The method works by first having the model generate reasoning without any biasing features in the prompt. This “unbiased reasoning” is then used as the target when training the model on prompts that do include various biases. “We train the model to give that unbiased reasoning even when we insert biases into the prompt,” explained Turpin. “Reducing models’ sensitivity to biases unverbalized in explanations results in models that more reliably behave as their explanations would suggest.”

Crucially, the researchers find that training the model to be insensitive to one type of bias using BCT helps reduce its sensitivity to other, held-out biases as well. “It’s hard to anticipate at deployment time all the undesirable features a model will be sensitive to,” said Turpin. “This generalization is a promising sign that BCT can reduce biased reasoning even on biases we haven’t anticipated.”

The applications are potentially far-reaching, especially as language models become more advanced and are used for higher stakes purposes, such as in medicine, law, and national security. “Eliciting accurate explanations is fundamental to trusting models,” said Turpin. “If we can understand why they’re giving certain outputs, we can detect if they’re using unsafe or flawed reasoning that we wouldn’t approve of.”

While the paper is currently under review, the findings point to an important step forward in developing trustworthy AI systems. “As models get smarter, being able to understand the process they use to give an answer is useful for detecting flaws or outputs that are undesirable or harmful in some way,” said Turpin. “Explanations are a very fundamental way that humans establish trust in each other, and that is beginning to apply to models too as they become more advanced.”

By Stephen Thomas

New Research Finds Method to Reduce AI Language Models’ Biased Reasoning

Written by NYU Center for Data Science