Language Models Often Favor Their Own Text, Revealing a New Bias in AI
As artificial intelligence becomes increasingly adept at mimicking human language, a new bias is emerging: large language models (LLMs) often prefer text generated by themselves over that created by other AI systems or humans. This self-preference bias has significant implications for the development and deployment of AI systems, particularly in areas such as content moderation and model benchmarking.
Shi Feng, a Postdoc Researcher at CDS, and his colleagues have been investigating this phenomenon in their recent paper, “LLM Evaluators Recognize and Favor Their Own Generations.” Feng conducted this research in collaboration with mentee and ML Alignment & Theory Scholars (MATS) scholar Arjun Panickssery and Sam Bowman, an Associate Professor of Linguistics and Data Science at CDS and a Member of Technical Staff at Anthropic.
The researchers discovered that LLMs such as GPT-4 and Llama 2 have a non-trivial ability to recognize their own outputs, even without fine-tuning. GPT-4, for example, was 73.5% accurate in distinguishing its own text from that of other LLMs and humans. Fine-tuning these models led to near-perfect self-recognition, with GPT-3.5 and Llama 2 achieving over 90% accuracy after training on just 500 examples.
Feng and his team also found a linear correlation between an LLM’s self-recognition capability and the strength of its self-preference bias. This suggests that the bias is not merely a coincidence but rather a result of the model’s ability to identify its own outputs.
“If this sort of correlation between self-recognition and self-preference turned out to be a universal bias — like ordering bias — this means we should incorporate some countermeasure against it in basically all self-evaluation,” Feng said. He proposes that obfuscating authorship identity by paraphrasing inputs in various ways could be one approach to mitigate this bias.
The implications of this research extend beyond academic curiosity. As LLMs increasingly replace human annotation and data curation work, self-preference bias could lead to inflated safety scores and overly optimistic assessments of how current monitoring and safety methods can provide reliable oversight for newer AI models. In the most extreme cases, it could even cause a monitoring system to overlook potentially harmful content simply because it resembles the AI’s own output.
Feng’s motivation for this work stems from a broader concern about the risks associated with LLMs communicating with each other, a scenario that is becoming more common as AI is used to both generate and evaluate content. Take resumes, for example. Someone might use a particular model to write their resume — and then a company might use that same model to evaluate it. “What used to be like a human to human interaction is now mediated by two copies of the same model,” said Feng. As this kind of thing happens more and more, said Feng, we want to be particularly careful, because “we really don’t know how that affects the decisions that we make.”
As AI continues to advance and integrate into various aspects of our lives, understanding and addressing biases like self-preference will be crucial to ensuring the technology’s safe and responsible development. Feng and his colleagues’ research is an important step in this direction, shedding light on a previously unexplored phenomenon and paving the way for further investigation and potential solutions.
By Stephen Thomas