Preference Learning Algorithms Fail to Learn Human Preference Rankings

2 min readSep 19, 2024

Language models trained to align with human preferences rarely achieve high ranking accuracy on those same preferences, according to new research from CDS PhD student Angelica Chen and colleagues. Their study reveals fundamental flaws in popular alignment techniques like reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO).

“Most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets,” Chen said. This means that when given two potential responses, the models often fail to rank the human-preferred option higher.

The researchers analyzed several open-source language models across multiple preference datasets. They found that even under ideal training conditions, the theoretical maximum ranking accuracy was still sometimes below 100%. More concerningly, real-world models exhibited large “alignment gaps” — differences of up to 59 percentage points between their actual and theoretically predicted ranking accuracies.

Chen and her collaborators traced this issue back to the DPO training objective. “DPO rarely flips the ranking of the two continuations,” Chen explained. “It’s actually very difficult for DPO to correct even mild ranking errors in the reference model.”

Their analysis showed that to flip an incorrect ranking, the DPO loss would need to be reduced to an extremely small value — often infeasibly small in practice. This helps explain why models struggle to achieve high ranking accuracy even on their training data.

The study also explored the relationship between ranking accuracy and win rate, a popular metric for evaluating aligned models. Interestingly, these metrics were only strongly correlated early in training when the model remained close to its starting point.

“Once it gets too far away from that reference model, these metrics start to actually anti-correlate,” Chen noted. This suggests that offline metrics like ranking accuracy may not reliably predict online performance as training progresses.

While the research identifies key limitations in current preference learning algorithms, Chen remains optimistic about improving model alignment. She hopes these insights will motivate the development of better algorithms and more fine-grained analyses of preference training dynamics.

The full paper, “Preference Learning Algorithms Do Not Learn Preference Rankings,” is available on arXiv. The work was conducted in collaboration with Sadhika Malladi at Princeton, Qiuyi (Richard) Zhang at Google DeepMind, Xinyi Chen at both Princeton and Google DeepMind, as well as CDS PhD student Lily H. Zhang, CDS Associate Professor of Computer Science and Data Science Rajesh Ranganath, and CDS Professor of Computer Science and Data Science Kyunghyun Cho.

By Stephen Thomas

Preference Learning Algorithms Fail to Learn Human Preference Rankings

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by NYU Center for Data Science

No responses yet