Why Win Rate Should Be the Guiding Principle in Preference Learning
Interest in training generative models with preference data has exploded in recent years, but is there a unified way to understand the barrage of methods and efforts in this space? CDS PhD student Lily H. Zhang and CDS Associate Professor of Computer Science and Data Science Rajesh Ranganath tackle this conceptual gap in their paper “Preference learning made easy: Everything should be understood through win rate.”
“This is the paper that I wish I could have read first,” Zhang said. “Before reading all the other methods and analysis papers in this space.”
While other machine learning tasks have well-established evaluation metrics, preference learning has lacked a unifying framework. When you present an AI model with pairs of outputs and tell it one is preferred over another, how do you evaluate if the model is learning effectively?
Zhang and Ranganath demonstrate that win rate — how often a model generates content preferred over a competitor — is the only evaluation metric that truly respects both the preferences in the data and relative prevalences.
“The information has to come from somewhere,” Ranganath often says, according to Zhang. “If it’s not from the data, it’s from some assumption that you’re baking in.”
This insight shaped their approach to understanding preference learning. They created a comprehensive framework comparing different methods based on whether they directly optimize for win rate (WRO) or not.
Their analysis reveals that while WRO methods are theoretically superior, they often underperform in practice due to optimization difficulties. Meanwhile, popular non-WRO methods like Direct Preference Optimization (DPO) and Supervised Fine-Tuning (SFT) have limitations in how well their loss aligns with the win rate or the amount of win rate improvement possible.
“A common thing that’s done for preference learning is to first do some supervised fine-tuning on your preferred sample,” Zhang explained. “We show that the expected win rate improvement of this procedure is directly a function of your original model’s diversity in how preferred its samples are.”
This helps explain why certain approaches succeed or fail in different settings. For example, if a model only generates responses that are equally preferred to one another, you won’t be able to improve it through preference learning on its own responses.
The work offers practical takeaways for AI researchers: WRO methods benefit from multiple training runs with different random seeds. DPO and similar approaches should use win rate rather than loss for model selection. And for SFT methods, increasing the diversity of training examples can yield significant gains.
The authors provide a roadmap for future preference learning research, suggesting either improved win rate optimization techniques or better surrogate objectives that more closely align with win rate.
“Once I had spent a lot of time reading papers in this field, I kept feeling like there must be some way to make sense of it all,” Zhang said. “This paper tries to provide that unified framework.”
By Stephen Thomas