Sitemap

When Language Models Grade, the Average Score Wins

2 min readMay 9, 2025

Large language models have been keeping a quiet secret: their hidden probabilities judge text more reliably than their single “final answer.” In a new paper, “Improving LLM-as-a-Judge Inference with the Judgment Distribution,” CDS Assistant Professor Eunsol Choi, UT Austin undergraduate student Victor Wang, and Courant PhD student Michael J.Q. Zhang showed that reading the full probability spread behind a model’s score — then taking the mean — beats today’s standard practice of picking the most likely score in almost every test they ran.

The trio asked popular models such as GPT-4o and Llama-3.1–8B to grade answers from two benchmark suites, RewardBench and MT-Bench. Instead of accepting the models’ top-voted score token (“7”, “8”, etc.), they averaged the entire distribution. That change alone lifted accuracy in 92 out of 120 comparisons, sometimes by double-digit points. “The mean aggregates what the model already knows, while the mode throws most of that away,” Wang said.

They then questioned another convention: forcing the judge model to explain its reasoning through chain-of-thought prompts before giving the score. Counterintuitively, the explanations hurt. When the authors removed that step, the score distribution became less picky, helping the mean aggregation to improve further — especially on reasoning-heavy tasks. “Once the model spells out an explanation, the final number loses almost all uncertainty,” Wang noted. Dropping the explanation restored that useful spread.

Because the study compared point-wise, pair-wise and list-wise setups, the team could propose concrete recommendations for practitioners. For large judges like GPT-4o, pair-wise ranking without chain-of-thought worked best; smaller judges preferred simple point-wise scoring, also without chain-of-thought. In every case, averaging — not majority-picking — was the safer bet.

The findings mattered beyond leaderboard tweaks. Companies already use “LLM-as-a-Judge” systems to grade student essays, screen legal drafts and filter medical advice. A habit as small as swapping the mode for the mean could cut annotation budgets while matching human raters more closely. And because the mean exposes the judge’s confidence on a continuous scale, downstream systems can weight that signal instead of treating every win as absolute.

Future work may tune models directly for distributional output, but the authors chose an intentionally frugal path: they left the judges untouched and only changed the inference rule. Their code is public, so any lab can plug the method into its own pipeline.

Greedy decoding — i.e., picking the modal next token — was quick, but it ignored most of the evidence on the table. Wang and colleagues showed that a quiet average, calculated in milliseconds, saw more — and scored better.

By Stephen Thomas

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet