LiveBench: Challenging Language Models with Contamination-Free Questions

NYU Center for Data Science
3 min readAug 2, 2024

--

Language models have been cheating on their tests without even knowing it. This unintentional academic dishonesty has made it increasingly difficult to accurately measure the true capabilities of artificial intelligence systems.

CDS Faculty Fellow Ravid Shwartz-Ziv, CDS founding director Yann LeCun, recently departed CDS Postdoc Researcher Micah Goldblum, and many other co-authors, including the two leads, Colin White and Samuel Dooley of Abacus.AI, have developed a solution to this problem. Their new benchmark, LiveBench, introduces a novel approach to evaluating language models by using constantly updated, contamination-free questions.

“Benchmarks are really the core of progress in machine learning,” Goldblum said. “They give us a target. If I raise this number, that means I’m making my model better.”

The challenge with existing benchmarks is that as language models are trained on vast swaths of internet data, they inadvertently learn the answers to test questions. This “test set contamination” is a kind of cheating, in that it enables a model to simply regurgitate an answer it’s already seen to a question that’s supposed to be novel. This contamination, unfortunately, renders many benchmarks obsolete shortly after their creation.

LiveBench addresses this issue by generating new questions on a monthly basis, drawing from recent sources such as newly published arXiv papers, recent math olympiad problems, and current events. Introduced in a recent preprint, “LiveBench: A Challenging, Contamination-Free LLM Benchmark,” this approach ensures that the questions remain fresh and unseen by the models during their training.

The benchmark covers six categories: math, coding, reasoning, language comprehension, instruction following, and data analysis. It includes tasks like solving high school math competition problems, generating code, and analyzing recent datasets from Kaggle and Socrata.

One of LiveBench’s key innovations is its method of evaluation. Unlike some recent benchmarks that rely on human judges or other language models to assess responses, LiveBench uses objective, ground-truth scoring. This eliminates potential biases and errors in the evaluation process.

“We went to a lot of effort to ensure that people really use it,” Shwartz-Ziv explained. “We wrote a paper, a blog post, and we have a leaderboard that people can use and compare different models.”

The team has already evaluated dozens of models, including both proprietary and open-source options ranging from 0.5 billion to 8x22 billion parameters in size. The results show that LiveBench is indeed challenging, with top models achieving less than 65% accuracy across all tasks.

LiveBench’s approach offers benefits for both model developers and users. Developers can use the benchmark to identify which changes improve their models’ capabilities, while users can make informed decisions about which models best suit their needs.

The project, which began as a conversation between Goldblum and Colin White from Abacus AI (the paper’s first author), quickly grew into a large-scale collaborative effort involving researchers from multiple institutions. The team’s diverse expertise allowed them to create a comprehensive benchmark that addresses many of the shortcomings of previous evaluation methods.

As the AI community continues to push the boundaries of language model capabilities, LiveBench stands poised to provide a reliable, transparent, and ever-evolving measure of progress. With its commitment to frequent updates and community engagement, this new benchmark may well become the gold standard for language model evaluation in the years to come.

By Stephen Thomas

--

--

NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.