Language Models’ Prediction of Current Events Degrades Over Time, Even With Latest Information
Large language models forget how to interpret new information accurately, even when provided with the most up-to-date context. That surprising finding emerged from research by CDS MS in Data Science student Amelia (Hui) Dai, CDS PhD student Ryan Teehan, and CDS Assistant Professor of Computer Science and Data Science Mengye Ren, who developed a new benchmark called Daily Oracle to track how well AI models maintain their prediction ability of current events over time.
While human experts have traditionally made forecasts about everything from healthcare outcomes to financial markets, language models have recently emerged as promising alternatives due to their ability to learn from vast and diverse datasets. These AI systems have shown increasing capability to analyze historical events and identify patterns that could predict future outcomes. However, no previous research had systematically tracked how well these models maintain their predictive abilities over time.
The researchers found that language models’ performance declined by approximately 20% when making predictions about recent events, compared to their accuracy on older information. Most strikingly, this degradation occurred even in what should have been a simple reading comprehension task, where models were given the exact articles containing the answers to their questions.
“We originally included this test just to verify our questions were answerable,” Dai said during her presentation at NeurIPS. “But surprisingly, we observed that even when given the source articles, the downward trends persisted across all models. This suggests the decline isn’t just about lacking future knowledge — something in the models’ internal representations may be becoming outdated.”
The Daily Oracle benchmark draws from over a million news articles across five major news outlets, automatically generating questions that test both simple factual understanding and more complex forecasting abilities. The system produces an average of 17.3 question-answer pairs per day, spanning topics from politics to technology, in both true-or-false and multiple choice formats.
Example true-or-false question: Will the prosecution’s key witness in the New York hush
money trial in April 2024 be someone other than Michael Cohen?
Example multiple choice question: What will be the starting price range for the Google Pixel 8a as of May 2024? A. $599–$649 B. $199–$249 C. $750–$800, D. $499–$559.
“We wanted a way to evaluate something that language models haven’t seen during training,” Ren said. “While there are other forecasting benchmarks, they lack an automatic way of generating testing data. With news articles, we can generate new testing data every day and track trends over time.”
The team discovered an unexpected pattern: models’ performance began declining even before their official knowledge cutoff dates. This timing coincided with a broader performance drop across all tested models after September 2021, which the researchers hypothesize may be related to increased restrictions on web scraping following ChatGPT’s release, potentially limiting the training data available to newer models.
To test whether access to current information could solve the problem, the team experimented with retrieval-augmented generation (RAG), allowing models to reference news articles up to different cutoff dates. While more recent information generally improved performance, it didn’t prevent the overall pattern of degradation.
The research points to the necessity of developing new approaches to maintaining AI systems’ knowledge over time. “This work provides the first clear evidence of how language models’ capabilities degrade temporally,” Ren said. “While we don’t have the solution for continuous updates yet, we’ve presented a practical scenario that opens up new directions for research into continual learning algorithms.” Models with continual learning would allow those systems to gradually acquire new knowledge over time, similar to how humans naturally learn about and adapt to changing world events without forgetting what they already know.
The implications extend beyond academic interest. “This has real applications in how we think about AI systems making predictions about future events,” Teehan said. “Just as analysts on a news program like Meet The Press reason about political outcomes or economic forecasts, it would be useful to understand how well AI can engage in this kind of predictive reasoning.”
The project was completed in less than six months and presented at NeurIPS 2024 by Dai, marking her first conference presentation. The Daily Oracle benchmark continues to generate new evaluation data daily, providing an ongoing measure of how well AI systems can understand and reason about our changing world.
By Stephen Thomas