Overcoming the AI Data Crisis: A New Solution to Model Collapse
Researchers have discovered a way to prevent AI models from deteriorating when trained on synthetic data, potentially averting a looming crisis in artificial intelligence. A collaborative effort led by CDS Silver Professor of Computer Science, Mathematics, and Data Science Julia Kempe, CDS PhD Student Yunzhen Feng, and Meta AI scientist Elvis Dohmatob has not only provided a new mathematical proof for the widely recognized problem of “model collapse,” but has also proposed a novel solution to mitigate its effects.
In a series of three papers, Kempe, Feng, Dohmatob, and their collaborators investigated the impact of using AI-generated data to train subsequent generations of AI models. Their first paper, “A Tale of Tails: Model Collapse as a Change of Scaling Laws,” which appeared at the International Conference on Machine Learning (ICML) 2024, reveals a concerning trend: as more synthetic data is incorporated into training datasets, the traditional scaling laws that have driven AI progress begin to break down.
“We found that models trained on synthetic data eventually hit a performance plateau,” Feng explained. “No matter how much more data you add, the model stops improving beyond a certain point.”
This phenomenon, called “model collapse,” occurs because AI-generated data lacks the rich diversity found in real-world data. AI models tend to focus on the most common patterns and lose the nuanced “long-tail” information crucial for continued improvement.
The implications of this research are far-reaching. As AI-generated content proliferates online, future AI models trained on web-scraped data will inevitably encounter increasing amounts of synthetic information. This could potentially slow or even halt the rapid progress the field has experienced in recent years. (The significance of this issue was recently highlighted on the cover of Nature.)
While the problem of model collapse was already known in the AI community, the work of Kempe, Feng, Dohmatob, and their co-authors provides the first analytic mathematical characterization of the phenomenon. Their second paper, “Model Collapse Demystified: The Case of Regression,” presented as a poster at the ICLR Workshop on Bridging the Gap Between Practice and Theory in Deep Learning (BGPT) 2024, goes deeper into the theoretical underpinnings of model collapse.
“Our theoretical framework allows us to predict and quantify the decay in model performance across multiple generations,” Feng said. “This gives us a solid foundation for understanding and addressing the issue.”
But the team didn’t stop at identifying the problem. In their third paper, “Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement,” the poster for which was accepted to the ICML 2024 Workshop on Theoretical Foundations of Foundation Models, they proposed a solution: using reinforcement techniques to curate high-quality synthetic data. By employing external verifiers — such as existing metrics, separate AI models, oracles, and humans — to rank and select the best AI-generated data, they demonstrated that it’s possible to overcome the performance plateau.
“With careful data curation, we can actually push model performance beyond that of the generator,” Feng said. This approach could pave the way for continued AI advancement even in a world awash with synthetic data.
The researchers tested their method on various tasks, including mathematical problem-solving and news summarization. In both cases, they found that their reinforcement-based data curation technique significantly improved model performance, even when training on synthetic data.
As AI continues to evolve and integrate more deeply into our digital ecosystem, understanding and mitigating the challenges posed by synthetic data will be crucial. The research of Kempe, Feng, Dohmatob, and their collaborators not only sounds a warning bell but also offers a potential roadmap for navigating this complex landscape.
By Stephen Thomas