The ImageNet Moment for Physics Simulation: CDS Researcher Leads Creation of “The Well”
Machine learning models must crawl through massive amounts of diverse data before they can walk confidently across complex tasks. While language models have internet-scale text repositories and image models have billions of photos, physics-based machine learning has lacked a similarly comprehensive benchmark — until now.
CDS Senior Research Scientist Shirley Ho has led an international team to create The Well, a groundbreaking collection of physics simulations designed to serve as a unified benchmark for numerical machine learning models.
“For any large machine learning model, you need an extensive, internet-scale dataset to train on,” Ho explained. “When machine learning first started making significant advances in computer vision, researchers created ImageNet as a standardized benchmark that allowed people to compare different methods and models. But until now, we haven’t had a comprehensive dataset for numerical simulations that serves the same purpose for physics-based machine learning.”
The Well fills this critical gap, providing 15TB of data across 18 different physical settings. These simulations range from biological systems and active matter to aerospace engineering scenarios and even neutron star mergers. Each simulation essentially functions as a multi-dimensional movie, allowing researchers to test how well their models can predict the next frame — or several frames — in the physical sequence.
“They are all movies,” Ho explained. “When you think about it, they are 2D or 3D movies of numerical simulations.”
Beyond simply providing the data, the project delivers a comprehensive toolkit for the research community. “We not only provide the simulations that they can train on, but also provide the tasks they should try, and also provide benchmark codes,” Ho noted. This standardized approach enables direct comparison of different machine learning techniques.
The project represents a remarkable collaborative effort, with two lead scientists, Ruben Ohana of Polymathic AI and the Flatiron Institute, and Michael McCabe of Polymathic AI (soon to join CDS), coordinating contributions from experts across multiple institutions. Together they’ve created not just a dataset but a potential catalyst for the field’s advancement.
The potential impact of The Well goes far beyond academic benchmarking. Ho envisions practical applications in fields such as weather forecasting, climate modeling, and astrophysics. “Our goal is to train large machine learning models on this massive and diverse dataset, just like OpenAI did when it crawled the entire internet,” Ho said. “By exposing AI models to a wide range of fluid dynamic simulations, we hope they’ll develop an unprecedented understanding of physical and biological principles — rather than just recognizing patterns.”
This deeper understanding could be a breakthrough in how AI models approach scientific prediction. “We call this hypothesis POLYMATH,” Ho explained. “The idea is that if a model learns a broad scientific foundation about the world, it will be able to make better predictions even with limited data. Instead of just memorizing patterns, it will develop an intuitive grasp of the underlying physics.”
The potential applications of this approach could be transformative, particularly for complex systems like Earth’s climate. “We have very few simulations of Earth’s climate because it’s incredibly difficult to model,” Ho said. “But if we can train AI models to make accurate predictions with minimal data, we might finally have a way to tackle problems where information is scarce — giving us a real shot at understanding and forecasting critical environmental changes.”
The team expects that foundational models trained on The Well could eventually drive breakthroughs in areas like climate modeling, solar weather prediction, and other complex physical systems that currently push the limits of computational power.
But beyond the data itself, Ho emphasizes the spirit of collaboration that made the project possible. “We’re incredibly grateful to all the scientists who worked with us,” she said. “Without them, this wouldn’t have happened. This is a collective effort — not just to advance machine learning or physics, but to push the boundaries of human understanding.”
By Stephen Thomas