Sitemap

When Good Data Is Scarce, Planning Beats Reinforcement Learning in AI Decision-Making

3 min readMay 7, 2025

Artificial intelligence often relies heavily on high-quality, abundant data to learn effectively. But a recent study led by CDS PhD Student Vlad Sobal and Wancong (Kevin) Zhang, a computer science PhD student at NYU’s Courant Institute, shows that when good data is scarce or poor-quality, planning ahead — rather than blindly following learned policies — can significantly outperform traditional reinforcement learning methods.

In their paper, “Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models,” Sobal, Zhang, and colleagues, including CDS Professor Kyunghyun Cho, CDS Faculty Fellow Tim G. J. Rudner, and CDS founding director Yann LeCun, examined how different AI methods perform when forced to rely solely on suboptimal datasets. Instead of teaching an AI to respond to immediate rewards, their method focused on learning the underlying dynamics of the environment, allowing the system to plan multiple steps ahead at test time.

“Traditional reinforcement learning methods are highly effective when provided with abundant, quality data,” Sobal said. “But in realistic scenarios, such high-quality data is often unavailable. Planning with latent dynamics models lets us stitch together even poor-quality, fragmented data into coherent, useful predictions.”

The research specifically tested “joint embedding predictive architectures” (JEPA), a method promoted by LeCun. This approach creates internal representations of the environment, enabling the AI to make informed predictions and decisions without needing explicit rewards during training.

One surprising finding from the study was how dramatically data quality affected various learning methods. While traditional reinforcement learning faltered when presented with short, random trajectories, the JEPA-based planning approach demonstrated robust performance, successfully navigating complex tasks and generalizing to unseen environments.

To test their system, the researchers employed navigation tasks, including maze environments. Even when trained on just a handful of maze layouts, the latent dynamics planning method consistently outperformed other methods when introduced to entirely new maze structures. This ability to generalize effectively from limited examples suggests significant promise for real-world applications, such as robotics and autonomous driving, where data collection is often challenging.

The JEPA method also excelled at what researchers call “trajectory stitching” — the ability to combine multiple short sequences of suboptimal actions into an effective strategy. Such stitching is crucial for scenarios where long, successful demonstrations are rare or difficult to acquire.

“This method isn’t just better with limited data — it generalizes to new tasks with surprising ease,” Zhang said. For example, without additional training, their system adapted from navigating toward a goal to actively avoiding an adversarial agent, simply by modifying the planning objective.

The authors noted that their research is related to earlier efforts like the paper “DINO-WM: World Models on Pre-trained Visual Features Enable Zero-shot Planning,” but goes beyond that approach by training latent dynamics and representation models jointly from scratch, rather than relying on a separately pretrained representations.

Though the current research is limited to navigation tasks, the implications stretch far beyond simple environments. The ability to efficiently leverage suboptimal datasets could unlock more practical and scalable AI systems, capable of robust, general decision-making in complex, real-world conditions.

By Stephen Thomas

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet