Sitemap

Learning to See Like Animals: How Small Objects in Dense Video Scenes Challenge AI Vision

2 min readJun 11, 2025

--

Most AI systems trained on carefully curated datasets struggle when faced with the messy reality of naturalistic video — where critical objects like pedestrians occupy less than 0.3% of a video frame, yet detecting them can be a matter of life and death in applications like self-driving cars. CDS Assistant Professor Mengye Ren and collaborators have developed a new self-supervised learning method that tackles this challenge.

The research, published as “PooDLe🐩: Pooled and Dense Self-Supervised Learning from Naturalistic Videos” at ICLR 2025, addresses a fundamental problem in computer vision. Traditional methods work well on iconic images with a single, clear subject. But naturalistic videos from dashcams contain cluttered scenes with many objects of varying sizes, creating “spatial imbalance.”

“We were interested in building embodied learning algorithms that could learn directly from video streams,” Ren said, drawing inspiration from how animals, including humans, acquire visual understanding from “a continuous stream of visual data” with minimal supervision.

The team’s solution, PooDLe, combines two learning strategies. Dense learning objectives work well for large background objects but struggle with small foreground objects. Pooled learning objectives capture semantics from smaller regions but may miss spatial relationships in crowded scenes.

PooDLe addresses these limitations through a cropping strategy that uses optical flow — information about how pixels move between video frames — to identify paired regions containing the same object across time. These “pseudo-iconic” subcrops increase the prevalence of smaller objects in the learning process.

Testing on driving and first-person videos, PooDLe outperformed existing methods by 3.5% on semantic segmentation and showed particularly strong gains on small objects like traffic lights and pedestrians.

“This could be very useful in cases where we want to adapt to visual features that are not already pre-trained, because we don’t know where the robot is going to be deployed,” Ren explained.

Co-first authors Alex N. Wang and Christopher Hoang, both Courant PhD students in NYU’s CILVR lab, conducted the experiments. The collaboration included CDS founding director Yann LeCun, whose joint embedding models influenced the research.

The work builds on Ren’s previous research, FlowE, which used optical flow for dense video learning but lacked the pooled objective crucial for small object recognition.

PooDLe represents a step toward vision systems that can adapt to new environments and capture the full complexity of visual scenes, much like the animals that inspired this research.

By Stephen Thomas

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet