DRoP: Making AI Fairer Through Smarter Data Reduction

NYU Center for Data Science
3 min readJan 29, 2025

--

Most data pruning techniques for machine learning models achieve strong overall accuracy while secretly making the models more biased. A new paper, “DRoP: Distributionally Robust Pruning,” by recent CDS PhD graduate Artem Vysogorets, now a machine learning engineer at Rockefeller University, CDS Silver Professor Julia Kempe, and Meta FAIR’s Kartik Ahuja, reveals this troubling trade-off and proposes a solution. The paper has been accepted to ICLR 2025.

The problem of biased AI models is well illustrated by the classic “water birds” example. In natural settings, water birds are usually photographed against water backgrounds like lakes and rivers, while land birds appear against green backgrounds like forests and fields. A model trained on this data might learn to classify birds based on their backgrounds rather than their actual features.

“Many models, presented with an image of a land bird with a background of water, fail miserably,” Vysogorets explained. “They just use the background as the basis for their decision.”

The researchers found that popular data pruning methods, which aim to reduce training costs by removing redundant or uninformative samples, often achieve their improved average performance by sacrificing accuracy on difficult classes. Some methods even completely remove challenging categories from the training data, masking severe bias beneath seemingly strong metrics.

“You can get good average accuracy by just removing difficult classes entirely,” said Vysogorets, lead author of the study. “But that’s clearly not what we want.”

To address this issue, the team developed DRoP (Distributionally Robust Pruning), a new pruning approach that carefully selects how many samples to keep from each class based on how difficult that class is for the model to learn. When tested on standard computer vision datasets, DRoP substantially reduced bias compared to existing pruning methods — improving worst-case accuracy by up to 10% while only modestly impacting overall performance.

The research has implications beyond academic benchmarks. Modern language models and AI systems are trained on increasingly massive datasets, making data pruning essential for practical deployment. The team’s method could help maintain fairness when pruning demographic data, suggesting applications in reducing bias across minority groups.

“Large language models process text from many different sources — Wikipedia, GitHub, Stack Overflow,” Vysogorets explained. “There may be a lot of duplicates or similar content. You need to balance these domains while removing redundancy to ensure the model performs well across all of them.”

Now at Rockefeller University’s Data Science Platform, Vysogorets applies machine learning expertise gained during his time at CDS to help research labs leverage AI in their work. The role involves collaborating with labs to analyze their data and implement AI solutions.

“My work is a very smooth continuation of what I’ve been doing,” he said of the transition from CDS to Rockefeller. “It’s implementing different algorithms, reading papers, writing code, and hopefully writing papers too.”

Beyond computer vision applications, the researchers believe their approach could help reduce bias when training large language models and other AI systems that rely on web-scale datasets.

By Stephen Thomas

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet