New Method Makes AI Model Fine-Tuning More Efficient and Accessible
Modern AI models contain trillions of parameters, making it impossible for most researchers to work with their full training datasets. A new approach developed by Courant Instructor Yijun Dong, CDS PhD student Hoang Phan, CDS PhD student Xiang Pan, and CDS Assistant Professor of Mathematics and Data Science Qi Lei, demonstrated how to identify and use only the most relevant data points, making the process both faster and more memory-efficient.
While data pruning — the practice of selecting a subset of training data — has been studied extensively for training models from scratch, less attention has been paid to pruning for fine-tuning existing models. This distinction matters because most researchers today start with pre-trained models rather than building from scratch. Additionally, this new work stands out by providing a theoretically sound framework, in contrast to previous approaches that lacked rigorous mathematical foundations.
Dong, Phan, Pan, and Lei’s method, called Sketchy Moment Matching (SkMM), works by first identifying key patterns in how the model processes information, then selecting training examples that best match those patterns. This two-stage approach proves particularly effective for researchers working with limited computational resources. The paper is called “Sketchy Moment Matching: Toward Fast and Provable Data Selection for Finetuning.”
“Most previous work [on data pruning] focused on selecting data to train models from scratch. But nowadays, with foundation models, we rarely train from scratch — we usually use off-the-shelf pre-trained models and fine-tune them,” Lei said.
The work addressed a critical challenge in machine learning: as models grow larger, researchers increasingly struggle to work with them using standard computing resources. The team’s approach reduced both the computational power and memory needed to adapt these models to new tasks.
Pan explained that data efficiency is becoming more crucial as large language models consume more of the available training data. “If we continue to increase model size without considering data efficiency, models like GPT will exhaust all available training data within five years,” Pan said.
The research also aligns with broader efforts to make AI development more accessible to researchers outside major tech companies. Lei’s work aims to provide tools that would allow normal researchers and engineers to work effectively with large AI models, rather than restricting such capabilities to organizations with massive computing resources.
The team’s paper was accepted to NeurIPS 2024, one of the premier conferences in machine learning. Beyond its immediate applications in computer vision tasks, the method’s framework could potentially extend to various types of data and models.
By Stephen Thomas