CDS PhD Student Publishes Research Paper on Early-Learning Regularization
CDS PhD student Sheng Liu — in collaboration with assistant professors of data science and mathematics Jonathan Niles-Weed and Carlos Fernandez-Granda, as well as NYU Langone Assistant Professor for the Department of Population Health and the Department of Radiology, Narges Razavian — has recently published a research paper on early-learning regularization.
After recognizing that label noise widely exists in medical datasets during their work on automatic early detection of Alzheimer’s disease[1] , Liu and coauthors were motivated to address label noise in classification. The paper, titled “Early-Learning Regularization Prevents Memorization of Noisy Labels”, proposes “a novel framework to perform classification via deep learning in the presence of noisy annotations.”[2]The research confirms that deep neural networks, when trained on noisy labels, have been observed to initially fit training data with clean labels during an “early learning” phase, prior to eventually memorizing the falsely labeled examples. The team’s theoretical and empirical results reveal that both early learning and memorization are characteristic features of high-dimensional classification, even in the case of simple linear models.
From these findings, the team has developed a new approach for noisy classification tasks, which utilizes the progress of the early learning phase. This is in contrast to existing methods, which use model output during early learning to detect the examples with clean labels and either disregard or attempt to correct false labels. Ultimately, the team’s approach does not require detecting the examples with clean labels. Instead, they propose a novel regularization procedure which preserves the progress made during the early stages of the algorithm.
This approach holds two principal elements. The first leverages “semi-supervised learning techniques to produce target probabilities based on the model outputs.” [2] The second designs a regularizer that directs the model towards these targets, preventing the memorization of false labels. “The resulting framework demonstrates…to noisy annotations on several standard benchmarks and real-world datasets,” says Liu, “where it achieves results comparable to the state of the art.” [2]
This work has the potential to advance the development of machine-learning methods that can be deployed in contexts where it is costly to gather accurate annotations. This is an important issue in applications such as medicine, where machine learning has great potential societal impact.
The project’s Github page provides additional information regarding requirements, data, and training.
About the Team
Sheng Liu is a Ph.D. student at CDS, co-advised by Professor Carlos Fernandez-Granda, Professor Jonathan Niles-Weed, and Professor Narges Razavian. His research interests are in robust representation learning in computer vision, and its applications for healthcare. He is also a member of the Math and Data (MaD) group (a joint effort of both CDS and the NYU Courant Institute of Mathematical Sciences towards the advancement of mathematical and statistical foundations of data sciences), where he works on inverse problems and optimization.
Jonathan Niles-Weed is an assistant professor of data science at mathematics at CDS and the NYU Courant Institute of Mathematical Sciences. He is also a core member of the Math and Data (MaD) group. His research interests are around statistical and computational problems arising from data with geometric structure. His recent work particularly focuses on optimal transport. Jonathan holds a Ph.D. in Mathematics and Statistics from MIT.
Carlos Fernandez-Granda is an assistant professor of data science and mathematics at CDS and the NYU Courant Institute of Mathematical Sciences. His research focuses on developing and analyzing optimization-based methods to tackle problems in applications such as neuroscience, computer vision and medical imaging. He’s particularly interested in machine learning applied to signal processing, theory of inverse problems, and applications in healthcare. He also is a member of the Math and Data (MaD) group. Carlos holds a PhD from Stanford University in electrical engineering.
To view the paper in its entirety, please visit the paper’s arXiv page.
References
1. “Automatic detection of Alzheimer’s disease” Github page
2. “Early-Learning Regularization Prevents….” Github page
By Ashley C. McDonald