Learning From Others’ Mistakes: How Fine-Grained Error Data Improves Machine Translation

NYU Center for Data Science
2 min readDec 18, 2024

--

Professional translators have been meticulously marking errors in machine translations for years as part of the Workshop on Machine Translation (WMT), but this rich feedback data has never been used to actually directly finetune machine translation systems themselves.

CDS PhD student Lily H. Zhang, working with Google researchers Hamid Dadkhahi, Mara Finkelstein, Firas Trabelsi, Jiaming Luo, and Markus Freitag, developed a new method called Training with Annotations (TWA) that utilizes these detailed error annotations to enhance machine translation quality. During an internship at Google this past summer, Zhang and her collaborators found that TWA significantly outperformed existing approaches by learning from the precise location and severity of errors that human evaluators had marked.

Each year, WMT hosts a shared task competition to assess machine translation capabilities across different languages and genres. Companies and organizations submit their systems, and top performers undergo rigorous evaluation using a framework called Multidimensional Quality Metrics (MQM). Professional translators annotate specific error spans in translations, categorizing issues like fluency and accuracy while rating their severity as major or minor.

“There has been this incredibly detailed evaluation data collected through the annual WMT competition, with professional translators carefully annotating specific problematic spans in translations and categorizing the types of errors,” Zhang said. “These annotations have previously been used to evaluate machine translation systems and well as to train models for automated evaluation metrics, but not to finetune and improve machine translation models directly.”

The key innovation of the work was developing an algorithm that could effectively learn from span-level error annotations — markings of exactly which parts of a translation contained mistakes. While previous methods only looked at overall translation quality scores, TWA uses the detailed error locations to help translation models understand precisely what went wrong.

Zhang and her coauthors tested TWA on English-to-German and Chinese-to-English translation tasks. The method consistently achieved better results than baseline approaches that either ignored the annotations or only used sequence-level scores.

The project emerged from Zhang’s interest in finding innovative ways to train language models using alternative forms of data. When she learned about the rich error annotation datasets during her internship on the Google Translate research team, she immediately saw an opportunity.

“The Google Translate team has a unique setup where the research and production teams work closely together,” Zhang said. “Working with folks from both teams helped ensure we were thinking about real-world applicability from the start.”

The results suggest that detailed error feedback, when properly utilized, can be a powerful signal for improving machine translation systems. TWA provides a framework that could potentially be extended to help models learn from fine-grained human feedback in other domains as well.

By Stephen Thomas

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet