Sitemap

Applying Multilingual Hyperlink Prediction to Improve Semantic Task Performance

2 min readJul 1, 2021

Masked language models (MLM — a task that involves masking part of the input text then asking the model to predict the missing information) have become the presumed approach when it comes to processing text. A number of alternative approaches have recently been presented to enhance word representations with external knowledge sources. However, these models are designed and assessed in a monolingual setting only, which is limiting for obvious reasons. Iacer Calixto, a visiting academic currently working with researchers at CDS, has co-authored a project “Wikipedia Entities as Rendezvous across Languages: Grounding Multilingual Language Models by Predicting Wikipedia Hyperlinks” that proposes an entity prediction task that is language-independent as an intermediate training procedure to base word representations on entity semantics. The project originated from Iacer’s current project Improving Multi-modal language Generation wIth world kNowledgE (IMAGINE), which had an essential goal of linking language models to massive knowledge graphs (such as a Wikipedia). “Wikipedia Entities” was also recently accepted into the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).

Using a dataset of Wikipedia articles in up to 100 languages, the team introduces a multilingual Wikipedia entity prediction objective which leverages the knowledge-rich task of hyperlink prediction that is designed to inject:

  1. “semantic knowledge from Wikipedia entities and concepts into the multilingual MLM token representations”(1)
  2. “explicit language-independent knowledge into a model trained via self-supervised learning, but in our case without parallel data.”(1)

They developed a training procedure where they masked out hyperlinks in the Wikipedia articles and trained their model to predict the hyperlinked article identifier similar to standard MLM but conversely using an entity vocabulary of 250,000 concepts across various languages. This consistently enhanced performance “on several zero-shot crosslingual tasks.”

When asked about the future of the project, Iacer expressed that the team has many options to potentially explore going forward. One he specifically noted would be to train a model to generate hyperlinked Wikipedia articles in context as opposed to predicting the articles as they have in the current project. “Something I personally want to work on at some point is to include the rich set of visual information (i.e. images) available in Wikipedia articles in this framework. In more general terms, I am interested in grounding language models using ‘real-world’ information. Knowledge graphs are certainly an important source, but I expect that visual information should also be a very promising way to ground these models,” says Iacer.

To read the “Wikipedia Entities…” in its entirety, please visit the paper’s ACL page.

  1. Wikipedia Entities as Rendezvous across Languages: Grounding Multilingual Language Models by Predicting Wikipedia Hyperlinks” (pg 1)

By Ashley C. McDonald

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet