Addressing Benchmarking Issues in Natural Language Understanding

2 min readJun 24, 2021

Sam Bowman, CDS Assistant Professor of Data Science & Linguistics, recently gave a presentation at the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). The NAACL assembles conferences and promotes information exchange among communities in related scientific and professional fields. The conference was held virtually from June 6-June 11, 2021.

Sam’s NAACL 2021 Talk: What Will it Take to Fix Benchmarking in Natural Language Understanding?

Sam’s talk is based on “What Will it Take to Fix Benchmarking in Natural Language Understanding?”, a paper he co-authored with colleague George E. Dahl, a research scientist at Google Research’s Brain Team. Their research tackles the issue of NLU (natural language understanding) evaluation and how it’s currently broken due to unreliable and biased systems that score high on standard benchmarks — regardless of the fact that experts can easily identify issues within these high-scoring models. To prevent this phenomena from occurring in future models, the team proposes and describes four principles that NLU benchmarks should be required to meet:

Ideally good benchmark performance should suggest robust in-domain performance on tasks, which means more work on dataset design and data collection methods.
Benchmark examples should be accurately and clearly annotated. Text examples should be validated thoroughly enough to remove inaccuracies and to properly handle ambiguity.
Benchmarks should offer adequate statistical power, which ultimately means that benchmark datasets need to be much larger and/or more challenging.
Benchmarks should recognize potential harmful social biases in systems i.e. the development and use of auxiliary bias evaluation metrics should be promoted.

Ultimately, though important open research questions remain and no concrete solution is established, the team’s proposed criteria should build toward significant improvement.

To read “What Will It Take…” in its entirety, please visit the project’s arXiv.org page.

By Ashley C. McDonald

Addressing Benchmarking Issues in Natural Language Understanding

Written by NYU Center for Data Science