Addressing Benchmarking Issues in Natural Language Understanding

Sam Bowman, CDS Assistant Professor of Data Science & Linguistics, recently gave a presentation at the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). The NAACL assembles conferences and promotes information exchange among communities in related scientific and professional fields. The conference was held virtually from June 6-June 11, 2021.

Sam’s NAACL 2021 Talk: What Will it Take to Fix Benchmarking in Natural Language Understanding?

Sam’s talk is based on “What Will it Take to Fix Benchmarking in Natural Language Understanding?”, a paper he co-authored with colleague George E. Dahl, a research scientist at Google Research’s Brain Team. Their research tackles the issue of NLU (natural language understanding) evaluation and how it’s currently broken due to unreliable and biased systems that score high on standard benchmarks — regardless of the fact that experts can easily identify issues within these high-scoring models. To prevent this phenomena from occurring in future models, the team proposes and describes four principles that NLU benchmarks should be required to meet:

  1. Ideally good benchmark performance should suggest robust in-domain performance on tasks, which means more work on dataset design and data collection methods.
  2. Benchmark examples should be accurately and clearly annotated. Text examples should be validated thoroughly enough to remove inaccuracies and to properly handle ambiguity.
  3. Benchmarks should offer adequate statistical power, which ultimately means that benchmark datasets need to be much larger and/or more challenging.
  4. Benchmarks should recognize potential harmful social biases in systems i.e. the development and use of auxiliary bias evaluation metrics should be promoted.

Ultimately, though important open research questions remain and no concrete solution is established, the team’s proposed criteria should build toward significant improvement.

To read “What Will It Take…” in its entirety, please visit the project’s arXiv.org page.

By Ashley C. McDonald

--

--

--

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Faster RCNN: Architecture, imporovements

MLOps — Ruling Fundamentals and few Practical Use Cases

Deep Learning fails Hollywood drivers

Average Rolling based Real-time Calamity Detection using Deep Learning

Data preparation in QCAI

Setting Up a Multi-GPU Machine and Testing With a Tensorflow Deep Learning Model

Introduction to Natural Language Processing

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
NYU Center for Data Science

NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

More from Medium

Starting a new machine learning project? User this 3-tier system to gau

What is Machine Learning and How does it Work?

Synthetic Data Generation

About biases in the data and how that affects the factual knowledge language models learn