Humans vs. Machines: Natural Language Understanding
Human annotators narrowly surpassed BERT in NLU tasks set by GLUE benchmark
With technological innovation comes intense speculation: will machines soon take over the world? Will your Roomba turn on you? We may not be quite there yet, but the common sci-fi theme of humans vs. machines is still quickly becoming relevant in natural language processing. CDS’ Nikita Nangia and Samuel R. Bowman, also of NYU’s Department of Linguistics and Department of Computer Science, presented research to redefine the target performance standard for GLUE (General Language Understanding Evaluation). The GLUE benchmark aims to train, evaluate, and analyze performance in NLU (Natural Language Understanding), using nine distinct NLU tasks. These tasks are diverse and include natural language inference, sentiment analysis, acceptability judgment, sentence similarity, and common sense reasoning. The objective of GLUE is to drive development of robust systems that perform well on multiple NLU tasks without additional training with massive amounts of data.
Researchers use BERT (Bidirectional Encoder Representations from Transformers), a state of the art model by Devlin et al. (2018), as the basis of comparison to human performance. “BERT is pre-trained with a language modeling like objective on a large amount of unlabeled data, and then fine-tuned to a specific task.” Because BERT’s performance is impressively successful, it prompted the line of questioning: How much better are humans than BERT at NLU tasks? In order to test BERT’s performance against humans, Nangia and Bowman collect human results on GLUE’s NLU tasks and compare them to BERT’s performance. Nangia et al. conduct this research via Hybrid, a crowdsourcing platform similar to Amazon Mechanical Turk which collects human feedback.
Human crowdworkers go through a short period of training, during which they may be eliminated for inadequate performance (though only abysmally poor performance disqualifies candidates). This system is intended to familiarize workers with the task. Then, the human annotators move to the actual task, wherein they annotate a random subset of the set of test examples.
At the time of publication, Nangia et al. found that “humans robustly outperform the current state of the art on six of the nine GLUE tasks.” The human performance baseline reached a score of 86.9, which far exceeded BERT’s performance at 80.3. However, given BERT’s recent tremendous strides, researchers conjectured that this disparity would soon disappear, and indeed it has with BERT’s score now exceeding human performance. Complicating matters, researchers acknowledge that their estimates of human performance are conservative, and would likely be improved with more training. This is especially true of two tasks, MRCP and QQP, whose problem definition is subtle and more training on the task may be needed to understand the quirks of the data. Therefore, the answer is more evident to a system like BERT, which has seen massive amounts of training data.
Interestingly, humans achieved one of their highest scores out of the nine tasks on WNLI. Some papers have criticized the task as “somewhat broken,” rationalizing poor model performance. Given these results, Nangia et al. noted that tasks like WNLI, with small training sets and no simple cues, represent a blindspot in model development. Nangia et al. thus recommend developing new tasks that challenge “machine learning systems in different ways than our current benchmark tasks.” Furthermore, BERT’s performance suffers when the amount of available training data is limited, and in low-resource settings. Due to these outcomes, Nangia et al. recommend designing systems around low sample complexity to improve robustness and adaptability.
By Sabrina de Silva