Understanding the Results of the NLP Community Metasurvey: Interview with CDS Research Scientist Julian Michael

NYU Center for Data Science
6 min readDec 7, 2022
Julian Michael, CDS Research Scientist

If you’ve ever wondered what those in the NLP community might think about whether language models can understand language, which predictive models are ethical for researchers to build and release, or if there are too many resources being dedicated to scaling up existing machine learning methods, then you’re reading the right blog post. The NLP Community Survey, which asks many of these questions and more, was conducted from May to June 2022. The survey covers over thirty potentially controversial positions on NLP, tackling concerns often raised about AGI (artificial general intelligence) and ethics. “What Do NLP Researchers Believe? Results of the NLP Community Metasurvey?”, led by CDS Research Scientist Julian Michael, analyzes these survey results. Other participating CDS authors include CDS PhD students, Angelica Chen, Nikita Nangia, and Jason Phang as well as CDS Associate Professor of Linguistics and Data Science, Sam Bowman. We caught up with Julian to discuss how the survey was designed, what surprised him most about the results, and more.

Can you tell us how you became involved in this project and what inspired you to research this topic?

The project was originally inspired by the PhilPapers surveys, a similar study done by philosophers, including David Chalmers here at NYU. The idea is that there are a lot of controversial questions in our field, but it’s hard to argue productively about them when we don’t know how controversial they actually are, where people stand, and what arguments are actually working. So some kind of study on what the research community believes is useful. But on top of that, studying what the research community believes (about its beliefs) can be especially interesting, because that shows us exactly how we might be able to change our discourse from its current state and make it more productive.

Personally, I got into this project before I started at NYU, while I was in discussions with Sam Bowman’s group here about possible collaborations. Sam suggested the metasurvey idea in a group meeting, and I found it fascinating and decided to run with it.

What was the process of designing the survey, i.e. how did you and your team decide what to ask?

Part of our goal with the survey is to help us figure out whether ostensibly controversial questions are actually controversial. The goal was to try to cover issues that are frequently discussed (e.g., at conferences, in papers, and on social media) and subject to public disagreement. We were especially interested in questions where, if we know what the community thinks, it can help us have more informed discussions about research priorities and community norms. So we included questions about ethics, what kind of research directions are promising, the relationship between academia and industry, and other related issues.

The process of coming up with questions was, by necessity, a bit informal and subjective. After all, if we knew which questions were actually controversial, or where the community had false sociological beliefs about itself, then we wouldn’t need to run the survey. So, we started with a big brainstorm on a Google Doc, and then we took votes among the research team to pare down the questions to the ones we thought would be most interesting. Then it was a long process of refining, adding, and subtracting questions while consulting colleagues and running pilots of the survey with various research groups.

What are some things that surprised you about the survey results?

One of the cool things about the metasurvey formulation is that we can actually quantify how surprising the results are. So I can tell you that the most surprising result was that support for this idea we call “scaling maximalism” was greatly overestimated by our respondents. This is the idea that simply scaling up existing machine learning techniques — like neural network language models, for example — to bigger and bigger datasets, new kinds of data (like images, video, etc.), and training them for longer on all of the data we can reasonably find, will produce general-purpose AI systems that will be versatile enough to solve any practical real-world problem we throw at them. Kind of like humans. So, while that’s a pretty rough description of the view and everyone has their own nuanced perspective on it, there’s a subset of the NLP research community that believes in “scaling maximalism” in some form. Our respondents thought that subset was about half, predicting 47% would lie in that camp. It turns out that the real number is closer to 17%. So that’s a pretty big difference, and it’s important for the public discourse because a lot of people seem to think that the dominance of “scaling” is taken for granted, and that if you’re not working on scaling up, or using these large-scale models, then other people won’t care about your research. And this is especially concerning for academics who don’t have access to big compute clusters or budgets. But, our results suggest that may not be true, and there are a lot of researchers who don’t see scaling up as a silver bullet.

A lot of the press coverage and discussion on social media has focused on another set of results, regarding AGI (artificial general intelligence) and catastrophic risk. The community was surprisingly well-calibrated on these questions, but it turns out that about a third (36%) of our respondents think that decisions made by AI or machine learning systems could plausibly cause a major global catastrophe at some point in the next 100 years. Like, “worse than nuclear war” major. The “plausibly” is big there — the question just suggests that it’s plausible, not necessarily likely. But global catastrophe is pretty bad, so that might be something worth digging deeper into.

Separately, one thing that surprised me personally — but didn’t surprise the rest of the community as much — is that most of the respondents thought there shouldn’t be government regulation of NLP systems. Admittedly it’s a bit tricky to nail down what this actually means (do broad data protections like in GDPR count?), but I thought it would be an easy yes for most people. So anyway, this means that for those of us who think regulation might be necessary, it could be important to think about why the research community might be against it and how to make the case for it or how to incorporate researchers into the development of regulations.

How do you see this research potentially evolving going forward?

I would love to do a follow-up survey sometime down the line and see how things change, maybe get a bit of longitudinal data. A challenge is: who knows if today’s issues will even be relevant in 3 or 5 years? But even if the questions have to change, I think having some way of keeping track of these issues, and getting the community to more regularly reflect back on itself and the research discourse, would be really beneficial.

Any additional comments/thoughts you’d like to share?

I always want to remind people when interpreting the results of the survey of a couple things.

  • First, there’s response bias to keep in mind. We tried to quantify it. For example, we know that people in the US and senior faculty are overrepresented in the survey’s responses, and China is deeply underrepresented compared to their presence in the research community as a whole. But there are also unseen biases, like we probably disproportionately got our friends, direct co-workers, and people directly interested in some of the ideas asked about in the survey to participate. So our results shouldn’t be taken as ground truth.
  • Second, we’re asking NLP researchers for their opinions, but those opinions don’t necessarily reflect the correct answers. We’re trained to do research, not predict economic or geopolitical events. So while a third of respondents said some kind of global catastrophe from AI may be plausible, or that most of the NLP jobs will be gone in the next 30 years, it’s hard to extract any precise predictive value out of that. But it’s valuable in understanding how NLP researchers are thinking about their field and what they think about the state — and potential — of the technology.

By Ashley C. McDonald



NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.