CDS Guest Editorial: Dialect map, jargon dialects across science domains
Work by Sinclert Perez, Quynh M. Nguyen, and Kyle Cranmer.
This entry is a part of the NYU Center for Data Science blog’s recurring guest editorial series. Sinclert Perez is a CDS research software engineer. Quynh M. Nguyen recently graduated from NYU. He completed his PhD research in the Applied Math Laboratory at NYU’s Courant Institute of Mathematical Sciences Kyle Cranmer is a CDS professor of data science and professor of physics at the NYU College of Arts & Science.
English speakers in the same local area tend to use the same word for a certain concept (i.e. “coke” vs. “soda”). Over a large enough distance, crossing natural and man-made boundaries, word choice can change, fragmenting maps into different dialect regions.
The same phenomenon also happens in science. Researchers in one discipline use the same jargon for a specific concept to effectively communicate and build their works upon existing literature. In a closely related discipline, a slight variation of the jargon might be used to refer to that same concept, and in a less related discipline, a completely different jargon might be used. Being unaware of these terms (for example, statistics’ ELBO vs. physics’ variational free energy) hinders interdisciplinary communication and the progress of comprehensive science. Thus it would be extremely useful to better understand the dialect structure and map out the dialects of science.
In this context, the goal of Dialect map is to provide a complete web application that allows researchers and the public alike to compare jargon terms across science areas. This tool could also help researchers in taking an interdisciplinary approach when doing literature review, as well as diffusing ideas across disciplinary boundaries in general.
The comparison of different jargon terms across sciences has been built upon the article corpus of ArXiv, one of the biggest open-access academic archives, which was made public as a Kaggle dataset. The full ArXiv corpus comprises more than 2 million PDFs, containing articles to a wide variety of science categories such as computer science, economics, mathematics, physics, and statistics, among others.
The representation of this corpus into a 2-D map, and its daily update, is a challenge on its own. However, PaperScape, an open source project Dialect map is inspired from, already provides a 2-dimensional representation of how articles relate to each other. The entire collection is modeled as particles in a physical system, with references acting as attractive forces between articles. Articles tend to reference other articles in the same discipline, thus each discipline forms a cluster, and disciplines that cross-reference more are located closer within the map. This naturally provides a pseudo-geographical map suited for overlaying Dialect map upon.
The last piece of the puzzle is to define the list of jargons that are available to be compared. The terms referring to the same concept need to be grouped together by human experts. Because of this reason, their definition remains a manual process.
The approach Dialect map takes when building a list of grouped jargon terms, is to rely on the scientific community: the list of jargon groups is public on GitHub so that every member of the scientific community with a GitHub account can fork the repository and propose changes to the jargons list. Once the changes are approved, the new jargons are added to the back-end database for their ArXiv article corpus metrics to be computed as a background process.
The project so far only offers a renovated UI compared to the legacy PaperScape interface (upper right box), but it will soon include the jargon comparison functionality described in this article.
The Dialect map project, although still in development, would not be possible without the collaboration of:
- Data Science M.S. student Jiayu Du, who helped shape some of the early functionality.
- NYU Research Technology and HPC teams, because of their guidance and the huge computational power that they made available to us, researchers, in the form of the NYU Greene cluster.
- PaperScape authors, because of the maintenance and constant update of their public APIs, without which this project would not be possible.
- Google Cloud Platform, for providing us with research credits to support the deployment of certain software components.
If you are interested in the project, please, consider contributing to the public list of jargons, or contact us for further collaboration!
By Sinclert Perez, Quynh M. Nguyen, and Kyle Cranmer