The State of NLP Literature: Part I

Size and Demographics

This series of posts presents a diachronic analysis of the ACL Anthology —

Or, as I like to think of it, making sense of NLP Literature through pictures.

The world of scientific publishing is a rain forest: Where ideas compete for sunlight/attention; Where some win out and grow taller, while others are forgotten. (Photo credit: Héctor J. Rivas)

The ACL Anthology (AA) is a digital repository of tens of thousands of articles on Natural Language Processing (NLP) / Computational Linguistics (CL). It includes papers published in the family of ACL conferences as well as in other NLP conferences such as LREC and RANLP.

AA is the largest single source of scientific literature on NLP.

This project, which I call NLP Scholar, examines the literature as a whole to identify broad trends in productivity, focus, and impact. I will present the analyses in a sequence of questions and answers. The questions range from fairly mundane to oh-that-will-be-good-to-know. My broader goal here is simply to record the state of the AA literature: who and how many of us are publishing? what are we publishing on? where and in what form are we publishing? and what is the impact of our publications? The answers are usually in the form of numbers, graphs, and inter-connected visualizations.

The posts in this series include:

Subsequent parts will be published in the coming days.

Before we begin, some quick notes:

Target Audience: The posts are likely to be of interest to any NLP researcher. This might be particularly the case for those that are new to the field and want to get a broad view of the NLP publishing landscape — current and past. On the other hand, even if you attended NLP conferences long before deep learning was a thing, you have likely wondered about the questions raised here and are interested in what the data tells us.

The posts are likely to be of interest to any NLP researcher. This might be particularly the case for those that are new to the field and want to get a broad view of the NLP publishing landscape — current and past. On the other hand, even if you attended NLP conferences long before deep learning was a thing, you have likely wondered about the questions raised here and are interested in what the data tells us. Data: The analyses presented below are based on information about the papers taken directly from AA (as of June 2019) and citation information extracted from Google Scholar (as of June 2019). Thus, all subsequent papers and citations are not included in the analysis. A fresh data collection is planned for January 2020.

The analyses presented below are based on information about the papers taken directly from AA (as of June 2019) and citation information extracted from Google Scholar (as of June 2019). Thus, all subsequent papers and citations are not included in the analysis. A fresh data collection is planned for January 2020. Interactive Visualizations and Anonymity: The visualizations I am developing for this work (using Tableau) are interactive — so one can hover, click to select and filter, move sliders, etc. However, I am not currently able to publish the interactive visualizations in a way that can be anonymized. Since I want to be able to anonymize public posts about this work as per the ACL guidelines, I include here relevant screenshots. The visualizations and data will be available once the work is published in a peer-reviewed conference. During the relevant anonymity period, this post and the associated paper will be anonymized.

The visualizations I am developing for this work (using Tableau) are interactive — so one can hover, click to select and filter, move sliders, etc. However, I am not currently able to publish the interactive visualizations in a way that can be anonymized. Since I want to be able to anonymize public posts about this work as per the ACL guidelines, I include here relevant screenshots. The visualizations and data will be available once the work is published in a peer-reviewed conference. Caveats and Ethical Considerations: This is work in progress and is not meant to be a complete or comprehensive view the AA literature.

See the About NLP Scholar page for a list of caveats, ethical considerations, related work, and acknowledgments.

Papers (most pertinent to this post):

Gender Gap in Natural Language Processing Research: Disparities in Authorship and Citations. Saif M. Mohammad. In Proceedings of the 58th Annual Meeting of the Association of Computational Linguistics (ACL-2020). July 2020. Seattle, USA.

NLP Scholar: A Dataset for Examining the State of NLP Research. Saif M. Mohammad. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC-2020). May 2020. Marseille, France.

The State of NLP Literature: A Diachronic Analysis of the ACL Anthology. Saif M. Mohammad. arXiv preprint arXiv:1911.03562. November 2019.

See full list of associated papers in the About Page.

Let’s jump in!!

Size

Q1. How big is the ACL Anthology (AA)? How is it changing with time?

A. As of June 2019, AA had ~50K entries, however, this includes some number of entries that are not truly research publications (for example, forewords, prefaces, table of contents, programs, schedules, indexes, calls for papers/participation, lists of reviewers, lists of tutorial abstracts, invited talks, appendices, session information, obituaries, book reviews, newsletters, lists of proceedings, lifetime achievement awards, erratum, and notes). We discard them for the analyses here. (Note: CL journal includes position papers like squibs, letter to editor, opinion, etc. We do not discard them.)

We are then left with 44,896 articles. Below is a graph of when they were published: