Scraping openreview.net and building the corpus

Scraping the site itself is slightly non-trivial since most of the content is rendered through AJAX, but by monitoring the network activity the task becomes very easy. The following code block shows how to politely scrape all ICLR reviews and represent them as a Pandas data frame.

Code to crawl openreview.net.

Next, we can parse through the data frame and assemble it into a set of categorized reviews. The code snippet here is a bit long, but please see the Jupyter notebook (http://nbviewer.jupyter.org/github/JasonKessler/ICLR18ReviewVis/blob/master/MiningICLR2018.ipynb). The result is that 930 papers were identified along with 2,806 unique reviews.

Below is a sample of the first 10 reviews scraped, including their metadata.

There first 10 reviews and their metadata. This is our core dataset.

ICLR has a decent acceptance rate, although the vast majority of accepted papers were accepted as posters. For the purpose of this study, we’ll group all accepted papers together and omit the workshop category.

The reviewer ratings were fairly tentative, with most hovering between 4 and 7 (out of 10). 4 and below were labeled by the conference committee as “reject”, 7 and up accept, and 5 and 6 were either “marginally above” or “marginally below” the acceptance threshold. For the analysis, I grouped [1,4] as “Negative”, [7,10] and positive, and omitted [5,6].

The distribution of the ratings found in the data.

Difference in Frequency Ranks for Term-Category Associations

Plot 1. A Scattertext plot showing how positive and negative reviews differ. Click the image for an interactive version. We can see positive sentiment expressions like “nice”, “well written”, “useful” and “novel” are highly associated with positive reviews, while concerns about about a paper’s “novelty”, it’s “limited” nature, and markers of skepticism dominate negative language. The “novel” vs. “novelty” dichotomy provides some justification for the decision not to stem or lemmatize.

Let’s first use Scattertext to visualize the difference in language between positive reviews and negative reviews. We’ll grab the review data frame we created in the last section, parse it with spaCy (Honnibal and Johnson 2015), and then use Scattertext to plot unigrams and bigrams which occur at least three times. Scattertext, by default, requires bigrams match the PMI “phrase-like” criteria. Be default, the threshold coefficient is 8, a fairly stringent criteria.

The PMI formula for identifying bigrams as phrases. Used in Scattertext, but originally introduced for this purpose in Schartz et al. (2013).

The code to produce the visualization is fairly concise. One can read the generated set of reviews as a Pandas data frame, add a column containing the spaCy-parsed reviews, and create a Scattertext Corpus object from the data frame. The Corpus object categorizes documents based on their binned rating — i.e., “Positive”, “Negative” or “Neutral”. Documents from the Neutral class are removed.

The visualization is created in HTML form, with the Positive category on the x-axis, contrasted to frequency counts from the “not-categories” — in this case, only the “Negative” category. The words are scored and colored based on their difference in rank. The ranks used in this plot (Plot 1) are dense.

Snippet 1. The code to produce the scatter plot.

Note that spaCy tokenizes contractions as multiple words, so the “s” that appears in the upper righthand corner is likely part of a possessive.

In this plot, the axes both refer to the ranks of unigrams and bigrams in each category. The higher a word is on the y-axis, the higher its frequency rank among positive reviews, while the further right a word is on the x-axis, the higher its frequency rank in negative reviews.

No stop-listing or normalization (other than case-insensitivity) is applied to the terms. Had we lemmatized, we’d fail to pick up that the word “novelty” is the best predictor a review is negative, and “novel” is a good indicator a review is positive.

To see how the words appear in context, click the image, mouseover a term, and click on it.

We can use the same workflow to visualize the difference between reviews of accepted papers and rejected papers (omitting workshop papers).

Figure 2. Dense ranks of words used in reviews of accepted and rejected papers.

While the terms highlighted here are similar to those associated with a review’s polarity (e.g., “well written” or “unclear”), we can see that terms associated with a paper’s content (e.g. ,“memory” and “theoretical” vs “regularization” and “LSTM”) appear to be influential, potentially reflecting topics favored by the organizers.

Snippet 2. Code used to make Figure 2.

Making use of neutral data: the Log-Odds-Ratio with an Informative Dirichlet Prior for term-associations

Let’s take a brief digression to see how

The next plot we’ll look at should be similar — it will be the difference between positive reviews and negative reviews. Here, we’ll use a different technique for finding interesting terms: the log-odds-ratio with an informative Dirichlet prior from Monroe et al. (2008). It was popularized in the NLP world by Jurafsky et al. (2014).

Feel free to look at the above papers for an explanation of this score. I’ve created a Jupyter notebooks which describes how this score works, along with Python code and a lot of charts. This no notebook also covers the Dense Rank Difference measure, a measure derived from tf.idf, and a novel scoring measure, Scaled F-Score. http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Class-Association-Scores.ipynb

Snippet 3. Source code for producing the log-odds-ratio with an informative Dirichlet prior chart.

We use the reviews for Workshop papers as the background corpus, and instead of looking at the absolute number of word occurrences, use the number of times a word or phrase occurred in a document as our term-count definition. This is accomplished in lines 12–13 of the snippet above. Finally, we scale the sum of the prior vector to that of a typical document length, (following Monroe et al.) to create the term-scorer object (lines 14–16).

The result is below.

Plot 3. Looking at how language differs between reviews of papers that were subsequently rejected or accepted. The log-odds-ratio with an informative Dirichlet prior was used. In this plot, the x-axis is the log frequency of a word or phrase, while the y-axis is the z-score of the log-odds-ratio. Terms with a z-score in (-1.96,1.96) are listed in gray.

Reviews of accepted papers praised their writing (“well written”) discussed the papers appendices and praised them (“thank”, “nice to see”). Reviews of rejected papers questioned their novelty, and contained a number of negatives (“is not”, “is no”, “never”) and criticized the writing style (“unclear’).

Acceptance vs. Positivity

Inevitably, some accepted papers will receive negative reviews while many rejected papers received positive reviews. Below, we’ll construct a plot shows how terms are associated with both a review’s polarity (i.e., whether it was positive or negative) and the ultimate acceptance decision of the paper being reviewed.

There are many ways to construct this chart, but we will define the axes in a way that distinguish terms that are present in reviews which aligned with the acceptance decision (“good” reviews) and those which went contrary (“bad” reviews)

In this case, we’ll only look at unigrams.