1. Scraping 1.5 Million Reviews

We previously examined the distribution of ratings across the Audible catalog, noting the extreme concentration of ratings for the select few programs at the top ( a roughly linear relationship between the log of the title’s rank and log of the number of reviews).

We’ll use that skewness to our advantage today in order to maximize the number of reviews we can scrape per hour. Assuming that the number of written reviews per title is proportional to the number of ratings per title, we can expect roughly 50% of all the site’s reviews to be concentrated in the top ~1% of programs in our dataset.

There’s one obstacle to scraping these reviews from the HTML. Most of the reviews of the most popular titles are hidden beneath a ‘See More’ button.

Inspecting the network requests when clicking that button provides us with a solution. We find a URL that points us to a spare page of just review text and ratings. By incrementing the page number in this URL until the HTML comes up blank, we can gather all of the reviews and ratings for a given program. Because this page is so quick to load (and because we’re using 16 threads), we’re able to scrape the top ~80% of all reviews from the top ~6% (26,101) of programs in a reasonable amount of time.

While scraping, we’ll take the opportunity to do a few housekeeping tasks. We’ll be using our text to train a machine learning model to classify reviews as favorable (4 or 5 stars) or unfavorable (1, 2 or 3 stars). If we leave the titles and authors in our review text, our model is likely to ‘memorize’ that, say, a given program has very high review scores. This does not bode well for generalization, so we’ll want to remove the authors and titles from the reviews. We’ll do it now so we don’t need to store that information for each review.

The process of replacing words with a common item — ‘<unk>’ , short for ‘unknown’— is called ‘unking’. In addition to unking the author and title from each review, we’ll also replace periods with the word ‘stop’, strip all other punctuation, and change all text to lower case. This reduces the number of ‘words’ in our dataset.

A few hours later we have more than 1.5 million reviews and ratings.