The purpose of this tutorial is to show a concrete example of how web scraping can be used to build a dataset purely from an external, non-preformatted source of data.

Our example will be the website Fivebooks.com, which I’ve been using for many years to find book recommendations. As explained on the website itself, Fivebooks asks experts to recommend the five best books in their subject and explain their selection in an interview. Their archive consists of more than one thousand interviews (i.e. five thousand book recommendations), and they add two new interviews every week.

Our objective will be to use R, and in particular the rvest package, to gather the entire list of books recommended on Fivebooks, and see which ones are the most popular.

Prerequisites

This tutorial assumes a very basic knowledge of HTML/CSS, which we’ll use to select elements in web pages. If you don’t know anything about those two languages, I’d strongly recommend spending an hour or two learning the basics. Not only will you be able to apply this tutorial, but HTML and CSS are very useful languages to understand for everyday tasks involving web pages.

Khan Academy has a very good interactive course on the subject: Intro to HTML/CSS: Making webpages. (In order to apply this tutorial you only really need to go through the Intro to HTML and Intro to CSS sections.)

Packages

rvest is part of the tidyverse. We’ll also be using other packages, namely dplyr , stringr and data.table . Of course if you haven’t installed those packages previously, make sure to run install.packages(c("rvest", "dplyr", "stringr", "data.table")) first.

library ( rvest ) library ( dplyr ) library ( stringr ) library ( data.table )

What do we want?

In order to get all the data we want and check the most popular books and authors on the website, we need to gather the following elements on each interview page:

The general category of the interview (e.g. Philosophy or World & Travel)

The subject of the interview (e.g. Emotional Intelligence)

The titles of the recommended books (e.g. Wuthering Heights)

The authors of those books (e.g. Emily Bronte)

The URL of the page.

But in order to do this, we’ll also need to build a list of all the interviews on Fivebooks. There could be several ways to do this, including using Google’s indexing to look for pages from the fivebooks.com domain. But if you go to any interview page on Fivebooks, you’ll notice that after some scrolling, there are links to other interviews on the right. Every time we visit an interview page, we’ll check these links, and if any of them have not been visited yet, we’ll simply add them to our list of interviews to check next.

Let’s start coding!

We need to start from somewhere, so we’ll simply use a recent interview that appeared on the home page of Fivebooks when I wrote this tutorial.

to_be_visited <- "https://fivebooks.com/best-books/adam-smith-dennis-rasmussen/"

For now our vector to_be_visited only contains one page that we need to look at. But as soon as we visit it, we’ll add to to_be_visited all the recommended links given in the sidebar. Slowly this vector will start including all the interviews on the website.

visited <- c ()

As soon as a page has been visited, we’ll remove it from to_be_visited and add it to visited . We’ll keep going from page to page in to_be_visited like this, until all interviews have been checked, and the sidebar recommendations stop adding new links to to_be_visited .

fivebooks_data <- list () i <- 1

As we go from page to page, we’ll create a small data.frame for each page with the information we need, and we’ll add it to a list called fivebooks_data . At the end of the process, we’ll be able to use the rbindlist function from the data.table package to elegantly merge this list into one data.frame.

Our main loop will take the following form:

while ( length ( to_be_visited ) > 0 ) { # 1. Select the next page in to_be_visited # 2. Gather the information we need from the interview # 3. Put this information in a data.frame and insert it into fivebooks_data # 4. Find the recommended links and add them to to_be_visited # 5. Remove the page from to_be_visited and add it to visited } # Merge fivebooks_data into a data.frame

Instructions 1 and 5 are probably the easiest to write:

while ( length ( to_be_visited ) > 0 ) { # 1. Select the next page in to_be_visited current_url <- to_be_visited [ 1 ] # 2. Gather the information we need from the interview # 3. Put this information in a small data.frame and insert it into the list # 4. Find the recommended links and add them to to_be_visited, # if they haven't been visited already! # 5. Remove the page from to_be_visited and add it to visited visited <- c ( visited , current_url ) to_be_visited <- setdiff ( to_be_visited , visited ) }

Gathering information from a webpage using rvest

In order to write instructions 2, 3, and 4, we’re going to have to use rvest to scrape data from the page. Web scraping itself isn’t terribly complicated, and rvest really makes the process very easy in R.

For any given page, the first thing we’ll need to do is to read the HTML source:

page <- read_html ( current_url )

From there, our object page contains all the source code, and we simply need to navigate its contents and find the relevant pieces of data. To navigate in the source, we’ll use HTML and CSS tags. Fortunately modern web browsers give us tools to find the tags we need without having to understand too much about the intrincate structure of the page.

If you go to a page like the Adam Smith book interview, you can open the developer tools of your browser (Ctrl + Shift + I in Firefox and Chrome for Windows, and Cmd + Alt + I on a Mac). You can then use the element selector (step 1 below), and click on any element in the page to automatically select it in the HTML source (step 2).

This tool is an excellent way to navigate the HTML source of a page, and find how to exactly select the data you need. By looking for the various elements we need in the page, we find that:

The name of the category is mentioned at the top of each page, in a CSS class called interview-heading .

. The subject of the interview is also at the top of the page, in an element with class subject .

. The book names are written in elements with the class title , inside a larger element with class interview-page-bookshelf . Keep in mind that there are 5 book names to collect, so we’ll need to use the right function to collect all elements under this class, not just the first one.

, inside a larger element with class . Keep in mind that there are 5 book names to collect, so we’ll need to use the right function to collect all elements under this class, not just the first one. Finally, the name of the author is also in the “bookshelf” (class interview-page-bookshelf ), in each element of class book-title . These elements contain the name of the book, followed by the name of the author, such as “Adam Smith in His Time and Ours by Jerry Muller”.

Now, how do we select elements in a web page using rvest ? Simply by using the page object we created above, and the html_node function to select an element based on its HTML tag or CSS class. CSS classes should be prefixed with a dot. For example page %>% html_node(".interview-heading") selects the first element of the page with a CSS class of interview-heading . (This notation uses the pipe operator introduced by the magrittr package, and included in rvest .)

A couple more things to know:

When multiple elements should be selected, such as for our books and authors, we need to use html_nodes (plural) instead of html_node (singular) to return all elements, not just the first one.

(plural) instead of (singular) to return all elements, not just the first one. We can specify a series of nested CSS classes in the function, in which case rvest will select the element that uses the first CSS class, and then the element inside it that uses the second class, etc. For example, html_nodes(".interview-page-bookshelf .title") selects the element with a class of interview-page-bookshelf , and then looks for elements of class title inside.

Once we have selected the right elements, our last step is to extract the text inside each element. Otherwise the element will contain its HTML and CSS tags, which is not what we want. For this, we use the html_text function and apply it to the selected elements. Thus, "<span class=\"subject\">Adam Smith</span>" simply becomes "Adam Smith" .

To summarise:

We import the page using read_html ;

; We select the elements using html_node or html_nodes ;

or ; We extract the text using html_text .

Finally, I’m using the str_replace function from the stringr package to only extract the name of the author from the book-title elements. This means for example that "Adam Smith in His Time and Ours

by Jerry Muller" becomes "Jerry Muller" .

This might seem like a lot of functions to use, but our code is actually not that long at this point:

while ( length ( to_be_visited ) > 0 ) { # 1. Select the next page in to_be_visited current_url <- to_be_visited [ 1 ] # 2. Gather the information we need from the interview page <- read_html ( current_url ) category <- page %>% html_node ( ".interview-heading" ) %>% html_text subject <- page %>% html_node ( ".subject" ) %>% html_text book <- page %>% html_nodes ( ".interview-page-bookshelf .title" ) %>% html_text author <- page %>% html_nodes ( ".interview-page-bookshelf .book-title" ) %>% html_text %>% str_replace ( ".*

by " , "" ) # 3. Put this information in a small data.frame and insert it into the list # 4. Find the recommended links and add them to to_be_visited # 5. Remove the page from to_be_visited and add it to visited visited <- c ( visited , current_url ) to_be_visited <- setdiff ( to_be_visited , visited ) }

Storing the information

For each interview, we then create a small data.frame at step 3. This includes the data from all of our variables: category , subject , book , author and current_url .

This data.frame is stored in the fivebooks_data list that we created before the loop, using the index i . After each interview, we increment i by 1 so that the next interview is stored in the next space on the list.

while ( length ( to_be_visited ) > 0 ) { # 1. Select the next page in to_be_visited current_url <- to_be_visited [ 1 ] # 2. Gather the information we need from the interview page <- read_html ( current_url ) category <- page %>% html_node ( ".interview-heading" ) %>% html_text subject <- page %>% html_node ( ".subject" ) %>% html_text book <- page %>% html_nodes ( ".interview-page-bookshelf .title" ) %>% html_text author <- page %>% html_nodes ( ".interview-page-bookshelf .book-title" ) %>% html_text %>% str_replace ( ".*

by " , "" ) # 3. Put this information in a small data.frame and insert it into the list fivebooks_data [[ i ]] <- data.frame ( category , subject , book , author , url = current_url , stringsAsFactors = F ) i <- i + 1 # 4. Find the recommended links and add them to to_be_visited # 5. Remove the page from to_be_visited and add it to visited visited <- c ( visited , current_url ) to_be_visited <- setdiff ( to_be_visited , visited ) }

Adding recommandations

Let’s not forget that our scraping process is based on the idea of gathering interview links, based on the recommendations in the sidebar. Here we’ll need another bit of rvest code: page %>% html_nodes(".related-item a") %>% html_attr("href") .

This selects all links (HTML tags <a> ) inside all elements with CSS class related-item . After the links are selected, we use the html_attr function to extract an HTML attribute of those links. Here we’re interested in the href attribute, since it holds the actual URLs of the recommended interviews.

We then add those links to the vector to_be_visited . Note that this could potentially create duplicates in this vector, but since we later use setdiff at step 5, all duplicates will be removed.

while ( length ( to_be_visited ) > 0 ) { # 1. Select the next page in to_be_visited current_url <- to_be_visited [ 1 ] # 2. Gather the information we need from the interview page <- read_html ( current_url ) category <- page %>% html_node ( ".interview-heading" ) %>% html_text subject <- page %>% html_node ( ".subject" ) %>% html_text book <- page %>% html_nodes ( ".interview-page-bookshelf .title" ) %>% html_text author <- page %>% html_nodes ( ".interview-page-bookshelf .book-title" ) %>% html_text %>% str_replace ( ".*

by " , "" ) # 3. Put this information in a small data.frame and insert it into the list fivebooks_data [[ i ]] <- data.frame ( category , subject , book , author , url = current_url , stringsAsFactors = F ) i <- i + 1 # 4. Find the recommended links and add them to to_be_visited recommended <- page %>% html_nodes ( ".related-item a" ) %>% html_attr ( "href" ) to_be_visited <- c ( to_be_visited , recommended ) # 5. Remove the page from to_be_visited and add it to visited visited <- c ( visited , current_url ) to_be_visited <- setdiff ( to_be_visited , visited ) }

Putting the final touches

When I execute a web scraping script that will take a while to run, I like to know how far in the process it is, to make sure it is actually running and not hanging because of a bug. For this we’ll simply add a console message at the bottom of the loop, to tell us the subject of the interview that was just processed, how many pages have been visited so far, and how many are left to visit.

Finally, after the loop, we use the rbindlist function from the data.table package to bind the rows of all the data.frames in our fivebooks_data list.

Phew! That was a lot of explanations overall, but again the final code is not that long:

while ( length ( to_be_visited ) > 0 ) { # 1. Select the next page in to_be_visited current_url <- to_be_visited [ 1 ] # 2. Gather the information we need from the interview page <- read_html ( current_url ) category <- page %>% html_node ( ".interview-heading" ) %>% html_text subject <- page %>% html_node ( ".subject" ) %>% html_text book <- page %>% html_nodes ( ".interview-page-bookshelf .title" ) %>% html_text author <- page %>% html_nodes ( ".interview-page-bookshelf .book-title" ) %>% html_text %>% str_replace ( ".*

by " , "" ) # 3. Put this information in a small data.frame and insert it into the list fivebooks_data [[ i ]] <- data.frame ( category , subject , book , author , url = current_url , stringsAsFactors = F ) i <- i + 1 # 4. Find the recommended links and add them to to_be_visited recommended <- page %>% html_nodes ( ".related-item a" ) %>% html_attr ( "href" ) to_be_visited <- c ( to_be_visited , recommended ) # 5. Remove the page from to_be_visited and add it to visited visited <- c ( visited , current_url ) to_be_visited <- setdiff ( to_be_visited , visited ) message ( subject , " - " , length ( visited ), " pages visited" , " - " , length ( to_be_visited ), " pages left to visit" ) } fivebooks_data <- rbindlist ( fivebooks_data )

Let’s run it!

As the script starts running, we get the expected messages, telling us the interview being scraped, how many pages have been visited and how many are left to visit.

Adam Smith - 1 pages visited - 46 pages left to visit Economic History - 2 pages visited - 49 pages left to visit An Economic Historian's Favourite Books - 3 pages visited - 55 pages left to visit The Lessons of the Great Depression - 4 pages visited - 54 pages left to visit Globalisation - 5 pages visited - 66 pages left to visit Entrepreneurship - 6 pages visited - 67 pages left to visit Best Economics Books of 2017 - 7 pages visited - 66 pages left to visit The Indian Economy - 8 pages visited - 70 pages left to visit Saving Capitalism and Democracy - 9 pages visited - 69 pages left to visit Financial Speculation - 10 pages visited - 68 pages left to visit (...)

As expected, the number of pages “left to visit” quickly increases, as new interviews are found in the recommended links. The script takes several minutes to go through all the interviews, but finally ends with a total of 1,018 interview pages visited.

Analysing our results

I won’t go into details since this isn’t the goal of this tutorial, but by using dplyr we can easily tabulate counts on our newly created dataset, and find out what are the most recommended books and authors in more than 1,000 interviews on Fivebooks!

Most cited books overall:

fivebooks_data %>% group_by ( author , book ) %>% tally %>% ungroup %>% top_n ( n = 10 , wt = n )

## # A tibble: 18 x 3 ## author book n ## <chr> <chr> <int> ## 1 Adam Smith The Wealth of Nations 5 ## 2 Aldous Huxley Brave New World 5 ## 3 Charles Darwin On the Origin of Species 5 ## 4 Charles Dickens A Tale of Two Cities 5 ## 5 Charlotte Brontë Jane Eyre 6 ## 6 Clay Shirky Here Comes Everybody 5 ## 7 Evelyn Waugh Scoop 7 ## 8 George Eliot Middlemarch 5 ## 9 Harper Lee To Kill a Mockingbird 5 ## 10 Homer The Odyssey 6 ## 11 John Buchan The Thirty-Nine Steps 5 ## 12 John Stuart Mill On Liberty 6 ## 13 Leo Tolstoy War and Peace 7 ## 14 Mary Shelley Frankenstein 5 ## 15 Michael Lewis Moneyball 6 ## 16 Michael Lewis The Big Short 5 ## 17 Richard Dawkins The Blind Watchmaker 5 ## 18 Vasily Grossman Life and Fate 5

Most cited authors overall:

fivebooks_data %>% filter ( author != "" ) %>% group_by ( author ) %>% tally %>% ungroup %>% top_n ( n = 10 , wt = n )

## # A tibble: 10 x 2 ## author n ## <chr> <int> ## 1 Charles Dickens 17 ## 2 Evelyn Waugh 11 ## 3 George Eliot 12 ## 4 George Orwell 11 ## 5 Jane Austen 11 ## 6 Leo Tolstoy 12 ## 7 Michael Lewis 12 ## 8 Plato 11 ## 9 Virginia Woolf 14 ## 10 William Shakespeare 19

Most cited author by category:

fivebooks_data %>% filter ( author != "" ) %>% group_by ( category , author ) %>% tally %>% top_n ( n = 1 , wt = n ) %>% print.data.frame ()