In this R tutorial, we will be web scraping Wikipedia List of United States cities by crime rate. For this tutorial, we will be using the rvest() package to data scrape the crime rate table from Wikipedia to create crime rate visual graphs. The rvest() package is used for wrappers around the ‘xml2‘ and ‘httr‘ packages to make it easy to download. Once the data is downloaded, we can manipulate HTML and XML.

Install and Load Packages

Below are the packages and libraries that we will need to load to complete this tutorial.

Input:

1 2 3 4 5 6 install . packages ( "rvest" ) install . packages ( "xml2" ) install . packages ( "magrittr" ) library ( rvest ) library ( xml2 ) library ( magrittr )

Crime Data Table for Extraction

The below Wikipedia crime rate table is based on the crime rates per 100,000 people.

Import the Wiki Crime Rate URL

First, we will simply input the URL link into a variable using the below command.

Input:

1 wiki_crime < - 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate'

Useful Functions for Web Scraping

Once the wiki URL variable is created, we will be utilizing the below functions to read and manipulate the crime data.

read_html()

The read_html() function returns a character vector.

Input:

1 2 wiki_crime < - read_html ( wiki_crime ) wiki_crime

Output:

1 2 3 4 { xml_document } < html class = "client-nojs" lang = "en" dir = "ltr" > [ 1 ] < head > n < meta https - equiv = "Content-Type" content = "text/html; charset=UTF-8" > n < meta . . . [ 2 ] < body class = " mediawiki ltr sitedir - ltr mw - hide - empty - elt ns - 0 ns - subject page - List . . .

In order to move to our next step, we must find the XPath for html_nodes. The steps to complete are below:

Go to URL: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate Find the table for the crime data to extract. Right-click the table -> click Inspect On the right-hand side, a pop-up will show and will need to select Element Right-click the table element -> Copy -> Copy Xpath

Below is the table from the URL and to the right is where the elements of the webpage are located. Most importantly, the table that we will be extracting is located and highlighted below. When you run your cursor over <table class=”wikitable sortable jquery-tablesorter” style=”text-align:right”>, this will highlight the table in blue. This will assure you will copying the correct tables XPath for data extraction.

html_nodes() and html_table()

For the below functions, we will be using the html_nodes() and html_table() functions. The html_nodes() function extracts pieces out of HTML documents using XPath and CSS selectors. The html_table() function will parse an HTML table into a data frame. One thing that you will notice upon the execution of the below is that it will only show a ‘List of 1‘ under Data.

Input:

1 2 3 us_crime < - wiki_crime % > % html_nodes ( xpath = '//*[@id="mw-content-text"]/div/table' ) % > % html_table ( fill = TRUE )

Return a single element with [[1]]

As you would see now from the above execution, us_crime shows a ‘List of 1‘ under Data. If you run head(us_crime), it will return 76 rows and omit 8 rows from the data table we scraped. We must use [[1]] to return a single element from the population vector. Upon running the below, us_crime will now become ‘83 obs. of 13 variables’ under Data.

Input:

1 us_crime < - us_crime [ [ 1 ] ]

Data Web Scraping Clean-up

Let’s take a look at the data and see if the data needs to be cleaned.

Input:

1 head ( us_crime )

Output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 State City Population Violent Crime Violent Crime 1 State City Population Total Murder andnNonnegligent Manslaughter 2 New Mexico Albuquerque 559 , 721 965.8 7.7 3 California Anaheim 349 , 471 363.7 5.2 4 Alaska Anchorage 301 , 239 1070.9 8.6 5 Texas Arlington 387 , 565 502.1 2.1 6 Georgia Atlanta 464 , 710 1119.6 20.2 Violent Crime Violent Crime Violent Crime Property Crime Property Crime 1 Violent Crime Rape Robbery Aggravated Assault Total 2 72.2 301.2 584.8 6073.2 1071.2 3 36.9 125.6 196.0 2872.3 422.4 4 171.6 206.1 684.5 3917.5 559.4 5 53.7 136.5 309.9 3443.6 559.9 6 36.6 429.3 633.5 5499.3 1028.8 Property Crime Property Crime Arson1 1 Burglary Larceny - Theft Motor Vehicle Theft 2 4076.7 925.3 15.9 3 1972.4 477.6 8.0 4 2975.0 383.1 35.2 5 2657.1 226.5 7.5 6 3549.1 921.4 10.8

In addition, to only print the column names we can use the colnames() function.

Input:

1 colnames ( us_crime )

Output:

1 2 3 [ 1 ] "State" "City" "Population" "Violent Crime" "Violent Crime" [ 6 ] "Violent Crime" "Violent Crime" "Violent Crime" "Property Crime" "Property Crime" [ 11 ] "Property Crime" "Property Crime" "Arson1"

We will just use the names() function to rename each column to our desire name. Once executing the names() function, we can confirm the names are updated by running the colnames() function.

From the above output, the web scraping managed to pull the headers plus the sub-headers into row 1. In addition, there’s a dupe column of Violent Crime – Violent Crime which pushed the following columns off by a column. Also, the Arson1 and Motor Vehicle Theft column are merged under one column. This can be fixed by re-naming the columns into an easier readable format.

Input:

1 2 3 4 names ( us_crime ) < - c ( "State" , "City" , "Population" , "VCTotal" , "VCMurderandManslaughter" , "VCRape" , "VCRobbery" , "VCAssault" , "PCTotal" , "PCBurglary" , "PCLarcenry" , "PCVehicleTheft" , "Arson" ) colnames ( us_crime )

Output:

1 2 3 4 5 [ 1 ] "State" "City" "Population" [ 4 ] "VCTotal" "VCMurderandManslaughter" "VCRape" [ 7 ] "VCRobbery" "VCAssault" "PCTotal" [ 10 ] "PCBurglary" "PCLarcenry" "PCVehicleTheft" [ 13 ] "Arson"

Now our column names are lining up correctly with the data. The above fixed the main header titles, however, the sub-headers will need to be removed by using the below command.

Input:

1 2 us_crime < - us_crime [ - 1 , ] head ( us_crime )

Output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 State City Population VCTotal VCMurderandManslaughter 1 State City Population Total Murder andnNonnegligent Manslaughter 2 New Mexico Albuquerque 559 , 721 965.8 7.7 3 California Anaheim 349 , 471 363.7 5.2 4 Alaska Anchorage 301 , 239 1070.9 8.6 5 Texas Arlington 387 , 565 502.1 2.1 6 Georgia Atlanta 464 , 710 1119.6 20.2 VCRape VCRobbery VCAssault PCTotal PCBurglary PCLarcenry 1 Violent Crime Rape Robbery Aggravated Assault Total Burglary 2 72.2 301.2 584.8 6073.2 1071.2 4076.7 3 36.9 125.6 196.0 2872.3 422.4 1972.4 4 171.6 206.1 684.5 3917.5 559.4 2975.0 5 53.7 136.5 309.9 3443.6 559.9 2657.1 6 36.6 429.3 633.5 5499.3 1028.8 3549.1 PCVehicleTheft Arson 1 Larceny - Theft Motor Vehicle Theft 2 925.3 15.9 3 477.6 8.0 4 383.1 35.2 5 226.5 7.5 6 921.4 10.8

The sub-headers are removed and the crime table now starts with Id of 2. This is not really a big deal but we can reset the Id’s with the seq() function.

Input:

1 2 rownames ( us_crime ) < - seq ( length = nrow ( us_crime ) ) head ( us_crime

Output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 State City Population VCTotal VCMurderandManslaughter VCRape VCRobbery 1 New Mexico Albuquerque 559 , 721 965.8 7.7 72.2 301.2 2 California Anaheim 349 , 471 363.7 5.2 36.9 125.6 3 Alaska Anchorage 301 , 239 1070.9 8.6 171.6 206.1 4 Texas Arlington 387 , 565 502.1 2.1 53.7 136.5 5 Georgia Atlanta 464 , 710 1119.6 20.2 36.6 429.3 6 Colorado Aurora 360 , 237 460.8 6.7 97.7 124.1 VCAssault PCTotal PCBurglary PCLarcenry PCVehicleTheft Arson 1 584.8 6073.2 1071.2 4076.7 925.3 15.9 2 196.0 2872.3 422.4 1972.4 477.6 8.0 3 684.5 3917.5 559.4 2975.0 383.1 35.2 4 309.9 3443.6 559.9 2657.1 226.5 7.5 5 633.5 5499.3 1028.8 3549.1 921.4 10.8 6 232.3 2936.7 467.2 2119.4 350.0 17.8

write.csv() Function Dataset Export

A very cool function within R is the ability to export the US crime data we imported from Wikipedia. The write.csv() function will export the dataset to the set directory in RStudio. Or you can download from the dataset page. Also, make sure to add row.names = FALSE unless you would like to ID associated with each row that R creates within a data.matrix.

Input:

1 write . csv ( us_crime , "wiki_united_states_crime_data.csv" , row . names = FALSE )

With the crime data cleaned, we can create various tables and plots with the data. Since there’s a large number of cities within the data, I would recommend only plotting the top 10-20 cities.

Now you can use this data to create plots or tables for data visualization.