Visualizing the Clinton Email Network in R

This isn’t a post about politics. I do have opinions about the now infamous e-mail server (which will no doubt come out here), but when the WSJ folks made it possible to search the Clinton email releases I though it would be fun to get the data into R to show how well the igraph and ggnetwork packages could work together, and also show how to use svgPanZoom to make it a bit easier to poke around the resulting hairball network.

NOTE: There are a couple “Assignment” blocks in here. My Elements of Data Science students are no doubt following the blog by now so those are meant for you :-) Other intrepid readers can ignore them.

We’ll need some packages:

library (jsonlite) # read in the JSON data from the API library (dplyr) # data munging library (igraph) # work with graphs in R library (ggnetwork) # devtools::install_github("briatte/ggnetwork") library (intergraph) # ggnetwork needs this to wield igraph things library (ggrepel) # fancy, non-ovelapping labels library (svgPanZoom) # zoom, zoom library (SVGAnnotation) # to help svgPanZoom; it's a bioconductor package library (DT) # pretty tables

There’s an API backing the WSJ web app. It’s not advertised, but it’s not hidden either. They were kind enough to actually make this resource available to the public to help them make up their minds as to whether this was a horrible, awful, terrible, inexcusable breach of national security through conceit, hubris and naïvety (see, I have opines :-) – or not – and we really shouldn’t constantly hit their API just because we want to work with the data on our own.

To that end, we grab the data from the API and save the R object off so we can work with the local copy whenever we want to.

if (! file.exists ( "clinton_emails.rda" )) { clinton_emails <- fromJSON ( "http://graphics.wsj.com/hillary-clinton-email-documents/api/search.php?subject=&text=&to=&from=&start=&end=&sort=docDate&order=desc&docid=&limit=27159&offset=0" )$rows save (clinton_emails, file= "clinton_emails.rda" ) } load ( "clinton_emails.rda" )

There are some from/to paris with multipe recipients (~140). We can munge those into shape, but I’m just going to get rid of them since this is just a post about visualizing the network. strsplit and tidyr::unnest can help if you want to preserve those small number of emails.

clinton_emails %>% mutate ( from= trimws (from), to= trimws (to)) %>% filter (from != "" ) %>% filter (to != "" ) %>% filter (! grepl ( ";" , from)) %>% filter (! grepl ( ";" , to)) -> clinton_emails

Assignment: Reduce the number of filter() statements in that code block.

Data with “from” and “to” characteristics lend themselves to graphs. Graphs (in yet another opinion of mine) are inherently objects designed for computation and this could be a fun data set to use to learn some basic graph theory. Let’s make a graph first:

gr <- graph_from_data_frame (clinton_emails[, c ( "from" , "to" )], directed= FALSE )

Aye, that was all we needed to do. Just tell igraph where the “from” and “to” bits are and it does the rest. You can add extra data to nodes & edges, but this will do just fine for this post.

First, we’ll take a look at the degree centrality so we can properly size the nodes for the final vis.

V (gr)$size <- centralization.degree (gr)$res datatable ( arrange ( data_frame ( person= V (gr)$name, centrality_degree= V (gr)$size), desc (centrality_degree)))

The names with higher degrees shouldn’t be a shocker. This is all about former Secretary Clinton and if you google a bit (or just follow politics like some folks follow $SPORTSBALL) you’ll grok why the others are so high on the e-mail frequency list.

Note that this is a bit different than just doing a simple crosstab count:

datatable ( arrange ( ungroup ( count (clinton_emails, from, to)), desc (n)))

Assigment: “pipify” that code block.

That does show that there are a large number of redundant edges. We’ll combine them by simplifying the graph and stroring the sum of the edge connections (it will be stored in the weight attribute as long as there is an existing weight attribute).

E (gr)$weight <- 1 g <- simplify (gr, edge.attr.comb= "sum" )

You can use that weight computationally or to size the line connections between vertices. That will be an exercise left to the reader.

Since we’re just going to visualize the network, we’ll pick a layout and one of my favs is Fruchterman–Reingold. Here’s where we’ll use ggnetwork .

First, we set a random seed since you’ll get a different orientation each time if you don’t (the graph algorithm starts at a random point). Then we tell ggnetwork to use the FR algorithm to do it’s work.

set.seed ( 1492 ) dat <- ggnetwork (g, layout= "fruchtermanreingold" , arrow.gap= 0 , cell.jitter= 0 )

What do arrow.gap and cell.jitter do? Be curious! Hit up help and play with the settings.