Have two politicians ever despised one another more than Bernie Sanders and Donald Trump? Not likely. And yet as much as these two mortal combatants despise one another they do share a few things. As of September 2019 of the 3 million donations given to either Sanders or Trump 108 donors gave to both.

This project is really an excuse to try out Google’s BigQuery. It is cloud based SQL where one can store and analyze terabytes of data almost instantaneously and almost for free.

There were 4 steps to this project:

1. Download the Federal Election Data into a CSV spreadsheet and do some rudimentary cleaning there.

2. Further clean the data in R.

3. Mutate and filter the data using Google’s BigQuery SQL.

4. Graph it in Neo4J

Step 1: MAKING A SPREADSHEET OF THE DATA.

Open the FEC file from this page:

https://www.fec.gov/data/committee/C00580100/?cycle=2016&tab=filings

Click on the download csv button. NOTE the label is one month after the data. 5/1 to 5/30 file is labeled July.

Locate file and rename it June.

Open file in Libre or whatever your favorite spreadsheet program is. The top of the sheet is littered with extraneous detail. Delete these rows.

There are many more columns than are of any use to us . Open a new spreadsheet and paste the column headings below into it, then copy and paste the relevant columns from the original FEC file into this new spreadsheet. In this way we preserve the original data file, drop the extraneous , and fix the col names.

I confess I do not recall now why but I remember the best way to do this was to paste right on top of the column name and then when done, inserting a new blank row, and re- inserting the col names again.

name_last name_first address city state zip amnt employer occupation

We will be stacking all of our data sets together so it is critical that the columns are always in this order.

Save this sheet as something like Sanders_June_19_Select.csv

Step 2: CLEAN THE DATA IN R

Load tidyverse and Lubridate

library(tidyverse)

library(lubridate)

Next run these steps over the file for each month for each candidate.

Sanders_Oct_Nov_16_Select <- read_csv(“Sanders_Oct_Nov_16_Select.csv”)

View(Sanders_Oct_Nov_16_Select)

Sanders_Oct_Nov_16_Select$zip <-as.integer(gsub(‘[a-zA-Z]’, ”,Sanders_Oct_Nov_16_Select$zip))

Sanders_Oct_Nov_16_Select$amnt <- round(Sanders_Oct_Nov_16_Select$amnt)

Sanders_Oct_Nov_16_Select$date <- as.Date(as.character(Sanders_Oct_Nov_16_Select$date) , “%Y%m%d”)

dim(Sanders_Oct_Nov_16_Select)

Sanders_Oct_Nov_16_Select <- na.omit(na.omit(Sanders_Oct_Nov_16_Select, cols=c(“name_last”,”address”,”city”,”state”,”zip”,”amnt”,”date”)))

dim(Sanders_Oct_Nov_16_Select)

write_csv(Sanders_Oct_Nov_16_Select , “Sanders_Oct_Nov_16_Select_A.csv”)

Step 3: Mutate and filter the data using Google’s BigQuery SQL.

The first thing to do is to upload each file to Google’s cloud Storage. Once every file is uploaded to Google Cloud Storage from within BigQuery create a dataset.

* * *

A frustration saving note about Google’s BigQuery.

Upload your data to Google Cloud Storage and import it into BigQuery from there.

Bewilderingly, the name that Google Cloud Storage assigns needs to be tweaked to run in BigQuery.

This is the name assigned by Google Cloud Storage:

jf11579campaigncontributions:obama.Sanders_2016_through_Sept_2019_2

Note the full colon, change that to a decimal.

And then enclose the entire name between backticks NOT single quote marks.

It needs to look like this:

`jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`

* * *

SELECT *

FROM `jf11579campaigncontributions.test.Test_bernie`

UNION ALL

SELECT *

FROM `jf11579campaigncontributions.test.Test_trump`

Once all of the files are imported into BigQuery the next thing to do is to join them all into a single table. I chose to separately join all of the Trump files together into one file and the Sanders files together in a separate file. Later I joined these two together.

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_Jan_16_Select_BB`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_Feb_16_Select_BB`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_March_16_Select_BB`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_April_16_Select_BB`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_May_16_Select_BB`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_June_16_Select_BB`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_July_16_Select_BB`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_Aug_16_Select`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_Sept_16_Select`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_Oct_16_Select`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_Oct_Nov_16_Select`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_Jan_Feb_March_19_C`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_April_May_June_C`

UNION ALL

SELECT *

FROM

`jf11579campaigncontributions.obama.Trump_July_Aug_Sept_19_Select_C`

Let us take a look at the table. The Trump file has 1.3 million rows

SELECT *

FROM

`jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

Limit 3

name_last name_first address city state zip amnt employer occupation date candidate Carson Richard POB 123 APO AA 34022 300 U.S. DEPARTMENT OF STATE MANAGEMENT OFFICER 2016-03-03 Trump Carson Richard POB 123 APO AA 34022 49 U.S. DEPARTMENT OF STATE MANAGEMENT OFFICER 2016-03-10 Trump Cool Dwayne POB 456 FPO AE 9645 50 UNIVERSITY OF TENNESSEE USN ELECTRONICS TECHNICIAN 2016-03-17 Trump

The Sanders donation data has 2million rows.

SELECT *

FROM

`jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`

Limit 3

Here we join Trump and Sanders

SELECT *

FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`

UNION ALL

SELECT *

FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

But before we get into the weeds with actual data let us have a run through with a practice data set first.

Here is a practice Sanders test data

name_first name_last address zip state amount candidate ann adams 1 A 11111 NY 5 bernie agatha boxer 2 B 11111 NY 10 bernie attilda carlson 3 C 11222 CT 15 bernie annabelle davidson 4 d 12345 NJ 15 bernie alexa ericson 5 fifith 12566 NJ 10 bernie amelia frazer 6 F 11111 NJ 10 bernie abigail grant 7 seventy 11510 NJ 5 bernie bill gunter 1 first 11111 NY 20 bernie amy harris 8 H 11545 CT 10 bernie ben harris 2 second 21212 NJ 25 bernie athena izzo 9 I 11579 NY 15 bernie angela james 10J 11579 NY 20 bernie

And here is the Trump test data

name_first name_last address zip state amount candidate ballontine adams 2 B 55555 AL 20 trump bain boxer 4 A 66666 WY 5 trump brian frazer 5 fifth 11111 BY 10 trump babson gomez 1 A 11111 NY 5 trump bill gunter 1 first 11111 NY 20 trump ben harris 2 second 21212 NJ 25 trump baker hill 3 B 11579 NY 10 trump bob inez 3 third 31313 VT 30 trump athena izzo 9 I 11579 NY 15 trump angela james 10J 11579 NY 20 trump brad james 4 fourth 11222 CT 5 trump

Again, upload them to Google cloud then move them into BigQuery by creating a datasets. Once that is done join the two datasets.

SELECT *

FROM `jf11579campaigncontributions.test.Test_bernie`

UNION ALL

SELECT *

FROM `jf11579campaigncontributions.test.Test_trump`

Here is the combined Sanders and Trump test data.

name_first name_last address zip state amount candidate attilda carlson 3 C 11222 CT 15 bernie amy harris 8 H 11545 CT 10 bernie annabelle davidson 4 d 12345 NJ 15 bernie alexa ericson 5 fifith 12566 NJ 10 bernie amelia frazer 6 F 11111 NJ 10 bernie abigail grant 7 seventy 11510 NJ 5 bernie ben harris 2 second 21212 NJ 25 bernie ann adams 1 A 11111 NY 5 bernie agatha boxer 2 B 11111 NY 10 bernie athena izzo 9 I 11579 NY 15 bernie angela james 10J 11579 NY 20 bernie bill gunter 1 first 11111 NY 20 bernie ballontine adams 2 B 55555 AL 20 trump baker hill 3 B 11579 NY 10 trump athena izzo 9 I 11579 NY 15 trump angela james 10J 11579 NY 20 trump bob inez 3 third 31313 VT 30 trump bill gunter 1 first 11111 NY 20 trump brian frazer 5 fifth 11111 BY 10 trump babson gomez 1 A 11111 NY 5 trump bain boxer 4 A 66666 WY 5 trump brad james 4 fourth 11222 CT 5 trump ben harris 2 second 21212 NJ 25 trump

Here is a not small difference between SQL and BigQuery’s version of SQL. Big Query only offers –as of Jan 2020–a single version of filtering by intersection: INTERSECT DISTINCT. It is a combined GROUP BY and INNER JOIN.

Let us look for the same last name that appear in both Sander’s and Trump’s list of donors:

SELECT name_last

FROM `jf11579campaigncontributions.test.Test_bernie`

INTERSECT DISTINCT

SELECT name_last

FROM `jf11579campaigncontributions.test.Test_trump`

This returns 7 names:

A quick glance at the Sanders Trump joined table shows that there are more than one harris and frazer so which harris and frazer contributed to both candidates?

Let us add more criteria, name_first:

SELECT name_last, name_first

FROM `jf11579campaigncontributions.test.Test_bernie`

INTERSECT DISTINCT

SELECT name_last, name_first

FROM `jf11579campaigncontributions.test.Test_trump`

Our results drop from 7 to 4.

And if we add in address our results are unchanged.

And if we add in address our results are unchanged.

SELECT name_last, name_first,address

FROM `jf11579campaigncontributions.test.Test_bernie`

INTERSECT DISTINCT

SELECT name_last, name_first, address

FROM `jf11579campaigncontributions.test.Test_trump`

Now back to our real world data.

MOST unfortunately our real world data do not work out so simply.

Let us finally compare the two tables:

SELECT name_last

FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`

INTERSECT DISTINCT

SELECT name_last

FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

This returns 1991 names. But this is not meaningful. We are only looking at names in common so far, not people in common. Let us add name_first to the criteria. This drops us down to 551 rows.

SELECT name_last , name_first

FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`

INTERSECT DISTINCT

SELECT name_last, name_first

FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

Heartbreakingly , if we add address we get 0 results.

SELECT name_last , name_first, address

FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`

INTERSECT DISTINCT

SELECT name_last, name_first, address

FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

Interestingly, if we search only on matching addresses we get 11. Apparently there are 11 households where different people residing under the same roof where one donated for Sanders while the other donated for Trump. Meals must be quiet there.

SELECT address

FROM `jf11579campaigncontributions.obama.Sanders_2016_through_Sept_2019_2`

INTERSECT DISTINCT

SELECT address

FROM `jf11579campaigncontributions.obama.TRUMP_2016_thru_Sept_19_B`

But back to our question. If we search for matches on name_last and name_first we get 551 matches. If we add zip it drops to 3.

If we search name_last and name_first and state we get back 108

It would be nice if we could match on all eight criteria (name_last, name_first, address, city, state, zip, employer) but we cannot . Or maybe we can and the truth of the matter is that there are zero people who gave to both Trump and Sanders but out of 3 million donors I am guessing this is not the case.

If you have never donated online to a candidate, which is where this data comes from , one has to enter quite a bit of detail about yourself including employer and occupation. People are entering this themselves so the data should be accurate. It is my guess that people would prefer to give anonymously and resent having to disclose this much detail about themselves and so somewhat sabotage the data but at this point , this is only a guess.

Another possibility may be BigQuery’s limited JOIN capability. INTERSECT DISTINCT which is an inner join + group by might be giving us lower results than if we had done this in SQL. In any event next we will graph it using Neo4J

Step 4. GRAPH IN NEO4J

Step one clear out any old data.

MATCH (n)

DETACH DELETE n

Step two load the dat and create the nodes.

LOAD CSV WITH HEADERS FROM ‘file:///Bernie_Trump_108_Indexed.csv’ AS csvDATA

MERGE (candidateTrump:CandidateTrump{candidate:”trump”})

MERGE (candidateSanders:CandidateSanders{candidate:”sanders”})

MERGE (donor:Donor {LastName: csvDATA.name_last, FirstName: csvDATA.name_first, FullName: csvDATA.full_name, State: csvDATA.state})

FOREACH (_ IN case when csvDATA.candidate = ‘trump’ then [1] else [] end|

CREATE (donor)-[donorRel:DONATED]->(candidateTrump)

)

FOREACH (_ IN case when csvDATA.candidate = ‘sanders’ then [1] else [] end|

CREATE (donor)-[donorRel:DONATED]->(candidateSanders)

)

Step three graph.

MATCH p = (d:Donor)-[r:DONATED]->()

RETURN p