Data Analysis Using SQL, Python and BigQuery

Recently Google’s BigQuery came across my radar, so I decided get hands on it to see what all the buzz was about. The US traffic fatality records database looked interesting because it contains information on all traffic accidents in the US where at least 1 person died. No doubt a sad topic, but one that is also quite interesting in terms of how understanding the trends could potentially help prevent future accidents. Not only was I able to zero in on some pretty neat insights, but I was also able the utilize some Python to visualize some of my findings.

BigQuery is a Google Cloud product for storing and accessing large databases super quickly. If you’re interested in experimenting more with BigQuery then you should definitely head over to Kaggle, where you can explore with some of the data sets they have available.

Set Up:

Setting up for a BigQuery notebook is actually quite simple thanks to the fine folks at Kaggle. They created a Python package called (bq_helper) which has some helper functions for getting data out of BigQuery.

Once you’ve imported the helper package the next step is to create a BigQueryHelper object that points to the specific data set you want to use. Below is the BigQueryHelper object I created, which I appropriately named ‘accidents’. The active_project argument takes the BigQuery information, which is currently “bigquery-public-data” for all of the BigQuery data sets on Kaggle. The dataset_name argument take the name of the data set, in this case “nhtsa_traffic_fatalities”.

With those two pieces of code you environment is ready to go! It’s time get analyzing!

Analysis

A good starting point to get a feel for a new data set is to look at the schema, which is simply a description of how the data is organized. This will give you an idea as to how the data is structured. Below is the list of tables in the accidents BigQueryHelper object that I created in the previous step.

Once you have a feel for the tables in the data set, try digging deeper into a specific table to get more information on the columns within that table. I chose the accident_2015 table.

The SchemaField gives us important information about a specific column, such as:

The name of the column

The datatype in the column

Whether or not the column is NULLABLE or not

And a description of the data in that column

So for the state_name column we have the following schema field:

SchemaField(‘state_name’, ‘string’, ‘NULLABLE’, “This data element identifies the state in which the crash occurred.”,())

To take this a step further let’s look at the first few rows of the accidents_2015 table

Ok once you’ve done some exploratory research and got a feel for the data set, it’s time to have some fun writing SQL queries.

The first thing I was interested in was trying to identify was which day of the week the most fatal accidents happen on. So I wrote a query to get the count of unique id’s (consecutive_number in this data set), which I aliased as Count, as well as the day of week for each accident, which I aliased as Day. And finally I sorted the table in descending order based on count.

Just a little clarification. Sunday is the first day of the week (Day 1) in the data set and Saturday is the last day of the week (Day 7), so according to the information returned from this query, Saturday is the most dangerous day to be out on the road and Tuesday is the day where the least amount of fatal accidents occur.

Next I imported matplotlib.pyplot so I could show this information graphically.

Next, I was interested in which hours of the day tend to be the most dangerous in terms of fatal accidents.

Below is the query, again I selected the count of the unique id of each accident, extracted the hour of day that the accident occurred on, grouped by hour and ordered by count in descending order.

I assigned ‘query3’ to a DataFrame called accidents_by_hour and then called the DataFrame to run my query.

This data is recorded from 00-23 hours in a day, so if I’m interpreting correctly the three most deadly hours to be on the road are 6:00pm, 8:00pm, and 5:00pm, which makes sense as those times are people are heading home from work, running errands after the workday, etc.

Instead of another line graph, I imported seaborn and created a barplot to visualize the information and added a splash of color.

Finally I wanted to see which states had the most hit and runs, so I can avoid those in the future.

Here are the top 10 states with the most hit and runs. Your guess is as good as mine as to what the “Unknown” grouping is, but for the year 2015 California had the most hit and runs, followed by Florida and Texas.

I hope you enjoyed following along with this short data analysis exercise and hopefully this sparked your interested in BigQuery. Once again, if you’re interested head over to Kaggle to check out some of the BigQuery data sets they have available. As always, if you have any questions or comments please shoot me an email at:

Kristi Smith,

ksmith@ecapitaladvisors.com