This article was posted by Vik Paruchuri.

Introduction

Working with large JSON datasets can be a pain, particularly when they are too large to fit into memory. In cases like this, a combination of command line tools and Python can make for an efficient way to explore and analyze the data. In this post, we’ll look at how to leverage tools like Pandas to explore and map out police activity in Montgomery County, Maryland. We’ll start with a look at the JSON data, then segue into exploration and analysis of the JSON with Python.

When data is stored in SQL databases, it tends to follow a rigid structure that looks like a table. Here’s an example from a SQLite database:

id|code|name|area|area_land|area_water|population|population_growth|birth_rate|death_rate|migration_rate|created_at|updated_at 1|af|Afghanistan|652230|652230|0|32564342|2.32|38.57|13.89|1.51|2015-11-01 13:19:49.461734|2015-11-01 13:19:49.461734 2|al|Albania|28748|27398|1350|3029278|0.3|12.92|6.58|3.3|2015-11-01 13:19:54.431082|2015-11-01 13:19:54.431082 3|ag|Algeria|2381741|2381741|0|39542166|1.84|23.67|4.31|0.92|2015-11-01 13:19:59.961286|2015-11-01 13:19:59.961286

As you can see, the data consists of rows and columns, where each column maps to a defined property, like id , or code . In the dataset above, each row represents a country, and each column represents some fact about that country.

But as the amount of data we capture increases, we often don’t know the exact structure of the data at the time we store it. This is called unstructured data. A good example is a list of events from visitors on a website. Here’s an example of a list of events sent to a server:

As you can see, three separate events are listed above. Each event has different fields, and some of the fields are nested within other fields. This type of data is very hard to store in a regular SQL database. This unstructured data is often stored in a format called JavaScript Object Notation (JSON). JSON is a way to encode data structures like lists and dictionaries to strings that ensures that they are easily readable by machines. Even though JSON starts with the word Javascript, it’s actually just a format, and can be read by any language.

Python has great JSON support, with the json library. We can both convert lists and dictionaries to JSON, and convert strings to lists and dictionaries. JSON data looks much like a dictionary would in Python, with keys and values stored.

In this post, we’ll explore a JSON file on the command line, then import it into Python and work with it using Pandas.

Table of Contents:

The dataset

Exploring the JSON data

Extracting information on the columns

Extracting the data

Reading the data into Pandas

Converting columns

Subsetting the stops

Conclusion

To read the full article, click here.

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge