A lot of sweat, a lot of late nights, and a lot of squinting at bad handwriting went into NJ Advance Media’s project tracking the use of police force.

Not only did we pay $30,000 for a data entry company to turn messy and oft-handwritten forms into a spreadsheet, the data analysis and presentation involved every single member of our six-person NJ Advance Media data team and several journalists elsewhere in the newsroom.

But the end result—a database of every time N.J. cops punched, kicked, or used other force over a five-year period, and a series of stories to go with it—was well worth the effort. As of January, we’ve digitized 72,677 documents and created a searchable database for more than 17,000 officers.

The Data

If you’ve done any reporting on police force, you have probably noticed police departments are reluctant to hand over documents on the matter. Our project was no different.

When we ran into challenges getting use-of-force forms from county prosecutors—required to collect them under the 2001 law that created the forms—we turned to local police departments instead. But sometimes, they were even less cooperative.

We initially sought ten years of data—from 2007 to 2016—but immediately got feedback that the request would be impossible, either because departments didn’t keep the forms that long, or because it would take great effort to collect them. So we revised the timeline to just five years.

Even then, several police departments tried to charge us for the time to collect the forms or redact minors’ names from them. The forms were often scanned crooked, rarely in order, and sometimes handwritten. A good form would look like this…

But we also got forms that looked like this…

We also only got two years of data from Phillipsburg because their older reports were under quarantine for mold.

During this process, reporter Craig McCarthy created a list of records officers and their emails to help keep track of the status of each request and remember who’s on the phone when calling a department.

Data entry was the next challenge to tackle. With so many handwritten and unstandardized forms, we chose to pay a data entry company to put forms into a spreadsheet, rather than attempt a programmatic solution.

We paid $30,000 to have people input each form into a Google Form, which automatically populated a spreadsheet. That way, they saved time with multiple choice answers and easy selections, and we didn’t have to worry about data being altered by mistake. This whole process took a total of five months.

Still, we ran into plenty of problems keeping the data clean, thanks to the nature of the forms themselves. Officers marked themselves as dead, signed their own forms, or named subjects “John Deere” when they meant that they had killed an actual deer.

We did multiple rounds of data cleaning, passing off versions of the file among three reporters while recording what we did at each stage in Jupyter notebooks and README files. Erin Petenko began with the basics: A week of lowercasing weird capitalizations and replacing glitchy characters. She also standardized the racial categories for subjects and officers, since the forms had everything from “A” to “Arab” written in the Subject Race box. In addition, there was one department, New Jersey State Police, that had their own numbered designations for race: 1 for white, 2 for black, etc.

We also knew that we wanted to compare department incidents to the population and arrests for each town. Carla Astudillo spent weeks cleaning FBI arrest data to use in the story, and Stephen Stirling used census figures to create adjusted population data for 10- to 65-year olds in each town.

However, that comparison meant that we had to consider whether the subject of an incident would have actually been arrested, which is not the case with “emotionally disturbed persons,” New Jersey’s term for people with mental health disorders or disabilities. There was no set way for officers to designate an EDP incident; some labeled them “psych” or “mental.”

Craig McCarthy and Erin Petenko looked at the columns where forms tended to be labeled as EDPs and marked them as such. But we’re certain that there are still ones we missed. When it comes to missing data, sometimes you have to learn to recognize what you do not have the power to change.

The biggest challenge was standardizing officer names. We knew from the beginning that we wanted to be able to say how many uses of force were tied to each officer and whether that officer could have triggered early warning systems in other cities.

But officers often recorded their names differently, inconsistently added a middle initial or even misspelled them in certain forms. Were Eric B. Hendrickson and Eric Hendrickson of the Galloway Police Department the same person? Our aim was for each officer to have one unique officer ID created for their name, regardless of how the names appear on the form.

First, we plugged the names into OpenRefine’s cluster tool to identify more officers that we suspected could be in there multiple times. This became important for spotting instances like an officer spelling their name Tom in one form and Thomas in others. Then, we identified a list of over 4,000 officers who could have possibly spelled their names differently in different forms.

Then, we checked the officers against that officer’s badge number and the New Jersey pension database in order to make sure this was the only officer with that name in that department. Once we confirmed that the multiple name variations belonged to one officer, we standardized the unique officer ID to one consistent, correct spelling.

Finally, we skimmed through all of the officer names one final time and found any that we had missed.

Due to restrictive personnel records available to us, we couldn’t account for officers who moved police departments during the time frame or officers who changed their names after marriage.

The Design

From the very beginning, it was important the data would be searchable by our readers. As the “beating heart” of this project, the database had to be easy to use and to understand. However, how do we make over 72,000 rows with more than 40 data points digestible to readers?

First, what would they want to search? We thought searching by local police departments would be a no-brainer since most people would be curious as to the use of force rates in their town. We created individual department pages for each of New Jersey’s 468 municipal police departments and one separate page for New Jersey State Police that could be accessed through the search box.

Each department page includes aggregate five-year numbers for several different data points, including total uses of force, incidents per 1,000 arrests, total officers who used force, racial breakdowns with odds ratios, and a breakdown of both types and reasons for force. In addition, we included statewide rankings and numbers for comparison. We tried to be extensive in the categories we included. At the same time, we didn’t want overwhelm readers with too many different numbers and figures.

We wanted the database to also be searchable by police officer. Each of the over 17,000 New Jersey officers who filled out a use-of-force form would get their own page with that officer’s aggregate five-year numbers.

We would show the pages to several people not involved and tweak the design depending on their feedback. It was a careful balancing act of scaling back the numbers to make it more simple, but not so simple that it became unclear.

Throughout the page, we also included plus signs which readers could click to read more about certain data. It was important to provide the readers as much context as possible while also keeping the design of the pages uncluttered. Additionally, Blake Nelson drew some wonderful icons which not only broke up the numbers and text but also visually communicated some of our context. It helped to have a drawing of a “compliance hold” being performed next to compliance hold totals.

While we didn’t want to overwhelm visitors with too much information, we did want to give readers plenty of options to explore further. One way we achieved this was searchable tables of officers and incidents with clickable links to that officer’s page and that department’s page. It was a great compliment when readers gave us feedback like, “I spent hours just looking up different police departments and officers.”

Finally, we decided not to include any subject names from the use-of-force forms in the forward-facing database, in case any of the people were cleared of charges, were involved in domestic or mental health calls, or were juveniles.

The Tech

Carla Astudillo built the database using Django, a Python-based framework, which made it easy to create an application with individual pages for each police department and officer. The bar, circle, and line graphs in every page were created using D3.js while the searchable tables were created using DataTables, which is a plug-in for jQuery.

For the backend, we hosted our data on a private data.world which the Django application can access through data.world’s simple-to-use library which provides a Python wrapper for the data.world’s API. What we like about the Python library is that you can either download a dataset on a local file system which works offline and will only overwrite your local copy if there’s been a change to your dataset, or you can query the dataset live using SQL. Because it wasn’t a live, updating dataset, and we didn’t want to overload the API, we opted for the former. By using the data.world API, we also don’t have to deal with making migrations every time we change our models in the application.

Because we were using Django as the framework, it made sense to use Python in the Jupyter notebook for the calculations of the raw data, especially because we wanted to be able to easily document how we calculated everything.

Finally, we deployed the Django application to Amazon Web Services Elastic Beanstalk which was surprisingly the most challenging part of it all since we had to make the Elastic Beanstalk environment play nice with the permissions configuration of the data.world API. The data.world support team was more than happy to help us figure it out.

Then, the day before launch, we made some minor changes to the HTML in our home page, and the whole thing went down. After spending hours trying to decode what exactly we did wrong so that we would never do it again, we just deployed the whole project in a different Elastic Beanstalk environment.

The one thing we would definitely do differently is to be smarter about caching. We were so worried about how fast the site was running that we added multiple levels of caching which we quickly realized meant any changes made to the database would take a long time to show up in everyone’s computers. In fact, during launch, our redirect cached a version of the site that took almost a week to expire in some computers!

If You Want to Look Into Force

Here are some tips and lessons from our 16-month investigation that may prove handy if you want to investigate use of force and build a database for your state or local police departments:

Collaboration is the only way to get this done. It took six data reporters, two investigative reporters, five additional reporters, and about four social media and audiovisual specialists to create all the content for our project.

Get familiar with the case law that makes a police department use-of-force form public. That way, when you ask for the forms, you can note the case law in your request. It’s especially useful to have handy if any forms are denied.

Start by filing one batch request from one department first and see what you get and if it’s even worth requesting.

Find out the guidelines for police use of force in your state and/or in your local police department. Have any of them changed during the time frame you are investigating?

Learn the official protocol as to when and how use-of-force forms are filled by officers. Are they typed or handwritten? Are officers required to fill out certain parts? Where do they go after being filled out? If they’re collected by another agency, is there any kind of review process?

Name your data by date, because you’ll never remember whether it was Master_file_final_v4.csv or Master_file_final_v5.csv you were looking for.

It’s never too late to catch a data cleaning error. We were correcting and changing numbers very close to the publication date. If we were to do it again, we’d have all the numbers in our articles feed from a spreadsheet so we didn’t have to go back and manually change them each time we caught a duplicate form. Also, Jupyter notebooks are a lifesaver when you have to re-run statistics every time a file is changed.

The Most Important Part is People

Don’t forget the most important part is to find the human side to these numbers. In the end, these are real people featured in the forms. Remember that the use-of-force form might not tell the whole story. Talk to residents from overpoliced communities. Talk to officers who police them. Some of the best stories from our project came after our reporters went out and talked to people.