From raw data to the #1 spot on DataIsBeautiful

Is data visualization art or science? Does the clarity from bar charts and line graphs always trump data viz that is unusual and/or beautiful?

These are some polarizing questions in the data viz community. Some of you just screamed out loud “Science! Clarity!” While others would happily die on art mountain.

This rift in the data viz community has made it possible for data visualizations to be loved and hated at the same time. For example, my Hunger Games graphic was 87% upvoted, received an awesome original content award, and was simultaneously roasted for being awful in the comments section. Thanks, Reddit.

This love-hate data viz situation is common. As a second example, in a blog post, Bryan Pierce contradicts a visualization Bill Gates picked for Wired Magazine. In the post, he argues why being clear has more advantages.

Bill Gates Choice

Bryan Pierce’s Remix

Who is right here? Should we focus on being simple and clear — or should we focus on being complex and beautiful?

While sharing my data visualization project, this was a question that I kept returning to.

In addition to expanding on this dilemma, I will also cover other project highlights such as:

Origin of the data set and Hunger Games visualization idea

How I used Python to get from a 25GB file to something Tableau could use

My data visualization process

The entertainment value from being roasted on Reddit

Incorporating feedback to get to a final version

Part I: How I came up with the data and Hunger Games visualization idea

This all began with a school project. The goal of the project was to tell a visualization story. We could choose any data set we want.

Having tons of experience at work with Tableau, I would say I am tapped into that community. At least enough to be aware of #MakeoverMonday. For all 52 weeks of the year, a data set is posted by Eva Murray or Andy Kriebel with a challenge to give it a makeover. Can you guess what day they share it on?

This is great for people who are aspiring data scientists and want to work on data viz. This is also great for lazy people who want to quickly find a data set for their school project.

It was this very week where they analyzed James Patterson’s checkout records from Seattle's public libraries open sourced database. Because the file is 34+ million rows weighing in at 25 Gigabytes #MakeoverMonday narrowed it down to just one author first. That way the public could easily work with it.

I figured this was going to be an interesting data set because it dated back to 2005 and contained checkouts for both books and DVDs.

The next step was to figure out what question I wanted to ask the data. My first idea was to check out Harry Potter and see the relationship between book checkouts, movie release dates, and DVD checkouts.

However, the first book came out in 1997 so I wasn’t going to get the full picture. I ended up pivoting to other popular book-to-movie series.

Twilight maybe!?!? That’s gonna be a hard no for me. I ended up picking Hunger Games.

Part II: How I used Python to get from a 25GB file to something Tableau could use

Now I have a 25,417,394 KB (~25 GB) file sitting in my Documents. Probably not a good idea to try to plug it into Tableau as is.

I like Python. There were just a few steps to prep the data for Tableau.

Apply some logic to label rows as Hunger Games

Figure out a way to read in just little chunks of the data at a time — the file is so big it will lock up your computer real quick

Only keep checkouts that had to do with the Hunger Games series

Data before:

Now that’s what I call Christmas and Indiana Jones in just a 5 row sample

Code to process the data:

The last cell is where the magic happens — just processing 50,000 rows at a time

Data after:

Catching Fire CD?!?! Hmmm…

The new file came to just 17,000 KB (.017 GB). Much better.

Part III: My data visualization process

There is no better moment in Data Science when the data is ready for analytics.

I think there is no debate on this topic. Even though data engineering is equally important and the first domino that needs to be pushed — only weirdos like data prep.

We have our ideas about what the data might tell us. This is the exciting moment where we find out if we were right or wrong.

Was my hypothesis correct? Will my predictors predict the target variable? Was my idea even worthwhile or was all of this effort a waste of time?

Now that data prep is over we will quickly be able to answer any questions we have.

Iteration 1 (made with Tableau)

One of the first charts I made was a line chart. The pro of this chart is that you can clearly see how many checkouts there are month over month.

Book checkouts peaked right before the first DVD release, but then books never spiked again.

Iteration 2 (made with Tableau)

Since I was most interested in visualizing Hunger Game checkouts in total I switched to an area chart which will show DVD + Book checkouts.

I also started to focus on formatting, removing chart junk, and minimizing ink. For example, there is no need for the year axis label as it is pretty clear already, we have a time-series line graph based on the year values alone.

I think an important thing to account for would probably be when the movies were released in theaters.

If this project was for a business setting, where decision making and clarity are most important. I probably would have stopped here.

However, I knew I wanted to post to DataIsBeautiful. So I needed to push forward with the goal to make something more fun to look at.

So finally, here we are… pushing away from clarity to beautiful. Why are we doing this? Because of our goal and target audience.

People want to see something new, interesting, exciting. So I went with a cousin of the area chart, the stream graph.

Before posting my viz to DataIsBeautiful I knew I was going to catch some negative comments on breaking at least two data visualization best practices. Here are the naughty things I did and my rationale:

No Y-axis

I wanted the user to focus more on the relative relationships between the books and DVDs and not the raw checkout counts.

2. The mirrored nature of this graph is redundant, you would be fine with just the top half alone

This is true, and it goes against Edward Tufte’s data visualization principle to only have ink that adds more insight (data to ink-ratio). However, it has also been found that people love things that are symmetric. In this case, I felt that the symmetry can be justified in making the data viz more memorable.

Part IV: The entertainment value from being roasted on Reddit

Building a data visualization cherished by all is not easy. People have different preferences for things as basic as color choice. I am pretty sure it is impossible to make everyone happy.

Want proof? Post a viz to DataIsBeautiful, grab some popcorn, and watch chaos unfold. After reading some of these comments it was hard to even remember this was 87% upvoted.

The Roast

The Optimistic

The High

Part V: Incorporating feedback to get to a final version

There is more than just entertainment from posting to Reddit. Some people will actually give helpful suggestions or even remix your post. My post was remixed and I added their contribution to make my final version:

The final hypothesis might be that book checkouts spiked in anticipation of the first movie. In sequential movie releases, there was no similar book spike. Perhaps because anyone who was inclined to read the books due to a movie had already done so.

Also, you will notice that Mockingjay DVD weirdly had two spikes. It was released as two different parts on different dates.

Final Thoughts:

You cannot build something that is beautiful to everyone. There are best practices in color choices. However, even those won’t get you to a 100% happy camper rate. 87% thumbs-up is probably pretty good.

You cannot build something that is clear to everyone. However, your goal should be to get your point across to as many people as possible.

I could have probably maximized clarity better. However, if I didn’t make these choices would my post still have been upvoted to #1?

In my opinion, data visualization is closer to art. Science is mostly defined by rules it establishes. Data visualization is guided by best practices that are much less firm. Science cares little about your feelings and reactions. While data visualization is all about catering to its audience.

There are times when being clear is much more important. Imagine a hospital using data visualization to make critical decisions. This is much different goal than trying to make an interesting data viz to post to Reddit.

Data visualization can benefit from testing and learning. Getting feedback is invaluable. Even better would be to run a randomized A and B grouping where you can expose each group to different versions only changing one thing. If the goal is decision making, see what decision they would make. If the goal is to engage people, ask them which one they liked better.

The holy grail is being clear and beautiful. These two things don’t have to be mutually exclusive. Although it is not easy to do both. When time is limited focus on your goal and intended audience.