Why you should start by learning data visualization and manipulation

One of the biggest issues that comes up when I talk to people who want to get started learning data science is the following:

I don’t know where to get started!

Recently, I argued that R is the best programming language to learn when you’re getting started with data science.

While this helps you select a programming language, it still doesn’t tell you what skills to focus on.

Just like when you select a programming language, selecting the skills to start with can be overwhelming.

Again, I want to be direct: learn data visualization first and then learn data manipulation.

Data visualization is critical for finding insights

There are a few reasons why I recommend learning data visualization first, but at it’s core, my reason is that I want you to be productive quickly. Your time is your most valuable resource, so you need to make a habit of focusing on the “big wins.” Focus on learning the high return on investment skills (high ROI skills).

For most people, the highest ROI skill when you’re starting out is data visualization.

To understand why, you need to think about the goal. As a data scientist your job is to find insights in data.

Clients want insight.

(If you need proof of this, just examine data science job postings. The word “insight” comes up over and over.)

Ultimately, insight is about seeing differently and using data to see problems and potential solutions.

Visualizations are tools for seeing.

So, clients want insight, insight is about seeing, and visualizations are tools to help you see.

Data visualization is useful for several parts of the data workflow

Finding insights is a multi-step process and data visualization is highly useful for almost every step of the data science workflow.

Finding insights

For starters, you need to see the insights for yourself.

When you’re starting out, data visualization is perhaps the highest ROI method for finding insight in data (and it becomes more powerful when you combine it with data manipulation, which I will cover in a moment).

As I’ve mentioned above, data visualization techniques are critical for exploring your data to find insights. Visualization helps you as an analyst see important features in your data.

Communicating insights

Visualization is also critical for communicating your insights.

When you walk into a meeting with an executive or a business partner, nine times out of ten, you will have to show them. You can’t talk about equations or algorithms. You need to show them using the right data visualizations.

You’ve almost certainly heard the phrase “a picture speaks a thousand words.” This is absolutely true. Communicating visually (through the appropriate visualization techniques) will magnify your ability communicate important issues and opportunities to your clients.

The alternative in most cases is text. Have you ever seen a slide presentation that was just a “wall of text”? Just lots of words? These are notoriously ineffective.

Although presentation design is outside the scope of this blog post, you need to understand that once you find insights that your clients need, you need to show them. You need to convince them. They need to see the insights you’ve seen. Hands down, one of the best ways of showing these insights to business partners and executives is via data visualization.

Ultimately, you want to be able to walk into a meeting with an executive or business partner, point at a data visualization and say, “There. Right there. That’s your problem. Do you see that red area on the chart? That’s the issue you need to fix.”

If you know the right visualization techniques, communicating concisely in that way is absolutely possible. And if you can do it, you will be very valuable to your clients and partners.

Machine learning and model building

At some stage though, pure data visualization isn’t the best tool for the job. As data sets become larger and the questions you’re trying to answer become more complex, pure data visualization may not work. You might need to employ more advanced tools, like machine learning.

The thing is, you’ll probably still need to use data visualization in the process of using these more advanced techniques.

Before you build, typically you’ll still need to use data visualization to explore your dataset. You’ll need to visualize your data to see how variables are distributed and help you select the best technique to use.

Later, when you have the results of these advanced techniques, you’ll likely need data visualization to interpret them. The results that we generate using these more advanced techniques need to be explored in order to be understood.

Said differently, the results of machine learning techniques (and other advanced techniques) can be very difficult to understand. Data visualization helps you understand those results.

Finally, because these advanced techniques (and their results) can be somewhat difficult to explain, it is quite common to use data visualization techniques to demonstrate and explain the results to business partners.

This is one of the reasons I recommend that beginning students wait to learn machine learning. You’ll almost certainly need to know data visualization before you can successfully employ those more advanced techniques.

Why you should learn data manipulation second

As you learn data visualization, eventually you’ll reach a bottleneck.

Either your data will be in the wrong format, you’ll need more data, or you’ll simply need to “dig deeper” into the data you already have.

At this point, you should learn some basic data manipulation.

This will allow you to subset your data, aggregate it, and otherwise transform your data to help you find more insights (you can also use data manipulation techniques to merge in new data, though that is slightly more complicated).

And with regard to finding insights, you can combine data visualization and data manipulation to perform more sophisticated data exploration.

Data exploration: finding insights with ggplot + dplyr

There are many possible paths to discovery, but some are surer and faster than others. When skilled seekers venture into the world of data exploration, they tend to follow a particular path … expressed in the form of a mantra: Overview first, zoom and filter, then details-on-demand. … This is a sure path to discovery!

– Stephen Few

Ultimately though, the reason I suggest learning data visualization first and data manipulation second, is because you can combine them. When you combine data visualization and data manipulation and use them together with the right process, you can rapidly find insights in your data.

This is an absolutely critical skill.

Before you begin learning machine learning. Before you dive into advanced techniques. Before you learn “big data” tools, you absolutely need to learn data exploration and analysis.

For most beginning data science students, I believe that competence in data exploration is the first milestone.

As it turns out, this is one of my biggest reasons for suggesting that beginning students learn R.

Two of R’s tools, ggplot2 and dplyr, are perfect for performing data exploration. They are the tools that I wish had when I was starting out.

In particular, you can combine them by using the ‘%>%’ operator to do rapid data exploration.

When you combine ggplot2 with dplyr, you can create subsets and aggregations of your data and immediately “pipe” the output of your dplyr manipulation into ggplot.

This allows you to easily implement Ben Shneiderman’s mantra of “overview first, zoom and filter, details on demand.”

As noted above, you can use visualization and data manipulation to “zoom in” and examine your dataset in a variety of ways.

As Stephen Few noted, the “visual path” is the perhaps the surest path to discovery.

Let me rephrase that: when you’re starting out, visual exploration is the fastest, most reliable path to discovering insights. You need to master data exploration first.

I want you to be productive immediately

What’s great about the tools I’ve recommended (ggplot2 and dplyr) is that you can learn the syntax within weeks (probably faster if you’re diligent).

The syntax for ggplot2 and dplyr is relatively straightforward. Once you know the syntax, creating the core visualizations like the scatterplot, or slightly more advanced charts like the bubble chart become extremely easy. Moreover, once you know the syntax, even visuaualizations that appear complicated become surprisingly easy to build.

Once you learn the syntax, you’ll be able to create beautiful, insightful data visualizations.

If you work hard and master ggplot2 and dplyr – if you master foundational data exploration first – you’ll be on your way to mastering how to find data insights.