How to do an analysis in R (part 2, visualization and analysis)

In several recent blog posts, I’ve emphasized the importance of data analysis.

My main point has been, that if you want to learn data science, you need to learn data analysis. Data analysis is the foundation of practical data science.

With that statement in mind, my most recent blog post showed you “part one” of an example analysis.

In that post, we kicked off the analysis by getting a dataset (from Wikipedia) and manipulating it into shape. It is a dataset about shipping volume at the world’s busiest ports, from 2004 to 2014.

Now that the dataset is ready, we’re going move into “part 2” of the analysis. We’ll analyze the data using a combination of tools from the tidyverse .

Essentially, I want to show you how to use tools from ggplot2 and dplyr together to analyze data.

Before we start, keep in mind that what you’re going to see is “intermediate” level ggplot2 / dplyr .

That said, even if you’re a beginner, you can look at the code and try to follow the general analysis process.

Morevoer, although I occasionally disparage copy-and-paste coding, if you’re a beginner, it will still be instructive to copy and run this code. You’ll still be able to take it apart and see how it works. Copy-and-paste won’t allow you to master ggplot2 and dplyr (and memorize the syntax), but in this case it will help you understand.

On the other hand, if you’re not a beginner, and you’ve learned some tidyverse syntax, this will show you how to put the pieces together and get things done. If you’re serious about learning data science, study this post. There’s a lot here for you …

Ok. Let’s get into it.

Get the data

First, we need to retrieve the data.

We built the dataset using the code in the last blog post. We originally scraped it from Wikipedia, and hammered it into shape using dplyr and tidyr .

If you’ve learned a little dplyr and tidyr , I highly recommend that you review that blog post, and possibly run the code and create the dataset yourself. It will give you an integrated view of how to use data manipulation tools together to shape a dataset prior to analysis.

Having said that, if you don’t want to run the code and build the dataset yourself, you can get here by loading it from a URL:

#============= # GET THE DATA #============= url.world_ports Create 'themes' for plot formatting Before we actually plot our data, we'll create a few themes that we can apply to our plots. If you're not familiar with them, "themes" in ggplot2 are just bundles of formatting code. In ggplot , we use the theme() function to format specific parts of our plot. We can then "bundle" those pieces of code together to create a reusable theme. That's what we're doing here. We're taking several lines of formatting code (that we execute with the theme() function) and bundling them together. Notice that we're actually creating a few themes here:

theme.porttheme : this is just a general theme that we'll apply to most of our plots in this analysis

: this is just a general theme that we'll apply to most of our plots in this analysis theme.smallmult : this theme will be applied to "small multiple" charts. Because small multiples can be quite large if you have a lot of panels, they can have special formatting requirements. We're setting up that formatting here.

: this theme will be applied to "small multiple" charts. Because small multiples can be quite large if you have a lot of panels, they can have special formatting requirements. We're setting up that formatting here. theme.widebar : This is for "wide" bar charts. You'll see in a moment, we're going to plot some bar charts that are quite wide, and again, these specific charts will have specific formatting requirements. We're bundling that specific formatting code together here.