Connected Scatterplots and Strikeouts

Shan Carter, Kevin Quealy and Joe Ward of The New York Times recently published a thorough analysis of the rise of strikeouts in Major League Baseball. In it, they showed how the number of strikeouts per game has risen along with the number of pitchers per game using two line plots, one for each variable. It’s good stuff, you should read it. I especially like the grayed out dots for each team, which give a sense of the team-by-team variation without overwhelming the reader.

I found the summary table for average MLB game stats since 1871 here, and I wondered what this correlation, and other pairings of MLB stats, would look like if they were plotted as connected scatterplots. Connected scatterplots are a visualization form that have have been featured at NYT recently (more about this form of visualization, including a number of examples, in Alberto Cairo‘s blog post “In praise of connected scatterplots“).

Here’s what it looks like, along with a second method show below it, the dual axis line plot:

Effort and Reward

I struggled with connected scatterplots at first. Maybe the engineer in me stubbornly resisted the notion of including time on anything other than the x-axis. But I found that after investing a small but not insignificant amount of time in orienting myself to the axes, the connected scatterplot actually became a fun chart to explore. To quote Andy Kirk, my effort was “ultimately rewarded with a worthy amount of insight gained.” (Kirk, Data Visualization – a successful design process, p26).

The connected scatterplot imparts a sense of travelling a pathway through a terrain that has twists and turns, loops and sudden rises and falls that encode how the two different variables changed together. It’s a roller coaster ride of sorts, and once you’ve on-boarded the cipher of the code, you’re out of the turnstiles and on your way.

The Other Method: Dual Axes

You have to admit, though, the dual axis line plots below the connected scatterplot do a fine job as well. In fact, they probably require the reader to invest less time upfront to begin to glean some insight (sorry, no experimental data on that claim). If my feeling is right, it probably has something to do with the fact that we’re more used to seeing changes over time shown from left-to-right. It’s still an abstract way to represent time, it’s just one we’re more familiar with.

Virtues and Vices

The dual axis method has some distinct advantages: if you open up the year slider to show the entire range from 1871 to 2012, you will see what I mean. The connected scatterplot becomes much more difficult to read, but the dual axis line plot does not require any additional effort. You can adjust the slide in the interactive version above, or here’s a screen shot:

Additionally, not all pairs of variables render well in the connected scatterplot format, even with the shorter time window of 1981-2012. If one variable basically contains a bunch of random noise, or doesn’t change much at all, the connected scatterplot will look very jumbled, and will be hard to read since all the points will just form clumps. For example, change variable 1 to “Avg Pitcher Age” and change variable 2 to “Batters Faced”. What you get isn’t an exciting journey, it’s a wild goose chase, and you can see why if you take a look at the dual axis plot, which immediately tells the story – two flat lines:

In conclusion, my opinion at this point is that the connected scatterplot is a special case visualization type for showing how two variables change together over time – if it works well, it really works well. If it doesn’t, ditch it for the more all-purpose (and admittedly more utilitarian) dual axis line plot. I guess to go along with the baseball theme, my advice would be to swing for the fences if the pitch is right, otherwise just make contact and get on base.

How I made the connected scatterplot in Tableau

This section will serve as a very brief how-to for making a connected scatterplot in Tableau Public. The key is dragging “Year” to the “Path” mark landing pad.

Here are the steps:

Drag the first measure you want to use to the “Columns” shelf and the second to the “Rows” shelf Convert both from SUMs to Dimensions by clicking in the down arrow of the pills and selecting “Dimension” (now you have a basic scatterplot) Change the Marks type from “Automatic” to “Line” and drag “Year” to the Path landing pad Also drag Year to Label

Of course I used Parameters to allow the reader to control the two variable types, and I also used a dual axis to format the data points but the above steps do the trick.

Here’s a screen shot of the final connected scatterplot sheet that I used:

Thanks for stopping by, and I’d love to know your thoughts on the virtures and/or vices of connected scatterplots,

Ben