The general idea of this post is to offer a brief summary of the meaning, the relevance, and the ways to use the concept of graphical inference to discover hidden patterns. Also, the general idea is to write this post in a plain language (instead of using an academic mindset) to understand the foundations.

The analysis of the visualization is based on the nullabor package written by Hadley Wickham, Niladri Roy Chowdhury, Di Cook, and Heike Hofmann.

First updated – June 1, 2019

This post was edited after the comments made by u/Thaufas and u/t4YWqYUUgDDpShW2

.

1. Background

Exploratory data analysis (EDA) and model diagnostics (MD) are data analytic activities that rely primarily on visual displays and only secondarily on numeric summaries.[1]

Through the use of EDA, analysts can obtain very good results “deducting” through images that is what happens in the analyzed dataset, or terrible results reaching absurd conclusions only based on visualization.

While until recently there were no formal visual methods to determining statistical significance of findings, the paper “Statistical inference for exploratory data analysis and model diagnostics” addressed this problem presenting two protocols: Line up and Rorschach to offer tools to confirm visual findings.

Later, in the paper “Graphical inference for infovis” was presented a new R package called nullabor, to semi-automate the protocols.

.

2. What is the nullabor package?

Nullabor package implements both protocols for graphical inference: the “Rorschach” and the “line-up”. The Rorschach is a calibrator, helping the analyst become accustomed to the vagaries of random data, while the line-up provides a simple inferential process to produce a valid p-value for a data plot.

.

3. What is the importance of nullabor?

Nullabor seems to be perfect for people who want to validate visual findings. That’s positive. However, the risk that people who lack the appropriate statistical training will find false associations is minimized for the use of this package, but it does not disappear.

.

What is a false association?

The Spurious Correlations website has so many great examples of false associations that result when thousands (or more) correlations are tested. In this case, the relationship (simultaneity) between divorce rate and per capita consumption of margarine does not imply that divorce is caused by the use of margarine.

.

.

What are the caveats of both protocols?

There are some issues that need considerations at the moment to use this package:

The wording of the instructions to viewers can affect their plot choices.

Limiting the time it takes to answer may produce less reliable results.

Readers will respond differently to the same plots, depending on training and even the state of mind. [1]

.

4. Can you explain the meaning and relevance of graphical inference?

Sometimes we perceive something in a visualization, some pattern or logic, and we come to conclusions, which may or may not be sustained using a more specific technique. In those cases, when “we see something,” but we are not sure about what we see, we can rely on graphical inference to determine if the visualization has a specific structure that makes it unique in its formation and that allows us to generate conclusions and extrapolate them to a larger population.

From the practical point of view, there is a set of steps to implement to discover if there is or not graphical inference. For a particular sample, we execute the Rorschach protocol to generate null samples, and then the line-up protocol to create a single plot with the original sample and the null samples. As a final step, we work in figure out if it is possible to determine at a glance which is the real data from the nulls, in the case that the answer is yes, we have more solid arguments to generate conclusions based on the original sample.

.

5. What is inference? And what is graphical inference?

A pretty informal definition for inference could be: making affirmations about a large population using a small samples. Graphical inference is extrapolating the conclusions obtains from a small graph which represents a sample, to a large population.

Inference happens when you have information on a subset of data, and you want to make statements about the full set. Typically, inference is done using the sample statistics, and what we know about the behavior of that statistic over all possible subsets, of the same size.[1]

From Diane Cook – To the Tidyverse and beyond – RStudio

.

6. Why do we need graphical inference? Can you show us some examples?

Because sometimes based on just one visualization people make inference without a firm foundation. So, we want to be sure that the discoveries based on visualizations have some base where confidently crow about them.

.

Example

.@gtyranny starts his blog with an interesting hypothesis: do states with highly populated capitals have less corrupt governments? https://t.co/shtM5KSlNz #datablog pic.twitter.com/eDvqTi6t9G — David Robinson (@drob) January 29, 2018

Below is a simple scatterplot of the two variables of interest. A slight negative slope is observed, but it does not look very large. There are a lot of states whose capitals are less than 5% of the total population. The two outliers are Hawaii (government rank 33 and capital population 25%) and Arizona (government rank 26 and capital population 23%). Without those two the downward trend (an improvement in ranking) would be much stronger.

I ran linear regressions of government rank on the percentage of each state’s population living in the capital city, state population (in 100,000s), and state GDP (in $100,000s)…. The coefficient is not significant for any regression at the traditional 5% level.

… I’m not convinced that the lack of significance is itself significant.

Analysis in Tick Tock blog, by Graham Tierney. [5]

.

7. What is population parameters?

A population can be described by a distribution, characterized by parameters like mean μ and standard deviation σ, then the sample statistic behaves nicely, close to the population mean, and closer still when the sample size n is large. [2]

If we don’t want to assume the population has a known distribution, but we’d still like to be able to say something about the population based on the sample at hand, we can use methods like bootstrap to sample the sample with replacement, or permutation to break associations between variables.

.

8. What are the visual protocols for graphical inference?

There are two protocols used for graphical inference, and implemented into nullabor package:

8.1. Rorschach protocol

Create a set of plot a lot of null samples based on the original sample, to compare your real data with other similar but with null value. How do you obtain your null samples? Via permutation and simulation.

.

8.2. Line-up protocol

The first step is to embed the original data plot among a field of plots of null samples. Ask someone who’s not related to you, to pick the different one. If they choose the original data plot, this is evidence for the data to have a structure that is significantly different from what might be expected by chance.

.

9. Example of use of the nullabor package step by step

.

9.1 Example showing how to use the nullabor package

For this case is not really necessary to use this package: it is entirely possible to establish a correlation between both variables (wt and mpg) using linear regression.

summary(lm(mpg ~ wt, data = mtcars)) #mpg Miles/(US) gallon #wt Weight (1000 lbs) # Call: # lm(formula = mpg ~ wt, data = mtcars) # # Residuals: # Min 1Q Median 3Q Max # -4.5432 -2.3647 -0.1252 1.4096 6.8727 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** # wt -5.3445 0.5591 -9.559 1.29e-10 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 3.046 on 30 degrees of freedom # Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 # F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

R-squared is the measure of the linear relationship between mpg and wt, in this case 0.75, results closer to 1 means that there is an actual relationship.

.

9.1.1. Scatterplot between mpg and wt

Let’s say that we want to find if there is relationship between mpg and wt from the mtcars dataset. [6]

library(nullabor) library(ggplot2) # Our original data ggplot(mtcars, aes(mpg, wt)) + geom_point()

.

Visually we can establish some kind of relationship between both variables. Now, using the nullarbor package let’s probe that.

.

9.1.2. Replacing original data

There are two functions to generate the null plots:

inf <- lineup(null_permute("mpg"), mtcars)

null_permute: Generate null data by permuting a variable. In this case mpg variable.

lineup function: The function to implement the line-up protocol. to embedded amongst the real data into a field of plots of data generated to be consistent with some null hypothesis.

.

9.1.3. Selecting the real data

If the observer can pick the real data as different from the others, this lends weight to the statistical significance of the structure in the plot.

# Display different samples ggplot(inf, aes(mpg, wt)) + geom_point() + facet_wrap(~ .sample)

.

.

Solution

Figure 12 is the most different visualization among the others, and correspond to the original data, so, we have elements to believe that the relationship between both variables exists.

10. Conclusions

It is a convenient package when you want to discover if there is something else in visualization, but you don’t know how to justify that hunch.

Several papers explain and develop graphical inference using an academic point of view, check in the bibliography section for the references. In particular, I recommend reading the last one because it clarifies the concepts about how to use this package via detailed explanations about each point.

.

Bibliography