Prerequisites

Experience with the specific topic: Novice

Professional experience: No industry experience

To follow this article, the reader should be familiar with Python syntax and have some understanding of basic statistical concepts (e.g. average, standard deviation).

Introduction: What Is Correlation and Why Is It Useful?

Correlation is one of the most widely used — and widely misunderstood — statistical concepts. In this overview, we provide the definitions and intuition behind several types of correlation and illustrate how to calculate correlation using the Python pandas library.

The term "correlation" refers to a mutual relationship or association between quantities. In almost any business, it is useful to express one quantity in terms of its relationship with others. For example, sales might increase when the marketing department spends more on TV advertisements, or a customer's average purchase amount on an e-commerce website might depend on a number of factors related to that customer. Often, correlation is the first step to understanding these relationships and subsequently building better business and statistical models.

So, why is correlation a useful metric?

Correlation can help in predicting one quantity from another

Correlation can (but often does not, as we will see in some examples below) indicate the presence of a causal relationship

Correlation is used as a basic quantity and foundation for many other modeling techniques

More formally, correlation is a statistical measure that describes the association between random variables. There are several methods for calculating the correlation coefficient, each measuring different types of strength of association. Below we summarize three of the most widely used methods.

Types of Correlation

Before we go into the details of how correlation is calculated, it is important to introduce the concept of covariance. Covariance is a statistical measure of association between two variables X and Y. First, each variable is centered by subtracting its mean. These centered scores are multiplied together to measure whether the increase in one variable associates with the increase in another. Finally, expected value (E) of the product of these centered scores is calculated as a summary of association. Intuitively, the product of centered scores can be thought of as the area of a rectangle with each point's distance from the mean describing a side of the rectangle:

If both variables tend to move in the same direction, we expect the "average" rectangle connecting each point (X_i, Y_i) to the means (X_bar, Y_bar) to have a large and positive diagonal vector, corresponding to a larger positive product in the equation above. If both variables tend to move in opposite directions, we expect the average rectangle to have a diagonal vector that is large and negative, corresponding to a larger negative product in the equation above. If the variables are unrelated, then the vectors should, on average, cancel out — and the total diagonal vector should have a magnitude near 0, corresponding to a product near 0 in the equation above.

If you are wondering what "expected value" is, it is another way of saying the average, or mean μ, of a random variable. It is also referred to as "expectation." In other words, we can write the following equation to express the same quantity in a different way:





The problem with covariance is that it keeps the scale of the variables X and Y, and therefore can take on any value. This makes interpretation difficult and comparing covariances to each other impossible. For example, Cov(X, Y) = 5.2 and Cov(Z, Q) = 3.1 tell us that these pairs are positively associated, but it is difficult to tell whether the relationship between X and Y is stronger than Z and Q without looking at the means and distributions of these variables. This is where correlation becomes useful — by standardizing covariance by some measure of variability in the data, it produces a quantity that has intuitive interpretations and consistent scale.

Pearson Correlation Coefficient

Pearson is the most widely used correlation coefficient. Pearson correlation measures the linear association between continuous variables. In other words, this coefficient quantifies the degree to which a relationship between two variables can be described by a line. Remarkably, while correlation can have many interpretations, the same formula developed by Karl Pearson over 120 years ago is still the most widely used today.

In this section, we will introduce several popular formulations and intuitive interpretations for Pearson correlation (referred to as ρ).

The original formula for correlation, developed by Pearson himself, uses raw data and the means of two variables, X and Y:

In this formulation, raw observations are centered by subtracting their means and re-scaled by a measure of standard deviations.

A different way to express the same quantity is in terms of expected values, means μ X , μ Y, and standard deviations σ X , σ Y :

Notice that the numerator of this fraction is identical to the above definition of covariance, since mean and expectation can be used interchangeably. Dividing the covariance between two variables by the product of standard deviations ensures that correlation will always fall between -1 and 1. This makes interpreting the correlation coefficient much easier.

The figure below shows three examples of Pearson correlation. The closer ρ is to 1, the more an increase in one variable associates with an increase in the other. On the other hand, the closer ρ is to -1, the increase in one variable would result in decrease in the other. Note that if X and Y are independent, then ρ is close to 0, but not vice versa! In other words, Pearson correlation can be small even if there is a strong relationship between two variables. We will see shortly how this can be the case.

So, how can we interpret the Pearson correlation? Turns out, there is a clear connection between Pearson correlation and the slope of a line. In the above figure, a regression line through each scatter plot is shown. The regression line is optimal, as it minimizes the distance of all points to itself. Because of this property, the slope of the regression line of Y and X is mathematically equivalent to correlation between X and Y, standardized by the ratio of their standard deviations: where b is the slope of the regression line of Y from X. In other words, correlation reflects the association and amount of variability between the two variables. This relationship with the slope of the line has two important implications: It makes it more clear why Pearson correlation describes linear relationships It also shows why correlation is important and so widely used in predictive modeling However, note that in the above equation for ρ, correlation does not equal slope — rather, it is standardized by a measure of data variability. For example, it is possible to have a very small magnitude of slope but large correlations between variables. In the figure below, the line describing this relationship is relatively flat, but correlation is 1 since variability s y is very small:

Note that, so far, we have not made any assumptions about the distribution of X and Y. The only restriction is that Pearson ρ assumes a linear relationship between the two variables. Pearson correlation relies on means and standard deviations, which means it is only defined for distributions where those statistics are finite, making the coefficient sensitive to outliers. Another way to interpret Pearson correlation is to use the coefficient of determination, also knows as R2. While ρ is unitless, its square is interpreted at the proportion of variance of Y explained by X. In the above example, ρ = -0.65 implies that (-0.652)*100 = 42% of variation in Y can be explained by X. There are many other ways to interpret ρ. Check out the classic paper "Thirteen ways to look at the correlation coefficient" if you are interested in connections between correlation and vectors, ellipses and more. Spearman's Correlation Spearman's rank correlation coefficient can be defined as a special case of Pearson ρ applied to ranked (sorted) variables. Unlike Pearson, Spearman's correlation is not restricted to linear relationships. Instead, it measures monotonic association (only strictly increasing or decreasing, but not mixed) between two variables and relies on the rank order of values. In other words, rather than comparing means and variances, Spearman's coefficient looks at the relative order of values for each variable. This makes it appropriate to use with both continuous and discrete data. The formula for Spearman's coefficient looks very similar to that of Pearson, with the distinction of being computed on ranks instead of raw scores: If all ranks are unique (i.e. there are no ties in ranks), you can also use a simplified version: where d i = rank(X i ) - rank(Y i ) is the difference between the two ranks of each observation and N is the number of observations. The difference between Spearman and Pearson correlations is best illustrated by example. In the below figure, there are three scenarios with both correlation coefficients shown. In the first example, there is a clear monotonic (always increasing) and non-linear relationship. Since ranks of values perfectly align in this case, the Spearman's coefficient is 1. Pearson correlation is weaker in this case, but it is still showing a very strong association due to the partial linearity of the relationship. The data in Example 2 shows clear groups in X and a strong, although non-monotonic, association for both groups with Y. In this case, Pearson correlation is almost 0 since the data is very non-linear. Spearman rank correlation shows weak association, since the data is non-monotonic. Finally, Example 3 shows a nearly perfect quadratic relationship centered around 0. However, both correlation coefficients are almost 0 due to the non-monotonic, non-linear, and symmetric nature of the data. These hypothetical examples illustrate that correlation is by no means an exhaustive summary of relationships within the data. Weak or no correlation does not imply lack of association, as seen in Example 3, and even a strong correlation coefficient might not fully capture the nature of the relationship. It is always a good idea to use visualization techniques and multiple statistical data summaries to get a better pictures of how your variables relate to each other.

Kendall's Tau The third correlation coefficient we will discuss is also based on variable ranks. However, unlike Spearman's coefficient, Kendalls' τ does not take into account the difference between ranks — only directional agreement. Therefore, this coefficient is more appropriate for discrete data. Formally, Kendall's τ coefficient is defined as: As an example, consider a simple dataset consisting of five observations. In practice, such a small number of data points would not be sufficient nor reliable to draw any conclusions. But here, we consider it for the sake of the simplicity of calculation: | | X | Y | |--|------|----| |a | 1 | 7 | |b | 2 | 5 | |c | 3 | 1 | |d | 4 | 6 | |e | 5 | 9 | Concordant pairs (x 1 , y 1 ), (x 2 , y 2 ) are pairs of values in which ranks coincide: x 1 < x 2 and y 1 < y 2 or x 1 > x 2 and y 1 > y 2 . In our mini example, (4,6) and (5,9) in rows d and e is a concordant pair. A discordant pair would be one that does not satisfy this condition, such as (1, 7) and (2, 5). To calculate the numerator of τ, we compare all possible pairs in the dataset and count number of concordant pairs; 6 in this case: (1,7) and (5,9)

(2,5) and (4,6)

(2,5) and (5,9)

(3,1) and (4,6)

(3,1) and (5,9)

(4,6) and (5,9) and discordant pairs: (1,7) and (2,5)

(1,7) and (3,1)

(1,7) and (4,6)

(2,5) and (3,1) The denominator of Kendall's τ is just the number of possible combinations of pairs, which ensures that τ varies between 1 and -1. With five data points, there are 5 * 4/2 = 10 possible combinations, making τ = (6-4) / 10 = 0.2 in this example. Kendall's correlation is particularly useful for discrete data, where the relative position of data points is more important that difference between them.

# fake kendall k = pd.DataFrame() k['X'] = np.arange(5)+1 k['Y'] = [7, 5,1, 6, 9] print k.corr(method='kendall')

X Y X 1.0 0.2 Y 0.2 1.0

Calculating Correlation in Pandas