SQL Recipes Summarizing Data Calculating Relationships with Correlation Matrices

Making Correlation Coefficient Matrices to understand relationships in SQL

A scatter X-Y plot is a straightforward way to visualize the dependency between two variables. However, at times you want to understand how more than two variables are related. The correlation coefficient can be calculated between pairs of variables that scales between +1 and -1 demonstrating the degree of positive or negative correlation. The relationship between three variables can be presented as a symmetric matrix of dimensions with a value of 1 along the diagonal.

In this recipe, we'll look for correlations betwen two or more variables and visualize it as a matrix. You can use the correlation matrix to figure out what activities are correlated, to plan future activities. For example, within your customer support department, how are net promoter scores (NPS) related to support wait times.

The Correlation Coefficient math

One of the most commonly used correlation formula is Pearson’s. For a sample,

where, and are the sample means of variables and .

The SQL to correlate email subject lengths and open rates

Using Mailchimp's email campaign data – we are interested in finding if there's a relationship between the length of the subject and open/click rates.

1. Calculating correlation by hand

We are going to be using Common Table Expressions (CTEs) to organize our intermediate results:

with table_mean as ( select avg(char_length(subject_line)) as mean_subject_length, avg(report_summary_open_rate) as mean_open_rate from mailchimp.campaigns ), table_corrected as ( select char_length(subject_line) - mean_subject_length as mean_subject_length_corrected, report_summary_open_rate - mean_open_rate as mean_open_rate_corrected from table_mean, mailchimp.campaigns ), select sum(mean_subject_length_corrected * mean_open_rate_corrected) / sqrt(sum(mean_subject_length_corrected * mean_subject_length_corrected) * sum(mean_open_rate_corrected * mean_open_rate_corrected)) as r from table_corrected;

This is a direct translation of the math equation. Fortunately, we don't have to repeat this each time, we can simply use the in-built corr function to calculate the correlation for us.

2. Calculating pairwise correlation using corr

The correlation can be calculated as follows:

select corr(char_length(subject_line), report_summary_open_rate) as r;

For more than two variables, we are going to repeat the correlation calculation pairwise between the variables and organize the results in the follow format. In step 3, it'll be clear why we use this format:

row col coeff subject_length subject_length (always 1) subject_length open_rate coefficient value subject_length click_rate coefficient value open_rate open_rate (always 1) open_rate click_rate coefficient value click_rate click_rate (always 1)

where subject_length is char_length(subject_line) .

select 'subject_length' as row, 'subject_length' as col, corr(subject_length, subject_length) as coeff from mailchimp.campaigns union select 'subject_length' as row, 'open_rate' as col, corr(subject_length, open_rate) as coeff from mailchimp.campaigns union select 'subject_length' as row, 'click_rate' as col, corr(subject_length, click_rate) as coeff from mailchimp.campaigns union select 'open_rate' as row, 'open_rate' as col, corr(open_rate, open_rate) as coeff from mailchimp.campaigns union select 'open_rate' as row, 'click_rate' as col, corr(open_rate, click_rate) as coeff from mailchimp.campaigns union select 'click_rate' as row, 'click_rate' as col, corr(click_rate, click_rate) as coeff from mailchimp.campaigns

3. Pivoting the table to get a matrix

In this step, we'll be building a manual pivot table using the values from the col column. After the pivot, the values of the col column will become new columns in the resulting table.

select row, sum(case when col='subject_length' then coeff else 0 end) as subject_length, sum(case when col='open_rate' then coeff else 0 end) as open_rate, sum(case when col='click_rate' then coeff else 0 end) as click_rate from ( // ... query as before ) group by row order by row DESC

and our table will look like this:

row subject_length open_rate click_rate subject_length 1 -0.31595114872448 -0.102740332427165 open_rate 0 1 0.798087505055931 click_rate 0 0 1

which matches the values we get from Excel or Google sheets. Our calculations show that open and click rates are negatively correlated with subject lengths and there's a strong positive relation between open and click rates (which is obvious because they are not independent variables – you need to open an email in order to click it.)