In the last post, looking at comparing two datasets from the Scottish datastore, we (I) assumed that there would be a relationship between the number of alcohol-related discharges from hospital in a council area, and the number of drug-related discharges. This turned out to not be the case.

To further explore the data, in this article we’re going to look at using SPARQL to extract a range of datasets from the Scottish datastore, and look at how they all relate to each other in a correlation matrix:

I’ll talk about what this matrix means and how to read it in more detail towards the end of this piece, but first up, I’ll explain how I went about creating it.

As before, the grounding to this is in the previous posts (using R to analyse linked data, and R and Linked Data — comparing linked datasets for analysis.)

The first thing we need to do is to write our query to get the data out of the datastore. We want a good selection of datasets for by Council Area. We can see which datasets are available to us by viewing this page:

There are 117 datasets about Council Areas in Scotland, some of which have many indicators inside them. We will choose 20 to compare, giving a good balance of health, economy, skills and environment:

Alcohol-related hospital discharges

Antenatal smoking

Breastfeeding rates

Childhood immunisations

Crime clear-up rates

Deliberate fires

Disability living allowance

Drug-related hospital discharges

Dwellings per hectare

Fuel poverty

Gender pay gap

Housing standard

Houses near derelict sites

Job Seeker’s Allowance Claimants

Life expectancy at birth for males and females

Low birthweight babies

Qualifications of adults

School attendance rate

Teenage pregnancy

The first thing to do, then, is get the data. As in the previous two articles, we will be writing the SPARQL query using the Scottish Government datastore’s endpoint, and then passing the query text to R for data manipulation.

Each dataset/indicator is modelled in a different way, because each has different dimensions associated with it, such as age or gender. This means we need to write a separate mini-query for each indicator (one of the reasons we’ve limited the number to 20). These mini queries are then glued together using a series of UNION joins — the same as using a UNION join in SQL. This query ended up being pretty big, in terms of the number of characters — too many for R to handle (there seems to be a limit of around 7,000 characters in the SPARQL library), so we split the query into two, and then handle the merging of dataframes within R.

The two queries that we used are here (you can also see these at the bottom of this article):

You can test either of these queries by pasting them straight into the endpoint on the Scottish datastore. As in the previous posts, though, we’re going to use R to grab the data.

The first step is to load the necessary packages — SPARQL for getting the data from the SPARQL endpoint, reshape2 for creating the matrix, corrplot for creating the visual, and RColorBrewer to use the colour schemes from ColorBrewer.

Next, we load a couple of variables with the SPARQL queries (remember we need to split into two because of a character limit in the SPARQL R library). We tell R where the endpoint is, and then load the results of the queries into 2 resultsets. Finally, we take only the data, and push it into two dataframes.

We can now merge these dataframes into one. Because they have the same structure, we can simply append one onto the other. To do this, we use the rbind function.

Now that we have all the data in one dataframe, we need to turn it into a matrix (similar to pivoting in MS Excel). To do this, we use the reshape2 package — specifically the acast function.

This function takes the dataframe, which looks like this:

And turns it into a matrix, which looks like this:

We can then run the cor function against this matrix to test the correlation between each variable:

This gives us another matrix — showing how each indicator relates to the other indicators in the matrix.

This post isn’t going to go into the theory behind correlation calculations, other than to loudly say CORRELATION DOES NOT EQUAL CAUSATION. In other words, this technique is useful for seeing how different characteristics of an area may vary in relation to other characteristics, eg that areas where breastfeeding rates are lower are more likely to have higher numbers of deliberate fires, NOT that people who are not breastfed tend to set fire to things.

To interpret the matrix, 1 is equal to perfect correlation, ie as one indicator increases, the other increases in a perfectly linear way. -1 is a perfect inverse correlation where as one indicator increases, the other decreases in a linear way. 0 indicates that there is no correlation between the two, and there is a sliding scale from weak to moderate to strong.

As well as correlation coefficient, it is important to consider the p-value, which is a measure of how significant the correlation is. To calculate the p-values, we can use the following code:

This step is creating a function called cor.mtest, into which we pass our matrix, along with the desired confidence level (in this case 95%). We can then use the values in res1 to display the significance of the correlation.

Although we have the correlation matrix now, we really want to make it look a bit more visual, to make it easier to draw conclusions / prompt further investigation. To do this, we can use the corrplot library.

The first line creates a function that sets the colour scheme to be used, using the RColorBrewer library. Then the second line uses the corrplot library to draw the visualisation. There are many optional parameters available when drawing the chart, which are described in detail in the corrplot reference manual. This is the final output:

The visualisation shows the linear relationship between the 20 indicators that we selected. The darker, bigger green colour shows stronger positive correlation. The darker, bigger brown squares show stronger negative correlation. The squares with crosses show where the correlation is not significant (ie the p-value is greater than 0.05).

The visualisation makes it fairly easy to pick out some interesting points from the data.

Higher proportions of teenage mothers is strongly correlated with higher proportions of mothers who are current smokers at the time of booking with maternity services.

The difference between men and women’s pay doesn’t really have a relationship with any other dataset, other than perhaps the proportion of crimes ‘cleared up’

The higher the proportion of people in a council area who live within 500m of a ‘derelict site’ there are, the more deliberate fires there are

Alcohol-related hospital discharges (the thing that started these articles) does not strongly correlate with any other indicator, with the closest being male life-expectancy.

This correlation matrix would be further improved by making it more interactive — ie hovering the mouse over a square would pop-up a tooltip showing the correlation co-efficient and the p-value. This is beyond the capabilities of corrplot, but should be possible by preparing the data in R, and then using d3 to generate an interactive web-based visualisation. Hopefully we’ll look at this in a future article.

Here is the full code to paste into R Studio to make this matrix. Note that when pasting into R from Medium and Github, there may be errors related to quotation marks and inverted commas (‘ and “). It seems to be best to paste into a text editor and convert the punctuation using find and replace.