Following on from last month’s post on using R to analyse linked data, we’re going to be looking into a bit more depth on the sorts of things you can do with R and linked data. We’re going to carry on using the statistics.gov.scot site, and in particular exploring the dataset relating to alcohol-related discharges from hospital.

This post will look at using R to compare datasets — pulling two different datasets in, merging them, and showing the relationship between the two.

If you haven’t already read the last post in this series, you can read it here. That post gives some of the grounding in terms of using SPARQL to get data out of the datastore, and using the SPARQL library in R to get that data for analysis, and I’ll be referring to it throughout this post.

So last time we identified the Orkney Islands as having the highest alcohol-related hospital discharge rates of all Council areas in Scotland:

This seemed at odds with what I would expect from what I perceive as being a rural area. It would be worth exploring what other data there is in the datastore that might help inform this (note that this isn’t an exercise in solving society’s problems — this is just a step-by-step example on a practical application of linked data for policy-types).

There is another indicator in the datastore — drug-related hospital discharges. Intuitively, I would expect this to be higher in areas where alcohol-related discharges are also high. We can test this in R.

We can re-use the SPARQL query from the earlier post to bring in the alcohol-related discharge data, and we can copy it to get the drug-related discharge data, by replacing the dataset in the WHERE statement:

If you paste this into the SPARQL endpoint then you’ll get a set of results for all Council areas in Scotland. You can also see the actual query here.

Taking these results automatically into R is as easy as last time. Creating a new project in RStudio, we can import both datasets, and put them into a couple of dataframes:

This block of code has retrieved two datasets from the Scottish datastore, and loaded them into two separate dataframes (dfdrug, and dfalc). We can then merge these dataframes into one, to allow easy comparison:

This merge function is using the area name as the matching column to bring the two indicators together. Using the ggplot line of code we used last time, we can quickly swap out the alcohol-related data for drug-related data to give a visual view of the data:

(This is an important point about the value of R — the repeatability of the code is extremely powerful, and really speeds up the analysis and visualisation of data.)

So the code produces this chart, which looks similar to the alcohol one (at the top of this post), but with some major differences in the ordering. The area that we are particularly interested in — the Orkney Islands — is way down the chart, with the 6th lowest rates in Scotland.

Comparing the two charts suggests there is no firm relationship between the two measures (which was surprising to me). We can go a bit further with this, and show the relationship. Using R’s cor.test() function, we can do a very basic statistical test:

This line of code in R is testing the correlation between the values in the dataframe ‘dfmerged’, for the columns ‘nratioalcohol’ and ‘nratiodrug’. This is the result:

Pearson's product-moment correlation data: dfmerged$nratioalcohol and dfmerged$nratiodrug

t = 2.0022, df = 30, p-value = 0.05437

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

-0.006092636 0.618035163

sample estimates:

cor

0.3433307

Because we’ve used Council Areas as our sample, the sample size is relatively small. The p-value is greater than 0.05, so any correlation we can determine (fairly weak at 0.34) is likely to be not significant. The final step we might want to take here is to show this relationship on a scatterplot, so that we can see the outliers.

We can do this using ggplot2 again. The data is already prepared, so we can just make a scatterplot:

Which gives us this chart, that very clearly shows that the correlation isn’t that strong, with the points (Council areas) spread out quite a lot. It also clearly shows the Orkney Islands as an outlier, with its high alcohol/low drug discharges.

This means that to try and understand the data in the Orkney Islands better, we need to look at more datasets. In the next post, we will look at a way to create and visualise a correlation matrix, still using R, that will allow us to explore the relationship between lots of datasets in the Scottish Datastore.