How much you like open and accessible data probably depends on the kind of person you are — I happen to like it a lot! But, it turns out that at a national scale, more open and accessible government data is positively correlated with happiness.

In this post, I want to share with you how I used Kaggle Kernels — our in-browser code execution environment — to explore two very interesting open datasets on Kaggle’s Datasets platform to come to this conclusion.

In my R analysis reproduced here, I demonstrate a few things:

How to write an Rmarkdown report using Kaggle Kernels featuring ggplot, dplyr, and formattable

and How to combine multiple data sources from a catalogue of over 1,000 datasets into a single “kernel” (analysis) thanks to a super cool new feature

And, finally, the answer to the question: Are “open data friendly” countries “happy” countries?

Let’s go!

Introduction

In the kernel I wrote, executed, and published on Kaggle, I examine the question of whether countries whose governments adopt open policies with respect to data sharing are the same countries that score highly on the world happiness index. Let’s hypothesize that the two are positively correlated!

Thanks to the new multiple data sources feature, I can easily combine datasets from a catalogue of over 1k sources over on Kaggle’s public data platform (plus several hundreds from competitions, too, if I want!).

Here are the two datasets shared on Kaggle that I’ve chosen to work with:

Open Knowledge International’s 2015 Global Open Data Index

The Global Open Data Index is an annual effort to measure the state of open government data around the world. The crowdsourced survey is designed to assess the openness of specific government datasets according to the Open Definition.

Sustainable Development Solutions Network’s World Happiness Report from 2016

The World Happiness Report is a landmark survey of the state of global happiness. The World Happiness Report 2016 Update, which ranks 156 countries by their happiness levels, was released in Rome in advance of UN World Happiness Day, March 20th. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness. They reflect a new worldwide demand for more attention to happiness as a criteria for government policy.

Now that I’ve got my question plus the data to help me answer it, I’m ready to start a new kernel on Kaggle. The next section shares how to do this including code reading in, joining, and manipulating the two datasets.

Reading in Multiple Data Sources

It’s quite straightforward to read in multiple data sources in a kernel on Kaggle:

Click on “New Kernel” from any page including https://www.kaggle.com/kernels

Select “Script” for Rmarkdown (“Notebook” starts a new Jupyter notebook)

Search for and add data sources (you can go back and add more at any time)

A demo showing how to start a new kernel with multiple data sources. In this case it’s a notebook which seamlessly combines markdown and either Python or R code.

On any kernel, you can see its data sources by clicking on the “Input” tab. This is what you see on my kernel once it’s published:

The “Input” tab on my Happiness and Open Data kernel on Kaggle.

Now that I’ve selected my data sources, I can get coding. In this post, I’m sharing the code portions of my Rmarkdown file which you can see in its entirety here.

The code below reads in the data sources and joins them together by country name. There are some country names that don’t exactly match, so I’ll leave it to you to fork this and tweak the code (click the blue “Fork” button on the kernel).

library(dplyr)



# Read in data files from `open-data` and `world-happiness` datasets

open_data <- read.csv("../input/open-data/countries.csv", stringsAsFactors=F)

happiness <- read.csv("../input/world-happiness/2015.csv", stringsAsFactors=F)



# Rename from "Country Name" to just "Country" so it's easier to join

colnames(open_data)[2] <- "Country"



# Join the two dataset files on "Country"

open_data_happiness <- open_data %>%

left_join(happiness, by = "Country") %>%

mutate(Country = factor(Country)) %>%

# Keep only columns I plan to use

select(Country, Region, X2015.Score, Happiness.Score, Economy..GDP.per.Capita., Family, Health..Life.Expectancy., Freedom, Trust..Government.Corruption., Generosity, Dystopia.Residual)



# Give the columns nicer names now that our data is in one dataframe

colnames(open_data_happiness) <- c("Country", "Region", "Openness", "Happiness", "GDP", "Family", "Health", "Freedom", "Trust", "Generosity", "DystopiaResidual")

Now that I have the data roughly how I want it, let’s have a quick peek. I really like this package called formattable for presenting information in dataframes. I’ll use it to look at the characteristics of the top 10 countries rated highest for their open data sharing policies: