A heatmap is basically a table that has colors in place of numbers. Colors correspond to the level of the measurement. Each column can be a different metric like above. It’s useful for finding highs and lows and sometimes, patterns. From Nathan Yau | Visualize This

One of the problems when we have a big quantity of data is the correct way to visualize and offer to the reader a simple but general vision about all the information.

In order to visualize trends within large sets of data, it is useful consider to create a data heat map with color instead of a table with numbers.

And as everything in life, there ain’t no such thing as a free lunch, and is completely valid in this case: the accuracy is lost because we are replacing numbers for a range of colors, but in exchange we are obtaining a wide vision about trends.

The colors used within the heat map, belong a spectrum of colors based on its distance from the statistical mean, so, in that way, intuitively darker colors means one thing and lighter colors means another thing facilitating a quick evaluation about patterns, maximum and minimum values.

Updates (Sat 11/24/2018)

After some comments made by u/ELKronos and u/prv about how to improve this example, I added how the data looks like before and after tranformations.

Idea

Let’s use a heatmap in order to visualize the stats for America Soccer Cup since the beginning of the times (well, actually since 1916).

Data

In order to see what we are obtaining in exchange, let’s take a look to the table with the stats for America Soccer Cup

Team Titles Match Points Matches Played Wins Drawn Losses Goals scored Goals against Difference of Goals Points Performance Argentina 14 41 398 189 120 38 31 455 173 +282 2,11 70,19% Uruguay 15 43 358 197 108 34 55 399 218 +181 1,82 60,58% Brasil 8 35 332 178 99 35 44 405 200 +205 1,87 62,17% Paraguay 2 36 225 168 62 39 67 253 293 -40 1,34 44,64% Chile 2 38 222 177 64 30 83 281 304 -23 1,25 41,81% Perú 2 31 197 148 54 35 59 213 232 -19 1,33 44,37% Colombia 1 21 150 113 42 24 47 131 184 -53 1,33 44,25% Bolivia 1 26 86 112 20 26 66 104 279 -175 0,77 25,60% México 0 10 70 48 19 13 16 66 62 +4 1,46 48,61% Ecuador 0 27 70 118 16 22 80 127 311 -184 0,59 19,77% Venezuela 0 17 34 62 7 13 42 47 171 -124 0,55 18,28% Costa Rica 0 5 18 17 5 3 9 17 31 -14 1,06 35,29% Estados Unidos 0 4 17 18 5 2 11 18 29 -11 0,94 31,48% Honduras 0 1 10 6 3 1 2 7 5 +2 1,67 55,55% Panamá 0 1 3 3 1 0 2 4 10 -6 1,00 33,33% Japón 0 1 1 3 0 1 2 3 8 -5 0,33 11,11% Jamaica 0 2 0 6 0 0 6 0 9 -9 0,00 0,00% Haití 0 1 0 3 0 0 3 1 12 -11 0,00 0,00%

As you can see, it is extremely complicated achieves any conclusion easily.

Visualization

This is the visualization for the data about America Soccer Cup, and it is very simple to determinate which are the best team along the different tournaments, even when we lost accuracy for the lacks of numbers for each event.

Some ideas that we can elaborate after check this visualization:

Argentina and Uruguay are the best team along all the tournaments.

Argentina is the team with more power of goals and best difference of goals.

Argentina, Brazil and Uruguay are the teams with best performance.

There are three groups of countries with similar trajectories: Argentina and Uruguay Brazil, Peru, Chile, Paraguay, Bolivia and Colombia The rest of the teams with low performance since Bolivia to Mexico



Technical implementation

In order to facilitate the implementation for any heatmap, I am going to separate the code in different sections and elaborate an small explanation of each part, however if you want to see all the code and the dataset used in this example, check my github account.

1. Setup libraries

We will use two libraries, readr to read a csv file – the dataset – and RColorBrewer, to use the palettes of colors.

library(readr) library(RColorBrewer)

2. Get the data

The dataset is in my Github account because I prefer that my examples work out-of-the-box (if you copy, paste and execute the example, the code should work).

A second benefit of that is no matter what happen with the original dataset used in my example, I have it in your account.

# get data url_soccer <- 'https://raw.githubusercontent.com/frm1789/soccer_ea/master/AmericaCupData.csv' df_soccer <- read_csv(url(url_soccer))

3. Order by

From all the data that we have, the most relevant is the quantity of titles that a team have. All the rest (goals, power of goals, won matches…) is subordinate to that.

# Order data for titles df_soccer <- df_soccer[order(df_soccer$Titles, decreasing = FALSE),] df_soccer <- data.frame(df_soccer)

3. Transformations

One main point to consider, the function heatmap requieres a numerical matrix, for that reason we will work to delete the columns that we don’t need and transform the rest in numeric columns.

How the data is before transformation?

Team Titles Match Points Matches.Played Wins Drawn Losses Goals.scored Goals.against Difference.of.Goals Points_1 Performance 1 México 0 10 70 48 19 13 16 66 62 4 1,46 48,61% 2 Ecuador 0 27 70 118 16 22 80 127 311 -184 0,59 19,77% 3 Venezuela 0 17 34 62 7 13 42 47 171 -124 0,55 18,28% 4 Costa Rica 0 5 18 17 5 3 9 17 31 -14 1,06 35,29% 5 Estados Unidos 0 4 17 18 5 2 11 18 29 -11 0,94 31,48% 6 Honduras 0 1 10 6 3 1 2 7 5 2 1,67 55,55% 7 Panamá 0 1 3 3 1 0 2 4 10 -6 1,00 33,33% 8 Japón 0 1 1 3 0 1 2 3 8 -5 0,33 11,11% 9 Jamaica 0 2 0 6 0 0 6 0 9 -9 0,00 0,00% 10 Haití 0 1 0 3 0 0 3 1 12 -11 0,00 0,00% 11 Colombia 1 21 150 113 42 24 47 131 184 -53 1,33 44,25% 12 Bolivia 1 26 86 112 20 26 66 104 279 -175 0,77 25,60% 13 Paraguay 2 36 225 168 62 39 67 253 293 -40 1,34 44,64% 14 Chile 2 38 222 177 64 30 83 281 304 -23 1,25 41,81% 15 Perú 2 31 197 148 54 35 59 213 232 -19 1,33 44,37% 16 Brasil 8 35 332 178 99 35 44 405 200 205 1,87 62,17% 17 Argentina 14 41 398 189 120 38 31 455 173 282 2,11 70,19% 18 Uruguay 15 43 358 197 108 34 55 399 218 181 1,82 60,58%

Validations before changes

All the rest of the data into the dataset is numeric or integer except Points_1 and Performance .

sapply(df_soccer, class) (...) # Points_1 # "character" # Performance # "character"

Code for changes

# heatmap requieres a numerical matrix, for that reason we will move the names of the team as row.names # and after that, we will delete the column "Team" row.names(df_soccer) <- df_soccer$Team df_soccer <- df_soccer[,-1] # transformation to numeric for column "Points_1" options(digits=2) df_soccer$Points_1 <- sub(',', '.', df_soccer$Points_1) df_soccer$Points_1 <- as.double(df_soccer$Points_1) # transformation to numeric for column "Performance" df_soccer$Performance = substr(df_soccer$Performance,1,nchar(df_soccer$Performance)-1) df_soccer$Performance <- sub(',', '.', df_soccer$Performance) df_soccer$Performance <- as.double(df_soccer$Performance) df_soccer$Performance <- log(df_soccer$Performance) # Dataframe to matrix america_matrix <- data.matrix(df_soccer)

How the data is after transformation?

Titles Match Points Matches.Played Wins Drawn Losses Goals.scored Goals.against Difference.of.Goals Points_1 Performance México 0 10 70 48 19 13 16 66 62 4 1.46 3.88382927105736 Ecuador 0 27 70 118 16 22 80 127 311 -184 0.59 2.98416563718253 Venezuela 0 17 34 62 7 13 42 47 171 -124 0.55 2.905807566026 Costa Rica 0 5 18 17 5 3 9 17 31 -14 1.06 3.56359963768718 Estados Unidos 0 4 17 18 5 2 11 18 29 -11 0.94 3.4493524235492 Honduras 0 1 10 6 3 1 2 7 5 2 1.67 4.01728351608564 Panamá 0 1 3 3 1 0 2 4 10 -6 1 3.50645789231965 Japón 0 1 1 3 0 1 2 3 8 -5 0.33 2.40784560365154 Jamaica 0 2 0 6 0 0 6 0 9 -9 0 -Inf Haití 0 1 0 3 0 0 3 1 12 -11 0 -Inf Colombia 1 21 150 113 42 24 47 131 184 -53 1.33 3.78985537145394 Bolivia 1 26 86 112 20 26 66 104 279 -175 0.77 3.24259235148552 Paraguay 2 36 225 168 62 39 67 253 293 -40 1.34 3.79863031807306 Chile 2 38 222 177 64 30 83 281 304 -23 1.25 3.73313554536847 Perú 2 31 197 148 54 35 59 213 232 -19 1.33 3.79256356539082 Brasil 8 35 332 178 99 35 44 405 200 205 1.87 4.12987256828125 Argentina 14 41 398 189 120 38 31 455 173 282 2.11 4.25120585074233 Uruguay 15 43 358 197 108 34 55 399 218 181 1.82 4.10396480559909

Validations after changes

We can see that all the variables in our dataframe now are integer and after transformations, numeric.

sapply(df_soccer, class) (...) # Points_1 # "numeric" # Performance # "numeric"

4. Creating a heatmap

We are using the function heatmap almost out of the box, except the adding of margins and colors.