Create effective data visualizations of proportions

Best ways to see individual contributions to a whole and changes over time, at various dataset sizes — (includes simple, visual demonstrations, code & data)

Various visualisations of proportions

Plotting proportions of a whole might be one of the most common tasks in data visualisation. Examples include regional differences in happiness, economic indicators or crime, demographic differences in voting patterns, income or spending, or contributions of parts of a business to its bottom line. Often, the data also describes changes over time, which may be months, quarters, years or decades.

Even though they all relate to proportions of a whole, there often isn’t a one-size-fits-all approach that would work for everything.

In this article, I describe what I think are effective techniques for communicating proportions of a whole, and also changes to them over time. I will also explore changes in charts’ effectiveness as the number of data points or series change.

As always, this article will also include examples so that you can follow along and create your own, interesting, data visualisations.

For the code, I am going to use the famous Gapminder dataset, and some data of basketball shot shares for the Toronto Raptors during last season. These are simply examples of datasets showing proportions, so you need not know anything about economics or basketball to follow along!

Before we get started

Data

I include the code and data in my GitLab repo here (viz_proportions directory). So please feel free to play with it / improve upon it.

Packages

I assume you’re familiar with python. Even if you’re relatively new, this tutorial shouldn’t be too tricky, though.

You’ll need pandas and plotly . Install each (in your virtual environment) with a simple pip install [PACKAGE_NAME] .

Visualising simple proportions

Load & Inspect Data

Handily, the plotly package provides a few toy data sets for us to play with, the Gapminder dataset being one of them. We can load it with:

import plotly.express as px

gap_df = px.data.gapminder()

Inspect the data with gap_df.info() , and gap_df.head() , and we will see that it shows data for multiple countries in each year.

It includes population, and the GDP per capita — so let’s multiply the two to get the GDP data.

gap_df = gap_df.assign(gdp=gap_df['pop'] * gap_df['gdpPercap'])

Visualising single year data only

For the very first visualisations, let’s compare a few different types of charts. The initial data includes data from 1952, across 142 countries. Let’s simplify the data to collect data by continents, and for the latest year only, into cont_df .

year_df = gap_df[gap_df.year == max(gap_df.year)]

cont_df = year_df.groupby('continent').agg({'gdp': 'sum'})

cont_df.reset_index(inplace=True)

Here, the dataframe is grouped by continent, and reset the index because it is easier to deal with a ‘flat’ dataframe in Plotly Express.

The data can be now plotted, using Plotly Express. The code to plot these basic graphs are very basic. I note that for the bubble chart, I add an arbitrary variable called dataType , simply so that it can be used to align the bubbles in the Y direction.

# Pie chart

fig = px.pie(cont_df, values='gdp', names='continent')

fig.show()

# Bar chart

fig = px.bar(cont_df, color='continent', x='continent', y='gdp')

fig.show()

# Horizontal bar chart - stacked

fig = px.bar(cont_df, color='continent', x='gdp', orientation='h')

fig.show()

# Bubble chart

fig = px.scatter(cont_df.assign(dataType='GDP'), color='continent', x='continent', y='dataType', size='gdp', size_max=50)

fig.show()

I have collected the results here:

A comparison of chart types for simple, proportional data

All charts, save for column graphs, do not do well in indicating comparative sizes.

When data points are close in size, as the GDP data for Asia, Americas, Europe are, stacked bar charts and pie charts do not allow easy comparisons between the data points, as they begin from different references.

Pie charts are also problematic in that differences in angles are notoriously difficult to perceive accurately — so we will just ignore them going forward.

Bubble charts do slightly better, but because the size of the bubble relates to the size of the dataset, the differences in radii become smaller than with the column graphs (by a square root).

What happens when we add a dimension of time?

Visualising data over time

For this portion, we will need a dataframe with multiple years’ worth of data. We could use the entire dataset, but let’s still keep it simple, with just a small number of years’ data.

The dataset contains multiple years, but not from each year. We can use gap_df.year.unique() to see which years’ data are available, and choose years after 1985, which is five different years.

Our summary dataframe can be built as follows:

mul_yrs_df = gap_df[gap_df.year > 1985]

mul_yr_cont_df = mul_yrs_df.groupby(['continent', 'year']).agg({'gdp': 'sum'})

mul_yr_cont_df.reset_index(inplace=True)

The groupby method will create a multi-index dataframe, which can be best thought of as a nested index of (continent, year) (if you would like to learn more about hierarchical/multi-index, this is a great resource). Then the index is flattened again, before we plot them.

# Bar chart

mul_yr_cont_df = mul_yr_cont_df.assign(yrstr=mul_yr_cont_df.year.astype(str))

fig = px.bar(mul_yr_cont_df, color='continent', y='gdp', x='yrstr', barmode='group')

fig.show()

# Horizontal bar chart - stacked

fig = px.bar(mul_yr_cont_df, color='continent', x='gdp', orientation='h', y='yrstr')

fig.show()

# Bubble chart

fig = px.scatter(mul_yr_cont_df, y='continent', x='yrstr', color='continent', size='gdp', size_max=50)

fig.show()

Grouped columns

Stacked bars

Bubble chart

With these, the previous properties still hold true with regards to the ease with which relative proportions can be seen in bar graphs. But, new observations can be made with the addition of another dimension to the data.

The stacked bar comes into its own in being able to demonstrate changes in size to the overall sample size, although sizes each series are still very difficult to compare.

As far as grouped bar charts go, grouping along the x-axis becomes important, as comparisons between different groups become more difficult while within the same group remains easy. Try plotting and comparing these two:

fig = px.bar(mul_yr_cont_df, color='continent', y='gdp', x='yrstr', barmode='group')

fig.show()

fig = px.bar(mul_yr_cont_df, color='yrstr', y='gdp', x='continent', barmode='group')

fig.show()

Prioritising comparisons across continents

Prioritising comparisons across years

In the top figure, comparisons across years are prioritised at a detriment to comparisons across continents, and vice versa on the bottom figure.

For comparisons across both axes, the bubble chart lays out the data in a grid, which makes it easier to compare changes across both dimensions.

If the size variations in bubble charts are not perceptible enough, gridded (subplots) of bar charts might work better:

fig = px.bar(mul_yr_cont_df, color='continent', facet_col='continent', x='gdp', orientation='h', facet_row='yrstr')

fig.update_yaxes(showticklabels=False)

fig.show()

Gridded subplots

That’s great. But often, there are more than 5 data points in each axis. So, what happens if the number of series is increased?

Visualising larger data sets

Let’s repeat the plots, for all data across the 12 years of the dataset (code not shown here for brevity — see the git repo).

Here, we can already start to see some space limitations with grouped bar graphs. Comparisons across different groups are becoming more difficult also, as individual bars get lost in a forest of towering colour bars, and the relative changes in the adjacent bars play tricks with our minds.

Although this dataset visually scales relatively well with size, helped by the increases in GDP with time, it is easy to see the limitations.

As a last point, let’s take a look at a larger dataset that is less ordered.

Bonus plots: Basketball shot shares (2019 Toronto Raptors)

Over the course of an NBA season, of 82, 48 minute-long regulation minutes, a team takes about 7000 shots. In the case of the Toronto Raptors, they had 14 players take at least 1% of those shots. The resulting 15 ‘players’ (14 + 1 ‘Others’) are shown below, split up to each 2-minute segments.

This dataset is also slightly different from the above, as I will be showing distributions of percentages.

What do they look like in each of the above plot types?

Grouped bar chart

Stacked bar chart

Gridded bar subplots

Bubble chart

As the data points become more numerous across both dimensions, and as the ratios of maximum to minimum sizes increase, the bubble chart in a grid comes into its own.

In the bubble chart, what would have been a 1 to 25 change in height in a bar graph becomes a 1 to 5 change in radius. The fact that a change in size is translated to a change in radius in effect compresses the visual disparity and helps display larger ranges. It (along with the bar subplot approach) also has an added advantage of being able to demonstrate the fourth variable as the colour, as the spatial location specifies its two dimensions.

As you might be able to tell, I prefer this type of subplot approach as the dataset sizes become larger — especially as the dataset gets larger. But as you saw earlier, in other situations, other types of visualisations work eminently better.

So — what to choose?

You might have seen this coming, but in my view, there is no ‘one-size-fits-all’ approach. And I haven’t even begun to scratch the surface of this area in visualising proportions — of scaling, different approaches to using colours, symbol types, utilising overall distribution shapes and highlighting proportions, etc.

But I hope that these examples were at least useful to you in seeing different types of visualisations, and how their efficacy might change over sample sizes. I find that there’s nothing quite like practicing to really learn data visualisation and get better at it. So pick a dataset and go for it — the more familiar you are with the subject area, the better.