Interactive climate data visualizations with Python & Plotly

Visualising time-series data using plotly with bar (column) graphs and subplots (source code & data in my GitLab repo)

Original photo by Kenrick Mills on Unsplash

Bar graphs (or column graphs to be exact) are very, very effective forms of data visualisation. They’re perceptually great, and usually don’t require as much explanation as some of the more unusual plots so. Subplot are not often talked about, but they can be very powerful and effective. Used in a certain way, they can allow us to lay out data in 4 dimensions, as two-dimensional charts can be a two-dimensional grid. Sometimes, they just allow us to nicely lay out multiple graphs in one figure.

I have recently been looking at temperature data from the Bureau of Meteorology in Australia. The BOM a great deal of high-quality data, with temperature observations that go back as far as 1910 in many weather stations.

Here, I will share how to create plots like these using Python and Plotly / Plotly Express, using this temperature data set.

As usual, I included the code for this in my GitLab repo here (climate_data directory), so please feel free to download it and play with it / improve upon it.

Before we get started

Data

The original data comes from the BOM’s ACORN-SAT dataset. This is a high-quality dataset, and I didn’t have to do much pre-processing at all. Nonetheless, the data set is relatively large with daily observations from over a hundred sites for a hundred years — so I provide a processed dataset file in the git repo.

Packages

I assume you’re familiar with python. But even if you’re relatively new, this tutorial shouldn’t be too tricky. Feel free to reach out on twitter or here if you’re not sure about something.

You’ll need plotly and pandas. Install them (in your virtual environment) with a simple pip install [PACKAGE NAME] .

Bar charts with Plotly Express

Data

The dataset includes data from multiple weather observation stations. To get a feel for the data, let’s load it, take a quick look, and then plot just one station’s worth of data.

As usual, load the data with:

import pandas as pd

flat_avg_df = pd.read_csv('climate_data/srcdata/flat_avg_data.csv', index_col=0)

The index_col parameter specifies to pandas which column is to be used as the index.

Loading CSV files without this parameter can lead to duplication of index columns, resulting in outputs like this. Notice the duplicated index column, where the index column has been saved as ‘Unnamed: 0’ .

Notice the duplicated index column

Other than that, the columns should be straightforward. avg_temp_C shows the annual average temperature, rel_avg_temp_C is a relative figure vs the median value for the site, and I also included the site name string, year and datatype. datatype is either tmax or tmin , indicating whether it related to the daily maximum or minimum temperature.

Simple bar plot

For now, let’s just plot the tmax data from one site. We are going to filter the dataframe based on the site name (to simple_df ), and pass it to Plotly Express.

import plotly.express as px

simple_df = flat_avg_df[(flat_avg_df.site_name == 'ALBANY AIRPORT') & (flat_avg_df.datatype == 'tmax')]

fig = px.bar(simple_df, x='year', y='rel_avg_temp_C', color='rel_avg_temp_C')

fig.show()

Just like that, we can see the annual relative temperatures measured at Albany Airport.

It shows an increasing trend since the measurements began in 1910. But is this occurring everywhere? What if we picked multiple locations? How does that look?

Subplots

With Plotly Express

Subplots allow you to include multiple plots in one figure. Plotly allows creation of subplots with either Plotly Express or classic Plotly.

As you might expect, Plotly Express is quicker and easier to use, albeit with less control. Let’s get started with that.

Plotly Express’ subplot functions are based on its facet_row and facet_col parameters, allowing creation of subplots using these categorical variables. Let’s put data from different sites onto rows, and tmin / tmax values onto columns.

Running len(flat_avg_df.site_name.unique()) , it tells us that there are 112 unique names — that’s probably too many rows to look at for now. I’m going to just pick the first 5 names, and plot the data like so:

site_names = flat_avg_df.site_name.unique()

short_df = flat_avg_df[flat_avg_df.site_name.isin(site_names[:5])]

fig = px.bar(short_df, x='year', y='rel_avg_temp_C', color='rel_avg_temp_C', facet_row='site_name', facet_col='datatype')

fig.show()

Isn’t that fantastically efficient!

X-axes are shared and nicely aligned, which allows us to quickly see which data points are missing. For example, data in the bottom right figure (from Scone Airport) probably only begins in 1963. Our Y-axis scales and colour contour scales are also uniform for easy comparisons.

Sure, it’s visually a little messy — but remember, we created this in just 4 lines of code. We can clean it up, but we won’t do that just now. I want to highlight how quickly we can visualise, and compare qualities and properties the datasets as well as the data itself.

One limitation of this method is that subplot rows and columns need to be correlated to a particular categorical variable. To plot data from ten different sites using Plotly Express, I could create a new column, say called subplot_cols , assign values (like 1 or 2 ), based on which plots would be put onto a column, and pass the parameter facet_col=’subplot_cols' . I would also have to do the same thing with rows.

This is fine, but it’s really a workaround rather than using the feature as intended. After all, Plotly Express is for exploratory analysis and for very well-organised data. It’s not the most flexible of tools.

So, let’s look next at generating subplots with regular Plotly, which will give us more flexibility.

With (classic) Plotly

The very minimum syntax for creating subplots in Plotly is as follows:

fig = make_subplots(rows=n_rows, cols=n_rows) # Create subplot grid

# Add subplot trace & assign to each grid

fig.add_trace(

go.Bar(

x=[X_DATA],

y=[X_DATA],

),

row=[SUBPLOT_ROW], col=[SUBPLOT_ROW],

)

You will see that the add_trace function needs to be repeated for each subplot used. Preferably, traces are added using loops, rather than manually. Especially as the number of subplots grow in size.

Let’s start small. We’ll create a 2 by 2 subplot grid, and plot data from the first four stations (by station name), only plotting the ‘tmax’ values.

We saw earlier that not some stations were clearly missing data from some years. It didn’t matter above as Plotly Express took care of dealing with the missing data by omission, but here we cannot do that. We are going to be passing a list or an array as the Y values, and skipping over values will (at best) lead to misalignment in data.

As we are dealing with relative values, let’s fill in the missing data with zeroes. (It’s debatable whether this is the optimal thing to do, but we can discuss that on another day.) We have to remember to do this as we pass the data to the plot.

So, let’s loop over each subplot, each loop a) collecting the Y data (temperature) for the years to be plotted, and pass that data from within the loop to plotly. It’s relatively straightforward — take a look at this code snippet:

Here, I create a list of names ( site_names ) & a list of years ( year_list ). I sort the year list, and then loop over the number of subplots ( 4 ), with an inner loop going over the year, and when there is no data for that year, I simply specify a value of zero.

Finally, I use the data to add a simple trace to our figure, using the .add_trace method. I’ve parameterised the row & column number (the +1 at the end ensures that the numbers start at 1 and 1, rather than 0 and 0).

More work =/= always better subplots, unfortunately.

So, unfortunately I’d say that this looks significantly worse. We have plotted the same data, in the same format, but the colours are now meaningless, trace names are gone, and we’ve used about 25 lines of code. D’oh.

But let’s keep going — it gets better, I promise.

Simply adding subplot titles, marker specifications, adding common y-axis parameters and updating the layout to hide the meaningless legends does wonders:

fig = make_subplots(

rows=2, cols=2,

subplot_titles=site_names

)

go.Bar(

...

marker=dict(color=temp_vals, colorscale='RdYlBu_r'),

),

fig.update_yaxes(tickvals=[-2, 0, 2], range=[-2.5, 2.5], fixedrange=True)

fig.update_layout(showlegend=False)

Look how much better this is!

Next, let’s just extend this concept and build our next chart, with as many subplots as we’d like.

The modifications look lengthy, but they’re really not. What I’ve done is to:

parameterise the number of rows & columns ( subplot_rows = … ) in our subplot (so that I can change it easily)

) in our subplot (so that I can change it easily) parameterise the height & width of the figure ( height=120 * subplot_rows, width=230 * subplot_cols ), to keep subplot sizes consistent, rather than shrinking or enlarging as the number of rows or columns change.

), to keep subplot sizes consistent, rather than shrinking or enlarging as the number of rows or columns change. order the station names ( names_by_obs = … )by number of observations, so we can prioritise plotting the ones with most data

)by number of observations, so we can prioritise plotting the ones with most data reduce spacing between subplots ( horizontal_spacing=… ), and

), and reduce font sizes ( font=dict(... ).

Sexy subplots!

Edit: An earlier version of this writeup used the below code to set the title — as I was not aware that there was a method to access these at :

for i in fig['layout']['annotations']:

i['font'] = dict(size=10, color='#404040')

Instead, you can use this:

fig.update_annotations(patch=dict(font=dict(size=10, color='#404040')))

If you *do* need to access any of these properties manually, though, the fig object allows you to do that quite easily, which is great.

Stacked bar charts with Plotly Express

This is going to blow your mind. I assume that you still have the main dataframe loaded in memory. Okay. Ready? Just run this bit of code:

fig = px.bar(flat_avg_df, x='year', y='rel_avg_temp_C', color='rel_avg_temp_C')

fig.show()

I was just giddy when I saw this plot. Yes, it’s not perfect. But in just two lines of code, I am able to visualise the entire annual dataset, showing aggregate yearly trends in temperature variations, from 112 stations.

Just for comparison, I did recreate this in regular Plotly and it took something like 30–40 lines of code, mostly in wrangling data and formatting. (I didn’t include it here.)

In this graph, I am stacking the tmax and tmin values, which is not ideal. Let’s separate them. Also, since we are plotting temperatures, let’s stick to red=hot and blue=cold convention, ensuring that the midpoint of the color scale is at zero. (You’ll notice above that zero doesn’t quite line up with the midpoint of the scale)

Now with delicious subplots!

Honestly, I am incredibly happy with how this looks. But there’s a couple of big issues — I am stacking up temperatures to numbers that don’t make any sense, and while Plotly Express deals with missing data, some of these stacks are made up of more observations than others, which skews the output.

Data visualisation isn’t just about making pretty pictures, it’s about communicating information to readers. We can do better; let’s normalise the temperatures to indicate a reasonable value, like the sums of hot/cold variations across weather observations stations, in each year.

I created a new column in the dataframe, where the temperature value is divided by the number of samples from that year:

Once that’s been completed, we can simply plot it the same as before, with the only change being y=’norm_rel_avg_temp_C’ .

You can see that the figures towards the left side of the image (further in the past) have been scaled up to account for the fact that there are fewer samples.

As a final touch, we include formatting elements — figure title, axis title, legend title, and annotation. The legend is resized with a border, and scaled down. This is the result:

Nice, isn’t it? I’m pretty happy with the results. Here is the interactive version.

Bar plots are very important tools in a data visualisation toolbox. Bar plots allow easy comparisons of values, grouping values together when needed, and are easily understood for their familiarity.

Subplots are not as critical, but has high utility. Anyone who’s ever looked at pairplots can attest to this. I personally also find that I often have data that is best visualised divided up, and often want to use subplots. Plotly Express’ feature allowing subplot generation by features save a lot of time, as it reduces effort required in visualising effects of certain categorical variables.

Regular Plotly’s subplots features is more powerful — if also more verbose. It allows things like manipulating grid sizes with Plotly’s subplots, so if you are interested in further customised subplot layouts, you may wish to explore that option.

I hope the above was useful showing you the kinds of good looking visualisations that can be easily created using Plotly / Plotly Express. Try playing with the parameters — with the number of columns, rows, colormaps, font sizes, whatever. I find that I learn a lot that way.

As ever, hit me up if you have any questions or comments.