Using Rodeo To Transform Olympics Data Into GIFs by Emily | August 15, 2016

Going For Gold

Like many people around the globe, I've spent the past week and a half pretty captivated by the biggest sporting spectacle in the world, the Olympics.

As I watched each genetic superhuman rack up medals while consuming some combination of pizza and beer on my couch, I noticed that certain countries seemed to stockpile wins each year while leaving others in the dust, and wondered how that had changed over time. Even better, would it be possible to find data on this and create some kind of visual from it? Turns out - It is!

Here's how I used pandas , cartopy , matplotlib.pyplot and images2gif to visualize olympic medal count data from past to present on a map of the globe, and then turn those images into a GIF.

All of this was done using Rodeo - a data science IDE created right here at Yhat.

A Preview Of What We'll Be Making!

We Need Olympics Data

To get started, we need data. Luckily, The Guardian's DataBlog posted an article a few years ago which provides a spreadsheet with all Summer Olympic medal winners from 1896 to 2008.

Here's what that dataset looks like:





City Edition Sport Discipline Athlete NOC Gender Event Event_gender Medal Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men 100m freestyle M Gold Athens 1896 Aquatics Swimming HERSCHMANN, Otto AUT Men 100m freestyle M Silver Athens 1896 Aquatics Swimming DRIVAS, Dimitrios GRE Men 100m freestyle for sailors M Bronze Athens 1896 Aquatics Swimming MALOKINIS, Ioannis GRE Men 100m freestyle for sailors M Gold Athens 1896 Aquatics Swimming CHASAPIS, Spiridon GRE Men 100m freestyle for sailors M Silver



We're going to download the Guardian DataBlog file and use the "ALL MEDALISTS" and "Country Codes" tabs for our data. pandas will help us organize and sort our data before start mapping.

Our final dataframe is going to need both country codes and country names, so let's first merge that "country_codes" CSV with our olympics data:

import os import pandas as pd filename = os.path.join("./olympics.csv") df = pd.read_csv(filename) cc_file = os.path.join("./country_codes.csv") cc_df = pd.read_csv(cc_file) df = df.merge(cc_df, on=['NOC'], how='left').rename(columns = {'country':'Country'})

The Olympics Started Out Weird

If at this point you're mostly distracted by "100m freestyle for sailors", and wondering what that was/why Greece seemed to be AMAZING at it -- have no fear.

We're on the same page, and I've already sailed down this rabbit hole for you.

The men's sailors 100 metre freestyle was one of the four swimming events on the Swimming at the 1896 Summer Olympics programme. The second swimming race was open only to sailors of the Greek Royal Navy. Eleven entered but only three actually took part in the event. wikipedia.org

That's right. Greece won all 3 medals for this event because literally only Greek sailors could enter, and exactly 3 of them.

Some would call this an unfair swimming advantage.

On the other hand, the USA is currently allowed to compete with 2 actual live dolphins on their team, so we'll just call it even.

WHAT DO YOU TWO EVEN EAT?!

Anyway. Back to business.

Grouping And Sorting

This part was a little tricky, because we don't want to just plot every medal winner from every country - this would hugely throw off our data.

For example, if an individual American swimmer wins a gold medal, and the American rowing team wins silver, this should be counted as 1 gold (swimmer) and 1 silver (rowing team) for the country. It should not be counted as 1 gold (swimmer), and n silvers (n = however many people fit on rowing boats. I feel like they're big boats. I may be thinking of vikings.)

To make sure our medal counts do their job, we're going to clip our data to just the columns we need, and then drop the duplicates to get rid of multi-player teams while keeping individual medals intact:

df = df[['Edition', 'City', 'Event','NOC','Country','Medal']] df = df.drop_duplicates() df.head()

Edition City Event NOC Medal 1896 Athens 100m freestyle HUN Gold 1896 Athens 100m freestyle AUT Silver 1896 Athens 100m freestyle for sailors GRE Bronze 1896 Athens 100m freestyle for sailors GRE Gold 1896 Athens 100m freestyle for sailors GRE Silver

We're then going to set a "Count" column using the transform method, and then sort our data by year and medal count so it will be easier to work with once we get to plotting.

df['Count'] = df.groupby(['Edition','NOC','Country','Event', 'Medal'])['Medal'].transform('count') df = df[['Edition', 'City', 'NOC', 'Country','Count']].groupby(['Edition', 'City','NOC','Country']).count().reset_index() df = df.sort(['Edition', 'Count'], ascending=[1, 0]) df.head()

Success!:

Edition City NOC Country Count 1896 Athens GRE Greece 45 1896 Athens USA United States 19 1896 Athens GER Germany 13 1896 Athens FRA France 11 1896 Athens GBR United Kingdom 7

.... Weren't There Olympics In 2012?

Unfortunately for us, our data was posted in 2012 BEFORE the Summer Olympic games that year, so we will have to grab the 2012 data manually. Please hold...

OK! With a little help from a 2012 medal tally table via Topend Sports, Panda's read_html method, and some reformatting gymnastics, we're back in business:

import pandas as pd last_ol = pd.read_html("http://www.topendsports.com/events/summer/medal-tally/2012.htm",header = 0)[0].rename(columns = {'Total':'Count'}).replace({'USA': 'United States'}, regex=True).replace({'Great Britain': 'United Kingdom'}, regex=True) last_ol = last_ol.merge(cc_df, on=['Country'], how='left') last_ol['Edition'] = pd.Series("2012", index=last_ol.index) last_ol['City'] = pd.Series("London", index=last_ol.index) last_ol = last_ol[['Edition', 'City', 'NOC', 'Country','Count']] last_ol = last_ol.sort(['Edition', 'Count'], ascending=[1, 0])

Edition City NOC Country Count 2012 London USA United States 104 2012 London CHN China 88 2012 London RUS Russia 82 2012 London GBR United Kingdom 65 2012 London GER Germany 44

Now let's throw that onto our data set, and we should be good to go.

df = pd.concat([df, last_ol])

ONWARD!!

Time To Plot

We'll be using cartopy to visualize our data on a map, and matplotlib.pyplot for the actual plotting of the map.

Cartopy provides a huge selection of projections for easily creating maps. We're going to go with the simple PlateCarree method which gives us a nice, clear view of the globe:

Now for the tough part. We're going to cycle through each Olympic year, as well as each country in shpreader 's natural_earth shapefile, grabbing data and plotting along the way..

import matplotlib.pyplot as plt import cartopy.io.shapereader as shpreader import cartopy.crs as ccrs maps = [] for _, olympics in df.groupby('Edition'): # set the figure size plt.figure(figsize=(12,16)) # slice the top 10 medal winners from the group olympics = olympics.head(10) # create our globe projection, adding our boundaries and country borders along the way ax = plt.axes(projection=ccrs.PlateCarree()) ax.add_feature(cartopy.feature.LAND) ax.set_extent([-145, 41, -5, 35]) shpfilename = shpreader.natural_earth(resolution='110m', category='cultural', name='admin_0_countries') reader = shpreader.Reader(shpfilename) # loop through each country, and add some orange color if we find it in our Top 10 from that year. If we # don't find it, we'll turn it blue instead. countries = reader.records() for country in countries: print country.attributes['adm0_a3'] if any(olympics.NOC == country.attributes['adm0_a3']) or any(olympics.Country == country.attributes['geounit']): ax.add_geometries(country.geometry, ccrs.PlateCarree(), facecolor=(0.93, 0.325, 0.0666), label=country.attributes['adm0_a3']) else: ax.add_geometries(country.geometry, ccrs.PlateCarree(), facecolor=(0.21, 0.58, 0.84), label=country.attributes['adm0_a3']) # add a blank row to keep consistent table size (1904 had only 9 competing countries) if year == "1904": olympics.loc[len(olympics)] = ['1904', 'St Louis', '', '', ''] #add the Olympics Edition as our title year = str(olympics['Edition'].iloc[0]) plt.title(year, fontsize=25) # add our table, format it, and place it on the bottom of the map table_vals= olympics[['Country','Count']].head(10).values col_labels=['Country','Medal Count'] the_table = plt.table(cellText=table_vals, colWidths = [0.5, 0.5], colLabels=col_labels, cellLoc='center', loc='bottom',) the_table.set_fontsize(14) the_table.scale(1, 5) # save each figure as a png and add it to an array (we'll come back to this later) pic_file = "medal_count_" + year + ".png" maps.append(pic_file) plt.savefig(pic_file, bbox_inches='tight') # take a look at your map! plt.show()

medal_count_1896.png

Make It A GIF!

Converting all of these PNGs to a single GIF on our own is a little tricky, but luckily there is a module in the echomesh repo called images2gif that provides a helpful writeGif function, which takes an array of images and exports them as a GIF.

We'll use the Python Imaging Library (PIL), along with numpy to convert our files into a suitable "GIF-able" format, throw them into an array, and then pass that array into the writeGif function.

*There is one caveat with using the writeGif function. All of the images used to create the final GIF must have the exact same dimensions. Our 1904 winners only have 9 countries listed, which made the resulting PNG slightly shorter than the others, and therefore unpassable to the function. We fixed this by simply adding a blank row to that table.

from images2gif import writeGif from PIL import Image import numpy as np gifs = [] for file in maps: im = Image.open(file) im = np.asarray(im) imgs.append(im) writeGif("olympics_globe.gif", gifs, duration=0.9, dither=0 )

TAH DAH!!

olympics_globe.gif