What Software Engineers Earn Compared to the General Population

In this notebook we'll compare the median annual income of software engineers to the average annual income (GDP per Capita) in 50 countries. It's shown how to scrape the data from a web page using lxml, turn it into a Pandas dataframe, clean it up and create scatter and bar plots with matplotlib to visualize the general trend and see which countries are the best and worst for software engineers based on how much they earn compared to the average person.

The data comes from PayScale and the International Monetary Fund and was published by Bloomberg in May 2014. It includes figures for 50 countries for which data was most available to PayScale. The software engineer figures represent income data collected from May 1, 2013, to May 1, 2014, and use exchange rates from May 5, 2014. The median years of work experience for survey respondents from each country range from two to five years.

Setup

First we load the necessary libraries, set the plot style and some variables, including a geonamescache object for adding country codes used to link geographic data with income figures in this interactive map.

%load_ext signature %matplotlib inline import os import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np import geonamescache from lxml import html mpl.style.use('ramiro') data_dir = os.path.expanduser('~/data') gc = geonamescache.GeonamesCache() chartinfo = '''Figures represent income data from May 2013 to May 2014 using exchange rates from May 2014. Average annual income figures are for 2014. Author: Ramiro Gómez - ramiro.org • Data: Bloomberg/PayScale - bloomberg.com/visual-data/best-and-worst/highest-paid-software-engineers-countries'''

Data retrieval and cleanup

To scrape the data from the web page, I looked at the source to determine a way to identify the data table. It is the only table on the page with a class of hid so the xpath expression below can be used to extract the table from the HTML source after it was loaded.

url ='http://www.bloomberg.com/visual-data/best-and-worst/highest-paid-software-engineers-countries' xpath = '//table[@class="hid"]' tree = html.parse(url) table = tree.xpath(xpath)[0] raw_html = html.tostring(table)

Pandas makes it easy to turn this raw HTML string into a dataframe. We instruct it to use the Rank column as the index and the first row as the header. The read_html function returns a list, in our case of one dataframe object, so we just grab this and print the first few rows

df = pd.read_html(raw_html, header=0, index_col=0)[0] df.head()

Country Ratio of median software engineer pay to average income Median annual pay for software engineers Average annual income Rank 1 Pakistan t 5.56 $7,200 $1,296 2 India t 3.91 $6,200 $1,584 3 South Africa t 3.64 $24,000 $6,595 4 Bulgaria t 3.28 $25,200 $7,682 5 China t 3.15 $23,100 $7,333

and the data types that were automatically determined by Pandas.

df.dtypes

Country object Ratio of median software engineer pay to average income float64 Median annual pay for software engineers object Average annual income object dtype: object

The output above shows, that we need to do some cleanup before continuing with further exploration. The values in the Country column all end in a space followed by a t, which we just remove. Also we need to turn the dollar amounts into a numeric type for use in calculations and plots.

df['Country'] = df['Country'].apply(lambda x: x.rstrip('t').strip()) for col in df.columns[2:]: df[col] = pd.to_numeric(df[col].apply(lambda x: x.lstrip('$').replace(',', ''))) df.dtypes

Country object Ratio of median software engineer pay to average income float64 Median annual pay for software engineers int64 Average annual income int64 dtype: object

This looks better now. As a sanity check we can test whether the ratio that is contained in the data agrees with the income numbers. To do so we calculate the ratio ourselves and compare it to the existing one.

ratio = round(df['Median annual pay for software engineers'] / df['Average annual income'], 2) all(ratio == df['Ratio of median software engineer pay to average income'])

True

Exploration and visualization

Income comparison

To get a holistic view we first create a scatter plot with all 50 records showing the median software engineer income on the X axis and the annual average of the whole population on the Y axis. We also draw a quadratic polynomial fitting curve to see if we can spot a trend.

x = 'Median annual pay for software engineers' y = 'Average annual income' title = 'Median annual income of software engineers vs. general average in 50 countries' fig = plt.figure(figsize=(11, 9)) ax = fig.add_subplot(111) ax.plot(x, y, '.', data=df, ms=10, alpha=.7) ax.set_title(title, fontsize=20, y=1.04) ax.set_xlabel(x) ax.set_ylim(bottom=0, top=101000) ax.set_ylabel(y) # Polynomial curve fitting # http://docs.scipy.org/doc/numpy/reference/routines.polynomials.classes.html polynomial = np.polynomial.Polynomial.fit(df[x], df[y], 2) xp = np.linspace(0, 120000, 100) yp = polynomial(xp) ax.plot(xp, yp, '-', lw=1, alpha=.5) fig.text(0, -.04 , chartinfo, fontsize=11) plt.show()

Before interpreting this and the following plots, I'll point out a few issues with the data. We do not know how many respondents took part in the PayScale survey and we do know that their work experiences range from two to five years. Whether this sample is a good representation for the income of software engineers in the respective countries is questionable.

Moreover, we compare median annual values for software engineers with mean annual values for the general population. Considering how large the income share of top earners is in several countries, annual median values for the general population might well show a different picture.

With this in mind, software engineer looks like a good career choice in the majority of the 50 countries. In some of the countries we see in the lower left, mid-level software engineers earn multiples of what the average person does. But there are also countries, where software engineers earn a lot less. To find out which are the best and worst countries for software engineers with respect to income, we'll plot rankings in form of bar plots next.

Country rankings

In the following bar plot countries are ranked by the ratio of median software engineer pay to average income. Since our dataframe is already ordered by the ratio from high to low, we can simply use the head and tail methods to get the slices we want to show.

col = 'Ratio of median software engineer pay to average income' title = 'Best and worst countries ranked by ratio of median software engineer pay to average income' limit = 10 best = df.head(limit)[::-1] worst = df.tail(limit) ticks = np.arange(limit)

Now we create a plot consisting of two bar charts showing the best countries for software engineers on the left and the worst on the right based on income ratio.

fig = plt.figure(figsize=(14, 5)) fig.suptitle(title, fontsize=20) ax1 = fig.add_subplot(1, 2, 1) ax1.barh(ticks, best[col], alpha=.5, color='#00ff00') ax1.set_yticks(ticks) ax1.set_yticklabels(best['Country'].values, fontsize=15, va='bottom') ax2 = fig.add_subplot(1, 2, 2) ax2.barh(ticks, worst[col], alpha=.5, color='#ff0000') ax2.set_yticks(ticks) ax2.set_yticklabels(worst['Country'].values, fontsize=15, va='bottom') fig.text(0, -.07, chartinfo, fontsize=12) plt.show()

This chart again shows that the income differences are huge in some countries and that software engineers are likely to earn more. In our dataset Pakistan and India are the two countries with the lowest average annual income and Qatar has the 2nd highest average income after Norway. I assume that the income distribution in countries with a high ratio is skewed towards lower incomes and in countries with low ratios towards higher incomes. Again, it'd be interesting to compare medians to medians and not medians to means, but we'd have to find another data source for such a comparison.

Adding country codes

In this last short section, I'll show how to add ISO 3 country codes to our dataset in order to visualize the data in an interactive D3.js based map.

df_map = df.copy() names = gc.get_countries_by_names() df_map['iso3'] = df_map['Country'].apply(lambda x: names[x]['iso3']) df.head(5)

Country Ratio of median software engineer pay to average income Median annual pay for software engineers Average annual income Rank 1 Pakistan 5.56 7200 1296 2 India 3.91 6200 1584 3 South Africa 3.64 24000 6595 4 Bulgaria 3.28 25200 7682 5 China 3.15 23100 7333

Here we use the geonamescache object initialized in the beginning to get a dictionary of countries keyed by names. The values are dictionaries as well, which, among other things, contain ISO 3 country codes. Finally, to save a few bytes the country column is removed and the dataframe saved as a CSV file.

del df_map['Country'] df_map.to_csv(data_dir + '/economy/income-software-engineers-countries.csv', encoding='utf-8', index=False)

Summary

In this notebook we looked at income data for software engineers in 50 countries and compared their earnings to the general population. In the process we scraped the dataset from the source web page, cleaned it up, visualized it and interpreted the results pointing out potential issues with the dataset and methodology.

While software engineer seems to be a good career choice in most of the countries, keep the caveats in mind before you start making emigration plans. Also, income should certainly not be your only criterion for choosing a profession or a place to live in.

signature

Author: Ramiro Gómez • Last edited: April 08, 2016

Linux 4.2.0-35-generic - CPython 3.5.1 - IPython 4.1.2 - matplotlib 1.5.1 - numpy 1.10.4 - pandas 0.18.0

Shirts for Python Programmers