Data Analysis Resources for Python

Introduction

This post seeks to provide resources which in 12 weeks give you a flavour of data analysis with Python and a basis for further learning. Important notes:

12 weeks is nowhere near enough to become familiar with a language to the point of being consistently productive. You will get syntax errors. You’ll have to google. You’ll get dismayed at the callousness of stackoverflow users. This is normal and doesn’t indicate failing on your part—this takes a lot of time.

Total time spent learning is important, but so is frequency. If this takes 12 weeks full time(ish), it might take 30 weeks part time because you forget things more quickly.

Business domain knowledge is super important; if you’re learning stuff for a particular industry, try to get hold of data sets for that industry, and feel free to skip stuff if you don’t think it’s relevant (though at this level little is irrelevant)

Credentials

I’m a data scientist with a maths PhD (unrelated to stats, but somewhat algebra-focused) and was a quantitative analyst before that. I work in the energy industry and spend a lot of time working with generalized additive models for time series forecasting, chucking stuff at random forests, doing Bayesian inference with pymc3, and survival analysis with lifelines. I don’t use a lot of Tensorflow or PyTorch because they tend not to fit the domain of my problems well, but I revisit them every few months to pit them against our existing models.

Disclaimer

This post is purely my opinion, and in particular reflects my view that people too quickly jump to ML/DL methods when ‘traditional’ methods could do better, perhaps because they don’t have sufficient grounding in statistics. Obviously this is very domain-specific—you’d struggle to generate meaningful text with a linear regression.

Learning Resources

Python Basics

Nothing here is specific to data analysis, so just take a look at the r/learnpython FAQ.

Data Analysis

There’s no getting away from the fact that mathematics is at the core of data analysis, but you don’t have to be John Conway to be useful. In addition, statistics is by far the most important at this level and you don’t need to understand the minutiae of the subject (which is based in measure theory and is tough). Unfortunately I’ve never found a good introduction to statistics with Python (there are plenty for R!), so you have to dip into a number of different resources.

Perhaps not all, but Larry Wasserman has written a very approachable introduction to statistics here. The link includes the few data sources given in the book, but it’s very much a textbook. At 500 pages it’s a bit daunting, so I recommend focusing on chapters 1–11 first, then the chapters on linear regression and multivariate models, which is about 200 pages total. Read along with the SciPy docs; in addition take a look at pythonfordatascience.org which calls out useful functions in SciPy and statsmodels.

An alternative (and possibly a better alternative) to AoS, this textbook is available with an optional contribution, and used by a number of colleges in the U.S. I’ve not read it, but a closer look, it appears to be pretty great. As with AoS you’ll have to read along with the SciPy and statsmodels docs.

Jake VanderPlas is the author of the excellent altair plotting library and a pretty bright chap. This book serves as a good introduction to NumPy, Pandas, Matplotlib and Scikit-Learn, and the link includes its full text as Jupyter Notebooks, which is awesome. You needn’t bother with the Scikit-Learn chapters unless you want to jump ahead.

Which of these you prefer is largely a matter of preferring one medium over another, but PfDA’s second edition is already slightly outdated for pandas 1.0.3, though certainly not enough that it’s not a very useful resource.

Joel Grus’s book kinda does do what I assert isn’t possible—take you from zero to data scientist hero in a relatively short text. The criticism I would level at it is that it (necessarily) doesn’t go into sufficient depth everywhere, but what it does brilliantly is implement most things from scratch (duh!) to give you a good grounding in the basics.

This is a great video to get a better understanding of how to work with Matplotlib, which is definitely the least Pythonic library still in use by data analysts today. It’s also slightly outdated, but hugely valuable.

Great introduction to survival analysis, which will either help you look like a superstar or be completely irrelevant.

I was at this talk at PyData London a few years ago and it was the best of the conference in my opinion. Vincent makes the argument that people are too quick to leap to ML/DL methods when simpler models could do as well or if not better.

Data Science

Briefly, here’re a few resources that cover data science proper, but don’t expect to get here any time soon!

Data Sources

As mentioned before, if you’re interested in a particular industry then see if you can get data related to it. Otherwise, these are some general sources of good-quality data.

Scikit-Learn data has some really good ‘toy’ datasets that are useful for playing around with descriptive and inferential statistics, besides the skl estimators

data.gov.uk and data.gov have hundreds of thousands of data sets. Many of these offer a great opportunity to practice cleaning up data with pandas because they come in all shapes and sizes

OpenIntro Statistics data sets used in this textbook

Out-of-scope

The following topics haven’t been mentioned in this post yet, because I consider them adjuncts to the main theme, but will probably be of importance:

SQL (probably very important!)

Big data (possibly less so, but in general the problems of big data are about finding efficient ways of doing the same stuff with… big data) inc. e.g. PySpark etc.

git/other version control

Python packaging

Unit testing

Continuous integration/continuous delivery

Docker/Kubernetes

Postscript

This page first appeared as a text post on the r/learnpython subreddit (to which I’ve been a frequent contributor for several years, helping people to understand Python) but was deleted by the mods without notification or explanation. I’ve directly messaged the mods asking for an explanation but haven’t received a reply.