Two views on regression with PyMC3 and scikit-learn

Colin Carroll

Contents:

Introduction

This is a series of three essays, based on my notes from a 2017 PyData NYC tutorial. The first two essays are completely independent, and may be used as in introduction to linear regression or probabilistic programming, respectively. The third builds on the knowledge of those two, and uses the data set introduced in the first, but is fine to read if you are familiar with linear regression and probabilistic programming.

The essays go along with Jupyter notebooks and exercises. You can install the requirements by following instructions below. All material is hosted on github, and comments may be posted there as issues.

These talks cover a reasonable portion of three undergraduate courses in math, but requires only a hazy memory of the subjects to follow. We derive linear regression, along with regularization, from the standpoint of

calculus, as the minimizer of a cost function,

linear algebra, as a projection onto a subspace, and

statistics, as the maximum a posteriori likelihood.

The goal of this talk is to help participants understand the math underlying so much of modern machine learning. I do not expect that attendees (or readers) will have new models or libraries to try, but I do expect that they will be better at tasks like diagnosing problems in the linear parts of their neural networks, explaining why logistic regression will be good (or bad) for a task, and giving colleagues some intuition for regularization.

Installation

If you wish to follow along in the essays and exercises (which I would recommend), the easiest way to install the requirements is using conda, which is fastest to get via Miniconda. Should also work via pip and the supplied requirements.txt file.

Clone the repository from https://github.com/ColCarroll/pydata_nyc2017: git clone https://github.com/ColCarroll/pydata_nyc2017.git Navigate to the folder: cd pydata_nyc2017 Create the conda environment: conda env create -f environment.yml Activate the environment with one of: conda activate pydata_nyc20173.6 # new conda source activate pydata_nyc20173.6 # OSX/Linux activate pydata_nyc20173.6 # Windows Start the jupyter notebook server jupyter notebook

Also See

There were a number of other talks and workshops at PyData NYC 2017 covering similar Bayesian approaches. To mention a few talks, along with links to their videos (coming soon! placeholder for now):

There were also three other workshops which (like this one) were not videotaped.