“We demand rigidly defined areas of doubt and uncertainty!” — Douglas Adams, The Hitchhiker’s Guide to the Galaxy

Why is uncertainty important?

Let’s imagine for a second that we’re building a computer vision model for a construction company, ABC Construction. The company is interested in automating its aerial site surveillance process, and would like our algorithm to run on their drones.

We happily get to work, and deploy our algorithm onto their fleets of drones, and go home thinking that the project is a great success. A week later, we get a call from ABC Construction saying that the drones keep crashing into the white trucks that they have parked on all their sites. You rush to one of the sites to examine the vision model, and realize that it is mistakenly predicting that the side of the white truck is just the bright sky. Given this single prediction, the drones are flying straight into the trucks, thinking that there is nothing there.

When making predictions about data in the real world, it’s a good idea to include an estimate of how sure your model is about its predictions. This is especially true if models are required to make decisions that have real consequences to peoples lives. In applications such as self driving cars, health care, insurance, etc, measures of uncertainty can help prevent serious accidents from happening.

Sources of uncertainty

When modeling any process, we are primarily concerned with two types of uncertainty.

Aleatoric Uncertainty: This is the uncertainty that is inherent in the process we are trying to explain. e.g. A ping pong ball dropped from the same location above a table will land in a slightly different spot every time, due to complex interactions with the surrounding air. Uncertainty in this category tends to be irreducible in practice.

Epistemic Uncertainty: This is the uncertainty attributed to an inadequate knowledge of the model most suited to explain the data. This uncertainty is reducible given more knowledge about the problem at hand. e.g. reduce this uncertainty by adding more parameters to the model, gather more data etc.

So how do we estimate uncertainty?

Let’s consider the case of a bakery trying to estimate the number of cakes it will sell in a given month based on the number of customers that enter the bakery. We’re going to try and model this problem using a simple linear regression model. We will then try to estimate the different types of epistemic uncertainty in this model from the available data that we have.

Equation for Linear Regression

The coefficients of this model are subject to sampling uncertainty, and it is unlikely that we will ever determine the true parameters of the model from the sample data. Therefore, providing an estimate of the set of possible values for these coefficients will inform us of how appropriately our current model is able to explain the data.

First, lets generate some data. We’re going to sample our x values from a scaled and shifted unit normal distribution. Our y values are just perturbations of these x values.

import numpy as np from numpy.random import randn, randint

from numpy.random import seed # random number seed

seed(1) # Number of samples

size = 100 x = 20 * (2.5 + randn(size))

y = x + (10 * randn(size) + 50)

Our resulting data ends up looking like this

Distribution of Sample Data

We’re going to start with estimating the uncertainty in our model parameters using bootstrap sampling. Bootstrap sampling is a technique to build new datasets by sampling with replacement from the original dataset. It generates variants of our dataset, and can give us some intuition into the range of parameters that could describe the data.

In the code below, we run 1000 iterations of bootstrap sampling, fit a linear regression model to each sample dataset, and log the coefficients, and intercepts of the model at every iteration.

from sklearn.utils import resample coefficients = []

intercepts = [] for _ in range(1000):

idx = randint(0, size, size)

x_train = x[idx]

y_train = y[idx]



model = LinearRegression().fit(x_train, y_train)



coefficients.append(model.coef_.item())

intercepts.append(model.intercept_)

Finally, we extract the 97.5th, 2.5th percentile from the logged coefficients. This gives us the 95% confidence interval of the coefficients and intercepts. Using percentiles to determine the interval has the added advantage of not making assumptions about the sampling distribution of the coefficients.