Normal Distributions With Python

(For the full code, please check out my GitHub here)

First, let’s get our inputs out of the way:

import numpy as np

from scipy.stats import norm

import matplotlib.pyplot as plt

import seaborn as sns

Now let’s generate some data. We will assume that the true mean height of a person is 5 feet 6 inches and the true standard deviation is 1 foot (12 inches). We also define a variable called “target” of 6 feet, which is the height that our friend inquired about. To make life easier, we will convert everything to inches.

mean_height = 5.5*12

stdev_height = 1*12

target = 6*12

With these parameters, we can now fabricate some height data. We will make a 10,000 by 10 array to hold our survey results, where each row is a survey. Then we will fill out each element of the array with a normally distributed random variable. You can give the random variable function a mean and a standard deviation, but I wanted to show you how to manually customize the mean and standard deviation of your random variable (more intuitive):

You can shift the mean by adding a constant to your normally distributed random variable (where the constant is your desired mean). It changes the central location of the random variable from 0 to whatever number you added to it.

(where the constant is your desired mean). It changes the central location of the random variable from 0 to whatever number you added to it. You can modify the standard deviation of your normally distributed random variable by multiplying a constant to your random variable (where the constant is your desired standard deviation).

In the code below, np.random.normal() generates a random number that is normally distributed with a mean of 0 and a standard deviation of 1. Then we multiply it by “stdev_height” to obtain our desired volatility of 12 inches and add “mean_height” to it in order to shift the central location by 66 inches.

mean_height + np.random.normal()*stdev_height

We can use nested for loops to fill out our survey data and then check that the output conforms to our expectations:

height_surveys = np.zeros((10000,10))

for i in range(height_surveys.shape[0]):

for j in range(height_surveys.shape[1]):

height_surveys[i,j] = mean_height +\

np.random.normal()*stdev_height print('Mean Height:', round(np.mean(height_surveys)/12,1), 'feet')

print('Standard Deviation of Height:',

round(np.var(height_surveys)**0.5/12,1), 'foot')

When I ran the code, it printed that the mean height was 5.5 feet and the standard deviation of peoples’ heights was 1 foot — these match our inputs. OK cool, we have our survey data. Now let’s do some statistical inference.

Assume for a second that we only had the time and resources to run one height survey (of 10 people). And we got the following result:

Single Survey Histogram

Half the people are taller than 6 feet and half are shorter (the red line denotes 6 feet) — not super informative. The average height in our sample is 69 inches, slightly below 6 feet.

Still, we remember that because our sample size is small (only 10 people in a survey), we should expect a lot of variance in our results. The cool thing is that even with just a single survey, we can get a pretty decent estimate of how much the mean varies. And we know the shape of the distribution — it’s normally distributed (hardcore statisticians will say that I need to say it is roughly normally distributed)!

Standard Error = Sample Standard Deviation / sqrt(N)

Where N is the number of observations (10 in our case) and sqrt denotes taking the square root.

The standard error is the standard deviation of the mean (if we keep conducting 10 people surveys and calculating a mean from each, the standard deviation of these means would eventually converge to the standard error). Like we mentioned, the distribution of sample means is the normal distribution. Meaning that if we were to conduct a large number of surveys and look at their individual means in aggregate, we would expect to see a bell curve. Let’s check this out.

Recall that earlier we created a 10,000 by 10 array of survey results. Each of our 10,000 rows is a survey. Let’s calculate the mean for each of our 10,000 surveys and plot the histogram via the following lines of code:

# Histogram that shows the distribution for the mean of all surveys

fig, ax = plt.subplots(figsize=(12,8))

sns.distplot(np.mean(height_surveys,axis=1),

kde=False, label='Height') ax.set_xlabel("Height in Inches",fontsize=16)

ax.set_ylabel("Frequency",fontsize=16)

plt.axvline(target, color='red')

plt.legend()

plt.tight_layout()

Which gives us the following histogram — that’s a bell curve if I ever saw one:

Histogram of the Sample Means

Now let’s use just our original survey (the first wonky looking histogram I showed) to calculate the distribution. With just a single sample of 10 people, we have no choice but to guess the sample mean to be the true mean (the population mean as they say in statistics). So we will guess the mean to be 69 inches.

Now let’s calculate the standard error. The standard deviation of our sample is 12.5 inches:

# I picked sample number 35 at random to plot earlier

np.var(height_surveys[35])**0.5) # = 12.5

Let’s overlay our inferred distribution, a normal distribution with a mean of 69 inches and a standard deviation of 12.5 inches on the true distribution (from our 10,000 simulated surveys, which we assume to be ground truth). We can do so with the following lines of code where

# Compare mean of all surveys with inferred distribution

fig, ax = plt.subplots(figsize=(12,8)) # Plot histogram of 10,000 sample means

sns.distplot(np.mean(height_surveys,axis=1),

kde=False, label='True') # Calculate stats using single sample

sample_mean = np.mean(height_surveys[35])

sample_stdev = np.var(height_surveys[35])**0.5

# Calculate standard error

std_error = sample_stdev/(height_surveys[35].shape[0])**0.5 # Infer distribution using single sample

inferred_dist = [sample_mean + np.random.normal()*\

std_error for i in range(10000)] # Plot histogram of inferred distribution

sns.distplot(inferred_dist, kde=False,

label='Inferred', color='red') ax.set_xlabel("Height in Inches",fontsize=16)

ax.set_ylabel("Frequency",fontsize=16)

plt.legend()

plt.tight_layout()

And we get the following set of histograms:

We are somewhat off, but honestly it’s not too bad given that we generated the inferred distribution (in red) with just 10 observations. And we are able to achieve this thanks to the fact that the distribution of sample means is normally distributed.

Our Estimate of the Standard Deviation Has Variance as Well

For the curious, the distribution of the sample standard deviations is also roughly normal (where a sample standard deviation is the standard deviation of a single survey of 10 people):

# Check out the distribution of the sample standard deviations

vol_dist = np.var(height_surveys, axis=1)**0.5 # Histogram that shows the distribution of sample stdev

fig, ax = plt.subplots(figsize=(12,8))

sns.distplot(vol_dist, kde=False,

label='Sample Stdev Distribution') ax.set_xlabel("Inches",fontsize=16)

ax.set_ylabel("Frequency",fontsize=16)

plt.legend()

plt.tight_layout()

Which gives us the plot below. That’s a pretty wide range of standard deviations. Does that mean our estimate of standard error is not as reliable as we first thought?

The Distribution of the Sample Standard Deviations (Pretty Normal!)

Let’s add the standard error distribution (in red) to the plot above (recall that the standard error is a function of the standard deviation). It’s lower (since standard error is standard deviation divided by the square root of the number of observations in each survey). The standard error also wiggles less for the same reason. Even so, that’s still more wiggle than we are comfortable with — but there’s an easy fix.

The Variance of the Standard Error Looks High

We can easily reduce the wiggle of our standard error distribution by including more observations. Let’s double it to 20 and see what happens — the red histogram has gotten significantly skinnier by just including 10 more people in our survey. Cool!

We Can Reduce the Variance of Standard Error By Taking More Observations

Finally, we can give a better answer to our friend (who was wondering whether the average height of the population was greater than 6 feet). We can do this one of two ways. First, we can just simulate it by generating a bunch of random variables (normally distributed of course). When I ran the code below, I got that 23% of the observations were 6 feet or taller.

# Simulation method for answering the question # Generate 10,000 random variables

inferred_dist = [sample_mean + np.random.normal()*\

std_error for i in range(10000)] # Figure out how many are > than target

sum([1 for i in inferred_dist if i>=target])/len(inferred_dist)

Or since we know that it’s normally distributed, we can use the cumulative density function to figure out the area under the curve for 6 feet or more (the area under the curve tells us the probability). The single line of code below tells us that the probability of being 6 feet or taller is 23%, the same as above.

1 - norm.cdf(target, loc=sample_mean, scale=std_error)

So we should tell our friend that while there is uncertainty given how informal and small our sample was, it’s still more likely than not (given the data that we have) that people’s average height is less than 6 feet (we could also do a more formal hypothesis test, but that’s a story for another time). There are three points that I want you to keep in mind: