A z-score is the number of standard deviations above or below the mean a specific data point is. Also known as the standard score, it can be helpful in automatically detecting outliers for alerts as well as comparing across data points with different scales.

I will use part of the Austin Texas crime data (2016) available as a public dataset in BigQuery to see which days had public intoxication incidents that were over 2 standard deviations from the mean.

If the data follows a normal distribution (also known as a Gaussian distribution), 68% of the observations will fall within 1 standard deviation of the mean, 95% will fall within 2, 99.7% will fall within 3 and so on. So if a particular data point is +/- 2 standard deviations from the mean, we know 95% of the time a value like that doesn’t occur in our data.

Dan Kernler [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)]

The above chart shows what a normal distribution looks like. In the real world not all data is normally distributed. That’s OK though, the central limit theorem states that even if we have a non-normal (i.e. skewed) distribution, we can still use the z-score as long as we have over 30 observations. The more the merrier though so 30 is really the bare minimum.

The calculation itself is very straightforward and easy to understand. It is simply: z = (x-mean)/std. Where ‘x’ is the particular data point you are calculating the z-score for, ‘mean’ is the mean of all the observations in the dataset and ‘std’ is the standard deviation for all the observations in the dataset.

In machine learning, calculating a z-score is one option you can use to scale your features. Certain algorithms are sensitive to the scale of the data (SVM, k-NN), others could care less (Decision Trees, Naive Bayes). In general though, it’s usually a good idea to scale your data or at least explore the option. Scikit-learn’s standard scaler is a z-score calculation.

Why calculate it in BigQuery? The more data exploration you can do directly in the database the better, especially if you’re working with big data. It will be much faster than doing it in a pandas dataframe and you can pass that calculation as a column onto whatever BI platform you are using.

OK now onto the fun part. Here is what a preview of the table looks like: