Abraham de Moivre laid the groundwork for probability theory with his book ‘The Doctrine of Chances’ (first edition 1718).

Specifically, in the later editions of his book, de Moivre recorded the first known statement of an approximation to the binomial distribution that is now commonly known as the Normal distribution . In addition he proved a special case of the central limit theorem. The discovery of these principles is as important for statistical analysis as the discovery of laced leather ball is for football.

The normal distribution can be described through the mean (μ) and the standard deviation (σ). In the normal distribution we can observe ~68% of all observations within one standard deviation from the mean and ~99% within 3 standard deviations from the mean. The standard deviation (σ) is a measure of disbursement from the mean.

The well-established normal distribution, described through the Gaussian function, creates a general comparability rule of measurement for numerical observations and returns its probability.

Gaussian Function source:http://www.drcruzan.com/Images/ProbStat/Distributions/GaussianFunctionWithExplanation.png

And that standardization is exactly where the z-score becomes to incredibly handy.

z-score function

Normalization for better comparisons

Let’s say Mr. de Moivre wanted to compare if his students are as good in Algebra as their are in Geometry. Obviously, he would plot the distributions of both data series’.

For Algebra the average score is 50 and we have calculated the standard deviation to be 10. We can calculate the z-score as (Score-Mean)/Standard Deviation = (X-50)/10

For Geometry the average score is 70 and we have calculated the standard deviation to be 10. We can calculate the z-score as (Score-Mean)/Standard Deviation = (X-70)/10

If know de Moivre’s favorite student would have scored a 60 in both tests, she would have been 1 standard deviation above the mean in the Algebra test.

However for the geometry test she would be 1 standard deviation below the mean.

Or in other words, in Algebra 84% (50% below the mean + 34% of one half-standard deviation) of other students would have been worse; in Geometry 84% would have been better.

Now we have two data series with varying means. Is it possible with the performance in these two tests to forecast in general the students future performance? Of course it is.

In many statistical applications it is common to normalize observations towards the standard normal distribution; a normal distribution with mean(μ) =0 and standard deviation(σ) = 1.

We simple calculate the z-scores for all observation values and voila we have all observations on the standard scale.

Standard scores have several advantages for evaluating the performance of the students. If you simply calculate the average ranks for the sample you would surely get an opinion on where the students are standing, but the differences in ranks near the average represent less of a difference in ability than differences at the extremes of the distribution. Or in other words, in a group of 100 students, the difference between the first and the fifth-ranked people is usually greater than the difference between the fifty-first and fifty-fifth ranked.

Outlier Detection

After converting these measures to z-scores you then add them up and you have a fair rating for each employee. The rating assumes that all three measures are of equal importance.

We can see that Student 3 performs much worse than the other two. Student 3 can thus be seen is an outlier who needs special support. The commonly used threshold for outlier detection is usually a z-score of 3 or -3, i.e. being more than 3 standard deviations away from the mean. If you needed to make a decision who to support more than obviously Student 3. But outlier detection is not only useful for data exploration. For fraud detection in financial services outlier detection is a commonly used application of Machine Learning improving lending operations.

Feature Scaling

In fact, most clustering algorithms in Machine Learning work better or even require data normalization using z-score or min-max normalizations. In ML this step is also called feature scaling. Feature scaling is applied for example with k-nearest-neighbor with an Euclidean distance measure between the observations.

Even cutting edge ML libraries like Google’s Tensorflow rely heavily on a technology from almost 300 years ago. Abraham de Moivre, however never saw how his research made an impact on this world. And despite his membership in the honorable Royal Society, he died a poor man.

Sources:

This article was brought to you by tenqyu, a startups making urban living more fun, healthy, inclusive, and thriving using big data, machine learning, and LOTs of creativity.