There are many ways to detect anomalies in data and like most things in life, not really a clear and cut ‘right way’ it to do it. The approach you take will be dictated by your data but also the requirements of the project.

Is this something that needs to be as accurate as possible like credit fraud detection, do you just need to monitor some metrics for potential issues, does this need to be up and running fast or do you have 6 months to get it into production, what kind of resources do you have available to you etc.

Anomalies Detected!!

There’s two basic approaches to anomaly detection. You can go the rule based route or you can go the machine learning route. Machine learning models like Isolation forests, local outlier factor, autoencoders etc. can all be used to successfully detect anomalies. If you’re going that route, you should check out sklearn’s packages as well as the book Hands on Unsupervised Learning as good starting places.

However, rule based methods work well, are easier to implement, easier to put into production and faster to get up and running. Based on your requirements, they may be the best solution.

Look I get it, everyone wants to immediately go to machine/deep learning to solve problems because let’s face it, it’s fun and challenging. However, going the machine learning route isn’t always the best option, especially in a business environment.

One rule based method that works well and is easy to implement in SQL is to use the interquartile range. The IQR is defined as the difference between the 75% percentile and the 25% percentile of your dataset.

If a data point is above the 75th percentile by 1.5x the IQR or more or below the 25th percentile by 1.5x the IQR or less, they can be considered outliers.

Positive Outlier = 75th Percentile + 1.5 * (75th percentile - 25th percentile)

Negative Outlier = 25th Percentile - 1.5 * (75th percentile - 25th percentile)

Why 1.5x and not 2x or 3x? 1.5x is pretty similar to 3 standard deviations above the mean (see the diagram below). You can certainly change it to 1.75x or 2.25x etc. depending on how ‘sensitive’ you want the outlier detection to be, which is another great thing about using the IQR.

Since the IQR uses the median instead of the average it’s considered a robust measure of scale and less likely to be influenced by a few outliers compared to measures that use the average such as standard deviation. This will make it easier to identify actual outliers.

If you’ve ever seen a boxplot, that’s just the IQR method of outlier detection visualized. The ‘box’ is the IQR, the ‘whiskers’ are at the 25th and 75th percentile and any data points above those (called ‘fliers’) are the outliers as you can see from the below diagram:

A couple more notes about anomaly detection before we get to the example. Anomaly detection can be univariate or multivariate. Univariate is much simpler and easier to interpret since you can know easily which metric triggered the anomaly. If you have multiple metrics then you can just set up separate queries to detect anomalies for each metric which makes the end product much more interpretable. If you’re going the multivariate route, machine learning may be better suited for that. Anomaly detection can also be either supervised or unsupervised but the majority of the time it is unsupervised like we’re doing here.