Introduction

Outliers are the values in dataset which standouts from the rest of the data. The outliers can be a result of error in reading, fault in the system, manual error or misreading

To understand outliers with the help of an example: If every student in a class scores less than or equal to 100 in an assignment but one student scores more than 100 in that exam then he is an outlier in the Assignment score for that class

For any analysis or statistical tests it’s must to remove the outliers from your data as part of data pre-processing

Any outlier in data may give a biased or invalid results which can impact your Analysis and further processing

In this post we will see following two robust methods to remove outliers from the data and Data Smoothing techniques using Exponential Weighted Moving Average

a) IQR - Interquartile Range

b) Z-Score method for Outlier Removal

Data Smoothing:

a) Exponential Weighted Moving Average

Inter Quartile Range (IQR)

IQR is part of Descriptive statistics and also called as midspead , middle 50%

IQR is first Quartile minus the Third Quartile (Q3-Q1)

In order to create Quartiles or Percentiles you first need to sort the data in ascending order and find the Q1,Q2,Q3 and Q4. Basically you have to divide the data in four equal parts after sorting

The middle value of this sorted data will be the median or Q2 or 50th Percentile

Let’s create our data first and then calculate the 1st and 3rd Quartile

The Interquartile IQR for the above data is

IQR = Q3 - Q1 = 64 - 19 = 45

For finding out the Outlier using IQR we have to define a multiplier which is 1.5 ideally that will decide how far below Q1 and above Q3 will be considered as an Outlier.

Any value below Q1-1.5*IQR or above Q3+1.5*IQR is an Outlier

Q1 - 1.5*IQR = 19 - 1.5*45 = -48.5 Q3 + 1.5*IQR = 64 + 1.5*45 = 131.5

We will remove the last item in this dataset i.e. 1456 which is greater than 86.5

Finding IQR using Scipy

from scipy.stats import iqr data = [-2,8,13,19,34,49,50,53,59,64,87,89,1456] iqr(data, axis=0)

IQR = 45, which is same as above calculated manually

You can also use numpy to calculate the First and 3rd Quantile and then do Q3-Q1 to find IQR

import numpy as np Q1 = np.quantile(data,0.25) Q3 = np.quantile(data,0.75) IQR = Q3 - Q1

Z-Score

Z score is an important measurement or score that tells how many Standard deviation above or below a number is from the mean of the dataset

Any positive Z score means the no. of standard deviation above the mean and a negative score means no. of standard deviation below the mean

Z score is calculate by subtracting each value with the mean of data and dividing it by standard deviation

The Mu and Sigma above is population mean and Standard deviation and not of sample

In case population mean and standrad deviation is not known then sample mean and standard deviation can be used

Let’s calculate the Z score of all the values in the dataset which is used above using scipy zscore function

Finding Z-score using Scipy

import pandas as pd from scipy import stats df=pd.DataFrame({'data':[-2,8,13,19,34,49,50,53,59,64,87,89,1456]}) df['z_score']=stats.zscore(df['data'])

These are the respective z-score for each of these values. You can see almost all of them have a negative value except the last one which clearly indicates that most of these values lies on the left side of the mean and are within a range of mean and mean-stddev

Now as per the empirical rule any absolute value of z-score above 3 is considered as an Outlier. This is quite debatable and may not hold true for every dataset in this world.

df.loc[df['z_score'].abs()<=3]

But that’s in-line with the six sigma and statistical process control limits as well. So we have discarded any values which is above 3 values of Standard deviation to remove outliers

In this case only z score which is above 3 is 1456. so that clearly stands out as an outlier

Data Smoothing

Smoothing of data is done for a variety of reasons and one of them is eliminating the spikes and outliers. You can use various techniques like rolling mean, moving averages and Exponential smoothing(EWMA)

if you have some outliers which are really high or a absolute low then smoothing helps to summarize the data and remove the noise from the data

We will discuss Exponential Smoothing(EWMA) unlike moving average which doesn’t treat all the data points equally while smoothing. Most of the time in a Time series data we want to treat the most recent data with more weight than the previous data

In EWMA we are weighting the more recent points higher than the lags or lesser recent points

For a time period t the smoothed value using exponential smoothing is given by following equation. Source: wikipedia link

The value alpha in this equation is the smoothing factor which is a kind of decides that how much the value is updated from the original value versus retaining information from the existing average

For example: if your current value if 12 and previous value is 8 and smoothing level is 0.6 then the smoothed value is given by

12*0.6 + (1-0.6)*8

Finding Exponential Smoothing values using Pandas

Pandas has a EWM function which you can use to calculate the smoothed value with different level of Alpha

df['ewm_alpha_1']=df['data'].ewm(alpha=0.1).mean() df['ewm_alpha_3']=df['data'].ewm(alpha=0.3).mean() df['ewm_alpha_6']=df['data'].ewm(alpha=0.6).mean() df

Conclusion

To sumarize our learning here are the key points that we discussed in this post

Outliers can be removed from the data using statistical methods of IQR, Z-Score and Data Smoothing

For claculating IQR of a dataset first calculate it’s 1st Quartile(Q1) and 3rd Quartile(Q3) i.e. 25th and 75 percentile of the data and then subtract Q1 from Q3

Z-Score tells how far a point is from the mean of dataset in terms of standard deviation

An absolute value of z score which is above 3 is termed as an outlier

Data smoothing is a process to remove the spikes and peaks from the data

Moving Average, Rolling Mean and Exponential smoothing are some of the process to smooth the data

Pandas Exponential smoothing function (EWM) can be used to calculate the value at different alpha level

Hope you must have got enough insight on how to use these methods to remove outlier from your data. if you know of any other methods to eliminate the outliers then please let us know in the comments section below