You want to be a Data Scientist. Perhaps, you are already a Data Scientist coming from a software engineering background. You know that 80% to 90% of a Data Scientist’s job is actually Data Cleaning. You still want to do the job because of the 10% of Machine Learning tasks that you get to perform. There’s nothing that beats the high that you get after completing the analysis of a dataset. How do you find a way to cut down the time that you use for Data Cleaning? How do you reserve your energy for that 10% of your job that’s critically important so that you can do that better?

From my experience, it’s always good to have a solid understanding of the processes involved in Data Cleaning. Having an understanding of the process, the importance of the process, and the techniques used in the process will cut down on the time that you take to perform Data Cleaning tasks.

The Importance of Good Data

As we are inundated with data these days, we get lost in the idea of “Good Data”. Good Data is defined as data that is accurate, complete, conformant, consistent, timely, unique, and valid. Your machine learning algorithms depend on “good data” to construct the model, to perform, and to generalize performance across. With real-world data, it’s common to find data issues down the line as you realize your ML algorithms simply won’t work or your ML algorithm’s performance cannot be generalized across larger sets of data.

It’s virtually impossible to find all the data issues on the first go of your data science process. You need to be prepared for an iterative process of Data Cleaning -> Data Modeling -> Performance Tuning. In this iterative process, the time can be significantly shortened by getting the fundamentals right from the first go.

In Statistics, you will frequently find people who liken the data analysis process to dating. During the first few dates, it’s critical to get a feel for your partner (your data). Are there any deal breakers that might creep up later on? These deal breakers are the ones you want to catch first. These deal breakers will bias your data.

One of the biggest deal breakers in your data is “missing data”.

Understanding Missing Data

Missing data can come in all shapes and sizes. You can have data that looks like line 1 below where it’s only missing data in the Insulin column. You can have data that’s missing across a lot of columns like in line 2. You can have data that contains 0s across a lot of columns like in line 3. There are many variations. You need to know what they are. Visualizing each column of data will only get you so far. You can visualize each column of data in boxplots to find outliers. You can also use heatmaps to visualize your data highlighting the missing data.

Diabetes Missing Data By Jun Wu

In Python:

import seaborn as sb

sb.heatmap(df.isnull(), cbar=False)

How to categorize Missing Data?

The first thing that you want to do after visualizing missing data is to categorize your missing data.

There are three categories of missing data: Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR).

MCAR — The missing value is missing completely at random. The propensity for a data point to be missing does not have anything to do with its hypothetical value and with the values of other variables.

MAR — The missing value is missing due to some of the observed data. The propensity for a data point to be missing is not related to missing data, but it is related to some of the observed data.

MNAR — The missing value is missing not at random. There are reasons for this. Often, the reasons are that the missing value depends on the hypothetical value or it is dependent on another variable’s value.

Is the Data Missing At Random?

If data is missing at random, you will handle the data differently than the data that’s not missing at random. You can use the Little’s MCAR test to figure out whether your data is missing at random or not.

The Little’s MCAR’s Null Hypothesis: Data is completely missing at random. Based on the outcome of your test, you either reject or accept this null hypothesis.

In SPSS: you can use Analyze -> Missing Value Analysis -> EM

In R, you can use the LittleMCAR() function in package BaylorEdPsych for this.

LittleMCAR(df) #df is the dataframe with no more than 50 variables

Interpretation: If the sig or statistical significance is greater than 0.05, then there’s no statistical significance. This means that you accept the null hypothesis that the “Data is completely missing at random”.

If MAR and MCAR, then delete. Otherwise, impute.

Methods of Deletion

Listwise Deletion — It is the method of removing the entire record of the data that contains one or more missing data.

Disadvantages — Statistical power relies on a high sample size. In smaller sets of data, listwise deletion can reduce sample size. Unless you are sure that the record is definitely not MNAR, this technique may introduce bias into the dataset.

In Python:

nMat<-cov(diabetes_data, use="complete.obs")

Pairwise Deletion — It is the method that uses the correlation between pairs of variables to maximize data available on an analysis by analysis basis.

In Python:

nMat<-cov(diabetes_data, use="pairwise.complete.obs")

Disadvantages — It’s difficult to interpret parts of your model due to the fact that there are different numbers of observations contributing to different parts of your model.

Dropping Variables — It is the method to drop a variable if 60% of the data is missing.

diabetes_data.drop('column_name', axis=1, inplace=True)

Disadvantages — It’s difficult to know how your dropped variable may affect other variables inside the dataset.

When you can’t delete, then imputation is another way to go.

Methods of Missing Data Imputation

Categorical Variables — These are variables that have a fixed number of possible values. An Example of such variables would be Gender = Male, Female, Not Applicable.

For categorical variables, there are 3 methods you can use to impute the data.

Create a new level out of the missing values

Use predictive models such as logistic regression, KNN to estimate the data

Use multiple imputations.

Continuous Variables — These are variables have real values that lie on some interval. An example of such variables would be payment amount = 0 to infinity.

For continuous variables, there are 3 methods you can use to impute the data.

Use Mean, Median, Mode

Use predictive models such as linear regression, KNN to estimate the data

Use multiple imputations

Create New Level Out of the Missing Values

Creating a new level of categorical variable for missing values is a good way to handle missing values if there aren’t a lot of missing values.

In Python:

import pandas as pd diabetes=pd.read_csv('data/diabetes.csv')

diabetes["Gender"].fillna("No Gender", inplace=diabetes)

Mean, Median, Mode

This method involves imputing the missing data with the mean, median or mode. The advantage of this method is that it’s very easy to implement. However, there are many disadvantages.

In Python:

df.Column_Name.fillna(df.Column_Name.mean(), inplace=True)

df.Column_Name.fillna(df.Column_Name.median(), inplace=True)

df.Column_Name.fillna(df.Column_Name.mode(), inplace=True)

The disadvantage of Mean, Median, Mode Imputations — It reduces the variance of the imputed variables. It also shrinks the standard error, which invalidates most hypothesis tests and the calculation of confidence interval. It disregards the correlations between variables. It can over-represent and under-represent certain data.

Logistic Regression

This is a statistical model that uses a logistic function to model a dependent variable. The dependent variable is a binary dependent variable where two values are labeled “0” and “1”. The logistic function is a sigmoid function where the inputs are log odds and the output is the probability. (Example: Y: Probability of passing the Exam, X: Hours of Study. The graph of the Sigmoid function is in below image)

In Python:

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import Imputer

from sklearn.linear_model import LogisticRegression imp=Imputer(missing_values="NaN", strategy="mean", axis=0)

logmodel = LogisticRegression()

steps=[('imputation',imp),('logistic_regression',logmodel)]

pipeline=Pipeline(steps)

X_train, X_test, Y_train, Y_test=train_test_split(X, y, test_size=0.3, random_state=42)

pipeline.fit(X_train, Y_train)

y_pred=pipeline.predict(X_test)

pipeline.score(X_test, Y_test)

Disadvantages of Logistic Regression:

- prone to overconfidence or overfitting due to the fact of overstating the accuracy of its predictions.

- tend to underperform when there are multiple or nonlinear decision boundaries.

Linear Regression

This is a statistical model that uses a linear predictor function to model a dependent variable. The relationship between the dependent variable y and the inputs x is linear. The coefficients are the slopes of the line in this case. The distance forms the point to the line marked by (green) is the error term.

In Python:

from sklearn.linear_model import LinearModel

from sklearn.preprocessing import Imputer

from sklearn.pipeline import Pipeline imp=Imputer(missing_values="NaN", strategy="mean", axis=0)

linmodel = LinearModel()

steps=[('imputation',imp),('linear_regression',linmodel)]

pipeline=Pipeline(steps)

X_train, X_test, Y_train, Y_test=train_test_split(X, y, test_size=0.3, random_state=42)

pipeline.fit(X_train, Y_train)

y_pred=pipeline.predict(X_test)

pipeline.score(X_test, Y_test)

Disadvantages of Linear Regression:

- the standard error is deflated.

- must have a linear relationship between x and y.

KNN (K Nearest Neighbors)

This is a model that’s widely used for missing data imputation. The reason it is widely used is due to the fact that it can handle both continuous data and categorical data.

This model is a non-parametric method that classifies the data to its nearest heavily weighted neighbor. The distance used for a continuous variable is Euclidean and for categorical data, it can be Hamming Distance. An example from the below, the green circle is the Y. It gets classified with the red triangles over the blue squares because there are two red triangles in its vicinity.

from sklearn.neighbors import KNeighborsClassifier

from sklearn.preprocessing import Imputer

from sklearn.pipeline import Pipeline k_range=range(1,26) for k in k_range:

imp=Imputer(missing_values=”NaN”, strategy=”mean”, axis=0)

knn=KNeighborsClassifier(n_neighbors=k)

steps=[(‘imputation’,imp),(‘K-Nearest Neighbor’,knn)]

pipeline=Pipeline(steps)

X_train, X_test, Y_train, Y_test=train_test_split(X, y, test_size=0.3, random_state=42)

pipeline.fit(X_train, Y_train)

y_pred=pipeline.predict(X_test)

pipeline.score(X_test, Y_test)

Disadvantages of KNN

- time-consuming on larger datasets

- on high dimensional data, accuracy can be severely degraded

Multiple Imputation

Multiple imputations or MICE algorithm works by running multiple regression models and each missing value is modeled conditionally depending on the observed (non-missing) values. The power of multiple imputations is that it can impute mixes of continuous, binary, unordered categorical and ordered categorical data.

The steps to multiple imputations are:

imputing data with mice()

building model using with()

pooling results for all models using pool()

In R, the MICE package offers multiple imputations.

library(mice)

imp<-mice(diabetes, method="norm.predict", m=1)

data_imp<-complete(imp)

imp<-mice(diabetes, m=5)

fit<-with(data=imp, lm(y~x+z))

combine<-pool(fit)

Disadvantages of MICE: