This article is highly interactive - Have fun scrolling along it !

Having a perfect dataset, without incomplete or missing data just doesn't exist. Dealing with missing values is one of the most commonly encountered problem in data science. In any pipeline there is always a step were you have to handle them. And it's not so simple, there are actually multiple kinds of missing values. In this article, we explore what kind of missing values can be encountered, what are the methods used today to tackle them and finally how craft ai deals with them.

Multiple kinds of Missing Values

In general, data engineers know how data have been collected and can therefore assume the type of missing values they are facing. For instance, missings values can occur if a sensor does not send any information above a certain threshold, or that just it just doesn't work well, sends only partial data. All along this article we will illustrate missing values notions using SML2010 Dataset, a weather dataset generated by two temperature and humidity sensors. In the following, we will consider only the behaviour of the temperature sensor. As mentioned earlier regarding missing values, this sensor can have three different behaviour, corresponding to the 3 main kinds of missing values : Missing Not At Random (MNAR), Missing At Random (MAR) and Missing Completely At Random (MCAR).

Missing Not At Random - MNAR:

Our temperature sensor works properly until a certain threshold and over this latter, it stops working and thus generates missing values. With the following illustration, by moving the temperature threshold (using the red bar), you can see temperature values becoming missing and also please note the difference between the distribution of missing and non-missing values.

In this case, missing values are said to be Missing Not At Random, i.e. the probability of a variable to be missing depends on the variable itself (here the temperature).

Drag the bar in the scatter plot in order to modify the temperature threshold.

Missing At Random - MAR:

Lets take a new temperature sensor that works under any temperature, thus it doesn't generate missing values as before. Unluckily, it appears that this one doesn't work above a certain humidity level! So this time, over a particular humidity threshold the temperature sensor stops working, and data become missing. With the following illustration, by moving the humidity threshold (using the red bar), you can see the temperature value becoming missing.

In this case, missing values are said to be Missing At Random, i.e. the probability of a variable to be missing depends on all variables (here the humidity).

Please note that the difference between the distribution of missing and non-missing values is quite the same as in the previous Missing Not At Random (MNAR) section. One important conclusion is that MNAR and MAR are can not be distinguished from each other.

Drag the bar in the scatter plot in order to modify the humidity threshold.

Missing Completely at Random - MCAR:

I've already bought 2 temperature sensors, but let's buy a third one! This one works finally under any temperature or humidity conditions. But it's just crap and just drops some measurement at random. In the collected data, nothing can explain why it sometimes just doesn't send data. In this case, missing values are said to be Missing Completely at Random, i.e. the probability of a variable to be missing . Now, the distribution of missing and non-missing values overlaps and therefore it is easy to distinguish MCAR from MAR/MNAR.does not depends on other variables, nor on the variable itself.

Use the slider below to set the probability of the *temperature* variable to be missing from 0 to 100%.

In the previous example, I was the sensor's owner AND the data scientist, meaning that I had some knowledge about the behavior of my sensor. Unfortunately, it can happen that data scientists (or even the domain experts) have no assumption about the missingness of the collected data. In this case a statistical test is required to determine their type. This test consists in separating the dataset in two groups, one with missing values and the rest without. The difference between these two groups can be measured with a Student's t-test, and as mentioned before if the two distributions are different, it can be concluded that the missing values are not MCARbut MNAR or MAR**. But there is no method to distinguishMNAR from MAR because we cannot infer that the missingness of a variable uniquely depends on others.

Are Missing Values valuable ?

Often, the first reaction when missing values appear in a dataset, is to simply drop them. But dropping missing values in a dataset can actually have some catastrophic impact on our model accuracy because these latter may have held a lot of information. It's true that, for a small percentage of a unique variable to be missing ( less than 5% ) dropping them usually leads to small differences in terms of accuracy whereas for multivariate dataset where many attributes can be missing, this may result in large performance drops. In the next section, we will see several methods to handle missing values in data science.

Everything you always wanted to know about handling Missing Values ?* (*But were afraid to ask)

I gotta filling !

The most common way to handle missing values consists in subtituting them with an approximation. Those methods are called imputation. Many imputation techniques exist to deal with missing values.

Mean Median

The most common method consists in using the mean or the median for continuous values and the mode for categorical values.

In the next graph, you can choose the missing values kind (use the tabs to switch between each of them), then inject them in your data, and finally reconstruct data using Mean or Median button. So you can get an idea of the impact these methods can have on your data

As you can see, this technique is indeed very fast computationally speaking but doesn't take into account the correlation between variables and is often not leading to the best possible result.

K-Nearest Neighbors

Another technique consists in applying the famous KNN algorithm. To illustrate this method, we add a new sensor measuring the sun irradiance. As previously, missing values can be injected using different missingness approaches. Another scatterplot shows the relation between the humidity and the sun irradiance. In this latter, a red dot means that the temperature value is missing. From the KNN slider you can adjust the K value for the K-Nearest Neighbors algorithm, then compute it along our dataset with the Start KNN button (missing points can also be clicked in order to launch KNN on them). You can experiment and see that the performances of the KNN algorithm are excpecially good in a MCAR context.

Other approaches

For categorical attributes, another technique consists in duplicating the samples and assigning all possible categorical values. This technique necessarily leads to adding the right one at least once but as well a lot of variance with all the other wrong samples.

Another approach consists in inferring the value on the missing one using the known attributes. Any algorithm, like linear regression, logistic regression, neural networks such as multi layer perceptron, or support vector machine can be used.

Decision Trees are as well usable on missing values as an imputation methods. This technique generates a Decision Tree (Regressor or Classfier depending on the missing value type) using the classical vanilla decision tree generation algorithm, Random Forests, or even XGBoost algorithm. The generated tree is then used to infer the missing values.

Clustering methods such as K-means, Fuzzy K-means or Expectation Maximization algorithm and K-Nearest Neighbors (KNN) can be used in order to replace missing values.

The algorithm itself deals with missing values

Some other techniques deal with missing value directly in the algorithm itself without any preprocessing step. For example, for decision trees it is common to create a new branch for missing values (named as the null branch approach). This gives a particular value for missing value, and this is a very efficint way of handling them in the Missing Not At Random (MNAR) case. If you don't want to separate missing values from other possible values, you can distribute them among all the children without creating a new branch (proposed by Quinlan in his algorithm C4.5). This is particularly usefull on Missing Completely At Random (MCAR) kind of missing values.

There is no perfect solution on handling missing values.

There is no perfect solution on handling missing values. In craft ai, we've decided to introduce a new type of data, called optional values. Optional values are variable that are known, by the sender, to be missing under some conditions, like MAR values. For instance, sending an optional value tells craft ai that the temperature sensor is out of its working range and not just a missing value. Also, the sensor can still have missing data for any other reason and craft will handle it in another way. This was very important to introduce that new type since it allows us to distinguish the various kind of missing values, and to still be able to extract from any data point as much information as possible.