Because there are often missing data in databases or there are data that are redundant or obsolete, we need to preprocess them. According to Dorian Pyle, data preparation could represent as much as 60% of the total time of data mining.

“Much of the rawdata contained in databases is unpreprocessed, incomplete, and noisy.”i (Larose D.2005)

Other than the three methods (replacing with a chosen value, a statistic or a random value) you can also use these methods:

Scatter plot in which we could see the trend line, histogram witch will give us a rapid point of view on the distribution that should correspond to the normal curve, Z-score standardization (by the way, we used to have a z-score on our note reports at the cegep level back then. I don’t know if it’s still the case?) The z-score will give us any data that are under or over 2 square deviation from the mean. We could also use linear regression, min–max that will summarize the range from 0 to 1. (0.5 being in the center of the distribution). Also, interquartile range to eliminate the data which are in the 25th percentile and the 75th percentile so that only the 50 percent in middle left and are considered. These numerical methods will be better to identify outliers.

I personally consider that the best method would we to set as many as possible validation alerts upstream in the system when the users are entering the data. For instance, to take back the examples in Larose, there should be alerts when:

The user is entering an another entry then (4, 6 or 8 cylinders) For instance, it could be mandatory for the users to enter a valid data before moving to another field When the number seems to be an outlier, there should be a dialogue box asking the user if the entry is correct to double-check the entry Of course, with outliers, among the best methods, there are: the neural networks and k-nearest neighbor algorithm. Users should choose among certain entries when possible (e.g. a drop-down menu list, radio button, checkbox or any other predefined set of options). That way, we normalise the answers and therefore, it’s easier to sort the data afterward. The field in which we expect a data should be as descriptive as possible (E.g. Enter a valid 5 digit US zip code. If you don’t leave in US, choose the option N\A). Or another example could be: Fill the field with your annual income in CAN$. If you had a loss of revenue with your business, you would put that in another field. Of course, the best option would be that if the user select Canada in the country field, the option for postal code would be automatically formatted (letter,number,letter-number,letter,number)

Dominique Loyer

i Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose, 2005

Reference:

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose, 2005