What is missing data?¶

Not all missing data is equal. At the heart of the matter is the need to distinguish between two types of missingness:

Unknown but existing data: This is data that we know exists, but which, due to sparse or incomplete sampling, we do not actually know the value of. There is some value there, however, and it would be useful to try to apply some sort of missing data interpolation technique in order to discover it. For example, in 2013 The New York Times published a survey of income mobility in the United States. As happens often in datasets which drill this deep (to a county level), there were several counties for which the newspaper could not trace data. Yet it would be possible, and easy, if it was truly necessary to do so, to interpolate reasonable values for these counties based on data from the surrounding ones, for instance, or based on data from other counties with similar demographic profiles. This is, fundamentally speaking, data that can be filled by some means.

Data that doesn't exist: data that does not exist at all, in any shape or form. For example, it would make no sense to ask what the average household income is of residents of an industrial park or other such location where no people actually live. It would not really make sense to use 0 as a sentinal value in this case, either, because the existance of such a number implies in the first place the existance of people for whom an average can be taken—otherwise in trying to compute an average you are making a divide by zero error! This is, fundamentally speaking, data that cannot be filled by any means.

This is an important distinction to keep in mind, and implementing it in some standard way significantly complicates the picture. It means that to ask the question "is this data entry filled?" one must actually consider three possible answers: "Yes", "No, but it can be", and "No, and it cannot be". There seem to be two dominant paradigms for handling this distinction:

Bitpatterns: Embed sentinal values into the array itself. For instance for integer data one might take 0 or -9999 to signal unknown but existant data. This requires no overhead but can be confusing and oftentimes robs you of values that you might otherwise want to use (like 0 or -9999 ).

Masks: Use a seperate boolean array to "mask" the data whenever missing data needs to be represented. This requires making a second array and knowing when to apply it to the dataset, but is more robust.

Numpy is the linear algebra and vectorized mathematical operation library which underpins the Python scientific programming stack, and its methodologies inform how everything else works. Numpy has masks: these are provided via the numpy.ma module. But it has no native bitpatterns! There is still no performant native bitpattern NA type exists whatsoever.

The lack of a native NA type, as is the case in, say, R, is a huge problem for libraries, like Pandas, that are meant to be able to efficiently handle large datasets.

Indeed, Pandas does not use the numpy.ma mask. Masks are simply not performant above for the purposes of a library that is expected to be able to handle literally millions of entries entirely in-memory, as pandas does. Pandas instead defines and uses its own null value sentinels, particularly NaN ( np.nan ) for null numbers and NaT (a psuedo-native handled under-the-hood); and then allows you to apply your own isnull() mask to your dataset (more on that shortly).

To read more on the conscientious backstory of missing data representation in `numpy` see the [NA-overview summary](http://www.numpy.org/NA-overview.html). To learn more about masked arrays see [its documentation](http://docs.scipy.org/doc/numpy/reference/maskedarray.html). For more on `pandas` missing data representation see the [`pandas` missing data documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html).