UMAP for Supervised Dimension Reduction and Metric Learning¶

While UMAP can be used for standard unsupervised dimension reduction the algorithm offers significant flexibility allowing it to be extended to perform other tasks, including making use of categorical label information to do supervised dimension reduction, and even metric learning. We’ll look at some examples of how to do that below.

First we will need to load some base libraries – numpy , obviously, but also mnist to read in the Fashion-MNIST data, and matplotlib and seaborn for plotting.

import numpy as np from mnist.loader import MNIST import matplotlib.pyplot as plt % matplotlib inline import seaborn as sns sns . set ( style = 'white' , context = 'poster' )

Our example dataset for this exploration will be the Fashion-MNIST dataset from Zalando Research. It is designed to be a drop-in replacement for the classic MNIST digits dataset, but uses images of fashion items (dresses, coats, shoes, bags, etc.) instead of handwritten digits. Since the images are more complex it provides a greater challenge than MNIST digits. We can load it in (after downloading the dataset) using the mnist library. We can then package up the train and test sets into one large dataset, normalise the values (to be in the range [0,1]), and set up labels for the 10 classes.

mndata = MNIST ( 'fashion-mnist/data/fashion' ) train , train_labels = mndata . load_training () test , test_labels = mndata . load_testing () data = np . array ( np . vstack ([ train , test ]), dtype = np . float64 ) / 255.0 target = np . hstack ([ train_labels , test_labels ]) classes = [ 'T-shirt/top' , 'Trouser' , 'Pullover' , 'Dress' , 'Coat' , 'Sandal' , 'Shirt' , 'Sneaker' , 'Bag' , 'Ankle boot' ]

Next we’ll load the umap library so we can do dimension reduction on this dataset.

import umap

UMAP on Fashion MNIST¶ First we’ll just do standard unsupervised dimension reduction using UMAP so we have a baseline of what the results look like for later comparison. This is simply a matter of instiantiating a UMAP object (in this case setting the n_neighbors parameter to be 5 – we are interested mostly in very local information), then calling the fit_transform() method with the data we wish to reduce. By default UMAP reduces to two dimensions, so we’ll be able to view the results as a scatterplot. %% time embedding = umap . UMAP ( n_neighbors = 5 ) . fit_transform ( data ) CPU times : user 1 min 45 s , sys : 7.22 s , total : 1 min 52 s Wall time : 1 min 26 s That took a little time, but not all that long considering it is 70,000 data points in 784 dimensional space. We can simply plot the results as a scatterplot, colored by the class of the fashion item. We can use matplotlib’s colorbar with suitable tick-labels to give us the color key. fig , ax = plt . subplots ( 1 , figsize = ( 14 , 10 )) plt . scatter ( * embedding . T , s = 0.3 , c = target , cmap = 'Spectral' , alpha = 1.0 ) plt . setp ( ax , xticks = [], yticks = []) cbar = plt . colorbar ( boundaries = np . arange ( 11 ) - 0.5 ) cbar . set_ticks ( np . arange ( 10 )) cbar . set_ticklabels ( classes ) plt . title ( 'Fashion MNIST Embedded via UMAP' ); The result is fairly good. We successfully separated a number of the classes, and the global structure (separating pants and footwear from shirts, coats and dresses) is well preserved as well. Unlike results for MNIST digits, however, there were a number of classes that did not separate quite so cleanly. In particular T-shirts, shirts, dresses, pullovers, and coats are all a little mixed. At the very least the dresses are largely separated, and the T-shirts are mostly in one large clump, but they are not well distinguished from the others. Worse still are the coats, shirts, and pullovers (somewhat unsurprisingly as these can certainly look very similar) which all have significant overlap with one another. Ideally we would like much better class separation. Since we have the label information we can actually give that to UMAP to use!

Using Labels to Separate Classes (Supervised UMAP)¶ How do we go about coercing UMAP to make use of target labels? If you are familiar with the sklearn API you’ll know that the fit() method takes a target parameter y that specifies supervised target information (for example when training a supervised classification model). We can simply pass the UMAP model that target data when fitting and it will make use of it to perform supervised dimension reduction! %% time embedding = umap . UMAP () . fit_transform ( data , y = target ) CPU times : user 3 min 28 s , sys : 9.17 s , total : 3 min 37 s Wall time : 2 min 45 s This took a little longer – both because we are using a larger n_neighbors value (which is suggested when doing supervised dimension reduction; here we are using the default value of 15), and because we need to condition on the label data. As before we have reduced the data down to two dimensions so we can again visualize the data with a scatterplot, coloring by class. fig , ax = plt . subplots ( 1 , figsize = ( 14 , 10 )) plt . scatter ( * embedding . T , s = 0.1 , c = target , cmap = 'Spectral' , alpha = 1.0 ) plt . setp ( ax , xticks = [], yticks = []) cbar = plt . colorbar ( boundaries = np . arange ( 11 ) - 0.5 ) cbar . set_ticks ( np . arange ( 10 )) cbar . set_ticklabels ( classes ) plt . title ( 'Fashion MNIST Embedded via UMAP using Labels' ); The result is a cleanly separated set of classes (and a little bit of stray noise – points that were sufficiently different from their class as to not be grouped with the rest). Aside from the clear class separation however (which is expected – we gave the algorithm all the class information), there are a couple of important points to note. The first point to note is that we have retained the internal structure of the individual classes. Both the shirts and pullovers still have the distinct banding pattern that was visible in the original unsupervised case; the pants, t-shirts and bags both retained their shape and internal structure; etc. The second point to note is that we have also retained the global structure. While the individual classes have been cleanly separated from one another, the inter-relationships among the classes have been preserved: footwear classes are all near one another; trousers and bags are at opposite sides of the plot; and the arc of pullover, shirts, t-shirts and dresses is still in place. The key point is this: the important structural properties of the data have been retained while the known classes have been cleanly pulled apart and isolated. If you have data with known classes and want to separate them while still having a meaningful embedding of individual points then supervised UMAP can provide exactly what you need.

Using Partial Labelling (Semi-Supervised UMAP)¶ What if we only have some of our data labelled, however, and a number of items are without labels. Can we still make use of the label information we do have? This is now a semi-supervised learning problem, and yes, we can work with those cases to. To set up the example we’ll mask some of the target information – we’ll do this by using the sklearn standard of having unlabelled point be given the label of -1 (such as, for example, the noise points from a DBSCAN clustering). masked_target = target . copy () . astype ( np . int8 ) masked_target [ np . random . choice ( 70000 , size = 10000 , replace = False )] = - 1 Now that we have randomly masked some of the labels we can try to perform supervised learning again. Everything works as before, but UMAP will interpret the -1 label as being an unlabelled point and learn accordingly. %% time fitter = umap . UMAP () . fit ( data , y = masked_target ) embedding = fitter . embedding_ CPU times : user 3 min 8 s , sys : 7.85 s , total : 3 min 16 s Wall time : 2 min 40 s Again we can look at a scatterplot of the data colored by class. fig , ax = plt . subplots ( 1 , figsize = ( 14 , 10 )) plt . scatter ( * embedding . T , s = 0.1 , c = target , cmap = 'Spectral' , alpha = 1.0 ) plt . setp ( ax , xticks = [], yticks = []) cbar = plt . colorbar ( boundaries = np . arange ( 11 ) - 0.5 ) cbar . set_ticks ( np . arange ( 10 )) cbar . set_ticklabels ( classes ) plt . title ( 'Fashion MNIST Embedded via UMAP using Partial Labels' ); The result is much as we would expect – while we haven’t cleanly separated the data as we did in the totally supervised case, the classes have been made cleaner and more distinct. This semi-supervised approach provides a powerful tool when labelling is potentially expensive, or when you have more data than labels, but want to make use of that extra data.