t-SNE is one of the most popular dimensionality reduction algorithms, it allows to represent a high-dimensional dataset in a 2 or 3-dimensional space, making it easy to visualize high-dimensional data points. In this post, we explain the algorithm and make a light overview of the math. Together with this post, we wrote a jupyter notebook where we show an example of t-SNE using sklearn and CUDA implementations.

t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm is designed for visualizing high-dimensional data in scatter plots.

t-SNE reduces the dimensionality of a high-dimensional dataset $X={x_1, x_2, ..., x_n}$ to a 2 or 3-dimensional dataset $Y={y_1, y_2, ..., y_n}$, by modeling the high-dimensional Euclidean distances between datapoints as joint probabilities $p_{ij}$ that represent similarities. For nearby data points, $p_{ij}$ is relatively high, and for separated points $p_{ij}$ is low. The joint probability is computed based on the conditional probability $p_{j/i}$, which is as a Gaussian centered in $x_i$, and defined as $p_{ij} = (p_{j/i} + p_{i/j})/2n$, where:

$$p_{j/i} = \frac{exp( - \lVert x_i - x_j \rVert^2/2\sigma_i^2)}{\sum_{k

e i} exp( - \lVert x_i - x_k \rVert^2/2\sigma_i^2) )}$$

For the low-dimensional space, the authors use Student t-distribution with one degree of freedom to convert pairwise distances into probabilities:

$$q_{ij} = \frac{ ( 1 + \lVert y_i - y_j \rVert^2 )^{-1} }{\sum_{k

e l} ( 1 + \lVert y_i - y_j \rVert^2 )^{-1} )}$$

First, it addresses the crowded problem, explained as "in ten dimensions, if we have 11 equidistant datapoints, it is difficult to model them faithfully in a two-dimensional map". Student t-distribution is a heavy-tailed distribution, and therefore almost scale-invariant to changes of points that are far apart. This effectively means that when mapping pairwise distances from the high-dimensional space to the low-dimensional space, the scale between points is roughly maintained.

Second, the t-distribution is computationally faster than the Gaussian to evaluate, since it doesn't involve an exponential.

The final step in the algorithm is to minimize the mismatch between $p_{ij}$ and $q_{ij}$ using the Kullback-Liebler divergence:

$$\min KL(P\lVert Q) = min \sum_i \sum_j p_{ij}\log\frac{p_{ij}}{q_{ij}} $$

In this jupyter notebook, we use a CNN to generate high-dimensional features from images and then show how they can be projected and visualized into a 2-dimensional space with t-SNE.