The Beginner's Guide to Dimensionality Reduction Explore the methods that data scientists use to visualize high-dimensional data. Click to Begin

Art or science? Your browser has just loaded information about roughly 800 artworks from the collection at the Metropolitan Museum of Art. The museum has publicly released a large dataset about their collection[5], just a small fraction are displayed here. They are positioned randomly. Hover over an artwork to see its details.

Each artwork includes basic metadata, such as its title, artist, date made, medium, and dimensions. Data scientists like to call metadata for each data point (artwork) features. Below are some of the features of 10 artworks in the dataset.

Year Title Artist 1680 Spindle-back armchair 1875 Armchair Pottier and Stymus Manufacturing Company 1776 Basket Myer Myers 1650 Basin Master Potter A 1710 Two-handled Bowl Cornelius Kierstede 1876 The Bryant Vase James Horton Whitehouse|Tiffany & Co.|Eugene J. Soligny|Augustus Saint-Gaudens 1765 Bureau table John Townsend 1880 Cabinet Daniel Pabst 1866 Cabinet Alexander Roux 1835 Celery vase Boston & Sandwich Glass Company Loading...



Projecting onto a line These features can be thought of as vectors existing in a high-dimensional space. Visualizing the vectors would reveal a lot about the distribution of the data, however humans can’t see so many dimensions all at once. Instead the data can be projected onto a lower dimension, one that can be visualized directly. This kind of projection is called an embedding. Computing a 1-dimensional embedding requires taking each artwork and computing a single number to describe it. A benefit of reducing to 1D is that the numbers, and the artworks, can be sorted on a line. On the right you see the artwork positioned according to their average pixel brightness. Notice that the images are sorted, with the darkest images appearing at the top and the brightest images on the bottom!

For the mathematically inclined Dimensionality reduction can be formulated mathematically in the context of a given dataset. Consider a dataset represented as a matrix X X X , where X X X is of size m × n m \times n m×n , where m m m represents the number of rows of X X X , and n n n representes the number of columns. Typically, the rows of the matrix are data points and the columns are features. Dimensionality reduction will reduce the number of features of each data point, turning X X X into a new matrix, X ′ X' X′ , of size m × d m \times d m×d , where d < n d < n d<n . For visualizations we typically set d d d to be 1, 2 or 3. Say m = n m=n m=n , that is X X X is a square matrix. Performing dimensionality reduction on X X X and setting d = 2 d=2 d=2 will change it from a square matrix to a tall, rectangular matrix. X = [ x x x x x x x x x ] ⟹ [ x ′ x ′ x ′ x ′ x ′ x ′ ] = X ′ X = \begin{bmatrix} x & x & x \\ x & x & x \\ x & x & x \end{bmatrix} \implies \begin{bmatrix} x' & x' \\ x' & x' \\ x' & x' \end{bmatrix} = X' X = ⎣ ⎡ ​ x x x ​ x x x ​ x x x ​ ⎦ ⎤ ​ ⟹ ⎣ ⎡ ​ x ′ x ′ x ′ ​ x ′ x ′ x ′ ​ ⎦ ⎤ ​ = X ′ Each data point only has two features now, i.e., each data point has been reduced from a 3 dimensional vector to a 2 dimensional vector.

Embedding data in two dimensions The same brightness feature can be used to position the artworks in 2D space instead of 1D. The pieces have more room to spread out. On the right you see a simple 2-dimensional embedding based on image brightness, but this isn’t the only way to position the artworks. In fact, there are many, and some projections are more useful than others. Use the slider to vary the influence that the brightness and artwork age have in determining the embedding positions. As you move the slider from brightness to artwork age, the embedding changes from highlighting bright and dark images, and starts to cluster recent modern-day images in the bottom left corner whereas older artworks are moved farther away (hover over images to see their date). Artwork Age Brightness Show technical details

Real-world algorithms The previous section showed an example of a user-driven embedding, where the exact influence of each feature is known. However, you may have noticed that it’s hard to find meaningful combinations of feature weights. State-of-the-art algorithms can find an optimal combination of features so that distances in the high dimensional space are preserved in the embedding. Use the tool below to project the artworks using three commonly used algorithms. In this example the reduction is performed on the pixels of each image: each image is flattened into a single vector, where each pixel represents one feature. The vectors records are then reduced to two dimensions. PCA t-SNE UMAP Principal component analysis Pros: Relatively computationally cheap.

Can save embedding model to then project new data points into the reduced space. Cons: Linear reduction limits information that can be captured; not as discriminably clustered as other algorithms. There are many algorithms that compute a dimensionality reduction of a dataset. Simpler algorithms such as principal component analysis (PCA) maximize the variance in the data to produce the best possible embedding. More complicated algorithms, such as t-distributed stochastic neighbor embedding (t-SNE)[2], iteratively produce highly clustered embeddings. Unfortunately, whereas before the influence of each feature was explicitly known, one must relinquish control to the algorithm to determine the best embedding— that means that it is not clear what features of the data are used to compute the embedding. This can be problematic for misinterpreting what an embedding is showing[11]. Dimensionality reduction, and more broadly the field of unsupervised learning, is an active area of research where researchers are developing new techniques to create better embeddings. A new technique, uniform manifold approximation and projection (UMAP)[4], is a non-linear reduction that aims to create visually striking embeddings fast, scaling to larger datasets.