Klustr: get it? ;)

Last month, I was super inspired by Leon Fedden’s post that compared dimensionality reduction techniques like UMAP and TSNE on features such as STFT (short time fourier transform) and WaveNet features. The post came out right at the same time as a final project I was doing with Avneesh Sarwate for a course on Audio Content Analysis. Our project started with the premise of using techniques from Kyle McDonald’s Infinite Drum Machine to solve a producer’s worst nightmare: endless scrolling through samples!

All our work (including extracted features and embeddings) is available as a self contained docker container on my GitHub. If you are unsure how to use Docker, you can follow along our nicely prepared jupyter notebook.

Essentially, we tried to find the optimal combination of feature extraction and dimensionality reduction techniques that produces a 2D map of drum samples typically found in hiphop, electronic and pop music. We built ontop of previous work by scraping ground truth labels from a large collection of drum samples, allowing us to compare the quality of embedding using different techniques and selecting the best one from a population of plots. Lastly, we optimized the entire process for time.

From speech, images and sensor input, data from the natural world is often high dimensional. Instead of interacting directly with this high dimensional data how-ever, a collection of audio sample is often instead orga-nized using simple high-level descriptors, such as the type of sound e.g “vocal_shout”. However, these labels are often not available, and when available, do not capture the nuances of relationships between sounds. How similar are vocal_shout_1.wav` and vocal_shout_2.wav` , what about vocal_shout_7.wav ?

In order to build relationships between samples at the timbre level, we must form representations drawn from the high dimensional audio data. These are typically in the form of STFT or MFCC features, but even these are in dimensions in the order of several thousands. Our work presents an approach of allowing users to navigate these relatively high dimensional features in 2D space.

Our study yielded a lot of beautiful plots, color coded by sample type (e.g black = kick drums, red = snares…). The jupyter notebook contains all the plots from our studies.

TSNE 2D maps of MFCC features from a dataset of 10,000 drum samples

We defined a set of scores that ranked the “visual quality” of the embedding — a combination of silhouette score and metrics like “roundness”, which used the Polsby Popper test to determine how spread out the samples were in 2D space. This enabled us to sample a population of embeddings with different TSNE, PCA and UMAP hyper parameters to find nice embeddings that could be used by musicians to navigate their sample collection.

Our best mapping for organizing drum samples based on timbre

We also want to note that UMAP (Uniform Manifold Approximation and Projection) is FAST. TSNE could take 45 minutes to compute, whereas UMAP would just take a couple of minutes to compute the embeddings. We managed to shave the embedding time by first projecting STFT features into its first 14 PCA components, and then reducing this to 2 dimensions using UMAP.

These findings can be used to develop a tool where users “drag and drop” their sample bank, and the app automatically generates the mapping in 2D space, allowing the musician to quickly audition similar sounding samples!

We also found some very heartwarming mappings…