In this guide, I will explain how to cluster a set of documents using Python. My motivating example is to identify the latent structures within the synopses of the top 100 films of all time (per an IMDB list). See the original post for a more detailed discussion on the example. This guide covers:

tokenizing and stemming each synopsis

transforming the corpus into vector space using tf-idf

calculating cosine distance between each document as a measure of similarity

clustering the documents using the k-means algorithm

using multidimensional scaling to reduce dimensionality within the corpus

plotting the clustering output using matplotlib and mpld3

conducting a hierarchical clustering on the corpus using Ward clustering

plotting a Ward dendrogram

topic modeling using Latent Dirichlet Allocation (LDA)

Note that my github repo for the whole project is available. The 'cluster_analysis' workbook is fully functional; the 'cluster_analysis_web' workbook has been trimmed down for the purpose of creating this walkthrough. Feel free to download the repo and use 'cluster_analysis' to step through the guide yourself.

If you have any questions for me, feel free to reach out on Twitter to @brandonmrose