Latent Dirichlet Analysis or LDA Topic analysis of redacted Mueller report

there was an optimal value regarding the use of relevance to aid topic interpretation and found that value to be 0.6, as described in section 3 of their paper."). Choose number of topics: 15 Topics

30 Topics

50 Topics

80 Topics

150 Topics On the left, each circle (identified by a number) is a topic, the size of the circle indicates the topic distribution and circles that are closer together indicate a stronger similarity. A topic is the collection of words it represents, shown on the right with the overall term frequency and the estimated term frequency within the selected topic. Use the slider to adjust the relevance metric, check this to dig into what it means ("The authors of the paper conducted a study to determine whether, as described in section 3 of their paper.").

Why topic modeling?

Investigative journalists can use LDA to streamline their investigations about particular topics of interest, such as serious crimes, political corruption, or corporate wrongdoing.

Historians can use LDA to identify important events in history by analysing text based on year.

Web based libraries can use LDA to recommend books based on your past readings.

News providers can use topic modelling to understand articles quickly or cluster similar articles.

Another interesting application is unsupervised clustering of images, where each image is treated similar to a document.

What is Latent Dirichlet Analysis or LDA?

"Each document can be described by a distribution of topics and each topic can be described by a distribution of words, molecules and atoms, employees and skills, or keyboards and crumbs." It is out of the scope of this project to explain LDA, if you are curious please visit It is out of the scope of this project to explain LDA, if you are curious please visit this guide

Analysis Details

spaCy

gensim

pyLDAvis The algorithm is written in python and uses the following libraries:

First we downloaded the OCR version of the redacted Mueller report. There are errors involved in character recognition which is reflected in the results (ex: Comey is recognized as Corney). Next, we pre-processed the text by removing unwanted characters, stopwords, tokenization (with bi-grams) and lemmatization. The LDA analysis was performed with gensim's LdaMulticore with num_topics = [15, 30, 50, 80, 150], passes = 40, chunksize = 2000, eval_every = 10 and iteration = 5000. Alpha and Eta use the default values. Finally we visualize the results with pyLDAvis which is a Python library for interactive topic model visualization.

www.symptoma.es