Hi, everyone.

Today, let me introduce Algorithm::LDA.

This module is a Latent Dirichlet Allocation (i.e., LDA) implementation for topic modeling.

Introduction

What’s LDA? LDA is one of the popular unsupervised machine learning methods.

It models document generation process and represents each document as a mixture of topics.

So, what does “a mixture of topics” mean? Fig. 1 shows an article in which some of the words are highlighted in three colors: yellow, pink, and blue. Words about genetics are marked in yellow; words about evolutionary biology are marked in pink; words about data analysis are marked in blue. Imagine all of the words in this article are colored, then we can represent this article as a mixture of topics (i.e., colors).

Fig. 1:



(This image is from “Probabilistic topic models.” (David Blei 2012))

OK, then I’ll demonstrate how to use Algorithm::LDA in the next section.

Modeling Quotations

In this article, we explore Wikiquote. Wikiquote is a cloud-sourced platform providing sourced quotations.

By using Wikiquote API, we get quotations that are used for LDA estimation. After that, we execute LDA and plot the result.

Finally, we create an information retrieval application using the resulting model.

Preliminary

Wikiquote API

Wikiquote has action API that provides means for getting Wikiquote resources.

For example, you can get content of the Main Page as follows:

$ curl "https://en.wikiquote.org/w/api.php?action=query&prop=revisions&titles=Main%20Page&rvprop=content&format=json"

The result of the above command is: