The goal of machine learning at Qchain is to match advertisers with publishers based on their content. Broadly speaking, the content can be broken down into two characteristics: topic and style. For example, FiveThirtyEight is a data-driven analytical sports and politics blog: Its topics are sports and politics; its style is data-driven and analytical. An advertiser may wish to place their content on blogs similar to FiveThirtyEight — and machine learning can help them achieve this goal at scale.

The domain of machine learning we will be working in is natural language processing (NLP). Topic modeling (also called document classification) is a well-known unsupervised learning problem in NLP. The problem is formulated to classify documents into categories based on topics. The topics are usually latent in the sense that they are not explicitly specified (hence unsupervised).

The traditional approach to topic modeling is based on word (phrase) frequency. Given a large amount of text, which is called a corpus, we can consider all unique words and phrases that make up the vocabulary. For each document, we will have word frequencies and certain keywords that tend to appear more often for each category of documents. Based on the keyword distribution, we will be able to put the documents into clusters, or groups.