Everyday users of stackoverflow.com posts many technical questions and all those get tagged with different topics. In this article, we will discuss a classification model that can automatically tell which tags can be attached to an unanswered question.

Obviously, there are multiple tags that can be associated with a question. So, ultimately this problem becomes ‘classifying a question and attaching class labels to it’. By Machine Learning theory, it is a ‘Multi-Label classification’ problem.

We already discussed about different theoretical techniques and accuracy metrics required for multi-label models in the below article.

The above one is a pre-requisite for the current discussion. Readers are requested to go through that before this current article.

We will use scikit-multilearn, gensim & scikit-learn for our work.

Getting the data and Exploration

Data for this article can be found from Kaggle. It contains “Questions.csv” and “Tags.csv” required for our discussion.

Let’s explore these files

import pandas as pd tag_df = pd.read_csv('../data/Tags.csv')

tag_df.head()

Figure 1