Multi-Class Text Classification with Doc2Vec & Logistic Regression

The goal is to classify consumer finance complaints into 12 pre-defined classes using Doc2Vec and Logistic Regression

Doc2vec is an NLP tool for representing documents as a vector and is a generalizing of the word2vec method.

In order to understand doc2vec, it is advisable to understand word2vec approach. However, the complete mathematical details is out of scope of this article. If you are new to word2vec and doc2vec, the following resources can help you to get start:

Using the same data set when we did Multi-Class Text Classification with Scikit-Learn, In this article, we’ll classify complaint narrative by product using doc2vec techniques in Gensim. Let’s get started!

The Data

The goal is to classify consumer finance complaints into 12 pre-defined classes. The data can be downloaded from data.gov.

import pandas as pd

import numpy as np

from tqdm import tqdm

tqdm.pandas(desc="progress-bar")

from gensim.models import Doc2Vec

from sklearn import utils

from sklearn.model_selection import train_test_split

import gensim

from sklearn.linear_model import LogisticRegression

from gensim.models.doc2vec import TaggedDocument

import re

import seaborn as sns

import matplotlib.pyplot as plt df = pd.read_csv('Consumer_Complaints.csv')

df = df[['Consumer complaint narrative','Product']]

df = df[pd.notnull(df['Consumer complaint narrative'])]

df.rename(columns = {'Consumer complaint narrative':'narrative'}, inplace = True)

df.head(10)

Figure 1

After remove null values in narrative columns, we will need to re-index the data frame.

df.shape

(318718, 2)

df.index = range(318718) df['narrative'].apply(lambda x: len(x.split(' '))).sum()

63420212

We have over 63 million words, it is a relatively large data set.

Exploring