Classification is the process of identifying the category of a new, unseen observation based of a training set of data, which has categories that are known.

In our case, our headlines are the observations and the positive/negative sentiment are the categories. This is a binary classification problem -- we're trying to predict if a headline is either positive or negative.

First Problem: Imbalanced Dataset

One of the most common problems, in machine learning, is working with an imbalanced dataset. As we'll see below, we have a slightly imbalanced dataset, where there's more negatives than positives.

Compared to some problems, like fraud detection, our dataset isn't super imbalanced. Sometimes you'll have datasets where the positive class is only 1% of the training data, the rest being negatives.

We want to be careful with interpreting results from imbalanced data. When producing scores with our classifier, you may experience accuracy up to 90%, which is commonly known as the Accuracy Paradox.

The reason why we might have 90% accuracy is due to our model examining the data and deciding to always predict negative, resulting in high accuracy.

There's a number of ways to counter this problem, such as::

Collect more data: could help balance the dataset by adding more minor class examples.

could help balance the dataset by adding more minor class examples. Change you metric: use either the Confusion Matrix, Precision, Recall or F1 score (combination of precision and recall).

use either the Confusion Matrix, Precision, Recall or F1 score (combination of precision and recall). Oversample the data: randomly sample the attributes from examples in the minority class to create more 'fake' data.

randomly sample the attributes from examples in the minority class to create more 'fake' data. Penalized model: Implements an additional cost on the model for making classification mistakes on the minority class during training. These penalties bias the model towards the minority class.

In our dataset, we have less positive examples than negative examples, and we will explore both different metrics and utilizing an oversampling technique, called SMOTE.

Let's establish a few basic imports: