What are some of these ‘models’ you speak of?

There are a ton of machine learning models that can be used in classification problems such as the following:

SVM

Logistic Regression

Perceptrons

Neural Networks

The good news is that we will explore various formulations of each of these models in depth, but for right now we will just use black boxed models and focus on producing some working classification systems.

Alright … let’s start coding already!

Sure that sounds like a good idea. We can start off by defining a generalized framework for how we can train and test our model. The following is a defining principal of any machine learning system:

We MUST make sure that the data we use to train our models is separate from the data we use to test the accuracy of our models!

In our case we will use 80% of the data to train and 20% of the data to test the models we create.

Okay, but what data will we be using?

Credit to https://www.pexels.com/@freestocks

We will be using the Wines dataset from the UCI database. Our model will be trained to predict which of 3 wine categories individual wines belong to. Some of the features we will utilize in order to predict this include quantities of:

Alcohol

Malic Acid

Proanthocyanins

Flavanoids

Magnesium

The data can be downloaded here: http://archive.ics.uci.edu/ml/datasets/Wine

Let’s see how well we can do only using blackbox models!

Step 1 — Train-Test Splitting the data:

Here our code will simply split the overall dataset into a training subset (80%) and testing subset (20%). We take full advantage of sklearn’s train_test_split function. To generalize this process for the future via subclassing, we define the following DataSplitter class (data_splitter.py):

Step 2 — Define/Test-out the baseline model:

We will be using sklearn’s pre-configured SVM model for training and prediction on the given data. We can make a simple wrapper for any of sklearn’s default models as follows (classifier.py):

We can now use the above class to define our model pipeline. Here is what a simple SVM pipeline could look like:

This pipeline simply runs a training phase using sklearn.svm’s out-of-the-box SVM model, followed by a prediction phase on the test data, and finally prints out the accuracy of the test predictions. For our wine dataset, our baseline model achieves an accuracy of only 66.66%, which is twice as much as the random classification baseline of 33.33%! Let’s look at how we can improve this even more!

Step 3 — Fine tune model parameters:

We will hyper-tune our model’s parameters via an extremely common method known as Grid Search Cross Validation, which essentially finds the ‘best’ model by examining all possible combinations of parameters we provide. Again, for simplicity we will utilize sklearn’s GridSearchCV functionality and refactor our previous code:

And just like that the accuracy improves to 97.22%, a 30.56% increase over our previous model! Let’s explore what is happening here in some more detail.