Beginning of a new journey. Credit to www.pexels.com/@tony-367405

Last Time

Last time we explored the concept of classification and used sklearn’s build-in SVM model to predict wine type based on quantities of various compounds such as Alcohol, Malic Acid, Flavanoids, and Magnesium. We were able to achieve a ~97.2% accuracy on our test set, however we just black-boxed much of how SVMs actually function.

This Time

This time we will start a deep-dive straight into the theory behind binary SVMs (Support Vector Machines), including the different types of SVMs and the mathematical optimizations behind them.

Binary Classification

Let’s start off with the simplest subset of classification problems, ones with two possible output classes. The input data for this situation would be in the form (xi, yi) where xi is the set of input features of the ith sample and yi is the correct output label for the ith sample. Here each yi would only take on one of two possible values.

Credit to https://pixabay.com/users/geralt-9301/

Loss Functions

Okay, so we have defined how our data looks and our desired outcome, but what do we do next? Well, we define a loss function. Loss functions measure how closely aligned our model’s current predictions are relative to the set of true labels, and thus we general will find ways to minimize the value of our selected loss.

Types of SVMs

What type of loss function will be use for SVMs? Well it matters what type of SVM we decide to use. There are two types of SVMs we will talk about: (i) Hard Margin SVMs and (ii) Soft Margin SVMs. Their loss functions and optimization methods differ because the two models prioritize different measurements in fitting data. Hard Margin SVMs can’t exhibit misclassification errors while Soft Margin SVMs can have misclassification errors.

Hard Margin SVMs (The Intuition)

Hard Margin SVMs prioritize maximizing the margin for the data they are trained on. Margin is the minimum distance between an input point and the decision boundary our model comes up with. Why would we want to do this in the first place though?

Think about it like this, we have the following set of red and blue points given to us in the training data.

Notice that these points are linearly separable, meaning we can fully separate out these points by just using a line. Furthermore, it is obvious that we can use many different lines to separate out the two classes:

Now, how do we judge which of these lines, or decision boundaries do the best job of separating the data? Well, if we adhere to the Maximum Margin Principle like Hard Margin SVMs, we would conclude that the central line does the best job as it maintains the maximum amount of distance from both the blue and red points that are closest to it. This provides our model with the most leeway for correctly classifying any new data points, even if they exhibit some variation from the training data!

Another topic that naturally follows from our discussion of margin maximization is the concept of support vectors. Support Vectors are the set of vectors between the decision boundary of a SVM and the small subset of the total training points that define the decision boundary. It is important to note that the other training points do not affect the decision boundary! Can you identify the support vectors in the example image above?

Soft-Margin SVMs (The Intuition)

Above we saw an ideal case where all points were in fact linearly separable, but the question remains, what do we do when that isn’t the case? Well, this is where Soft-Margin SVMs come to the rescue! Where Hard-Margin SVMs fail to find a decision boundary successfully, this second type of SVM can successfully find a solution since it follows relaxed constraints which allow for misclassifications.

Hard-Margin SVMs (The Math)

Why can’t Hard-Margin SVMs work on data that isn’t linearly separable? After all, what exactly restricts these models from making errors? Well, the proof lies in the pudding, or in this case the math. Here we go.

First let’s define what some of the variables we will be using mean. As in all parametric machine learning models we will be fine-tuning a set of weights w through training. Furthermore we will denote m as being the margin with respect to the current decision boundary and the given points. Finally, we will have b simply be a bias term.

We know that we have to maximize the margin, but there are two additional constraints we define:

(1) All of the points MUST be perfectly separated, and each individual point should have a distance from the decision boundary that is at least as large as the margin m. (2) The margin has to be great than or equal to 0.

Since our decision boundary will be the line defined by w^T*x + b = 0, we can derive the first constraint as follows:

This formula may seem like rocket-science at first, but let’s break it down piece-by-piece and hopefully it’ll make much more sense. We are essentially utilizing the point to line distance formula in order to derive the distance between ith sample point (xi, yi) and our decision boundary. The base formula looks like this:

Plugging in our line for the numerator we obtain:

Note: that we take out the absolute value sign for a very important critical reason that is explained a little further down.

For deriving the numerator, it is handy to know a bit about vector norms. A norm is simply an operator that is used on vectors to tell us something about them. There are many different types of norms, but the most common one is the L2-norm, which calculates the distance of a given vector coordinate from the origin. We can rewrite the denominator of the base point to line distance formula as being the L2-norm of the 2-d vector <a,b>:

Now, for our SVM case, we can just generalize and have the denominator be the distance of our d-length weight vector w:

We have most of the constraint’s expression set now, except for the leading yi. Why do we multiply our expression by it? We do this in order to ensure that points are on the correct side of the decision boundary!

Let’s say we take our two possible output classes and them labels of -1 and +1. Points with a label of -1 belong to the negative class where as points with a label of +1 belong to the positive class. The simple analysis below shows that our Hard-Margin SVM would need to assign correct labels in order for its optimization to work!

The second constraint is much simpler and is represented as follows (see the tie-in between the above analysis and this?):

Putting everything above together we get the following optimization formulation:

There is still one more tweak we need to make. As we see below, currently all multiples of our weight vector and our bias vector b are potential solutions to this problem for any given margin m!

We can cure this by tying m back to w and converting our three-variable optimization into a two-variable optimization as follows:

Soft-Margin SVMs (The Math)

We want a way of being able to expand on Hard-Margin SVMs so that the newly formed model is capable of producing decision boundaries even when it isn’t possible to perfectly classify data. The way we will accomplish this feat is by adding in extra variables into our existing optimization known as slack variables.

What is happening here is that each of the slack terms reduces the minimum distance a specific point needs to maintain from the decision boundary. The optimization also avoids producing very large slack terms by including them in the objective function being optimized. Lastly, C is a parameter to our model that we have to tweak via cross-validation.

We can analyze two main cases to verify that the Soft-Margin SVM algorithm would in fact work regardless of whether or not the inputted data is linearly separable :

Case I: Point (xi,yi) is linearly separable -> 0 ≤ epsilon_i ≤1 Case II: Point (xi,yi) is not linearly separable -> epsilon_i > 1

Dual Formulation of the Optimization

Although the above optimization does in fact make mathematical sense, it is tough to optimize. Luckily for us, in comes the concept of duality, where we can express a minimization problem as an equivalent maximization problem! Not going to go into the details of deriving this dual in this post for the sake of length, but the process is called Lagrangian Duality, and the resulting maximization becomes:

Next Time

That’s quite a bit of theory and math for this time, next time we will further explore Lagrangian Duality, see how the coefficients we derive from the optimization above help us get optimal weights for our SVM model, and introduce the concept of kernels!

Feedback

Feedback really helps me to figure out how these articles are being perceived and also helps to increase the quality of future posts, so if you enjoyed this read be sure to leave a clap, and if you have any questions/concerns/feedback be sure to leave them in the comments. Thank you!