Introduction

Data classification is a very important task in machine learning. Support Vector Machines (SVMs) are widely applied in the field of pattern classifications and nonlinear regressions. The original form of the SVM algorithm was introduced by Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1963. Since then, SVMs have been transformed tremendously to be used successfully in many real-world problems such as text (and hypertext) categorization, image classification, bioinformatics (Protein classification, Cancer classification), handwritten character recognition, etc.

Table of Contents

What is a Support Vector Machine? How does it work? Derivation of SVM Equations Pros and Cons of SVMs Python and R implementation

What is a Support Vector Machine(SVM)?

A Support Vector Machine is a supervised machine learning algorithm which can be used for both classification and regression problems. It follows a technique called the kernel trick to transform the data and based on these transformations, it finds an optimal boundary between the possible outputs.

In simple words, it does some extremely complex data transformations to figure out how to separate the data based on the labels or outputs defined.We will be looking only at the SVM classification algorithm in this article.

How does it work?

The main idea is to identify the optimal separating hyperplane which maximizes the margin of the training data. Let us understand this objective term by term.

What is a separating hyperplane?

We can see that it is possible to separate the data given in the plot above. For instance, we can draw a line in which all the points above the line are green and the ones below the line are red. Such a line is said to be a separating hyperplane.

Now the obvious confusion, why is it called a hyperplane if it is a line?

In the diagram above, we have considered the simplest of examples, i.e., the dataset lies in the 2-dimensional plane(\(\mathbb{R}^2\)). But the support vector machine can work for a general n-dimensional dataset too. And in the case of higher dimensions, the hyperplane is the generalization of a plane.

More formally, it is an n-1 dimensional subspace of an n-dimensional Euclidean space. So for a

1D dataset, a single point represents the hyperplane.

2D dataset, a line is a hyperplane.

3D dataset, a plane is a hyperplane.

And in the higher dimension, it is called a hyperplane.

We have said that the objective of an SVM is to find the optimal separating hyperplane. When is a separating hyperplane said to be optimal?

The fact that there exists a hyperplane separating the dataset doesn’t mean that it is the best one.

Let us understand the optimal hyperplane through a set of diagrams.

Multiple hyperplanes

There are multiple hyperplanes, but which one of them is a separating hyperplane? It can be easily seen that line B is the one which best separates the two classes.

Multiple separating hyperplanes

There can be multiple separating as well. How do we find the optimal one? Intuitively, if we select a hyperplane which is close to the data points of one class, then it might not generalize well. So the aim is to choose the hyperplane which is as far as possible from the data points of each category.

In the diagram above, the hyperplane that meets the specified criteria for the optimal hyperplane is B.

Therefore, maximizing the distance between the nearest points of each class and the hyperplane would result in an optimal separating hyperplane. This distance is called the margin.

The goal of SVMs is to find the optimal hyperplane because it not only classifies the existing dataset but also helps predict the class of the unseen data. And the optimal hyperplane is the one which has the biggest margin.

Mathematical Setup

Now that we have understood the basic setup of this algorithm, let us dive straight into the mathematical technicalities of SVMs.

I will be assuming you are familiar with basic mathematical concepts such as vectors, vector arithmetic(addition, subtraction, dot product) and the orthogonal projection. Some of these concepts can also be found in the article, Prerequisites of linear algebra for machine learning.

Equation of Hyperplane

You must have come across the equation of a straight line as \(y=mx+c\), where \(m\) is the slope and \(c\) is the y-intercept of the line.

The generalized equation of a hyperplane is as follows:

\(\displaystyle w^Tx=0\)

Here \(w\) and \(x\) are the vectors and \(w^Tx\) represents the dot product of the two vectors. The vector \(w\) is often called as the weight vector.

Consider the equation of the line as \(y-mx-c=0\). In this case,

\(w=\begin{pmatrix}-c\\-m\\1 \end{pmatrix}\) and \(x=\begin{pmatrix}1\\x\\y \end{pmatrix}\)

\(\displaystyle w^Tx=-c\times 1-m\times x+y=y-mx-c=0\)

It is just two different ways of representing the same thing. So why do we use \(w^Tx=0\)? Simply because it is easier to deal with this representation in the case of higher dimensional dataset and \(w\) represents the vector which is normal to the hyperplane. This property will be useful once we start computing the distance from a point to the hyperplane.

Understanding the constraints

The training data in our classification problem is of the form \(\{(x_1,y_1),(x_2,y_2),…,(x_n,y_n)\} \in \mathbb{R}^n \times {-1,1}\). This means that the training dataset is a pair of \(x_i\), an n-dimensional feature vector and \(y_i\), the label of \(x_i\). When \(y_i=1\) implies that the sample with the feature vector \(x_i\) belongs to class 1 and if \(y_i=-1\) implies that the sample belongs to class -1.

In a classification problem, we thus try to find out a function, \( y=f(x): \mathbb{R}^n \longrightarrow \{-1,1\}\). \(f(x)\) learns from the training data set and then applies its knowledge to classify the unseen data.

There are an infinite number of functions, \(f(x)\) that can exist, so we have to restrict the class of functions that we are dealing with. In the case of SVM’s, this class of functions is that of the hyperplane represented as \(w^Tx=0\).

It can also be represented as \(\vec{w}.\vec{x}+b=0; \vec{w}\in \mathbb{R}^n \mbox{ and } b \in \mathbb{R}\)

This divides the input space into two parts, one containing vectors of class ?1 and the other containing vectors of class +1.

For the rest of this article, we will consider 2-dimensional vectors. Let \(\mathcal{H}_0\) be a hyperplane separating the dataset and satisfying the following:

\(\displaystyle \vec{w}.\vec{x}+b=0\)

Along with \(\mathcal{H}_0\), we can select two others hyperplanes \(\mathcal{H}_1\) and \(\mathcal{H}_2\) such that they also separate the data and have the following equations:

\(\vec{w}.\vec{x}+b=\delta\) and \(\vec{w}.\vec{x}+b=\mbox{-}\delta\)

This makes \(\mathcal{H}_o\) equidistant from \(\mathcal{H}_1\) as well as \(\mathcal{H}_2\).

The variable ? is not necessary so we can set ?=1 to simplify the problem as \(\vec{w}.\vec{x}+b=1\) and \(\vec{w}.\vec{x}+b = \mbox{-}1\)

Next, we want to ensure that there is no point between them. So for this, we will select only those hyperplanes which satisfy the following constraints:

For every vector \(x_i\) either:

\(\vec{w}.\vec{x}+b\leq \mbox{-}1\) for \(x_i\) having the class ?1 or \(\vec{w}.\vec{x}+b\geq 1\) for \(x_i\) having the class 1

Combining the constraints

Both the constraints stated above can be combined into a single constraint.

Constraint 1:

For \(x_i\) having the class -1, \(\vec{w}.\vec{x}+b\leq \mbox{-}1\)

Multiplying both sides by \(y_i\) (which is always -1 for this equation)

\(y_i\left(\vec{w}.\vec{x}+b\right)\geq y_i(-1)\) which implies \(y_i\left(\vec{w}.\vec{x}+b\right) \geq 1\) for \(x_i\) having the class?1.

Constraint 2: \(y_i=1\)

\(y_i\left(\vec{w}.\vec{x}+b\right) \geq 1\) for \(x_i\) having the class 1

Combining both the above equations, we get \(y_i\left(\vec{w}.\vec{x}+b\right)\geq 1 \mbox{ for all }1\leq i\leq n\)

This leads to a unique constraint instead of two which are mathematically equivalent. The combined new constraint also has the same effect, i.e., no points between the two hyperplanes.

Maximize the margin

For the sake of simplicity, we will skip the derivation of the formula for calculating the margin, \(m\) which is

\(\displaystyle m=\frac{2}{||\vec{w}||}\)

The only variable in this formula is \(w\), which is indirectly proportional to \(m\), hence to maximize the margin we will have to minimize \(||\vec{w}||\). This leads to the following optimization problem:

Minimize in \(\displaystyle (\vec{w},b) \{ \frac{||\vec{w}||^2}{2}\) subject to \( y_i\left(\vec{w}.\vec{x}+b\right) \geq 1 \mbox{ for any } i=1,\dots, n \)

The above is the case when our data is linearly separable. There are many cases where the data can not be perfectly classified through linear separation. In such cases, Support Vector Machine looks for the hyperplane that maximizes the margin and minimizes the misclassifications.

For this, we introduce the slack variable,\(\zeta_i\) which allows some objects to fall off the margin but it penalizes them.

In this scenario, the algorithm tries to maintain the slack variable to zero while maximizing the margin. However, it minimizes the sum of distances of the misclassification from the margin hyperplanes and not the number of misclassifications.

Constraints now changes to \(y_i(\vec{w}.\vec{x_i}+b)\geq 1-\zeta_i \mbox{ for all } 1\leq i\leq n, \zeta_i\geq 0\)

and the optimization problem changes to

Minimize in \(\displaystyle (\vec{w},b) \{ \frac{||\vec{w}||^2}{2}+C\sum_i\zeta_i\) subject to \( y_i\left(\vec{w}.\vec{x}+b\right) \geq 1-\zeta_i \mbox{ for any } i=1,\dots, n \)

Here, the parameter \(C\) is the regularization parameter that controls the trade-off between the slack variable penalty (misclassifications) and width of the margin.

Small \(C\) makes the constraints easy to ignore which leads to a large margin.

Large \(C\) allows the constraints hard to be ignored which leads to a small margin.

For \(C=\inf \), all the constraints are enforced.

The easiest way to separate two classes of data is a line in case of 2D data and a plane in case of 3D data. But it is not always possible to use lines or planes and one requires a nonlinear region to separate these classes. Support Vector Machines handle such situations by using a kernel function which maps the data to a different space where a linear hyperplane can be used to separate classes. This is known as the kernel trick where the kernel function transforms the data into the higher dimensional feature space so that a linear separation is possible.



If \( \phi \) is the kernel function which maps \(x_i \)to \(\phi(x_i)\), the constraints change to \(y_i(\vec{w}.\phi(x_i)+b)\geq 1-\zeta_i \mbox{ for all } 1\leq i \leq n, \zeta_i\geq 0\)

And the optimization problem is

Minimize in \(\displaystyle (\vec{w},b) \{ \frac{||\vec{w}||^2}{2} + C\sum_i\zeta_i \) subject to \(y_i(\vec{w}.\phi(x_i)+b)\geq 1-\zeta_i\) \( \mbox{ for all } 1\leq i \leq n, \zeta_i\geq 0\)

We will not get into the solution of these optimization problems. The most common method used to solve these optimization problems is Convex Optimization.

Pros and Cons of Support Vector Machines

Every classification algorithm has its own advantages and disadvantages that are come into play according to the dataset being analyzed. Some of the advantages of SVMs are as follows:

The very nature of the Convex Optimization method ensures guaranteed optimality. The solution is guaranteed to be a global minimum and not a local minimum.

SVM is an algorithm which is suitable for both linearly and nonlinearly separable data (using kernel trick). The only thing to do is to come up with the regularization term, \(C\).

SVMs work well on small as well as high dimensional data spaces. It works effectively for high-dimensional datasets because of the fact that the complexity of the training dataset in SVM is generally characterized by the number of support vectors rather than the dimensionality. Even if all other training examples are removed and the training is repeated, we will get the same optimal separating hyperplane.

SVMs can work effectively on smaller training datasets as they don’t rely on the entire data.

Disadvantages of SVMs are as follows:

They are not suitable for larger datasets because the training time with SVMs can be high and much more computationally intensive.

They are less effective on noisier datasets that have overlapping classes.

SVM with Python and R

Let us look at the libraries and functions used to implement SVM in Python and R.

Python Implementation

The most widely used library for implementing machine learning algorithms in Python is scikit-learn. The class used for SVM classification in scikit-learn is svm.SVC()

sklearn.svm.SVC (C=1.0, kernel=’rbf’, degree=3, gamma=’auto’)

Parameters are as follows:

C: It is the regularization parameter, C, of the error term.

kernel: It specifies the kernel type to be used in the algorithm. It can be ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’, or a callable. The default value is ‘rbf’.

degree: It is the degree of the polynomial kernel function (‘poly’) and is ignored by all other kernels. The default value is 3.

gamma: It is the kernel coefficient for ‘rbf’, ‘poly’, and ‘sigmoid’. If gamma is ‘auto’, then 1/n_features will be used instead.

There are many advanced parameters too which I have not discussed here. You can check them out here.

https://gist.github.com/HackerEarthBlog/07492b3da67a2eb0ee8308da60bf40d9

One can tune the SVM by changing the parameters \(C, \gamma\) and the kernel function. The function for tuning the parameters available in scikit-learn is called gridSearchCV().

sklearn.model_selection.GridSearchCV(estimator, param_grid)

Parameters of this function are defined as:

estimator: It is the estimator object which is svm.SVC() in our case.

param_grid: It is the dictionary or list with parameters names (string) as keys and lists of parameter settings to try as values.

To know more about other parameters of GridSearch.CV(), click here.

https://gist.github.com/HackerEarthBlog/a84a446810494d4ca0c178e864ab2391

In the above code, the parameters we have considered for tuning are kernel, C, and gamma. The values from which the best value is to be are the ones written in the bracket. Here, we have only given a few values to be considered but a whole range of values can be given for tuning but it will take a longer time for execution.

R Implementation

The package that we will use for implementing SVM algorithm in R is e1071. The function used will be svm().

https://gist.github.com/HackerEarthBlog/0336338c5d93dc3d724a8edb67ad0a05

Summary

In this article, I have gone through a very basic explanation of SVM classification algorithm. I have left out a few mathematical complications such as calculating distances and solving the optimization problem. But I hope this gives you enough know-how about how a machine learning algorithm, that is, SVM, can be modified based on the type of dataset provided.