Personal collection en

This website is my personal collection and exists as a reminder of stuff I was interested in.

1 • Support Vector Machines

One of the major clustering techniques is Support Vector Machine (SVM). The task of clustering in its plainest form deals with the seperation of data into two classes along a discriminant. We do not know this discriminant, but we have a training set of labeled data with which it can be estimated. In case of SVM we seek a linear discriminant, which seperates our data into two classes. This linear discriminant tells us then what special features are important for our classification, and allows us then to interpret its decision. In the simplest form its usage is rather limited, but with kernel methods SVM can also be extended to work well with non-linear data and furthermore an application to many classes is possible. In the beginning we will focus on the textbook case of dual class, separable data.

The article comprises three chapters. In the beginning we will explore the application of a linear discriminant for classification and develop our quadratic problem. The Lagrange Dual Theory is then applied to derive a more simple form with regularization. Finally a toy example in Rust is explained and outlook for classification with many-classes and kernel functions is given.

In essence SVMs are seeking a hyperplane that separates data points optimally and maximizes their margin, i.e., the distance between the hyperplane and the closest points of the training set from each class. The hyperplane is trained with a set where the class membership is denoted by .

An example for a hyperplane can be seen in Figure 1.1

Figure 1.1 An illustrative example of a SVM classifier with data points in and a hyperplane with a margin.

1.1 Hyperplanes and optimal margins

A hyperplane is a subspace whose dimension is one less than that of the data vectors . Suppose that a normal and and offset are given, then the hyperplane is defined as

or with b stated as an explicit offset vector

Hence the hyperplane is a (n-1)-dimensional subspace orthogonal to , which is shifted by the vector , see Figure 1.2.

Figure 1.2 Illustrative hyperplane shifted by with normal

Assume we have two parallel hyperplanes and (e.g. same but different offsets ) and we want to determine their distance to each other

(1.1)

This offset can also be expressed from a hyperplane to a point

We can therefore conclude that a hyperplane is seperating points of two classes, if one group is contained in the half-space and the other group in . The minimal distance from points to the seperating hyperplane is called margin. Our aim is to find a hyperplane which maximizes the margin and correctly classifies all points.

1.2 The Optimal Margin Classifier

After a short recap on hyperplanes this chapter will introduce the Optimal Margin Classifier (OMC). The former definitions will help us to define our quadratic problem (QP), whose solution will give us an optimal hyperplane. Given a training set

(1.2)

with the size of the training set and its dimension.

Assume we already found a seperating hyperplane with an minimum margin (see Eq. (1.1)). Then all our training set has to be classified correctly

The objective is therefore to find a seperating hyperplane such that the minimum margin is maximized

This problem is scale invariant. As a simple example see that if is solution then is a solution too. To avoid this problem we can normalize the problem by

This is a quadratic problem with linear inequality constraints, a special case of convex optimization problems and can be solved by any quadratic programming software. The solution is called the optimal classifier. With an optimal solution and a nearest point and from each class we can find

combine both and obtain

Without more effort we could use this QP to find and classify our data. (with the assumption that its separable) But we can actually do better by deriving the dual function with Lagrange Duality Theory. This will give us a problem, whose solution can be used to classify training data as Support Vectors or outliers. With this we can study the behaviour of non-separable data and apply regularization.

1.3 SVM and Lagrange Duality

In order to derive a new form of our problem we shall start with a short excursion into Lagrange Duality Theory. A much deeper explanation of Convex Optimization can be found in Boy09.

A standard convex optimization problem has the form

where and are convex and are linear.

Duality theory augments the optimization criterion by extending it with the weighted sum of the constraints. The Lagrange multiplier is associated with the ith inequality constraint and weights the ith equality constraint . The Lagrangian function is therefore composed of the primal optimization problem and weighted constraints.

The Lagrangian dual function is defined as the infimum of the Lagrangian function

the minimum primal variable can be expressed in respect to and . This transforms a Lagrangian function to its dual form.

The Lagrange dual problem is defined as follows

Suppose that are the optimal solutions of the Lagrangian duality problem, and is the optimal solution of the primal optimization problem. The duality theorem states that

The strong duality holds especially (i.e. the duality gap is zero) when the constraints are linear, the "Slater's condition". It is then very easy to show that strong duality holds everywhere

since and . We can conclude that minimizes with both sums zeros. From this we can derive the complementary slackness condition

so either the either the Lagrange multiplier or the constraint is maximal with . Later on we will use this condition to decide whether a training point is a Support Vector or not.

The dual function can now be derived for the SVM classifier problem. Given the training set Eq. (1.2), the primal optimization was introduced as

The Lagrangian function writes as

the derivation expresses us the optimal parameter in respect to the Lagrangian multiplier, which we can now substitute to obtain our dual function

The dual problem is thus composed of the dual function

From KKT for optimal complementary slackness follows

from here we can see

with indicating support vectors, those which have smallest distance to the seperating hyperplane, e.g. after solving the problem the support vectors are determined by .

Let , and be sets of support vectors. Then is parameterized by

(1.3)

1.4 Outliers and regularization

Are we finished in our endeavour? Not really, the outlier detection is still missing. If we apply the previous approach to a non-separable dataset, the solver will tell us that there is no solution. It would be much nicer if the solution would tell us which points are outliers and give us an approximation without these. Furthermore we have a problem with extreme data points causing our hyperplane to fluctuate much. (see Figure 1.3)

Figure 1.3 Influence of an outlier to the seperating hyperplane, note that a single support vector changes the border completely

Both problems are addressed by the regularization

the problem is identical except for the introduction of loss variables .

Intuitively we allow mis-classification in the constraints (e.g. with a cost paid in the optimization of ). The hyperparameter c controls how much we pay for mis-classification. We can solve for the dual problem again and obtain

(1.4)

From the complementary slackness conditions it follows

so any training vector is automatically classified as either a support, an outlier or a mundane vector.

The hyperplane classifies new data by projecting it

(1.5)

The decision can then be used in two ways

Hard classifier outputs a binary decision, decide if otherwise .

if otherwise . Soft classifier computes a probability decision with

1.5 Also Sprach Zarthrustra

Now it's finally time to apply the quadratic problem in a toy example. In the previous chapter we derived a recipe for linear classification, and we saw how to interpret its results. We will now use Rust, nalgebra and osqp for a small classification problem.

The problem is rather simple. We want to decide whether a word is used in a positive or negative context, so for example dishonost should classify as class and friendly as . Of course we won't use the words itself as input for the classifier, but word vectors trained on wikipedia. These word vectors represent contextual meaning (examples are adjective, female, positive, time, emotional, etc.), but the representation of sentiment is somewhere hidden in the vector space.

First you should download the word vector file from here. Its format is very simple, each line represent a single word with the tuple . We will select a couple of words to a new word (don't try to open the file with vim or anything because it is so large) with grep "^somewhat " wiki-news-300d-1M.vec >> training.vec The training set can now be read in with Rust and processed into a matrix representing our QP. We will use here osqp which can be used to solve any QP, but is specialised for large sparse matrices. The website states that the solver allows input of the form

So osqp requires us to re-express the Eq. (1.4) with the matrices . If we compare both optimization problems we get

with the data vector and the corresponding classes. (the hadmard product means the elementwise multiplication)

The input matrices can now be constructed from the data vector dm and the classes classes

Solve the quadratic problem

This sets up the solver and propagate the problem through the matrices. We need to convert the nalgebra matrices to the sparse matrix format of ospq with nalg_to_osqp . The result is then converted back to a nalgebra matrix.

The parameter of the hyperplane are now derived. The normal

and the offset from two support vectors of different classes

Finally we can classify our test data set by Eq. (1.5)

The result is looking like this

word score bad -1.0 evil -1.0 wicked -1.0 angry -1.0 annoy -1.0 anxious -1.0 sun -0.052 earnest 0.151 light 0.3952 funny 0.9568 good 1.0 tasty 1.0 happy 1.0 easy 1.0 amazing 1.0 adorable 1.0

1.6 Non-linear kernel function

For now we used a linear similiarity measure in Eq. (1.4). But actually we can choose our similiarity function arbitrarily (instead of the dot product ) with the constraint that is non-negative definite. The general idea is to augment the feature vector with additional non-linear dimensions which could exhibit properties of the underlying structure. In Figure 1.4 the one-dimensional problem can't be solved, but if we map it to we can find a seperating hyperplane.

Figure 1.4 Illustrative example for a kernel function mapping

If we use the general form, we will lose the ability to precalculate the hyperplane in a closed form like in Eq. (1.3), because we can't precompute . The similarity function can't be seperated anymore, because it is not linear

the classification needs to re-evaluate the distances of support vectors to a new data point

There are a couple of possible kernel functions, please refer to Yek14 for a full comparison, some of them are

There are a special class of similarity functions which can be seperated into a dot product with a feature mapping

The Figure 1.4 exhibits for example the feature mapping .

Example. Given define the kernel function as

We can show that this kernel has actual a feature mapping, for example let and and . Use . Then

For a full treatment of kernel functions you should refer to the book Joh04.

1.7 Multiclass classification

SVMs are inherently two-class classifiers, but we can extend their method to many-classes by training multiple discriminants and classifiying data multiple times. The methods we will use depends on whether the classes are mutually exclusive or not.

The simplest case is classification for classes which are not mutually exclusive. Then a feature vector can belong to any of these classes, therefore they are called any-of SVMs. Their approach is very simple

build a classifier for each class, where the training set consists of the data belonging to the class (positive label) and the remaining (negative label)

apply each classifier seperately for a new data set

The other case is one-of classification where a data set belongs to a single label in a bag of more than two different labels. Consider for example whether a word is female, male or neuter in the German language. The word Mond (moon) is male, Sonne (sun) is female and Haus (house) neuter. A classifier should tell us for a new word in which of these three classes it probable belongs.

The one-against-all method constructs k SVM models for k classes, labeling in each model the desired data as positive and the rest as negative. After training these k models and finding parameters we can classify a new data point. We say is in the class c with the largest value of the decision function

Another method is one-against-one which constructs SVM models for k classes, labeling the desired class as positive and the undesired with a negative label. Further methods and a study on their performance and application can be found in Hsu02.

1.8 Conclusion

SVM is one of the best of-the-shelf classification algorithm with good properties for linear data. Its well-explainable behaviour and simple derivation makes it very popular among data science people, and extensions can make it feasible for non-linear data and many-class decisions.

We derived the quadratic problem in its simplest form with regularization and used the result for a Natural Language Processing (NLP) task. The framework is rather generic and can be used to classify many things. For example a useful exercise could be multi-class gender decisions for some words. We skipped the practical application of the extension to be concise, but background information can be found in the given papers and the implementation should be straight-forward.

Finally we skipped practical implementations of SVM for huge datasets. These is a topic of its own, which includes for example the SMO algorithm. A broad overview and comparison can be found in Men09.

Bibliography