Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space, it can be built fast by solving the primal optimization problem also. In this article a Support Vector Machine implementation is going to be described by solving the primal optimization problem with sub-gradient solver using stochastic gradient decent. The algorithm is called the Pegasos algorithm, as described by Shai Shalev-Shwartz et al, in their original paper.

First the vanilla version and then the kernelized version of the the Pegasos algorithm is going to be described along with some applications on some datasets.

Next the hinge-loss function for the SVM is going to be replaced by the log-loss function for the Logistic Regression and the primal SVM problem is going to be converted to regularized logistic regression.

Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.

The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course. The description of the problems are sometimes taken from the assignment itself.

1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITM, these slides from CMU and also shown in the next figure taken from the slides, the Soft-SVM Primal Lagrangian can be represented as follows:

or as the following if the explicit bias term is discarded:

where the 0-1 loss is approximated by the hinge-loss.

Changing the regularization constant to λ, it can be equivalently expressed using the hinge-loss as follows, as shown in the next figure, taken from the Pegasos paper.

The next figure also describes the Pegasos algorithm, which performs an SGD on the primal objective (Lagrangian) with carefully chosen steps.

Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD.

The learning rate η is gradually decreased with iteration.

The following figure shows a simplified version of the algorithm:

The following python code implements the algorithm:

class SVMPegasos(LinearClassifier): """Implementation of SVM with SGD with PEGASOS Algorithm""" def __init__(self, n_iter=10, lambda1=1): self.n_iter = n_iter self.lambda1 = lambda1 def fit(self, X, Y): Y = list(Y) self.find_classes(Y) # convert all output values to +1 or -1 Yn = [sign(y, self.positive_class) for y in Y] X = X.toarray() m, n_features = X.shape[0], X.shape[1] self.w = numpy.zeros( n_features ) for i in range(self.n_iter): eta = 1. / (self.lambda1*(i+1)) j = numpy.random.choice(m, 1)[0] x, y = X[j], Yn[j] score = self.w.dot(x) if y*score < 1: self.w = (1 - eta*self.lambda1)*self.w + eta*y*x else: self.w = (1 - eta*self.lambda1)*self.w

Some Notes

The optional projection step has been left out (the line in square brackets in the paper).

As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.

The number η is the step length in gradient descent.

The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty, Pegasos uses a variable step length: η = 1 / (λ · t).

Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around” problem as it gets close to the optimum.

Although the bias variable b in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum w.

Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

The following animation and the figure show the final decision surface and how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.

As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.

The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.

Classification accuracy obtained on the test split was 95%.

2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to kernel-SVM.

Using the PEGASOS implementation to classify the following linearly non-separable datasets

Dataset 1

The following animation and the figure show the final decision surface and how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.

As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.

The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) ran very fast, taking only 2.09 seconds.

Classification accuracy obtained on the test split was 95.8%.

Dataset 2

The following animation and the figure show the final decision surface and how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.

As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.

The dataset was small (with 373 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) training ran very fast taking only 5.1 seconds.

Classification accuracy obtained on the test split was 100%.

3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from hinge-loss to log-loss) in the PEGASOS algorithm, a logistic regression model can be trained.

Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

The following animation and the figure show the final decision surface and how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the Logistic Regression classifier, respectively.

As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.

The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.

Classification accuracy obtained on the test split was 97.5%.

4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy. The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.

The next figures show the distribution of the word lengths for the texts for different sentiments.

Steps

Create sklearn Pipeline to Compute simple BOW features (CountVectorizer from sklearn): obtained 51063 columns with BOW. Use sklearn SelectKBest to choose top 5000 features Train / test the model

Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.

Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.

Also Record the time taken to train the classifier.

Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.

The following python code shows how the text-classification pipeline is implemented with SVMPegasos.

def classify_with_svm(): # read all the documents X, Y = read_corpus('all_sentiment_shuffled.txt') # split into training and test parts Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8, random_state=0) classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x, tokenizer = lambda x: x)), ('fs', SelectKBest(k=5000)), ('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] ) t0 = time.time() classifier.fit(Xtrain, Ytrain) t1 = time.time() print('Training time:', t1-t0, 'seconds.') Yguess = classifier.predict(Xtest) print('accuracy:', accuracy_score(Ytest, Yguess))

Some Notes

The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model / tf-idf features.

This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.

The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)

For Perceptron the entire training data was passed 10 times to the model while training.

For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.

Results

As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.