Traditionally, many classification problems try to solve the two or multi-class situation. The goal of the machine learning application is to distinguish test data between a number of classes, using training data. But what if you only have data of one class and the goal is to test new data and found out whether it is alike or not like the training data? A method for this task, which gained much popularity the last two decades, is the One-Class Support Vector Machine. This (quite lengthly) blog post will give an introduction to this technique and will show the two main approaches.

Just one class?

First look at our problem situation; we would like to determine whether (new) test data is member of a specific class, determined by our training data, or is not. Why would we want this? Imagine a factory type of setting; heavy machinery under constant surveillance of some advanced system. The task of the controlling system is to determine when something goes wrong; the products are below quality, the machine produces strange vibrations or something like a temperature that rises. It is relatively easy to gather training data of situations that are OK; it is just the normal production situation. But on the other side, collection example data of a faulty system state can be rather expensive, or just impossible. If a faulty system state could be simulated, there is no way to guarantee that all the faulty states are simulated and thus recognized in a traditional two-class problem.

To cope with this problem, one-class classification problems (and solutions) are introduced. By just providing the normal training data, an algorithm creates a (representational) model of this data. If newly encountered data is too different, according to some measurement, from this model, it is labeled as out-of-class. We will look in the application of Support Vector Machines to this one-class problem.

Basic concepts of Support Vector Machines

Let us first take a look at the traditional two-class support vector machine. Consider a data set ; points $x_i \in \mathbb{R}^d$ in a (for instance two-dimensional) space where $x_i$ is the $i$-th input data point and is the $i$-th output pattern, indicating the class membership.

A very nice property of SVMs is that it can create a non-linear decision boundary by projecting the data through a non-linear function $\phi$ to a space with a higher dimension. This means that data points which can’t be separated by a straight line in their original space $I$ are “lifted” to a feature space $F$ where there can be a “straight” hyperplane that separates the data points of one class from an other. When that hyperplane would be projected back to the input space $I$, it would have the form of a non-linear curve. The following video illustrates this process; the blue dots (in the white circle) can not be linearly separated from the red dots. By using a polynomial kernel for projection (later more on that) all the dots are lifted into the third dimension, in which a hyperplane can be used for separation. When the intersection of the plane with the space is projected back to the two dimensional space, a circular boundary arises.

The hyperplane is represented with the equation , with and . The hyperplane that is constructed determines the margin between the classes; all the data points for the class $-1$ are on one side, and all the data points for class $1$ on the other. The distance from the closest point from each class to the hyperplane is equal; thus the constructed hyperplane searches for the maximal margin (“separating power”) between the classes. To prevent the SVM classifier from over-fitting with noisy data (or to create a soft margin), slack variables $\xi_i$ are introduced to allow some data points to lie within the margin, and the constant $C > 0$ determines the trade-off between maximizing the margin and the number of training data points within that margin (and thus training errors). The objective function of the SVM classifier is the following minimization formulation:

When this minimization problem (with quadratic programming) is solved using Lagrange multipliers, it gets really interesting. The decision function (classification) rule for a data point $x$ then becomes: $$ f(x) = \operatorname{sgn}( \sum_{i=1}^n \alpha_i y_i K(x, x_i) + b) $$

Here $\alpha_i$ are the Lagrange multipliers; every $\alpha_i > 0$ is weighted in the decision function and thus “supports” the machine; hence the name Support Vector Machine. Since SVMs are considered to be sparse, there will be relatively few Lagrange multipliers with a non-zero value.

Kernel Function

The function $K(x, x_i) = \phi(x)^T \phi(x_i)$ is known as the kernel function. Since the outcome of the decision function only relies on the dot-product of the vectors in the feature space $F$ (i.e. all the pairwise distances for the vectors), it is not necessary to perform an explicit projection to that space (as was done in the above video). As long as a function $K$ has the same results, it can be used instead. This is known as the kernel trick and it is what gives SVMs such a great power with non-linear separable data points; the feature space $F$ can be of unlimited dimension and thus the hyperplane separating the data can be very complex. In our calculations though, we avoid that complexity.

Popular choices for the kernel function are linear, polynomial, sigmoidal but mostly the Gaussian Radial Base Function: $$ K(x, x’) = \operatorname{exp} \left( - \frac{ \lVert x - x’ \rVert ^2}{2 \sigma^2 } \right) $$ where $\sigma \in R$ is a kernel parameter and is the dissimilarity measure.

With this set of formulas and concepts we are able to classify a set of data point into two classes with a non-linear decision function. But, we are interested in the case of a single class of data. Roughly there are two different approaches, which we will discuss in the next two sections.

One-Class SVM according to Schölkopf

The Support Vector Method For Novelty Detection by Schölkopf et al. basically separates all the data points from the origin (in feature space $F$) and maximizes the distance from this hyperplane to the origin. This results in a binary function which captures regions in the input space where the probability density of the data lives. Thus the function returns $+1$ in a “small” region (capturing the training data points) and $-1$ elsewhere.

The quadratic programming minimization function is slightly different from the original stated above, but the similarity is still clear:

In the previous formulation the parameter $C$ decided the smoothness. In this formula it is the parameter $

u$ that characterizes the solution;

it sets an upper bound on the fraction of outliers (training examples regarded out-of-class) and, it is a lower bound on the number of training examples used as Support Vector.

Due to the importance of this parameter, this approach is often referred to as $

u\text{-SVM}$.

Again by using Lagrange techniques and using a kernel function for the dot-product calculations, the decision function becomes: $$ f(x) = \operatorname{sgn}((w \cdot \phi(x_i)) - \rho) = \operatorname{sgn}( \sum_{i=1}^n \alpha_i K(x, x_i) - \rho) $$

This method thus creates a hyperplane characterized by $w$ and $\rho$ which has maximal distance from the origin in feature space $F$ and separates all the data points from the origin. An other method is to create a circumscribing hypersphere around the data in feature space. This following section will show that approach.

One-Class SVM according to Tax and Duin

The method of Support Vector Data Description by Tax and Duin (SVDD) takes a spherical, instead of planar, approach. The algorithm obtains a spherical boundary, in feature space, around the data. The volume of this hypersphere is minimized, to minimize the effect of incorporating outliers in the solution.

The resulting hypersphere is characterized by a center $\mathbf{a}$ and a radius $R > 0$ as distance from the center to (any support vector on) the boundary, of which the volume $R^2$ will be minimized. The center $\mathbf{a}$ is a linear combination of the support vectors (that are the training data points for which the Lagrange multiplier is non-zero). Just as the traditional formulation, it could be required that all the distances from data points $x_i$ to the center is strict less then $R$, but to create a soft margin again slack variables $\xi_i$ with penalty parameter $C$ are used. The minimization problem then becomes:

After solving this by introduction Lagrange multipliers $\alpha_i$, a new data point $z$ can be tested to be in or out of class. It is considered in-class when the distance to the center is smaller than or equal to the radius, by using the Gaussian kernel as a distance function over two data points:

You can see the similarity with the traditional two-class method, the algorithm by Schölkopf and Tax and Duin. So far the theoretical fundamentals of Support Vector Machines. Lets take a very quick look to some applications of this method.

Applications (in Matlab)

A very good and much used library for SVM-classification is LibSVM, which can be used for Matlab. Out of the box it supports one-class SVM following the method of Schölkopf. Also available in the LibSVM tools is the method for SVDD, following the algorithm of Tax and Duin.

To give a nice visual clearification of how the kernel mapping (to feature space $F$ works), I created a small Matlab script that lets you create two data sets, red and blue dots (note: this simulates a two-class example). After clicking, you are able to inspect the data after being projected to the three-dimensional space. The data will then result in a shape like the following image.

-- main.scpt -- Cocoa-AppleScript Applet -- -- This app can close and open applications when an other application, -- the trigger, is launced or terminated. This can be useful when two -- applications interfere with eachother or when one is dependend on the -- other (e.g. with setting an VPN connection). -- -- -- Roemer Vlasveld (roemer.vlasveld@gmail.com) -- -- -- Github Gist: https://gist.github.com/5429191 -- Github repository: https://github.com/rvlasveld/quit-open -- Blogpost: http://rvlasveld.github.io/blog/2013/04/21/open-and-close-applications-when-an-other-launches-or-terminates/ -- -- -- SETTINGS -- Modify the following lists to your needs. -- To find an application name, use e.g. the command-tab -- window. -- -- Application to trigger when it is launched property pTriggerLaunchApplications : {"MATLAB"} -- Application to trigger when it is terminated property pTriggerTerminateApplications : {"MATLAB"} -- Applications to open when the trigger is launched property pOpenOnLaunchApplications : {"Cisco AnyConnect Secure Mobility Client"} -- Applications to close when the trigger is launched property pCloseOnLaunchApplication : {"Flexiglass"} -- Applications to open when the trigger is terminated property pOpenOnTerminateApplications : {"Flexiglass"} -- Applications to hide when opened property pHideOnOpenAfterTerminateApplications : {"Flexiglass"} -- Applications to close when the trigger is terminated property pCloseOnTerminateApplications : {"Cisco AnyConnect Secure Mobility Client"} property pNSWorkspace : class "NSWorkspace" on run -- Register for notifications of launching and terminating applications tell (pNSWorkspace's sharedWorkspace())'s notificationCenter() addObserver_selector_name_object_(me, "appQuitNotification:", "NSWorkspaceDidTerminateApplicationNotification", missing value) addObserver_selector_name_object_(me, "appLaunchNotification:", "NSWorkspaceDidLaunchApplicationNotification", missing value) end tell end run -- Handle the notification for a launched application on appLaunchNotification_(notification) set theLauchedApplication to (notification's userInfo's NSWorkspaceApplicationKey's localizedName()) as text if theLauchedApplication is in pTriggerLaunchApplications then -- Open the associated applications repeat with applicationToOpen in pOpenOnLaunchApplications tell application applicationToOpen to activate end repeat -- Close the associated applications repeat with applicationtoClose in pCloseOnLaunchApplication tell application applicationtoClose to quit end repeat end if end appLaunchNotification_ -- Handle the notification for a terminated application on appQuitNotification_(notification) set theLauchedApplication to (notification's userInfo's NSWorkspaceApplicationKey's localizedName()) as text if theLauchedApplication is in pTriggerTerminateApplications then -- Open the associated applications repeat with applicationToOpen in pOpenOnTerminateApplications tell application applicationToOpen to activate -- Try for AppleScript support, because we may want the application to hide try tell application applicationToOpen to count windows on error message -- Enable scripting enableAppleScripting(applicationToOpen) -- Reopen tell application applicationToOpen to quit delay 2 tell application applicationToOpen to activate end try if applicationToOpen is in pHideOnOpenAfterTerminateApplications then -- Close all the windows of this application tell application applicationToOpen to close every window end if end repeat -- Close the associated applications repeat with applicationtoClose in pCloseOnTerminateApplications tell application applicationtoClose to quit end repeat end if end appQuitNotification_ on enableAppleScripting(theApplication) -- Add AppleScript support to an application by overwriting the Info.plist in -- the Application bundle. -- See http://c-command.com/blog/2009/12/28/capture-from-preview/ try set application_path to (path to application theApplication) set bundle_identifier to get bundle identifier of (info for the application_path) tell application "Finder" set the application_to_modify to (application file id bundle_identifier) as alias end tell set the app_path to (POSIX path of the application_to_modify) set the app_info_path to ((POSIX path of the application_to_modify) & "Contents/Info") set the plist_filepath to the quoted form of the app_info_path -- determine which Mac OS X version currently running set osver to system version of (system info) -- Make a backup of the Application bundle and overwrite the plist file do shell script "ditto -c -k --sequesterRsrc --keepParent " & app_path & space & app_path & ".quit-open.zip" with administrator privileges do shell script "defaults write " & app_info_path & space & "NSAppleScriptEnabled -bool YES" with administrator privileges do shell script "chmod a+r" & space & app_info_path & ".plist" with administrator privileges if osver ≥ "10.7" then if osver ≥ "10.8" then -- Assume Xcode is installed do shell script "sudo ln -s /Applications/Xcode.app/Contents/Developer/usr/bin/codesign_allocate /usr/bin" with administrator privileges end if do shell script "codesign -f -s - " & app_path with administrator privileges end if on error message number errorNumber -- Something went wrong if message is not equal to "ln: /usr/bin/codesign_allocate: File exists" then display dialog "Problem with enabling AppleScript for " & theApplication & ": " & message & " -- Error number: " & errorNumber end if end try end enableAppleScripting

Application to change detection

As a conclusion to this post I will give a look at the perspective from which I am using one-class SVMs in my current research for my master thesis (which is performed at the Dutch research company Dobots). My goal is to detect change points in a time series data; also known as novelty detection. One-class SVMs have already been applied to novelty detection for time series data. I will apply it specifically to accelerometer data, collection by smartphone sensors. My theory is that when the change points in the time series are explicitly discovered, representing changes in the activity performed by the user, the classification algorithms should perform better. Probably in a next post I will take a further look at an algorithm for novelty detection using one-class Support Vector Machines.