The Kernel Trick

Kernelized SVMs

Kernel approximation - Marrying kernels and SGD

Some experiments

Other Embeddings

Afterthoughts

The Literature

Recently we added another method for kernel approximation, the Nyström method, to scikit-learn , which will be featured in the upcoming 0.13 release.Kernel-approximations were my first somewhat bigger contribution to scikit-learn and I have been thinking about them for a while.To dive into kernel approximations, first recall the kernel-trick The motivation is to obtain a non-linear decision boundary, because not all problems are linear. Consider this simple 1d datasetThere is obviously no way this can be linearly separated.There is an easy way to make it linearly separable, though. By embedding the data it in a higher-dimensional space using some non-linear mapping, for example $x \mapsto x^2$This is obviously a contrived example, but not totally detached from reality.Learning classifiers in high dimensions is expensive, though, and you have to come up with the non-linear mapping. If you start out with a given number of features $n$ and you want to compute all polynomial terms up to degree 2, you get more than $n^2$ features, and this rises exponentially in the degree $d$ of polynomials you want!There is a very simple trick to circumvent this problem, though, which is the kernel-trick.The essence of the kernel-trick is that if you can describe an algorithm in a certain way -- which is using only inner products -- then you never need to actually use the feature mapping, as long as you can compute the inner product in the feature space.For the polynomial feature map, the inner product in the feature space is given by $k(x, y) = (x^Ty + c)^d$ which is easy enough to compute for any degree $d$.What is even better is that you don't really need to start from a feature map. You can specify an inner product $k(x, y)$ directly and under mild condition (if $k$ is a Mercer-kernel), there exists a space $H$ for which $k$ is the scalar product. It is possible to construct a mapping from the original $R^n \rightarrow H$ but you never actually need to compute it.One of the most popular $k$ is the Gaussian (or RBF) kernel$k(x, y) = \text{exp}(\gamma ||x - y|| ^ 2)$,which is a scalar product in a space that is even infinite dimensional.One of the most popular applications of the kernel trick is the kernelized Support Vector Machine (SVM), which is one of the best of-the-shelf classifiers today.One of the characteristics of kernelized algorithms is that their runtime and space complexity is basically independent of the dimensionality of the input space, but rather scales with the number of data points used for training. There are lots of tricks to make SVM training fast, but in general you can assume that the run time is cubic in the number of samples.Usually good implementations avoid computing the kernel values for all pairs of training points but this comes at the cost of some runtime and algorithmic complexity. If you could afford it, you'd really like to store the whole kernel matrix, i.e. the kernel value for all pairs of training points, which is quadratic in the number of samples.If you have very complex, but small (say <10.000 samples or <100.000 depending on your patience) kernels are pretty neat. But if you deal with really A LOT of data, there is no way you can train an SVM - it will be just to slow or the memory complexity will just be tohigh. One trend in recent years has been to apply stochastic gradient descent (SGD) optimization to learn linear classifiers.These can be really fast - a single iteration is linear in the dimensionality (or rather in the number of non-zero features per sample, which is very nice for sparse data with very few non-zero features per sample) and convergence is sublinear in the number of samples (in theory)!So that is really, really fast. But classifiers learned in this way are usually linear in the data (except neural networks which I will not talk about much today).So on the one hand, we have kernelized SVMs, which work quite well on complicated that that is not linearly separable, but doesn't scale well to many samples. On the other hand, we have SGD optimization that is very efficient, but only produces linear classifiers.A somewhat recent trend is to combine the two by (sort of) moving away from the kernel trick and computing explicit feature maps. So we actually map our features to a high dimensional space an then apply a linear classifier, which yields non-linear decisions in the original space. I feel like this is a bit strange given the original motivation for studying kernels, but why not.So what we want is to have an embedding into a reasonably sized space, so that we can then learn a linear classifier using SGD on the new representation.As some people really loved the rbf-kernel, it would be great to have a way to compute the mapping there explicitly. But mapping to an infinite-dimensional space is clearly not practical. Fortunately it is known that we only need a finite subspace of that infinite space to solve the SVM problem, the one that is spanned by the images of the training data (this is called Representer Theorem).If we would use all data points, we would map to an $\mathbb{R}^N$ dimensional space and have the same scaling problems that the kernel-SVM has. Actually we would do worse, as we would really need to store all kernel values. But we can think of an easy approximation: We don't go to the full space spanned by all $N$ training points, but we just use a subset. This will only yield an approximate embedding but if we keep the number of samples we use the same, the resulting embedding is independent of dataset size and we can basically choose the complexity to suit our problem. This algorithm is called Nyström method (or Nyström embedding) and is what I just added to scikit-learn.There is also another method to compute approximatios to the rbf-kernel, which is based on some ideas of Fourier-Analysis, connecting kernels to measures. It produces a monte-carlo (i.e. randomized) approximation to the feature map. If you sample infinitely long, you will actually get "the real" feature map back. But obviously you stop at a certain point, depending on your budget. I implemented this method in scikit-learn last year.Let's do a simple experiment comparing the Nyström approximation and the Fourier approximation of an rbf kernel. I use a subset of MNIST for that (as I am impatient): 20000 taining examples and 10000 test examples. This is basically the same as this sklearn example , only with a somewhat bigger dataset. You can find the code here I forgot to set C on the approximate kernel SVMs. Here is the new plot, in which the Nystroem method is somewhat closer to the exact kernel and the gap between the two methods is bigger.The plot compares accuracy and training time of the two approximate kernel embeddings with an exact SVM on the rbf kernel.What we can see from this example immediately is that even with relatively few dimensions (about 1000) both methods produce quite decent classifiers. It is also clear that for the same number of features (dimensions in the embedding space), the Nyström method always beats the Fourier method.On the other hand, looking at the running time, Nyström scales a lot worse than the Fourier embedding. This is caused mainly by an additional normalization step (one needs to compute a singular value decomposition of the kernel on the choosen subset of data points used to construct the mapping).I might not use the fastest method for SVD here, but even with more efficient methods, this step seems quiet costly.With many more samples, computing the embedding for all points might dominate, giving a different picture. Any how, I feel it is nice to be able to compare these methods quite easily side-by-side.I plan to throw these at the cover type dataset, but didn't have the time so far.Going through the motivation above, you might wonder: why bother with kernels at all? In the end, what you care about is some form of embedding that allows you to do good classification.So why not construct one directly? There are several people working on this, but as far as I know mostly completely disconnected (and not comparing to) the above methods.The ImageNet challenge led to some interesting methods in computer vision, like Fisher Vectors, but research in this direction is still pretty early. Intuitively it seems important to find out what is important about the data to create a good embedding. This argument has been made by some people in the neural network community for some time now.There is only one embedding in scikit-learn right now that targets linear classification, which is a very unsophisticated random forest based embedding. You can find an example here . The top left shows a toy dataset that is not linearly separable. Below is the decision boundary as given by naive Bayes using a high-dimensional embedding of the data.Hopefully restricted Boltzmann machines (and maybe at some point Fisher vectors) will be added ( PR ) and we can compare all these very different approaches to see what actually works - which, in my opinion is all that counts.Though I'd find it much more satisfying if we also know why ;)Obviously the embedding $\rightarrow$ linear classifier path is not the only promising avenue for general purpose non-linear classifiers, but you can read about neural networks and random forests enough elsewhere (and also sometimes here ;).An just to mention it: there is an interesting connection between the Nystroem method and rbf-networks (remember those?).I wrote this blog-post to be somewhat concise (obviously without much success) and readable and didn't give the references in the text.If you are interested, there is a huge amount of inspiring research around what I talked above. Here are just some pointers to relevant work.The very recent paper that inspired me to revisit kernel approximations is from this year's NIPS: Nystroem Method vs Random Fourier Features: A Theoretical and Empirical Comparison The original formulation of Nyström approximations is from:Using the Nystrom method to speed up kernel machinesThe paper that (re-)ignited interested in kernel-approxiamtion is by Rahimi and Recht: Random features for large-scale kernel machines Finally, feature computation and embedding that don't relate to kernels where investiated a lot in computer vision recently.A good overview is: The devil is in the details: an evaluation of recent feature encoding methods After writing this I just read "Do you really have big data" which discusses the complexity / time trade of from a slightly different angle.