A Thousand Foot View of Machine Learning | December 31, 2009

Since I plan to spend a fair amount of time on this blog talking about machine learning, I thought it would make sense to give a basic introduction to what in the world machine learning is. From the outside, people seem to think it’s some kind of magic. In this post, I will give a brief introduction to the principles in machine learning, highlighting the way machine learning researchers frame problems and how the algorithms they develop work to solve these problems.

The field is separated into two main categories: supervised learning and unsupervised learning.

Supervised learning is when we are given input vectors (X’s) and associated results (Y’s), which could be categories (classification problems) or numbers (regression problems). Given this information, we then seek to develop a model whereby, given a new vector (X), we can determine the associated result (Y).

Sorry if that was too far out there. Maybe we’re given a vector of a user’s preferences in music and we’re trying to classify him as a fan of “rock” or “rap.” Maybe we’re trying to look at some market variables and tell whether it’s going to go up or down.

In unsupervised problems, we are only given the input vectors (X). Our goal, then, is to “cluster” (or find logical groupings) of these vectors subject to some constraints. That’s not as important to what I do, so I’m going to leave it for another time.

Supervised learning

A vector is simply a list of numbers. For simplicity’s sake, we’ll start with a problem where the input vectors (X) are two-dimensional, since these vectors can easily be visualized on an x-y plane. However, it is important to remember that these algorithms can work on input vectors of any dimension.

Consider the following “toy problem.” We are given two-dimensional input vectors (X), and associated Y values which can be either “square” or “triangle.” Then, given a new two-dimensional point (X), depicted below as a green circle, we want to decide whether it corresponds to “square” or “triangle”:

This image demonstrates how perhaps the simplest of supervised learning algorithms, known as the k-nearest neighbor algorithm (or KNN) works. KNN simply looks at the new point, and finds the closest points in the training set (the nearest neighbors) in order to decide how to classify the new point. The inner circle demonstrates that, when we use the closest 3 points (k = 3) the algorithm will predict “triangle.” However, the outer circle shows that, when we select k = 5, it will predict “square.”

Naturally, since the Euclidean distance function used by KNN can work with vectors of any dimension, this principle can be applied to input vectors of any size (although they wouldn’t be quite as easy to visualize).

This is the fundamental problem of supervised learning: we are given examples (input vectors X and their associated outcomes Y), and we look for a way to train a model to make predictions about new points.

Although KNN does it in a fairly simple way, other algorithms tackle the problem from other directions.

Another example is an algorithm called a linear support vector machine. This algorithm relies on the intuition of maximum margin to solve supervised learning problems. Consider the following 2-dimensional example, here training an algorithm to separate between black and white Y values:

The linear SVM draws a line between the two sets in such a way that it accomplishes two things:

The line is as far as possible from the closest training points of each class (maximum margin); and As few as possible points are on the “wrong side” of the line.

The training points along the dotted line are called the “support vectors”—the line is drawn so as to maximize its distance from these support vectors.

The linear SVM has one serious flaw: in order for it to work, the two classes must be separable by a simple line. That doesn’t occur very often. In the above KNN demonstration, for example, the squares weren’t linearly separable from the triangles. In that case, the linear SVM would fail miserably.

The Kernel Trick

The way we get around this problem is by using the kernel trick, which I believe is one of the most beautiful pieces of mathematics I’ve ever seen.

The kernel trick works as follows. We have our original points (X’s), which are 2-dimensional and sit in the x-y plane. As it turns out, there are functions (called kernel functions) that take those original 2-dimensional points and project them (according to some rules) into a much higher dimensional space. In fact, using the most popular kernel (called the Gaussian radial basis function kernel), it takes points of any dimension and projects them into an infinite-dimensional space (believe it or not, this is possible). Of course, computers can’t use infinite dimensional points. But, these kernel functions have special properties that allow us to compute various operations (mainly the “dot” product) on these infinite-dimensional vectors without ever actually calculating the infinite-dimensional vector itself. Since SVMs use dot-products for doing their analysis, they can essentially use a kernel function to project the 2-dimensional points into an infinite-dimensional space in which the two classes are linearly separable, and train the algorithm based on where these points lie in that infinite-dimensional space. When an SVM is used this way, it is called a Kernel Support Vector Machine.

This concept can be confusing at first. To get an idea of how this works in practice, you can play around with this SVM applet, which will allow you to use different data sets and different kernels and see how the SVM reacts.

KSVMs are truly revolutionary. They can accomplish a variety of tasks without any tweaking at all, including handwriting, speech and image recognition. I also have applied KSVMs to some problems in the financial sector, although their main selling point is that they can make good generalizations with small amounts of training examples. Since we have plenty of training examples, KSVMs may not be the right choice for us.

Conclusions and further reading

For further reading, I suggest taking a look at Andrew Moore’s tutorials, which I have found to be very helpful. Andrew Moore is a well-known AI researcher from CMU. Mainly, I suggest taking a look at his tutorials on Decision Trees, Gaussian Mixture Models, K-Means Clustering and Support Vector Machines. For a broad look at the field, his Intro to AI tutorial might be helpful.

Hopefully I’ve introduced the basic concept of machine learning. We come up with training examples, which are just a list of numbers tagged with a category/outcome. Then, given a new list of numbers, we are trying to classify it or predict the outcome associated with it. Thus, the data we feed these algorithms is of paramount importance. The real key is to find a group of data that make our outcomes separable in the high-dimensional space in which these training examples live—then, we simply have to write an algorithm that can learn how to separate between them.

Naturally, this is a bit of a simplification. These algorithms rarely work “out of the box,” and it requires a good understanding of all the internals to figure out which algorithm to use, and how specifically to implement it such as to maximize the chances of success. The programmer has to determine which kernel to use (if any), how to use the data to make it as useful as possible to the algorithm, and how to fine-tune the parameters that each of these algorithms take.