Dimensionality Reduction¶

Dimensionality reduction is the other main approach to unsupervised learning. We haven't looked closely at any dimensionality reduction algorithms, but the idea with them is to try and summarize your data set with a reduced number of 'latent' variables. Like a codebook. Dimensionality reduction algorithms were first studied in the social sciences, in particular in psychology. The IQ measurement is an attempt to reduce all facets of your intelligence down to one single value. Talking about how left wing or right wing someone is is an attempt to reduce all their political opinions down to one value. In practice, do make good models, we often need more than one value to summarize the complexity of a data set. Common approaches for dimensionality reduction include principal components analysis and factor analysis.

Probability: The Calculus of Uncertainty¶

We have spent most of our introduction to artificial intelligence and machine learning focussing on particular problem types which we have categorized into supervised and unsupervised learning. The wider field of artificial intelligence contains many challenges that aren't so easy to categorize. In fact there are times when such categorizations can become a hindrance to progress. For example, is robot navigation supervised or unsupervised learning? I'm not sure, but I think it depends on whether the robot has a map or not. But how does the map appear as labels?

In general modern artificial intelligence is about modelling your data and propagating the uncertainty around your system. The main way we do this is through probability and in particular Bayes rule. All the algorithms we have introduced have a probabilistic intepretation. The common way of finding it is to look at the error function, and see what the equivalent maximum likelihood model is. Remember the error function can often be derived as the negative log of the probability distribution we use for defining the model.

When using probabilities for artificial intelligence, we think about two things: the state of the world (often represented by $\dataMatrix$) and our belief about the state of the world, which is represented by a probability distribution. The model involves specifying a probability distribution over what we think the most likely state of the world is, making observations, and updating that probability distribution. A nice example is robot navigation.

A robot's state includes it's $x$, $y$ position, the direction it's facing, and perhaps a number of other variables such as where it's arms are legs etc. You can see quite quickly how the state space of the robot increases. Maybe the robot doesn't know exactly where it is: it is uncertain. The robot expresses this uncertainty by placing a probability distribution over its location: $p(\latentMatrix)$. This is known as the prior distribution. When the robot makes an observation, for example of landmark, like a wall, or perhaps if the robot is in the Peak District, a particular hill, the robot can absorb that information through probability, it does this by modelling the observations, $\dataMatrix$, with a probability distribution that says what the observations should be given the state of the robot, $p(\dataMatrix|\latentMatrix)$. This distribution is known as the likelihood. What we want to know is what effect the observation has had on the robots updated belief about its new location. This distribution is known as the posterior distribution.

These three distributions are related by Bayes' rule. Bayes' rule of probability comes about from combining the two main rules of probability.

Product Rule of Probability¶

The product rule of probability says that the joint distribution, $p(\latentMatrix, \dataMatrix)$ can be computed from the conditional and marginal distributions as follows

$$p(\latentMatrix, \dataMatrix) = p(\dataMatrix | \latentMatrix) p(\latentMatrix)$$

or conversely

$$p(\latentMatrix, \dataMatrix) = p(\latentMatrix |\dataMatrix) p(\dataMatrix).$$

The product rule gives the relationship between the marginal probability, the conditional probability and the joint probability.

Sum Rule of Probability¶

The sum rule of probability gives the relationship between the marginal probabability and the joint probability.

$$p(\dataMatrix) = \sum_{\latentMatrix} p(\dataMatrix, \latentMatrix)$$

where here the sum is over all possible values that $\latentMatrix$ can take (in the robot example, over all possible states of the robot! For continuous density functions, the sum has the form of an integral.

Bayes' rule is so simple to derive from these two rules, that it doesn't really deserve a name (in fact I don't like the name Bayes rule, because it was first used by other people including Laplace). However, that's the common name for it. Bayes' rule gives us the relationship between what we new before we observed the location of the hill ($p(\latentMatrix)$) and what we know after ($p(\latentMatrix|\dataMatrix)$. The relationship is given by equating the two possible forms of the joint distribution as given by the product rule, so we have

$$p(\latentMatrix|\dataMatrix)p(\dataMatrix) = p(\dataMatrix|\latentMatrix) p(\latentMatrix)$$

With a little rearrangement we can have

$$p(\latentMatrix|\dataMatrix) = \frac{p(\dataMatrix|\latentMatrix) p(\latentMatrix)}{p(\dataMatrix)}$$

Everything in this rule is specified by our model of the world: our prior belief, $p(\latentMatrix)$, and the relationship between our measurements and the state of the world, known as the likelihood, $p(\dataMatrix|\latentMatrix)$. The only thing we're missing is the marginal likelihood, $p(\dataMatrix)$, but this can be derived through the sum rule and product rules,

$$p(\dataMatrix) = \sum_{\latentMatrix} p(\dataMatrix, \latentMatrix) = \sum_{\latentMatrix} p(\dataMatrix|\latentMatrix) p(\latentMatrix).$$

Now we can express everything we need to get our updated belief about the state of the world through knowing the likelihood and the prior, i.e. our initial belief about the state of the world, and the relationship between the state of the world and the types of measurements we can make. The major difficulty is in computing the sum: it can involve many terms and therefore be intractable. Or in continuous systems, it involves a high dimensional integral, which is also difficult to do.

Despite the challenges associated with computing the marginal likelihood, Bayesian inference (as it's widely known, although I prefer the simpler term 'the calculus of uncertainty') is the most promising approach we have for practical Aritificial Intelligence. Many of the advances you may have heard about in the media (such as Sebastian Thrun and the Google driverless car) rely on application of this formula. Amazing what can emerge from a fairly simple set of rules. Although, of course, practical application requires a great deal of engineering knowledge and approximations. In robot navigation the formula is applied repeatedly to absorb information as it arrives. When this is done the algorithm is known as a Bayes filter.