Inductive Inference and Machine Learning by Minimum Message Length (MML) encoding.

Also see models & parameters, William of Ockham, Thomas Bayes, R. A. Fisher, and C. S. Wallace.

There is a non-Technical Introduction [here].

For a hypothesis H and data D we have from Bayes:

P(H&D) = P(H).P(D|H) = P(D).P(H|D)

P(H), prior probability of hypothesis H

P(H|D), posterior probability of hypothesis H

P(D|H), likelihood of the hypothesis, actually a function of the data given H .

From Shannon's Mathematical Theory of Communication (1949) we know that in an optimal code, the message length of an event E, MsgLen(E), where E has probability P(E), is given by MsgLen(E) = -log 2 (P(E)):

= MsgLen(H)+MsgLen(D|H)

= MsgLen(D)+MsgLen(H|D)

Now in inductive inference one often wants the hypothesis H with the largest posterior probability. MsgLen(H) can usually be estimated well, for some reasonable prior on hypotheses. MsgLen(D|H) can also usually be calculated. Unfortunately it is often impractical to estimate P(D) which is a pity because it would yield P(H|D). However, for two rival hypotheses, H and H'

= MsgLen(H)+MsgLen(D|H)

- MsgLen(H')-MsgLen(D|H')

= posterior -log odds ratio

Consider a transmitter T and a receiver R connected by one of Shannon's communication channels. T must transmit some data D to R. T and R may have previously agreed on a code book for hypotheses, using common knowledge and prior expectations. If T can find a good hypothesis, H, (theory, structure, pattern, ...) to fit the data then she may be able to transmit the data economically. An explanation is a two part message:

(i) transmit H taking MsgLen(H) bits, and (ii) transmit D given H taking MsgLen(D|H) bits.

The message paradigm keeps us "honest": Any information that is not common knowledge must be included in the message for it to be decipherable by the receiver; there can be no hidden parameters.

This issue extends to inferring (and stating) real-valued parameters to the "appropriate" level of precision.

The method is "safe": If we use an inefficient code it can only make the hypothesis look less attractive than otherwise.

There is a natural hypothesis test: The null-theory corresponds to transmitting the data "as is". (That does not necessarily mean in 8-bit ascii; the language must be efficient.) If a hypothesis cannot better the null-theory then it is not acceptable.

A more complex hypothesis fits the data better than a simpler model, in general. We see that MML encoding gives a trade-off between hypothesis complexity, MsgLen(H), and the goodness of fit to the data, MsgLen(D|H). The MML principle is one way to justify and realise Occam's razor.

Continuous Real-Valued Parameters

When a model has one or more continuous, real-valued parameters they must be stated to an "appropriate" level of precision. The parameter must be stated in the explanation, and only a finite number of bits can be used for the purpose, as part of MsgLen(H). The stated value will often be close to the maximum-likelihood value which minimises MsgLen(D|H). If the -log likelihood, MsgLen(D|H), varies rapidly for small changes in the parameter, the parameter should be stated to high precision. If the -log likelihood varies only slowly with changes in the parameter, the parameter should be stated to low precision.

The simplest case is the multi-state or multinomial distribution where the data is a sequence of independent values from such a distribution. The hypothesis, H, is an estimate of the probabilities of the various states (eg. the bias of a coin or a dice). The estimate must be stated to an "appropriate" precision, ie. in an appropriate number of bits.

Applications and Related Areas