New Year celebrations are just behind us, but things are already happening in 2019. One very exciting development for machine learning researchers all around the world is the new journal from the venerable Nature family: Nature Machine Intelligence. Its volume 1 is dated January 2019, and it’s already out (all papers are in open access, you can read them right there). Machine learning results have already made it into Nature — for example, I’ve always wondered how a paper about a piece of software playing a boardgame is about Nature. But now we have a new top venue specifically devoted to machine learning.

Nature Machine Intelligence begins on a high note. We might go back to its first volume in later installments, but today I want to discuss one especially unexpected and exciting result: a paper by Ben-David et al. called Learnability Can Be Undecidable. It brings into our humble and so far very practical field of science: Godel’s incompleteness theorem.

Independence Results: Things We Cannot Know

They might or they might not. You never can tell with bees. A.A. Milne

Mathematical logic is a new subject for our series, and indeed it doesn’t appear too often in the context of machine learning. So let’s start with a brief introduction.

Gödel’s incompleteness theorem establishes that in any (sufficiently complex) formal system, there are things we can neither prove nor disprove. A formal system is basically a set of symbols and axioms that define relations between these symbols. For example, you can have two functions, + and *, and constants 0 and 1, with the usual axioms for addition and multiplication that define a field. Then you can have models of this formal system, i.e., interpretations of the symbols such that all axioms hold. As an example, the set of real numbers with standard interpretations of 0,1,+,* is one model of the theory of fields, and the set of rational numbers is another.

The original constructions given by Gödel are relatively involved and not easy to grasp without a logical background. They have been quite beautifully explained for the layperson in Douglas Hofstadter’s famous book Gödel, Escher, Bach, but it does take a few dozen pages, so we won’t go into that here.

How can you prove that a certain statement is unprovable? Sounds like an oxymoron, but the basic idea of many such proofs is straightforward: you construct two models of the formal system such that in one of them the statement is true and in the other it’s not.

For example, consider a very simple formal system with only one function s(x), which we interpret as “taking the next element”, and one constant 0. We can construct formulas (terms, to be precise) like s(0), s(s(0)), s(s(0)) etc. We can think of them as natural numbers: 1:=s(0), 2:=s(1)=s(s(0)), and so on. But do negative numbers also exist? Formally, is there an x such that s(x)=0?

The question makes sense (it’s easy to write as a logical formula: ∃x s(x)=0) but has no answer. First, the set of natural numbers 0,1,2,… is a valid model for this formal system, with the function s defined as s(x)=x+1. And in this model, the answer is no: there is no number preceding zero. But the set of integers …,-2,-1,0,1,2,… is also a valid model, with the same interpretation s(x)=x+1! And now, we clearly have s(-1)=0. This means that the original formal system does not know whether negative numbers exist.

Of course, this was a very, very simple formal system and nobody really expected it to have answers to complicated questions. But the same kind of reasoning can be applied to much more complex systems. For example, the axioms of a field in mathematics do not have an answer to whether irrational numbers exist; e.g., ∃x(x*x=2) is true in the real numbers but false in the rational numbers, and both are fields. Godel’s incompleteness theorem says that we can find such statements for any reasonably powerful formal system, including for example, Zermelo-Fraenkel set theory (ZFC), which is basically what we usually mean by mathematics. Logicians have constructed statements that are independent of ZFC axioms.

One such statement is the famous continuum hypothesis. Modern mathematical logic was in many ways initiated by Georg Cantor, who was the first to try to systematically develop the foundations of mathematics, specifically formal and consistent set theory. Cantor was the first to understand that there are different kinds of infinities: the set of natural numbers is smaller than the set of reals because you cannot enumerate all real numbers. The cardinality (size) of the set of natural numbers, denoted ℵ₀ (“aleph-null”) is the smallest infinite number (smallest infinite cardinal, as they are called in mathematical logic), and the set of reals is said to have the cardinality of continuum, ℵ₁ (“aleph-one”).

There is no doubt that ℵ₁ > ℵ₀, but is there anything in between the natural numbers and the reals? This is known as the continuum hypothesis: it says that ℵ₁ is the smallest infinite cardinal larger than ℵ₀. And it turns out to be independent of ZFC: you can construct a model of mathematics where there is an intermediate cardinality, and you can construct a model where there isn’t. There is really no point to ask which model we live in: it’s unclear if there is anything truly infinite in our world at all.

Undecidability in Machine Learning

Some problems are so complex that you have to be highly intelligent and well informed just to be undecided about them. Laurence J. Peter

Okay, so what does all of this have to do with machine learning? In our field, we usually talk about finite datasets that define optimization problems for the weights. How can we find obscure statements about the existence of various infinities within our practical and usually well-defined field?

Ben-David et al. speak about the “estimating the maximum” problem (EMX):

Given a family F of subsets of some domain X, find a set F whose measure with respect to an unknown probability distribution P is close to maximal, based on a finite sample generated independently from P.

Sounds complicated, but it’s really just a general formulation of many machine learning problems. Ben-David et al. give the following example: suppose you are placing ads on a website. The domain X is the set of visitors for the website, every ad A has its target audience Fᴬ, and P is the distribution of visitors for the site. Then the problem of finding the best ad to show is exactly the problem of finding a set Fᴬ that has the largest measure with respect to P, i.e., it will most probably resonate with a random visitor.

In fact, EMX is a very general problem, and its relation to machine learning is much deeper than this example shows. You can think of a set F as a function from the domain X to 0 and 1: F(x)=1 if x belongs to F and F(x)=0 if it doesn’t. And the EMX problem is asking to find a function F from a given family that tries to maximize the expectation Eᴾ(F) with respect to the distribution P.

Let us now think of samples from the distribution P as data samples, and treat the functions as classifiers. Now the setting begins to make a lot of sense for machine learning: it means that you can know the labels of all data samples and need to, given a sample of the data, find a classifier from a given family that will have low error with respect to the data distribution. Sounds very much like a standard machine learning problem, right? For more details on this setting, check out an earlier paper by Ben-David (two Ben-Davids, actually).

Ben-David et al. consider a rather simple special case of the EMX problem, where X is the interval [0,1] and the family of subsets are all finite subsets of X, that is, finite collections of real numbers from [0,1]. They prove that the problem of EMX learnability with probability 2/3, that is, given some i.i.d. samples from a distribution P, find a finite subset of [0,1] that has probability at least 2/3, is independent of ZFC! That is, our regular mathematics cannot say whether you can find a good classifier in this setting. They do it by constructing a (rather intricate) reduction of the continuum hypothesis to this case of EMX learnability.

So What’s the Takeaway?

A conclusion is the place where you got tired thinking. Martin H. Fischer

The results of Ben-David et al. are really beautiful. They connect a lot of dots: unprovability and independence, machine learning, compression schemes (used in the proof), and computational learning theory. One important corollary the paper’s main result is that there can be no general notion of dimension for EMX learnability, like the VC (Vapnik-Chervonenkis) dimension is for PAC learnability. I have no doubt these ideas will blossom into a whole new direction of research.

Still, as it sadly often happens with mathematical logic, this result can leave you a bit underwhelmed. It only makes sense in the context of uncountable sets, which you can hardly find in real life. Ben-David et al. themselves admit in the conclusion that the proof hinges on the fact that EMX asks to find a function over an infinite domain rather than, say, an algorithm, which would be a much simpler object (in theoretical computer science, algorithms are defined as Turing machines, basically finite sets of instructions for a very simple formalized “computer”, and there are only countably many finite sets of instructions while there are, obviously, a continuum of finite subsets of [0,1] and hence functions).

Nevertheless, it is really exciting to see different fields of mathematics connected in such unexpected and beautiful ways. I hope that more results like this will follow, and I hope that in the future, modern mathematics will play a more important role in machine learning than it does now. Thank you for reading!

Sergey Nikolenko

Chief Research Officer, Neuromation