News media these days is filled with stories of AI victories over human experts in tasks ranging from playing Go to reading MRIs. One might imagine that smart machines would have an easy time with mathematics — but as a domain, math remains relatively unexplored by AI. The new DeepMind paper Analyzing Mathematical Reasoning Abilities of Neural Models pits a neural network against a high school mathematics test with surprising results: the AI failed.

Humans apply a variety of cognitive skills to solve simple math substitution questions:

Parsing the characters into entities such as numbers, arithmetic operators, variables (which together form functions) and words (determining the question).

Planning (for example, identifying the functions in the correct order to compose).

Using sub-algorithms for function composition (addition, multiplication).

Exploiting working memory to store intermediate values (such as the composition h(f(x))).

Generally applying acquired knowledge of rules, transformations, processes, and axioms.

DeepMind trained and tested its neural model by first collecting a dataset consisting of different types of mathematics problems. Rather than crowd-sourcing, they synthesized the dataset to generate a larger number of training examples, control the difficulty level and reduce training time. The team used a “freeform” text format to ensure for example that tree diagram or graph type questions could be accommodated in the dataset.

The data is based on UK national school mathematics curriculums (up to age 16), and covers Algebra, Arithmetic, Calculus, Comparisons, Measurement, Numbers, Manipulating Polynomials, and Probability.

While there have been previous studies on math using neural network-driven approaches, DeepMind limited themselves to a general sequence-processing architecture to present the most general baseline possible for future comparisons. The team selected LSTM (long short-term memory) and Transformer architectures for the test.

DeepMind tested two LSTM models on mathematics questions: a Simple LSTM trained with question data one character at a time using one-hot encoding; and an Attentional LSTM representing a commonly used neural machine translation encoder/decoder architecture as shown in the figure below.

The Transformer model meanwhile is a sequence-to-sequence model which achieves state-of-the-art results in machine translation. Its general problem-solving logic is shown below.

Researchers observed that the Simple LSTM, Attentional LSTM and Transformer models had roughly the same overall performance on the math test. The Transformer model however proved superior for problems that involved:

(1) doing more calculations with the same number of parameters

(2) having a shallower architecture (with better gradient propagation)

(3) having an internal “memory” that is sequential, which is more pre-disposed to mathematical objects like sequences of digits.

The models’ results on the 40 question test were all around around 35 percent correct, a failing grade on any high school report card. Detailed results are as follows:

The paper Analyzing Mathematical Reasoning Abilities of Neural Models is on arXiv.