Optimal training difficulty for binary classification tasks

In a standard binary classification task, an animal or machine ‘agent’ makes binary decisions about simple stimuli. For example, in the classic Random Dot Motion paradigm from Psychology and Neuroscience15,16, stimuli consist of a patch of moving dots—most moving randomly but a small fraction moving coherently either to the left or the right—and participants must decide in which direction the coherent dots are moving. A major factor in determining the difficulty of this perceptual decision is the fraction of coherently moving dots, which can be manipulated by the experimenter to achieve a fixed error rate during training using a procedure known as ‘staircasing’17.

We assume that agents make their decision on the basis of a scalar, subjective decision variable, h, which is computed from a stimulus that can be represented as a vector x (e.g., the direction of motion of all dots)

$$h = \Phi ({\mathbf{x}},{\boldsymbol{\phi }})$$ (1)

where Φ(⋅) is a function of the stimulus and (tunable) parameters ϕ. We assume that this transformation of stimulus x into the subjective decision variable h yields a noisy representation of the true decision variable, Δ (e.g., the fraction of dots moving left). That is, we write

$$h = \Delta + n$$ (2)

where the noise, n, arises due to the imperfect representation of the decision variable. We further assume that this noise, n, is random and sampled from a zero-mean Gaussian distribution with standard deviation σ (Fig. 1a).

Fig. 1 Illustration of the model. a Distributions over decision variable h given a particular difficulty, Δ = 16, with lower precision before learning and higher precision after learning. The shaded regions corresponds to the error rate—the probability of making an incorrect response at each difficulty. b The error rate as a function of difficulty before and after learning. c The derivative that determines the rate of learning as a function of difficulty before and after learning showing that the optimal difficulty for learning is lower after learning than before. d The same derivative as in c re-plotted as a function of error rate showing that the optimal error rate (at 15.87% or ~85% accuracy) is the same both before and after learning Full size image

If the decision boundary is set to 0, such that the model chooses option A when h > 0, option B when h < 0 and randomly when h = 0, then the noise in the representation of the decision variable leads to errors with probability

$${\mathrm{ER}} = {\int_{ - \infty }^0} p (h|\Delta ,\sigma )\mathrm{d}h = F( - \Delta /\sigma ) = F( - \beta \Delta )$$ (3)

where F(x) is the cumulative density function of the standardized noise distribution, p(x) = p(x|0, 1), and β = 1/σ quantifies the precision of the representation of Δ and the agent’s skill at the task. As shown in Fig. 1b, this error rate decreases as the decision gets easier (Δ increases) and as the agent becomes more accomplished at the task (β increases).

The goal of learning is to tune the parameters ϕ such that the subjective decision variable, h, is a better reflection of the true decision variable, Δ. That is, the model should aim to adjust the parameters ϕ so as to decrease the magnitude of the noise σ or, equivalently, increase the precision β. One way to achieve this tuning is to adjust the parameters using gradient descent on the error rate, i.e. changing the parameters over time t according to

$$\frac{{\mathrm{d}{\boldsymbol{\phi }}}}{{\mathrm{d}t}} = - \eta

abla _{\boldsymbol{\phi }}{\mathrm{ER}}$$ (4)

where η is the learning rate and ∇ϕER is the derivative of the error rate with respect to parameters ϕ. This gradient can be written in terms of the precision, β, as

$$

abla _{\boldsymbol{\phi }}{\mathrm{ER}} = \frac{{\partial {\mathrm{ER}}}}{{\partial \beta }}

abla _{\boldsymbol{\phi }}\beta$$ (5)

Note here that only the first term on the right hand side of Eq. (5) depends on the difficulty Δ, while the second describes how the precision changes with ϕ. Note also that Δ itself, as the ‘true’ decision variable, is independent of ϕ. This means that the optimal difficulty for training, that maximizes the change in the parameters, ϕ, at this time point, is the value of the decision variable Δ* that maximizes ∂ER/∂β. Of course, this analysis ignores the effect of changing ϕ on the form of the noise—instead assuming that it only changes the scale factor, β, an assumption that likely holds in the relatively simple cases we consider here, although whether it holds in more complex cases will be an important question for future work.

In terms of the decision variable, the optimal difficulty changes as a function of precision (Fig. 1c) meaning that the difficulty of training must be adjusted online according to the skill of the agent. Using the monotonic relationship between Δ and ER (Fig. 1b) it is possible to express the optimal difficulty in terms of the error rate, ER* (Fig. 1d). Expressed this way, the optimal difficulty is constant as a function of precision, meaning that optimal learning can be achieved by clamping the error rate during training at a fixed value, which, for Gaussian noise is

$${\mathrm{ER}}^ \ast = \frac{1}{2}\left( {1 - {\mathrm{erf}}\left( {\frac{1}{{\sqrt 2 }}} \right)} \right) \approx 0.1587$$ (6)

That is, the optimal error rate for learning is 15.87%, and the optimal accuracy is around 85%. We call this the Eighty Five Percent Rule for optimal learning.

Dynamics of learning

While the previous analysis allows us to calculate the error rate that maximizes the rate of learning, it does not tell us how much faster learning occurs at this optimal error rate. In this section we address this question by comparing learning at the optimal error rate with learning at a fixed error rate, ER f (which may be suboptimal), and, alternatively, a fixed difficulty, Δ f . If stimuli are presented one at a time (i.e., not batch learning), in both cases, gradient-descent based updating of the parameters, ϕ, (Eq. (4)) implies that the precision β evolves in a similar manner, i.e..

$$\frac{{\mathrm{d}\beta }}{{\mathrm{d}t}} = - \eta \frac{{\partial {\mathrm{ER}}}}{{\partial \beta }}$$ (7)

For fixed error rate, ER f , as shown in the Methods, integrating Eq. (7) gives

$$\beta (t) = \sqrt {\beta _0^2 + 2\eta K_{\mathrm{f}}(t - t_0)}$$ (8)

where t 0 is the initial time point, β 0 is the initial value of β and K f is the following function of the training error rate

$$K_{\mathrm{f}} = - F^{ - 1}({\mathrm{ER}}_{\mathrm{f}})p(F^{ - 1}({\mathrm{ER}}_{\mathrm{f}}))$$ (9)

Thus, for fixed training error rate the precision grows as the square root of time with the exact rate determined by K f which depends on both the training error rate and the noise distribution.

For fixed decision variable, Δ f , integrating Eq. (7) is more difficult and the solution depends more strongly on the distribution of the noise. In the case of Gaussian noise, there is no closed form solution for β. However, as shown in the Methods, an approximate form can be derived at long times where we find that β grows as

$$\beta (t) \propto \sqrt {\log t}$$ (10)

i.e., exponentially slower than Eq. (38).

Simulations

To demonstrate the applicability of the Eighty Five Percent Rule we simulated the effect of training accuracy on learning in three cases, two from AI and one from computational neuroscience. From AI we consider how training at 85% accuracy impacts learning in the the simple case of a one-layer Perceptron14 with artificial stimuli, and in the more complex case of a two-layer neural network9 with stimuli drawn from the MNIST (Modified National Institute of Standards and Technology) dataset of handwritten digits18. From computational neuroscience we consider the model of Law and Gold11, that accounts for both the behavior and neural firing properties of monkeys learning the Random Dot Motion task. In all cases we see that learning is maximized when training occurs at 85% accuracy.

Perceptron with artificial stimuli

The Perceptron is a classic one-layer neural network model that learns to map multidimensional stimuli x onto binary labels, y via a linear threshold process14. To implement this mapping, the Perceptron first computes the decision variable h as

$$h = {\mathbf{w}}\cdot {\mathbf{x}}$$ (11)

where w are the weights of the network, and then assigns the label according to

$$y = \left\{ {\begin{array}{*{20}{c}} 1 & {h \, > \, 0} \\ 0 & {h \le 0} \end{array}} \right.$$ (12)

The weights, w, which constitute the parameters of the model, are updated based on feedback about the true label t by a the learning rule,

$${\mathbf{w}} \leftarrow {\mathbf{w}} + (t - y){\mathbf{x}}$$ (13)

This learning rule implies that the Perceptron only updates its weights when the predicted label y does not match the actual label t—that is, the Perceptron only learns when it makes mistakes. Naively then, one might expect that optimal learning would involve maximizing the error rate. However, because Eq. (13) is closely related (albeit not identical) to a gradient descent based rule (e.g., Chapter 39 in ref. 19), the analysis of the previous sections applies and the optimal error rate for training is 15.87%.

To test this prediction we simulated the Perceptron learning rule for a range of training error rates between 0.01 and 0.5 in steps of 0.01 (1000 simulations per error rate, 1000 trials per simulation). Error rate was kept constant by varying the difficulty, and the degree of learning was captured by the precision β (see Methods). As predicted by the theory, the network learns most effectively when trained at the optimal error rate (Fig. 2a) and the dynamics of learning are well described, up to a scale factor, by Eq. (38) (Fig. 2b).

Fig. 2 The Eighty Five Percent Rule applied to the Perceptron. a The relative precision, β/β max , as a function of training error rate and training duration. Training at the optimal error rate leads to the fastest learning throughout. b The dynamics of learning agree well with the theory Full size image

Two-layer network with MNIST stimuli

As a more demanding test of the Eighty Five Percent Rule, we consider the case of a two-layer neural network applied to more realistic stimuli from the Modified National Institute of Standards and Technology (MNIST) dataset of handwritten digits18. The MNIST dataset is a labeled dataset of 70,000 images of handwritten digits (0 through 9) that has been widely used as a test of image classification algorithms (see ref. 20 for a list). The dataset is broken down into a training set consistent of 60,000 images and a test set of 10,000 images. To create binary classification tasks based on these images, we trained the network to classify the images according to either the parity (odd or even) or magnitude (less than 5 or not) of the number.

The network itself consisted of 1 input layer, with 400 units corresponding to the pixel values in the images, 1 hidden layer, with 50 neurons, and one output unit. Unlike the Perceptron, activity of the output unit was graded and was determined by a sigmoid function of the decision variable, h

$$y = \frac{1}{{1 + \exp \left( h \right)}} = S(h)$$ (14)

where the decision variable was given by

$$h = {\mathbf{w}}_2\cdot {\mathbf{a}}$$ (15)

where w 2 were the weights connecting the hidden layer to the output units and a was the activity in the hidden layer. This hidden-layer activity was also determined by a sigmoidal function

$${\mathbf{a}} = S({\mathbf{w}}_1\cdot {\mathbf{x}})$$ (16)

where the inputs, x, corresponds to the pixel values in the image and w 1 were the weights from the input layer to the hidden layer.

All weights were trained using the Backpropagation algorithm9 which takes the error,

$$e = t - y$$ (17)

and propagates it backwards through the network, from output to input stage, as a teaching signal for the weights. This algorithm implements stochastic gradient descent and, if our assumptions are met, should optimize learning at a training accuracy of 85%.

To test this prediction we trained the two-layer network for 5000 trials to perform either the Parity or the Magnitude Task while clamping the training error rate between 5 and 30% (Fig. 3). After training, performance was assessed on the entire test set and the whole process was repeated 1000 times for each task. As shown in Fig. 3, training error rate has a relatively large effect on test accuracy, around 10% between the best and worse training accuracies. Moreover, for both tasks, the optimal training occurs at 85% training accuracy. This suggests that the 85% rule holds even for learning of more realistic stimuli, by more complex multi-layered networks.

Fig. 3 The Eighty Five Percent Rule applied to a multilayered neural network. Test accuracy vs training error rate on the MNIST dataset for the a Parity and b Magnitude tasks for 1000 different simulations. In both cases the test accuracy peaks at or near the optimal error rate. Each color corresponds to a different target training accuracy Full size image

Biologically plausible model of perceptual learning

To demonstrate how the Eighty Five Percent Rule might apply to learning in biological systems, we simulated the Law and Gold model of perceptual learning11. This model has been shown to capture the long term changes in behavior, neural firing and synaptic weights as monkeys learn to perform the Random Dot Motion task.

Specifically, the model assumes that monkeys make the perceptual decision between left and right on the basis of neural activity in area MT—an area in the dorsal visual stream that is known to represent motion information15. In the Random Dot Motion task, neurons in MT have been found to respond to both the direction θ and coherence COH of the dot motion stimulus such that each neuron responds most strongly to a particular ‘preferred’ direction and that the magnitude of this response increases with coherence. This pattern of firing is well described by a simple set of equations (see “Methods”) and thus the noisy population response, x, to a stimulus of arbitrary direction and coherence is easily simulated.

From this MT population response, Law and Gold proposed that animals construct a decision variable in a separate area of the brain (lateral interparietal area, LIP) as the weighted sum of activity in MT; i.e.,

$$h = {\mathbf{w}}\cdot {\mathbf{x}} + \epsilon$$ (18)

where w are the weights between MT and LIP neurons and ϵ is random neuronal noise that cannot be reduced by learning. The presence of this irreducible neural noise is a key difference between the Law and Gold model (Eq. 18) and the Perceptron (Eq. 11) as it means that no amount of learning can lead to perfect performance. However, as shown in the Methods section, the presence of irreducible noise does not change the optimal accuracy for learning which is still 85%.

Another difference between the Perceptron and the Law and Gold model is the form of the learning rule. In particular, weights are updated according to a reinforcement learning rule based on a reward prediction error

$$\delta = r - E[r]$$ (19)

where r is the reward presented on the current trial (1 for a correct answer, 0 for an incorrect answer) and E[r] is the predicted reward

$$E[r] = \frac{1}{{1 + \exp ( - B|h|)}}$$ (20)

where B is a proportionality constant that is estimated online by the model (see “Methods”). Given the prediction error, the model updates its weights according to

$${\mathbf{w}} \leftarrow {\mathbf{w}} + \eta C\delta {\mathbf{x}}$$ (21)

where C is the choice (−1 for left, +1 for right) and η is the learning rate. Despite the superficial differences with the Perceptron learning rule (Eq. (13)) the Law and Gold model still implements stochastic gradient descent on the error rate13 and learning should be optimized at 85%.

To test this prediction we simulated the model at a variety of different target training error rates. Each target training rate was simulated 100 times with different parameters for the MT neurons (see “Methods”). The precision, β, of the trained network was estimated by fitting simulated behavior of the network on a set of test coherences that varied logarithmically between 1 and 100%. As shown in Fig. 4a the precision after training is well described (up to a scale factor) by the theory. In addition, in Fig. 4b, we show the expected difference in behavior—in terms of psychometric choice curves—for three different training error rates. While these differences are small, they are large enough that they could be distinguished experimentally.