Moreover, with large complex sets of training patterns, it is likely that some errors may occur, either in the inputs or in the outputs. In that case, and again particularly in the later parts of the learning process, it is likely that backprop will be contorting the weights so as to fit precisely around training patterns that are actually erroneous! This phenomenon is known as over-fitting.

This problem can to some extent be avoided by stopping learning early. How does one tell when to stop? One method is to partition the training patterns into two sets (assuming that there are enough of them). The larger part of the training patterns, say 80% of them, chosen at random, form the training set, and the remaining 20% are referred to as the test set. Every now and again during training, one measures the performance of the current set of weights on the test set. One normally finds that the error on the training set drops monotonically (that's what a gradient descent algorithm is supposed to do, after all). However, error on the test set (which will be larger, per pattern, than the error on the training set) will fall at first, then start to rise as the algorithm begins to overtrain. Best generalization performance is gained by stopping the algorithm at the point where error on the test set starts to rise.

generalized delta rule An improvement on the in error backpropagation learning. If the learning rate (often denoted by η) is small, the backprop algorithm proceeds slowly, but accurately follows the path of steepest descent on the error surface. If η is too large, the algorithm may "bounce off the canyon walls of the error surface" - i.e. not work well. This can be largely avoided by modifying the delta rule to include a momentum term:

Δw ji (n) = α Δw ji (n–1) + η δ j (n) y i (n)

in the notation of Haykin's text (Neural networks - a comprehensive foundation). The constant α is a termed the momentum constant and can be adjusted to achieve the best effect. The second summand corresponds to the standard delta rule, while the first summand says "add α × the previous change to this weight."

This new rule is called the generalized delta rule. The effect is that if the basic delta rule would be consistently pushing a weight in the same direction, then it gradually gathers "momentum" in that direction.

gradient descent