Basic concepts

In the following section, you will find some basic concepts of machine learning, that are important to know, and is true for almost all machine learning methods.

Capacity, Overﬁtting and Underﬁtting

The central challenge in machine learning is that our algorithm must perform well on new, previously unseen inputs — not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization.

Typically, when training a machine learning model, we have access to a training set; we can compute some error measure on the training set, called the training error; and we reduce this training error. What separates machine learning from optimization is that we want the generalization error, also called the test error, to be low as well. We typically estimate the generalization error of a machine learning model by measuring its performance on a test set of examples that were collected separately from the training set. The factors determining how well a machine learning algorithm will perform are its ability to:

1. Make the training error small.

2. Make the gap between training and test error small.

These two factors correspond to the two central challenges in machine learning: underﬁtting and overﬁtting. Underﬁtting occurs when the model is not able to obtain a suﬃciently low error value on the training set. Overﬁtting occurs when the gap between the training error and test error is too large.

We can control whether a model is more likely to overﬁt or underﬁt by altering its capacity. Informally, a model’s capacity is its ability to ﬁt a wide variety of functions. Models with low capacity may struggle to ﬁt the training set. Models with high capacity can overﬁt by memorizing properties of the training set that do not serve them well on the test set.

One way to control the capacity of a learning algorithm is by choosing its hypothesis space, the set of functions that the learning algorithm is allowed to select as being the solution.

Machine learning algorithms will generally perform best when their capacity is appropriate for the true complexity of the task they need to perform and the amount of training data they are provided with. Models with insuﬃcient capacity are unable to solve complex tasks. Models with high capacity can solve complex tasks, but when their capacity is higher than needed to solve the present task, they may overﬁt.

Visualization of underfitting and overfitting [1]

We must remember that while simpler functions are more likely to generalize (to have a small gap between training and test error), we must still choose a suﬃciently complex hypothesis to achieve low training error.

Typically, training error decreases until it asymptotes to the minimum possible error value as model capacity increases (assuming the error measure has a minimum value). On the other hand, generalization error has a U-shaped curve as a function of model capacity. This is illustrated on the following ﬁgure:

Relationship between capacity and error [1]

At the left end of the graph, training error and generalization error are both high. This is the underﬁtting regime . As we increase capacity, training error decreases, but the gap between training and generalization error increases. Eventually, the size of this gap outweighs the decrease in training error, and we enter the overﬁtting regime, where capacity is too large, above the optimal capacity.

The No Free Lunch Theorem

The no free lunch theorem for machine learning (Wolpert, 1996) states that, averaged over all possible data-generating distributions, every classiﬁcation algorithm has the same error rate when classifying previously unobserved points. In other words, in some sense, no machine learning algorithm is universally any better than any other. The most sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class.

This means that the goal of machine learning research is not to seek a universal learning algorithm or the absolute best learning algorithm. Instead, our goal is to understand what kinds of distributions are relevant to the “real world” that an AI agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data-generating distributions we care about.

Regularization

Regularization is any modiﬁcation we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

The no free lunch theorem has made it clear that there is no best machine learning algorithm, and in particular, no best form of regularization. Instead we must choose a form of regularization that is well suited to the particular task we want to solve.

Hyperparameters and Validation Sets

Most machine learning algorithms have hyperparameters, settings that we can use to control the algorithm’s behavior. The values of hyperparameters are not adapted by the learning algorithm itself (though we can design a nested learning procedure in which one learning algorithm learns the best hyperparameters for another learning algorithm).

Sometimes a setting is chosen to be a hyperparameter that the learning algorithm does not learn because the setting is diﬃcult to optimize. More frequently, the setting must be a hyperparameter if it is not appropriate to learn on the training set. This applies to all hyperparameters that control model capacity. If learned on the training set, such hyperparameters would always choose the maximum possible model capacity, resulting in overﬁtting. To solve this problem, we need a validation set of examples that the training algorithm does not observe.

Earlier we discussed how a held-out test set, composed of examples coming from the same distribution as the training set, can be used to estimate the generalization error of a learner, after the learning process has completed. It is important that the test examples are not used in any way to make choices about the model, including its hyperparameters. For this reason, no example from the test set can be used in the validation set. Therefore, we always construct the validation set from the training data. Speciﬁcally, we split the training data into two disjoint subsets. One of these subsets is used to learn the parameters. The other subset is our validation set, used to estimate the generalization error during or after training, allowing for the hyperparameters to be updated accordingly. The subset of the training data used to learn the parameters is still typically called the training set, even though this may be confused with the larger pool of data used for the entire training process.

The other subset of the training data used to guide the selection of hyperparameters is called the validation set. Typically, one uses about 80 percent of the training data for training and 20 percent for validation. Since the validation set is used to “train” the hyperparameters, the validation set error will underestimate the generalization error, though typically by a smaller amount than the training error does. After all hyperparameter optimization is complete, the generalization error may be estimated using the test set.

Cross-Validation

Dividing the dataset into a ﬁxed training set and a ﬁxed test set can be problematic if it results in the test set being small.

A small test set implies statistical uncertainty around the estimated average test error, making it diﬃcult to claim that algorithm A works better than algorithm B on the given task.

When the dataset has hundreds of thousands of examples or more, this is not a serious issue. When the dataset is too small, alternative procedures enable one to use all the examples in the estimation of the mean test error, at the price of increased computational cost. These procedures are based on the idea of repeating the training and testing computation on diﬀerent randomly chosen subsets or splits of the original dataset. The most common of these is the k-fold cross-validation procedure, in which a partition of the dataset is formed by splitting it into k non-overlapping subsets. The test error may then be estimated by taking the average test error across k number of trials. On trial i, the i-th subset of the data is used as the test set, and the rest of the data is used as the training set.

Bias and Variance

Bias and variance measure two diﬀerent sources of error in an estimator. Bias measures the expected difference from the true value of the function or parameter. Variance on the other hand, provides a measure of the deviation from the expected estimator value that any particular sampling of the data is likely to cause.

What happens when we are given a choice between two estimators, one with more bias and one with more variance? How do we choose between them? The most common way to negotiate this trade-oﬀ is to use cross-validation. Empirically, cross-validation is highly successful on many real-world tasks. Alternatively, we can also compare the mean squared error (MSE) of the estimates:

Mean Squared Error

The relationship between bias and variance is tightly linked to the machine learning concepts of capacity, underﬁtting and overﬁtting. When generalization error is measured by the MSE (where bias and variance are meaningful components of generalization error), increasing capacity tends to increase variance and decrease bias. This is illustrated on the next figure, where we see again the U-shaped curve of generalization error as a function of capacity

Visualization of bias and variance [1]

Consistency

Usually, we are also concerned with the behavior of an estimator as the amount of training data grows. In particular, we usually wish that, as the number of data points in our dataset increases, our point estimates converge to the true value of the corresponding parameters. This is called consistency. Consistency ensures that the bias induced by the estimator diminishes as the number of data examples grows.