A model is biased when it is not complex enough to capture relationships present in the training data. In contrast, a model has high variance when it captures too much information present in the training data, resulting in high variation between predictions made from test data. Models with excessively high bias or variance will exhibit increased prediction error compared to models which have been trained in such a way that minimizes both bias and variance. One means of optimizing this tradeoff is through understanding whether a model is under or overfitting — terms we associate with a model’s performance after evaluating the model on test or holdout data.

A visual representation of underfitting, appropriate-fitting, and overfitting. Source: What is underfitting and overfitting in machine learning and how to deal with it.

Underfit models exhibit large amounts of bias. When underfitting, models will fail to capture important relationships in training data — relationships which could have been used in making predictions. This failure to capture relevant information in the modeling process results in poor predictive performance. When overfitting models, models will typically be poor performers when predicting outputs from new, unseen data. Typically, models are overfit by allowing too much model complexity, failing to validate model performance on holdout data, or not using cross-validation when training models. Analogous to the bias-variance tradeoff, it is ideal to minimize both under and overfitting when training predictive models, with the goal being to find the ideal balance between the two.

The impacts of under and overfitting models are illustrated in this article visually through plotting decision surfaces from models trained on US Census Bureau data — data containing latitude and longitude coordinates of state borders and an associated state label (e.g. California, Illinois, Tennessee). By plotting the decision surface of models fit to this data, we can visually interpret model behavior in cases of under and overfitting using a mapping that is familiar to many — the 48 contiguous United States.

The Data

The data comes from the US Census Bureau. In its original format, the data is a single Keyhole Markup Language (KML) file which contains latitude and longitude coordinates of the borders of US states. The necessary latitude, longitude, and label (state) data were parsed from the KML files using a simple Python script.

The dataset contains 11,192 observations. The number of observations varies from state to state — California has 467 observations whereas Rhode Island has only 59 (the US census bureau does not suggest using this data for any serious geospatial or modeling work). The data is rich enough, however, to use an exploratory tool and contains enough signal for predicting a US state given a pair of latitude and longitude coordinates. We can then note differences in decision surfaces between under and overfit models by plotting the decision surfaces on the geographic coordinate plane.

Note that we only have the border coordinates for each US state and do not have any coordinates within the borders of each state for this data. This means that our coordinates and state labels lie close to one another in the latitude-longitude coordinate plane since they may share a border (e.g. Tennessee and Georgia or Idaho and Montana). This a data property that tree-based models should be able to take advantage of (trees create a series of branches based on the input data). Other types of models would perform better with datasets that contain latitude-longitude coordinates within the borders of each state in addition to the border coordinates.

Under and Overfitting Shown Graphically

Below, a Random Forest Classifier is used here to illustrate the decision surfaces of models that are underfit, overfit, and somewhere in between. In these illustrations, voronoi cells are constructed that include several latitude-longitude coordinates and associated predicted states — we use only the most likely state for visualization, or for coloring and shading that cell. By coloring and shading the cell, we can visually interpret the model’s prediction for the cell and its confidence in that prediction.

A Random Forest Classifier is simply a collection of Decision Tree Classifiers. Tree-based classifiers create a series of branches that act as arguments. For example, if longitude is greater than 36, then it must be a state farther north or along the same longitude as Arkansas. These branching arguments can be followed to their termini to arrive at a predicted value. Rules about our coordinate plane (e.g. Tennessee lies above 34 degrees latitude) can be learned by a decision tree to use in predicting the state for a given set of coordinates — the architecture of decision trees is well suited to this data and problem. Given that an individual tree-based classifier should perform reasonably well in predicting a US state when provided a pair of latitude-longitude coordinates, a Random Forest Classifier — a collection of Decision Tree Classifiers — should provide greater generalization over the individual decision tree classifier.