Model Fitting vs Model Selection¶

The difference between model fitting and model selection is often a cause of confusion. Model fitting proceeds by assuming a particular model is true, and tuning the model so it provides the best possible fit to the data. Model selection, on the other hand, asks the larger question of whether the assumptions of the model are compatible with the data.

Let's make this more concrete. By model here I essentially mean a formula, usually with tunable parameters, which quantifies the likelihood of observing your data. For example, your model might consist of the statement, "the $(x, y)$ observations come from a straight line, with known normal measurement errors $\sigma_y$". Labeling this model $M_1$, we can write:

$$ y_{M_1}(x;\theta) = \theta_0 + \theta_1 x\\ y \sim \mathcal{N}(y_{M_1}, \sigma_y^2) $$

where the second line indicates that the observed $y$ is normally distributed about the model value, with variance $\sigma_y^2$. There are two tunable parameters to this model, represented by the vector $\theta = [\theta_0, \theta_1]$ (i.e. the slope and intercept).

Another model might consist of the statement "the observations $(x, y)$ come from a quadratic curve, with known normal measurement errors $\sigma_y$". Labeling this model $M_2$, we can write:

$$ y_{M_2}(x;\theta) = \theta_0 + \theta_1 x + \theta_2 x^2\\ y \sim \mathcal{N}(y_{M_2}, \sigma_y^2) $$

There are three tunable parameters here, again represented by the vector $\theta$.

Model fitting, in this case, is the process of finding constraints on the values of the parameters $\theta$ within each model. That is, it allows you to make statements such as, "assuming $M_1$ is true, this particular $\theta$ gives the best-fit line" or "assuming $M_2$ is true, this particular vector $\theta$ gives the best-fit curve." Model fitting proceeds without respect to whether the model is capable of describing the data well; it just arrives at the best-fit model under the assumption that the model is accurate.

Model selection, on the other hand, is not concerned with the parameters themselves, but with the question of whether the model is capable of describing the data well. That is, it allows you to say, "for my data, a line ($M_1$) provides a better fit than a quadratic curve ($M_2$)".

Let's make this more concrete by introducing some data.