Six More Reasons for Doubting a Regression Model

There are more than a few reasons for being skeptical about a regression model. Some are easy to identify, others are more subtle. Here are six more reasons you might doubt the validity of a regression model.

Overfitting

Overfitting involves building a statistical model solely by optimizing statistical parameters, and usually involves using a large number of variables and transformations of the variables. The resulting model may fit the data almost perfectly but will produce erroneous results when applied to another sample from the population.

The concern about overfitting may be somewhat overstated. Overfitting is like becoming too muscular from weight training. It doesn’t happen suddenly or simply. If you know what overfitting is, you’re not likely to become a victim. It’s not something that happens in a keystroke. It takes a lot of work fine tuning variables and what not. It’s also usually easy to identify overfitting in other people’s models. Simply look for a conglomeration of manual numerical adjustments, mathematical functions, and variable combinations.

Misspecification

Misspecification involves including terms in a model that make the model look great statistically even though the model is problematical. Often, misspecification involves placing the same or very similar variable on both sides of the equation.

Consider this example from economics. A model for the U.S. Gross Domestic Product (GDP) was developed using data on government spending and unemployment from 1947 to 1997. The model:

GDP = (121*Spending) – (3.5*Spending2) + (136*Time) – (61*Unemployment) – 566

had an R-squared value of 0.9994. Such a high R-squared value is a signal that something is amiss. R-squared values that high are usually only seen in models involving equipment calibration, and certainly not anything involving capricious human behavior. A closer look at the study indicated that the model term involving spending were an index of the government’s outlays relative to the economy. Usually, indexing a variable to a baseline or standard is a good thing to do. In this case, though, the spending index was the proportion of government outlays per the GDP. Thus, the model was:

GDP = (121*Outlays/GDP) – (3.5* (Outlays/GDP)2) + (136*Time) – (61*Unemployment) – 566

GDP appears on both sides of the equation, thus accounting for the near perfect correlation. This is a case in which an index, at least one involving the dependent variable, should not have been used.

Another misspecification involves creating a prediction model having independent variables that are more difficult, time consuming, or expensive to generate than the dependent variable. You might as well just measure the dependent variable when you need to know its value. Similarly with forecasting (prediction of the future) models, if you need to forecast something a year in advance, don’t use predictors that are measured less than a year in advance.

Multicollinearity

Multicollinearity occurs when a model has two or more independent variables that are highly correlated with each other. The consequences are that the model will look fine, but predictions from the model will be erratic. It’s like a football team. The players perform well together but you can’t necessarily tell how good individual players are. The team wins, yet in some situations, the cornerback or offensive tackle will get beat on most every play.

If you ever tried to use independent variables that add to a constant, you’ve seen multicollinearity in action. In the case of perfect correlations, such as these, statistical software will crash because it won’t be able to perform the matrix mathemagics of regression. Most instances of multicollinearity involve weaker correlations that allow statistical software to function, yet the predictions of the model will still be erratic.

Multicollinearity occurs often in the social sciences and other fields of study in which many variables are measured in the process of model building. Diagnosis of the problem is simple if you have access to the data. Look at correlations between the independent variables. You can also look at the variance inflation factors, reciprocals of one minus the R-squared values for the independent variables and the dependent variable. VIFs are measures of how much the model’s coefficients change because of multicollinearity. The VIF for a variable should be less than 10 and ideally near 1.

If you suspect multicollinearity, don’t worry about the model but don’t believe any of the predictions.

Heteroscedasticity

Regression, and practically all parametric statistics, requires that the variances in the model residuals be equal at every value of the dependent variable. This assumption is called equal variances, homogeneity of variances, or coolest of all, homoscedasticity. Violate the assumption and you have heteroscedasticity.

Heteroscedasticity is assessed much more commonly in analysis of variance models than in regression models. This is probably because the dependent variable in ANOVA is measured on a categorical scale while the dependent variable in regression is measured on a continuous scale. The solution to this is fairly simple. Break the dependent variable scale into intervals, like in a histogram, and calculate the variance for each interval. The variances don’t have to be precisely equal, but variances different by a factor of five are problematical. Unequal variances will wreak havoc on any tests or confidence limits calculated for model predictions.

Autocorrelation

Autocorrelation involves a variable being correlated with itself. It is the correlation between data points with the previously listed data points (termed a lag). Usually, autocorrelation involves time-series data or spatial data, but it can also involve the order in which data are collected. The terms autocorrelation and serial correlation are often used interchangeably. If the data points are collected at a constant time interval, the term autocorrelation is more typically used.

If the residuals of a model are autocorrelated, it’s a sure bet that the variances will also be unequal. That means, again, that tests or confidence limits calculated from variances should be suspect.

To check a variable or residuals from a model for autocorrelation, you can conduct a Durban-Watson test. The Durban-Watson test statistic ranges from 0 to 4. If the statistic is close to 2.0, then serial correlation is not a problem. Most statistical software will allow you to conduct this test as part of a regression analysis.

Weighting

Most software that calculates regression parameters also allows you to weight the data points. You might want to do this for several reasons. Weighting is used to make more reliable or relevant data points more important in model building. It’s also used when each data point represents more than one value. The issue with weighting is that it will change the degrees of freedom, and hence, the results of statistical tests. Usually this is OK, a necessary change to accommodate the realities of the model. However, if you ever come upon a weighted least squares regression model in which the weightings are arbitrary, perhaps done by an analyst who doesn’t understand the consequence, don’t believe the test results.

No Doubts

So, there are six more reasons for doubting a regression model. These are a bit more sophisticated than the last five reasons, and though they might appear less often, they are still good reasons for doubting a regression model. You just have to be able to diagnose and treat the regression maladies. But that is a topic for another time.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.



