This short post provides an overview of my typical workflow when working with regression models. I think it is helpful to start with a simple example first (i.e. bivariate OLS) and work toward more complex examples. The syntax for OLS is a bit easier to understand, yet model construction translates well to more complex forms. In this example I’ll use one of the datasets from the ggplot2 library, mpg .

Exploratory data analysis The mpg dataset contains a number of variables including the manufacturer ( mpg$manufactur ), engine displacement ( mpg$displ ), number of cylinders ( mpg$cyl ), miles per gallon when driving in the city ( mpg$cty ), and miles per gallon when driving on the highway ( mpg$hwy ). We can get some information about the dataset with the following: library(ggplot2) ## view the contents mpg ## # A tibble: 234 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp… ## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp… ## 3 audi a4 2 2008 4 manual… f 20 31 p comp… ## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp… ## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp… ## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp… ## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp… ## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp… ## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp… ## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp… ## # … with 224 more rows Or, for a little more flexibility we can use the datatable function from the DT library to create an interactive table: library(DT) datatable(mpg) If we’re interested in only a few of these components, we can use some of the functions below. The output is hidden here since much of it is redundant. ## number of rows nrow(mpg) ## number of columns ncol(mpg) ## names of columns names(mpg) ## view metadata, only useful for a built-in dataset though ?mpg ## view the unique values of a column, useful for categorical data unique(mpg$manufacturer)

OLS Suppose we are interested in the relationship between engine displacement ( mpg$displ ) and miles per gallon when driving on the highway ( mpg$hwy ). We expect a negative relationship and we can explore this with the base plot function: plot(mpg$displ, mpg$hwy) Or for a more sophisticated plot that is perhaps a bit more publication worthy: ggplot(data = mpg, aes(displ, hwy)) + geom_point() I think this is a good example for a polynomial regression problem because a linear function will work pretty well, but a polynomial function will likely be better. Of course, when conducting regression it can be useful to explore characteristics of the individual variables: ## set plot parameters; one row and two column (to get figures side by side) par(mfrow = c(1, 2)) hist(mpg$displ) hist(mpg$hwy) …and a more sophisticated plot: library(gridExtra) ## engine displacment histogram displ.hist <- ggplot(data = mpg, aes(displ)) + geom_histogram(bins = 10) ## highway miles per gallon histogram hwy.hist <- ggplot(data = mpg, aes(hwy)) + geom_histogram(bins = 10) ## place plots side by side grid.arrange(displ.hist, hwy.hist, nrow = 1) Clearly the x variable may pose problems, but let’s move on. To actually complete the linear regression model, we can use the lm function which is a base R function. In the lm call above, I like to think of replacing the ~ operator with the phrase “as a function of…” In this case, hwy is a function of displ . ## we don't have to put the result in a variable but it gives us access to more information. ols.model.1 <- lm(hwy ~ displ, data = mpg) ## view the results ols.model.1 ## ## Call: ## lm(formula = hwy ~ displ, data = mpg) ## ## Coefficients: ## (Intercept) displ ## 35.698 -3.531 ## get some more detailed information summary(ols.model.1) ## ## Call: ## lm(formula = hwy ~ displ, data = mpg) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.1039 -2.1646 -0.2242 2.0589 15.0105 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 35.6977 0.7204 49.55 <0.0000000000000002 *** ## displ -3.5306 0.1945 -18.15 <0.0000000000000002 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.836 on 232 degrees of freedom ## Multiple R-squared: 0.5868, Adjusted R-squared: 0.585 ## F-statistic: 329.5 on 1 and 232 DF, p-value: < 0.00000000000000022 Use graphical measures to evaluate the model: plot(ols.model.1) Plot the actual vs. fitted values: plot(mpg$displ, mpg$hwy) abline(ols.model.1) We can evaluate the assumption of residual normality with graphical measures… hist(ols.model.1$residuals) …and/or with a test shapiro.test(ols.model.1$residuals) ## ## Shapiro-Wilk normality test ## ## data: ols.model.1$residuals ## W = 0.94125, p-value = 0.00000004343 Adding more variables to the regression model is easy; just use the + operator in the function call. Here, highway miles per gallon is a function of engine displacement ( displ ), whether or not the car is an SUV (manually created dummay variable, suv ), and number of cylinders ( cyl ). ## create a dummy variable for SUVs mpg$suv <- 0 mpg[mpg$class == "suv",]$suv <- 1 ## new model ols.model.2 <- lm(hwy ~ displ + suv + cyl, data = mpg,) There are different ways to evaluate VIF, but the car library has a straightforward function: vif . library(car) ## some significant multicollinearity here vif(ols.model.2) ## displ suv cyl ## 7.918850 1.273033 7.464803