A note on how to read the plots. I want you to pay attention to the two things:

1. The distance between the true value — shown as black dashed line— and the average predicted value for the model — shown as dashed line of the same color. This distance is the bias (or bias squared) of the models. A large shift from the true value (11) is a large bias.

2. The width of the histograms is the variance of the model. A large width is a larger variance.

λ ~ 0

Starting off with a very tiny lambda value. This is equivalent to having no penalty, thus we would expect the same results as in OLS for Ridge and Lasso.

The plot gives no surprises. All three distributions overlap with means around the true value. Notice how spread out the distribution is though. There is a large variance is the prediction ranging from 9 to 13.

λ = 0.01

With a (very) small penalty it is easy to see regularization in effect. The distributions have shifted to the left (evident by the mean). A small bias is observed in Ridge, and a relatively larger one in Lasso. It is not clear if the variance has changed.

λ = 0.05

At λ = 0.05, Lasso is already too aggressive with a bias of 3 units. Ridge is close enough but looks like it has the same variance. So there is no advantage of Ridge yet for this data.

λ = 0.1

An almost similar result as above. It is hard to notice any change in variance yet.

λ = 0.5

A higher penalty gives some (reasonably) satisfactory clues. Bias on Ridge has increased close to three units, but the variance is smaller. Lasso has very aggressively pushed for zero coefficient estimate for β resulting in a very high bias in the result but has a small variance.

λ = 1 — Some good results!

Here the tradeoff has clearly switched sides. The variance of Ridge is small at the cost of higher bias.

λ = 5

Just to really drive the point home, here is a very large penalty. Variance on Ridge is small at the cost of much higher bias. You will probably never need such large penalties. But the facts are clear, a lower variance at the cost of higher bias.

Bias and variance for various regularization values

Repeating the above for a range of regularization values gives a clear picture.

Bias is computed as the distance from the average prediction and true value — true value minus mean(predictions)

Variance is the average deviation from the average prediction — mean(prediction minus mean(predictions))

The plots give the same observation. OLS has the lowest bias but highest variance. Ridge looks like a smooth shift and Lasso is constant after around λ = 0.2 (β becomes 0, thus predicting y = α for all values of x).

Ideal distributions

A better choice of data could give us an ideal plot of the sampling distributions of predictions.

The advantage Ridge offers is immediately evident here because of the overlapping distributions. Ridge give a slightly biased prediction, but will give a closer prediction much more often than OLS. This is the true value of Ridge. A small bias, but more consistent predictions. OLS gives an unbiased result but is not very consistently. This is key, OLS gives an unbiased result on average, not always. And that is the bias and variance tradeoff taking shape in linear models.