Machine Learning: Ridge Regression

Ridge regression is a regression technique that is quite similar to unadorned least squares linear regression: simply adding an \(\ell_2\) penalty on the parameters \(\beta\) to the objective function for linear regression yields the objective function for ridge regression.

Our goal is to find an assignment to \(\beta\) that minimizes the function

\[f(\beta) = \|X\beta - Y\|_2^2 + \lambda \|\beta\|_2^2,\]

where \(\lambda\) is a hyperparameter and, as usual, \(X\) is the training data and \(Y\) the observations. In practice, we tune \(\lambda\) until we find a model that generalizes well to the test data.

Ridge regression is an example of a shrinkage method: compared to least squares, it shrinks the parameter estimates in the hopes of reducing variance, improving prediction accuracy, and aiding interpetation.

In this notebook, we show how to fit a ridge regression model using CVXPY, how to evaluate the model, and how to tune the hyper-parameter \(\lambda\).

Writing the objective function

We can decompose the objective function as the sum of a least squares loss function and an \(\ell_2\) regularizer.

Generating data

Because ridge regression encourages the parameter estimates to be small, and as such tends to lead to models with less variance than those fit with vanilla linear regression. We generate a small dataset that will illustrate this.

Fitting the model

All we need to do to fit the model is create a CVXPY problem where the objective is to minimize the the objective function defined above. We make \(\lambda\) a CVXPY parameter, so that we can use a single CVXPY problem to obtain estimates for many values of \(\lambda\).

Evaluating the model

Notice that, up to a point, penalizing the size of the parameters reduces test error at the cost of increasing the training error, trading off higher bias for lower variance; in other words, this indicates that, for our example, a properly tuned ridge regression generalizes better than a least squares linear regression.

../../_images/ridge_regression_9_0.svg

Regularization path

As expected, increasing \(\lambda\) drives the parameters towards \(0\). In a real-world example, those parameters that approach zero slower than others might correspond to the more informative features. It is in this sense that ridge regression can be considered model selection.

../../_images/ridge_regression_11_0.svg