# Machine Learning: Ridge Regression¶

Ridge regression is a regression technique that is quite similar to
unadorned least squares linear regression: simply adding an
\(\ell_2\) **penalty** on the parameters \(\beta\) to the
objective function for linear regression yields the objective function
for ridge regression.

Our goal is to find an assignment to \(\beta\) that minimizes the function

where \(\lambda\) is a hyperparameter and, as usual, \(X\) is the training data and \(Y\) the observations. In practice, we tune \(\lambda\) until we find a model that generalizes well to the test data.

Ridge regression is an example of a **shrinkage method**: compared to
least squares, it shrinks the parameter estimates in the hopes of
**reducing variance, improving prediction accuracy, and aiding
interpetation**.

In this notebook, we show how to fit a ridge regression model using CVXPY, how to evaluate the model, and how to tune the hyper-parameter \(\lambda\).

## Writing the objective function¶

We can decompose the **objective function** as the sum of a **least
squares loss function** and an \(\ell_2\) **regularizer**.

## Generating data¶

Because ridge regression encourages the parameter estimates to be small,
and as such tends to lead to models with **less variance** than those
fit with vanilla linear regression. We generate a small dataset that
will illustrate this.

## Fitting the model¶

All we need to do to fit the model is create a CVXPY problem where the objective is to minimize the the objective function defined above. We make \(\lambda\) a CVXPY parameter, so that we can use a single CVXPY problem to obtain estimates for many values of \(\lambda\).

## Evaluating the model¶

Notice that, up to a point, penalizing the size of the parameters
reduces test error at the cost of increasing the training error, trading
off higher bias for lower variance; in other words, this indicates that,
for our example, a properly tuned ridge regression **generalizes
better** than a least squares linear regression.

## Regularization path¶

As expected, increasing \(\lambda\) drives the parameters towards
\(0\). In a real-world example, those parameters that approach zero
slower than others might correspond to the more **informative**
features. It is in this sense that ridge regression can be considered
**model selection.**