Machine Learning: Ridge Regression
==================================

Ridge regression is a regression technique that is quite similar to
unadorned least squares linear regression: simply adding an
:math:`\ell_2` **penalty** on the parameters :math:`\beta` to the
objective function for linear regression yields the objective function
for ridge regression.

Our goal is to find an assignment to :math:`\beta` that minimizes the
function

.. math:: f(\beta) = \|X\beta - Y\|_2^2 + \lambda \|\beta\|_2^2,

where :math:`\lambda` is a hyperparameter and, as usual, :math:`X` is
the training data and :math:`Y` the observations. In practice, we tune
:math:`\lambda` until we find a model that generalizes well to the test
data.

Ridge regression is an example of a **shrinkage method**: compared to
least squares, it shrinks the parameter estimates in the hopes of
**reducing variance, improving prediction accuracy, and aiding
interpetation**.

In this notebook, we show how to fit a ridge regression model using
CVXPY, how to evaluate the model, and how to tune the hyper-parameter
:math:`\lambda`.

.. code:: python

    import cvxpy as cp
    import numpy as np
    import matplotlib.pyplot as plt

Writing the objective function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can decompose the **objective function** as the sum of a **least
squares loss function** and an :math:`\ell_2` **regularizer**.

.. code:: python

    def loss_fn(X, Y, beta):
        return cp.pnorm(X @ beta - Y, p=2)**2
    
    def regularizer(beta):
        return cp.pnorm(beta, p=2)**2
    
    def objective_fn(X, Y, beta, lambd):
        return loss_fn(X, Y, beta) + lambd * regularizer(beta)
    
    def mse(X, Y, beta):
        return (1.0 / X.shape[0]) * loss_fn(X, Y, beta).value

Generating data
~~~~~~~~~~~~~~~

Because ridge regression encourages the parameter estimates to be small,
and as such tends to lead to models with **less variance** than those
fit with vanilla linear regression. We generate a small dataset that
will illustrate this.

.. code:: python

    def generate_data(m=100, n=20, sigma=5):
        "Generates data matrix X and observations Y."
        np.random.seed(1)
        beta_star = np.random.randn(n)
        # Generate an ill-conditioned data matrix
        X = np.random.randn(m, n)
        # Corrupt the observations with additive Gaussian noise
        Y = X.dot(beta_star) + np.random.normal(0, sigma, size=m)
        return X, Y
    
    m = 100
    n = 20
    sigma = 5
    
    X, Y = generate_data(m, n, sigma)
    X_train = X[:50, :]
    Y_train = Y[:50]
    X_test = X[50:, :]
    Y_test = Y[50:]

Fitting the model
~~~~~~~~~~~~~~~~~

All we need to do to fit the model is create a CVXPY problem where the
objective is to minimize the the objective function defined above. We
make :math:`\lambda` a CVXPY parameter, so that we can use a single
CVXPY problem to obtain estimates for many values of :math:`\lambda`.

.. code:: python

    beta = cp.Variable(n)
    lambd = cp.Parameter(nonneg=True)
    problem = cp.Problem(cp.Minimize(objective_fn(X_train, Y_train, beta, lambd)))
    
    lambd_values = np.logspace(-2, 3, 50)
    train_errors = []
    test_errors = []
    beta_values = []
    for v in lambd_values:
        lambd.value = v
        problem.solve()
        train_errors.append(mse(X_train, Y_train, beta))
        test_errors.append(mse(X_test, Y_test, beta))
        beta_values.append(beta.value)

Evaluating the model
~~~~~~~~~~~~~~~~~~~~

Notice that, up to a point, penalizing the size of the parameters
reduces test error at the cost of increasing the training error, trading
off higher bias for lower variance; in other words, this indicates that,
for our example, a properly tuned ridge regression **generalizes
better** than a least squares linear regression.

.. code:: python

    %matplotlib inline
    %config InlineBackend.figure_format = 'svg'
    
    def plot_train_test_errors(train_errors, test_errors, lambd_values):
        plt.plot(lambd_values, train_errors, label="Train error")
        plt.plot(lambd_values, test_errors, label="Test error")
        plt.xscale("log")
        plt.legend(loc="upper left")
        plt.xlabel(r"$\lambda$", fontsize=16)
        plt.title("Mean Squared Error (MSE)")
        plt.show()
        
    plot_train_test_errors(train_errors, test_errors, lambd_values)


.. image:: ridge_regression_files/ridge_regression_9_0.svg


Regularization path
~~~~~~~~~~~~~~~~~~~

As expected, increasing :math:`\lambda` drives the parameters towards
:math:`0`. In a real-world example, those parameters that approach zero
slower than others might correspond to the more **informative**
features. It is in this sense that ridge regression can be considered
**model selection.**

.. code:: python

    def plot_regularization_path(lambd_values, beta_values):
        num_coeffs = len(beta_values[0])
        for i in range(num_coeffs):
            plt.plot(lambd_values, [wi[i] for wi in beta_values])
        plt.xlabel(r"$\lambda$", fontsize=16)
        plt.xscale("log")
        plt.title("Regularization Path")
        plt.show()
        
    plot_regularization_path(lambd_values, beta_values)


.. image:: ridge_regression_files/ridge_regression_11_0.svg