In [1]:
from lec_utils import *
import lec20_util as util
def show_cv_slides():
    src = "https://docs.google.com/presentation/d/e/2PACX-1vTydTrLDr-y4nxQu1OMsaoqO5EnPEISz2VYmM6pd83ke8YnnTBJlp40NfNLI1HMgoaKx6GBKXYE4UcA/embed?start=false&loop=false&delayms=60000&rm=minimal"
    display(IFrame(src, width=900, height=361))
import warnings
warnings.simplefilter('ignore')

Lecture 20¶

Cross-Validation and Regularization¶

EECS 398-003: Practical Data Science, Fall 2024¶

practicaldsc.org • github.com/practicaldsc/fa24

Announcements 📣¶

  • Homework 9 is due on Monday, November 11th.
    We're adding more office hours tomorrow and on Monday.
  • The Portfolio Homework has been released! Read all about it here. It has two due dates:
    • A checkpoint (worth 15 points / 100) is due on Monday, November 25th (no slip days!).
    • The full homework is due on Saturday, December 7th (no slip days!).
  • Homework 8 solutions can be found in #282 on Ed, and Homework 7 solutions can be found in #259 on Ed.
  • Please help spread the word about this class by submitting an anonymous testimony here 🙏.
    We'll share some of the responses we get on this form at practicaldsc.org/next, and in advertisement emails/posts we share with other students.

Come say hi tonight!¶

A few other professors and I are hosting a faculty-student panel, where you can learn more about our career (and personal) paths. Come say hi – there will be pizza 🍕!

No description has been provided for this image

RSVP here.

Agenda¶

  • Recap: Generalization, hyperparameters, and train-test splits.
  • Cross-validation.
  • Grid search.
  • Regularization.

Additional resources¶

  • Take a look at mlu-explain.github.io, a site with interactive explanations for a lot of core machine learning topics, like:
    • Linear Regression.
    • The Bias-Variance Tradeoff.
    • Train, Test, and Validation Sets.
    • Cross-Validation.
    • and other ideas we'll see later in the semester!
  • We've linked these articles in the Resources tab of the course website, too.

Question 🤔 (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

Recap: Generalization, hyperparameters, and train-test splits¶


No description has been provided for this image
As we increase the complexity (e.g., polynomial degree) of our models, their training error decreases.

Here, polynomial degree is a hyperparameter – something we get to choose before fitting the model to the data. We're trying to determine how to choose the "best" hyperparameter.
No description has been provided for this image
But, as we increase complexity, our models overfit to their training data, and can fail to generalize to unseen, test data.

Generalization¶

  • We care about how well our models generalize to unseen data.
  • The more complex a model is, the more it will overfit to the noise in the training data, and have high model variance.
  • The less complex a model is, the more it will underfit the training data, and have high bias.
  • To navigate this bias-variance tradeoff, we choose model complexity by choosing the model with the lowest error on an unseen test set.
No description has been provided for this image

Conducting train-test splits¶

  • Recall, training data is used to fit our model, and test data is used to evaluate our model.
No description has been provided for this image
  • Question: How should we split?
    • sklearn's train_test_split splits randomly, which usually works well.
    • However, if there is some element of time in the training data (say, when predicting the future price of a stock), a better split is "past" and "future".
  • Question: How large should the split be, e.g. 90%-10% vs. 75%-25%?
    • There's a tradeoff – a larger training set should lead to a "better" model, while a larger test set should lead to a better estimate of our model's ability to generalize.
    • There's no "right" choice, but we usually choose between 10% to 25% for the test set.

But wait...¶

  • With our current strategy, we are choosing the hyperparameter that creates the model that performs best on the test set.
  • As such, we are overfitting to the test set – the best hyperparameter for the test set might not be the best hyperparameter for a totally unseen dataset!
  • It seems like we need another split.
  • Next, we'll cover the more robust solution to the problem of selecting hyperparameters: cross-validation.

Cross-validation¶


Idea: A single validation set¶

No description has been provided for this image
  1. Split the data into three sets: training, validation, and test.
  1. For each hyperparameter choice, train the model only on the training set, and evaluate the model's performance on the validation set.
  1. Find the hyperparameter with the best validation performance.
  1. Retrain the final model on the training and validation sets, and report its performance on the test set.
  • Issue: This strategy is too dependent on the validation set, which may be small and/or not a representative sample of the data. We're not going to do this. ❌

A better idea: $k$-fold cross-validation¶

  • Instead of relying on a single validation set, we can create $k$ validation sets, where $k$ is some positive integer (5 in the example below).
No description has been provided for this image
  • Since each data point is used for training $k-1$ times and validation once, the (averaged) validation performance should be a good metric of a model's ability to generalize to unseen data.
  • $k$-fold cross-validation (or simply "cross-validation") is the technique we will use for finding hyperparameters, or more generally, for choosing between different possible models. It's what you should use in your Portfolio Homework! ✅

Illustrating $k$-fold cross-validation¶

  • To illustrate $k$-fold cross-validation, let's use the following example dataset with $n = 12$ rows.
    Suppose this dataset represents our training set, i.e. suppose we already performed a train-test split.
In [2]:
training_data = pd.DataFrame().assign(x=range(0, 120, 10),
                                      y=[9, 1, 58, 3, 6, 4, -2, 8, 1, 10, 1.1, -45])        
display_df(training_data, rows=12)
x y
0 0 9.0
1 10 1.0
2 20 58.0
3 30 3.0
4 40 6.0
5 50 4.0
6 60 -2.0
7 70 8.0
8 80 1.0
9 90 10.0
10 100 1.1
11 110 -45.0
  • Suppose we choose $k = 4$. Then, each fold has $\frac{12}{4} = 3$ rows.
In [3]:
show_cv_slides()

$k$-fold cross-validation, in general¶

  • First, shuffle the entire training set randomly and split it into $k$ disjoint folds, or "slices". Then:
  • For each hyperparameter:
    • For each slice:
      • Let the slice be the "validation set", $V$.
      • Let the rest of the data be the "training set", $T$.
      • Train a model using the selected hyperparameter on the training set $T$.
      • Evaluate the model on the validation set $V$.
    • Compute the average validation error (e.g. MSE) for the particular hyperparameter.
  • Choose the hyperparameter with the lowest average validation error.

Loading the data¶

  • To illustrate, let's continue using the roughly-cubic dataset titled "Sample 1" from the past few lectures.
In [4]:
np.random.seed(23) # For reproducibility.
def sample_from_pop(n=100):
    x = np.linspace(-2, 3, n)
    y = x ** 3 + (np.random.normal(0, 3, size=n))
    return pd.DataFrame({'x': x, 'y': y})
sample_1 = sample_from_pop()
In [5]:
px.scatter(sample_1, x='x', y='y', title='Sample 1')
  • We'll look at how to use $k$-fold cross-validation to choose a polynomial degree that best generalizes to unseen data.
  • Before doing cross-validation, we must always perform a train-test split.
In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sample_1[['x']], sample_1['y'], random_state=23)

$k$-fold cross-validation in sklearn¶

  • The cross_val_score function in sklearn implements $k$-fold cross-validation for us!



cross_val_score(estimator, X_train, y_train, cv)
In [7]:
from sklearn.model_selection import cross_val_score
  • Let's perform $k$-fold cross validation in order to help us pick a degree for polynomial regression from the list [1, 2, ..., 25].
  • We'll use $k=5$ since it's a common choice (and the default in sklearn).
In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
In [9]:
errs_df = pd.DataFrame()
for d in range(1, 26):
    pl = make_pipeline(PolynomialFeatures(d), LinearRegression())
    # The `scoring` argument is used to specify that we want to compute the MSE; 
    # the default is R^2. It's called "neg" MSE because, 
    # by default, sklearn likes to "maximize" scores, and maximizing -MSE is the same
    # as minimizing MSE.
    errs = cross_val_score(pl, X_train, y_train, 
                           cv=5, scoring='neg_mean_squared_error')
    errs_df[f'Deg {d}'] = -errs # Negate to turn positive (sklearn computed negative MSE).
errs_df.index = [f'Fold {i}' for i in range(1, 6)]
errs_df.index.name = 'Validation Fold'

$k$-fold cross-validation in sklearn¶

  • Note that for each choice of degree (our hyperparameter), we have five MSEs, one for each "fold" of the data. This means that in total, $5 \cdot 25 = 125$ models were trained!
In [10]:
errs_df
Out[10]:
Deg 1 Deg 2 Deg 3 Deg 4 ... Deg 22 Deg 23 Deg 24 Deg 25
Validation Fold
Fold 1 24.36 14.06 7.45 7.62 ... 40.48 69.03 43.38 220.49
Fold 2 31.50 19.17 7.05 10.85 ... 796.97 1446.11 2208.78 5666.03
Fold 3 15.64 11.78 9.39 9.48 ... 61.79 80.64 474.69 4620.71
Fold 4 25.95 17.57 9.47 9.46 ... 12.24 12.02 11.86 15.02
Fold 5 11.12 4.74 4.44 4.81 ... 8.85 8.54 8.65 8.27

5 rows × 25 columns

  • Remember, our goal is to choose the degree with the lowest average validation error.
In [11]:
errs_df.mean(axis=0)
Out[11]:
Deg 1       21.71
Deg 2       13.46
Deg 3        7.56
           ...   
Deg 23     323.27
Deg 24     549.47
Deg 25    2106.10
Length: 25, dtype: float64
In [12]:
fig = errs_df.mean(axis=0).iloc[:18].plot(kind='line', title='Average Validation Error')
fig.update_layout(xaxis_title='Degree', yaxis_title='Average Validation MSE', showlegend=False)
In [13]:
errs_df.mean(axis=0).idxmin()
Out[13]:
'Deg 3'
  • Note that if we didn't perform $k$-fold cross-validation, but instead just used a single validation set, we may have ended up with a different result:
In [14]:
errs_df.idxmin(axis=1)
Out[14]:
Validation Fold
Fold 1     Deg 9
Fold 2     Deg 3
Fold 3     Deg 3
Fold 4    Deg 10
Fold 5     Deg 3
dtype: object

Question 🤔 (Answer at practicaldsc.org/q)

  • Suppose you have a training dataset with 1000 rows.
  • You want to decide between 20 hyperparameters for a particular model.
  • To do so, you perform 10-fold cross-validation.
  • How many times is the first row in the training dataset (X.iloc[0]) used for training a model?

Grid search¶


An easier approach: GridSearchCV¶

  • Instead of having to for-loop over possible hyperparameter values, we can let sklearn do the hard work for us, using GridSearchCV.
  • GridSearchCV takes in:

    • an un-fit instance of an estimator, and
    • a dictionary of hyperparameter values to try,

    and performs $k$-fold cross-validation to find the combination of hyperparameters with the best average validation performance.

In [15]:
from sklearn.model_selection import GridSearchCV
  • Why do you think it's called "grid search"?

Grid searching for the best polynomial degree¶

  • Let's once again aim to fit a polynomial regression model to X_train and y_train, taken from sample_1.
  • Here, we want to try values of degree from 1 through 25, so we'll need to specify these values in a dictionary.
In [16]:
# The key names in this dictionary are chosen very carefully.
# They need to be of the format pipelinestep__hyperparametername,
# where pipelinestep is a lowercase version of the step in the pipeline
# that we want to tune, and 
# hyperparameter name is the formal name of the hyperparameter (see the documentation).
hyperparams = {
    'polynomialfeatures__degree': range(1, 26)
}
In [17]:
searcher = GridSearchCV(
    make_pipeline(PolynomialFeatures(), LinearRegression()),
    param_grid=hyperparams,
    cv=5, # k = 5.
    scoring='neg_mean_squared_error'
)
searcher
Out[17]:
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('linearregression',
                                        LinearRegression())]),
             param_grid={'polynomialfeatures__degree': range(1, 26)},
             scoring='neg_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('linearregression',
                                        LinearRegression())]),
             param_grid={'polynomialfeatures__degree': range(1, 26)},
             scoring='neg_mean_squared_error')
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                ('linearregression', LinearRegression())])
PolynomialFeatures()
LinearRegression()
  • Like any other estimator, GridSearchCV instances need to be fit.
In [18]:
searcher.fit(X_train, y_train)
Out[18]:
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('linearregression',
                                        LinearRegression())]),
             param_grid={'polynomialfeatures__degree': range(1, 26)},
             scoring='neg_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('linearregression',
                                        LinearRegression())]),
             param_grid={'polynomialfeatures__degree': range(1, 26)},
             scoring='neg_mean_squared_error')
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=3)),
                ('linearregression', LinearRegression())])
PolynomialFeatures(degree=3)
LinearRegression()
  • Once fit, searcher can tell us what it found!
In [19]:
searcher.best_params_
Out[19]:
{'polynomialfeatures__degree': 3}
In [20]:
-pd.DataFrame(np.vstack([searcher.cv_results_[f'split{i}_test_score'] for i in range(5)]))
Out[20]:
0 1 2 3 ... 21 22 23 24
0 24.36 14.06 7.45 7.62 ... 40.48 69.03 43.38 220.49
1 31.50 19.17 7.05 10.85 ... 796.97 1446.11 2208.78 5666.03
2 15.64 11.78 9.39 9.48 ... 61.79 80.64 474.69 4620.71
3 25.95 17.57 9.47 9.46 ... 12.24 12.02 11.86 15.02
4 11.12 4.74 4.44 4.81 ... 8.85 8.54 8.65 8.27

5 rows × 25 columns

  • searcher is now a fit regression model – there's no need to refit it on the entire training set, since it already did that.
In [21]:
searcher.predict([[4], 
                  [-1], 
                  [0]])
Out[21]:
array([64.1 , -1.28,  0.28])

Another example: Commute times¶

  • We can also use $k$-fold cross-validation to determine which subset of features to use in a linear model that predicts commute times!
In [22]:
np.random.seed(23)
df = pd.read_csv('data/commute-times.csv')
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df['month'] = pd.to_datetime(df['date']).dt.month_name()
df.head()
Out[22]:
date day home_departure_time home_departure_mileage ... work_departure_time_hr mileage_to_home day_of_month month
0 5/15/2023 Mon 2023-05-15 10:49:00 15873.0 ... 17.17 53.0 15 May
1 5/16/2023 Tue 2023-05-16 07:45:00 15979.0 ... NaN NaN 16 May
2 5/22/2023 Mon 2023-05-22 08:27:00 50407.0 ... 15.90 54.0 22 May
3 5/23/2023 Tue 2023-05-23 07:08:00 50535.0 ... NaN NaN 23 May
4 5/30/2023 Tue 2023-05-30 09:09:00 50664.0 ... 17.12 54.0 30 May

5 rows × 20 columns

  • Let's make several candidate pipelines. But first, as always, a train-test split.
In [23]:
# Here, we're letting X_train and X_test keep all of the columns in the DataFrame
# OTHER than 'minutes'.
X_train, X_test, y_train, y_test = train_test_split(df.drop('minutes', axis=1), df['minutes'], random_state=23)

Creating many pipelines¶

In [24]:
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
In [25]:
selecter = FunctionTransformer(lambda x: x) # Shortcut to say "keep just these columns."
week_converter = FunctionTransformer(lambda s: 'Week ' + ((s - 1) // 7 + 1).astype(str))
day_of_month_transformer = make_pipeline(week_converter, OneHotEncoder(drop='first')) # From last class.
pipes = {
    'departure_hour only': make_pipeline(
        make_column_transformer((selecter, ['departure_hour'])),
        LinearRegression()
    ),
    'departure_hour + day_of_month': make_pipeline(
        make_column_transformer((selecter, ['departure_hour', 'day_of_month'])),
        LinearRegression()
    ),
    'departure_hour + day OHE': make_pipeline(
        make_column_transformer(
            (selecter, ['departure_hour']),
            (OneHotEncoder(drop='first', handle_unknown='ignore'), ['day'])
        ),
        LinearRegression()
    ),
    'departure_hour + day OHE + month OHE': make_pipeline(
        make_column_transformer(
            (selecter, ['departure_hour']),
            (OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month'])
        ),
        LinearRegression()
    ),
    'departure_hour with poly features + day OHE + month OHE + week': make_pipeline(
        make_column_transformer(
        (PolynomialFeatures(3), ['departure_hour']),
        (OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']),
        (day_of_month_transformer, ['day_of_month']),
    ),
    LinearRegression())
}
  • Here, we won't be able to use GridSearchCV directly, because we're choosing between many different pipelines, not between hyperparameters for a particular pipeline.
In [26]:
results = pd.DataFrame(columns=['Average Training MSE', 'Average Validation MSE'])
for pipe in pipes:
    fitted = GridSearchCV(
        pipes[pipe],
        param_grid={}, # No hyperparameters, but we could have them.
        scoring='neg_mean_squared_error',
        cv=10, # Change this and see what happens!,
        return_train_score=True # So that we can compute training MSEs, too.
    )
    fitted.fit(X_train, y_train)
    results.loc[pipe] = [-fitted.cv_results_['mean_train_score'][0], -fitted.cv_results_['mean_test_score'][0]]
commute_models_summarized = (
    results
    .sort_values('Average Training MSE')
    .plot(kind='barh', barmode='group', width=1000)
    .update_layout(xaxis_title='Mean Squared Error', yaxis_title='Model')
)
commute_models_summarized
  • What's the "right" combination of features to choose?

Summary: Generalization¶

  1. Split the data into two sets: training and test.
  1. Use only the training data when designing, training, and tuning the model.
    • Use $k$-fold cross-validation to choose hyperparameters and estimate the model's ability to generalize.
    • Do not ❌ look at the test data in this step!
  1. Commit to your final model and train it using the entire training set.
  1. Test the data using the test data. If the performance (e.g. MSE) is not acceptable, return to step 2.
  1. Finally, train on all available data and ship the model to production! 🛳
  • 🚨 This is the process you should always use! 🚨

Question 🤔 (Answer at practicaldsc.org/q)

What questions do you have?

Regularization¶


Motivation¶

  • So far, to us, "model complexity" has essentially meant "number of features."
    The main hyperparameter we've tuned is polynomial degree. For instance, a polynomial of degree 5 has 5 features – an $x$, $x^2$, $x^3$, $x^4$, and $x^5$ feature.
    In the more recent example, we manually created several different pipelines, each of which used different combinations of features from the commute times dataset.
  • Once we've created several different candidate models, we've used cross-validation to choose the one that best generalizes to unseen data.
  • Another approach: instead of manually choosing which features to include, put some constraint on the optimal parameters, $w_0^*, w_1^*, ..., w_d^*$.
    This would save us time from having to think of combinations of features that might be relevant.
  • Intuition: The bigger the optimal parameters $w_0^*, w_1^*, ..., w_d^*$ are, the more overfit the model is to the training data.
    Why?

Polynomial regression returns!¶

  • Once again, let's try fitting a polynomial model to Sample 1.
In [27]:
X_train, X_test, y_train, y_test = train_test_split(sample_1[['x']], sample_1['y'], random_state=23)
px.scatter(x=X_train['x'], y=y_train, title="Sample 1's Training Data")
  • This time, instead of using cross-validation to choose a polynomial degree, let's pick a degree in advance, like 25.
In [28]:
model = make_pipeline(PolynomialFeatures(25), LinearRegression(fit_intercept=False))
model.fit(X_train, y_train)
Out[28]:
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25)),
                ('linearregression', LinearRegression(fit_intercept=False))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25)),
                ('linearregression', LinearRegression(fit_intercept=False))])
PolynomialFeatures(degree=25)
LinearRegression(fit_intercept=False)
In [29]:
fig = px.scatter(x=X_train['x'], y=y_train, title="Sample 1's Training Data")
fig.add_trace(go.Scatter(
    x=X_train['x'].sort_values(),
    y=model.predict(X_train.sort_values('x')),
    mode='lines',
    line=dict(width=4),
    name='Fit Polynomial of Degree 25'
))
  • This degree 25 polynomial is clearly overfit to the training data.
  • Observe: sklearn assigned really large values to many features. For instance, we're seeing that the coefficient on $x^{10}$ is 1712.92. This means that if $x$ changes a little, the output is going to change a lot.
In [30]:
pd.Series(model.named_steps['linearregression'].coef_).abs().sort_values(ascending=False).head()
Out[30]:
10    1712.92
8     1418.08
9     1405.07
7     1395.33
12    1268.88
dtype: float64

Penalizing large parameters¶

  • Idea: In addition to just minimizing mean squared error, what if we could also try and prevent large parameter values?
    Maybe this would lead to less overfitting!
  • Previously, we minimized mean squared error to find $\vec{w}_\text{OLS}^*$: $$R_\text{sq}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2$$ Here, OLS stands for "ordinary least squares."
  • Idea: Instead, minimize regularized mean squared error:
$$R_\text{sq-reg}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \mathbf{+} \underbrace{\lambda \sum_{j = 1}^d w_j^2}_{\text{penalty!}}$$
  • For each $w_j$, we've added a penalty on how large it is! The larger each $w_j$ is, the larger $R_\text{sq-reg}$ will be.
    Remember, the goal is to find the $\vec{w}^*$ that minimizes $R_\text{sq-reg}$.
    Note that we're not penalizing the intercept term, which doesn't really contribute to overfitting.

Ridge regression¶

  • The specific penalty we've chosen – $w_j^2$ for each $w_j$ – is called $L_2$ regularization.
  • Minimizing mean squared error, with $L_2$ regularization, is called ridge regression. The objective function for ridge regression is:
$$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \mathbf{+} \underbrace{\lambda \sum_{j = 1}^d w_j^2}_{\text{penalty!}}$$
  • $\lambda$ is a hyperparameter, which we choose through cross-validation.
  • The $\vec{w}_\text{ridge}^*$ that minimizes $R_\text{ridge}(\vec{w})$ is not necessarily the same as $\vec{w}_\text{OLS}^*$, which minimizes $R_\text{sq}(\vec{w})$!

Question 🤔 (Answer at practicaldsc.org/q)

The objective function we minimize to find $\vec{w}_\text{ridge}^*$ in ridge regression is:

$$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d w_j^2$$

$\lambda$ is a hyperparameter, which we choose through cross-validation.

  • What if we pick $\lambda = 0$ – what is $\vec{w}_\text{ridge}^*$ then?
  • What happens to $\vec{w}_\text{ridge}^*$ as $\lambda \rightarrow \infty$?
  • Can $\lambda$ be negative?

Ridge regression, visualized¶

  • As $\lambda$ increases, the penalty on the size of $\vec{w}_\text{ridge}^*$ increases, meaning that each $w_j^*$ inches closer to 0.
  • An equivalent way of formulating the ridge regression objective function,

    $$\text{minimize} \:\:\:\: \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d w_j^2$$

    is as a constrained optimization problem:

    $$\text{minimize} \:\:\:\:\frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \text{ such that } \sum_{j = 1}^d w_j^2 \leq Q$$

  • $Q$ and $\lambda$ are inversely related: the larger $Q$ is, the less of a penalty we're putting on size of $\vec{w}_\text{ridge}^*$, so the smaller $\lambda$ is.
    The exact relationship between $Q$ and $\lambda$ is outside of the scope of this course, as is the proof of this fact. Take IOE 310!
  • Intuitively:

    • The loss surface for just the mean squared error component is in blue.
    • The constraint, $\sum_{j = 1}^d w_j^2 \leq Q$, is in green.
No description has been provided for this image(source)
  • The bigger $Q$ is – so, the smaller $\lambda$ is – the larger the green circle is!
  • The bigger $Q$ is, the larger the range of possible values for $\vec{w}_\text{ridge}^*$ is, and the closer $\vec{w}_\text{ridge}^*$ gets to $\vec{w}_\text{OLS}^*$.

Finding $\vec{w}_\text{ridge}^*$¶

  • We know that the $\vec{w}_\text{OLS}^*$ that minimizes mean squared error, $$R_\text{sq}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2$$ is the one that satisfies the normal equations, $X^TX \vec{w} = X^T \vec{y}$.
  • Sometimes, $\vec{w}^*_\text{OLS}$ is unique, and sometimes there are infinitely many possible $\vec{w}^*_\text{OLS}$.
    There are infinitely many possible $\vec{w}^*_\text{OLS}$ when the design matrix, $X$, is not full rank! All of these infinitely many solutions minimize mean squared error.
  • Which vector $\vec{w}_\text{ridge}^*$ minimizes the ridge regression objective function?
$$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d w_j^2$$
  • It turns out there is always a unique solution for $\vec{w}_\text{ridge}^*$, even if $X$ is not full rank. It is: $$\vec{w}_\text{ridge}^* = (X^TX + n \lambda I)^{-1} X^T \vec{y}$$
    The proof is outside of the scope of the class, and requires vector calculus.
  • Since there is always a unique solution, ridge regression is often used in the presence of multicollinearity!

Taking a step back¶

  • $\vec{w}_\text{ridge}^*$ doesn't minimize mean squared error – it minimizes a slightly different objective function.
  • So, why would we use ever use ridge regression?

Ridge regression in sklearn¶

  • Fortunately, sklearn can perform ridge regression for us.
In [31]:
from sklearn.linear_model import Ridge
  • Just to experiment, let's set $\lambda$ to something extremely large and look at the resulting predictions.
In [32]:
# The name of the lambda hyperparameter in sklearn is alpha.
model_large_lambda = make_pipeline(PolynomialFeatures(25), Ridge(alpha=1000000000000000000000000000))
model_large_lambda.fit(X_train, y_train)
fig = px.scatter(x=X_train['x'], y=y_train, title="Sample 1's Training Data")
fig.add_trace(go.Scatter(
    x=X_train['x'].sort_values(),
    y=model_large_lambda.predict(X_train.sort_values('x')),
    mode='lines',
    line=dict(width=4, color='purple'),
    name='Extremely Regularized Polynomial of Degree 25'
))
  • What do you notice?
In [33]:
model_large_lambda.named_steps['ridge'].intercept_ 
Out[33]:
3.6606542664388497
In [34]:
y_train.mean() 
Out[34]:
3.662415381952351

Using GridSearchCV to choose $\lambda$¶

  • In general, we won't just arbitrarily choose a value of $\lambda$.
  • Instead, we'll perform $k$-fold cross-validation to choose the $\lambda$ that leads to predictions that work best on unseen test data.
In [35]:
hyperparams = {
    'ridge__alpha': 10.0 ** np.arange(-2, 15) # Try 0.01, 0.1, 1, 10, 100, 1000, ... 
}
model_regularized = GridSearchCV(
    estimator=make_pipeline(PolynomialFeatures(25), Ridge()),
    param_grid=hyperparams,
    scoring='neg_mean_squared_error'
)
model_regularized.fit(X_train, y_train)
Out[35]:
GridSearchCV(estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures(degree=25)),
                                       ('ridge', Ridge())]),
             param_grid={'ridge__alpha': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05,
       1.e+06, 1.e+07, 1.e+08, 1.e+09, 1.e+10, 1.e+11, 1.e+12, 1.e+13,
       1.e+14])},
             scoring='neg_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures(degree=25)),
                                       ('ridge', Ridge())]),
             param_grid={'ridge__alpha': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05,
       1.e+06, 1.e+07, 1.e+08, 1.e+09, 1.e+10, 1.e+11, 1.e+12, 1.e+13,
       1.e+14])},
             scoring='neg_mean_squared_error')
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25)),
                ('ridge', Ridge(alpha=1000000000.0))])
PolynomialFeatures(degree=25)
Ridge(alpha=1000000000.0)
  • Let's check the optimal $\lambda$ it found!
In [36]:
model_regularized.best_params_
Out[36]:
{'ridge__alpha': 1000000000.0}
  • What do the resulting predictions look like?
In [37]:
# The name of the lambda hyperparameter in sklearn is alpha.
fig = px.scatter(x=X_train['x'], y=y_train, title="Sample 1's Training Data")
fig.add_trace(go.Scatter(
    x=X_train['x'].sort_values(),
    y=model.predict(X_train.sort_values('x')),
    mode='lines',
    line=dict(width=4),
    name='Unregularized Polynomial of Degree 25'
))
fig.add_trace(go.Scatter(
    x=X_train['x'].sort_values(),
    y=model_regularized.predict(X_train.sort_values('x')),
    mode='lines',
    line=dict(width=4, color='green'),
    name='Regularized Polynomial of Degree 25'
))
  • It seems that the regularized polynomial is less overfit to the specific noise in the training data than the unregularized polynomial!
  • The largest coefficients are all much smaller now, too.


The coefficient on $x^{20}$ is 0.000136.

In [38]:
pd.Series(model_regularized.best_estimator_.named_steps['ridge'].coef_).abs().sort_values(ascending=False).head()
Out[38]:
20    1.36e-04
17    1.29e-04
15    8.57e-05
19    8.49e-05
16    5.68e-05
dtype: float64

Another example: Commute times¶

  • Let's use ridge regression to find coefficients for the most complicated commute times model we looked at earlier, to see if it can generalize well to unseen data.
In [39]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('minutes', axis=1), df['minutes'], random_state=23)
  • We'll use the model previously labeled departure_hour with poly features + day OHE + month OHE + week.
In [40]:
commute_models_summarized

Ridge regression for commute times¶

  • Let's first instantiate a Pipeline for the steps we want to execute.
In [41]:
commute_pipe = make_pipeline(
        make_column_transformer(
        (PolynomialFeatures(3), ['departure_hour']),
        (OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']),
        (day_of_month_transformer, ['day_of_month']),
    ),
    Ridge())
  • Then, as before, we'll fit a GridSearchCV instance with a hyperparameter grid.
In [42]:
lambdas = 10.0 ** np.arange(-10, 15) # Try 0.00000000001, ..., 1, 10, 100, 1000, ... 
hyperparams = {
    'ridge__alpha': lambdas 
}
In [43]:
commute_model_regularized = GridSearchCV(
    commute_pipe,
    param_grid = hyperparams,
    scoring='neg_mean_squared_error',
    cv=10
)
commute_model_regularized.fit(X_train, y_train)
Out[43]:
GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('polynomialfeatures',
                                                                         PolynomialFeatures(degree=3),
                                                                         ['departure_hour']),
                                                                        ('onehotencoder',
                                                                         OneHotEncoder(drop='first',
                                                                                       handle_unknown='ignore'),
                                                                         ['day',
                                                                          'month']),
                                                                        ('pipeline',
                                                                         Pipeline(steps=[('functiontransformer',
                                                                                          FunctionTransformer(func=<function <lambda> at 0x177a80820>)),
                                                                                         ('onehotencoder',
                                                                                          OneHotEncoder(drop='first'))]),
                                                                         ['day_of_month'])])),
                                       ('ridge', Ridge())]),
             param_grid={'ridge__alpha': array([1.e-10, 1.e-09, 1.e-08, ..., 1.e+12, 1.e+13, 1.e+14])},
             scoring='neg_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('polynomialfeatures',
                                                                         PolynomialFeatures(degree=3),
                                                                         ['departure_hour']),
                                                                        ('onehotencoder',
                                                                         OneHotEncoder(drop='first',
                                                                                       handle_unknown='ignore'),
                                                                         ['day',
                                                                          'month']),
                                                                        ('pipeline',
                                                                         Pipeline(steps=[('functiontransformer',
                                                                                          FunctionTransformer(func=<function <lambda> at 0x177a80820>)),
                                                                                         ('onehotencoder',
                                                                                          OneHotEncoder(drop='first'))]),
                                                                         ['day_of_month'])])),
                                       ('ridge', Ridge())]),
             param_grid={'ridge__alpha': array([1.e-10, 1.e-09, 1.e-08, ..., 1.e+12, 1.e+13, 1.e+14])},
             scoring='neg_mean_squared_error')
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('polynomialfeatures',
                                                  PolynomialFeatures(degree=3),
                                                  ['departure_hour']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first',
                                                                handle_unknown='ignore'),
                                                  ['day', 'month']),
                                                 ('pipeline',
                                                  Pipeline(steps=[('functiontransformer',
                                                                   FunctionTransformer(func=<function <lambda> at 0x177a80820>)),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['day_of_month'])])),
                ('ridge', Ridge())])
ColumnTransformer(transformers=[('polynomialfeatures',
                                 PolynomialFeatures(degree=3),
                                 ['departure_hour']),
                                ('onehotencoder',
                                 OneHotEncoder(drop='first',
                                               handle_unknown='ignore'),
                                 ['day', 'month']),
                                ('pipeline',
                                 Pipeline(steps=[('functiontransformer',
                                                  FunctionTransformer(func=<function <lambda> at 0x177a80820>)),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'))]),
                                 ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3)
['day', 'month']
OneHotEncoder(drop='first', handle_unknown='ignore')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x177a80820>)
OneHotEncoder(drop='first')
Ridge()
  • Which $\lambda$ did it choose?
In [44]:
commute_model_regularized.best_params_
Out[44]:
{'ridge__alpha': 1.0}
  • How did the average validation MSE change with $\lambda$?
    Here, large values of $\lambda$ mean less complex models, not more complex.
In [45]:
(
    pd.Series(-commute_model_regularized.cv_results_['mean_test_score'], 
              index=np.log(lambdas))
    .to_frame()
    .reset_index()
    .plot(kind='line', x='index', y=0)
    .update_layout(xaxis_title='$\log(\lambda)$', yaxis_title='Average Validation MSE')
)

What's next?¶

  • Could we have chosen a different method of penalizing each $w_j$ other than $w_j^2$?
  • Ridge regression's objective function happened to have a closed-form solution. What if we want to minimize a function that can't be minimized by hand?