from lec_utils import *
import lec21_util as util
np.random.seed(23) # For reproducibility.
def sample_from_pop(n=100):
x = np.linspace(-2, 3, n)
y = x ** 3 + (np.random.normal(0, 3, size=n))
return pd.DataFrame({'x': x, 'y': y})
sample_1 = sample_from_pop()
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, FunctionTransformer, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
import warnings
warnings.simplefilter('ignore')
from IPython.display import Markdown
Announcements 📣¶
- The Portfolio Homework has been released! Read all about it here. It has two due dates:
- A checkpoint (worth 15 points / 100) is due on Monday, November 25th (no slip days!).
- The full homework is due on Saturday, December 7th (no slip days!).
- Homework 10 will be out later this week.
- The Grade Report now includes scores and slip days through Homework 8.
- Please help spread the word about this class by submitting an anonymous testimony here 🙏.
We'll share some of the responses we get on this form at practicaldsc.org/next, and in advertisement emails/posts we share with other students. - Interested in research opportunities in data science? See #302 on Ed.
Agenda¶
- Recap: Ridge Regression.
- LASSO.
- Gradient descent.
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
Recap: Ridge Regression¶
Motivation: Polynomial regression¶
- Last class, we fit a degree 25 polynomial model to Sample 1.
X_train, X_test, y_train, y_test = train_test_split(sample_1[['x']], sample_1['y'], random_state=23)
px.scatter(x=X_train['x'], y=y_train, title="Sample 1's Training Data")
# include_bias=False makes sure that PolynomialFeatures
# doesn't create a new column of all 1s in the design matrix, since
# LinearRegression already makes one.
model = make_pipeline(PolynomialFeatures(25, include_bias=False), LinearRegression())
model.fit(X_train, y_train)
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25, include_bias=False)), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25, include_bias=False)), ('linearregression', LinearRegression())])
PolynomialFeatures(degree=25, include_bias=False)
LinearRegression()
- This degree 25 polynomial is clearly overfit to the training data.
fig = px.scatter(x=X_train['x'], y=y_train, title="Sample 1's Training Data")
fig.add_trace(go.Scatter(
x=X_train['x'].sort_values(),
y=model.predict(X_train.sort_values('x')),
mode='lines',
line=dict(width=4),
name='Fit Polynomial of Degree 25'
))
Inspecting the fit degree 25 polynomial¶
- What are the resulting coefficients of the fit polynomial?
pd.Series(model.named_steps['linearregression'].coef_, index=range(1, 26))
1 1.32e+01 2 -3.43e+00 3 -1.85e+02 ... 23 6.85e-02 24 -1.95e-02 25 1.47e-03 Length: 25, dtype: float64
- What does the resulting polynomial actually look like, as an equation?
# These coefficients are rounded to two decimal places.
# The coefficient on x^25 is not 0.00, but is something very small.
util.display_features(model.named_steps['linearregression'])
sklearn
assigned really large values to many features.
This means that if $x$ changes a little, the output is going to change a lot. It seems like some of the terms are trying to "cancel" each other out – some have large negative coefficients, some have large positive coefficients.
- Intuition: In general, the bigger the optimal parameters $w_0^*, w_1^*, ..., w_d^*$ are, the more overfit the model is to the training data.
Ridge regression¶
- Idea: In addition to just minimizing mean squared error, what if we could also try and prevent large parameter values?
Maybe this would lead to less overfitting!
- Minimizing mean squared error, with $L_2$ regularization, is called ridge regression. The objective function for ridge regression is:
- Intuition: Instead of just minimizing mean squared error, we balance minimizing mean squared error and a penalty on the size of the fit coefficients, $w_1^*$, $w_2^*$, ..., $w_d^*$.
We don't regularize the intercept term!
- $\lambda$ is a hyperparameter, which we choose through cross-validation.
- The $\vec{w}_\text{ridge}^*$ that minimizes $R_\text{ridge}(\vec{w})$ is not necessarily the same as $\vec{w}_\text{OLS}^*$, which minimizes $R_\text{sq}(\vec{w})$!
Another interpretation of ridge regression¶
- As $\lambda$ increases, the penalty on the size of $\vec{w}_\text{ridge}^*$ increases, meaning that each $w_j^*$ inches closer to 0.
An equivalent way of formulating the ridge regression objective function,
$$\text{minimize} \:\:\:\: \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d w_j^2$$
is as a constrained optimization problem:
$$\text{minimize} \:\:\:\:\frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \text{ such that } \sum_{j = 1}^d w_j^2 \leq Q$$
- $Q$ and $\lambda$ are inversely related: the larger $Q$ is, the less of a penalty we're putting on size of $\vec{w}_\text{ridge}^*$, so the smaller $\lambda$ is.
The exact relationship between $Q$ and $\lambda$ is outside of the scope of this course, as is the proof of this fact. Take IOE 310!
Ridge regression, visualized¶
Intuitively:
- The loss surface for just the mean squared error component is in blue.
- The constraint, $\sum_{j = 1}^d w_j^2 \leq Q$, is in green.
The larger $Q$ is, the larger the radius of the ball is.
- The bigger $Q$ is – so, the smaller $\lambda$ is – the larger the green circle is!
- The bigger $Q$ is, the larger the range of possible values for $\vec{w}_\text{ridge}^*$ is, and the closer $\vec{w}_\text{ridge}^*$ gets to $\vec{w}_\text{OLS}^*$.
Finding $\vec{w}_\text{ridge}^*$¶
- We know that the $\vec{w}_\text{OLS}^*$ that minimizes mean squared error,
$$R_\text{sq}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2$$
is the one that satisfies the normal equations, $X^TX \vec{w} = X^T \vec{y}$.
Recall, linear regression that minimizes mean squared error, without any other constraints, is called ordinary least squares (OLS).
- Sometimes, $\vec{w}^*_\text{OLS}$ is unique, and sometimes there are infinitely many possible $\vec{w}^*_\text{OLS}$.
There are infinitely many possible $\vec{w}^*_\text{OLS}$ when the design matrix, $X$, is not full rank! All of these infinitely many solutions minimize mean squared error.
- Which vector $\vec{w}_\text{ridge}^*$ minimizes the ridge regression objective function?
- It turns out there is always a unique solution for $\vec{w}_\text{ridge}^*$, even if $X$ is not full rank. It is:
$$\vec{w}_\text{ridge}^* = (X^TX + n \lambda I)^{-1} X^T \vec{y}$$
The proof is outside of the scope of the class, and requires vector calculus.
- Since there is always a unique solution, ridge regression is often used in the presence of multicollinearity!
Taking a step back¶
- $\vec{w}_\text{ridge}^*$ doesn't minimize mean squared error – it minimizes a slightly different objective function.
- So, why would we use ever use ridge regression?
Ridge regression in sklearn
¶
- Fortunately,
sklearn
can perform ridge regression for us.
from sklearn.linear_model import Ridge
- Just to experiment, let's set $\lambda$ to something extremely large and look at the resulting predictions.
# The name of the lambda hyperparameter in sklearn is alpha.
model_large_lambda = make_pipeline(PolynomialFeatures(25, include_bias=False),
Ridge(alpha=1000000000000000000000000000))
model_large_lambda.fit(X_train, y_train)
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25, include_bias=False)), ('ridge', Ridge(alpha=1000000000000000000000000000))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25, include_bias=False)), ('ridge', Ridge(alpha=1000000000000000000000000000))])
PolynomialFeatures(degree=25, include_bias=False)
Ridge(alpha=1000000000000000000000000000)
Visualizing the extremely regularized model¶
- What do the resulting predictions look like?
fig = px.scatter(x=X_train['x'], y=y_train, title="Sample 1's Training Data")
fig.add_trace(go.Scatter(
x=X_train['x'].sort_values(),
y=model_large_lambda.predict(X_train.sort_values('x')),
mode='lines',
line=dict(width=4, color='purple'),
name='Extremely Regularized Polynomial of Degree 25'
))
fig.update_layout(width=1000, height=800)
- What do you notice?
model_large_lambda.named_steps['ridge'].intercept_
3.6606542664388497
# All 0!
model_large_lambda.named_steps['ridge'].coef_
array([0., 0., 0., ..., 0., 0., 0.])
y_train.mean()
3.662415381952351
Using GridSearchCV
to choose $\lambda$¶
- In general, we won't just arbitrarily choose a value of $\lambda$.
- Instead, we'll perform $k$-fold cross-validation to choose the $\lambda$ that leads to predictions that work best on unseen test data.
hyperparams = {
'ridge__alpha': 10.0 ** np.arange(-2, 15) # Try 0.01, 0.1, 1, 10, 100, 1000, ...
}
model_regularized = GridSearchCV(
estimator=make_pipeline(PolynomialFeatures(25, include_bias=False), Ridge()),
param_grid=hyperparams,
scoring='neg_mean_squared_error'
)
model_regularized.fit(X_train, y_train)
GridSearchCV(estimator=Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25, include_bias=False)), ('ridge', Ridge())]), param_grid={'ridge__alpha': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06, 1.e+07, 1.e+08, 1.e+09, 1.e+10, 1.e+11, 1.e+12, 1.e+13, 1.e+14])}, scoring='neg_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25, include_bias=False)), ('ridge', Ridge())]), param_grid={'ridge__alpha': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06, 1.e+07, 1.e+08, 1.e+09, 1.e+10, 1.e+11, 1.e+12, 1.e+13, 1.e+14])}, scoring='neg_mean_squared_error')
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25, include_bias=False)), ('ridge', Ridge(alpha=1000000000.0))])
PolynomialFeatures(degree=25, include_bias=False)
Ridge(alpha=1000000000.0)
- Let's check the optimal $\lambda$ it found!
model_regularized.best_params_
{'ridge__alpha': 1000000000.0}
Visualizing the regularized degree 25 model¶
- What do the resulting predictions look like?
fig = px.scatter(x=X_train['x'], y=y_train, title="Sample 1's Training Data")
fig.add_trace(go.Scatter(
x=X_train['x'].sort_values(),
y=model.predict(X_train.sort_values('x')),
mode='lines',
line=dict(width=4),
name='Unregularized Polynomial of Degree 25'
))
fig.add_trace(go.Scatter(
x=X_train['x'].sort_values(),
y=model_regularized.predict(X_train.sort_values('x')),
mode='lines',
line=dict(width=4, color='green'),
name='Regularized Polynomial of Degree 25'
))
fig.update_layout(width=1000, height=800)
- It seems that the regularized polynomial is less overfit to the specific noise in the training data than the unregularized polynomial!
- The largest coefficients are all much smaller now, too.
The coefficient on $x^{20}$ is 0.000136.
util.display_features(model_regularized.best_estimator_.named_steps['ridge'], precision=8)
- Note that none of them are exactly 0, but many of them are close!
This will be important later.
Tuning multiple hyperparameters at once¶
- What if we don't want to fix a polynomial degree in advance, and instead want to choose the degree using cross-validation, while also using ridge regression?
- No problem!
Note that the next cell takes much longer than the previous call tofit
took, since it needs to try every combination of $\alpha$ and polynomial degree.
hyperparams = {
'ridge__alpha': 10.0 ** np.arange(-2, 15),
'polynomialfeatures__degree': range(1, 26)
}
model_regularized_degree = GridSearchCV(
estimator=make_pipeline(PolynomialFeatures(include_bias=False), Ridge()),
param_grid=hyperparams,
scoring='neg_mean_squared_error'
)
model_regularized_degree.fit(X_train, y_train)
GridSearchCV(estimator=Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)), ('ridge', Ridge())]), param_grid={'polynomialfeatures__degree': range(1, 26), 'ridge__alpha': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06, 1.e+07, 1.e+08, 1.e+09, 1.e+10, 1.e+11, 1.e+12, 1.e+13, 1.e+14])}, scoring='neg_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)), ('ridge', Ridge())]), param_grid={'polynomialfeatures__degree': range(1, 26), 'ridge__alpha': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06, 1.e+07, 1.e+08, 1.e+09, 1.e+10, 1.e+11, 1.e+12, 1.e+13, 1.e+14])}, scoring='neg_mean_squared_error')
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False)), ('ridge', Ridge(alpha=100.0))])
PolynomialFeatures(degree=3, include_bias=False)
Ridge(alpha=100.0)
- Now, let's check the optimal $\lambda$ and polynomial degree it found!
model_regularized_degree.best_params_
{'polynomialfeatures__degree': 3, 'ridge__alpha': 100.0}
Visualizing the regularized degree 3 model¶
- What do the resulting predictions look like?
fig = px.scatter(x=X_train['x'], y=y_train, title="Sample 1's Training Data")
fig.add_trace(go.Scatter(
x=X_train['x'].sort_values(),
y=model.predict(X_train.sort_values('x')),
mode='lines',
line=dict(width=4),
name='Unregularized Polynomial of Degree 25'
))
fig.add_trace(go.Scatter(
x=X_train['x'].sort_values(),
y=model_regularized.predict(X_train.sort_values('x')),
mode='lines',
line=dict(width=4, color='green'),
name='Regularized Polynomial of Degree 25'
))
fig.add_trace(go.Scatter(
x=X_train['x'].sort_values(),
y=model_regularized_degree.predict(X_train.sort_values('x')),
mode='lines',
line=dict(width=4, color='skyblue'),
name='Regularized Polynomial of Degree 3'
))
polyfig = fig.update_layout(width=1000, height=800)
polyfig
util.display_features(model_regularized_degree.best_estimator_.named_steps['ridge'])
Run the cell below to set up the next slide.
from sklearn.metrics import mean_squared_error
unregularized_train = mean_squared_error(y_train, model.predict(X_train))
unregularized_test = mean_squared_error(y_test, model.predict(X_test))
regularized_lambda_train = mean_squared_error(y_train, model_regularized.predict(X_train))
regularized_lambda_validation = (-model_regularized.cv_results_['mean_test_score']).min()
regularized_lambda_test = mean_squared_error(y_test, model_regularized.predict(X_test))
regularized_lambda_degree_train = mean_squared_error(y_train, model_regularized_degree.predict(X_train))
regularized_lambda_degree_validation = (-model_regularized_degree.cv_results_['mean_test_score']).min()
regularized_lambda_degree_test = mean_squared_error(y_test, model_regularized_degree.predict(X_test))
results_df = pd.DataFrame(index=['training MSE', 'average validation MSE (across all folds)', 'test MSE']).assign(
unregularized=[unregularized_train, np.nan, unregularized_test],
regularized_lambda_only=[regularized_lambda_train, regularized_lambda_validation, regularized_lambda_test],
regularized_lambda_and_degree=[regularized_lambda_degree_train, regularized_lambda_degree_validation, regularized_lambda_degree_test]
)
reprs = {'unregularized': '<b><span style="color:#ff7f0f">Unregularized (Degree 25)</span></b>',
'regularized_lambda_only': '<b><span style="color:green">Regularized (Degree 25)<br><small>Used cross-validation to choose $\lambda$</span></b>',
'regularized_lambda_and_degree': '<b><span style="color:skyblue">Regularized (Degree 3)<br><small>Used cross-validation to choose $\lambda$ and degree</small></span></b>'}
results_df_str = results_df.to_html()
for rep in reprs:
results_df_str = results_df_str.replace(rep, reprs[rep])
Comparing training, validation, and test errors¶
- Let's compare the training and testing error of the three polynomials below.
polyfig
display(HTML(results_df_str))
Unregularized (Degree 25) | Regularized (Degree 25) Used cross-validation to choose $\lambda$ |
Regularized (Degree 3) Used cross-validation to choose $\lambda$ and degree |
|
---|---|---|---|
training MSE | 4.72 | 10.33 | 7.11 |
average validation MSE (across all folds) | NaN | 17.60 | 7.40 |
test MSE | 14.21 | 17.17 | 10.52 |
- It seems that the regularized polynomial, in which we used cross-validation to choose both the regularization penalty $\lambda$ and degree, generalizes best to unseen data!
What's next?¶
- Could we have chosen a different method of penalizing each $w_j$ other than $w_j^2$?
We're about to see another option!
- Ridge regression's objective function happened to have a closed-form solution.
What if we want to minimize a function that can't be minimized by hand?
We'll talk about how towards the end of lecture!
- Why is it called ridge regression?
See Homework 10!
Question 🤔 (Answer at practicaldsc.org/q)
What questions do you have about ridge regression?
LASSO¶
Penalizing large parameters¶
- The ridge regression objective function, $$R_\text{ridge}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d w_j^2$$ minimizes mean squared error, plus a squared penalty on the size of the fit coefficients, $w_1^*, w_2^*, ..., w_d^*$.
- We called the regularization strategy above "$L_2$ regularization." Could we have regularized another way?
The LASSO objective function uses $L_1$ regularization, which penalizes the absolute value of each coefficient:
$$R_\text{LASSO}(\vec{w}) = \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d |w_j| $$
- LASSO stands for "least absolute shrinkage and selection operator."
We'll make sense of this name shortly.
LASSO in sklearn
¶
- Unlike with ridge regression or ordinary least squares, there is no general closed-form solution for $\vec{w}_\text{LASSO}^*$.
- But, it can be estimated using numerical methods, which
sklearn
uses under-the-hood. Let's test it out.
More on numerical methods soon!
from sklearn.linear_model import Lasso
- Let's use LASSO to fit a degree 25 polynomial to Sample 1.
Here, we'll fix the degree, and cross-validate to find $\lambda$.
hyperparams = {
'lasso__alpha': 10.0 ** np.arange(-2, 15)
}
model_regularized_lasso = GridSearchCV(
estimator=make_pipeline(PolynomialFeatures(25, include_bias=False), Lasso()),
param_grid=hyperparams,
scoring='neg_mean_squared_error'
)
model_regularized_lasso.fit(X_train, y_train)
GridSearchCV(estimator=Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25, include_bias=False)), ('lasso', Lasso())]), param_grid={'lasso__alpha': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06, 1.e+07, 1.e+08, 1.e+09, 1.e+10, 1.e+11, 1.e+12, 1.e+13, 1.e+14])}, scoring='neg_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25, include_bias=False)), ('lasso', Lasso())]), param_grid={'lasso__alpha': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06, 1.e+07, 1.e+08, 1.e+09, 1.e+10, 1.e+11, 1.e+12, 1.e+13, 1.e+14])}, scoring='neg_mean_squared_error')
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=25, include_bias=False)), ('lasso', Lasso(alpha=0.1))])
PolynomialFeatures(degree=25, include_bias=False)
Lasso(alpha=0.1)
- Our cross-validation routine ends up choosing $\lambda = 0.1$, though on its own, this doesn't really tell us anything.
model_regularized_lasso.best_params_
{'lasso__alpha': 0.1}
Visualizing the regularized degree 25 model, fit with LASSO¶
- What do the resulting predictions look like, relative to the fit polynomials from earlier in the lecture?
polyfig.add_trace(go.Scatter(
x=X_train['x'].sort_values(),
y=model_regularized_lasso.predict(X_train.sort_values('x')),
mode='lines',
line=dict(width=4, color='red'),
name='Regularized Polynomial of Degree 25, using LASSO'
))
- What do you notice about the coefficients of the polynomial themselves?
util.display_features(model_regularized_lasso.best_estimator_.named_steps['lasso'], precision=8)
- Important: Note that we fit a degree 25 polynomial, but many of the higher-order terms are missing, since their coefficients ended up being 0!
There's are no $x^{18}, x^{19}, x^{20}, ..., x^{25}$ terms above, and also no $x$ term.
- The resulting polynomial ends up being of degree 17.
- Run the cell below to set up the calculations.
regularized_lasso_train = mean_squared_error(y_train, model_regularized_lasso.predict(X_train))
regularized_lasso_validation = (-model_regularized_lasso.cv_results_['mean_test_score']).min()
regularized_lasso_test = mean_squared_error(y_test, model_regularized_lasso.predict(X_test))
results_df['regularized_lasso'] = [regularized_lasso_train, regularized_lasso_validation, regularized_lasso_test]
results_df_str = results_df.to_html()
reprs['regularized_lasso'] = '<b><span style="color:red">Regularized using LASSO (Degree 25)<br><small>Used cross-validation to choose $\lambda$</small></span></b>'
for rep in reprs:
results_df_str = results_df_str.replace(rep, reprs[rep])
- How does our LASSO-regularized polynomial compare, in terms of errors, to our earlier polynomials?
display(HTML(results_df_str))
Unregularized (Degree 25) | Regularized (Degree 25) Used cross-validation to choose $\lambda$ |
Regularized (Degree 3) Used cross-validation to choose $\lambda$ and degree |
Regularized using LASSO (Degree 25) Used cross-validation to choose $\lambda$ |
|
---|---|---|---|---|
training MSE | 4.72 | 10.33 | 7.11 | 6.34 |
average validation MSE (across all folds) | NaN | 17.60 | 7.40 | 7.89 |
test MSE | 14.21 | 17.17 | 10.52 | 12.28 |
When using LASSO, many coefficients are set to 0!¶
- When using $L_1$ regularization – that is, when performing LASSO – many of the optimal coefficients $w_1^*, w_2^*, ..., w_d^*$ end up being exactly 0.
- This was not the case in ridge regression – there, the optimal coefficients were all very small, but none were exactly 0.
display(Markdown('#### Fit using Ridge:'))
util.display_features(model_regularized.best_estimator_.named_steps['ridge'], precision=8)
Fit using Ridge:¶
- If a feature has a coefficient of 0, it means it's not being used at all in making predictions.
display(Markdown('#### Fit using LASSO (notice the larger coefficient on $x^3$):'))
util.display_features(model_regularized_lasso.best_estimator_.named_steps['lasso'], precision=8)
Fit using LASSO (notice the larger coefficient on $x^3$):¶
- LASSO implicitly performs feature selection for us – it automatically tells us which features we don't need to use.
Here, it told us "don't use $x$, don't use $x^{18}$, don't use $x^{19}$, ..., don't use $x^{25}$, and instead weigh the $x^2$ and $x^3$ terms more."
- This is where the name "least absolute shrinkage and selection operator" comes from.
Why does LASSO encourage sparsity?¶
- The fact that many of the optimal coefficients – $w_1^*, w_2^*, ..., w_d^*$ – are 0 when performing LASSO is often stated as:
To make sense of this, let's look at the equivalent formulation of LASSO as a constrained optimization problem.
$$\text{minimize} \:\:\:\: \frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 + \lambda \sum_{j = 1}^d | w_j |$$
is equivalent to:
$$\text{minimize} \:\:\:\:\frac{1}{n} \lVert \vec{y} - X \vec{w} \rVert^2 \text{ such that } \sum_{j = 1}^d | w_j | \leq Q$$
- Again, $Q$ and $\lambda$ are inversely related: the larger $Q$ is, the less of a penalty we're putting on size of $\vec{w}_\text{LASSO}^*$, so the smaller $\lambda$ is.
LASSO, visualized¶
- Again:
- The loss surface for just the mean squared error component is in blue.
- The constraint, $\sum_{j = 1}^d |w_j| \leq Q$, is in green.
The larger $Q$ is, the larger the side length of the diamond is.
- Notice that the constraint set has clearly defined "corners," which lie on the axes. The axes are where the parameter values, $w_1$ and $w_2$ here, are 0.
- Due to the shape of the constraint set, it's likely that the minimum value of mean squared error, among all options in the green diamond, will occur at a corner, where some of the parameter values are 0.
Question 🤔 (Answer at practicaldsc.org/q)
What questions do you have about LASSO, or regularization in general?
Example: Commute times¶
Another example: Commute times¶
- Last class, before we learned about regularization, we used $k$-fold cross-validation to choose between the following five models that predict commute time in
'minutes'
.
- The most complicated model, labeled
departure_hour with poly features + day OHE + month OHE + week
, didn't generalize well to unseen data, relative to more simple models.
At least, not when we used ordinary least squares to train it.
- Let's use ordinary least squares, ridge regression, and LASSO to train commute time models, and compare the results.
df = pd.read_csv('data/commute-times.csv')
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df['month'] = pd.to_datetime(df['date']).dt.month_name()
df.head()
date | day | home_departure_time | home_departure_mileage | ... | work_departure_time_hr | mileage_to_home | day_of_month | month | |
---|---|---|---|---|---|---|---|---|---|
0 | 5/15/2023 | Mon | 2023-05-15 10:49:00 | 15873.0 | ... | 17.17 | 53.0 | 15 | May |
1 | 5/16/2023 | Tue | 2023-05-16 07:45:00 | 15979.0 | ... | NaN | NaN | 16 | May |
2 | 5/22/2023 | Mon | 2023-05-22 08:27:00 | 50407.0 | ... | 15.90 | 54.0 | 22 | May |
3 | 5/23/2023 | Tue | 2023-05-23 07:08:00 | 50535.0 | ... | NaN | NaN | 23 | May |
4 | 5/30/2023 | Tue | 2023-05-30 09:09:00 | 50664.0 | ... | 17.12 | 54.0 | 30 | May |
5 rows × 20 columns
X_train, X_test, y_train, y_test = train_test_split(df.drop('minutes', axis=1), df['minutes'], random_state=23)
Ordinary least squares for commute times¶
- Since we'll do the feature transformations repeatedly, we'll save them as a single pipeline.
week_converter = FunctionTransformer(lambda s: 'Week ' + ((s - 1) // 7 + 1).astype(str),
feature_names_out='one-to-one')
day_of_month_transformer = make_pipeline(week_converter, OneHotEncoder(drop='first'))
# Note the include_bias=False once again!
commute_feature_pipe = make_pipeline(
make_column_transformer(
(PolynomialFeatures(3, include_bias=False), ['departure_hour']),
(OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']),
(day_of_month_transformer, ['day_of_month']),
)
)
- First, we'll fit a "vanilla" linear regression model, i.e. one that just minimizes mean squared error, with no regularization.
commute_model_ols = make_pipeline(commute_feature_pipe, LinearRegression())
commute_model_ols
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('linearregression', LinearRegression())])
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first', handle_unknown='ignore')
['day_of_month']
FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)
OneHotEncoder(drop='first')
LinearRegression()
- There are no hyperparameters to grid search for here, so we'll just fit the model directly.
commute_model_ols.fit(X_train, y_train)
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('linearregression', LinearRegression())])
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first', handle_unknown='ignore')
['day_of_month']
FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)
OneHotEncoder(drop='first')
LinearRegression()
- We'll keep
commute_model_ols
aside for now, and compare its performance to the fit regularized models in a few moments.
Ridge regression for commute times¶
- Again, let's instantiate a Pipeline for the steps we want to execute.
commute_pipe_ridge = make_pipeline(commute_feature_pipe, Ridge())
commute_pipe_ridge
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('ridge', Ridge())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('ridge', Ridge())])
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first', handle_unknown='ignore')
['day_of_month']
FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)
OneHotEncoder(drop='first')
Ridge()
- Now, since we need to choose the regularization penalty, $\lambda$, we'll fit a
GridSearchCV
instance with a hyperparameter grid.
lambdas = 10.0 ** np.arange(-10, 15)
hyperparams = {
'ridge__alpha': lambdas
}
commute_model_ridge = GridSearchCV(
commute_pipe_ridge,
param_grid = hyperparams,
scoring='neg_mean_squared_error',
cv=10
)
commute_model_ridge.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('ridge', Ridge())]), param_grid={'ridge__alpha': array([1.e-10, 1.e-09, 1.e-08, ..., 1.e+12, 1.e+13, 1.e+14])}, scoring='neg_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=10, estimator=Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('ridge', Ridge())]), param_grid={'ridge__alpha': array([1.e-10, 1.e-09, 1.e-08, ..., 1.e+12, 1.e+13, 1.e+14])}, scoring='neg_mean_squared_error')
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('ridge', Ridge())])
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first', handle_unknown='ignore')
['day_of_month']
FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)
OneHotEncoder(drop='first')
Ridge()
- Which $\lambda$ did it choose?
On its own, this value of $\lambda$ doesn't really tell us anything.
commute_model_ridge.best_params_
{'ridge__alpha': 1.0}
Aside: average validation error vs. $\lambda$¶
- How did the average validation MSE change with $\lambda$?
Here, large values of $\lambda$ mean less complex models, not more complex.
(
pd.Series(-commute_model_ridge.cv_results_['mean_test_score'],
index=np.log10(lambdas))
.to_frame()
.reset_index()
.plot(kind='line', x='index', y=0)
.update_layout(xaxis_title='$\log(\lambda)$', yaxis_title='Average Validation MSE')
)
LASSO for commute times¶
- Let's instantiate a third Pipeline for the steps we want to execute.
commute_pipe_lasso = make_pipeline(commute_feature_pipe, Lasso())
commute_pipe_lasso
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('lasso', Lasso())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('lasso', Lasso())])
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first', handle_unknown='ignore')
['day_of_month']
FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)
OneHotEncoder(drop='first')
Lasso()
- Now, since we need to choose the regularization penalty, $\lambda$, we'll fit a
GridSearchCV
instance with a hyperparameter grid.
lambdas = 10.0 ** np.arange(-10, 15)
hyperparams = {
'lasso__alpha': lambdas
}
commute_model_lasso = GridSearchCV(
commute_pipe_lasso,
param_grid = hyperparams,
scoring='neg_mean_squared_error',
cv=10
)
commute_model_lasso.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('lasso', Lasso())]), param_grid={'lasso__alpha': array([1.e-10, 1.e-09, 1.e-08, ..., 1.e+12, 1.e+13, 1.e+14])}, scoring='neg_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=10, estimator=Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('lasso', Lasso())]), param_grid={'lasso__alpha': array([1.e-10, 1.e-09, 1.e-08, ..., 1.e+12, 1.e+13, 1.e+14])}, scoring='neg_mean_squared_error')
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('lasso', Lasso(alpha=0.1))])
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first', handle_unknown='ignore')
['day_of_month']
FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)
OneHotEncoder(drop='first')
Lasso(alpha=0.1)
- Which $\lambda$ did it choose?
On its own, this value of $\lambda$ doesn't really tell us anything.
commute_model_lasso.best_params_
{'lasso__alpha': 0.1}
Run the cell below to set up the next slide.
commute_results = pd.concat([
util.display_commute_coefs(commute_model_ols),
util.display_commute_coefs(commute_model_ridge.best_estimator_),
util.display_commute_coefs(commute_model_lasso.best_estimator_)
], axis=1)
commute_results.columns = ['ols', 'ridge', 'lasso']
Comparing coefficients across models¶
- What do the resulting coefficients look like in all three models?
display_df(commute_results, rows=22)
ols | ridge | lasso | |
---|---|---|---|
feature | |||
intercept | 460.31 | 214.15 | 2.54e+02 |
polynomialfeatures__departure_hour | -94.79 | -0.71 | -2.10e+01 |
polynomialfeatures__departure_hour^2 | 6.80 | -4.63 | -1.70e+00 |
polynomialfeatures__departure_hour^3 | -0.14 | 0.31 | 1.81e-01 |
onehotencoder__day_Mon | -0.61 | -5.74 | -2.70e+00 |
onehotencoder__day_Thu | 13.30 | 6.04 | 9.00e+00 |
onehotencoder__day_Tue | 11.19 | 5.52 | 8.68e+00 |
onehotencoder__day_Wed | 5.73 | -0.46 | 0.00e+00 |
onehotencoder__month_December | 8.90 | 2.82 | 4.06e+00 |
onehotencoder__month_February | -5.33 | -7.14 | -5.81e+00 |
onehotencoder__month_January | 1.93 | 0.39 | 0.00e+00 |
onehotencoder__month_July | 2.46 | 0.44 | 0.00e+00 |
onehotencoder__month_June | 6.28 | 4.45 | 5.14e+00 |
onehotencoder__month_March | -0.76 | -1.70 | -8.17e-01 |
onehotencoder__month_May | 9.36 | 4.95 | 5.57e+00 |
onehotencoder__month_November | 1.40 | -1.81 | -0.00e+00 |
onehotencoder__month_October | 2.06 | 0.22 | 0.00e+00 |
onehotencoder__month_September | -3.20 | 0.05 | -0.00e+00 |
pipeline__day_of_month_Week 2 | 0.91 | 1.39 | 3.23e-01 |
pipeline__day_of_month_Week 3 | 6.30 | 4.70 | 4.57e+00 |
pipeline__day_of_month_Week 4 | 0.28 | -0.20 | -0.00e+00 |
pipeline__day_of_month_Week 5 | 2.09 | 0.76 | 4.78e-03 |
- The coefficients in the OLS model tend to be the largest in magnitude.
- In the ridge model, the coefficients are all generally small, but none are 0.
- In the LASSO model, many coefficients are 0 exactly.
Comparing training and test errors across models¶
model_dict = {'ols': commute_model_ols, 'ridge': commute_model_ridge, 'lasso': commute_model_lasso}
for model in model_dict:
display(Markdown(f'#### {model.upper()} model for commute times'))
display(Markdown(f'Training error: {mean_squared_error(y_train, model_dict[model].predict(X_train))}<br>Test error: {mean_squared_error(y_test, model_dict[model].predict(X_test))}'))
- The best-fitting LASSO model seems to have a lower training and testing MSE than the best-fitting ridge model.
- But, in general, sometimes LASSO performs better on unseen data, and sometimes ridge does. Cross-validate!
Sometimes, machine learning practitioners say "there's no free lunch" – there's no universal always-best technique to use to make predictions, it always depends on the specific data you have.
Standardize when regularizing!¶
- As we discussed a few lectures ago, by standardizing our features, we bring them all to the same scale.
- Standardizing features in ordinary least squares doesn't change our model's performance; rather, it impacts the interpretability of the coefficients.
- But, when regularizing, we're penalizing the sizes of the coefficients, which can be on wildly different scales if the features are on different scales.
- So, when regularizing a linear model, you should standardize the features first, so the coefficients for all features are on the same scale, and are penalized equally.
# In other words, commute_feature_pipe should've been this!
make_pipeline(commute_feature_pipe, StandardScaler())
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('standardscaler', StandardScaler())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pipeline', Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])), ('standardscaler', StandardScaler())])
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])]))])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first', handle_unknown='ignore'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first', handle_unknown='ignore')
['day_of_month']
FunctionTransformer(feature_names_out='one-to-one', func=<function <lambda> at 0x15ba2dea0>)
OneHotEncoder(drop='first')
StandardScaler()
Question 🤔 (Answer at practicaldsc.org/q)
What questions do you have about regularization in general?
Gradient descent intuition¶
Minimizing empirical risk¶
- Repeatedly, we've been tasked with minimizing the value of empirical risk functions.
Why? To help us find the best model parameters, $h^*$ or $w^*$, which help us make the best predictions!
- We've minimized empirical risk functions in various ways.
Minimizing arbitrary functions¶
- Assume $f(w)$ is some differentiable function.
For now, we'll assume $f$ takes in a single number, $w$, as input and returns a single number as its output.
- When tasked with minimizing $f(w)$, our general strategy has been to:
- Find $\frac{df}{dw}(w)$, the derivative of $f$.
- Find the input $w^*$ such that $\frac{df}{dw}(w^*) = 0$.
- However, there are cases where we can find $\frac{df}{dw}(w)$, but it is either difficult or impossible to solve $\frac{df}{dw}(w^*) = 0$.
- Then what?
util.draw_f()
What does the derivative of a function tell us?¶
- Goal: Given a differentiable function $f(w)$, find the input $w^*$ that minimizes $f(w)$.
- What does $\frac{d}{dw} f(w)$ mean?
from ipywidgets import interact
interact(util.show_tangent, w0=(-1.5, 1.5));
interactive(children=(FloatSlider(value=0.0, description='w0', max=1.5, min=-1.5), Output()), _dom_classes=('w…
Let's go hiking!¶
- Suppose you're at the top of a mountain 🏔️ and need to get to the bottom.
- Further, suppose it's really cloudy ☁️, meaning you can only see a few feet around you.
- How would you get to the bottom?
Searching for the minimum¶
- Suppose we're given an initial guess for a value of $w$ that minimizes $f(w)$.
- If the slope of the tangent line at $f(w)$ is positive 📈:
- Increasing $w$ increases $f$.
- This means the minimum must be to the left of the point $(w, f(w))$.
- Solution: Decrease $w$ ⬇️.
- The steeper the slope is, the further we must be from the minimum – so, the steeper the slope, the quicker we should decrease $w$!
Searching for the minimum¶
- Suppose we're given an initial guess for a value of $w$ that minimizes $f(w)$.
- If the slope of the tangent line at $f(w)$ is negative 📉:
- Increasing $w$ decreases $f$.
- This means the minimum must be to the right of the point $(w, f(w))$.
- Solution: Increase $w$ ⬆️.
- The steeper the slope is, the further we must be from the minimum – so, the steeper the slope, the quicker we should increase $w$!
Intuition¶
- To minimize $f(w)$, start with an initial guess for the minimizing input, $w^{(0)}$.
- Where do we go next?
- If $\frac{df}{dw}(w^{(0)}) > 0$, decrease $w^{(0)}$.
- If $\frac{df}{dw}(w^{(0)}) < 0$, increase $w^{(0)}$.
- One way to accomplish this:
- A consequence of the above update rule: the larger $\frac{df}{dw}$ is, the bigger a step we take!
This matches our intuition from the previous flew slides – the further we are from the minimum, the bigger of a step we should take!
Gradient descent¶
- To minimize a differentiable function $f$:
- Pick a positive number, $\alpha$. This number is called the learning rate, or step size.
Think of $\alpha$ as a hyperparameter of the minimization process. - Pick an initial guess, $w^{(0)}$.
- Then, repeatedly update your guess using the update rule:
- Pick a positive number, $\alpha$. This number is called the learning rate, or step size.
- Repeat this process until convergence – that is, when $w$ doesn't change much from iteration to iteration.
- This procedure is called gradient descent.
What is gradient descent?¶
- Gradient descent is a numerical method for finding the input to a function $f$ that minimizes the function.
- It is called gradient descent because the gradient is the extension of the derivative to functions of multiple variables.
- A numerical method is a technique for approximating the solution to a mathematical problem, often by using the computer.
- Gradient descent is widely used in machine learning, to train models from linear regression to neural networks and transformers (includng ChatGPT)!
In machine learning, we use gradient descent to minimize empirical risk when we can't minimize it by hand, which is true in most, more sophisticated cases.
Implementing gradient descent¶
- In practice, we typically don't implement gradient descent ourselves – we rely on existing implementations of it. But, we'll implement it here ourselves to understand what's going on.
- Let's start with an initial guess $w^{(0)} = 0$ and a learning rate $\alpha = 0.01$.
w = 0
for i in range(50):
print(round(w, 4), round(util.f(w), 4))
w = w - 0.01 * util.df(w)
0 -9 -0.02 -9.042 -0.042 -9.0927 -0.0661 -9.1537 -0.0925 -9.2267 -0.1214 -9.3135 -0.1527 -9.4158 -0.1866 -9.5347 -0.2229 -9.6708 -0.2615 -9.8235 -0.302 -9.9909 -0.344 -10.1687 -0.3867 -10.3513 -0.4293 -10.5311 -0.4709 -10.7001 -0.5104 -10.8511 -0.547 -10.9789 -0.58 -11.0811 -0.6089 -11.1586 -0.6335 -11.2141 -0.654 -11.2521 -0.6706 -11.277 -0.6839 -11.2927 -0.6943 -11.3023 -0.7023 -11.308 -0.7085 -11.3113 -0.7131 -11.3132 -0.7166 -11.3143 -0.7193 -11.3149 -0.7213 -11.3153 -0.7227 -11.3155 -0.7238 -11.3156 -0.7247 -11.3156 -0.7253 -11.3157 -0.7257 -11.3157 -0.726 -11.3157 -0.7263 -11.3157 -0.7265 -11.3157 -0.7266 -11.3157 -0.7267 -11.3157 -0.7268 -11.3157 -0.7268 -11.3157 -0.7269 -11.3157 -0.7269 -11.3157 -0.7269 -11.3157 -0.7269 -11.3157 -0.727 -11.3157 -0.727 -11.3157 -0.727 -11.3157 -0.727 -11.3157
- We see that pretty quickly, $w^{(i)}$ converges to $-0.727$!
Visualizing $w^{(0)} = 0, \alpha = 0.01$¶
util.minimizing_animation(w0=0, alpha=0.01)
Visualizing $w^{(0)} = 1.1, \alpha = 0.01$¶
What if we start with a different initial guess?
util.minimizing_animation(w0=1.1, alpha=0.01)
Visualizing $w^{(0)} = 0, \alpha = 0.1$¶
What if we use a different learning rate?
util.minimizing_animation(w0=0, alpha=0.1)
Visualizing $w^{(0)} = 0, \alpha = 1$¶
Some learning rates are so large that the values of $t$ explode towards infinity! Watch what happens when we use a learning rate of 1:
w = 0
for i in range(50):
print(round(w, 4), round(util.f(w), 4))
w = w - 1 * util.df(w)
0 -9 -2 55 148 2395575055 -64768502 87988399093209215258221002525055 5434024027622804955958648 4359696148872124725882822307832767315039587443442298737093378524825544745642834622818997783401350055 -3209184300040232968384986400955563373214898401121991075501068665288470367002 530332985231284587741390981348980358147931180953380132783007883732290450487940896497686801834821479793056750838217464306331399631782539963635649924968096560953483216630744135148611918314690619239216181828228590993041527070916340763511348214767483482898617989266963488210286317855470338346244530924050055 661019044901392416954729679460748990344170910956358835698132854563403750799535254895720517200327333237179123918683068293585779703016725909751539139711861617687813209841593304288472547260824104926792650829880677991116888105447148 954609811130853181103583701649028211653660744627023839503556744987120626567103558567559598665702180831022338880115370491802698434832577366874991589662088965037117816221504923258001678540709754551409196622983402075423897917311140276279241141570865352818912580292076157317652459623227622008267704418865451565124850508363893736669815765725665667366255955451825529239205112277975621646590805843102421929101874916034353375181650823160240275221716639504237816856330637620915336423852170198623085839171048441345753566891271423134343990416913972889069267978119649974190469814710715636873538491644959256191689099590275341151171819790272788767702188985877934107603998669287604787308015484964644983093480091708331796112794087571003132797188774464919854845084662160823258121712229906414297355334067702069889036890811444360852930909406532971705798565032317088519403066294374222935955758358426585115485869844584732865071125055 -5776594901426824677760480474882267759584323573334469245518728218477211441849170261086790265427887646502160199028651631032865248664657546774385277227975572113996889604118144335449979370344701318004957758483337360534957311016716528158918733168768987913212630157201130190714745075686287645254078551834810276130815287367029241908776773753955285997149825750202626302129245121512515240250913385951208276563237429345711741762020109539150320627994866580511418312312651021359018659779452045853422296670672741295547252935058529308187027533149563829451002796785142110224887498329653640450635931265679232291798563741794599235612163962019705986436274936065651608305243446437666621321660450426195502 5567467040762316647385680347496459071251908329594160637160498217500728208197785446877620158646539654196693804349701586885505176356827181880839354738173998718197881791881967695457924075375294149630473701323151168952512549575023461459967449672579482008885831745929142141307525902417921966422620156197997959033382481547789109206030074643294774657208648370969358486991923314222057795753918515226937876476501703483528201605196470687213979945800918422889559476532497715529745247540647183908990046566458091964430977720304527045774223725630445826305008926134177414636165239732270281548434945740006290454894345971654799317870476761859071358456700605388406526672352852842528562812080157563678373872972299765360359661781260039830550782774851103914104460053225713254665083723090034914553217854973947491715388082781371837178415343779041114832420266562516738494930908406864990809290469065201457416742355159387931761288948089880134766558012048604589614427248226611670883138104212484824576364868456607848641544085419408684803182641302222711849442851846688984202685218850146983425835148344713877943139507299503363868320438698755066216665031441301315464067215418012635191464658042702741661722755979820460469677297995810315811215663898912622910102255559762850130312289787292854714575891715793598025092374164364282959494877810669555805022547146134781462745383374343668249780501716400740138568430849172523027689098838759922274029309133123478995142169995505126232505676819213088714715656287738858154533136855634026888245481898579026078615890721196538637229193944104541362086415076437862917954223027414916208781786565842299215444273960692425511146694322624069953293786875830319514711885802415773788409813779243749524914244516213710663260063688335921887369244677554825392081574955007731473426560520821688595689782468311769547938315083428262015533566031818211680185018320653621298397725185879170798176531103648662701549010048950888729070216090647184538768143166154511376277117235807121203111830861246070656539536355204462276388648953881233443527113872654284387900650540683367971947151628154490986163371145377670078048070618732564773855322256383844757843718007771960314271730413963708642094262402184015551513234118077484341995264719171815862502578409011729975998047461884275026017838376240794591113391051148546757147128595410875481466482193442309330997235009212240046797730355829181690992079997293491670138092017754077150230363231768901357800214696847752712490141376479743191489292002260894100178579654600753149959455619099748751239885506374502836267293906056062286685319228208301660012222246228506117164074258953811501120138459015981704650114720296390762242471158059548639639927387729533173665405132394910446596550820016373570414868657854973615617172836900954575055 3855189526540728524278319589985944495280538183882299698111689637316917381014217424359204367757696569848797009286892542400785193726335567987034823263801379873053783159408920759828571423350123520123035158696353085032635136034131454305531324732236086118636591592766279126638843826115442262604630257496291581655380520525571414258044875312841352532019938871383343877558513337813374125880178988558766709859899290450137866624221730719608783942726161612792292305504566249374395980642467567218942150593990395152526094519095131463300406217865689511779704428711817280647109327850769428626549065765980036609088275619449954869385676998947910213629191812252524723398626483312884863537692238044139790238208452983272310794434698544113549330856479690021344599732952778985508833957294859477035036071847071599443134882108915531890138690637316504153318235252832248766928560532967615588730273287698123681564030077662024861886093041219973225909975993186676633428845630829511044966800955808789774160870559946654821874915508336277098360788879945103538972629204066467239792918237167786057571888151366709626913042986689083367879892696729560516485898681940815106502563918787990749571977850666555457934714778242970898998055380928540525862898129526293837152588892472464187926016910874594710823808523892486983995786101954775129386224417798737057254745757493815140576352882408891677979092062423913177607696314891502599572041093610228309909244092496865578650805891175698700745462968934690359706927049642045485713846183931306952972003913491311485079791450461936196070640685465196727010011659512276761228081942347860844708394750775631606546197508835071877303456351730912372251498497086485867526328442126596521073598894369578656924259173657522539931993404043814903461522066789523205765659882980970713576706897118147377104743538840122956532911071796144453651876484893964959088019075578216005476795649512420435265088056097629087650340512140436442510445645669612970287029900222069693894940747901277423995624030099301782502790091665044221793488568935328647585524597829056270452373090049303865648
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[59], line 3 1 w = 0 2 for i in range(50): ----> 3 print(round(w, 4), round(util.f(w), 4)) 4 w = w - 1 * util.df(w) ValueError: Exceeds the limit (4300) for integer string conversion; use sys.set_int_max_str_digits() to increase the limit
Gradient descent and empirical risk minimization¶
- While gradient descent can minimize other kinds of differentiable functions, its most common use case is in minimizing empirical risk.
- For example, consider:
- The constant model, $H(x) = h$.
- The dataset $-4, -2, 2, 4$.
- The initial guess $h_0 = 4$ and the learning rate $\alpha = \frac{1}{4}$.
- Exercise: Find $h_1$ and $h_2$.
- See the annotated slides for the solution!
Lingering questions¶
- When is gradient descent guaranteed to converge to a global minimum? What kinds of functions work well with gradient descent?
- How do we choose a step size?
- How do we use gradient descent to minimize functions of multiple variables, e.g.:
- Question: Why can't we use gradient descent to find $\vec{w}_\text{LASSO}^*$?