from lec_utils import *
import lec19_util as util
Announcements 📣¶
- Homework 9 is due on Monday, November 11th.
- The Portfolio Homework will be released by tomorrow.
- Homework 8 solutions can be found in #282 on Ed.
Agenda¶
- Pipelines.
- Generalization.
- Train-test splits.
- Hyperparameters.
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
Pipelines¶
Loading the data¶
- We'll start with our trusty commute times dataset.
df = pd.read_csv('data/commute-times.csv')
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df['month'] = pd.to_datetime(df['date']).dt.month_name()
df.head()
date | day | home_departure_time | home_departure_mileage | ... | work_departure_time_hr | mileage_to_home | day_of_month | month | |
---|---|---|---|---|---|---|---|---|---|
0 | 5/15/2023 | Mon | 2023-05-15 10:49:00 | 15873.0 | ... | 17.17 | 53.0 | 15 | May |
1 | 5/16/2023 | Tue | 2023-05-16 07:45:00 | 15979.0 | ... | NaN | NaN | 16 | May |
2 | 5/22/2023 | Mon | 2023-05-22 08:27:00 | 50407.0 | ... | 15.90 | 54.0 | 22 | May |
3 | 5/23/2023 | Tue | 2023-05-23 07:08:00 | 50535.0 | ... | NaN | NaN | 23 | May |
4 | 5/30/2023 | Tue | 2023-05-30 09:09:00 | 50664.0 | ... | 17.12 | 54.0 | 30 | May |
5 rows × 20 columns
- Our goal, as always, is to predict commute time in
'minutes'
:
df['minutes']
0 68.0 1 94.0 2 63.0 ... 62 68.0 63 90.0 64 83.0 Name: minutes, Length: 65, dtype: float64
- The main numerical feature we have is
'departure_hour'
.
(
df
.plot(kind='scatter', x='departure_hour', y='minutes')
.update_layout(xaxis_title='Home Departure Time (AM)',
yaxis_title='Minutes',
title='Commuting Time vs. Home Departure Time')
)
- Last class, we used transformer classes to one hot encode
'day'
and'month'
. We'll look at how we can easily use these columns – and more! – as inputs to a linear model that predicts commute times.
So far, we've used transformers (like OneHotEncoder
and StandardScaler
) for feature engineering and models (like LinearRegression
) for prediction. We can combine these steps into a single Pipeline
.
Pipelines in sklearn
¶
- From
sklearn
's documentation:
Pipeline
allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.
Intermediate steps of the pipeline must be "transforms", that is, they must implementfit
andtransform
methods. The final estimator only needs to implementfit
.
- General template:
pl = Pipeline([trans_1, trans_2, ..., model])
.
Note that themodel
is optional, meaning you can have Pipelines of just transformers.
Each element in the list must be a tuple; the first item in the tuple should be a "name" for the step, and the second item should be a transformer or estimator instance.
- Once a Pipeline is instantiated, you can fit all steps (transformers and model) using
pl.fit(X, y)
.
- To make predictions using raw, untransformed data, use
pl.predict(X)
.
Our first Pipeline¶
- Let's build a Pipeline that:
- One hot encodes
'day'
and'month'
. - Fits a regression model on just the one hot encoded data.
- One hot encodes
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
pl = Pipeline([
('one-hot', OneHotEncoder(drop='first')),
('lin-reg', LinearRegression())
])
- Now that
pl
is instantiated, wefit
it the same way we would fit the individual steps.
pl.fit(X=df[['day', 'month']], y=df['minutes'])
Pipeline(steps=[('one-hot', OneHotEncoder(drop='first')), ('lin-reg', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('one-hot', OneHotEncoder(drop='first')), ('lin-reg', LinearRegression())])
OneHotEncoder(drop='first')
LinearRegression()
- Now, to make predictions using raw data, all we need to do is use
pl.predict
:
pl.predict([['Mon', 'November']])
/Users/surajrampure/miniforge3/envs/pds/lib/python3.10/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but OneHotEncoder was fitted with feature names
array([68.61])
pl
performs both feature transformation and prediction with just a single call topredict
!
- We can access individual "steps" of a
Pipeline
through thenamed_steps
attribute:
pl.named_steps
{'one-hot': OneHotEncoder(drop='first'), 'lin-reg': LinearRegression()}
pl.named_steps['one-hot'].transform(df[['day', 'month']]).toarray()
array([[1., 0., 0., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.], [1., 0., 0., ..., 0., 0., 0.], ..., [1., 0., 0., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.], [0., 1., 0., ..., 0., 0., 0.]])
pl.named_steps['one-hot'].get_feature_names_out()
array(['day_Mon', 'day_Thu', 'day_Tue', 'day_Wed', 'month_December', 'month_February', 'month_January', 'month_July', 'month_June', 'month_March', 'month_May', 'month_November', 'month_October', 'month_September'], dtype=object)
pl.named_steps['lin-reg'].coef_
array([ 1.65, 8.35, 13.2 , 2.68, -2.1 , 6.06, -4.44, -3.08, 9.14, 8.62, 6.24, 2.98, -5.6 , 3.29])
pl
also has ascore
method, the same way a fitLinearRegression
instance does:
# Why is this so low?
pl.score(df[['day', 'month']], df['minutes'])
0.2973846534941993
More sophisticated Pipelines¶
- In the previous example, we one hot encoded every input column, and didn't use any columns that were originally numeric.
That's not realistic or useful!
# Why is this so low?
pl.score(df[['day', 'month']], df['minutes'])
0.2973846534941993
- What if we want to perform different transformations on different columns, or include some columns without transformation?
- Or, what if we want to perform multiple transformations to the same column?
- There are a variety of useful functions/classes we can use:
Name | Functionality |
---|---|
ColumnTransformer |
Allows us to transform different columns with different transformations. Instantiate a ColumnTransformer using a list of tuples, where:• The first element is a "name" we choose for the transformer. • The second element is a transformer instance (e.g. OneHotEncoder() ).• The third element is a list of relevant column names. |
FunctionTransformer |
Allows us to create a custom transformation (similar to using .apply on a DataFrame's columns). |
make_pipeline |
Helper function for creating a Pipeline (slightly less verbose).Note that you can make a pipeline of just transformations, if you want to use multiple transformations on the same column! |
make_column_transformer |
Helper function for creating a ColumnTransformer . |
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import FunctionTransformer, make_pipeline
The plan¶
- Before writing any code, let's plan out how we want to transform our data.
df[['departure_hour', 'day', 'month', 'day_of_month']]
departure_hour | day | month | day_of_month | |
---|---|---|---|---|
0 | 10.82 | Mon | May | 15 |
1 | 7.75 | Tue | May | 16 |
2 | 8.45 | Mon | May | 22 |
... | ... | ... | ... | ... |
62 | 7.58 | Mon | March | 4 |
63 | 7.45 | Tue | March | 5 |
64 | 7.60 | Thu | March | 7 |
65 rows × 4 columns
'departure_hour'
: Create degree 2 and degree 3 polynomial features.
'day'
: One hot encode.
'month'
: One hot encode.
'day_of_month'
: Separate into five weeks, then one hot encode.
Days 1 to 7 are Week 1, Days 8 to 15 are Week 2, and so on.
- After all of these transformations, we'll fit a
LinearRegression
object – i.e., fit a linear model.
'departure_hour'
: Create degree 2 and degree 3 polynomial features.
'day'
: One hot encode.
'month'
: One hot encode.
'day_of_month'
: Separate into five weeks, then one hot encode.
- Let's start with
'day_of_month'
, since it seems to involve the most complicated transformations.
- First, let's figure out how to extract the week number given the day of the month.
example_vals = df['day_of_month'].tail()
example_vals
60 27 61 29 62 4 63 5 64 7 Name: day_of_month, dtype: int32
# Expression to convert from day of month to Week #.
'Week ' + ((example_vals - 1) // 7 + 1).astype(str)
60 Week 4 61 Week 5 62 Week 1 63 Week 1 64 Week 1 Name: day_of_month, dtype: object
# The function that FunctionTransformer takes in
# itself takes in a Series/DataFrame, not a single element!
# Here, we're having that function return a new Series/DataFrame,
# depending on what's passed in to .tranform (experiment on your own).
week_converter = FunctionTransformer(lambda s: 'Week ' + ((s - 1) // 7 + 1).astype(str))
week_converter.transform(df[['day_of_month']])
day_of_month | |
---|---|
0 | Week 3 |
1 | Week 3 |
2 | Week 4 |
... | ... |
62 | Week 1 |
63 | Week 1 |
64 | Week 1 |
65 rows × 1 columns
- We need to apply two consecutive transformations to
'day_of_month'
, which calls for a Pipeline.
day_of_month_transformer = make_pipeline(week_converter, OneHotEncoder(drop='first'))
day_of_month_transformer
Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)), ('onehotencoder', OneHotEncoder(drop='first'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)), ('onehotencoder', OneHotEncoder(drop='first'))])
FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)
OneHotEncoder(drop='first')
day_of_month_transformer.fit_transform(df[['day_of_month']]).toarray()
array([[0., 1., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], ..., [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]])
- So,
day_of_month_transformer
does everything we need to transform'day_of_month'
.
'departure_hour'
: Create degree 2 and degree 3 polynomial features.
'day'
: One hot encode.
'month'
: One hot encode.
'day_of_month'
: Separate into five weeks, then one hot encode. ✅ Use day_of_month_transformer
.
- Every other column only needs a single transformation. We can specify the transformations needed for each column using
make_column_transformer
.
from sklearn.preprocessing import PolynomialFeatures
preprocessing = make_column_transformer(
(PolynomialFeatures(3), ['departure_hour']),
(OneHotEncoder(drop='first'), ['day', 'month']),
(day_of_month_transformer, ['day_of_month']),
remainder='drop'
)
- Now, we're ready for a final Pipeline!
model = make_pipeline(preprocessing, LinearRegression())
model
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3)
['day', 'month']
OneHotEncoder(drop='first')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)
OneHotEncoder(drop='first')
LinearRegression()
model.fit(X=df[['departure_hour', 'day', 'month', 'day_of_month']], y=df['minutes'])
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3)
['day', 'month']
OneHotEncoder(drop='first')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)
OneHotEncoder(drop='first')
LinearRegression()
- Once our Pipeline is fit, we can use it to make predictions!
What's the predicted commute time if I leave at 8:30AM on a Tuesday in November, which happens to be the 5th of the month?
model.predict(pd.DataFrame([{
'departure_hour': 8.5,
'day': 'Tue',
'month': 'November',
'day_of_month': 5
}]))
array([77.48])
Activity
How many columns does the final design matrix that model
creates have? If you write code to determine the answer, make sure you can walk through the steps over the past few slides to figure out why the answer is what it is.
model
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3)
['day', 'month']
OneHotEncoder(drop='first')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x1661f96c0>)
OneHotEncoder(drop='first')
LinearRegression()
Question 🤔 (Answer at practicaldsc.org/q)
What questions do you have?
Generalization¶
What went wrong with polls in 2016? Can we trust them now?¶
- Trump's victory in 2016 came as a surprise to many, since most polls in swing states had Clinton ahead.
- Polls severely underestimated support for Trump; many voters were undecided until the last minute, and many didn't want to share they supported Trump.
- But a more systematic issue with the polls was that:
- college-educated voters tended to be more likely to respond to polls.
- college-educated voters tended to support Clinton over Trump.
- there are fewer college-educated voters than non-college-educated voters, meaning that the polled support for Trump was lower than the true support.
- Read more at this CNBC article.
Motivation¶
- You and Billy are studying for an upcoming exam. You both decide to test your understanding by taking a practice exam.
Your logic: If you do well on the practice exam, you should do well on the real exam.
- You each take the practice exam once and look at the solutions afterwards.
- Your strategy: Memorize the answers to all practice exam questions, e.g. "Question 1: A; Question 2: C; Question 3: A."
- Billy's strategy: Learn high-level concepts from the solutions, e.g. "the TF-IDF of term $t$ in document $d$ is large when $t$ occurs often in $d$ but rarely overall."
- Who will do better on the practice exam? Who will probably do better on the real exam? 🧐
Evaluating the quality of a model¶
- So far, we've computed the MSE (and $R^2$) of our fit regression models on the data that we used to fit them, i.e. the training data.
This mean squared error is called the training MSE, or training error.
- We've said that Model A is better than Model B if Model A's MSE is lower than Model B's MSE.
- Remember, our training data is a sample from some population.
- Just because a model fits the training data well doesn't mean it will generalize and work well on similar, unseen samples from the same population!
Overfitting and underfitting¶
- Let's collect two samples $\{(x_i, y_i)\}$ from the same population.
np.random.seed(23) # For reproducibility.
def sample_from_pop(n=100):
x = np.linspace(-2, 3, n)
y = x ** 3 + (np.random.normal(0, 3, size=n))
return pd.DataFrame({'x': x, 'y': y})
sample_1 = sample_from_pop()
sample_2 = sample_from_pop()
- For now, let's just look at Sample 1. The relationship between $x$ and $y$ is roughly cubic; that is, $y \approx x^3$.
Remember, in reality, you won't get to see the population distribution. If you could, there'd be no need to build a model!
px.scatter(sample_1, x='x', y='y', title='Sample 1')
Polynomial regression¶
- Let's fit three polynomial models on Sample 1: degree 1, degree 3, and degree 25.
Again, we'll use thePolynomialFeatures
transformer.
# fit_transform fits and transforms the same input.
d2 = PolynomialFeatures(3)
d2.fit_transform(np.array([1, 2, 3, 4, -2]).reshape(-1, 1))
array([[ 1., 1., 1., 1.], [ 1., 2., 4., 8.], [ 1., 3., 9., 27.], [ 1., 4., 16., 64.], [ 1., -2., 4., -8.]])
- Below, we look at our three models' predictions on Sample 1, which they were trained on.
# Look at the definition of train_and_plot in lec19_util.py if you're curious as to how the plotting works.
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_1, degs=[1, 3, 25], data_name='Sample 1')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 1')
- The degree 25 polynomial has the lowest MSE on Sample 1.
- How do the same fit polynomials look on Sample 2?
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_2, degs=[1, 3, 25], data_name='Sample 2')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 2')
- The degree 3 polynomial has the lowest MSE on Sample 2.
- Note that we didn't get to see Sample 2 when fitting our models!
- As such, it seems that the degree 3 polynomial generalizes better to unseen data than the degree 25 polynomial does.
- What if we fit a degree 1, degree 3, and degree 25 polynomial on Sample 2 as well?
util.plot_multiple_models(sample_1, sample_2, degs=[1, 3, 25])
- Key idea: Degree 25 polynomials seem to vary more when trained on different samples than degree 3 and 1 polynomials do.
Bias and variance¶
- The training data we have access to is a sample from the population. We are concerned with our model's ability to generalize and work well on different datasets drawn from the same population.
- Suppose we fit a model $H^*$ (e.g. a degree 3 polynomial) on several different datasets from a population. There are three sources of error that arise.
- Bias: The expected deviation between a predicted value and an actual value.
In other words, for a given $x_i$, how far is $H^*(x_i)$ from the true $y_i$, on average?- Low bias is good! ✅
- High bias is a sign of underfitting, i.e. that our model is too basic to capture the relationship between our features and response.
- Model variance ("variance"): The variance of a model's predictions.
In other words, for a given $x_i$, how much does $H^*(x_i)$ vary across all datasets?- Low model variance is good! ✅
- High model variance is a sign of overfitting, i.e. that our model is too complicated and is prone to fitting to the noise in our training data.
- Observation error: The error due to the random noise in the process we are trying to model (e.g. measurement error).
We can't reduce this without collecting more data!
- Here, suppose:
- The red bulls-eye represents your true weight and height 🧍.
- The dark blue darts represent predictions of your weight and height using different models that were fit using different samples drawn from the same population.
- We'd like our models to be in the top left, but in practice that's hard to achieve!
Risk vs. empirical risk¶
- Since Lecture 14, we've minimized empirical risk to find optimal model parameters $\vec{w}^*$:
- Key idea: A model that works well on past data should work well on future data, if future data looks like past data.
- What we really want is for the expected loss for a new data point $(x_{\text{new}}, y_{\text{new}})$, drawn from the same population as the training set, to be small. That is, we want $$\mathbb{E}[y_{\text{new}} - H(x_{\text{new}})]^2$$ to be minimized. The quantity above is called risk.
- What's that fancy $\mathbb{E}$? It is the expectation operator of a random variable: it computes the average value of the random variable across its entire distribution.
For example, if $X \sim \text{Binomial}(n, p)$, then $\mathbb{E}[X] = np$.
Here, the expectation is being computed across the entire population distribution of $(x_i, y_i)$ pairs.
- In general, we don't know the entire population distribution of $x$s and $y$s, so we can't compute risk exactly. That's why we compute empirical risk!
The bias-variance decomposition¶
- Risk can be decomposed as follows:
Remember, this expectation $\mathbb{E}$ is over the entire population of $x$s and $y$s. In real life, we don't know what this population distribution is, so we can't put actual numbers to this.
- Key takeaway: If we care about minimizing (empirical) risk, we can equivalently try to minimize both model bias and model variance.
- If $H$ is too simple to capture the relationship between $x$s and $y$s in the population, $H$ will underfit to training sets and have high bias.
- If $H$ is overly complex, $H$ will overfit to training sets and have high variance, meaning it will change significantly from one training set to the next.
- We won't cover the proof of the decomposition here – read this for more – but note that in Homework 7, you proved a related formula for $R_\text{sq}(h)$:
The bias-variance tradeoff¶
- As model variance increases, model bias tends to decrease, and vice versa.
- The graph below shows, conceptually, this tradeoff:
- As we collect more data points (i.e. as $n \uparrow$):
- Model variance decreases.
- If $H$ can exactly model the true population relationship between $x$ and $y$ (e.g. cubic), then model bias also decreases.
- If $H$ can't exactly model the true population relationship between $x$ and $y$, then model bias will remain large.
- As we add more features (i.e. as $d \uparrow$):
- Model variance increases, whether or not the feature was useful.
- Adding a useful feature decreases model bias.
- Adding a useless feature doesn't change model bias.
- Example: Suppose the actual relationship between $x$ and $y$ in the population is linear, and we fit $H$ using simple linear regression.
- Model bias = 0.
- Model variance $\propto \frac{d}{n}$.
- As $d \uparrow$, model variance $\uparrow$.
- As $n \uparrow$, model variance $\downarrow$.
Activity
Determine how each change below affects model bias and variance compared to this model:
For each change, choose all of the following that apply: increase bias, decrease bias, increase variance, decrease variance.
- Add degree 3 polynomial features.
- Add a feature of numbers chosen at random between 0 and 1.
- Collect 100 more points for the training set.
- Don’t use the
'veg'
feature.
Train-test splits¶
Avoiding overfitting¶
- We won't know whether our model has overfit to our sample (training data) unless we get to see how well it performs on a new sample from the same population.
- 💡Idea: Split our sample into a training set and test set.
- Use only the training set to fit the model (i.e. find $\vec{w}^*$).
- Use the test set to evaluate the model's error (MSE, $R^2$).
- The test set is like a new sample of data from the same population as the training data!
- Generally:
- Training error reflects bias, not variance.
- Test error reflects both bias and variance, so we need to compute it to understand the true error of our model.
Train-test split 🚆¶
sklearn.model_selection.train_test_split
implements a train-test split for us! 🙏🏼
- If
X
is an array/DataFrame of features andy
is an array/Series of responses,randomly splits the features and responses into training and test sets, such that the test set contains 0.25 of the full dataset.X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
from sklearn.model_selection import train_test_split
# Read the documentation!
train_test_split?
- Let's perform a train/test split on
sample_1
, for illustration.
sample_1
x | y | |
---|---|---|
0 | -2.00 | -6.00 |
1 | -1.95 | -7.33 |
2 | -1.90 | -9.18 |
... | ... | ... |
97 | 2.90 | 25.75 |
98 | 2.95 | 22.40 |
99 | 3.00 | 32.47 |
100 rows × 2 columns
X = sample_1[['x']] # DataFrame.
y = sample_1['y'] # Series.
# We don't have to choose 0.25.
# We also don't have to set a random_state;
# we've done this so that we get the same results in lecture every time.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)
- Before proceeding, let's check the sizes of
X_train
andX_test
.
print('Rows in X_train:', X_train.shape[0])
display(X_train.head())
print('Rows in X_test:', X_test.shape[0])
display(X_test.head())
Rows in X_train: 80
x | |
---|---|
85 | 2.29 |
28 | -0.59 |
8 | -1.60 |
11 | -1.44 |
63 | 1.18 |
Rows in X_test: 20
x | |
---|---|
26 | -0.69 |
80 | 2.04 |
82 | 2.14 |
68 | 1.43 |
77 | 1.89 |
Example train-test split¶
- First, we'll fit a model on the training set.
- Here, we'll use a stand-alone
LinearRegression
model without aPipeline
, but this process would work the same if we were using aPipeline
.
from sklearn.metrics import mean_squared_error
model = LinearRegression()
model.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
- Let's check our model's performance on the training set first.
pred_train = model.predict(X_train)
mse_train = mean_squared_error(y_train, pred_train)
mse_train
20.85690274368493
- And the test set:
pred_test = model.predict(X_test)
mse_test = mean_squared_error(y_test, pred_test)
mse_test
20.786774387538337
- Since
mse_train
andmse_test
are similar, it doesn't seem like our model is overfitting to the training data.
- If
mse_test
was much larger thanmse_train
, it would be evidence that our model is unable to generalize well.
Hyperparameters¶
Example: Polynomial regression¶
- We recently looked at an example of polynomial regression.
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_2, degs=[1, 3, 25], data_name='Sample 2')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 2')
- When building these models:
- We got to choose the degree of the polynomials – we chose 1, 3, and 25.
- We didn't get to choose the exact formulas for the three polynomials – their formulas were learned from data.
No matter what the data looked like, the left-most model had to look like a line, because we chose its degree in advance.
Parameters vs. hyperparameters¶
A parameter defines the relationship between variables in a model. We learn parameters from data.
- For instance, suppose we fit a degree 3 polynomial to data, and end up with:
$$H(x) = 1 - 2x + 13x^2 - 4x^3$$
- 1, -2, 13, and -4 are parameters.
- A hyperparameter is a parameter that we choose before our model is fit to the data.
- Think of hyperparameters as knobs 🎛 – we get to pick and tune them!
- Polynomial degree was a hyperparameter in the previous example, and we tried three different values: 1, 3, and 25.
- Question: How do we choose the "right" hyperparameter(s)?
Degree 3 was a better choice than degree 25, for example – but how do we systematically choose?
Training error vs. test error¶
- We know that a model's performance on a test set is a good estimate of its ability to generalize to unseen data.
- We want to find the hyperparameter that leads to the best test set performance.
- Idea:
- Come up with a list of hyperparameters to try.
- For each hyperparameter, train the model on the training set and compute its performance on the test set.
- Pick the hyperparameter with the best performance on the test set.
- Let's try this strategy on Sample 1 from our earlier example.
- We'll try to fit a polynomial model on the dataset; we'll choose the polynomial's degree from the list [1, 2, ..., 25].
Polynomial degree vs. train/test error¶
- We already performed a train-test split of
sample_1
a few slides ago.
X_train
x | |
---|---|
85 | 2.29 |
28 | -0.59 |
8 | -1.60 |
... | ... |
73 | 1.69 |
40 | 0.02 |
83 | 2.19 |
80 rows × 1 columns
- Now, we'll create models with degree 1 through degree 25 polynomial features and compute their train and test errors.
train_errs = []
test_errs = []
for d in range(1, 26):
pl = make_pipeline(PolynomialFeatures(d), LinearRegression())
pl.fit(X_train, y_train)
train_errs.append(mean_squared_error(y_train, pl.predict(X_train)))
test_errs.append(mean_squared_error(y_test, pl.predict(X_test)))
errs = pd.DataFrame({'Train Error': train_errs, 'Test Error': test_errs})
- Let's look at the plots of training error vs. degree and test error vs. degree.
fig = px.line(errs.iloc[:-1])
fig.update_layout(showlegend=True, xaxis_title='Polynomial Degree', yaxis_title='Mean Squared Error')
- Training error appears to decrease as polynomial degree increases.
- Test error appears to decrease until a "valley", and then increases again.
- Here, we'd choose a degree of 3, since that degree has the lowest test error.
Training error vs. test error¶
- The pattern we saw in the previous example is true more generally.
- We pick the hyperparameter(s) at the "valley" of test error.
- Note that training error tends to underestimate test error, but it doesn't have to – i.e., it is possible for test error to be lower than training error (say, if the test set is "easier" to predict than the training set).
- The results – and the bias-variance tradeoff more generally – hold true for "classic" machine learning models, like the ones we're studying here. But in deep neural networks, this pattern is often violated; extremely complex models can have low test error as well.
This phenomenon is known as "double descent"; learn more here.
Conducting train-test splits¶
- Recall, training data is used to fit our model, and test data is used to evaluate our model.
- Question: How should we split?
sklearn
'strain_test_split
splits randomly, which usually works well.- However, if there is some element of time in the training data (say, when predicting the future price of a stock), a better split is "past" and "future".
- Question: How large should the split be, e.g. 90%-10% vs. 75%-25%?
- There's a tradeoff – a larger training set should lead to a "better" model, while a larger test set should lead to a better estimate of our model's ability to generalize.
- There's no "right" choice, but we usually choose between 10% to 25% for the test set.
But wait...¶
- With our current strategy, we are choosing the hyperparameter that creates the model that performs best on the test set.
- As such, we are overfitting to the test set – the best hyperparameter for the test set might not be the best hyperparameter for a totally unseen dataset!
- It seems like we need another split.
- On Thursday, we'll cover the more robust solution to the problem of selecting hyperparameters: cross-validation.