from lec_utils import *
import lec17_util as util
Lecture 17Ā¶
PipelinesĀ¶
EECS 398: Practical Data Science, Winter 2025Ā¶
practicaldsc.org ā¢ github.com/practicaldsc/wn25 ā¢ š£ See latest announcements here on Ed
Agenda šĀ¶
- Brief recap: standardization.
OneHotEncoder
and multicollinearity.- Pipelinesš°.
- Generalization š.
Question š¤ (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
Brief recap: standardization (see annotated slides)Ā¶
OneHotEncoder
and multicollinearityĀ¶
Example: Commute times šĀ¶
- Let's reload our trusty commute times dataset.
df = pd.read_csv('data/commute-times.csv')
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df['month'] = pd.to_datetime(df['date']).dt.month_name()
df.head()
date | day | home_departure_time | home_departure_mileage | ... | work_departure_time_hr | mileage_to_home | day_of_month | month | |
---|---|---|---|---|---|---|---|---|---|
0 | 5/15/2023 | Mon | 2023-05-15 10:49:00 | 15873.0 | ... | 17.17 | 53.0 | 15 | May |
1 | 5/16/2023 | Tue | 2023-05-16 07:45:00 | 15979.0 | ... | NaN | NaN | 16 | May |
2 | 5/22/2023 | Mon | 2023-05-22 08:27:00 | 50407.0 | ... | 15.90 | 54.0 | 22 | May |
3 | 5/23/2023 | Tue | 2023-05-23 07:08:00 | 50535.0 | ... | NaN | NaN | 23 | May |
4 | 5/30/2023 | Tue | 2023-05-30 09:09:00 | 50664.0 | ... | 17.12 | 54.0 | 30 | May |
5 rows Ć 20 columns
- We'll focus specifically on the
'day'
and'month'
columns.
df[['day', 'month']]
day | month | |
---|---|---|
0 | Mon | May |
1 | Tue | May |
2 | Mon | May |
... | ... | ... |
62 | Mon | March |
63 | Tue | March |
64 | Thu | March |
65 rows Ć 2 columns
Example transformer: OneHotEncoder
Ā¶
- Last class, we had to manually one hot encode the
'day'
column. Let's figure out how to one hot encode it automatically, along with the new'month'
column.
df[['day', 'month']]
day | month | |
---|---|---|
0 | Mon | May |
1 | Tue | May |
2 | Mon | May |
... | ... | ... |
62 | Mon | March |
63 | Tue | March |
64 | Thu | March |
65 rows Ć 2 columns
- First, we need to import the relevant class from
sklearn.preprocessing
.
from sklearn.preprocessing import OneHotEncoder
- Like with
StandardScaler
, we need to instantiate and fit ourOneHotEncoder
instsance before it can transform anything.
ohe = OneHotEncoder()
ohe.fit(df[['day', 'month']])
OneHotEncoder()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
OneHotEncoder()
- Once we've fit, when we use the
transform
method, we get a result we might not expect.
ohe.transform(df[['day', 'month']])
<Compressed Sparse Row sparse matrix of dtype 'float64' with 130 stored elements and shape (65, 16)>
- Since the resulting matrix is sparse ā most of its elements are 0 ā
sklearn
uses a more efficient representation than a regularnumpy
array. We can convert to a regular (dense) array:
ohe.transform(df[['day', 'month']]).toarray()
array([[0., 1., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 1., 0., ..., 0., 0., 0.], ..., [0., 1., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.]])
- The column names from
df[['day', 'month']]
don't appear in the output above. We can use theget_feature_names_out
method onohe
to access the names and order of the one hot encoded columns, though:
ohe.get_feature_names_out()
array(['day_Fri', 'day_Mon', 'day_Thu', 'day_Tue', 'day_Wed', 'month_August', 'month_December', 'month_February', 'month_January', 'month_July', 'month_June', 'month_March', 'month_May', 'month_November', 'month_October', 'month_September'], dtype=object)
pd.DataFrame(ohe.transform(df[['day', 'month']]).toarray(),
columns=ohe.get_feature_names_out()) # If we need a DataFrame back, for some reason.
day_Fri | day_Mon | day_Thu | day_Tue | ... | month_May | month_November | month_October | month_September | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
62 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 |
63 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 |
64 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 |
65 rows Ć 16 columns
- Usually, we won't perform all of these intermediate steps, since the
OneHotEncoder
will be part of a larger Pipeline.
Example: Heights and weightsĀ¶
- We now know how to use
OneHotEncoder
.
- To illustrate a mathematical issue involving one hot encoding, let's load in another dataset, this time containing the weights and heights of 25,000 18 year olds, taken from here.
people = pd.read_csv('data/heights-weights.csv').drop(columns=['Index'])
people.head()
Height (Inches) | Weight (Pounds) | |
---|---|---|
0 | 65.78 | 112.99 |
1 | 71.52 | 136.49 |
2 | 69.40 | 153.03 |
3 | 68.22 | 142.34 |
4 | 67.79 | 144.30 |
people.plot(kind='scatter', x='Height (Inches)', y='Weight (Pounds)',
title='Weight vs. Height for 25,000 18 Year Olds')
Motivating exampleĀ¶
- Suppose we fit a simple linear regression model that uses height in inches, $x$ to predict weight in pounds, $y$.
X = people[['Height (Inches)']]
y = people['Weight (Pounds)']
from sklearn.linear_model import LinearRegression
people_one_feat = LinearRegression()
people_one_feat.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
- $w_0^*$ and $w_1^*$ are shown below, along with the model's MSE on the data we used to train it.
We call this the model's training MSE.
people_one_feat.intercept_, people_one_feat.coef_
(-82.57574306454093, array([3.08]))
from sklearn.metrics import mean_squared_error
mean_squared_error(y, people_one_feat.predict(X))
101.58853248632849
An added featureĀ¶
- Now, suppose we fit another regression model, that uses height in inches AND height in feet to predict weight.
people['Height (Feet)'] = people['Height (Inches)'] / 12 # 12 inches = 1 foot.
X2 = people[['Height (Inches)', 'Height (Feet)']]
X2
Height (Inches) | Height (Feet) | |
---|---|---|
0 | 65.78 | 5.48 |
1 | 71.52 | 5.96 |
2 | 69.40 | 5.78 |
... | ... | ... |
24997 | 64.70 | 5.39 |
24998 | 67.53 | 5.63 |
24999 | 68.88 | 5.74 |
25000 rows Ć 2 columns
people_two_feat = LinearRegression()
people_two_feat.fit(X2, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
- What are $w_0^*$, $w_1^*$, $w_2^*$, and the model's MSE?
people_two_feat.intercept_, people_two_feat.coef_
(-82.59155502376602, array([-2.32e+11, 2.78e+12]))
mean_squared_error(y, people_two_feat.predict(X2))
101.58844271417476
- Observation: The intercept is the same as before (roughly -82.59), as is the MSE. However, the coefficients on
'Height (Inches)'
and'Height (Feet)'
are massive in size!
- It should be unsurprising that the MSE is the same, because the span of the design matrix is the same. So, the best predictions should be the same, too.
- But what's going on with the coefficients?
Redundant featuresĀ¶
- Suppose in the first model, $w_0^* = -80$ and $w_1^* = 3$.
- In the second model, we have:
- But, since $\text{height in feet}_i = \frac{\text{height in inches}_i}{12}$:
- In the first model, we already found the "best" intercept ($-80$) and slope ($3$) in a linear model that uses height in inches to predict weight.
- So, as long as $w_1^* + \frac{w_2^*}{12} = 3$ in the second model, the second model's predictions will be the same as the first, and hence they will also minimize MSE.
Infinitely many parameter choicesĀ¶
- Issue: There are an infinite number of $w_1^*$ and $w_2^*$ that satisfy $w_1^* + \frac{w_2^*}{12} = 3$!
- Both hypothesis functions look very different, but actually make the same predictions.
model.coef_
could return either set of coefficients, or any other of the infinitely many options.
- But neither set of coefficients is has any meaning!
-80 + 5 * people['Height (Inches)'] - 24 * people['Height (Feet)']
0 117.35 1 134.55 2 128.20 ... 24997 114.10 24998 122.59 24999 126.63 Length: 25000, dtype: float64
-80 - 1 * people['Height (Inches)'] + 48 * people['Height (Feet)']
0 117.35 1 134.55 2 128.20 ... 24997 114.10 24998 122.59 24999 126.63 Length: 25000, dtype: float64
MulticollinearityĀ¶
- Multicollinearity occurs when features in a regression model are highly correlated with one another.
In other words, multicollinearity occurs when a feature can be predicted using a linear combination of other features, fairly accurately.
- When multicollinearity is present in the features, the coefficients in the model are uninterpretable ā they have no meaning.
A "slope" represents "the rate of change of $y$ with respect to a feature", when all other features are held constant ā but if there's multicollinearity, you can't hold other features constant.
- Note: Multicollinearity doesn't impact a model's predictions!
- It doesn't impact a model's ability to generalize to unseen data.
- If features are multicollinear in the data we've seen, they will probably be multicollinear in the data we haven't seen, drawn from the same distribution.
- Solutions:
- Manually remove highly correlated features.
- Use a dimensionality reduction technique (such as PCA) to automatically reduce dimensions.
One hot encoding and multicollinearityĀ¶
- One hot encoding will result in multicollinearity unless you drop one of the one hot encoded features.
- Suppose we have the following fitted model:
For illustration, assume'weekend'
was originally a categorical feature with two possible values,'Yes'
or'No'
.
- This is equivalent to:
- Note that for a particular row in the dataset, $\text{weekend}_i==\text{Yes} + \text{weekend}_i==\text{No}$ is always equal to 1.
- What's the issue with the example design matrix above?
See the annotated slides.
One hot encoding and multicollinearityĀ¶
- The columns of the design matrix, $X$ above are not linearly independent!
The column of all 1s can be written as a linear combination of the $\text{weekend==Yes}$ and $\text{weekend==No}$ columns.
$$\text{column 1} = \text{column 3} + \text{column 4}$$- This means that the design matrix is not full rank, which means that $X^TX$ is not invertible.
- This means that there are infinitely many possible solutions $\vec{w}^*$ to the normal equations, $(X^TX) \vec{w} = X^T\vec{y}$!
That's a problem, because we don't know which of these infinitely many solutionsmodel.coef_
will find for us, and it's impossible to interpret the resulting coefficients, as we saw two slides ago.
- Solution: Drop one of the one hot encoded columns.
OneHotEncoder
has an option to do this.
OneHotEncoder
returnsĀ¶
- Let's switch back to the commute times dataset,
df
.
df[['day', 'month']]
day | month | |
---|---|---|
0 | Mon | May |
1 | Tue | May |
2 | Mon | May |
... | ... | ... |
62 | Mon | March |
63 | Tue | March |
64 | Thu | March |
65 rows Ć 2 columns
- Let's try using
drop='first'
when instantiating aOneHotEncoder
.
ohe_drop_one = OneHotEncoder(drop='first')
ohe_drop_one.fit(df[['day', 'month']])
OneHotEncoder(drop='first')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
OneHotEncoder(drop='first')
- How many features did the resulting transformer create?
len(ohe_drop_one.get_feature_names_out())
14
- Where did this number come from?
df['day'].nunique()
5
df['month'].nunique()
11
Key takeawaysĀ¶
- Multicollinearity is present in a linear model when one feature can be accurately predicted using one or more other features.
In other words, it is present when a feature is redundant.
- Multicollinearity doesn't pose an issue for prediction; it doesn't hinder a model's ability to generalize. Instead, it renders the coefficients of a linear model meaningless.
Pipelinesš°Ā¶
Recap: Commute times šĀ¶
(
df
.plot(kind='scatter', x='departure_hour', y='minutes')
.update_layout(xaxis_title='Home Departure Time (AM)',
yaxis_title='Minutes',
title='Commuting Time vs. Home Departure Time')
)
- So far, our goal has been to predict commute time in
'minutes'
, given'departure_hour'
.
- We just learned how to use
OneHotEncoder
to encode'day'
and'month'
as numerical columns.
We'll look at how we can easily use these columns ā and more! ā as inputs to a linear model that predicts commute times.
Pipelines in sklearn
Ā¶
- From
sklearn
's documentation:
Pipeline
allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.
Intermediate steps of the pipeline must be "transforms", that is, they must implementfit
andtransform
methods. The final estimator only needs to implementfit
.
- General template:
pl = make_pipeline(transformer_1, transformer_2, ..., model)
.
Note that themodel
is optional, meaning you can have Pipelines of just transformers.
- Once a Pipeline is instantiated, you can fit all steps (transformers and model) using
pl.fit(X, y)
.
- To make predictions using raw, untransformed data, use
pl.predict(X)
.
Our first PipelineĀ¶
- Let's build a Pipeline that:
- One hot encodes
'day'
and'month'
. - Fits a regression model on just the one hot encoded data.
- One hot encodes
# You can either use the Pipeline class constructor directly,
# or the make_pipeline helper function (my preference).
from sklearn.pipeline import Pipeline, make_pipeline
pl = make_pipeline(
OneHotEncoder(drop='first'),
LinearRegression()
)
pl
Pipeline(steps=[('onehotencoder', OneHotEncoder(drop='first')), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('onehotencoder', OneHotEncoder(drop='first')), ('linearregression', LinearRegression())])
OneHotEncoder(drop='first')
LinearRegression()
- Now that
pl
is instantiated, wefit
it the same way we would fit the individual steps.
pl.fit(X=df[['day', 'month']], y=df['minutes'])
Pipeline(steps=[('onehotencoder', OneHotEncoder(drop='first')), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('onehotencoder', OneHotEncoder(drop='first')), ('linearregression', LinearRegression())])
OneHotEncoder(drop='first')
LinearRegression()
- Now, to make predictions using raw data, all we need to do is use
pl.predict
:
pl.predict([['Mon', 'November']])
array([68.61])
pl
performs both feature transformation and prediction with just a single call topredict
!
- We can access individual "steps" of a
Pipeline
through thenamed_steps
attribute.
# These names are automatically generated by make_pipeline.
# If you use the Pipeline() constructor,
# you can choose these names yourself.
pl.named_steps
{'onehotencoder': OneHotEncoder(drop='first'), 'linearregression': LinearRegression()}
pl.named_steps['onehotencoder'].transform(df[['day', 'month']]).toarray()
array([[1., 0., 0., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.], [1., 0., 0., ..., 0., 0., 0.], ..., [1., 0., 0., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.], [0., 1., 0., ..., 0., 0., 0.]])
pl.named_steps['onehotencoder'].get_feature_names_out()
array(['day_Mon', 'day_Thu', 'day_Tue', 'day_Wed', 'month_December', 'month_February', 'month_January', 'month_July', 'month_June', 'month_March', 'month_May', 'month_November', 'month_October', 'month_September'], dtype=object)
pl.named_steps['linearregression'].coef_
array([ 1.65, 8.35, 13.2 , 2.68, -2.1 , 6.06, -4.44, -3.08, 9.14, 8.62, 6.24, 2.98, -5.6 , 3.29])
More sophisticated PipelinesĀ¶
- In the previous slide, we one hot encoded every input column, and didn't use any columns that were originally numeric, i.e. we didn't use
'departure_hour'
.
That's not realistic or useful!
- What if we want to perform different transformations on different columns, or include some columns without transformation?
- Or, what if we want to perform multiple transformations to the same column?
- There are a variety of useful functions/classes we can use:
Name | Functionality |
---|---|
ColumnTransformer |
Allows us to transform different columns with different transformations. Instantiate a ColumnTransformer using a list of tuples, where:ā¢ The first element is a "name" we choose for the transformer. ā¢ The second element is a transformer instance (e.g. OneHotEncoder() ).ā¢ The third element is a list of relevant column names. |
FunctionTransformer |
Allows us to create a custom transformation (similar to using .apply on a DataFrame's columns). |
make_pipeline |
Helper function for creating a Pipeline (slightly less verbose).Note that you can make a pipeline of just transformations, if you want to use multiple transformations on the same column! |
make_column_transformer |
Helper function for creating a ColumnTransformer . |
The planĀ¶
- Before writing any code, let's plan out how we want to transform our data.
df[['departure_hour', 'day', 'month', 'day_of_month']]
departure_hour | day | month | day_of_month | |
---|---|---|---|---|
0 | 10.82 | Mon | May | 15 |
1 | 7.75 | Tue | May | 16 |
2 | 8.45 | Mon | May | 22 |
... | ... | ... | ... | ... |
62 | 7.58 | Mon | March | 4 |
63 | 7.45 | Tue | March | 5 |
64 | 7.60 | Thu | March | 7 |
65 rows Ć 4 columns
'departure_hour'
: Create degree 2 and degree 3 polynomial features:
'day'
: One hot encode.
'month'
: One hot encode.
'day_of_month'
: Separate into five weeks, then one hot encode.
Days 1 to 7 are Week 1, Days 8 to 15 are Week 2, and so on.
- After all of these transformations, we'll fit a
LinearRegression
object ā i.e., fit a linear model.
'departure_hour'
: Create degree 2 and degree 3 polynomial features.
'day'
: One hot encode.
'month'
: One hot encode.
'day_of_month'
: Separate into five weeks, then one hot encode.
- Let's start with
'day_of_month'
, since it seems to involve the most complicated transformations.
- First, let's figure out how to extract the week number given the day of the month.
example_vals = df['day_of_month'].tail()
example_vals
60 27 61 29 62 4 63 5 64 7 Name: day_of_month, dtype: int32
# Expression to convert from day of month to Week #.
'Week ' + ((example_vals - 1) // 7 + 1).astype(str)
60 Week 4 61 Week 5 62 Week 1 63 Week 1 64 Week 1 Name: day_of_month, dtype: object
# The function that FunctionTransformer takes in
# itself takes in a Series/DataFrame, not a single element!
# Here, we're having that function return a new Series/DataFrame,
# depending on what's passed in to .tranform (experiment on your own).
from sklearn.pipeline import FunctionTransformer
week_converter = FunctionTransformer(lambda s: 'Week ' + ((s - 1) // 7 + 1).astype(str))
week_converter.transform(df[['day_of_month']])
day_of_month | |
---|---|
0 | Week 3 |
1 | Week 3 |
2 | Week 4 |
... | ... |
62 | Week 1 |
63 | Week 1 |
64 | Week 1 |
65 rows Ć 1 columns
- We need to apply two consecutive transformations to
'day_of_month'
, which calls for a Pipeline.
day_of_month_transformer = make_pipeline(week_converter, OneHotEncoder(drop='first'))
day_of_month_transformer
Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))])
FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)
OneHotEncoder(drop='first')
day_of_month_transformer.fit_transform(df[['day_of_month']]).toarray()
array([[0., 1., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], ..., [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]])
- So,
day_of_month_transformer
does everything we need to transform'day_of_month'
.
'departure_hour'
: Create degree 2 and degree 3 polynomial features.
'day'
: One hot encode.
'month'
: One hot encode.
'day_of_month'
: Separate into five weeks, then one hot encode. ā
Use day_of_month_transformer
.
- Every other column only needs a single transformation.
To specify which transformations to apply to which columns, create aColumnTransformer
.
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import PolynomialFeatures
preprocessing = make_column_transformer(
(PolynomialFeatures(3, include_bias=False), ['departure_hour']),
(OneHotEncoder(drop='first'), ['day', 'month']),
(day_of_month_transformer, ['day_of_month']),
remainder='drop'
)
preprocessing
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)
OneHotEncoder(drop='first')
- Now, we're ready for a final Pipeline!
model = make_pipeline(preprocessing, LinearRegression())
model
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)
OneHotEncoder(drop='first')
LinearRegression()
model.fit(X=df[['departure_hour', 'day', 'month', 'day_of_month']], y=df['minutes'])
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)
OneHotEncoder(drop='first')
LinearRegression()
The punchlineĀ¶
- Now that our Pipeline is fit, we can use it to make predictions using raw data!
What's the predicted commute time if I leave at 8:30AM on a Wednesday in March, which happens to be the 19th of the month?
model.predict(pd.DataFrame([{
'departure_hour': 8.5,
'day': 'Wed',
'month': 'March',
'day_of_month': 19
}]))
array([64.5])
- Note that when calling
model.predict
, I didn't need to think about one hot encoding, or polynomial features, or any other aspects of the feature engineering process.
Activity
How many columns does the final design matrix that model
creates have? If you write code to determine the answer, make sure you can walk through the steps over the past few slides to figure out why the answer is what it is.
model
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])), ('linearregression', LinearRegression())])
ColumnTransformer(transformers=[('polynomialfeatures', PolynomialFeatures(degree=3, include_bias=False), ['departure_hour']), ('onehotencoder', OneHotEncoder(drop='first'), ['day', 'month']), ('pipeline', Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)), ('onehotencoder', OneHotEncoder(drop='first'))]), ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)
OneHotEncoder(drop='first')
LinearRegression()
Question š¤ (Answer at practicaldsc.org/q)
What questions do you have?
Generalization šĀ¶
MotivationĀ¶
- You and Billy are studying for an upcoming exam. You both decide to test your understanding by taking a practice exam.
Your logic: If you do well on the practice exam, you should do well on the real exam.
- You each take the practice exam once and look at the solutions afterwards.
- Your strategy: Memorize the answers to all practice exam questions, e.g. "Question 1: A; Question 2: C; Question 3: A."
- Billy's strategy: Learn high-level concepts from the solutions, e.g. "the TF-IDF of term $t$ in document $d$ is large when $t$ occurs often in $d$ but rarely overall."
- Who will do better on the practice exam? Who will probably do better on the real exam? š§
Evaluating the quality of a modelĀ¶
- So far, we've computed the MSE of our fit regression models on the data that we used to fit them, i.e. the training data.
This mean squared error is called the training MSE, or training error.
- We've said that Model A is better than Model B if Model A's MSE is lower than Model B's MSE.
- Remember, our training data is a sample from some population.
- Just because a model fits the training data well doesn't mean it will generalize and work well on similar, unseen samples from the same population!
Overfitting and underfittingĀ¶
- Let's collect two samples $\{(x_i, y_i)\}$ from the same population.
np.random.seed(23) # For reproducibility.
def sample_from_pop(n=100):
x = np.linspace(-2, 3, n)
y = x ** 3 + (np.random.normal(0, 3, size=n))
return pd.DataFrame({'x': x, 'y': y})
sample_1 = sample_from_pop()
sample_2 = sample_from_pop()
- For now, let's just look at Sample 1. The relationship between $x$ and $y$ is roughly cubic; that is, $y \approx x^3$.
Remember, in reality, you won't get to see the population distribution. If you could, there'd be no need to build a model!
px.scatter(sample_1, x='x', y='y', title='Sample 1')
Polynomial regressionĀ¶
- Let's fit three polynomial models on Sample 1: degree 1, degree 3, and degree 25.
Again, we'll use thePolynomialFeatures
transformer.
# fit_transform fits and transforms the same input.
# We tell it not to add a column of 1s, because
# LinearRegression() does this automatically later on.
d2 = PolynomialFeatures(3, include_bias=False)
d2.fit_transform(np.array([1, 2, 3, 4, -2]).reshape(-1, 1))
array([[ 1., 1., 1.], [ 2., 4., 8.], [ 3., 9., 27.], [ 4., 16., 64.], [-2., 4., -8.]])
- Below, we look at our three models' predictions on Sample 1, which they were trained on.
# Look at the definition of train_and_plot in lec17_util.py if you're curious as to how the plotting works.
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_1, degs=[1, 3, 25], data_name='Sample 1')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 1')
- The degree 25 polynomial has the lowest MSE on Sample 1.
- How do the same fit polynomials look on Sample 2?
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_2, degs=[1, 3, 25], data_name='Sample 2')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 2')
- The degree 3 polynomial has the lowest MSE on Sample 2.
- Note that we didn't get to see Sample 2 when fitting our models!
- As such, it seems that the degree 3 polynomial generalizes better to unseen data than the degree 25 polynomial does.
- What if we fit a degree 1, degree 3, and degree 25 polynomial on Sample 2 as well?
util.plot_multiple_models(sample_1, sample_2, degs=[1, 3, 25])
- Key idea: Degree 25 polynomials seem to vary more when trained on different samples than degree 3 and 1 polynomials do.
- More on this next class!