InĀ [1]:
from lec_utils import *
import lec17_util as util

Lecture 17¶

Pipelines¶

EECS 398: Practical Data Science, Winter 2025¶

practicaldsc.org • github.com/practicaldsc/wn25 • šŸ“£ See latest announcements here on Ed

Agenda šŸ“†Ā¶

  • Brief recap: standardization.
  • OneHotEncoder and multicollinearity.
  • Pipelines🚰.
  • Generalization šŸ”­.

Question šŸ¤” (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

Brief recap: standardization (see annotated slides)¶









OneHotEncoder and multicollinearity¶


Example: Commute times šŸš—Ā¶

  • Let's reload our trusty commute times dataset.
InĀ [2]:
df = pd.read_csv('data/commute-times.csv')
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df['month'] = pd.to_datetime(df['date']).dt.month_name()
df.head()
Out[2]:
date day home_departure_time home_departure_mileage ... work_departure_time_hr mileage_to_home day_of_month month
0 5/15/2023 Mon 2023-05-15 10:49:00 15873.0 ... 17.17 53.0 15 May
1 5/16/2023 Tue 2023-05-16 07:45:00 15979.0 ... NaN NaN 16 May
2 5/22/2023 Mon 2023-05-22 08:27:00 50407.0 ... 15.90 54.0 22 May
3 5/23/2023 Tue 2023-05-23 07:08:00 50535.0 ... NaN NaN 23 May
4 5/30/2023 Tue 2023-05-30 09:09:00 50664.0 ... 17.12 54.0 30 May

5 rows Ɨ 20 columns

  • We'll focus specifically on the 'day' and 'month' columns.
InĀ [3]:
df[['day', 'month']]
Out[3]:
day month
0 Mon May
1 Tue May
2 Mon May
... ... ...
62 Mon March
63 Tue March
64 Thu March

65 rows Ɨ 2 columns

Example transformer: OneHotEncoder¶

  • Last class, we had to manually one hot encode the 'day' column. Let's figure out how to one hot encode it automatically, along with the new 'month' column.
InĀ [4]:
df[['day', 'month']]
Out[4]:
day month
0 Mon May
1 Tue May
2 Mon May
... ... ...
62 Mon March
63 Tue March
64 Thu March

65 rows Ɨ 2 columns

  • First, we need to import the relevant class from sklearn.preprocessing.
InĀ [5]:
from sklearn.preprocessing import OneHotEncoder
  • Like with StandardScaler, we need to instantiate and fit our OneHotEncoder instsance before it can transform anything.
InĀ [6]:
ohe = OneHotEncoder()
InĀ [7]:
ohe.fit(df[['day', 'month']])
Out[7]:
OneHotEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
OneHotEncoder()
  • Once we've fit, when we use the transform method, we get a result we might not expect.
InĀ [8]:
ohe.transform(df[['day', 'month']])
Out[8]:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 130 stored elements and shape (65, 16)>
  • Since the resulting matrix is sparse – most of its elements are 0 – sklearn uses a more efficient representation than a regular numpy array. We can convert to a regular (dense) array:
InĀ [9]:
ohe.transform(df[['day', 'month']]).toarray()
Out[9]:
array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])
  • The column names from df[['day', 'month']] don't appear in the output above. We can use the get_feature_names_out method on ohe to access the names and order of the one hot encoded columns, though:
InĀ [10]:
ohe.get_feature_names_out()
Out[10]:
array(['day_Fri', 'day_Mon', 'day_Thu', 'day_Tue', 'day_Wed',
       'month_August', 'month_December', 'month_February',
       'month_January', 'month_July', 'month_June', 'month_March',
       'month_May', 'month_November', 'month_October', 'month_September'],
      dtype=object)
InĀ [11]:
pd.DataFrame(ohe.transform(df[['day', 'month']]).toarray(), 
             columns=ohe.get_feature_names_out()) # If we need a DataFrame back, for some reason.
Out[11]:
day_Fri day_Mon day_Thu day_Tue ... month_May month_November month_October month_September
0 0.0 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ...
62 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
63 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0
64 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0

65 rows Ɨ 16 columns

  • Usually, we won't perform all of these intermediate steps, since the OneHotEncoder will be part of a larger Pipeline.

Example: Heights and weights¶

  • We now know how to use OneHotEncoder.
  • To illustrate a mathematical issue involving one hot encoding, let's load in another dataset, this time containing the weights and heights of 25,000 18 year olds, taken from here.
InĀ [12]:
people = pd.read_csv('data/heights-weights.csv').drop(columns=['Index'])
people.head()
Out[12]:
Height (Inches) Weight (Pounds)
0 65.78 112.99
1 71.52 136.49
2 69.40 153.03
3 68.22 142.34
4 67.79 144.30
InĀ [13]:
people.plot(kind='scatter', x='Height (Inches)', y='Weight (Pounds)', 
            title='Weight vs. Height for 25,000 18 Year Olds')

Motivating example¶

  • Suppose we fit a simple linear regression model that uses height in inches, $x$ to predict weight in pounds, $y$.
$$\text{predicted weight}_i = w_0 + w_1 \cdot \text{height in inches}_i$$
InĀ [14]:
X = people[['Height (Inches)']]
y = people['Weight (Pounds)']
InĀ [15]:
from sklearn.linear_model import LinearRegression
people_one_feat = LinearRegression()
people_one_feat.fit(X, y)
Out[15]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
  • $w_0^*$ and $w_1^*$ are shown below, along with the model's MSE on the data we used to train it.
    We call this the model's training MSE.
InĀ [16]:
people_one_feat.intercept_, people_one_feat.coef_
Out[16]:
(-82.57574306454093, array([3.08]))
InĀ [17]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y, people_one_feat.predict(X))
Out[17]:
101.58853248632849

An added feature¶

  • Now, suppose we fit another regression model, that uses height in inches AND height in feet to predict weight.
$$\text{predicted weight}_i = w_0 + w_1 \cdot \text{height in inches}_i + w_2 \cdot \text{height in feet}_i$$
InĀ [18]:
people['Height (Feet)'] = people['Height (Inches)'] / 12 # 12 inches = 1 foot.
InĀ [19]:
X2 = people[['Height (Inches)', 'Height (Feet)']]
X2
Out[19]:
Height (Inches) Height (Feet)
0 65.78 5.48
1 71.52 5.96
2 69.40 5.78
... ... ...
24997 64.70 5.39
24998 67.53 5.63
24999 68.88 5.74

25000 rows Ɨ 2 columns

InĀ [20]:
people_two_feat = LinearRegression()
people_two_feat.fit(X2, y)
Out[20]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
  • What are $w_0^*$, $w_1^*$, $w_2^*$, and the model's MSE?
InĀ [21]:
people_two_feat.intercept_, people_two_feat.coef_
Out[21]:
(-82.59155502376602, array([-2.32e+11,  2.78e+12]))
InĀ [22]:
mean_squared_error(y, people_two_feat.predict(X2))
Out[22]:
101.58844271417476
  • Observation: The intercept is the same as before (roughly -82.59), as is the MSE. However, the coefficients on 'Height (Inches)' and 'Height (Feet)' are massive in size!
  • It should be unsurprising that the MSE is the same, because the span of the design matrix is the same. So, the best predictions should be the same, too.
  • But what's going on with the coefficients?

Redundant features¶

  • Suppose in the first model, $w_0^* = -80$ and $w_1^* = 3$.
$$\text{predicted weight}_i = -80 + 3 \cdot \text{height in inches}_i$$
  • In the second model, we have:
$$\begin{align*}\text{predicted weight}_i &= w_0^* + w_1^* \cdot \text{height in inches}_i + w_2^* \cdot \text{height in feet}_i \end{align*}$$
  • But, since $\text{height in feet}_i = \frac{\text{height in inches}_i}{12}$:
$$\begin{align*}\text{predicted weight}_i &= w_0^* + w_1^* \cdot \text{height in inches}_i + w_2^* \cdot \text{height in feet}_i \\ &= w_0^* + w_1^* \cdot \text{height in inches}_i + w_2^* \cdot \left( \frac{\text{height in inches}_i}{12} \right) \\ &= w_0^* + \left( w_1^* + \frac{w_2^*}{12} \right) \cdot \text{height in inches}_i \end{align*}$$
  • In the first model, we already found the "best" intercept ($-80$) and slope ($3$) in a linear model that uses height in inches to predict weight.
  • So, as long as $w_1^* + \frac{w_2^*}{12} = 3$ in the second model, the second model's predictions will be the same as the first, and hence they will also minimize MSE.

Infinitely many parameter choices¶

  • Issue: There are an infinite number of $w_1^*$ and $w_2^*$ that satisfy $w_1^* + \frac{w_2^*}{12} = 3$!
$$\begin{align*}\text{predicted weight}_i &= -80 + 5 \cdot \text{height in inches}_i - 24 \cdot \text{height in feet}_i \end{align*}$$
$$\begin{align*}\text{predicted weight}_i &= -80 - 1 \cdot \text{height in inches}_i + 48 \cdot \text{height in feet}_i \end{align*}$$
  • Both hypothesis functions look very different, but actually make the same predictions.
  • model.coef_ could return either set of coefficients, or any other of the infinitely many options.
  • But neither set of coefficients is has any meaning!
InĀ [23]:
-80 + 5 * people['Height (Inches)'] - 24 * people['Height (Feet)']
Out[23]:
0        117.35
1        134.55
2        128.20
          ...  
24997    114.10
24998    122.59
24999    126.63
Length: 25000, dtype: float64
InĀ [24]:
-80 - 1 * people['Height (Inches)'] + 48 * people['Height (Feet)']
Out[24]:
0        117.35
1        134.55
2        128.20
          ...  
24997    114.10
24998    122.59
24999    126.63
Length: 25000, dtype: float64

Multicollinearity¶

  • Multicollinearity occurs when features in a regression model are highly correlated with one another.
    In other words, multicollinearity occurs when a feature can be predicted using a linear combination of other features, fairly accurately.
  • When multicollinearity is present in the features, the coefficients in the model are uninterpretable – they have no meaning.
    A "slope" represents "the rate of change of $y$ with respect to a feature", when all other features are held constant – but if there's multicollinearity, you can't hold other features constant.
  • Note: Multicollinearity doesn't impact a model's predictions!
    • It doesn't impact a model's ability to generalize to unseen data.
    • If features are multicollinear in the data we've seen, they will probably be multicollinear in the data we haven't seen, drawn from the same distribution.
  • Solutions:
    • Manually remove highly correlated features.
    • Use a dimensionality reduction technique (such as PCA) to automatically reduce dimensions.

One hot encoding and multicollinearity¶

  • One hot encoding will result in multicollinearity unless you drop one of the one hot encoded features.
  • Suppose we have the following fitted model:
    For illustration, assume 'weekend' was originally a categorical feature with two possible values, 'Yes' or 'No'.
$$ \begin{aligned} H(\vec x_i) = 1 - 3 \cdot \text{departure hour}_i + 2 \cdot (\text{weekend}_i==\text{Yes}) - 2 \cdot (\text{weekend}_i==\text{No}) \end{aligned} $$
  • This is equivalent to:
$$ \begin{aligned} H(\vec x_i) = 10 - 3 \cdot \text{departure hour}_i - 7 \cdot (\text{weekend}_i==\text{Yes}) - 11 \cdot (\text{weekend}_i==\text{No}) \end{aligned} $$
  • Note that for a particular row in the dataset, $\text{weekend}_i==\text{Yes} + \text{weekend}_i==\text{No}$ is always equal to 1.
$$X = \begin{bmatrix} 1 & 8.45 & 0 & 1 \\ 1 & 11 & 0 & 1 \\ 1 & 7.39 & 1 & 0 \\ 1 & 9.98 & 1 & 0 \\ 1 & 10.45 & 0 & 1 \\\end{bmatrix}$$
A possible design matrix for this model.
  • What's the issue with the example design matrix above?
    See the annotated slides.

One hot encoding and multicollinearity¶

$$X = \begin{bmatrix} 1 & 8.45 & 0 & 1 \\ 1 & 11 & 0 & 1 \\ 1 & 7.39 & 1 & 0 \\ 1 & 9.98 & 1 & 0 \\ 1 & 10.45 & 0 & 1 \\\end{bmatrix}$$
A possible design matrix for this model.
  • The columns of the design matrix, $X$ above are not linearly independent!

The column of all 1s can be written as a linear combination of the $\text{weekend==Yes}$ and $\text{weekend==No}$ columns.

$$\text{column 1} = \text{column 3} + \text{column 4}$$
  • This means that the design matrix is not full rank, which means that $X^TX$ is not invertible.
  • This means that there are infinitely many possible solutions $\vec{w}^*$ to the normal equations, $(X^TX) \vec{w} = X^T\vec{y}$!
    That's a problem, because we don't know which of these infinitely many solutions model.coef_ will find for us, and it's impossible to interpret the resulting coefficients, as we saw two slides ago.
  • Solution: Drop one of the one hot encoded columns. OneHotEncoder has an option to do this.

OneHotEncoder returns¶

  • Let's switch back to the commute times dataset, df.
InĀ [25]:
df[['day', 'month']]
Out[25]:
day month
0 Mon May
1 Tue May
2 Mon May
... ... ...
62 Mon March
63 Tue March
64 Thu March

65 rows Ɨ 2 columns

  • Let's try using drop='first' when instantiating a OneHotEncoder.
InĀ [26]:
ohe_drop_one = OneHotEncoder(drop='first')
InĀ [27]:
ohe_drop_one.fit(df[['day', 'month']])
Out[27]:
OneHotEncoder(drop='first')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
OneHotEncoder(drop='first')
  • How many features did the resulting transformer create?
InĀ [28]:
len(ohe_drop_one.get_feature_names_out())
Out[28]:
14
  • Where did this number come from?
InĀ [29]:
df['day'].nunique()
Out[29]:
5
InĀ [30]:
df['month'].nunique()
Out[30]:
11

Key takeaways¶

  • Multicollinearity is present in a linear model when one feature can be accurately predicted using one or more other features.
    In other words, it is present when a feature is redundant.
  • Multicollinearity doesn't pose an issue for prediction; it doesn't hinder a model's ability to generalize. Instead, it renders the coefficients of a linear model meaningless.

Pipelines🚰¶


Recap: Commute times šŸš—Ā¶

InĀ [31]:
(
    df
    .plot(kind='scatter', x='departure_hour', y='minutes')
    .update_layout(xaxis_title='Home Departure Time (AM)', 
                   yaxis_title='Minutes',
                   title='Commuting Time vs. Home Departure Time')
)
  • So far, our goal has been to predict commute time in 'minutes', given 'departure_hour'.
  • We just learned how to use OneHotEncoder to encode 'day' and 'month' as numerical columns.
    We'll look at how we can easily use these columns – and more! – as inputs to a linear model that predicts commute times.

Pipelines in sklearn¶

  • From sklearn's documentation:

Pipeline allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.

Intermediate steps of the pipeline must be "transforms", that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

  • General template: pl = make_pipeline(transformer_1, transformer_2, ..., model).
    Note that the model is optional, meaning you can have Pipelines of just transformers.
  • Once a Pipeline is instantiated, you can fit all steps (transformers and model) using pl.fit(X, y).
  • To make predictions using raw, untransformed data, use pl.predict(X).

Our first Pipeline¶

  • Let's build a Pipeline that:
    1. One hot encodes 'day' and 'month'.
    2. Fits a regression model on just the one hot encoded data.
InĀ [32]:
# You can either use the Pipeline class constructor directly,
# or the make_pipeline helper function (my preference).
from sklearn.pipeline import Pipeline, make_pipeline
InĀ [33]:
pl = make_pipeline(
    OneHotEncoder(drop='first'),
    LinearRegression()
)
pl
Out[33]:
Pipeline(steps=[('onehotencoder', OneHotEncoder(drop='first')),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('onehotencoder', OneHotEncoder(drop='first')),
                ('linearregression', LinearRegression())])
OneHotEncoder(drop='first')
LinearRegression()
  • Now that pl is instantiated, we fit it the same way we would fit the individual steps.
InĀ [34]:
pl.fit(X=df[['day', 'month']], y=df['minutes']) 
Out[34]:
Pipeline(steps=[('onehotencoder', OneHotEncoder(drop='first')),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('onehotencoder', OneHotEncoder(drop='first')),
                ('linearregression', LinearRegression())])
OneHotEncoder(drop='first')
LinearRegression()
  • Now, to make predictions using raw data, all we need to do is use pl.predict:
InĀ [35]:
pl.predict([['Mon', 'November']]) 
Out[35]:
array([68.61])
  • pl performs both feature transformation and prediction with just a single call to predict!

Reference Slide¶

Pipeline internals¶

  • We can access individual "steps" of a Pipeline through the named_steps attribute.
InĀ [36]:
# These names are automatically generated by make_pipeline.
# If you use the Pipeline() constructor,
# you can choose these names yourself.
pl.named_steps
Out[36]:
{'onehotencoder': OneHotEncoder(drop='first'),
 'linearregression': LinearRegression()}
InĀ [37]:
pl.named_steps['onehotencoder'].transform(df[['day', 'month']]).toarray()
Out[37]:
array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])
InĀ [38]:
pl.named_steps['onehotencoder'].get_feature_names_out()
Out[38]:
array(['day_Mon', 'day_Thu', 'day_Tue', 'day_Wed', 'month_December',
       'month_February', 'month_January', 'month_July', 'month_June',
       'month_March', 'month_May', 'month_November', 'month_October',
       'month_September'], dtype=object)
InĀ [39]:
pl.named_steps['linearregression'].coef_
Out[39]:
array([ 1.65,  8.35, 13.2 ,  2.68, -2.1 ,  6.06, -4.44, -3.08,  9.14,
        8.62,  6.24,  2.98, -5.6 ,  3.29])

More sophisticated Pipelines¶

  • In the previous slide, we one hot encoded every input column, and didn't use any columns that were originally numeric, i.e. we didn't use 'departure_hour'.
    That's not realistic or useful!
  • What if we want to perform different transformations on different columns, or include some columns without transformation?
  • Or, what if we want to perform multiple transformations to the same column?
  • There are a variety of useful functions/classes we can use:
Name Functionality
ColumnTransformer Allows us to transform different columns with different transformations.
Instantiate a ColumnTransformer using a list of tuples, where:
• The first element is a "name" we choose for the transformer.
• The second element is a transformer instance (e.g. OneHotEncoder()).
• The third element is a list of relevant column names.
FunctionTransformer Allows us to create a custom transformation (similar to using .apply on a DataFrame's columns).
make_pipeline Helper function for creating a Pipeline (slightly less verbose).
Note that you can make a pipeline of just transformations,
if you want to use multiple transformations on the same column!
make_column_transformer Helper function for creating a ColumnTransformer.

The plan¶

  • Before writing any code, let's plan out how we want to transform our data.
InĀ [40]:
df[['departure_hour', 'day', 'month', 'day_of_month']]
Out[40]:
departure_hour day month day_of_month
0 10.82 Mon May 15
1 7.75 Tue May 16
2 8.45 Mon May 22
... ... ... ... ...
62 7.58 Mon March 4
63 7.45 Tue March 5
64 7.60 Thu March 7

65 rows Ɨ 4 columns

  • 'departure_hour': Create degree 2 and degree 3 polynomial features:
$$H(\vec x_i) = ... + w_1 \cdot \text{departure hour}_i + w_2 \cdot \left(\text{departure hour}_i\right)^2 + w_3 \cdot \left( \text{departure hour}_i \right)^3 + ...$$
  • 'day': One hot encode.
  • 'month': One hot encode.
  • 'day_of_month': Separate into five weeks, then one hot encode.

Days 1 to 7 are Week 1, Days 8 to 15 are Week 2, and so on.

  • After all of these transformations, we'll fit a LinearRegression object – i.e., fit a linear model.

'departure_hour': Create degree 2 and degree 3 polynomial features.
'day': One hot encode.
'month': One hot encode.
'day_of_month': Separate into five weeks, then one hot encode.

  • Let's start with 'day_of_month', since it seems to involve the most complicated transformations.
  • First, let's figure out how to extract the week number given the day of the month.
InĀ [41]:
example_vals = df['day_of_month'].tail()
example_vals
Out[41]:
60    27
61    29
62     4
63     5
64     7
Name: day_of_month, dtype: int32
InĀ [42]:
# Expression to convert from day of month to Week #.
'Week ' + ((example_vals - 1) // 7 + 1).astype(str) 
Out[42]:
60    Week 4
61    Week 5
62    Week 1
63    Week 1
64    Week 1
Name: day_of_month, dtype: object
InĀ [43]:
# The function that FunctionTransformer takes in
# itself takes in a Series/DataFrame, not a single element!
# Here, we're having that function return a new Series/DataFrame,
# depending on what's passed in to .tranform (experiment on your own).
from sklearn.pipeline import FunctionTransformer
week_converter = FunctionTransformer(lambda s: 'Week ' + ((s - 1) // 7 + 1).astype(str)) 
InĀ [44]:
week_converter.transform(df[['day_of_month']])
Out[44]:
day_of_month
0 Week 3
1 Week 3
2 Week 4
... ...
62 Week 1
63 Week 1
64 Week 1

65 rows Ɨ 1 columns

  • We need to apply two consecutive transformations to 'day_of_month', which calls for a Pipeline.
InĀ [45]:
day_of_month_transformer = make_pipeline(week_converter, OneHotEncoder(drop='first')) 
day_of_month_transformer
Out[45]:
Pipeline(steps=[('functiontransformer',
                 FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                ('onehotencoder', OneHotEncoder(drop='first'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('functiontransformer',
                 FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                ('onehotencoder', OneHotEncoder(drop='first'))])
FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)
OneHotEncoder(drop='first')
InĀ [46]:
day_of_month_transformer.fit_transform(df[['day_of_month']]).toarray()
Out[46]:
array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       ...,
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
  • So, day_of_month_transformer does everything we need to transform 'day_of_month'.

'departure_hour': Create degree 2 and degree 3 polynomial features.
'day': One hot encode.
'month': One hot encode.
'day_of_month': Separate into five weeks, then one hot encode. āœ… Use day_of_month_transformer.

  • Every other column only needs a single transformation.
    To specify which transformations to apply to which columns, create a ColumnTransformer.
InĀ [47]:
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import PolynomialFeatures
InĀ [48]:
preprocessing = make_column_transformer(
    (PolynomialFeatures(3, include_bias=False), ['departure_hour']),
    (OneHotEncoder(drop='first'), ['day', 'month']),
    (day_of_month_transformer, ['day_of_month']),
    remainder='drop'
)
preprocessing
Out[48]:
ColumnTransformer(transformers=[('polynomialfeatures',
                                 PolynomialFeatures(degree=3,
                                                    include_bias=False),
                                 ['departure_hour']),
                                ('onehotencoder', OneHotEncoder(drop='first'),
                                 ['day', 'month']),
                                ('pipeline',
                                 Pipeline(steps=[('functiontransformer',
                                                  FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'))]),
                                 ['day_of_month'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('polynomialfeatures',
                                 PolynomialFeatures(degree=3,
                                                    include_bias=False),
                                 ['departure_hour']),
                                ('onehotencoder', OneHotEncoder(drop='first'),
                                 ['day', 'month']),
                                ('pipeline',
                                 Pipeline(steps=[('functiontransformer',
                                                  FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'))]),
                                 ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)
OneHotEncoder(drop='first')
  • Now, we're ready for a final Pipeline!
InĀ [49]:
model = make_pipeline(preprocessing, LinearRegression())
model
Out[49]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('polynomialfeatures',
                                                  PolynomialFeatures(degree=3,
                                                                     include_bias=False),
                                                  ['departure_hour']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['day', 'month']),
                                                 ('pipeline',
                                                  Pipeline(steps=[('functiontransformer',
                                                                   FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['day_of_month'])])),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('polynomialfeatures',
                                                  PolynomialFeatures(degree=3,
                                                                     include_bias=False),
                                                  ['departure_hour']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['day', 'month']),
                                                 ('pipeline',
                                                  Pipeline(steps=[('functiontransformer',
                                                                   FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['day_of_month'])])),
                ('linearregression', LinearRegression())])
ColumnTransformer(transformers=[('polynomialfeatures',
                                 PolynomialFeatures(degree=3,
                                                    include_bias=False),
                                 ['departure_hour']),
                                ('onehotencoder', OneHotEncoder(drop='first'),
                                 ['day', 'month']),
                                ('pipeline',
                                 Pipeline(steps=[('functiontransformer',
                                                  FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'))]),
                                 ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)
OneHotEncoder(drop='first')
LinearRegression()
InĀ [50]:
model.fit(X=df[['departure_hour', 'day', 'month', 'day_of_month']], y=df['minutes'])
Out[50]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('polynomialfeatures',
                                                  PolynomialFeatures(degree=3,
                                                                     include_bias=False),
                                                  ['departure_hour']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['day', 'month']),
                                                 ('pipeline',
                                                  Pipeline(steps=[('functiontransformer',
                                                                   FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['day_of_month'])])),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('polynomialfeatures',
                                                  PolynomialFeatures(degree=3,
                                                                     include_bias=False),
                                                  ['departure_hour']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['day', 'month']),
                                                 ('pipeline',
                                                  Pipeline(steps=[('functiontransformer',
                                                                   FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['day_of_month'])])),
                ('linearregression', LinearRegression())])
ColumnTransformer(transformers=[('polynomialfeatures',
                                 PolynomialFeatures(degree=3,
                                                    include_bias=False),
                                 ['departure_hour']),
                                ('onehotencoder', OneHotEncoder(drop='first'),
                                 ['day', 'month']),
                                ('pipeline',
                                 Pipeline(steps=[('functiontransformer',
                                                  FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'))]),
                                 ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)
OneHotEncoder(drop='first')
LinearRegression()

The punchline¶

  • Now that our Pipeline is fit, we can use it to make predictions using raw data!
    What's the predicted commute time if I leave at 8:30AM on a Wednesday in March, which happens to be the 19th of the month?
InĀ [51]:
model.predict(pd.DataFrame([{
    'departure_hour': 8.5,
    'day': 'Wed',
    'month': 'March',
    'day_of_month': 19
}]))
Out[51]:
array([64.5])
  • Note that when calling model.predict, I didn't need to think about one hot encoding, or polynomial features, or any other aspects of the feature engineering process.

Activity

How many columns does the final design matrix that model creates have? If you write code to determine the answer, make sure you can walk through the steps over the past few slides to figure out why the answer is what it is.

InĀ [52]:
model
Out[52]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('polynomialfeatures',
                                                  PolynomialFeatures(degree=3,
                                                                     include_bias=False),
                                                  ['departure_hour']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['day', 'month']),
                                                 ('pipeline',
                                                  Pipeline(steps=[('functiontransformer',
                                                                   FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['day_of_month'])])),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('polynomialfeatures',
                                                  PolynomialFeatures(degree=3,
                                                                     include_bias=False),
                                                  ['departure_hour']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['day', 'month']),
                                                 ('pipeline',
                                                  Pipeline(steps=[('functiontransformer',
                                                                   FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['day_of_month'])])),
                ('linearregression', LinearRegression())])
ColumnTransformer(transformers=[('polynomialfeatures',
                                 PolynomialFeatures(degree=3,
                                                    include_bias=False),
                                 ['departure_hour']),
                                ('onehotencoder', OneHotEncoder(drop='first'),
                                 ['day', 'month']),
                                ('pipeline',
                                 Pipeline(steps=[('functiontransformer',
                                                  FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)),
                                                 ('onehotencoder',
                                                  OneHotEncoder(drop='first'))]),
                                 ['day_of_month'])])
['departure_hour']
PolynomialFeatures(degree=3, include_bias=False)
['day', 'month']
OneHotEncoder(drop='first')
['day_of_month']
FunctionTransformer(func=<function <lambda> at 0x17635f9a0>)
OneHotEncoder(drop='first')
LinearRegression()

Question šŸ¤” (Answer at practicaldsc.org/q)

What questions do you have?

Generalization šŸ”­Ā¶


Motivation¶

  • You and Billy are studying for an upcoming exam. You both decide to test your understanding by taking a practice exam.
    Your logic: If you do well on the practice exam, you should do well on the real exam.
  • You each take the practice exam once and look at the solutions afterwards.
  • Your strategy: Memorize the answers to all practice exam questions, e.g. "Question 1: A; Question 2: C; Question 3: A."
  • Billy's strategy: Learn high-level concepts from the solutions, e.g. "the TF-IDF of term $t$ in document $d$ is large when $t$ occurs often in $d$ but rarely overall."
  • Who will do better on the practice exam? Who will probably do better on the real exam? 🧐

Evaluating the quality of a model¶

  • So far, we've computed the MSE of our fit regression models on the data that we used to fit them, i.e. the training data.
    This mean squared error is called the training MSE, or training error.
  • We've said that Model A is better than Model B if Model A's MSE is lower than Model B's MSE.
    • Remember, our training data is a sample from some population.
    • Just because a model fits the training data well doesn't mean it will generalize and work well on similar, unseen samples from the same population!

Overfitting and underfitting¶

  • Let's collect two samples $\{(x_i, y_i)\}$ from the same population.
InĀ [53]:
np.random.seed(23) # For reproducibility.
def sample_from_pop(n=100):
    x = np.linspace(-2, 3, n)
    y = x ** 3 + (np.random.normal(0, 3, size=n))
    return pd.DataFrame({'x': x, 'y': y})
sample_1 = sample_from_pop()
sample_2 = sample_from_pop()
  • For now, let's just look at Sample 1. The relationship between $x$ and $y$ is roughly cubic; that is, $y \approx x^3$.
    Remember, in reality, you won't get to see the population distribution. If you could, there'd be no need to build a model!
InĀ [54]:
px.scatter(sample_1, x='x', y='y', title='Sample 1')

Polynomial regression¶

  • Let's fit three polynomial models on Sample 1: degree 1, degree 3, and degree 25.
    Again, we'll use the PolynomialFeatures transformer.
InĀ [55]:
# fit_transform fits and transforms the same input.
# We tell it not to add a column of 1s, because
# LinearRegression() does this automatically later on.
d2 = PolynomialFeatures(3, include_bias=False)
d2.fit_transform(np.array([1, 2, 3, 4, -2]).reshape(-1, 1))
Out[55]:
array([[ 1.,  1.,  1.],
       [ 2.,  4.,  8.],
       [ 3.,  9., 27.],
       [ 4., 16., 64.],
       [-2.,  4., -8.]])
  • Below, we look at our three models' predictions on Sample 1, which they were trained on.
InĀ [56]:
# Look at the definition of train_and_plot in lec17_util.py if you're curious as to how the plotting works.
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_1, degs=[1, 3, 25], data_name='Sample 1')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 1')
  • The degree 25 polynomial has the lowest MSE on Sample 1.
  • How do the same fit polynomials look on Sample 2?
InĀ [57]:
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_2, degs=[1, 3, 25], data_name='Sample 2')
fig.update_layout(title='Trained on Sample 1, Performance on Sample 2')
  • The degree 3 polynomial has the lowest MSE on Sample 2.
  • Note that we didn't get to see Sample 2 when fitting our models!
  • As such, it seems that the degree 3 polynomial generalizes better to unseen data than the degree 25 polynomial does.
  • What if we fit a degree 1, degree 3, and degree 25 polynomial on Sample 2 as well?
InĀ [58]:
util.plot_multiple_models(sample_1, sample_2, degs=[1, 3, 25])
  • Key idea: Degree 25 polynomials seem to vary more when trained on different samples than degree 3 and 1 polynomials do.
  • More on this next class!