from lec_utils import *
Lecture 16 Supplementary Notebook¶
Regression using Linear Algebra¶
EECS 398-003: Practical Data Science, Fall 2024¶
Note: This notebook is only a supplementary notebook to the main lecture slides, which are in PDF form.
The main lecture slides can be found at practicaldsc.org under Lecture 16. (After the live lecture, an annotated version of the slides will be made available as well.)
Let's once again load in the commute times dataset as a DataFrame.
df = pd.read_csv('data/commute-times.csv')
df.head()
date | day | home_departure_time | home_departure_mileage | ... | mileage_to_work | minutes_to_home | work_departure_time_hr | mileage_to_home | |
---|---|---|---|---|---|---|---|---|---|
0 | 5/15/2023 | Mon | 2023-05-15 10:49:00 | 15873.0 | ... | 53.0 | 72.0 | 17.17 | 53.0 |
1 | 5/16/2023 | Tue | 2023-05-16 07:45:00 | 15979.0 | ... | 54.0 | NaN | NaN | NaN |
2 | 5/22/2023 | Mon | 2023-05-22 08:27:00 | 50407.0 | ... | 54.0 | 82.0 | 15.90 | 54.0 |
3 | 5/23/2023 | Tue | 2023-05-23 07:08:00 | 50535.0 | ... | 54.0 | NaN | NaN | NaN |
4 | 5/30/2023 | Tue | 2023-05-30 09:09:00 | 50664.0 | ... | 54.0 | 76.0 | 17.12 | 54.0 |
5 rows × 18 columns
There are many columns in here, but the only ones we're interested in for now are 'departure_hour'
and 'minutes'
.
df[['departure_hour', 'minutes']]
departure_hour | minutes | |
---|---|---|
0 | 10.82 | 68.0 |
1 | 7.75 | 94.0 |
2 | 8.45 | 63.0 |
... | ... | ... |
62 | 7.58 | 68.0 |
63 | 7.45 | 90.0 |
64 | 7.60 | 83.0 |
65 rows × 2 columns
fig = px.scatter(df,
x='departure_hour',
y='minutes',
size=np.ones(len(df)) * 50,
size_max=8)
fig.update_xaxes(title='Home Departure Time (AM)')
fig.update_yaxes(title='Minutes')
fig.update_layout(title='Commuting Time vs. Home Departure Time')
fig.update_layout(width=700)
Finding the Regression Line, Using the Old Formulas¶
Recall, the formulas for the optimal intercept ($w_0^*$) and slope ($w_1^*$) are
$$w_1^* = r \frac{\sigma_y}{\sigma_x}$$$$w_0^* = \bar{y} - w_1^* \bar{x}$$def slope(x, y):
return np.corrcoef(x, y)[0, 1] * np.std(y) / np.std(x)
def intercept(x, y):
return np.mean(y) - slope(x, y) * np.mean(x)
w0_star = intercept(df['departure_hour'], df['minutes'])
w1_star = slope(df['departure_hour'], df['minutes'])
# Just fancy printing – ignore these next two lines.
rule_string = ('$$\\text{Predicted Commute Time (in Minutes)} = ' +
f'{round(w0_star, 2)} + {round(w1_star, 2)}' +
'\cdot \\left( \\text{Departure Hour} \\right)$$')
display(HTML(f'<h4>The best linear predictor for this dataset is</h4><br><center>{rule_string}</center>'))
The best linear predictor for this dataset is
hline = px.line(x=[5.5, 11.5], y=[97.405, 48.265]).update_traces(line={'color': 'red', 'width': 4})
fline1 = go.Figure(fig.data + hline.data)
fline1.update_xaxes(title='Home Departure Time (AM)')
fline1.update_yaxes(title='Minutes')
fline1.update_layout(title='<span style="color:red">Predicted Commute Time</span> = 142.45 - 8.19 * Departure Hour')
fline1.update_layout(width=700, margin={'t': 60})
Now that we have $w_0^*$ and $w_1^*$, we can use them to make predictions.
# The predicted commute time if I leave at 8:30AM.
w0_star + w1_star * 8.5
72.85923693147151
Finding the Regression Line, Using the Normal Equations¶
Using our linear algebraic formulation, the optimal intercept and slope are given by the vector $\vec{w}^*$, where:
$$\vec{w}^* = ({X^TX})^{-1} X^T\vec{y}$$Here:
- $X$ is a $n \times 2$ matrix, called the design matrix, defined as:
- $\vec{y}$ is a $n$-dimensional vector, called the observation vector, defined as:
Let's construct $X$ and $y$ in code.
First, the design matrix.
# Create a new DataFrame by taking the 'departure_hour' column from df.
X = df[['departure_hour']].copy()
X
departure_hour | |
---|---|
0 | 10.82 |
1 | 7.75 |
2 | 8.45 |
... | ... |
62 | 7.58 |
63 | 7.45 |
64 | 7.60 |
65 rows × 1 columns
# Add a column of all 1s to X.
X['1'] = 1
X
departure_hour | 1 | |
---|---|---|
0 | 10.82 | 1 |
1 | 7.75 | 1 |
2 | 8.45 | 1 |
... | ... | ... |
62 | 7.58 | 1 |
63 | 7.45 | 1 |
64 | 7.60 | 1 |
65 rows × 2 columns
# Change the order of the columns and convert to an array.
X = X[['1', 'departure_hour']].to_numpy()
X
array([[ 1. , 10.82], [ 1. , 7.75], [ 1. , 8.45], ..., [ 1. , 7.58], [ 1. , 7.45], [ 1. , 7.6 ]])
$\vec{y}$ is already created for us: it's just the 'minutes'
column in df
.
y = df['minutes'].to_numpy()
y
array([68., 94., 63., ..., 68., 90., 83.])
Now, let's implement:
$$\vec{w}^* = ({X^TX})^{-1} X^T\vec{y}$$# The @ symbol performs matrix multiplication!
w_star_linalg = np.linalg.inv(X.T @ X) @ X.T @ y
w_star_linalg
array([142.45, -8.19])
These numbers look familiar!
# Old formulas.
w0_star, w1_star
(142.44824158772875, -8.186941724265557)
Indeed, they're exactly the same as the w0_star
and w1_star
we found using our old formulas.
# The predicted commute time if I leave at 8:30AM.
w0_star + w1_star * 8.5
72.85923693147151
How do we make predictions with the new formulas?
To find the predicted commute time for every departure hour in our dataset, we can multiply $X$ by the optimal parameter vector, $\vec{w}^*$.
$$\vec{h}^* = X \vec{w}^*$$$\vec{h}^*$ above is the optimal hypothesis vector.
all_preds = X @ w_star_linalg
all_preds
array([53.89, 79. , 73.27, ..., 80.36, 81.46, 80.23])
To make a prediction for a single data point, we must take the dot product of the optimal parameter vector, $\vec{w}^*$ (w_star_linalg
) with a vector of the form $\begin{bmatrix} 1 & x_\text{new} \end{bmatrix}^T$, since this is what the rows of $X$ look like.
# Also the predicted commute time if I leave at 8:30AM.
np.dot(w_star_linalg, np.array([1, 8.5]))
72.85923693147129
This gives us the same prediction as before!
Multiple Linear Regression¶
Previously, with simple linear regression, our goal was to fit a hypothesis function of the form:
$$\begin{align*}\text{pred. commute} &= H(\text{departure hour}) \\ &= w_0 + w_1 \cdot \text{departure hour} \end{align*}$$Now, we'll try and fit a linear regression model of the form:
$$\begin{align*}\text{pred. commute} &= H(\text{departure hour}) \\ &= w_0 + w_1 \cdot \text{departure hour} + w_2 \cdot \text{day of month} \end{align*}$$df['date']
0 5/15/2023 1 5/16/2023 2 5/22/2023 ... 62 3/4/2024 63 3/5/2024 64 3/7/2024 Name: date, Length: 65, dtype: object
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df[['departure_hour', 'day_of_month', 'minutes']]
departure_hour | day_of_month | minutes | |
---|---|---|---|
0 | 10.82 | 15 | 68.0 |
1 | 7.75 | 16 | 94.0 |
2 | 8.45 | 22 | 63.0 |
... | ... | ... | ... |
62 | 7.58 | 4 | 68.0 |
63 | 7.45 | 5 | 90.0 |
64 | 7.60 | 7 | 83.0 |
65 rows × 3 columns
Let's create our new design matrix, $X$:
$$X = \begin{bmatrix} 1 & \text{departure hour}_1 & \text{day}_1 \\ 1 & \text{departure hour}_2 & \text{day}_2 \\ ... & ... & ... \\ 1 & \text{departure hour}_n & \text{day}_n \end{bmatrix}$$X = df[['departure_hour', 'day_of_month']].copy()
X['1'] = 1
X = X[['1', 'departure_hour', 'day_of_month']].to_numpy()
X
array([[ 1. , 10.82, 15. ], [ 1. , 7.75, 16. ], [ 1. , 8.45, 22. ], ..., [ 1. , 7.58, 4. ], [ 1. , 7.45, 5. ], [ 1. , 7.6 , 7. ]])
w_star_multiple = np.linalg.inv(X.T @ X) @ X.T @ y
w_star_multiple
array([141.86, -8.22, 0.06])
What do our predictions look like, for each row of the dataset?
XX, YY = np.mgrid[5:14:1, 0:31:1]
Z = w_star_multiple[0] + w_star_multiple[1] * XX + w_star_multiple[2] * YY
plane = go.Surface(x=XX, y=YY, z=Z, colorscale='Reds')
fig = go.Figure(data=[plane])
fig.add_trace(go.Scatter3d(x=df['departure_hour'],
y=df['day_of_month'],
z=df['minutes'], mode='markers', marker = {'color': '#656DF1'}))
fig.update_layout(scene=dict(xaxis_title='Departure Hour',
yaxis_title='Day of Month',
zaxis_title='Minutes'),
title='Commute Time vs. Departure Hour and Day of Month',
width=1000, height=500)
How do we make predictions for new datapoints?
# The predicted commute time if I leave at 8:30AM on the 15th of the month.
np.dot(w_star_multiple, np.array([1, 8.5, 15]))
72.80767679746616
# The predicted commute time if I leave at 8:30AM on the 30th of the month.
np.dot(w_star_multiple, np.array([1, 8.5, 30]))
73.65007448594321