from lec_utils import *

df = pd.read_csv('data/commute-times.csv')
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df.head()

Init signature:
LinearRegression(
    *,
    fit_intercept=True,
    copy_X=True,
    n_jobs=None,
    positive=False,
)
Docstring:     
Ordinary least squares Linear Regression.

LinearRegression fits a linear model with coefficients w = (w1, ..., wp)
to minimize the residual sum of squares between the observed targets in
the dataset, and the targets predicted by the linear approximation.

Parameters
----------
fit_intercept : bool, default=True
    Whether to calculate the intercept for this model. If set
    to False, no intercept will be used in calculations
    (i.e. data is expected to be centered).

copy_X : bool, default=True
    If True, X will be copied; else, it may be overwritten.

n_jobs : int, default=None
    The number of jobs to use for the computation. This will only provide
    speedup in case of sufficiently large problems, that is if firstly
    `n_targets > 1` and secondly `X` is sparse or if `positive` is set
    to `True`. ``None`` means 1 unless in a
    :obj:`joblib.parallel_backend` context. ``-1`` means using all
    processors. See :term:`Glossary <n_jobs>` for more details.

positive : bool, default=False
    When set to ``True``, forces the coefficients to be positive. This
    option is only supported for dense arrays.

    .. versionadded:: 0.24

Attributes
----------
coef_ : array of shape (n_features, ) or (n_targets, n_features)
    Estimated coefficients for the linear regression problem.
    If multiple targets are passed during the fit (y 2D), this
    is a 2D array of shape (n_targets, n_features), while if only
    one target is passed, this is a 1D array of length n_features.

rank_ : int
    Rank of matrix `X`. Only available when `X` is dense.

singular_ : array of shape (min(X, y),)
    Singular values of `X`. Only available when `X` is dense.

intercept_ : float or array of shape (n_targets,)
    Independent term in the linear model. Set to 0.0 if
    `fit_intercept = False`.

n_features_in_ : int
    Number of features seen during :term:`fit`.

    .. versionadded:: 0.24

feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.

    .. versionadded:: 1.0

See Also
--------
Ridge : Ridge regression addresses some of the
    problems of Ordinary Least Squares by imposing a penalty on the
    size of the coefficients with l2 regularization.
Lasso : The Lasso is a linear model that estimates
    sparse coefficients with l1 regularization.
ElasticNet : Elastic-Net is a linear regression
    model trained with both l1 and l2 -norm regularization of the
    coefficients.

Notes
-----
From the implementation point of view, this is just plain Ordinary
Least Squares (scipy.linalg.lstsq) or Non Negative Least Squares
(scipy.optimize.nnls) wrapped as a predictor object.

Examples
--------
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
1.0
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_
3.0...
>>> reg.predict(np.array([[3, 5]]))
array([16.])
File:           ~/miniforge3/envs/pds/lib/python3.10/site-packages/sklearn/linear_model/_base.py
Type:           ABCMeta
Subclasses:

LinearRegression()

LinearRegression()

(141.86402699471932, array([-8.22,  0.06]))

/Users/surajrampure/miniforge3/envs/pds/lib/python3.10/site-packages/sklearn/base.py:493: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

array([67.26])

array([67.26])

96.78730488437492

{'departure_hour + day_of_month': 96.78730488437492,
 'departure_hour': 97.04687150819183,
 'constant': 167.535147928994}

0.4222865704252339

df = pd.read_csv('data/commute-times.csv')
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df.head()

Init signature:
LinearRegression(
    *,
    fit_intercept=True,
    copy_X=True,
    n_jobs=None,
    positive=False,
)
Docstring:     
Ordinary least squares Linear Regression.

LinearRegression fits a linear model with coefficients w = (w1, ..., wp)
to minimize the residual sum of squares between the observed targets in
the dataset, and the targets predicted by the linear approximation.

Parameters
----------
fit_intercept : bool, default=True
    Whether to calculate the intercept for this model. If set
    to False, no intercept will be used in calculations
    (i.e. data is expected to be centered).

copy_X : bool, default=True
    If True, X will be copied; else, it may be overwritten.

n_jobs : int, default=None
    The number of jobs to use for the computation. This will only provide
    speedup in case of sufficiently large problems, that is if firstly
    `n_targets > 1` and secondly `X` is sparse or if `positive` is set
    to `True`. ``None`` means 1 unless in a
    :obj:`joblib.parallel_backend` context. ``-1`` means using all
    processors. See :term:`Glossary <n_jobs>` for more details.

positive : bool, default=False
    When set to ``True``, forces the coefficients to be positive. This
    option is only supported for dense arrays.

    .. versionadded:: 0.24

Attributes
----------
coef_ : array of shape (n_features, ) or (n_targets, n_features)
    Estimated coefficients for the linear regression problem.
    If multiple targets are passed during the fit (y 2D), this
    is a 2D array of shape (n_targets, n_features), while if only
    one target is passed, this is a 1D array of length n_features.

rank_ : int
    Rank of matrix `X`. Only available when `X` is dense.

singular_ : array of shape (min(X, y),)
    Singular values of `X`. Only available when `X` is dense.

intercept_ : float or array of shape (n_targets,)
    Independent term in the linear model. Set to 0.0 if
    `fit_intercept = False`.

n_features_in_ : int
    Number of features seen during :term:`fit`.

    .. versionadded:: 0.24

feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.

    .. versionadded:: 1.0

See Also
--------
Ridge : Ridge regression addresses some of the
    problems of Ordinary Least Squares by imposing a penalty on the
    size of the coefficients with l2 regularization.
Lasso : The Lasso is a linear model that estimates
    sparse coefficients with l1 regularization.
ElasticNet : Elastic-Net is a linear regression
    model trained with both l1 and l2 -norm regularization of the
    coefficients.

Notes
-----
From the implementation point of view, this is just plain Ordinary
Least Squares (scipy.linalg.lstsq) or Non Negative Least Squares
(scipy.optimize.nnls) wrapped as a predictor object.

Examples
--------
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
1.0
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_
3.0...
>>> reg.predict(np.array([[3, 5]]))
array([16.])
File:           ~/miniforge3/envs/pds/lib/python3.10/site-packages/sklearn/linear_model/_base.py
Type:           ABCMeta
Subclasses:

LinearRegression()

LinearRegression()

(141.86402699471932, array([-8.22,  0.06]))

/Users/surajrampure/miniforge3/envs/pds/lib/python3.10/site-packages/sklearn/base.py:493: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

array([67.26])

array([67.26])

96.78730488437492

{'departure_hour + day_of_month': 96.78730488437492,
 'departure_hour': 97.04687150819183,
 'constant': 167.535147928994}

0.4222865704252339

df = pd.read_csv('data/commute-times.csv')
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df.head()

df[['departure_hour', 'day_of_month', 'minutes']]

from sklearn.linear_model import LinearRegression

LinearRegression?

Init signature:
LinearRegression(
    *,
    fit_intercept=True,
    copy_X=True,
    n_jobs=None,
    positive=False,
)
Docstring:     
Ordinary least squares Linear Regression.

LinearRegression fits a linear model with coefficients w = (w1, ..., wp)
to minimize the residual sum of squares between the observed targets in
the dataset, and the targets predicted by the linear approximation.

Parameters
----------
fit_intercept : bool, default=True
    Whether to calculate the intercept for this model. If set
    to False, no intercept will be used in calculations
    (i.e. data is expected to be centered).

copy_X : bool, default=True
    If True, X will be copied; else, it may be overwritten.

n_jobs : int, default=None
    The number of jobs to use for the computation. This will only provide
    speedup in case of sufficiently large problems, that is if firstly
    `n_targets > 1` and secondly `X` is sparse or if `positive` is set
    to `True`. ``None`` means 1 unless in a
    :obj:`joblib.parallel_backend` context. ``-1`` means using all
    processors. See :term:`Glossary <n_jobs>` for more details.

positive : bool, default=False
    When set to ``True``, forces the coefficients to be positive. This
    option is only supported for dense arrays.

    .. versionadded:: 0.24

Attributes
----------
coef_ : array of shape (n_features, ) or (n_targets, n_features)
    Estimated coefficients for the linear regression problem.
    If multiple targets are passed during the fit (y 2D), this
    is a 2D array of shape (n_targets, n_features), while if only
    one target is passed, this is a 1D array of length n_features.

rank_ : int
    Rank of matrix `X`. Only available when `X` is dense.

singular_ : array of shape (min(X, y),)
    Singular values of `X`. Only available when `X` is dense.

intercept_ : float or array of shape (n_targets,)
    Independent term in the linear model. Set to 0.0 if
    `fit_intercept = False`.

n_features_in_ : int
    Number of features seen during :term:`fit`.

    .. versionadded:: 0.24

feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.

    .. versionadded:: 1.0

See Also
--------
Ridge : Ridge regression addresses some of the
    problems of Ordinary Least Squares by imposing a penalty on the
    size of the coefficients with l2 regularization.
Lasso : The Lasso is a linear model that estimates
    sparse coefficients with l1 regularization.
ElasticNet : Elastic-Net is a linear regression
    model trained with both l1 and l2 -norm regularization of the
    coefficients.

Notes
-----
From the implementation point of view, this is just plain Ordinary
Least Squares (scipy.linalg.lstsq) or Non Negative Least Squares
(scipy.optimize.nnls) wrapped as a predictor object.

Examples
--------
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
1.0
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_
3.0...
>>> reg.predict(np.array([[3, 5]]))
array([16.])
File:           ~/miniforge3/envs/pds/lib/python3.10/site-packages/sklearn/linear_model/_base.py
Type:           ABCMeta
Subclasses:

model_multiple = LinearRegression()
# Note that there are two arguments to fit – X and y!
# (It is not necessary to write X= and y=)
model_multiple.fit(X=df[['departure_hour', 'day_of_month']], y=df['minutes'])

LinearRegression()

LinearRegression()

model_multiple.intercept_, model_multiple.coef_

(141.86402699471932, array([-8.22,  0.06]))

XX, YY = np.mgrid[5:14:1, 0:31:1]
Z = model_multiple.intercept_ + model_multiple.coef_[0] * XX + model_multiple.coef_[1] * YY
plane = go.Surface(x=XX, y=YY, z=Z, colorscale='Reds')
fig = go.Figure(data=[plane])
fig.add_trace(go.Scatter3d(x=df['departure_hour'], 
                           y=df['day_of_month'], 
                           z=df['minutes'], mode='markers', marker = {'color': '#656DF1'}))
fig.update_layout(scene=dict(xaxis_title='Departure Hour',
                             yaxis_title='Day of Month',
                             zaxis_title='Minutes'),
                  title='Commute Time vs. Departure Hour and Day of Month',
                  width=1000, height=500)

# What if I leave at 9:15AM on the 26th of the month?
model_multiple.predict([[9.25, 26]])

/Users/surajrampure/miniforge3/envs/pds/lib/python3.10/site-packages/sklearn/base.py:493: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

array([67.26])

# Since we trained on a DataFrame, the input to model_multiple.predict should also
# be a DataFrame. To avoid having to do this, we can use .to_numpy()
# when specifying X= and y=.
model_multiple.predict(pd.DataFrame({'departure_hour': [9.25], 'day_of_month': [26]}))

array([67.26])

from sklearn.metrics import mean_squared_error

mean_squared_error(df['minutes'], model_multiple.predict(df[['departure_hour', 'day_of_month']]))

96.78730488437492

mse_dict = {}
mse_dict['departure_hour + day_of_month'] = mean_squared_error(df['minutes'], model_multiple.predict(df[['departure_hour', 'day_of_month']]))

# Simple linear model.
model_simple = LinearRegression()
model_simple.fit(X=df[['departure_hour']], y=df['minutes'])
mse_dict['departure_hour'] = mean_squared_error(df['minutes'], model_simple.predict(df[['departure_hour']]))

# Constant model.
model_constant = df['minutes'].mean()
mse_dict['constant'] = mean_squared_error(df['minutes'], np.ones(df.shape[0]) * model_constant)

mse_dict

{'departure_hour + day_of_month': 96.78730488437492,
 'departure_hour': 97.04687150819183,
 'constant': 167.535147928994}

model_multiple.score(df[['departure_hour', 'day_of_month']], df['minutes'])

0.4222865704252339

pred = df.assign(predicted=model_multiple.predict(df[['departure_hour', 'day_of_month']]))
pred

0.42228657042523415

0.4222865704252346

0.4222865704252339

0.42228657042523376

day
Tue    25
Mon    20
Thu    15
Wed     3
Fri     2
Name: count, dtype: int64

0     0
1     1
2     0
     ..
62    0
63    1
64    0
Name: day, Length: 65, dtype: int64

LinearRegression()

LinearRegression()

(134.0430659240799, array([-8.42, -0.03,  5.09, 16.38,  5.12, 11.5 ]))

{'departure_hour + day_of_month': 96.78730488437492,
 'departure_hour': 97.04687150819183,
 'constant': 167.535147928994,
 'departure_hour + day_of_month + ohe day': 70.21791287461917}

pred = df.assign(predicted=model_multiple.predict(df[['departure_hour', 'day_of_month']]))
pred

np.var(pred['predicted']) / np.var(pred['minutes'])

0.42228657042523415

np.corrcoef(pred['predicted'], pred['minutes'])[0, 1] ** 2

0.4222865704252346

model_multiple.score(df[['departure_hour', 'day_of_month']], df['minutes'])

0.4222865704252339

1 - mean_squared_error(pred['minutes'], pred['predicted']) / np.var(pred['minutes'])

0.42228657042523376

df[['departure_hour', 'day_of_month', 'minutes']]

df.head()

df['day'].value_counts()

day
Tue    25
Mon    20
Thu    15
Wed     3
Fri     2
Name: count, dtype: int64

(df['day'] == 'Tue').astype(int)

0     0
1     1
2     0
     ..
62    0
63    1
64    0
Name: day, Length: 65, dtype: int64

for val in df['day'].unique():
    df[f'day == {val}'] = (df['day'] == val).astype(int)

df.loc[:, df.columns.str.contains('day')]

X_for_ohe = df[['departure_hour', 
                         'day_of_month',
                         'day == Mon',
                         'day == Tue',
                         'day == Wed',
                         'day == Thu']]
X_for_ohe

model_with_ohe = LinearRegression()
model_with_ohe.fit(X=X_for_ohe, y=df['minutes'])

LinearRegression()

LinearRegression()

model_with_ohe.intercept_, model_with_ohe.coef_

(134.0430659240799, array([-8.42, -0.03,  5.09, 16.38,  5.12, 11.5 ]))

fig = go.Figure()
fig.add_trace(go.Scatter(x=df['departure_hour'], y=df['minutes'], 
                         mode='markers', name='Original Data'))
fig.add_trace(go.Scatter(x=df['departure_hour'], y=model_with_ohe.predict(X_for_ohe), 
                         mode='markers', name='Predicted Commute Times using Departure Hour, <br>Day of Month, and Day of Week'))
fig.update_layout(showlegend=True, title='Commute Time vs. Departure Hour',
                  xaxis_title='Departure Hour', yaxis_title='Minutes', width=1000)

mse_dict['departure_hour + day_of_month + ohe day'] = mean_squared_error(
    df['minutes'],
    model_with_ohe.predict(X_for_ohe)
)

mse_dict

{'departure_hour + day_of_month': 96.78730488437492,
 'departure_hour': 97.04687150819183,
 'constant': 167.535147928994,
 'departure_hour + day_of_month + ohe day': 70.21791287461917}

df.head()

mpg = sns.load_dataset('mpg').dropna()
mpg.head()

model_year
73    40
78    36
76    34
      ..
71    27
80    27
74    26
Name: count, Length: 13, dtype: int64

LinearRegression()

LinearRegression()

0.6059482578894348

LinearRegression()

LinearRegression()

0.6683347641192137

(108.69970699574483, array([-18.58]))

mpg = sns.load_dataset('mpg').dropna()
mpg.head()

mpg['model_year'].value_counts()

model_year
73    40
78    36
76    34
      ..
71    27
80    27
74    26
Name: count, Length: 13, dtype: int64

px.scatter(mpg, x='horsepower', y='mpg')

car_model = LinearRegression()
car_model.fit(mpg[['horsepower']], mpg['mpg'])

LinearRegression()

LinearRegression()

hp_points = pd.DataFrame({'horsepower': [25, 225]})
fig = px.scatter(mpg, x='horsepower', y='mpg')
fig.add_trace(go.Scatter(
    x=hp_points['horsepower'],
    y=car_model.predict(hp_points),
    mode='lines',
    name='Predicted MPG using Horsepower'
))

car_model.score(mpg[['horsepower']], mpg['mpg'])

0.6059482578894348

mpg['log hp'] = np.log(mpg['horsepower'])

px.scatter(mpg, x='log hp', y='mpg')

car_model_log = LinearRegression()
car_model_log.fit(mpg[['log hp']], mpg['mpg'])

LinearRegression()

LinearRegression()

fig = px.scatter(mpg, x='log hp', y='mpg')
log_hp_points = pd.DataFrame({'log hp': [3.7, 5.5]})
fig = px.scatter(mpg, x='log hp', y='mpg')
fig.add_trace(go.Scatter(
    x=log_hp_points['log hp'],
    y=car_model_log.predict(log_hp_points),
    mode='lines',
    name='Predicted MPG using log(Horsepower)'
))

car_model_log.score(mpg[['log hp']], mpg['mpg'])

0.6683347641192137

fig = px.scatter(mpg, x='horsepower', y='mpg')
fig.add_trace(
    go.Scatter(
        x=mpg['horsepower'], 
        y=car_model_log.intercept_ + car_model_log.coef_[0] * np.log(mpg['horsepower']),  
        mode='markers', name='Predicted MPG using log(Horsepower)'
    )
)
fig

car_model_log.intercept_, car_model_log.coef_

(108.69970699574483, array([-18.58]))

	date	day	home_departure_time	home_departure_mileage	...	minutes_to_home	work_departure_time_hr	mileage_to_home	day_of_month
0	5/15/2023	Mon	2023-05-15 10:49:00	15873.0	...	72.0	17.17	53.0	15
1	5/16/2023	Tue	2023-05-16 07:45:00	15979.0	...	NaN	NaN	NaN	16
2	5/22/2023	Mon	2023-05-22 08:27:00	50407.0	...	82.0	15.90	54.0	22
3	5/23/2023	Tue	2023-05-23 07:08:00	50535.0	...	NaN	NaN	NaN	23
4	5/30/2023	Tue	2023-05-30 09:09:00	50664.0	...	76.0	17.12	54.0	30

	date	day	home_departure_time	home_departure_mileage	...	work_departure_time_hr	mileage_to_home	day_of_month	predicted
0	5/15/2023	Mon	2023-05-15 10:49:00	15873.0	...	17.17	53.0	15	53.76
1	5/16/2023	Tue	2023-05-16 07:45:00	15979.0	...	NaN	NaN	16	79.03
2	5/22/2023	Mon	2023-05-22 08:27:00	50407.0	...	15.90	54.0	22	73.61
...	...	...	...	...	...	...	...	...	...
62	3/4/2024	Mon	2024-03-04 07:35:00	39120.0	...	17.27	52.0	4	79.73
63	3/5/2024	Tue	2024-03-05 07:27:00	59161.0	...	17.28	55.0	5	80.88
64	3/7/2024	Thu	2024-03-07 07:36:00	59270.0	...	NaN	NaN	7	79.76

Property	Example	Description
Initialize model parameters	`lr = LinearRegression()`	Create (empty) linear regression model
Fit the model to the data	`lr.fit(X, y)`	Determines regression coefficients
Use model for prediction	`lr.predict(X_new)`	Uses regression line to make predictions
Evaluate the model	`lr.score(X, y)`	Calculates the $R^2$ of the LR model
Access model attributes	`lr.coef_`, `lr.intercept_`	Accesses the regression coefficients and intercept

	date	day	home_departure_time	home_departure_mileage	...	minutes_to_home	work_departure_time_hr	mileage_to_home	day_of_month
0	5/15/2023	Mon	2023-05-15 10:49:00	15873.0	...	72.0	17.17	53.0	15
1	5/16/2023	Tue	2023-05-16 07:45:00	15979.0	...	NaN	NaN	NaN	16
2	5/22/2023	Mon	2023-05-22 08:27:00	50407.0	...	82.0	15.90	54.0	22
3	5/23/2023	Tue	2023-05-23 07:08:00	50535.0	...	NaN	NaN	NaN	23
4	5/30/2023	Tue	2023-05-30 09:09:00	50664.0	...	76.0	17.12	54.0	30

	date	day	home_departure_time	home_departure_mileage	...	day == Tue
0	5/15/2023	Mon	2023-05-15 10:49:00	15873.0	...	0
1	5/16/2023	Tue	2023-05-16 07:45:00	15979.0	...	1
2	5/22/2023	Mon	2023-05-22 08:27:00	50407.0	...	0
3	5/23/2023	Tue	2023-05-23 07:08:00	50535.0	...	1
4	5/30/2023	Tue	2023-05-30 09:09:00	50664.0	...	1

	departure_hour	day_of_month	minutes
0	10.82	15	68.0
1	7.75	16	94.0
2	8.45	22	63.0
...	...	...	...
62	7.58	4	68.0
63	7.45	5	90.0
64	7.60	7	83.0

	day	day_of_month	day == Mon	day == Tue	day == Wed	day == Thu	day == Fri
0	Mon	15	1	0	0	0	0
1	Tue	16	0	1	0	0	0
2	Mon	22	1	0	0	0	0
...	...	...	...	...	...	...	...
62	Mon	4	1	0	0	0	0
63	Tue	5	0	1	0	0	0
64	Thu	7	0	0	0	1	0

	mpg	cylinders	displacement	horsepower	...	acceleration	model_year	origin	name
0	18.0	8	307.0	130.0	...	12.0	70	usa	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	...	11.5	70	usa	buick skylark 320
2	18.0	8	318.0	150.0	...	11.0	70	usa	plymouth satellite
3	16.0	8	304.0	150.0	...	12.0	70	usa	amc rebel sst
4	17.0	8	302.0	140.0	...	10.5	70	usa	ford torino

	day	day_of_month	day == Mon	day == Tue	day == Wed	day == Thu	day == Fri
0	Mon	15	1	0	0	0	0
1	Tue	16	0	1	0	0	0
2	Mon	22	1	0	0	0	0
...	...	...	...	...	...	...	...
62	Mon	4	1	0	0	0	0
63	Tue	5	0	1	0	0	0
64	Thu	7	0	0	0	1	0

Lecture 17¶

Multiple Linear Regression and Feature Engineering¶

EECS 398-003: Practical Data Science, Fall 2024¶

Announcements 📣¶

Agenda¶

Question 🤔 (Answer at practicaldsc.org/q)

Recap: Regression and linear algebra¶

Terminology recap¶

The optimal parameter vector, $\vec{w}^*$¶

Multiple linear regression¶

Incorporating multiple features¶

Geometric interpretation¶

The hypothesis vector¶

Finding the optimal parameters¶

Notation for multiple linear regression¶

Feature vectors¶

Augmented feature vectors¶

The general problem¶

The general solution¶

Note on parameters¶

Question 🤔 (Answer at practicaldsc.org/q)

Regression in sklearn¶

Loading the data¶

sklearn¶

The LinearRegression class¶

Fitting a multiple linear regression model¶

Making predictions¶

Comparing models¶

The .score method of a LinearRegression object¶

Aside: $R^2$¶

Reference Slide¶

Calculating $R^2$¶

Relationship between $R^2$ and MSE¶

LinearRegression class summary¶

What's next?¶

Feature engineering ⚙️¶

The goal of feature engineering¶

One hot encoding¶

Example: One hot encoding 'day'¶

Using 'day' as a feature, along with 'departure_hour' and 'day_of_month'¶

Visualizing our latest model¶

Comparing our latest model to earlier models¶

Reflection¶

Question 🤔 (Answer at practicaldsc.org/q)

Example: Horsepower 🚗¶

Loading the (new) data¶

The relationship between 'horsepower' and 'mpg'¶

Predicting 'mpg' using 'horsepower'¶

Linear in the parameters¶

Linearization¶

Predicting 'mpg' using log('horsepower')¶

Question 🤔 (Answer at practicaldsc.org/q)

What's next?¶

Regression in `sklearn`¶

`sklearn`¶

The `LinearRegression` class¶

The `.score` method of a `LinearRegression` object¶

`LinearRegression` class summary¶

Example: One hot encoding `'day'`¶

Using `'day'` as a feature, along with `'departure_hour'` and `'day_of_month'`¶

The relationship between `'horsepower'` and `'mpg'`¶

Predicting `'mpg'` using `'horsepower'`¶

Predicting `'mpg'` using `log('horsepower')`¶

	day	day_of_month	day == Mon	day == Tue	day == Wed	day == Thu	day == Fri
0	Mon	15	1	0	0	0	0
1	Tue	16	0	1	0	0	0
2	Mon	22	1	0	0	0	0
...	...	...	...	...	...	...	...
62	Mon	4	1	0	0	0	0
63	Tue	5	0	1	0	0	0
64	Thu	7	0	0	0	1	0