In [1]:
from lec_utils import *
import lec23_util as util
from IPython.display import YouTubeVideo
from ipywidgets import interact
Announcements 📣¶
- The Portfolio Homework's checkpoint is due on Monday, November 25th – no slip days allowed!
The full homework is due on Saturday, December 7th (no slip days!). - Homework 10 will be out by tomorrow – sorry for the delay!
We'll adjust the deadline accordingly. - The Grade Report now includes scores and slip days through Homework 9 – make sure it's accurate!
Agenda¶
- Recap: Gradient descent for multivariate functions.
- Classification overview.
- Survey of classification methods.
- $k$-Nearest Neighbors 🏡🏠.
- Decision trees 🎄.
- Logistic regression 📈.
- Evaluating classifiers.
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
Recap: Gradient descent for multivariate functions¶
Example: Gradient descent for simple linear regression¶
- To find optimal model parameters for the model $H(x) = w_0 + w_1 x$ and squared loss, we minimized empirical risk:
- This is a function of multiple variables, and is differentiable, so it has a gradient!
- Key idea: To find $\vec{w}^* = \begin{bmatrix} w_0^* \\ w_1^* \end{bmatrix}$, we could use gradient descent!
- Why would we, when closed-form solutions exist?

At any point, there are many directions in which you can go "up", but there's only one "steepest direction up", and that's the direction of the gradient!
Gradient descent for simple linear regression, visualized¶
In [2]:
YouTubeVideo('oMk6sP7hrbk')
Out[2]:
Gradient descent for simple linear regression, implemented¶
- Let's use gradient descent to fit a simple linear regression model to predict commute time in
'minutes'from'departure_hour'.
In [3]:
df = pd.read_csv('data/commute-times.csv')
df[['departure_hour', 'minutes']]
util.make_scatter(df)
In [4]:
x = df['departure_hour']
y = df['minutes']
- First, let's remind ourselves what $w_0^*$ and $w_1^*$ are supposed to be.
In [5]:
slope = np.corrcoef(x, y)[0, 1] * np.std(y) / np.std(x)
slope
Out[5]:
-8.186941724265557
In [6]:
intercept = np.mean(y) - slope * np.mean(x)
intercept
Out[6]:
142.44824158772875
Implementing partial derivatives¶
$$R_\text{sq}(\vec{w}) = \frac{1}{n} \sum_{i = 1}^n ( y_i - (w_0 + w_1 x_i ))^2$$$$\nabla R(\vec{w}) = \begin{bmatrix} \displaystyle -\frac{2}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i)) \\ \displaystyle -\frac{2}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1x_i))x_i \end{bmatrix}$$
In [7]:
def dR_w0(w0, w1):
return -2 * np.mean(y - (w0 + w1 * x))
def dR_w1(w0, w1):
return -2 * np.mean((y - (w0 + w1 * x)) * x)
Implementing gradient descent¶
- The update rule we'll follow is:
- We can treat this as two separate update equations:
- Let's initialize $w_0^{(0)} = 100$ and $w_1^{(0)} = -50$, and choose the step size $\alpha = 0.01$.
The initial guesses were just parameters that we thought might be close.
In [8]:
# We'll store our guesses so far, so we can look at them later.
def gradient_descent_for_regression(w0_initial, w1_initial, alpha, threshold=0.0001):
w0, w1 = w0_initial, w1_initial
w0_history = [w0]
w1_history = [w1]
while True:
w0 = w0 - alpha * dR_w0(w0, w1)
w1 = w1 - alpha * dR_w1(w0, w1)
w0_history.append(w0)
w1_history.append(w1)
if np.abs(w0_history[-1] - w0_history[-2]) <= threshold:
break
return w0_history, w1_history
In [9]:
w0_history, w1_history = gradient_descent_for_regression(0, 0, 0.01)
In [10]:
w0_history[-1]
Out[10]:
142.1051891023626
In [11]:
w1_history[-1]
Out[11]:
-8.146983792459055
- It seems that we converge at the right value! But how many iterations did it take? What could we do to speed it up?
In [12]:
len(w0_history)
Out[12]:
20664
Classification overview¶
The taxonomy of machine learning¶
- So far, we've focused on building regression models.
- Regression is a form of supervised learning, in which the target variable (i.e., the $y$-values we're trying to predict) is numerical.
For example, a predicted commute time could technically be any real number.
- Next, we'll focus on classification, a form of supervised learning in which the target variable is categorical.
Example classification problems¶
- Does this person have diabetes?
This is an example of binary classification – there are only two possible classes, or categories. In binary classification, the two classes are typically 1 (yes) and 0 (no).
- Is this digit a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9?
This is an example of multi-class classification, where there are multiple possible classes.
- Will Michigan win this week?
- Is this picture of a dog, cat, zebra, or hamster?
The plan¶
- When we introduced regression, we started by understanding the theoretical foundations on paper, and then learned how to build models in
sklearn.
- This time, we'll do the reverse: we'll start by learning how to use classifiers in
sklearn, and then over the next few lectures, we'll dive deeper into the internals of a few.- $k$-Nearest Neighbors.
- Decision trees.
- Logistic regression.
Loading the data¶
- Our first classification example will involve predicting whether or not a patient has diabetes, given other information about their health.
In [13]:
diabetes = pd.read_csv('data/diabetes.csv')
display_df(diabetes, cols=9)
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.63 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.35 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.67 | 32 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.24 | 30 | 0 |
| 766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.35 | 47 | 1 |
| 767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.32 | 23 | 0 |
768 rows × 9 columns
In [14]:
# 0 means no diabetes, 1 means yes diabetes.
diabetes['Outcome'].value_counts()
Out[14]:
Outcome 0 500 1 268 Name: count, dtype: int64
'Glucose'is measured in mg/dL (milligrams per deciliter).
'BMI'is calculated as $\text{BMI} = \frac{\text{weight (kg)}}{\left[ \text{height (m)} \right]^2}$.
- Let's start by using
'Glucose'and'BMI'to predict whether or not a patient has diabetes ('Outcome').
- But first, a train-test split:
In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = (
train_test_split(diabetes[['Glucose', 'BMI']], diabetes['Outcome'], random_state=1)
)
Visualizing the data¶
- Let's visualize the relationship between
X_trainandy_train. There are three numeric variables at play here –'Glucose','BMI', and'Outcome'– so we can use a 3D scatter plot.
In [16]:
px.scatter_3d(X_train.assign(Outcome=y_train),
x='Glucose', y='BMI', z='Outcome',
title='Relationship between Glucose, BMI, and Diabetes',
width=800, height=600)
- Since there are only two possible
'Outcome's, we can draw a 2D scatter plot of'BMI'vs.'Glucose'and color each point by'Outcome'. Below, class 0 (orange) is "no diabetes" and class 1 (blue) is "diabetes".
In [17]:
fig = (
X_train.assign(Outcome=y_train.astype(str).replace({'0': 'no diabetes', '1': 'yes diabetes'}))
.plot(kind='scatter', x='Glucose', y='BMI', color='Outcome',
color_discrete_map={'no diabetes': 'orange', 'yes diabetes': 'blue'},
title='Relationship between Glucose, BMI, and Diabetes')
.update_layout(width=800)
)
fig
- Using this dataset, how can we classify whether someone (not already in the dataset) has diabetes, given their
'Glucose'and'BMI'?
- Intuition: If a new person's feature vector is close to the blue points, we'll predict blue (diabetes); if they're close to the orange points, we'll predict orange (no diabetes).
Classifier 1: $k$-Nearest Neighbors 🏡🏠¶
$k$-Nearest Neighbors 🏡🏠¶
- Suppose we're given a new individual, $\vec{x}_\text{new} = \begin{bmatrix} \text{Glucose}_\text{new} \\ \text{BMI}_\text{new} \end{bmatrix}$.
- The $k$-Nearest Neighbors classifier ($k$-NN for short) classifies $\vec{x}_\text{new}$ by:
- Finding the $k$ closest points in the training set to $\vec{x}_\text{new}$.
- Predicting that $\vec{x}_\text{new}$ belongs to the most common class among those $k$ closest points.
In [18]:
fig
- Example: Suppose $k = 6$. If, among the 6 closest points to $\vec{x}_\text{new}$, there are 4 blue and 2 orange points, we'd predict blue (diabetes).
What if there are ties? Read here.
- $k$ is a hyperparameter that should be chosen through cross-validation.
As we've seen in Homework 9 (and 10!) in the context of $k$-NN regression, smaller values of $k$ tend to overfit significantly.
KNeighborsClassifier in sklearn¶
In [19]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
- Let's fit a
KNeighborsClassifierby using cross-validation to choose a value of $k$ from 1 through 50.
Note thatKNeighborsClassifiers have several other hyperparameters. One of them is the metric used to measure distances; the default is the standard Euclidean (Pythagorean) distance, e.g. $\text{dist}(\vec u, \vec v) = \sqrt{(u_1 - v_1)^2 + (u_2 - v_2)^2 + ... + (u_d - v_d)^2}$.
In [20]:
model_knn = GridSearchCV(
KNeighborsClassifier(),
param_grid = {'n_neighbors': range(1, 51)}
)
model_knn.fit(X_train, y_train)
Out[20]:
GridSearchCV(estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': range(1, 51)})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': range(1, 51)})KNeighborsClassifier(n_neighbors=28)
KNeighborsClassifier(n_neighbors=28)
In [21]:
model_knn.best_params_
Out[21]:
{'n_neighbors': 28}
- Cross-validation chose $k = 28$. With the resulting model, we can make predictions using the
predictmethod, just like with regressors.
Note that all of the work in making the prediction – finding the 28 nearest neighbors, for instance – is done when we callpredict. "Training" does very little.
In [22]:
# To know what reasonable values for 'Glucose' and 'BMI' might be, let's look at the plot again.
fig
In [23]:
model_knn.predict(pd.DataFrame([{
'Glucose': 125,
'BMI': 40
}]))
Out[23]:
array([0])
- What does the resulting model look like? Can we visualize it?
Decision boundaries¶
- The decision boundaries of a classifier visualize the regions in the feature space that separate different predicted classes.
- The decision boundaries for
model_knnare visualized below.
If a new person's feature vector lies in the blue region, we'd predict they do have diabetes, otherwise, we'd predict they don't.
In [24]:
util.show_decision_boundary(model_knn, X_train, y_train, title='Decision Boundary when $k = 28$')
- What would the decision boundaries look like if $k$ increased or decreased?
Play with the slider below to find out!
In [25]:
from ipywidgets import interact
interact(lambda k: util.visualize_k(k, X_train, y_train), k=(1, 51));
interactive(children=(IntSlider(value=26, description='k', max=51, min=1), Output()), _dom_classes=('widget-in…
- What if $k = n$, the number of points in the training set?
In [26]:
util.visualize_k(576, X_train, y_train)
Quantifying the performance of a classifier¶
- For regression models, our default evaluation metric was mean squared error.
Error is bad, so lower values indicate better model performance.
The most common evaluation metric in classification is accuracy:
$$\text{accuracy} = \frac{\text{# data points classified correctly}}{\text{# data points}}$$
Accuracy ranges from 0 to 1, i.e. 0% to 100%. Higher values indicate better model performance.
In [27]:
# Equivalent to 75%.
(model_knn.predict(X_test) == y_test).mean()
Out[27]:
0.75
- This is the default metric that the
scoremethod of a classifier computes, too.
In [28]:
model_knn.score(X_test, y_test)
Out[28]:
0.75
In [29]:
# For future reference.
test_scores = pd.Series()
test_scores['knn with k = 28'] = model_knn.score(X_test, y_test)
test_scores
Out[29]:
knn with k = 28 0.75 dtype: float64
- Accuracy is not the only metric we care about, and can sometimes be misleading. More on this soon!
Activity¶
It seems that a $k$-NN classifier that uses $k = 1$ should achieve 100% training accuracy. Why doesn't the model defined below have 100% training accuracy?
In [30]:
model_k1 = KNeighborsClassifier(n_neighbors=1)
model_k1.fit(X_train, y_train)
Out[30]:
KNeighborsClassifier(n_neighbors=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=1)
In [31]:
# Training accuracy – high, but not 100%.
model_k1.score(X_train, y_train)
Out[31]:
0.9913194444444444
In [32]:
# Accuracy on test set is lower than when k = 28!
model_k1.score(X_test, y_test)
Out[32]:
0.6822916666666666
In [33]:
test_scores['knn with k = 1'] = model_k1.score(X_test, y_test)
test_scores
Out[33]:
knn with k = 28 0.75 knn with k = 1 0.68 dtype: float64
Discussion¶
Why should we generally standardize features before using a $k$-NN classifier?
In [34]:
X_train_scaled = X_train.copy()
X_train_scaled['Glucose * 2'] = X_train_scaled['Glucose'] * 2
(
X_train_scaled.assign(Outcome=y_train.astype(str).replace({'0': 'no diabetes', '1': 'yes diabetes'}))
.plot(kind='scatter', x='Glucose * 2', y='BMI', color='Outcome',
color_discrete_map={'no diabetes': 'orange', 'yes diabetes': 'blue'},
title='Relationship between Glucose * 2, BMI, and Diabetes')
.update_layout(width=1300)
.update_xaxes(tickvals=np.arange(0, 500, 100))
)
In [35]:
fig
Parametric vs. non-parametric models¶
- The $k$-Nearest Neighbors classifier is an example of a non-parametric machine learning method.
- Linear regression, on the other hand, is parametric.
- Some differences between parametric and non-parametric models:
| Parametric | Non-Parametric |
|---|---|
| There's a fixed set of coefficients (parameters), $w_0, w_1, ..., w_d$ that we'll use for making predictions, and the number of coefficients is independent of the training set size. | No fixed set of parameters; model complexity increases as the training set size increases. |
| Parametric methods make assumptions about the shape of the data and/or its underlying probability distribution. For instance, linear models assume a linear relationship between the features $X$ and target $\vec{y}$. There's a connection between the squared loss function and maximum likelihood estimation, too. |
Non-parametric methods make no assumptions about the shape of the data. |
Classifier 2: Decision trees 🎄¶
Decision trees 🎄¶
- Suppose we're given a new individual, $\vec{x}_\text{new} = \begin{bmatrix} \text{Glucose}_\text{new} \\ \text{BMI}_\text{new} \end{bmatrix}$.
- The decision tree classifier classifies $\vec{x}_\text{new}$ by:
- Asking a series of yes/no questions about $\text{Glucose}_\text{new}$ and $\text{BMI}_\text{new}$, e.g.:
Is $\text{Glucose}_\text{new} \leq 129.5$? 2. Once it runs out of questions to ask, it predicts that $\vec{x}_\text{new}$ belongs to the **most common class** among training set points that had the same answers as $\vec{x}_\text{new}$.
If so, is $\text{BMI}_\text{new} \leq 26.3$?
If not, is $\text{BMI}_\text{new} \leq 29.95$?
$\vdots$
- Visually, a fit decision tree may look like:
- Decision trees are also non-parametric!
DecisionTreeClassifier in sklearn¶
In [36]:
from sklearn.tree import DecisionTreeClassifier
- Let's fit a
DecisionTreeClassifier.
One of the main hyperparameters ismax_depth, the number of questions to ask before making a prediction. Typically, we fit this with cross-validation, but for now we'll hard-code it.
In [37]:
model_tree = DecisionTreeClassifier(max_depth=3)
model_tree.fit(X_train, y_train)
Out[37]:
DecisionTreeClassifier(max_depth=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3)
- The decision tree achieves a slightly higher test set accuracy than the cross-validated $k$-NN model.
In [38]:
model_tree.score(X_test, y_test)
Out[38]:
0.7708333333333334
In [39]:
test_scores['decision tree with depth = 3'] = model_tree.score(X_test, y_test)
test_scores
Out[39]:
knn with k = 28 0.75 knn with k = 1 0.68 decision tree with depth = 3 0.77 dtype: float64
- But what does it look like?
Decision boundaries for a decision tree classifier¶
In [40]:
util.show_decision_boundary(model_tree, X_train, y_train, title='Decision Boundary for a Tree of Depth 3')