In [1]:
from lec_utils import *
import lec21_util as util
Lecture 21¶
Introduction to Classification¶
EECS 398: Practical Data Science, Winter 2025¶
practicaldsc.org • github.com/practicaldsc/wn25 • 📣 See latest announcements here on Ed
Agenda 📆¶
- Classification overview.
- Survey of classification methods.
- $k$-nearest neighbors 🏡🏠.
- Decision trees 🎄.
- Evaluating classifiers.
- Multiclass classification 🐧.
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
Classification overview¶
The taxonomy of machine learning¶
- So far, we've focused on building regression models.
- Regression is a form of supervised learning, in which the target variable (i.e., the $y$-values we're trying to predict) is numerical.
For example, a predicted commute time could technically be any real number.
- Next, we'll focus on classification, a form of supervised learning in which the target variable is categorical.
Example classification problems¶
- Does this person have diabetes?
This is an example of binary classification – there are only two possible classes, or categories. In binary classification, the two classes are typically 1 (yes) and 0 (no).
- Is this digit a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9?
This is an example of multi-class classification, where there are multiple possible classes.
- Is this picture of a dog, cat, zebra, or hamster?
The plan¶
- When we introduced regression, we started by understanding the theoretical foundations on paper, and then learned how to build models in
sklearn
.
- This time, we'll do the reverse: we'll start by learning how to use classifiers in
sklearn
, and then over the next few lectures, we'll dive deeper into the internals of a few.- Today: $k$-nearest neighbors and decision trees.
- Lectures 22-23: Logistic regression (and, potentially, Naïve Bayes).
Loading the data 🏥¶
- Our first classification example will involve predicting whether or not a patient has diabetes, given other information about their health.
In [2]:
diabetes = pd.read_csv('data/diabetes.csv')
display_df(diabetes, cols=9)
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.63 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.35 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.67 | 32 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.24 | 30 | 0 |
766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.35 | 47 | 1 |
767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.32 | 23 | 0 |
768 rows × 9 columns
In [3]:
# 0 means no diabetes, 1 means yes diabetes.
diabetes['Outcome'].value_counts()
Out[3]:
Outcome 0 500 1 268 Name: count, dtype: int64
'Glucose'
is measured in mg/dL (milligrams per deciliter);'BMI'
is calculated as $\text{BMI} = \frac{\text{weight (kg)}}{\left[ \text{height (m)} \right]^2}$.
Let's start by using these two features to predict whether or not a patient has diabetes ('Outcome'
).
- But first, a train-test split:
In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = (
train_test_split(diabetes[['Glucose', 'BMI']], diabetes['Outcome'], random_state=1)
)
X_train
Out[4]:
Glucose | BMI | |
---|---|---|
118 | 97 | 28.2 |
205 | 111 | 23.9 |
506 | 180 | 36.5 |
... | ... | ... |
72 | 126 | 43.4 |
235 | 171 | 43.6 |
37 | 102 | 32.9 |
576 rows × 2 columns
Visualizing the data¶
- Let's visualize the relationship between
X_train
andy_train
. There are three numeric variables at play here –'Glucose'
,'BMI'
, and'Outcome'
– so we can use a 3D scatter plot.
In [5]:
px.scatter_3d(X_train.assign(Outcome=y_train),
x='Glucose', y='BMI', z='Outcome',
title='Relationship between Glucose, BMI, and Diabetes',
width=800, height=600)
- Since there are only two possible
'Outcome'
s, we can draw a 2D scatter plot of'BMI'
vs.'Glucose'
and color each point by'Outcome'
. Below, class 0 (orange) is "no diabetes" and class 1 (blue) is "diabetes".
In [6]:
fig = util.create_base_scatter(X_train, y_train)
fig
- Using this dataset, how can we classify whether someone new (not already in the dataset) has diabetes, given their
'Glucose'
and'BMI'
?
- Intuition: If a new person's feature vector is close to the blue points, we'll predict blue (diabetes); if they're close to the orange points, we'll predict orange (no diabetes).
Classifier 1: $k$-nearest neighbors 🏡🏠¶
$k$-nearest neighbors 🏡🏠¶
- Suppose we're given a new individual, $\vec{x}_\text{new} = \begin{bmatrix} \text{Glucose}_\text{new} \\ \text{BMI}_\text{new} \end{bmatrix}$.
- The $k$-nearest neighbors classifier ($k$-NN for short) classifies $\vec{x}_\text{new}$ by:
- Finding the $k$ closest points in the training set to $\vec{x}_\text{new}$.
- Predicting that $\vec{x}_\text{new}$ belongs to the most common class among those $k$ closest points.
In [7]:
fig
- Example: Suppose $k = 6$. If, among the 6 closest points to $\vec{x}_\text{new}$, there are 4 blue and 2 orange points, we'd predict blue (diabetes).
What if there are ties? Read here.
- $k$ is a hyperparameter that should be chosen through cross-validation.
As we've seen in Homework 8 and 9, in the context of $k$-NN regression, smaller values of $k$ tend to overfit significantly.
KNeighborsClassifier
in sklearn
¶
In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
- Let's fit a
KNeighborsClassifier
by using cross-validation to choose a value of $k$ from 1 through 50.
Note thatKNeighborsClassifier
s have several other hyperparameters. One of them is the metric used to measure distances; the default is the standard Euclidean ($L_2$) distance, e.g. $\text{dist}(\vec u, \vec v) = \sqrt{(u_1 - v_1)^2 + (u_2 - v_2)^2 + ... + (u_d - v_d)^2}$.
In [9]:
model_knn = GridSearchCV(
KNeighborsClassifier(),
param_grid = {'n_neighbors': range(1, 51)}
)
model_knn.fit(X_train, y_train)
Out[9]:
GridSearchCV(estimator=KNeighborsClassifier(), param_grid={'n_neighbors': range(1, 51)})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=KNeighborsClassifier(), param_grid={'n_neighbors': range(1, 51)})
KNeighborsClassifier(n_neighbors=28)
KNeighborsClassifier(n_neighbors=28)
In [10]:
model_knn.best_params_
Out[10]:
{'n_neighbors': 28}
- Cross-validation chose $k = 28$. With the resulting model, we can make predictions using the
predict
method, just like with regressors.
Note that all of the work in making the prediction – finding the 28 nearest neighbors, for instance – is done when we callpredict
. "Training" does very little.
In [11]:
# To know what reasonable values for 'Glucose' and 'BMI' might be, let's look at the plot again.
fig
In [12]:
model_knn.predict(pd.DataFrame([{
'Glucose': 125,
'BMI': 40
}]))
Out[12]:
array([0])
- What does the resulting model look like 👀? Can we visualize it?
Decision boundaries¶
- The decision boundaries of a classifier visualize the regions in the feature space that separate different predicted classes.
- The decision boundaries for
model_knn
are visualized below.
If a new person's feature vector lies in the blue region, we'd predict they do have diabetes, otherwise, we'd predict they don't.
In [13]:
util.visualize_k(28, X_train, y_train)
What would the decision boundaries look like if $k$ increased or decreased?
Play with the slider below to find out!
In [14]:
util.show_slider()