In [1]:
from lec_utils import *
import lec21_util as util

Lecture 21¶

Introduction to Classification¶

EECS 398: Practical Data Science, Winter 2025¶

practicaldsc.org • github.com/practicaldsc/wn25 • 📣 See latest announcements here on Ed

Agenda 📆¶

  • Classification overview.
  • Survey of classification methods.
    • $k$-nearest neighbors 🏡🏠.
    • Decision trees 🎄.
  • Evaluating classifiers.
  • Multiclass classification 🐧.

Question 🤔 (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

Classification overview¶


The taxonomy of machine learning¶

  • So far, we've focused on building regression models.
  • Regression is a form of supervised learning, in which the target variable (i.e., the $y$-values we're trying to predict) is numerical.
    For example, a predicted commute time could technically be any real number.
No description has been provided for this image
  • Next, we'll focus on classification, a form of supervised learning in which the target variable is categorical.

Example classification problems¶

  • Does this person have diabetes?
    This is an example of binary classification – there are only two possible classes, or categories. In binary classification, the two classes are typically 1 (yes) and 0 (no).
  • Is this digit a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9?
    This is an example of multi-class classification, where there are multiple possible classes.
  • Is this picture of a dog, cat, zebra, or hamster?

The plan¶

  • When we introduced regression, we started by understanding the theoretical foundations on paper, and then learned how to build models in sklearn.
  • This time, we'll do the reverse: we'll start by learning how to use classifiers in sklearn, and then over the next few lectures, we'll dive deeper into the internals of a few.
    • Today: $k$-nearest neighbors and decision trees.
    • Lectures 22-23: Logistic regression (and, potentially, Naïve Bayes).

Loading the data 🏥¶

  • Our first classification example will involve predicting whether or not a patient has diabetes, given other information about their health.
In [2]:
diabetes = pd.read_csv('data/diabetes.csv')
display_df(diabetes, cols=9)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.63 50 1
1 1 85 66 29 0 26.6 0.35 31 0
2 8 183 64 0 0 23.3 0.67 32 1
... ... ... ... ... ... ... ... ... ...
765 5 121 72 23 112 26.2 0.24 30 0
766 1 126 60 0 0 30.1 0.35 47 1
767 1 93 70 31 0 30.4 0.32 23 0

768 rows × 9 columns

In [3]:
# 0 means no diabetes, 1 means yes diabetes.
diabetes['Outcome'].value_counts()
Out[3]:
Outcome
0    500
1    268
Name: count, dtype: int64
  • 'Glucose' is measured in mg/dL (milligrams per deciliter); 'BMI' is calculated as $\text{BMI} = \frac{\text{weight (kg)}}{\left[ \text{height (m)} \right]^2}$.
    Let's start by using these two features to predict whether or not a patient has diabetes ('Outcome').
  • But first, a train-test split:
In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = (
    train_test_split(diabetes[['Glucose', 'BMI']], diabetes['Outcome'], random_state=1)
)

X_train
Out[4]:
Glucose BMI
118 97 28.2
205 111 23.9
506 180 36.5
... ... ...
72 126 43.4
235 171 43.6
37 102 32.9

576 rows × 2 columns

Visualizing the data¶

  • Let's visualize the relationship between X_train and y_train. There are three numeric variables at play here – 'Glucose', 'BMI', and 'Outcome' – so we can use a 3D scatter plot.
In [5]:
px.scatter_3d(X_train.assign(Outcome=y_train), 
              x='Glucose', y='BMI', z='Outcome', 
              title='Relationship between Glucose, BMI, and Diabetes',
              width=800, height=600)
  • Since there are only two possible 'Outcome's, we can draw a 2D scatter plot of 'BMI' vs. 'Glucose' and color each point by 'Outcome'. Below, class 0 (orange) is "no diabetes" and class 1 (blue) is "diabetes".
In [6]:
fig = util.create_base_scatter(X_train, y_train)
fig
  • Using this dataset, how can we classify whether someone new (not already in the dataset) has diabetes, given their 'Glucose' and 'BMI'?
  • Intuition: If a new person's feature vector is close to the blue points, we'll predict blue (diabetes); if they're close to the orange points, we'll predict orange (no diabetes).

Classifier 1: $k$-nearest neighbors 🏡🏠¶


$k$-nearest neighbors 🏡🏠¶

  • Suppose we're given a new individual, $\vec{x}_\text{new} = \begin{bmatrix} \text{Glucose}_\text{new} \\ \text{BMI}_\text{new} \end{bmatrix}$.
  • The $k$-nearest neighbors classifier ($k$-NN for short) classifies $\vec{x}_\text{new}$ by:
    1. Finding the $k$ closest points in the training set to $\vec{x}_\text{new}$.
    2. Predicting that $\vec{x}_\text{new}$ belongs to the most common class among those $k$ closest points.
In [7]:
fig
  • Example: Suppose $k = 6$. If, among the 6 closest points to $\vec{x}_\text{new}$, there are 4 blue and 2 orange points, we'd predict blue (diabetes).


What if there are ties? Read here.

  • $k$ is a hyperparameter that should be chosen through cross-validation.
    As we've seen in Homework 8 and 9, in the context of $k$-NN regression, smaller values of $k$ tend to overfit significantly.

KNeighborsClassifier in sklearn¶

In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
  • Let's fit a KNeighborsClassifier by using cross-validation to choose a value of $k$ from 1 through 50.
    Note that KNeighborsClassifiers have several other hyperparameters. One of them is the metric used to measure distances; the default is the standard Euclidean ($L_2$) distance, e.g. $\text{dist}(\vec u, \vec v) = \sqrt{(u_1 - v_1)^2 + (u_2 - v_2)^2 + ... + (u_d - v_d)^2}$.
In [9]:
model_knn = GridSearchCV(
    KNeighborsClassifier(),
    param_grid = {'n_neighbors': range(1, 51)}
)
model_knn.fit(X_train, y_train)
Out[9]:
GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': range(1, 51)})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': range(1, 51)})
KNeighborsClassifier(n_neighbors=28)
KNeighborsClassifier(n_neighbors=28)
In [10]:
model_knn.best_params_
Out[10]:
{'n_neighbors': 28}
  • Cross-validation chose $k = 28$. With the resulting model, we can make predictions using the predict method, just like with regressors.
    Note that all of the work in making the prediction – finding the 28 nearest neighbors, for instance – is done when we call predict. "Training" does very little.
In [11]:
# To know what reasonable values for 'Glucose' and 'BMI' might be, let's look at the plot again.
fig
In [12]:
model_knn.predict(pd.DataFrame([{
    'Glucose': 125,
    'BMI': 40
}]))
Out[12]:
array([0])
  • What does the resulting model look like 👀? Can we visualize it?

Decision boundaries¶

  • The decision boundaries of a classifier visualize the regions in the feature space that separate different predicted classes.
  • The decision boundaries for model_knn are visualized below.
    If a new person's feature vector lies in the blue region, we'd predict they do have diabetes, otherwise, we'd predict they don't.
In [13]:
util.visualize_k(28, X_train, y_train)

What would the decision boundaries look like if $k$ increased or decreased?
Play with the slider below to find out!

In [14]:
util.show_slider()

What if $k = n$, the number of points in the training set?

In [15]:
util.visualize_k(576, X_train, y_train)

Quantifying the performance of a classifier¶

  • For regression models, our default evaluation metric was mean squared error.
    Error is bad, so lower values indicate better model performance.
  • The most common evaluation metric in classification is accuracy:

    $$\text{accuracy} = \frac{\text{# points classified correctly}}{\text{# points}}$$

    Accuracy ranges from 0 to 1, i.e. 0% to 100%. Higher values indicate better model performance.

In [16]:
# Equivalent to 75%.
(model_knn.predict(X_test) == y_test).mean() 
Out[16]:
0.75
  • This is the default metric that the score method of a classifier computes, too.
In [17]:
model_knn.score(X_test, y_test) 
Out[17]:
0.75
In [18]:
# For future reference.
test_scores = pd.Series()
test_scores['knn with k = 28'] = model_knn.score(X_test, y_test) 
test_scores
Out[18]:
knn with k = 28    0.75
dtype: float64
  • Accuracy is not the only metric we care about, and can sometimes be misleading. More on this soon!

Activity¶

It seems that a $k$-NN classifier that uses $k = 1$ should achieve 100% training accuracy. Why doesn't the model defined below have 100% training accuracy?

In [19]:
fig
In [20]:
model_k1 = KNeighborsClassifier(n_neighbors=1)
model_k1.fit(X_train, y_train)
Out[20]:
KNeighborsClassifier(n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=1)
In [21]:
# Training accuracy – high, but not 100%.
model_k1.score(X_train, y_train)
Out[21]:
0.9913194444444444
In [22]:
# Accuracy on test set is lower than when k = 28!
model_k1.score(X_test, y_test)
Out[22]:
0.6822916666666666
In [23]:
test_scores['knn with k = 1'] = model_k1.score(X_test, y_test)
test_scores
Out[23]:
knn with k = 28    0.75
knn with k = 1     0.68
dtype: float64

Discussion¶

Why should we generally standardize features before using a $k$-NN classifier?

In [24]:
util.create_scaled_version(X_train, y_train)
In [25]:
fig

Parametric vs. non-parametric models¶

  • The $k$-nearest neighbors classifier is an example of a non-parametric machine learning method.
  • Linear regression, on the other hand, is parametric.
  • One intuitive difference:
    • Once we train a linear regression model, we don't need to look at the training set to make predictions – we just use the optimal parameters $w_0^*, w_1^*, ..., w_d^*$ we found.
    • Once we train a $k$-NN model, we still need to look at the training set each time we want to make predictions.
  • Other differences between parametric and non-parametric models:
Parametric Non-Parametric
There's a fixed set of parameters (weights/coefficients), $w_0^*, w_1^*, ..., w_d^*$ that we'll use for making predictions, and the number of parameters is independent of the training set size. No fixed set of parameters; model complexity increases as the training set size increases.
Parametric methods make assumptions about the shape of the data and/or its underlying probability distribution.
For instance, linear models assume a linear relationship between the features $X$ and target $\vec{y}$.
There's a connection between the squared loss function and maximum likelihood estimation, too.
Non-parametric methods make no assumptions about the shape of the data.

Classifier 2: Decision trees 🎄¶


Decision trees 🎄¶

  • Suppose we're given a new individual, $\vec{x}_\text{new} = \begin{bmatrix} \text{Glucose}_\text{new} \\ \text{BMI}_\text{new} \end{bmatrix}$.
  • The decision tree classifier classifies $\vec{x}_\text{new}$ by:
    1. Asking a series of yes/no questions about $\text{Glucose}_\text{new}$ and $\text{BMI}_\text{new}$, e.g.:

    Is $\text{Glucose}_\text{new} \leq 129.5$?
    If so, is $\text{BMI}_\text{new} \leq 26.3$?
    If not, is $\text{BMI}_\text{new} \leq 29.95$?
    $\vdots$
    2. Once it runs out of questions to ask, it predicts that $\vec{x}_\text{new}$ belongs to the **most common class** among training set points that had the same answers as $\vec{x}_\text{new}$.
  • Visually, a fit decision tree may look like:
No description has been provided for this image
  • Decision trees are also non-parametric!

DecisionTreeClassifier in sklearn¶

In [26]:
from sklearn.tree import DecisionTreeClassifier
  • Let's fit a DecisionTreeClassifier.
    One of the main hyperparameters is max_depth, the number of questions to ask before making a prediction. Typically, we fit this with cross-validation, but for now we'll hard-code it.
In [27]:
model_tree = DecisionTreeClassifier(max_depth=3)
model_tree.fit(X_train, y_train)
Out[27]:
DecisionTreeClassifier(max_depth=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3)
  • The decision tree achieves a slightly higher test set accuracy than the cross-validated $k$-NN model.
In [28]:
model_tree.score(X_test, y_test)
Out[28]:
0.7708333333333334
In [29]:
test_scores['decision tree with depth = 3'] = model_tree.score(X_test, y_test)
test_scores
Out[29]:
knn with k = 28                 0.75
knn with k = 1                  0.68
decision tree with depth = 3    0.77
dtype: float64
  • But what does it look like?

Decision boundaries for a decision tree classifier¶

In [30]:
util.show_decision_boundary(model_tree, X_train, y_train, title='Decision Boundary for a Tree of Depth 3')
  • Observe that the decision boundaries – at least when we set max_depth to 3 – look less "jagged" than with the $k$-NN classifier.
    Decision trees partition the feature space into rectangles.

Visualizing decision trees¶

  • Our fit decision tree is like a "flowchart", made up of a series of questions.
    It turns out sklearn provides us with a convenient way of visualizing this flowchart.
  • As before, orange is "no diabetes" and blue is "diabetes".
In [31]:
util.show_diabetes_decision_tree(model_tree, X_train);
No description has been provided for this image
  • To classify a new data point, we start at the top and answer the first question (i.e. "Glucose <= 129.5").
  • If the answer is "Yes", we move to the left branch, otherwise we move to the right branch.
  • We repeat this process until we end up at a leaf node, at which point we predict the most common class in that node.
    Note that each node has a value attribute, which describes the number of training individuals of each class that fell in that node.
In [32]:
y_train[X_train[X_train['Glucose'] <= 129.5].index].value_counts()
Out[32]:
Outcome
0    304
1     78
Name: count, dtype: int64

Increasing tree depth¶

  • One of the many hyperparameters we can tune is tree depth.
  • What happens to the decision boundary of the resulting classifier if we increase max_depth?
In [33]:
interact(lambda depth: util.visualize_depth(depth, X_train, y_train), depth=(1, 51));
interactive(children=(IntSlider(value=26, description='depth', max=51, min=1), Output()), _dom_classes=('widge…
  • What happens to the flowchart representation of the resulting classifier if we increase max_depth?
In [34]:
# By default, there is pre-specified maximum depth.
# The training algorithm keeps 
model_tree_no_max = DecisionTreeClassifier()
model_tree_no_max.fit(X_train, y_train)
Out[34]:
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
In [36]:
util.show_diabetes_decision_tree(model_tree_no_max, X_train);
No description has been provided for this image
  • The tree is extremely overfit to the training set, and very deep!
In [37]:
# Training accuracy. This number should look familiar!
model_tree_no_max.score(X_train, y_train)
Out[37]:
0.9913194444444444
In [38]:
model_tree_no_max.tree_.max_depth
Out[38]:
18
In [39]:
# Worse test set performance than when we used max_depth = 3!
test_scores['decision tree with no specified max depth'] = model_tree_no_max.score(X_test, y_test)
test_scores
Out[39]:
knn with k = 28                              0.75
knn with k = 1                               0.68
decision tree with depth = 3                 0.77
decision tree with no specified max depth    0.72
dtype: float64

Activity¶


No description has been provided for this image

Classifier evaluation¶


Outcomes in binary classification¶

  • When performing binary classification, there are four possible outcomes.
    Note: A "positive prediction" is a prediction of 1, and a "negative prediction" is a prediction of 0.
Outcome of Prediction Definition True Class
True positive (TP) ✅ The predictor correctly predicts the positive class. P
False negative (FN) ❌ The predictor incorrectly predicts the negative class. P
True negative (TN) ✅ The predictor correctly predicts the negative class. N
False positive (FP) ❌ The predictor incorrectly predicts the positive class. N
  • We typically organize the four quantities above into a confusion matrix.
Predicted Negative Predicted Positive
Actually Negative TN ✅ FP ❌
Actually Positive FN ❌ TP ✅
  • Note that in the four acronyms – TP, FN, TN, FP – the first letter is whether the prediction is correct, and the second letter is what the prediction is.
  • Depending on the situation, false negatives may be worse than false positives (or vice versa!).

Example: Accuracy of COVID tests¶

  • The results of 100 Michigan Medicine COVID tests are given below.
Predicted Negative Predicted Positive
Actually Negative TN = 90 ✅ FP = 1 ❌
Actually Positive FN = 8 ❌ TP = 1 ✅
Michigan Medicine test results
  • 🤔 Question: What is the accuracy of the test?
$$ \text{accuracy} = \frac{\text{# points classified correctly}}{\text{# points}} $$
  • 🙋 Answer: $$\text{accuracy} = \frac{TP + TN}{TP + FP + FN + TN} = \frac{1 + 90}{100} = 0.91$$
  • Followup: At first, the test seems good. But, suppose we build a classifier that predicts that nobody has COVID. What would its accuracy be?
  • Answer to followup: Also 0.91! There is severe class imbalance in the dataset, meaning that most of the data points are in the same class (no COVID). Accuracy doesn't tell the full story!

Recall¶

Predicted Negative Predicted Positive
Actually Negative TN = 90 ✅ FP = 1 ❌
Actually Positive FN = 8 ❌ TP = 1 ✅
Michigan Medicine test results
  • 🤔 Question: What proportion of individuals who actually have COVID did the test identify?
  • 🙋 Answer: $\frac{1}{1 + 8} = \frac{1}{9} \approx 0.11$.
  • More generally, the recall of a binary classifier is the proportion of actually positive instances that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.
$$\text{recall} = \frac{TP}{\text{# actually positive}} = \frac{TP}{TP + FN}$$
  • To compute recall, look at the bottom (positive) row of the above confusion matrix.

Recall isn't everything, either!¶

$$\text{recall} = \frac{TP}{TP + FN}$$
  • 🤔 Question: Can you design a "COVID test" with perfect recall?
  • 🙋 Answer: Yes – just predict that everyone has COVID!
Predicted Negative Predicted Positive
Actually Negative TN = 0 ✅ FP = 91 ❌
Actually Positive FN = 0 ❌ TP = 9 ✅
everyone-has-COVID classifier
$$\text{recall} = \frac{TP}{TP + FN} = \frac{9}{9 + 0} = 1$$
  • Like accuracy, recall on its own is not a perfect metric. Even though the classifier we just created has perfect recall, it has 91 false positives!

Precision¶

Predicted Negative Predicted Positive
Actually Negative TN = 0 ✅ FP = 91 ❌
Actually Positive FN = 0 ❌ TP = 9 ✅
everyone-has-COVID classifier
  • The precision of a binary classifier is the proportion of predicted positive instances that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.
$$\text{precision} = \frac{TP}{\text{# predicted positive}} = \frac{TP}{TP + FP}$$
  • To compute precision, look at the right (positive) column of the above confusion matrix.
    Tip: A good way to remember the difference between precision and recall is that in the denominator for 🅿️recision, both terms have 🅿️ in them (TP and FP).
  • Note that the "everyone-has-COVID" classifier has perfect recall, but a precision of $\frac{9}{9 + 91} = 0.09$, which is quite low.
  • 🚨 Key idea: There is a "tradeoff" between precision and recall. Ideally, you want both to be high. For a particular prediction task, one may be important than the other.

Precision and recall¶

No description has been provided for this image
(source)

Discussion¶

$$\text{precision} = \frac{TP}{TP + FP} \: \: \: \: \: \: \: \: \text{recall} = \frac{TP}{TP + FN}$$
  • When might high precision be more important than high recall?
  • When might high recall be more important than high precision?

Activity

Consider the confusion matrix shown below.

Predicted Negative Predicted Positive
Actually Negative TN = 22 ✅ FP = 2 ❌
Actually Positive FN = 23 ❌ TP = 18 ✅

What is the accuracy of the above classifier? The precision? The recall?


After calculating all three on your own, click below to see the answers.

👉 Accuracy (22 + 18) / (22 + 2 + 23 + 18) = 40 / 65
👉 Precision 18 / (18 + 2) = 9 / 10
👉 Recall 18 / (18 + 23) = 18 / 41

Reference Slide¶

Combining precision and recall¶

  • If we care equally about a model's precision $PR$ and recall $RE$, we can combine the two using a single metric called the F1-score:
$$\text{F1-score} = \text{harmonic mean}(PR, RE) = 2\frac{PR \cdot RE}{PR + RE}$$
  • Both F1-score and accuracy are overall measures of a binary classifier's performance. But remember, accuracy is misleading in the presence of class imbalance, and doesn't take into account the kinds of errors the classifier makes.

Reference Slide¶

Other evaluation metrics for binary classifiers¶

  • We just scratched the surface! This excellent table from Wikipedia summarizes the many other metrics that exist.
No description has been provided for this image
  • If you're interested in exploring further, a good next metric to look at is true negative rate (i.e. specificity), which is the analogue of recall for true negatives.

Multiclass classification 🐧¶

No description has been provided for this image Artwork by @allison_horst

To illustrate multiclass classification, we'll revisit the Palmer Penguins dataset we saw earlier in the semester.

From binary to multiclass classification¶

  • In binary classification, there are only two possible classes, typically either 0 or 1.
$$y_i \in \{0, 1\}$$
  • In multiclass classification, there can be any finite number of classes, or labels. They need not be numbers, either.
$$y_i \in \{ \text{Adelie}, \text{Chinstrap}, \text{Gentoo} \}$$

Loading the data 🐧¶

In [40]:
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna().reset_index(drop=True)
X_train, X_test, y_train, y_test = train_test_split(penguins[['bill_length_mm', 'body_mass_g', 'bill_depth_mm']], 
                                                    penguins['species'], 
                                                    random_state=26)

display(X_train, y_train)
bill_length_mm body_mass_g bill_depth_mm
93 43.2 4100.0 18.5
103 43.2 4775.0 19.0
274 46.2 5300.0 14.9
... ... ... ...
262 45.2 5300.0 15.8
318 53.4 5500.0 15.8
309 46.9 4875.0 14.6

249 rows × 3 columns

93     Adelie
103    Adelie
274    Gentoo
        ...  
262    Gentoo
318    Gentoo
309    Gentoo
Name: species, Length: 249, dtype: object
  • Here, each row corresponds to a single penguin.
  • There are three 'species' of penguin: Adelie, Chinstrap, and Gentoo.
In [41]:
y_train.value_counts(normalize=True)
Out[41]:
species
Adelie       0.45
Gentoo       0.34
Chinstrap    0.21
Name: proportion, dtype: float64
  • Question: Suppose our goal is to predict the 'species' of a penguin, given other information.
    What accuracy would the best "constant" classifier achieve on this data?

Visualizing the data¶

  • Visually, it seems that the 'species' are penguins are well separated based on their physical characteristics ('bill_depth_mm', 'bill_length_mm', and 'body_mass_g').
In [42]:
util.penguin_scatter_3d(X_train, y_train)
  • For simplicity, we'll work with just two features: 'bill_length_mm' and 'body_mass_g'.
In [43]:
util.penguin_scatter_2d(X_train, y_train)

Classifier 1: $k$-nearest neighbors 🏡🏠¶

  • Let's use the default of $k = 5$.
    Of course, in practice, we should cross-validate.
In [44]:
model_knn = KNeighborsClassifier(n_neighbors=5)
model_knn.fit(X_train.iloc[:, :-1], y_train)
Out[44]:
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
  • There are now three colors in the decision boundaries.
In [45]:
util.penguin_decision_boundary(model_knn, X_train, y_train, title="k-NN Decision Boundary when k = 5")

Classifier 2: Decision trees 🎄¶

  • Let's fix max_depth=3 so that we can visualize the resulting tree.
    Again, in practice, we should cross-validate.
In [46]:
model_tree = DecisionTreeClassifier(max_depth=2)
model_tree.fit(X_train.iloc[:, :-1], y_train)
Out[46]:
DecisionTreeClassifier(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=2)
  • Note that colors below don't directly match the colors in the scatter plot earlier.
In [47]:
util.penguin_decision_boundary(model_tree, X_train.iloc[:, :-1], y_train, title="Decision Boundary for a Decision Tree of Depth 3")
In [50]:
util.show_penguin_decision_tree(model_tree, X_train);
No description has been provided for this image