# Run this cell to get everything set up.
from lec_utils import *
import lec24_util as util
diabetes = pd.read_csv('data/diabetes.csv')
from sklearn.model_selection import train_test_split
diabetes = diabetes[(diabetes['Glucose'] > 0) & (diabetes['BMI'] > 0)]
X_train, X_test, y_train, y_test = (
train_test_split(diabetes[['Glucose', 'BMI']], diabetes['Outcome'], random_state=11)
)
from ipywidgets import interact
import warnings
warnings.simplefilter('ignore')
Announcements 📣¶
The Portfolio Homework's checkpoint is due on Monday, November 25th – no slip days allowed!
The full homework is due on Saturday, December 7th (no slip days!).Homework 10 is (finally!) out, and is due on Monday, December 2nd.
Plan to finish it earlier, since we won't be able to offer much help over Thanksgiving.
There will still be a Homework 11, but it'll be max 3 questions.Consider entering the Big Ten Data Viz Championship. Submissions are due on January 15th. Read more here.
Help Michigan defend its title!Enrollment begins today. Some suggested courses for next semester can be found in #306 on Ed.
And please help spread the word about 398!
Agenda¶
- Recap: Classification techniques and classifier evaluation.
- Predicting probabilities.
- Cross-entropy loss.
- From probabilities to decisions.
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
Recap: Classification techniques and classifier evaluation¶
Classification¶
- A regression problem is one in which we're given a feature vector $\vec{x}$ and need to predict a real-valued target variable, $y$.
Example: Given today's temperature, precipitation, and wind chill, what will tomorrow's high temperature be?
- A classification problem is one in which we're given a feature vector $\vec{x}$ and need to predict a categorical target variable, $y$.
Example: Given today's temperature, precipitation, and wind chill, will it snow tomorrow?
- In binary classification, there are only two possible values of the target variable (typically 1 and 0); in multi-class classification, there can be more than two possible values of the target variable.
- Last class, we learned about two classification techniques:
- $k$-Nearest Neighbors 🏡🏠.
- Decision trees 🎄.
Accuracy of COVID tests¶
- The results of 100 Michigan Medicine COVID tests are given below.
Predicted Negative | Predicted Positive | |
---|---|---|
Actually Negative | TN = 90 ✅ | FP = 1 ❌ |
Actually Positive | FN = 8 ❌ | TP = 1 ✅ |
- 🤔 Question: What is the accuracy of the test?
- 🙋 Answer: $$\text{accuracy} = \frac{TP + TN}{TP + FP + FN + TN} = \frac{1 + 90}{100} = 0.91$$
- Followup: At first, the test seems good. But, suppose we build a classifier that predicts that nobody has COVID. What would its accuracy be?
- Answer to followup: Also 0.91! There is severe class imbalance in the dataset, meaning that most of the data points are in the same class (no COVID). Accuracy doesn't tell the full story.
Recall¶
Predicted Negative | Predicted Positive | |
---|---|---|
Actually Negative | TN = 90 ✅ | FP = 1 ❌ |
Actually Positive | FN = 8 ❌ | TP = 1 ✅ |
- 🤔 Question: What proportion of individuals who actually have COVID did the test identify?
- 🙋 Answer: $\frac{1}{1 + 8} = \frac{1}{9} \approx 0.11$.
- More generally, the recall of a binary classifier is the proportion of actually positive instances that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.
- To compute recall, look at the bottom (positive) row of the above confusion matrix.
Recall isn't everything, either!¶
$$\text{recall} = \frac{TP}{TP + FN}$$- 🤔 Question: Can you design a "COVID test" with perfect recall?
- 🙋 Answer: Yes – just predict that everyone has COVID!
Predicted Negative | Predicted Positive | |
---|---|---|
Actually Negative | TN = 0 ✅ | FP = 91 ❌ |
Actually Positive | FN = 0 ❌ | TP = 9 ✅ |
- Like accuracy, recall on its own is not a perfect metric. Even though the classifier we just created has perfect recall, it has 91 false positives!
Precision¶
Predicted Negative | Predicted Positive | |
---|---|---|
Actually Negative | TN = 0 ✅ | FP = 91 ❌ |
Actually Positive | FN = 0 ❌ | TP = 9 ✅ |
- The precision of a binary classifier is the proportion of predicted positive instances that are correctly classified. We'd like this number to be as close to 1 (100%) as possible.
- To compute precision, look at the right (positive) column of the above confusion matrix.
Tip: A good way to remember the difference between precision and recall is that in the denominator for 🅿️recision, both terms have 🅿️ in them (TP and FP).
- Note that the "everyone-has-COVID" classifier has perfect recall, but a precision of $\frac{9}{9 + 91} = 0.09$, which is quite low.
- 🚨 Key idea: There is a "tradeoff" between precision and recall. Ideally, you want both to be high. For a particular prediction task, one may be important than the other.
- Later today, we'll see how to weigh this tradeoff in the context of selecting a threshold for classification in logistic regression.
Discussion¶
$$\text{precision} = \frac{TP}{TP + FP} \: \: \: \: \: \: \: \: \text{recall} = \frac{TP}{TP + FN}$$🤔 When might high precision be more important than high recall?
🤔 When might high recall be more important than high precision?
Activity
Consider the confusion matrix shown below.
Predicted Negative | Predicted Positive | |
---|---|---|
Actually Negative | TN = 22 ✅ | FP = 2 ❌ |
Actually Positive | FN = 23 ❌ | TP = 18 ✅ |
What is the accuracy of the above classifier? The precision? The recall?
After calculating all three on your own, click below to see the answers.
👉 Accuracy
(22 + 18) / (22 + 2 + 23 + 18) = 40 / 65👉 Precision
18 / (18 + 2) = 9 / 10👉 Recall
18 / (18 + 23) = 18 / 41Activity
After fitting a BillyClassifier
, we use it to make predictions on an unseen test set. Our results are summarized in the following confusion matrix.
Predicted Negative | Predicted Positive | |
---|---|---|
Actually Negative | ??? | 30 |
Actually Positive | 66 | 105 |
Part 1: What is the recall of our classifier? Give your answer as a fraction (it does not need to be simplified).
Part 2: The accuracy of our classifier is $\frac{69}{117}$. How many true negatives did our classifier have? Give your answer as an integer.
Part 3: True or False: In order for a binary classifier's precision and recall to be equal, the number of mistakes it makes must be an even number.
Part 4: Suppose we are building a classifier that listens to an audio source (say, from your phone’s microphone) and predicts whether or not it is Soulja Boy’s 2008 classic “Kiss Me thru the Phone." Our classifier is pretty good at detecting when the input stream is ”Kiss Me thru the Phone", but it often incorrectly predicts that similar sounding songs are also “Kiss Me thru the Phone."
Complete the sentence: Our classifier has...
- low precision and low recall.
- low precision and high recall.
- high precision and low recall.
- high precision and high recall.
Combining precision and recall¶
- If we care equally about a model's precision $PR$ and recall $RE$, we can combine the two using a single metric called the F1-score:
- Both F1-score and accuracy are overall measures of a binary classifier's performance. But remember, accuracy is misleading in the presence of class imbalance, and doesn't take into account the kinds of errors the classifier makes.
Other evaluation metrics for binary classifiers¶
- We just scratched the surface! This excellent table from Wikipedia summarizes the many other metrics that exist.
- If you're interested in exploring further, a good next metric to look at is true negative rate (i.e. specificity), which is the analogue of recall for true negatives.
Predicting probabilities¶
The New York Times maintained needles
that displayed the probabilities of various outcomes in the election.
Motivation: Predicting probabilities¶
- Often, we're interested in predicting the probability of an event occurring, given some other information.
what's the probability that Michigan wins?
In the context of weather apps, this is a nuanced question; here's a meme about it.
- If we're able to predict the probability of an event, we can classify the event by using a threshold.
For example, if we predict there's a 70% chance of Michigan winning, we could predict that Michigan will win. Here, we implicitly used a threshold of 50%.
- The two classification techniques we've seen so far – $k$-Nearest Neighbors and decision trees – don't directly use probabilities in their decision-making process.
But sometimes it's helpful to model uncertainty and to be able to state a level of confidence along with a prediction!
Recap: Predicting diabetes¶
- Let's try to predict whether or not a patient has diabetes (
'Outcome'
) given just their'Glucose'
level.
Last class, we used both'Glucose'
and'BMI'
; we'll start with just one feature for now.
- As before, class 0 (orange) is "no diabetes" and class 1 (blue) is "diabetes".
util.show_one_feature_plot(X_train, y_train)
- It seems that as a patient's
'Glucose'
value increases, the chances they have diabetes also increases.
- Can we model this probability directly, as a function of
'Glucose'
?
In other words, can we find some $f$ such that:
An attempt to predict probabilities¶
- Let's try and fit a simple linear model to the data from the previous slide.
util.show_one_feature_plot_with_linear_model(X_train, y_train)
- The simple linear model above predicts values greater than 1 and less than 0! This means we can't interpret the outputs as probabilities.
- We could, technically, clip the outputs of the linear model:
util.show_one_feature_plot_with_linear_model_clipped(X_train, y_train)
Bins and proportions¶
- Another approach we could try is to:
- Place
'Glucose'
values into bins, e.g. 50 to 55, 55 to 60, 60 to 65, etc. - Within each bin, compute the proportion of patients in the training set who had diabetes.
- Place
# Take a look at the source code in lec24_util.py to see how we did this!
# We've hidden a lot of the plotting code in the notebook to make it cleaner.
util.make_prop_plot(X_train, y_train)
- For example, the point near a
'Glucose'
value of 100 has a $y$-axis value of ~0.25. This means that about 25% of patients with a'Glucose'
value near 100 had diabetes in the training set.
- So, if a new person comes along with a
'Glucose'
value near 100, we'd predict there's a 25% chance they have diabetes (so they likely do not)!
- Notice that the points form an S-shaped curve!
Can we incorporate this S-shaped curve in how we predict probabilities?
The logistic function¶
The logistic function resembles an $S$-shape.
$$\sigma(t) = \frac{1}{1 + e^{-t}} = \frac{1}{1 + \text{exp}(-t)}$$
The logistic function is an example of a sigmoid function, which is the general term for an S-shaped function. Sometimes, we use the terms "logistic function" and "sigmoid function" interchangeably.
- Below, we'll look at the shape of $y = \sigma(w_0 + w_1 x)$ for different values of $w_0$ and $w_1$.
- $w_0$ controls the position of the curve on the $x$-axis.
- $w_1$ controls the "steepness" of the curve.
util.show_three_sigmoids()
- Notice that $0 < \sigma(t) < 1$, for all $t$, which means we can interpret the outputs of $\sigma(t)$ as probabilities!
- Below, interact with the sliders to change the values of $w_0$ and $w_1$.
interact(util.plot_sigmoid, w0=(-15, 15), w1=(-3, 3, 0.1));
interactive(children=(IntSlider(value=0, description='w0', max=15, min=-15), FloatSlider(value=0.0, descriptio…
Logistic regression¶
- Logistic regression is a linear classification technique that builds upon linear regression.
- It models the probability of belonging to class 1, given a feature vector:
- Note that the existence of coefficients, $w_0, w_1, ... w_d$, that we need to learn from the data, tells us that logistic regression is a parametric method!
LogisticRegression
in sklearn
¶
from sklearn.linear_model import LogisticRegression
- Let's fit a
LogisticRegression
classifier. Specifically, this means we're askingsklearn
to learn the optimal parameters $w_0^*$ and $w_1^*$ in:
model_logistic = LogisticRegression()
model_logistic.fit(X_train[['Glucose']], y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
- We get a test accuracy that's roughly in line with the test accuracies of the two models we saw last class.
model_logistic.score(X_test[['Glucose']], y_test)
0.7287234042553191
- What does our fit model look like?
Visualizing a fit logistic regression model¶
- The values of $w_0^*$ and $w_1^*$
sklearn
found are below.
model_logistic.intercept_[0], model_logistic.coef_[0][0]
(-5.901585544378712, 0.04240496085762793)
- So, our fit model is:
util.show_one_feature_plot_with_logistic(X_train, y_train)
- So, if a patient has a
'Glucose'
level of 150, the model's predicted probability that they have diabetes is:
model_logistic.predict_proba([[150]])
array([[0.39, 0.61]])
sklearn
find $w_0^*$ and $w_1^*$?What loss function did it use?
Cross-entropy loss¶
The modeling recipe¶
- To train a parametric model, we always follow the same three steps.
$k$-Nearest Neighbors and decision trees didn't quite follow the same process.
- Choose a model.
- Choose a loss function.
- Minimize average loss to find optimal model parameters.
As we've now seen, average loss could also be regularized!
Attempting to use squared loss¶
- Our default loss function has always been squared loss, so we could try and use it here.
- Unfortunately, there's no closed form solution for $\vec{w}^*$, so we'll need to use gradient descent.
- Before doing so, let's visualize the loss surface in the case of our "simple" logistic model:
- Specifically, we'll visualize:
util.show_logistic_mse_surface(X_train, y_train)
- What do you notice?
Mean squared error doesn't work well with logistic regression!¶
- The following function is not convex:
- There are two flat "valleys" with gradients near 0, where gradient descent could get trapped.
Additionally, squared loss doesn't penalize bad predictions nearly enough. The largest possible value of:
$$\left( y_i - \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right) \right)^2$$
is 1, since both $y_i$ and $\sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$ are bounded between 0 and 1, and $(1 - 0)^2 = 1$.
- Suppose $y_i = 1$. Then, the graph of the squared loss of the prediction $p_i$ is below.
util.show_squared_loss_individual()
- Predicted $p_i$ values near 0 are really bad, since $y_i = 1$, but the loss for $p_i = 0$ is not very high.
- It seems like we need a loss function that more steeply penalizes incorrect probability predictions – and hopefully, one that is convex for the logistic regression model!
Cross-entropy loss¶
- A common loss function in this setting is log loss, i.e. cross-entropy loss.
The term "entropy" comes from information theory. Watch this short video for more details.
- We can define the cross-entropy loss function piecewise. If $y_i$ is an observed value and $p_i$ is a predicted probability, then:
- Note that in the two cases – $y_i = 1$ and $y_i = 0$ – the cross-entropy loss function resembles squared loss, but is unbounded when the predicted probabilities $p_i$ are far from $y_i$.
util.show_ce_loss_individual_1()
util.show_ce_loss_individual_0()
A non-piecewise definition of cross-entropy loss¶
- We can define the cross-entropy loss function piecewise. If $y_i$ is an observed value and $p_i$ is a predicted probability, then:
- An equivalent formulation of $L_\text{ce}$ that isn't piecewise is:
- This formulation is easier to work with algebraically!
Average cross-entropy loss¶
- Cross-entropy loss for an observed value $y_i$ and predicted probability $p_i = P(y = 1 | \vec{x}_i) = \sigma \left(\vec w \cdot \text{Aug}(\vec x_i) \right)$ is:
- To find $\vec{w}^*$, then, we minimize average cross-entropy loss:
- Cross-entropy loss is the default loss function used to find optimal parameters in logistic regression.
- There's still no closed-form solution for $\vec{w}^* = \underset{\vec{w}}{\text{argmin}} \: R_\text{ce}(\vec{w})$, so we'll need to use gradient descent, or some other numerical method.
But don't worry – we'll leave this tosklearn
!
- Fortunately, average cross-entropy loss is convex, too.
util.show_logistic_ce_surface(X_train, y_train)
- And, it can be regularized!
By default,sklearn
applies regularization when performing logistic regression.
util.show_logistic_ce_surface(X_train, y_train, reg_lambda=0.5)
The modeling recipe, revisited¶
- Choose a model.
- Choose a loss function.
Minimize average loss to find optimal model parameters.
As we've now seen, average loss could also be regularized!\begin{align*}R_\text{ce}(\vec{w}) &= - \frac{1}{n} \sum_{i = 1}^n \left( y_i \log p_i + (1 - y_i) \log (1 - p_i) \right) \\ &= - \frac{1}{n} \sum_{i = 1}^n \left[ y_i \log \left( \sigma \left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right) \right) + (1 - y_i) \log \left(1 - \sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)\right) \right]\end{align*}
The actual minimization here is done using numerical methods, through
sklearn
.
LogisticRegression
in sklearn
, revisited¶
- The
LogisticRegression
class insklearn
has a lot of hidden, default hyperparameters.
LogisticRegression?
- It performs $L_2$ regularization ("ridge logistic regression") by default. The hyperparameter for regularization strength, $C$, is the inverse of $\lambda$; by default, it sets $C = 1$.
- So, for a given value of $C$, it minimizes:
- It also specifies
solver='lbfgs'
, i.e. it doesn't use gradient descent per-se, but another more sophisticated numerical method.
Read more about LBFGS here.
Question 🤔 (Answer at practicaldsc.org/q)
What questions do you have?
From probabilities to decisions¶
Predicting probabilities vs. predicting classes¶
- 🤔 Question: Suppose our logistic regression model predicts the probability that someone has diabetes is 0.75. What do we predict – diabetes or no diabetes? What if the predicted probability is 0.3?
- 🙋 Answer: We have to pick a threshold (for example, 0.5)!
- If the predicted probability is above the threshold, we predict diabetes (1).
- Otherwise, we predict no diabetes (0).
Predicting probabilities vs. predicting classes¶
- By default, the
predict
method of a fitLogisticRegression
model predicts a class.
model_logistic.predict(pd.DataFrame([{
'Glucose': 150,
}]))
array([1])
- But, logistic regression is designed to predict probabilities. We can access these predicted probabilities using the
predict_proba
method, as we saw earlier.
model_logistic.predict_proba(pd.DataFrame([{
'Glucose': 150,
}]))
array([[0.39, 0.61]])
- The above is telling us that the model thinks this person has:
- A 39% chance of belonging to class 0 (no diabetes).
- A 61% chance of belonging to class 1 (diabetes).
- By default, it uses a threshold of 0.5, i.e. it predicts the larger probability.
As we'll soon discuss, this may not be what we want!
Unfortunately,sklearn
doesn't let us change the threshold ourselves. If we want a different threshold, we need to manually implement it using the results ofpredict_proba
.
Thresholding probabilities¶
- As we did with other classifiers, we can visualize the decision boundary of a fit logistic regression model.
If we pick a threshold of $T$, any patient with a
'Glucose'
value such that:$$\sigma(w_0^* + w_1^* \cdot \text{Glucose}) \geq T$$
is classified as having diabetes.
- For example, if $T = 0.5$:
util.show_one_feature_plot_with_logistic_and_y_threshold(X_train, y_train, 0.5)
- If we set $T = 0.5$, then patients with
'Glucose'
values above $\approx$ 140 are classified as having diabetes.
util.show_one_feature_plot_with_logistic_and_x_threshold(X_train, y_train, 0.5)
- How do we find the exact $x$-axis position of the decision boundary above?
If we can, then we'd be able to predict whether someone has diabetes just by looking at their'Glucose'
value.
Decision boundaries for logistic regression¶
- In our single feature model that predicts
'Outcome'
given just'Glucose'
, our predicted probabilities are of the form:
- Suppose we fix a threshold, $T$. Then, our decision boundary is of the form:
- If we can invert $\sigma(t)$, then we can re-arrange the above to solve for the
'Glucose'
value at the threshold:
- Important: If $p = \sigma(t)$, then $\sigma^{-1}({p}) = \log \left( \frac{p}{1-p} \right)$ is the inverse of $\sigma(t)$.
$\sigma^{-1}(p)$ is called the logit function.
Aside: Odds¶
- Suppose an event occurs with probability $p$.
- The odds of that event are:
For instance, if there's a $p = \frac{3}{4}$ chance that Michigan wins this week, then the odds that Michigan wins this week are:
$$\text{odds} \left( \frac{3}{4} \right) = \frac{\frac{3}{4}}{\frac{1}{4}} = 3$$
- Interpretation: it's 3 times more likely that Michigan wins than loses.
- We can interpret $\sigma^{-1}(p) = \log \left( \frac{p}{1-p} \right)$ as the "log odds" of $p$!
See the reference slides for more details.
Solving for the decision boundary¶
- Previously, we said that if we pick a threshold $T$, then:
- We re-arranged this for the
'Glucose'
value on the threshold, $\text{Glucose}_T$:
- Using the fact that $\sigma^{-1}(T) = \log \left( \frac{T}{1 - T} \right)$ gives us a closed-form formula for $\text{Glucose}_T$!
- This explains why $\text{Glucose} \geq 139.17$ is the decision boundary below!
w0_star = model_logistic.intercept_[0]
w1_star = model_logistic.coef_[0][0]
T = 0.5
glucose_threshold = (np.log(T / (1 - T)) - w0_star) / w1_star
glucose_threshold
139.1720549912291
util.show_one_feature_plot_with_logistic_and_x_threshold(X_train, y_train, 0.5)
The decision boundary in the feature space¶
- The decision boundary on the previous slide is:
- Let's visualize this in the feature space. We are just using $d = 1$ feature, so let's visualize our decision boundary with a 1D plot, i.e. a number line.
util.show_one_feature_plot_in_1D(X_train, y_train, 0.5)
Logistic regression with multiple features¶
- Now, as we did last class, let's use both
'Glucose'
and'BMI'
to predict diabetes.
util.make_two_feature_scatter(X_train, y_train)
- Specifically, our fit model will look like:
model_logistic_multiple = LogisticRegression()
model_logistic_multiple.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
- After minimizing mean (regularized!) cross-entropy loss, we find that our fit model is of the form:
model_logistic_multiple.intercept_, model_logistic_multiple.coef_
(array([-8.17]), array([[0.04, 0.08]]))
Visualizing a fit logistic regression model with two features¶
- Recall, the logistic regression model is trained to predict the probability of class 1 (diabetes).
- The graph below shows the predicted probabilities of class 1 (diabetes) for different combinations of features.
util.show_logistic(model_logistic_multiple, X_train, y_train)
The decision boundary in the feature space¶
- What does the resulting decision boundary look like, in a $d = 2$ dimensional plot?
util.show_decision_boundary(model_logistic_multiple, X_train, y_train, title='Decision Boundary when Using Both Glucose and BMI \n and T = 0.5 (the default)')
- Note that unlike the decision boundaries for $k$-Nearest Neighbors and decision trees, this decision boundary is linear.
- Specifically, the decision boundary in the feature space is of the form:
- In the homework, you'll solve for $a$, $b$, and $c$ in a similar example!
It involves retracing the steps we followed in the single-feature case.
Question 🤔 (Answer at practicaldsc.org/q)
What questions do you have?
Lingering questions¶
- By default, a fit
LogisticRegression
object'spredict
method uses a threshold of $T = 0.5$ to decide when to predict class 1 vs. class 0. What if we want to use a different threshold?
- The logistic function, $\sigma(t)$, obeys several interesting properties.
- It is symmetric.
- Its derivative is conveniently calculated:
- But, most relevant to us right now, its inverse is:
- Let $p$ represent our predicted probability.
- Using the inverse of the logistic function, we have that:
- On the left, we have a linear function of $\text{Glucose}$.
- On the right, we have the log of the odds of $p$.
We call the "log of the odds" the "log odds".
- Important: The logistic regression model assumes that the log of the odds of $P(y = 1 | \vec{x})$ is linear!
- Suppose that $w_0^* = -6$ and $w_1^* = 0.05$. Then:
- It's hard to interpret the role of the coefficient $0.05$ directly. But, we know that:
- Example: Suppose my
'Glucose'
level increases by 1 unit. Then, the predicted log odds that I have diabetes increases by 0.05.
- But, since:
- And:
- We can say that if my
'Glucose'
level increases by 1 unit, then my predicted odds of diabetes increases by a factor of $e^{0.05}$, or more generally $e^{w_1^*}$.
- You'll need this interpretation in Homework 10, Question 6!