# Run this cell to get everything set up.
from lec_utils import *
import lec23_util as util
diabetes = pd.read_csv('data/diabetes.csv')
from sklearn.model_selection import train_test_split
diabetes = diabetes[(diabetes['Glucose'] > 0) & (diabetes['BMI'] > 0)]
X_train, X_test, y_train, y_test = (
train_test_split(diabetes[['Glucose', 'BMI']], diabetes['Outcome'], random_state=1)
)
Lecture 23¶
Logistic Regression, Continued¶
EECS 398: Practical Data Science, Winter 2025¶
practicaldsc.org ⢠github.com/practicaldsc/wn25 ⢠š£ See latest announcements here on Ed
Agenda šĀ¶
- Recap: Logistic regression.
- Choosing a threshold.
- Linear separability.
- Softmax regression.
Question š¤ (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
Recap: Logistic regression¶
Logistic regression¶
- Logistic regression is a linear classification technique that builds upon linear regression.
- It models the probability of belonging to class 1, given a feature vector:
- Suppose we train a logistic regression model to predict the probability a patient has diabetes ($y = 1$) given their
'Glucose'
and'BMI'
.
If our optimal parameters end up being $\vec{w}^* = \begin{bmatrix} -7.85 & 0.04 & 0.08 \end{bmatrix}^T$, we then predict probabilities using:
- To find the optimal parameters $\vec{w}^*$, we minimize mean cross-entropy loss:
There's no closed-form solution for $\vec{w}^*$, so we use some numerical method (or, rather, sklearn
does).
LogisticRegression
in sklearn
¶
- To illustrate, let's re-fit a model to predict diabetes from
'Glucose'
and'BMI'
insklearn
.
from sklearn.linear_model import LogisticRegression
model_logistic_multiple = LogisticRegression()
model_logistic_multiple.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
- By default, the
predict
method of a fitLogisticRegression
model predicts a class; it applies a threshold $T = 0.5$ to the predicted probability.
model_logistic_multiple.predict(pd.DataFrame([{
'Glucose': 150,
'BMI': 25,
}]))
array([0])
- We can access the predicted probabilities using the
predict_proba
method.
model_logistic_multiple.predict_proba(pd.DataFrame([{
'Glucose': 150,
'BMI': 25,
}]))
array([[0.58, 0.42]])
The decision boundary in the feature space¶
- After choosing $T = 0.5$, what does the resulting decision boundary look like, in a $d = 2$ dimensional plot?
util.show_decision_boundary(model_logistic_multiple, X_train, y_train, title='Logistic Regression Decision Boundary (T = 0.5)')
- Note that unlike the decision boundaries for $k$-Nearest Neighbors and decision trees, this decision boundary is linear. Specifically, it is the line:
- Important: Since $\sigma(0) = 0.5$, we can write the above as:
Question š¤ (Answer at practicaldsc.org/q)
Which expression describes the odds ratio, $$\frac{P(y_i = 1 | \vec{x}_i)}{P(y_i = 0 | \vec{x}_i)}$$
in the logistic regression model?
- A. $\vec{w} \cdot \text{Aug}(\vec{x}_i)$
- B. $-\vec{w} \cdot \text{Aug}(\vec{x}_i)$
- C. $e^{\vec{w} \cdot \text{Aug}(\vec{x}_i)}$
- D. $\sigma(\vec{w} \cdot \text{Aug}(\vec{x}_i))$
- E. None of the above.
Question š¤ (Answer at practicaldsc.org/q)
Which expression describes $P(y_i = \mathbf{0} | \vec{x}_i)$ in the logistic regression model?
- A. $\sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$
- B. $-\sigma\left(\vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$
- C. $\sigma\left(- \vec{w} \cdot \text{Aug}(\vec{x}_i) \right)$
- D. $1 - \log \left( 1 + e^{\vec{w} \cdot \text{Aug}(\vec{x}_i)} \right)$
- E. $1 + \log \left( 1 + e^{- \vec{w} \cdot \text{Aug}(\vec{x}_i)} \right)$
Choosing a threshold¶
Thresholding¶
- As we've seen, in order to classify $\vec{x}_i$ as either yes ($y_i = 1$) or no ($y_i = 0$), we apply a threshold $T$ to the predicted probability.
and a predicted probability of 0.55 is classified as no diabetes (class 0).
More generally, if we pick a threshold of $T$, then any feature vector $\vec{x}_i$ such that:
$$\sigma(\vec{w}^* \cdot \text{Aug}(\vec{x}_i)) \geq T$$
is classified as class 1.
- Question: How do we choose the "right" threshold?
sklearn
's default threshold of $T = 0.5$ is not guaranteed to yield the highest accuracy!
Remember, to find $\vec{w}^*$, we minimized mean cross-entropy loss (that is, we didn't "maximize" accuracy), and mean cross-entropy loss doesn't involve our threshold.
Choosing a custom threshold¶
- If we want to use a custom threshold, we'll need to implement the logic ourselves.
def predict_thresholded(X, T):
'''Calls model_logistic_multiple.predict_proba.
For each P(y_i = 1 | x_i), returns 1 if >= T and 0 if < T.'''
probs = model_logistic_multiple.predict_proba(X)[:, 1]
return (probs >= T).astype(int)
- Now, we can choose any threshold we'd like, and compute the accuracy of the resulting predictions.
predict_thresholded([[150, 25]], 0.5)
array([0])
predict_thresholded([[150, 25]], 0.4)
array([1])
predict_thresholded(X_train, 0.4)
array([0, 0, 1, ..., 0, 0, 0])
# Training accuracy for the threshold T = 0.4.
(predict_thresholded(X_train, 0.4) == y_train).mean()
0.7588652482269503
Accuracy vs. threshold¶
- Accuracy is defined as:
- How does the model's training accuracy change as the threshold changes?
Note that we'd see a similar trend with test accuracy, too.
util.plot_vs_threshold(X_train, y_train, 'Accuracy')
- The threshold with the best training accuracy (among the thresholds we tried) is $T = 0.465$, which has a training accuracy of 77.3%.
- Remember that 64% of people in the training set don't have diabetes, so we can achieve a 64% training accuracy just by always predicting "no diabetes"! This means that a good model's accuracy should be much higher than 64%.
pd.Series(y_train).value_counts(normalize=True)
Outcome 0 0.64 1 0.36 Name: proportion, dtype: float64
Metrics for binary classification¶
- A few lectures ago, we introduced other metrics for measuring the quality of a binary classifier's predictions.
- A binary classifier's confusion matrix displays its number of true positives ($TP$), false positives ($FP$), true negatives ($TN$), and false negatives ($FN$).
util.show_confusion(X_train, y_train, T=0.5)
- Remember, we're predicting whether or not patients have diabetes. Which is worse: a false positive or a false negative?
Observe how the values in the confusion matrix change as the threshold changes!
interact(lambda T: util.show_confusion(X_train, y_train, T), T=(0, 1, 0.01));
Precision vs. threshold¶
Precision is defined as:
$$\text{precision} = \frac{TP}{\text{# predicted positive}} = \frac{TP}{TP + FP}$$
Here, a false positive ($FP$) is when we predict that someone has diabetes when they do not.
- How does the model's training precision change as the threshold changes?
util.plot_vs_threshold(X_train, y_train, 'Precision')
- If the "bar" is higher to predict 1, then we will have fewer positives in general, and thus fewer false positives.
- As the threshold increases ā¬ļø, the denominator in $\text{precision} = \frac{TP}{TP + FP}$ will decrease, and so precision tends to increase ā¬ļø.
There are some cases where a slightly higher threshold led to a slightly lower precision; why?
Recall vs. threshold¶
Recall is defined as:
$$\text{recall} = \frac{TP}{\text{# actually positive}} = \frac{TP}{TP + FN}$$
Here, a false negative ($FN$) is when we predict that someone does not have diabetes, when they really do.
- How does the model's training recall change as the threshold changes?
util.plot_vs_threshold(X_train, y_train, 'Recall')
- Note that the denominator in $\text{recall} = \frac{TP}{\text{# actually positive}}$ is constant. As the threshold increases ā¬ļø:
- true positives get converted to false negatives, so
- the numerator of recall ($TP$) decreases, and so
- recall decreases ā¬ļø.
Precision vs. recall¶
- We can visualize how precision and recall vary together.
util.pr_curve(X_train, y_train)
- The curve above is called a PR curve.
- Question: Given the information above, what threshold would you choose?
- Answer: The threshold whose point is closest to the top right corner of the plot above.
Why? The top right corner is where precision = 1 and recall = 1, and we want both to be high.
ROC curves¶
- A more popular variant of the PR curve is the ROC curve.
ROC stands for "receiver operating characteristic."
See here for a good discussion on the differences between PR curves and ROC curves.
- A ROC curve plots true positive rate (TPR) vs. false positive rate (FPR) for all possible thresholds, where:
The ROC curve for our classifier looks like:
util.draw_roc_curve(X_train, y_train)
- If we care about TPR and FPR equally, the best threshold is the one whose point is closest to the top left corner in the plot above.
Why? The top left corner is where $TPR = 1$ and $FPR = 0$, and we want $TPR$ to be high and $FPR$ to be low.
- A common metric for the quality of a binary classifier is the area under curve (AUC) for the ROC curve.
Larger values are better!
Question š¤ (Answer at practicaldsc.org/q)
What questions do you have about thresholds and logistic regression?
Linear separability¶
Feature space¶
- Suppose we're using $d$ features as inputs to our classifier. Consider a visualization of the features in $d$-dimensional space.
- Example: $d = 1$.
util.show_one_feature_plot_in_1D(X_train, y_train, thres=False)
- Example: $d = 2$.
util.create_base_scatter(X_train, y_train)
- Note that in both plots above, there are orange points mixed in with the blue points!
Linear separability¶
- A dataset is linearly separable if a line, plane, or hyperplane can be drawn in $d$-dimensional space that perfectly separates the two classes.
- Example: $d = 1$.
util.lin_sep_1D()
util.non_lin_sep_1D()
- Example: $d = 2$.
util.lin_sep_2D()
util.non_lin_sep_2D()
- Why is the dataset below not linearly separable?
util.bad_example_1D()
Linear separability and decision boundaries¶
- By definition, if a dataset is linearly separable, then there exists a linear decision boundary that achieves 100% training accuracy.
util.lin_sep_1D()
- Above, any value of $c$ in $(120, 150)$ would make the decision boundary $$\text{Glucose} = c$$
achieve 100% training accuracy.
- Question: How do we find this decision boundary?
Logistic regression and linear separability¶
- Logistic regression, without regularization, fails to converge on linearly separable data!
- Let's re-draw the plot below, but with diabetes status drawn on the $y$-axis.
util.lin_sep_1D()
- Why would the optimal $w_1^*$ below tend to $\infty$?
See the annotated slides for more details.
util.lin_sep_1D_elevated()
- To prevent this case, logistic regression should generally be regularized.
This is exactly whysklearn
regularizes logistic regression by default.
Logistic regression for multiclass classification¶
From binary to multiclass classification¶
- In binary classification, there are only two possible classes, typically either 0 or 1.
- In multiclass classification, there can be any finite number of classes, or labels. They need not be numbers, either.
- Important: Let $C$ be the set of possible classes for our classification problem, and let $|C|$ be the number of classes total.
Loading the data š§Ā¶
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna().reset_index(drop=True)
X_train, X_test, y_train, y_test = train_test_split(penguins[['bill_length_mm', 'body_mass_g']],
penguins['species'],
random_state=26)
display(X_train, y_train)
bill_length_mm | body_mass_g | |
---|---|---|
93 | 43.2 | 4100.0 |
103 | 43.2 | 4775.0 |
274 | 46.2 | 5300.0 |
... | ... | ... |
262 | 45.2 | 5300.0 |
318 | 53.4 | 5500.0 |
309 | 46.9 | 4875.0 |
249 rows Ć 2 columns
93 Adelie 103 Adelie 274 Gentoo ... 262 Gentoo 318 Gentoo 309 Gentoo Name: species, Length: 249, dtype: object
- As we did two lectures ago, we'll aim to predict the
'species'
of a penguin given their'bill_length_mm'
and'bill_depth_mm'
.
util.penguin_scatter_2d(X_train, y_train)
Recap: $k$-nearest neighbors¶
- Let's fit a $k$-NN classifier with $k=5$ to the training data.
from sklearn.neighbors import KNeighborsClassifier
model_knn = KNeighborsClassifier(n_neighbors=5)
model_knn.fit(X_train, y_train)
util.penguin_decision_boundary(model_knn, X_train, y_train, title="k-NN Decision Boundary when k = 5")
- Notice the vastly different scales of the features! What happens if we standardize?
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
model_knn_standardized = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))
model_knn_standardized.fit(X_train, y_train)
util.penguin_decision_boundary(model_knn_standardized, X_train, y_train, title="k-NN Decision Boundary when k = 5 and with Standardization")
Recap: Decision trees¶
- Let's fit a decision tree classifier with a maximum depth of 3 to the training data.
from sklearn.tree import DecisionTreeClassifier
model_tree = DecisionTreeClassifier(max_depth=3)
model_tree.fit(X_train, y_train)
util.penguin_decision_boundary(model_tree, X_train, y_train, title="Decision Boundary for a Decision Tree of Depth 3")
What about logistic regression?¶
- As we've seen, in binary classification, logistic regression models the probability of belonging to class 1, given a feature vector $\vec{x}_i$:
- In logistic regression, $C = \{0, 1\}$. But, in our current penguin classification problem, $C = \{ \text{Adelie}, \text{Chinstrap}, \text{Gentoo} \}$, so we can't use logistic regression directly.
- One idea: one-vs-rest. Fit $|C| = 3$ separate logistic regression models ā one per class ā and predict the class that has the highest probability.
- Penguin is Adelie vs. penguin is not Adelie.
- Penguin is Chinstrap vs. penguin is not Chinstrap.
- Penguin is Gentoo vs. penguin is not Gentoo.
- Another idea: one-vs-one. Fit ${3 \choose 2} = 3$ separate logistic regression models ā one per pair of classes ā and predict the class that "wins" the most predictions.
- Penguin is Adelie vs. penguin is Chinstrap.
- Penguin is Adelie vs. penguin is Gentoo.
- Penguin is Chinstrap vs. penguin is Gentoo.
- Let's try something slightly different than what's listed above.
Multinomial logistic regression¶
- Multinomial logistic regression, also known as softmax regression, models the probability of belonging to any class, given a feature vector $\vec x_i$.
Think of it as a generalization of logistic regression.
- Instead of a single parameter vector $\vec{w}$, there are $|C|$ parameter vectors, one per class!
- Multinomial logistic regression models the probability of each class directly, and then predicts the most likely class.
Aside: The softmax function¶
- The softmax function is a generalization of the logistic function to multiple dimensions.
Suppose $\vec z \in \mathbb{R}^d$. Then, the softmax of $\vec z$ is defined element-wise as follows:
$$\sigma(\vec z)_i = \frac{e^{z_i}}{\sum_{j = 1}^d e^{z_j}}$$- For example, suppose $\vec{z} = \begin{bmatrix} -5 \\ 2 \\ 4 \end{bmatrix}$. Then:
- Why is it defined this way? It maps a vector of real numbers to a vector of probabilities!
Note that the denominator, $\sum_{j=1}^d e^{z_j}$, normalizes the $e^{z_i}$ terms so that the results sum to 1.
Multinomial logistic regression, i.e. softmax regression, trains $|C|$ linear models of the form $\boxed{\vec w_k \cdot \text{Aug}(\vec x_i)}$ ā one per class, $k$ ā and feeds the output of each through the softmax function, so the results can be interpreted as probabilities.
$$p_j = P(y_i = j | \vec{x}_i) = \frac{e^{\vec{w}_j \cdot \text{Aug}(\vec{x}_i)}}{\sum_{k \in C} e^{\vec w_k \cdot \text{Aug}(\vec x_i)}}$$
The $|C|$ optimal parameter vectors ā $\vec w_\text{Adelie}^*$, $\vec w_\text{Chinstrap}^*$, and $\vec w_\text{Gentoo}^*$, in our case ā are chosen to minimize mean cross-entropy loss, just like before!
Multinomial logistic regression in sklearn
¶
- The
LogisticRegression
class supports multinomial logistic regression.
model_log = LogisticRegression(multi_class='multinomial')
model_log.fit(X_train, y_train)
LogisticRegression(multi_class='multinomial')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(multi_class='multinomial')
- In total, the fit model has $3 \times 2 = 6$ coefficients and $3 \times 1 = 3$ intercepts.
model_log.coef_
array([[-0.85, 0. ], [ 0.84, -0.01], [ 0.02, 0.01]])
model_log.intercept_
array([ 36.4 , -10.96, -25.43])
model_log.classes_
array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)
- When calling
model_log.predict_proba
, we get back an array of three predicted probabilities.
model_log.predict_proba(pd.DataFrame([{
'bill_length_mm': 45,
'body_mass_g': 4500
}]))
array([[0.14, 0.01, 0.85]])
What does this model look like?¶
util.penguin_decision_boundary(model_log, X_train, y_train, title="Softmax Regression Decision Boundary")
Neural networks š§ ¶
- Softmax regression is an example of a neural network.
Our brains are made up of neurons connected by "links", called synapses. The model diagram below loosely resembles this structure, which is why the model is called a neural network.
- Each of the 9 diagonal lines connecting a value in the input layer with a value in the output layer represents a parameter, $w^*$.
model_log.intercept_
array([ 36.4 , -10.96, -25.43])
model_log.coef_
array([[-0.85, 0. ], [ 0.84, -0.01], [ 0.02, 0.01]])
- We can use the nine parameter values above to reproduce the network's calculations ourselves.
# Same values as shown in model_log.predict, two slides ago!
softmax = lambda z: np.e ** z / sum(np.e ** z)
softmax(model_log.intercept_.reshape(-1, 1) + model_log.coef_ @ np.array([[45], [4500]]))
array([[0.14], [0.01], [0.85]])