In [1]:
from lec_utils import *
import lec20_util as util

Lecture 20¶

Gradient Descent¶

EECS 398: Practical Data Science, Spring 2025¶

practicaldsc.org • github.com/practicaldsc/sp25 • 📣 See latest announcements here on Ed

Agenda 📆¶

  • Intuition for gradient descent 🗻.
  • When is gradient descent guaranteed to work?
  • Gradient descent for functions of multiple variables.
No description has been provided for this image
What we're building towards today.

Question 🤔 (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

Intuition for gradient descent 🗻¶


Let's go hiking!¶

  • Suppose you're at the top of a mountain 🏔️ and need to get to the bottom.
  • Further, suppose it's really cloudy ☁️, meaning you can only see a few feet around you.
  • How would you get to the bottom?

Minimizing arbitrary functions¶

  • Assume $f(w)$ is some differentiable function.
    For now, we'll assume $f$ takes in a scalar, $w$, as input and returns a scalar as its output.
  • When tasked with minimizing $f(w)$, our general strategy has been to:
    1. Find $\frac{df}{dw}(w)$, the derivative of $f$.
    2. Find the input $w^*$ such that $\frac{df}{dw}(w^*) = 0$.
  • However, there are cases where we can find $\frac{df}{dw}(w)$, but it is either difficult or impossible to solve $\frac{df}{dw}(w^*) = 0$. Then what?
$$f(w) = 5w^4 - w^3 - 5w^2 + 2w - 9$$
In [2]:
util.draw_f()

What does the derivative of a function tell us?¶

  • Goal: Given a differentiable function $f(w)$, find the input $w^*$ that minimizes $f(w)$.
  • What does $\frac{d}{dw} f(w)$ mean?
In [3]:
from ipywidgets import interact
interact(util.show_tangent, w0=(-1.5, 1.5));
interactive(children=(FloatSlider(value=0.0, description='w0', max=1.5, min=-1.5), Output()), _dom_classes=('w…

Searching for the minimum¶

  • Suppose we're given an initial guess for a value of $w$ that minimizes $f(w)$.
No description has been provided for this image
  • If the slope of the tangent line at $f(w)$ is positive 📈:
    • Increasing $w$ increases $f$.
    • This means the minimum must be to the left of the point $(w, f(w))$.
    • Solution: Decrease $w$ ⬇️.
  • The steeper the slope is, the further we must be from the minimum – so, the steeper the slope, the quicker we should decrease $w$!

Searching for the minimum¶

  • Suppose we're given an initial guess for a value of $w$ that minimizes $f(w)$.
No description has been provided for this image
  • If the slope of the tangent line at $f(w)$ is negative 📉:
    • Increasing $w$ decreases $f$.
    • This means the minimum must be to the right of the point $(w, f(w))$.
    • Solution: Increase $w$ ⬆️.
  • The steeper the slope is, the further we must be from the minimum – so, the steeper the slope, the quicker we should increase $w$!

Gradient descent¶

  • To minimize a differentiable function $f$:
    1. Pick a positive number, $\alpha$. This number is called the learning rate, or step size.
      Think of $\alpha$ as a hyperparameter of the minimization process.
    2. Pick an initial guess, $w^{(0)}$.
    3. Then, repeatedly update your guess using the update rule:
$$w^{(t+1)} = w^{(t)} - \alpha \frac{df}{dw}(w^{(t)})$$



  • Repeat this process until convergence – that is, when the difference between $w^{(t)}$ and $w^{(t+1)}$ is small.
  • This procedure is called gradient descent.

What is gradient descent?¶

  • Gradient descent is a numerical method for finding the input to a function $f$ that minimizes the function.
  • It is called gradient descent because the gradient is the extension of the derivative to functions of multiple variables.
  • A numerical method is a technique for approximating the solution to a mathematical problem, often by using the computer.
  • Gradient descent is widely used in machine learning, to train models from linear regression to neural networks and transformers (including ChatGPT)!
    In machine learning, we use gradient descent to minimize empirical risk when we can't minimize it by hand, which is true in most, more sophisticated cases.

Implementing gradient descent¶

  • In practice, we typically don't implement gradient descent ourselves – we rely on existing implementations of it. But, we'll implement it here ourselves to understand what's going on.
  • Let's start with an initial guess $w^{(0)} = 0$ and a learning rate $\alpha = 0.01$.
$$w^{(t+1)} = w^{(t)} - \alpha \frac{df}{dw}(w^{(t)})$$
In [4]:
w = 0
for t in range(50):
    print(round(w, 4), round(util.f(w), 4))
    w = w - 0.01 * util.df(w)
0 -9
-0.02 -9.042
-0.042 -9.0927
-0.0661 -9.1537
-0.0925 -9.2267
-0.1214 -9.3135
-0.1527 -9.4158
-0.1866 -9.5347
-0.2229 -9.6708
-0.2615 -9.8235
-0.302 -9.9909
-0.344 -10.1687
-0.3867 -10.3513
-0.4293 -10.5311
-0.4709 -10.7001
-0.5104 -10.8511
-0.547 -10.9789
-0.58 -11.0811
-0.6089 -11.1586
-0.6335 -11.2141
-0.654 -11.2521
-0.6706 -11.277
-0.6839 -11.2927
-0.6943 -11.3023
-0.7023 -11.308
-0.7085 -11.3113
-0.7131 -11.3132
-0.7166 -11.3143
-0.7193 -11.3149
-0.7213 -11.3153
-0.7227 -11.3155
-0.7238 -11.3156
-0.7247 -11.3156
-0.7253 -11.3157
-0.7257 -11.3157
-0.726 -11.3157
-0.7263 -11.3157
-0.7265 -11.3157
-0.7266 -11.3157
-0.7267 -11.3157
-0.7268 -11.3157
-0.7268 -11.3157
-0.7269 -11.3157
-0.7269 -11.3157
-0.7269 -11.3157
-0.7269 -11.3157
-0.727 -11.3157
-0.727 -11.3157
-0.727 -11.3157
-0.727 -11.3157
  • We see that pretty quickly, $w^{(t)}$ converges to $-0.727$!

Visualizing $w^{(0)} = 0, \alpha = 0.01$¶

In [5]:
util.minimizing_animation(w0=0, alpha=0.01)