from lec_utils import *
from lec03_utils import *
from PIL import Image
Announcements 📣¶
- Homework 1 is due on Thursday.
Post on Ed or come to Office Hours for help!
- The Welcome Survey was due yesterday; please fill it out if you haven't already!
EECS 370 has the same midterm time as us. If you're in 370, sign up to take their alternate midterm exam the following day. - We released a 🎥 walkthrough video of the "area codes" dictionary example from Lecture 2.
- We also posted the solutions to the exercises from Lecture 2 and Discussion 1 in our public GitHub repository. Starting today, I'll post "blank" versions of lecture notebooks before class, and "filled" versions of notebooks with the code that I write after class, so that you can code along with me.
- Check out the new Resources tab on the course website, with links to lots of supplementary resources and past exams from similar classes!
Agenda¶
- Recap:
for
-loops and dictionaries. numpy
arrays.- Multidimensional arrays and linear algebra.
- Randomness and simulation.
Aside: Python Tutor¶
Python Tutor, linked on the Resources tab of the course website, visualizes the execution of Python code.
# The mystery example from Lecture 2.
from IPython.display import IFrame
IFrame(width="800",
height="500",
frameborder="0",
src="https://pythontutor.com/iframe-embed.html#code=def%20mystery%28vals%29%3A%0A%20%20%20%20vals%5B-1%5D%20%3D%2015%0A%20%20%20%20return%20vals.append%28'BBB'%29%0A%20%20%20%20%0Acreature%20%3D%20%5B1,%202,%203%5D%0A%0Amystery%28creature%29%0Amystery%28creature%29%0Amystery%28creature%29&codeDivHeight=400&codeDivWidth=350&cumulative=false&curInstr=14&heapPrimitives=nevernest&origin=opt-frontend.js&py=311&rawInputLstJSON=%5B%5D&textReferences=false")
Question 🤔 (Answer at practicaldsc.org/q)
How much progress have you made on Homework 1?
No judgement!
- A. I've submitted it!
- B. I've finished more than half of it.
- C. I've finished a few questions.
- D. I've looked at it, but haven't written any code yet.
- E. Haven't started at all.
Recap: for
-loops and dictionaries¶
(source)
for
-loops in Python¶
- In Python, you can loop over any iterable. Strings, lists, and dictionaries are all examples of iterables.
- All of the following are valid ways to write a
for
-loop.
for value in "this is a string":
for element in lst: # Assume lst is a list.
for i in range(len(lst)):
- One of the more common
for
-loop examples you may have seen in earlier classes involved performing some operation to every element of a sequence, e.g. doubling the numbers in a list.
def double(vals):
new_vals = []
for val in vals:
new_vals.append(vals * 2)
return new_vals
- We are going to avoid ❌ these kinds of
for
-loops in this class, because there are much faster ways of achieving the same goal innumpy
andpandas
. We'll see these soon.
while
-loops will come up sparingly.
But conceptually, you should know how they work!
List comprehension¶
In the situations when we do want to perform some operation to every element in a list, a common pattern is the list comprehension.
vals = [2, -1, 9, 4, 3, 8]
[val ** 2 for val in vals]
[4, 1, 81, 16, 9, 64]
[val ** 2 for val in vals if val % 2 == 0]
[4, 16, 64]
[val ** 2 if val % 2 == 0 else val + 1 for val in vals]
[4, 0, 10, 16, 4, 64]
Dictionaries¶
- A dictionary stores a collection of key-value pairs.
They are the equivalent of a map in C++.
{
curly brackets}
denote the start and end of a dictionary, a colon:
is used to denote a single key value pair, and a comma,
is used to separate key-value pairs.
dog = {'name': 'Junior', 'age': 15, 4: ['kibble', 'treat']}
dog
{'name': 'Junior', 'age': 15, 4: ['kibble', 'treat']}
- We retrieve a value in a dictionary using its key.
dog['name']
'Junior'
dog['height']
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[9], line 1 ----> 1 dog['height'] KeyError: 'height'
- After creation, we can add or change key-value pairs.
dog['color'] = 'beige'
dog['tricks'] = {
'easy': ['roll over', 'paw'],
'medium': ['jump']
}
dog
{'name': 'Junior', 'age': 15, 4: ['kibble', 'treat'], 'color': 'beige', 'tricks': {'easy': ['roll over', 'paw'], 'medium': ['jump']}}
- A dictionary's keys must be immutable (numbers, strings, Booleans), while its values can be anything.
# Here, we're trying to add a value with a key of [1, 2].
# Since [1, 2] is mutable, it can't be used as a key.
dog[[1, 2]] = 'does this work?'
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[12], line 3 1 # Here, we're trying to add a value with a key of [1, 2]. 2 # Since [1, 2] is mutable, it can't be used as a key. ----> 3 dog[[1, 2]] = 'does this work?' TypeError: unhashable type: 'list'
Activity
Complete the implementation of the function find_anagrams
, which takes in words
, a list of strings, and returns a dictionary describing the anagrams present in words
. Example behavior is given below.
>>> find_anagrams(['dog', 'hello', 'enlist', 'silent', 'a gentleman', 'god', 'elegant man', 'listen'])
{'dgo': ['dog', 'god'],
'ehllo': ['hello'],
'eilnst': ['enlist', 'silent', 'listen'],
' aaeeglmnnt': ['a gentleman', 'elegant man']}
example_words = ['dog', 'hello', 'enlist', 'silent', 'a gentleman', 'god', 'elegant man', 'listen']
def find_anagrams(words):
out = {}
for word in words:
word_sorted = ''.join(sorted(word))
if word_sorted in out:
out[word_sorted].append(word)
else:
out[word_sorted] = [word]
return out
numpy
arrays¶
Import statements¶
- We use
import
statements to add the objects (values, functions, classes) defined in other modules to our programs. There are a few different ways toimport
.
Other terms I'll use for "module" are "library" and "package".
- Option 1:
import module
.
Now, everytime we want to use a name in module
, we must write module.<name>
.
import math
math.sqrt(15)
3.872983346207417
sqrt(15)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[16], line 1 ----> 1 sqrt(15) NameError: name 'sqrt' is not defined
- Option 2:
import module as m
.
Now, everytime we want to use a name in module
, we can write m.<name>
instead of module.<name>
.
# This is the standard way that we will import numpy.
import numpy as np
np.pi
3.141592653589793
np.linalg.inv([[2, 1],
[3, 4]])
array([[ 0.8, -0.2], [-0.6, 0.4]])
- Option 3:
from module import ...
.
This way, we explicitly state the names we want to import from module
.
To import everything, write from module import *
.
# Importing a particular function from the requests module.
from requests import get
# This typically fills up the namespace with a lot of unnecessary names, so use sparingly.
from math import *
sqrt
<function math.sqrt(x, /)>
NumPy¶
- NumPy (pronounced "num pie") is a Python library (module) that provides support for arrays and operations on them.
- The
pandas
library, which we will use for tabular data manipulation, works in conjunction withnumpy
.
- To use
numpy
, we need to import it. It's usually imported asnp
(but doesn't have to be!)
We also had to install it on your computer first, but you already did that when you set up your environment.
import numpy as np
Arrays¶
- The core data structure in
numpy
is the array. Moving forward, "array" will always refer to anumpy
array.
- One way to instantiate an array is to pass a list as an argument to the function
np.array
.
np.array([4, 9, 1, 2])
array([4, 9, 1, 2])
- Arrays, unlike lists, must be homogenous – all elements must be of the same type.
# All elements are converted to strings!
np.array([1961, 'michigan'])
array(['1961', 'michigan'], dtype='<U21')
Array-number arithmetic¶
- Arrays make it easy to perform the same operation to every element without a
for
-loop. This behavior is formally known as "broadcasting", but we often say these operations are vectorized.
temps = [68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62]
temps
[68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62]
temp_array = np.array(temps)
# Increase all temperatures by 3 degrees.
temp_array + 3
array([71, 75, 68, 67, 65, 64, 62, 67, 67, 66, 68, 65])
# Halve all temperatures.
temp_array / 2
array([34. , 36. , 32.5, 32. , 31. , 30.5, 29.5, 32. , 32. , 31.5, 32.5, 31. ])
# Convert all temperatures to Celsius.
(5 / 9) * (temp_array - 32)
array([20. , 22.22, 18.33, 17.78, 16.67, 16.11, 15. , 17.78, 17.78, 17.22, 18.33, 16.67])
- Note: In none of the above cells did we actually modify
temp_array
! Each of those expressions created a new array. To actually changetemp_array
, we need to reassign it to a new array.
temp_array
array([68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62])
temp_array = (5 / 9) * (temp_array - 32)
# Now in Celsius!
temp_array
array([20. , 22.22, 18.33, 17.78, 16.67, 16.11, 15. , 17.78, 17.78, 17.22, 18.33, 16.67])
⚠️ The dangers of unnecessary for
-loops¶
- Under the hood,
numpy
is implemented in C and Fortran, which are compiled languages that are much faster than Python. As a result, these vectorized operations are much quicker than if we used a vanilla Pythonfor
-loop.
Also, the fact that arrays must be homogenous lend themselves to more efficient representations in memory.
- We can time code in a Jupyter Notebook. Let's try and square a long sequence of integers and see how long it takes with a Python loop:
%%timeit
squares = []
for i in range(1_000_000):
squares.append(i * i)
46.7 ms ± 790 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
- In vanilla Python, this takes about 0.04 seconds per loop. In
numpy
:
%%timeit
squares = np.arange(1_000_000) ** 2
1.51 ms ± 47.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
- Only takes about 0.001 seconds per loop, more than 40x faster!
Element-wise arithmetic¶
- We can apply arithmetic operations to multiple arrays, provided they have the same length.
- The result is computed element-wise, which means that the arithmetic operation is applied to one pair of elements from each array at a time.
a = np.array([4, 5, -1])
b = np.array([2, 3, 2])
a + b
array([6, 8, 1])
a / b
array([ 2. , 1.67, -0.5 ])
a ** 2 + b ** 2
array([20, 34, 5])
arr = np.array([3, 8, 4, -3.2])
(2 ** arr).sum()
280.108818820412
(2 ** arr).mean()
70.027204705103
(2 ** arr).max()
256.0
(2 ** arr).argmax()
1
# An attribute, not a method.
arr.shape
(4,)
Question 🤔 (Answer at practicaldsc.org/q)
What questions do we have about arrays so far?
Activity
🎉 Congrats! 🎉 You won the lottery 💰. Here's how your payout works: on the first day of September, you are paid \$0.01. Every day thereafter, your pay doubles, so on the second day you're paid \$0.02, on the third day you're paid \$0.04, on the fourth day you're paid \$0.08, and so on.
September has 30 days.
Write a one-line expression that uses the numbers 2
and 30
, along with the function np.arange
and at least one array method, that computes the total amount in dollars you will be paid in September. No for
-loops or list comprehensions allowed!
*Note*: We have a 🎥 walkthrough video of this problem, but don't watch it until you've tried it yourself!
(2 ** np.arange(30) / 100).sum()
10737418.23
Boolean filtering¶
- Comparisons with arrays yield Boolean arrays! These can be used to answer questions about the values in an array.
temp_array
array([20. , 22.22, 18.33, 17.78, 16.67, 16.11, 15. , 17.78, 17.78, 17.22, 18.33, 16.67])
temp_array >= 18
array([ True, True, True, False, False, False, False, False, False, False, True, False])
- How many values are greater than or equal to 18?
(temp_array >= 18).sum()
4
- What fraction of values are greater than or equal to 18?
(temp_array >= 18).mean()
0.3333333333333333
- Which values are greater than or equal to 18?
temp_array[temp_array >= 18]
array([20. , 22.22, 18.33, 18.33])
- Which values are between 18 and 20?
# Note the parentheses!
temp_array[(temp_array >= 18) & (temp_array <= 20)]
array([20. , 18.33, 18.33])
# WRONG!
temp_array[(temp_array >= 18) and (temp_array <= 20)]
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[53], line 2 1 # WRONG! ----> 2 temp_array[(temp_array >= 18) and (temp_array <= 20)] ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Note: & and | vs. and and or¶
- In Python, the standard symbols for "and" and "or" are, literally,
and
andor
.
if (5 > 3 and 'h' + 'i' == 'hi') or (-2 > 0):
print('success')
success
- But, when taking the element-wise and/or of two arrays, the standard operators don't work. Instead, use the bitwise operators:
&
for "and",|
for "or".
temp_array
array([20. , 22.22, 18.33, 17.78, 16.67, 16.11, 15. , 17.78, 17.78, 17.22, 18.33, 16.67])
# Don't forget parentheses when using multiple conditions!
temp_array[(temp_array % 2 == 0) | (temp_array == temp_array.min())]
array([20., 15.])
- Read more about this here.
Multidimensional arrays and linear algebra¶
Multidimensional arrays¶
- A matrix can be represented in code using a two dimensional (2D) array.
- 2D arrays also resemble tables, or DataFrames, so it's worthwhile to study how they work.
nums = np.array([
[5, 1, 9, 7],
[9, 8, 2, 3],
[2, 5, 0, 4]
])
nums
array([[5, 1, 9, 7], [9, 8, 2, 3], [2, 5, 0, 4]])
# nums has 3 rows and 4 columns.
nums.shape
(3, 4)
- In addition to creating 2D arrays from scratch, we can also create 2D arrays by reshaping other arrays.
# Here, we're asking to reshape np.arange(1, 7)
# so that it has 2 rows and 3 columns.
a = np.arange(1, 7).reshape((2, 3))
a
array([[1, 2, 3], [4, 5, 6]])
Operations along axes¶
- In 2D arrays (and DataFrames), axis 0 refers to the rows (up and down) and axis 1 refers to the columns (left and right).
a
array([[1, 2, 3], [4, 5, 6]])
- If we specify
axis=0
,a.sum
will "compress" along axis 0.
a.sum(axis=0)
array([5, 7, 9])
- If we specify
axis=1
,a.sum
will "compress" along axis 1.
a.sum(axis=1)
array([ 6, 15])
Selecting rows and columns from 2D arrays¶
- You can use
[
square brackets]
to slice rows and columns out of an array, too.
- The general convention is:
array[<row positions>, <column positions>]
a
array([[1, 2, 3], [4, 5, 6]])
# Accesses row 0 and all columns.
a[0, :]
array([1, 2, 3])
# Same as the above.
a[0]
array([1, 2, 3])
# Accesses all rows and column 1.
a[:, 1]
array([2, 5])
# Access all rows and columns 0 and 2.
a[:, [0, 2]]
array([[1, 3], [4, 6]])
# Accesses row 0 and columns 1 and onwards.
a[0, 1:]
array([2, 3])
Activity
Suppose we run the cell below.s = (5, 3)
grid = np.ones(s) * 2 * np.arange(1, 16).reshape(s)
grid[-1, 1:].sum()
What is the output of the cell? Try and answer without writing any code.
s = (5, 3)
grid = np.ones(s) * 2 * np.arange(1, 16).reshape(s)
grid[-1, 1:].sum()
58.0
Linear algebra review¶
- Arrays are used to perform computation in the context of linear algebra! For example, let's work through Practice Question 8.1 from LARDS, but this time using code.
Consider the vectors $\vec u$ and $\vec v$, defined below:
$$\vec u = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \qquad \vec v = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}$$
show_projection_plot()
We define $X \in \mathbb{R}^{3 \times 2}$ to be the matrix whose first column is $\vec u$ and whose second column is $\vec v$.
u = np.array([1, 0, 0]).reshape(-1, 1)
v = np.array([0, 1, 1]).reshape(-1, 1)
u
array([[1], [0], [0]])
v
array([[0], [1], [1]])
X = np.hstack([u, v])
X
array([[1, 0], [0, 1], [0, 1]])
Show that:
$$(X^TX)^{-1}X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}$$np.linalg.inv(X.T @ X) @ X.T
array([[1. , 0. , 0. ], [0. , 0.5, 0.5]])
Find scalars $a$ and $b$ such that $a \vec u + b \vec v$ is the vector in $\text{span}(\vec u, \vec v)$ that is as close to $\vec y = \begin{bmatrix} 4 \\ 2 \\ 8 \end{bmatrix}$ as possible.
y = np.array([4, 2, 8])
w = np.linalg.inv(X.T @ X) @ X.T @ y
w
array([4., 5.])
Let $\vec e = \vec y - (a \vec u + b \vec v)$, where $a$ and $b$ are the values you found above. You should notice that the sum of the entries in $\vec e$ is 0. Why?
e = y - X @ w
e
array([ 0., -3., 3.])
Question 🤔 (Answer at practicaldsc.org/q)
What questions do you have about multidimensional arrays and our linear algebra review?
Example: Image processing¶
- It turns out that images can be represented as 3D
numpy
arrays.
- The color of each pixel can be described with three numbers under the RGB model – a red value, green value, and blue value. Each of these can vary from 0 to 1.
img = np.asarray(Image.open('imgs/junior.jpeg')) / 255
img
array([[[0.38, 0.24, 0.18], [0.35, 0.22, 0.15], [0.35, 0.22, 0.16], ..., [0.48, 0.31, 0.23], [0.49, 0.31, 0.24], [0.5 , 0.32, 0.25]], [[0.38, 0.24, 0.17], [0.35, 0.22, 0.15], [0.35, 0.22, 0.16], ..., [0.49, 0.31, 0.24], [0.49, 0.31, 0.24], [0.5 , 0.31, 0.24]], [[0.37, 0.23, 0.16], [0.35, 0.22, 0.15], [0.35, 0.22, 0.16], ..., [0.49, 0.31, 0.24], [0.49, 0.31, 0.24], [0.49, 0.31, 0.24]], ..., [[0.35, 0.2 , 0.04], [0.33, 0.18, 0.03], [0.32, 0.16, 0.01], ..., [0.37, 0.27, 0.13], [0.39, 0.29, 0.15], [0.42, 0.32, 0.18]], [[0.34, 0.19, 0.04], [0.33, 0.18, 0.04], [0.33, 0.18, 0.04], ..., [0.36, 0.26, 0.12], [0.39, 0.29, 0.15], [0.44, 0.33, 0.2 ]], [[0.37, 0.22, 0.07], [0.36, 0.22, 0.07], [0.37, 0.22, 0.08], ..., [0.35, 0.25, 0.11], [0.38, 0.28, 0.15], [0.44, 0.34, 0.2 ]]])
img.shape
(2048, 1536, 3)
plt.imshow(img)
plt.axis('off');
Applying a greyscale filter¶
- One way to convert an image to greyscale is to average its red, green, and blue values.
mean_2d = img.mean(axis=2)
mean_2d
array([[0.27, 0.24, 0.24, ..., 0.34, 0.34, 0.36], [0.26, 0.24, 0.25, ..., 0.34, 0.34, 0.35], [0.25, 0.24, 0.25, ..., 0.34, 0.34, 0.35], ..., [0.2 , 0.18, 0.16, ..., 0.25, 0.28, 0.31], [0.19, 0.18, 0.18, ..., 0.25, 0.28, 0.32], [0.22, 0.22, 0.23, ..., 0.24, 0.27, 0.33]])
# This is just a single red channel!
plt.imshow(mean_2d)
plt.axis('off');
- We need to repeat
mean_2d
three times along axis 2, to use the same values for the red, green, and blue channels.np.repeat
will help us here.
# np.newaxis is an alias for None.
# It helps us introduce an additional axis.
np.arange(5)[:, np.newaxis]
array([[0], [1], [2], [3], [4]])
np.repeat(np.arange(5)[:, np.newaxis], 3, axis=1)
array([[0, 0, 0], [1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]])
mean_3d = np.repeat(mean_2d[:, :, np.newaxis], 3, axis=2)
mean_3d
array([[[0.27, 0.27, 0.27], [0.24, 0.24, 0.24], [0.24, 0.24, 0.24], ..., [0.34, 0.34, 0.34], [0.34, 0.34, 0.34], [0.36, 0.36, 0.36]], [[0.26, 0.26, 0.26], [0.24, 0.24, 0.24], [0.25, 0.25, 0.25], ..., [0.34, 0.34, 0.34], [0.34, 0.34, 0.34], [0.35, 0.35, 0.35]], [[0.25, 0.25, 0.25], [0.24, 0.24, 0.24], [0.25, 0.25, 0.25], ..., [0.34, 0.34, 0.34], [0.34, 0.34, 0.34], [0.35, 0.35, 0.35]], ..., [[0.2 , 0.2 , 0.2 ], [0.18, 0.18, 0.18], [0.16, 0.16, 0.16], ..., [0.25, 0.25, 0.25], [0.28, 0.28, 0.28], [0.31, 0.31, 0.31]], [[0.19, 0.19, 0.19], [0.18, 0.18, 0.18], [0.18, 0.18, 0.18], ..., [0.25, 0.25, 0.25], [0.28, 0.28, 0.28], [0.32, 0.32, 0.32]], [[0.22, 0.22, 0.22], [0.22, 0.22, 0.22], [0.23, 0.23, 0.23], ..., [0.24, 0.24, 0.24], [0.27, 0.27, 0.27], [0.33, 0.33, 0.33]]])
plt.imshow(mean_3d)
plt.axis('off');
Applying a sepia filter¶
We won't walk through this example in lecture, but it's here for your reference.
- Let's sepia-fy Junior!
- From here, we can apply this conversion to each pixel.
sepia_filter = np.array([
[0.393, 0.769, 0.189],
[0.349, 0.686, 0.168],
[0.272, 0.534, 0.131]
])
# Multiplies each pixel by the sepia_filter matrix.
# Then, clips each RGB value to be between 0 and 1.
filtered = (img @ sepia_filter.T).clip(0, 1)
filtered
array([[[0.37, 0.33, 0.26], [0.33, 0.3 , 0.23], [0.33, 0.3 , 0.23], ..., [0.47, 0.42, 0.32], [0.47, 0.42, 0.33], [0.49, 0.43, 0.34]], [[0.36, 0.32, 0.25], [0.33, 0.3 , 0.23], [0.34, 0.3 , 0.24], ..., [0.47, 0.42, 0.33], [0.47, 0.42, 0.33], [0.48, 0.43, 0.33]], [[0.35, 0.31, 0.24], [0.33, 0.3 , 0.23], [0.34, 0.3 , 0.24], ..., [0.47, 0.42, 0.33], [0.47, 0.42, 0.33], [0.48, 0.43, 0.33]], ..., [[0.3 , 0.26, 0.21], [0.27, 0.24, 0.19], [0.25, 0.23, 0.18], ..., [0.37, 0.33, 0.26], [0.41, 0.36, 0.28], [0.45, 0.4 , 0.31]], [[0.29, 0.25, 0.2 ], [0.27, 0.24, 0.19], [0.27, 0.24, 0.19], ..., [0.36, 0.32, 0.25], [0.41, 0.36, 0.28], [0.46, 0.41, 0.32]], [[0.33, 0.29, 0.23], [0.32, 0.29, 0.22], [0.33, 0.3 , 0.23], ..., [0.35, 0.31, 0.24], [0.4 , 0.35, 0.27], [0.47, 0.42, 0.33]]])
plt.imshow(filtered)
plt.axis('off');
Key takeaway: avoid for
-loops whenever possible!¶
You can do a lot without for
-loops, both in numpy
and in pandas
.
Randomness and simulation¶
np.random
¶
The submodule np.random
contains various functions that produce random results.
These use pseudo-random number generators to generate random-seeming sequences of results.
# Run this cell multiple times!
# Returns a random integer between 1 and 6, inclusive.
np.random.randint(1, 7)
1
# Returns a random real number between 0 and 1.
np.random.random()
0.45982206654530877
# Returns a randomly selected element from the provided list, 5 times.
np.random.choice(['H', 'T'], 5)
array(['H', 'T', 'T', 'T', 'H'], dtype='<U1')
# Returns the number of occurrences of each outcome
# in 12 trials of an experiment in which
# outcome 1 happens 60% of the time and
# outcome 2 happens 40% of the time.
np.random.multinomial(12, [0.6, 0.4])
array([8, 4])
Simulations¶
- Often, we'll want to estimate the probability of an event, but it may not be possible – or we may not know how – to calculate the probability exactly.
e.g., the probability that I see between 40 and 50 heads when I flip a fair coin 100 times.
- Or, we may have a theoretical answer, and want to validate it using another approach.
In such cases, we can use the power of simulation. We can:
- Figure out how to simulate one run of the experiment.
e.g., figure out how to get Python to flip a fair coin 100 times and count the number of heads. - Repeat the experiment many, many times.
- Compute the fraction of experiments in which our event occurs, and use this fraction as an estimate of the probability of our event.
This is the basis of Monte Carlo Methods.
- Figure out how to simulate one run of the experiment.
- Theory tells us that the more repetitions we perform of our experiment, the closer our fraction will be to the true probability of the event!
Specifically, the Law of Large Numbers tells us this.
Example: Coin flipping¶
- Question: What is the probability that I see between 40 and 50 heads, inclusive, when I flip a fair coin 100 times?
- Step 1: Figure out how to simulate one run of the experiment.
e.g., figure out how to get Python to flip a fair coin 100 times and count the number of heads.
(np.random.choice(['H', 'T'], 100) == 'H').sum()
47
np.random.multinomial(100, [0.5, 0.5])[0]
50
def num_heads():
return np.random.multinomial(100, [0.5, 0.5])[0]
num_heads()
49
- Step 2: Repeat the experiment many, many times.
outcomes = np.array([])
for _ in range(10_000):
# Note that with arrays, append is a FUNCTION,
# not a METHOD, and is NOT destructive,
# unlike with lists!
outcomes = np.append(outcomes, num_heads())
- Step 3: Compute the fraction of experiments in which our event occurs, and use this fraction as an estimate of the probability of our event.
px.histogram(outcomes)
((outcomes >= 40) & (outcomes <= 50)).mean()
0.5187
- This is remarkably close to the true, theoretical answer!
from scipy.stats import binom
binom.cdf(50, 100, 0.5) - binom.cdf(39, 100, 0.5)
0.5221945185847372
Question 🤔 (Answer at practicaldsc.org/q)
What questions do you have about our coin flipping simulation?
Can you think of a way to perform the same simulation without any for
-loops?
Example: The Birthday Paradox¶
- There are ~80 students in the room right now. What are the chances at least 2 students share the same birthday?
- In general, how many people must be in a room such that there's a 50% chance that at least 2 students share the same birthday?
- Let's define a function,
estimated_probability
, which takes in a class size,n
, and returns the probability that in a class ofn
students, at least 2 students share the same birthday.
def simulate_classroom(n):
# This helper function should take in a class size, n,
# and return True if a simulated classroom of size n
# has at least 2 students with the same birthday
# and False otherwise.
# This is not the most efficient solution, but works for now.
options = np.arange(1, 366)
chosen_options = np.random.choice(options, n, replace=True)
return len(chosen_options) != len(np.unique(chosen_options))
def estimated_probability(n):
return np.mean([simulate_classroom(n) for _ in range(10_000)])
estimated_probability(80)
0.9998
- With 80 students, it's almost certain that 2 share the same birthday!
What's the minimum class size we'd need for a 50% chance that at least 2 share the same birthday?
probs = [estimated_probability(n) for n in range(1, 51)]
(
px
.bar(x=range(1, 51),
y=probs,
title='Probability that at least 2 students share the<br>same birthday in a class of n students')
.update_xaxes(title='$n$')
.update_yaxes(title='Probability')
)
- Lower than you might think!
Next time¶
On Thursday, we'll begin our deep dive into pandas
DataFrames.