from lec_utils import *
Announcements 📣¶
- Homework 1 is due tonight, though note that you have 6 slip days to use during the semester, and you can use up to 2 slip days on any homework (see here for policy details).
Post on Ed or come to Office Hours for help! We're using a queue for office hours now – access it from practicaldsc.org/calendar.
Homework 2 will be released tomorrow.
We'll make an Ed announcement anytime an assignment is released.In discussion tomorrow, we'll cover past exam problems on paper related to this week's material.
Check out the Resources tab on the course website, with links to lots of supplementary resources.
New link: EECS 201: Computer Science Pragmatics. Look here for help with Terminal commands,git
, etc.
Agenda¶
- Randomness and simulation.
- Introduction to
pandas
DataFrames.- Selecting columns from a DataFrame.
- Selecting rows from a DataFrame.
Remember to follow along in lecture by accessing the "blank" lecture notebook in our public GitHub repository.
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!When is your birthday?
Randomness and simulation¶
We'll start by exploring a useful application of numpy
in the field of probability and statistics: simulation!
np.random
¶
The submodule np.random
contains various functions that produce random results.
These use pseudo-random number generators to generate random-seeming sequences of results.
# Run this cell multiple times!
# Returns a random integer between 1 and 6, inclusive.
np.random.randint(1, 7)
3
# Returns a random real number between 0 and 1.
np.random.random()
0.24976347964756174
# Returns a randomly selected element from the provided list, 5 times.
np.random.choice(['H', 'T'], 5)
array(['T', 'H', 'T', 'T', 'T'], dtype='<U1')
# Returns the number of occurrences of each outcome
# in 12 trials of an experiment in which
# outcome 1 happens 60% of the time and
# outcome 2 happens 40% of the time.
np.random.multinomial(12, [0.6, 0.4])
array([3, 9])
Simulations¶
- Often, we'll want to estimate the probability of an event, but it may not be possible – or we may not know how – to calculate the probability exactly.
e.g., the probability that I see between 40 and 50 heads when I flip a fair coin 100 times.
- Or, we may have a theoretical answer, and want to validate it using another approach.
In such cases, we can use the power of simulation. We can:
- Figure out how to simulate one run of the experiment.
e.g., figure out how to get Python to flip a fair coin 100 times and count the number of heads. - Repeat the experiment many, many times.
- Compute the fraction of experiments in which our event occurs, and use this fraction as an estimate of the probability of our event.
This is the basis of Monte Carlo Methods.
- Figure out how to simulate one run of the experiment.
- Theory tells us that the more repetitions we perform of our experiment, the closer our fraction will be to the true probability of the event!
Specifically, the Law of Large Numbers tells us this.
Example: Coin flipping¶
- Question: What is the probability that I see between 40 and 50 heads, inclusive, when I flip a fair coin 100 times?
- Step 1: Figure out how to simulate one run of the experiment.
e.g., figure out how to get Python to flip a fair coin 100 times and count the number of heads.
(np.random.choice(['H', 'T'], 100) == 'H').sum()
48
np.random.multinomial(100, [0.5, 0.5])[0]
55
def num_heads():
return np.random.multinomial(100, [0.5, 0.5])[0]
num_heads()
49
- Step 2: Repeat the experiment many, many times.
In other words, run the cell above lots of times and store the results somewhere.
outcomes = np.array([])
for _ in range(10_000):
# Note that with arrays, append is a FUNCTION,
# not a METHOD, and is NOT destructive,
# unlike with lists!
outcomes = np.append(outcomes, num_heads())
- Step 3: Compute the fraction of experiments in which our event occurs, and use this fraction as an estimate of the probability of our event.
px.histogram(outcomes)