🙋FAQs

Moving forward, we’re going to try and update this page each week to provide answers to questions asked (1) live in lecture, (2) at practicaldsc.org/q during lecture, and (3) on Ed. If you have other related questions, feel free to post them on Ed.

Jump to:


DataFrame Manipulation

Why does rounding 0.5 sometimes round down?

Question

Sometimes when I try use Series.round() or np.round() on a number that’s exactly x.5, it rounds down—why is this?

Answer

This is expected behavior by pandas and numpy (documentation), even though Python’s round() function does not do this:

For values exactly halfway between rounded decimal values, NumPy rounds to the nearest even value. Thus 1.5 and 2.5 round to 2.0, -0.5 and 0.5 round to 0.0, etc.

Illustration of Pandas series rounding

One reason to do this is to avoid biasing a dataset’s average upwards by always rounding up at 0.5. From a great StackOverflow answer:

This kind of rounding is called rounding to even (or banker’s rounding). It is the case because if we always round 0.5 up to the next largest number, then the average of a large data set rounded numbers is likely to be slightly larger than the average of the unrounded numbers: this bias or drift can have very bad effects on some numerical algorithms and make them inaccurate.

Why do we pass in just iqr to agg?

Question

In lecture, we defined iqr as a function that takes in a series, why here we don’t pass any argument explicitly as agg(iqr(s)), where s is the Series we get by groupby('species')[body_mass_g]?

def iqr(s): 
    # s is a series
    # return the interquartile range for s
    return np.percentile(s, 75) - np.percentile(s, 25)

# Here, the argument to agg a function which
# takes in a Series and returns a scalar.
(
    penguins
    .groupby('species')
    ['body_mass_g']
    .agg(iqr)
)

Answer:

There’s a subtle difference between .agg(iqr) and .agg(iqr(s)). If you actually tried .agg(iqr(s)), you’d get an error saying s is not defined, since that will try and evaluate iqr(s) before talking to .agg, and in the global scope of your notebook, there (most likely) aren’t any variables named s. (There is an s, but it’s the input to iqr.)

But also, .agg takes as input a function. iqr is a function, hence why we call .agg(iqr). Even if s was a Series defined in your notebook and iqr(s) worked and returned the difference between the 75th percentile and 25th percentile of this globally-defined s, then .agg(iqr(s)) would end up being something like .agg(17.39). Then, the input to .agg isn’t a function, as we need it to be, but rather it’s a number.