In [1]:
from lec_utils import *
def show_grouping_animation():
    src = "https://docs.google.com/presentation/d/1tBaFyHseIGsX5wmE3BdNLeVHnKksQtpzLhHge8Tzly0/embed?start=false&loop=false&delayms=60000&rm=minimal"
    width = 960
    height = 509
    display(IFrame(src, width, height))

Lecture 5¶

Querying and Grouping¶

EECS 398-003: Practical Data Science, Fall 2024¶

practicaldsc.org • github.com/practicaldsc/fa24

Announcements 📣¶

  • Homework 2 is due on Thursday.


Post on Ed or come to Office Hours for help! We're using a queue for office hours now – access it from practicaldsc.org/calendar.

  • study.practicaldsc.org contains our discussion worksheets (and solutions), which are made up of old exam problems. Use these problems to build your theoretical understanding of the material!

  • Homework 1 scores are available on Gradescope.

Agenda¶

  • Recap: Querying.
  • Adding and modifying columns.
  • pandas and numpy.
  • Introduction to the groupby method.
  • groupby's inner workings.

Remember to follow along in lecture by accessing the "blank" lecture notebook in our public GitHub repository.

Question 🤔 (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

Querying¶

How do we find rows that satisfy certain conditions?


Run the cell below to load in our dataset.

In [2]:
dogs = pd.read_csv('data/dogs43.csv').set_index('breed')
dogs.head()
Out[2]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
Shetland Sheepdog herding 21006.0 12.53 small 22.0 14.5

Recap: Querying¶

  • Querying is the act selecting rows in a DataFrame that satisfy certain condition(s).
    We sometimes call this "filtering."
  • Example: How many breeds live to be over 10 years old?
In [3]:
dogs[dogs['longevity'] > 10].shape[0] 
Out[3]:
32
  • Example: Among all breeds with 'Retriever' in the same, which is the second tallest?
    Remember, we like to format our code this way when writing long, hard-to-read lines.
In [4]:
(
    dogs.loc[dogs.index.str.contains('Retriever'), 'height']
    .sort_values(ascending=False)
    .iloc[1]
)
Out[4]:
23.0
  • What is the distribution of breeds 'size's among 'sporting' and 'working' breeds?
    Remember, you need parentheses around each condition. Also, you must use the bitwise operators & and | instead of the standard and and or keywords, as we saw in Lecture 3.
In [5]:
(
    dogs.loc[(dogs['kind'] == 'sporting') | (dogs['kind'] == 'working'), 'size']
    .value_counts()
)
Out[5]:
size
large     11
medium     7
small      1
Name: count, dtype: int64
In [6]:
# Equivalent to the above!
(
    dogs.loc[dogs['kind'].isin(['sporting', 'working']), 'size']
    .value_counts()
)
Out[6]:
size
large     11
medium     7
small      1
Name: count, dtype: int64
  • Show me all rows for 'medium'-sized dogs.
In [7]:
dogs[dogs['size'] == 'medium'] 
Out[7]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.00
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.00
Siberian Husky working 22049.0 12.58 medium 47.5 21.75
... ... ... ... ... ... ...
Clumber Spaniel sporting 18084.0 10.00 medium 70.0 18.50
Kerry Blue Terrier terrier 17240.0 9.40 medium 36.5 18.50
Bull Terrier terrier 18490.0 10.21 medium 60.0 21.50

12 rows × 6 columns

  • Show me all rows for 'Golden Retriever's.
    Note that because we set the index to 'breed' earlier, we can select rows based on dog breeds without having to query. If 'breed' was instead a column, then we'd need to query to access information about a particular breed.
In [8]:
dogs.loc['Golden Retriever'] 
Out[8]:
kind             sporting
lifetime_cost     21447.0
longevity           12.04
size               medium
weight               60.0
height              22.75
Name: Golden Retriever, dtype: object

Aside: Reference Slides

  • Moving forward, I'm going to try and spend a bit less time on syntax and a bit more time on conceptual problem-solving.
  • So, in each lecture, some slides will be called "Reference Slides".
  • We will skip these slides during live lecture, but they'll be present in the posted lecture notebooks, so that you can look at them when working on activities in class and on discussion and homework problems.
    The material in them is in-scope.
  • Reference slides will appear with red headers, like this one.
    There's a Reference Slide right after this slide, about the DataFrame query method!

Reference Slide¶

The query method¶

The DataFrame query method is a convenient way to query, since you don't need parentheses and you can use the and and or keywords.

In [9]:
dogs
Out[9]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [10]:
dogs.query('weight < 20 and kind == "terrier"')
Out[10]:
kind lifetime_cost longevity size weight height
breed
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
Miniature Schnauzer terrier 20087.0 11.81 small 15.5 13.0
Norfolk Terrier terrier 24308.0 13.07 small 12.0 9.5
In [11]:
dogs.query('kind in ["sporting", "terrier"] and lifetime_cost < 20000')
Out[11]:
kind lifetime_cost longevity size weight height
breed
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Chesapeake Bay Retriever sporting 16697.0 9.48 large 67.5 23.5
Gordon Setter sporting 19605.0 11.10 large 62.5 25.0
Clumber Spaniel sporting 18084.0 10.00 medium 70.0 18.5
Scottish Terrier terrier 17525.0 10.69 small 20.0 10.0
Kerry Blue Terrier terrier 17240.0 9.40 medium 36.5 18.5
Bull Terrier terrier 18490.0 10.21 medium 60.0 21.5

Reference Slide¶

More practice¶

In [12]:
jack = pd.DataFrame({1: ['fee', 'fi'], 
                     '1': ['fo', 'fum']})
jack
Out[12]:
1 1
0 fee fo
1 fi fum

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We won't cover these in class, but you should try them out yourself. Here's a Pandas Tutor link to visualize these!

In [13]:
# jack[1]
In [14]:
# jack[[1]]
In [15]:
# jack['1']
In [16]:
# jack[[1, 1]]
In [17]:
# jack.loc[1]
In [18]:
# jack.loc[jack[1] == 'fo']
In [19]:
# jack[1, ['1', 1]]
In [20]:
# jack.loc[1,1]

Adding and modifying columns¶


Adding and modifying columns, using a copy¶

  • To add a new column to a DataFrame, use the assign method.
    To change the values in a column, add a new column with the same name as the existing column.
  • Like most pandas methods, assign returns a new DataFrame.
    • Pro ✅: This doesn't inadvertently change any existing variables.
    • Con ❌: It is not very space efficient, as it creates a new copy each time it is called.
In [21]:
dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity']) 
Out[21]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0 1628.90
... ... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5 1840.95
Mastiff working 13581.0 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20022.0 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

In [22]:
dogs
Out[22]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

Reference Slide¶

assign for column names with special characters¶

You can also use assign when the desired column name has spaces (and other special characters) by unpacking a dictionary:

In [23]:
dogs.assign(**{'cost per year 💵': dogs['lifetime_cost'] / dogs['longevity']})
Out[23]:
kind lifetime_cost longevity size weight height cost per year 💵
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0 1628.90
... ... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5 1840.95
Mastiff working 13581.0 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20022.0 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

Adding and modifying columns, in-place¶

  • You can assign a new column to a DataFrame in-place using [].
    This works like dictionary assignment. Using [] modifies the underlying DataFrame, unlike assign, which returns a new DataFrame.
  • This is the more "common" way of adding/modifying columns.
    Exercise caution when using this approach, since it is destructive – it changes the values of existing variables.
In [24]:
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy, the original remains unmodified.
dogs_copy = dogs.copy() 
dogs_copy.head(2)
Out[24]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
In [25]:
dogs_copy['cost_per_year'] = dogs_copy['lifetime_cost'] / dogs_copy['longevity']
dogs_copy
Out[25]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0 1628.90
... ... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5 1840.95
Mastiff working 13581.0 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20022.0 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

  • Note that we never reassigned dogs_copy in the cell above – that is, we never wrote dogs_copy = ... – though it was still modified.

Mutability¶

  • DataFrames, like lists, arrays, and dictionaries, are mutable. As we saw in Lecture 2, this means they can be modified in-place after creation.
  • Not only does this explain the behavior on the previous slide, but it also explains the following:
In [26]:
dogs_copy
Out[26]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0 1628.90
... ... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5 1840.95
Mastiff working 13581.0 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20022.0 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

In [27]:
def cost_in_thousands():
    dogs_copy['lifetime_cost'] = dogs_copy['lifetime_cost'] / 1000
In [28]:
# What happens when we run this twice? Three times?
cost_in_thousands()
In [29]:
dogs_copy
Out[29]:
kind lifetime_cost longevity size weight height cost_per_year
breed
Brittany sporting 22.59 12.92 medium 35.0 19.0 1748.37
Cairn Terrier terrier 21.99 13.84 small 14.0 10.0 1589.02
English Cocker Spaniel sporting 18.99 11.66 medium 30.0 16.0 1628.90
... ... ... ... ... ... ... ...
Bullmastiff working 13.94 7.57 large 115.0 25.5 1840.95
Mastiff working 13.58 6.50 large 175.0 30.0 2089.38
Saint Bernard working 20.02 7.78 large 155.0 26.5 2573.52

43 rows × 7 columns

⚠️ Warning: Avoid mutation when possible!¶

  • Note that dogs_copy was modified, even though we didn't reassign it! These unintended consequences can influence the behavior of test cases on homeworks, among other things!
  • To avoid this, it's a good idea to avoid mutation when possible. If you must use mutation, include df = df.copy() as the first line in functions that take DataFrames as input.

Question 🤔 (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

pandas and numpy¶


No description has been provided for this image

pandas is built upon numpy!¶

  • A Series in pandas is a numpy array with an index.
  • A DataFrame is like a dictionary of columns, each of which is a numpy array.
  • Many operations in pandas are fast because they use numpy's implementations, which are written in fast, compiled languages like C.
  • If you need to access the array underlying a DataFrame or Series, use the to_numpy method.
In [30]:
dogs['lifetime_cost']
Out[30]:
breed
Brittany                  22589.0
Cairn Terrier             21992.0
English Cocker Spaniel    18993.0
                           ...   
Bullmastiff               13936.0
Mastiff                   13581.0
Saint Bernard             20022.0
Name: lifetime_cost, Length: 43, dtype: float64
In [31]:
dogs['lifetime_cost'].to_numpy() 
Out[31]:
array([22589., 21992., 18993., ..., 13936., 13581., 20022.])

Reference Slide¶

pandas data types¶

  • Each Series (column) has a numpy data type, which refers to the type of the values stored within. Access it using the dtypes attribute.
  • A column's data type determines which operations can be applied to it.
  • pandas tries to guess the correct data types for a given DataFrame, and is often wrong.
    • This can lead to incorrect calculations and poor memory/time performance.
  • As a result, you will often need to explicitly convert between data types.
In [32]:
dogs
Out[32]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [33]:
dogs.dtypes
Out[33]:
kind              object
lifetime_cost    float64
longevity        float64
size              object
weight           float64
height           float64
dtype: object

Reference Slide¶

pandas data types¶

  • Notice that Python str types are object types in numpy and pandas.
Pandas dtype Python type NumPy type SQL type Usage
int64 int int_, int8,...,int64, uint8,...,uint64 INT, BIGINT Integer numbers
float64 float float_, float16, float32, float64 FLOAT Floating point numbers
bool bool bool_ BOOL True/False values
datetime64 or Timestamp datetime.datetime datetime64 DATETIME Date and time values
timedelta64 or Timedelta datetime.timedelta timedelta64 NA Differences between two datetimes
category NA NA ENUM Finite list of text values
object str string, unicode NA Text
object NA object NA Mixed types
  • This article details how pandas stores different data types under the hood.
  • This article explains how numpy/pandas int64 operations differ from vanilla int operations.

Reference Slide¶

Type conversion¶

  • You can change the data type of a Series using the .astype Series method.
  • For example, we can change the data type of the 'lifetime_cost' column in dogs to be uint32:
In [34]:
dogs
Out[34]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

43 rows × 6 columns

In [35]:
# Gives the types as well as the space taken up by the DataFrame.
dogs.info()
<class 'pandas.core.frame.DataFrame'>
Index: 43 entries, Brittany to Saint Bernard
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   kind           43 non-null     object 
 1   lifetime_cost  43 non-null     float64
 2   longevity      43 non-null     float64
 3   size           43 non-null     object 
 4   weight         43 non-null     float64
 5   height         43 non-null     float64
dtypes: float64(4), object(2)
memory usage: 3.4+ KB
In [36]:
dogs['lifetime_cost'] = dogs['lifetime_cost'].astype('uint32')
  • Now, the DataFrame takes up less space! This may be insignificant in our DataFrame, but makes a difference when working with larger datasets.
In [37]:
dogs.info()
<class 'pandas.core.frame.DataFrame'>
Index: 43 entries, Brittany to Saint Bernard
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   kind           43 non-null     object 
 1   lifetime_cost  43 non-null     uint32 
 2   longevity      43 non-null     float64
 3   size           43 non-null     object 
 4   weight         43 non-null     float64
 5   height         43 non-null     float64
dtypes: float64(3), object(2), uint32(1)
memory usage: 3.2+ KB

Reference Slide¶

Setting dtypes in read_csv¶

Usually, we prefer to set the correct dtypes in read_csv, since it can help pandas load in files more quickly:

In [38]:
dogs_new = pd.read_csv('data/dogs43.csv', dtype={'lifetime_cost': 'uint32'})
dogs_new
Out[38]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993 11.66 medium 30.0 16.0
... ... ... ... ... ... ... ...
40 Bullmastiff working 13936 7.57 large 115.0 25.5
41 Mastiff working 13581 6.50 large 175.0 30.0
42 Saint Bernard working 20022 7.78 large 155.0 26.5

43 rows × 7 columns

In [39]:
dogs_new.dtypes
Out[39]:
breed             object
kind              object
lifetime_cost     uint32
longevity        float64
size              object
weight           float64
height           float64
dtype: object

Axes¶

  • The rows and columns of a DataFrame are both stored as Series.
  • The axis specifies the direction of a slice of a DataFrame.
No description has been provided for this image
  • Axis 0 refers to the rows and axis 1 refers to the columns.
    These are the same axes definitions that 2D numpy arrays have!

DataFrame methods with axis¶

  • Many Series methods work on DataFrames.
  • In such cases, the DataFrame method usually applies the Series method to every row or column.
  • Many of these methods accept an axis argument; the default is usually axis=0.
In [40]:
dogs
Out[40]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936 7.57 large 115.0 25.5
Mastiff working 13581 6.50 large 175.0 30.0
Saint Bernard working 20022 7.78 large 155.0 26.5

43 rows × 6 columns

In [41]:
# Max element in each column.
dogs.max() 
Out[41]:
kind             working
lifetime_cost      26686
longevity           16.5
size               small
weight             175.0
height              30.0
dtype: object
In [42]:
# Max element in each row – a little nonsensical, since the values in each column are on different scales.
# Note that we had to select the numeric columns first.
dogs[['lifetime_cost', 'longevity', 'weight', 'height']].max(axis=1)
Out[42]:
breed
Brittany                  22589.0
Cairn Terrier             21992.0
English Cocker Spaniel    18993.0
                           ...   
Bullmastiff               13936.0
Mastiff                   13581.0
Saint Bernard             20022.0
Length: 43, dtype: float64
In [43]:
# The number of unique values in each column.
dogs.nunique() 
Out[43]:
kind              7
lifetime_cost    43
longevity        40
size              3
weight           37
height           30
dtype: int64
In [44]:
# describe doesn't accept an axis argument; it works on every numeric column in the DataFrame it is called on.
dogs.describe() 
Out[44]:
lifetime_cost longevity weight height
count 43.00 43.00 43.00 43.00
mean 20532.84 11.34 49.35 18.34
std 3290.78 2.05 39.42 6.83
... ... ... ... ...
50% 21006.00 11.81 36.50 18.50
75% 22072.50 12.52 67.50 25.00
max 26686.00 16.50 175.00 30.00

8 rows × 4 columns

Activity

Pick a dog breed that you personally like or know the name of. Then:

  • Try to find a few other dog breeds that are similar in weight to yours in all_dogs.
  • Which similar breeds have the lowest and highest 'lifetime_cost'? 'intelligence_rank'?
  • Are there any similar breeds that you haven't heard of before?

For fun, look up these dog breeds on the AKC website to see what they look like!
In [45]:
all_dogs = pd.read_csv('data/all_dogs.csv')
all_dogs
Out[45]:
breed group datadog popularity_all ... megarank size weight height
0 Border Collie herding 3.64 45 ... 29.0 medium NaN 20.0
1 Border Terrier terrier 3.61 80 ... 1.0 small 13.5 NaN
2 Brittany sporting 3.54 30 ... 11.0 medium 35.0 19.0
... ... ... ... ... ... ... ... ... ...
169 Wire Fox Terrier terrier NaN 100 ... NaN small 17.5 15.0
170 Wirehaired Pointing Griffon sporting NaN 92 ... NaN medium NaN 22.0
171 Xoloitzcuintli non-sporting NaN 155 ... NaN medium NaN 16.5

172 rows × 18 columns

In [46]:
# There's no "right answer" here; you're supposed to explore!
fav_weight = all_dogs.loc[all_dogs['breed'] == 'English Cocker Spaniel', 'weight'].iloc[0]
similar_weight = all_dogs[(all_dogs['weight'] >= fav_weight - 5) & (all_dogs['weight'] <= fav_weight + 5)]
similar_weight
Out[46]:
breed group datadog popularity_all ... megarank size weight height
2 Brittany sporting 3.54 30 ... 11.0 medium 35.0 19.00
5 English Cocker Spaniel sporting 3.33 63 ... 5.0 medium 30.0 16.00
6 Cocker Spaniel sporting 3.30 27 ... 6.0 small 25.0 14.50
... ... ... ... ... ... ... ... ... ...
125 Irish Terrier terrier NaN 132 ... NaN medium 26.0 18.00
138 Norwegian Buhund herding NaN 165 ... NaN medium 33.0 17.25
159 Soft-Coated Wheaten Terrier terrier NaN 52 ... NaN medium 35.0 18.00

12 rows × 18 columns

In [47]:
similar_weight.sort_values('intelligence_rank')[['breed', 'lifetime_cost', 'intelligence_rank']] 
Out[47]:
breed lifetime_cost intelligence_rank
57 Pembroke Welsh Corgi 23978.0 11.0
5 English Cocker Spaniel 18993.0 18.0
2 Brittany 22589.0 19.0
... ... ... ...
73 French Bulldog 17266.0 58.0
117 Glen of Imaal Terrier NaN NaN
138 Norwegian Buhund NaN NaN

12 rows × 3 columns

Introduction to the groupby method¶


Example: Palmer Penguins¶

No description has been provided for this image Artwork by @allison_horst

The dataset we'll work with for the rest of the lecture involves various measurements taken of three species of penguins in Antarctica.

In [48]:
IFrame('https://www.youtube-nocookie.com/embed/CCrNAHXUstU?si=-DntSyUNp5Kwitjm&amp;start=11',
       width=560, height=315)
Out[48]:

Loading the data¶

In [49]:
penguins = sns.load_dataset('penguins').dropna().reset_index(drop=True)
penguins
Out[49]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
... ... ... ... ... ... ... ...
330 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
331 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
332 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

333 rows × 7 columns

  • Here, each row corresponds to a single penguin, and each column corresponds to a different attribute (or feature) we have for each penguin.
  • Data formatted in this way is sometimes called tidy data.

Visualizing the data¶

In [50]:
penguins.plot(kind='scatter', 
              x='bill_length_mm', 
              y='body_mass_g', 
              color='species', 
              title='Body Mass vs. Bill Length')

Granularity¶

  • Granularity refers to what each observation in a dataset represents.
    • Fine: small details.
    • Coarse: bigger picture.
  • If you can control how your dataset is created, you should opt for finer granularity, i.e. for more detail.
    • You can always remove details, but it's difficult to add detail that isn't already there.
    • But obtaining fine-grained data can take more time/money.
  • Today, we'll focus on how to remove details from fine-grained data, in order to help us understand bigger-picture trends in our data.

Aggregating¶

  • Aggregating is the act of combining many values into a single value.
  • What is the mean 'body_mass_g' for all penguins?
In [51]:
penguins['body_mass_g'].mean() 
Out[51]:
4207.057057057057
  • What is the mean 'body_mass_g' for each 'species'?

A naïve approach to finding the mean 'body_mass_g' per 'species'¶

  • First, we could identify all unique values in the 'species' column.
In [52]:
penguins['species'].unique() 
Out[52]:
array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)
  • Then, for each 'species', we could:
    1. Query for just that 'species'.
    2. Extract the 'body_mass_g' column and use the mean method on it.
In [53]:
penguins.loc[penguins['species'] == 'Adelie', 'body_mass_g'].mean() 
Out[53]:
3706.1643835616437
In [54]:
penguins.loc[penguins['species'] == 'Chinstrap', 'body_mass_g'].mean() 
Out[54]:
3733.0882352941176
In [55]:
penguins.loc[penguins['species'] == 'Gentoo', 'body_mass_g'].mean() 
Out[55]:
5092.436974789916
  • We could use a for-loop, but remember, we want to avoid Python for-loops.

Grouping¶

  • A better solution is to use the groupby method.
In [56]:
# To find the overall mean 'body_mass_g':
penguins['body_mass_g'].mean() 
Out[56]:
4207.057057057057
In [57]:
# To find the mean 'body_mass_g' for each 'species':
penguins.groupby('species')['body_mass_g'].mean() 
Out[57]:
species
Adelie       3706.16
Chinstrap    3733.09
Gentoo       5092.44
Name: body_mass_g, dtype: float64
  • Somehow, the groupby method computes what we're looking for in just one line. How?
  • We'll work through the internals, but remember this: if you need to calculate something for each group, use groupby!

An illustrative example: Pets 🐱 🐶🐹¶

  • Consider the DataFrame pets, shown below.
Species Color Weight Age
0 dog black 40 5.0
1 cat golden 15 8.0
2 cat black 20 9.0
3 dog white 80 2.0
4 dog golden 25 0.5
5 hamster golden 1 3.0
  • Let's see what happens under the hood when we use the groupby method on pets.
In [58]:
show_grouping_animation()

Let's try it out!¶

In [59]:
pets = pd.DataFrame().assign(
    Species=['dog', 'cat', 'cat', 'dog', 'dog', 'hamster'],
    Color=['black', 'golden', 'black', 'white', 'golden', 'golden'],
    Weight=[40, 15, 20, 80, 25, 1],
    Age=[5, 8, 9, 2, 0.5, 3]
)
pets
Out[59]:
Species Color Weight Age
0 dog black 40 5.0
1 cat golden 15 8.0
2 cat black 20 9.0
3 dog white 80 2.0
4 dog golden 25 0.5
5 hamster golden 1 3.0
In [60]:
# Why does this error?
pets.groupby('Species').mean()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1870, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1869 try:
-> 1870     res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
   1871 except Exception as err:

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/ops.py:850, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
    848     preserve_dtype = True
--> 850 result = self._aggregate_series_pure_python(obj, func)
    852 npvalues = lib.maybe_convert_objects(result, try_float=False)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/ops.py:871, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
    870 for i, group in enumerate(splitter):
--> 871     res = func(group)
    872     res = extract_result(res)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:2376, in GroupBy.mean.<locals>.<lambda>(x)
   2373 else:
   2374     result = self._cython_agg_general(
   2375         "mean",
-> 2376         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   2377         numeric_only=numeric_only,
   2378     )
   2379     return result.__finalize__(self.obj, method="groupby")

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/series.py:6226, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
   6218 @doc(make_doc("mean", ndim=1))
   6219 def mean(
   6220     self,
   (...)
   6224     **kwargs,
   6225 ):
-> 6226     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/generic.py:11969, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11962 def mean(
  11963     self,
  11964     axis: Axis | None = 0,
   (...)
  11967     **kwargs,
  11968 ) -> Series | float:
> 11969     return self._stat_function(
  11970         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  11971     )

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/generic.py:11926, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  11924 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11926 return self._reduce(
  11927     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11928 )

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/series.py:6134, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   6130     raise TypeError(
   6131         f"Series.{name} does not allow {kwd_name}={numeric_only} "
   6132         "with non-numeric dtypes."
   6133     )
-> 6134 return op(delegate, skipna=skipna, **kwds)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/nanops.py:1693, in _ensure_numeric(x)
   1691 if isinstance(x, str):
   1692     # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1693     raise TypeError(f"Could not convert string '{x}' to numeric")
   1694 try:

TypeError: Could not convert string 'goldenblack' to numeric

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[60], line 2
      1 # Why does this error?
----> 2 pets.groupby('Species').mean()

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:2374, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
   2367     return self._numba_agg_general(
   2368         grouped_mean,
   2369         executor.float_dtype_mapping,
   2370         engine_kwargs,
   2371         min_periods=0,
   2372     )
   2373 else:
-> 2374     result = self._cython_agg_general(
   2375         "mean",
   2376         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   2377         numeric_only=numeric_only,
   2378     )
   2379     return result.__finalize__(self.obj, method="groupby")

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1925, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
   1922     result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   1923     return result
-> 1925 new_mgr = data.grouped_reduce(array_func)
   1926 res = self._wrap_agged_manager(new_mgr)
   1927 out = self._wrap_aggregated_output(res)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/internals/managers.py:1428, in BlockManager.grouped_reduce(self, func)
   1424 if blk.is_object:
   1425     # split on object-dtype blocks bc some columns may raise
   1426     #  while others do not.
   1427     for sb in blk._split():
-> 1428         applied = sb.apply(func)
   1429         result_blocks = extend_blocks(applied, result_blocks)
   1430 else:

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/internals/blocks.py:366, in Block.apply(self, func, **kwargs)
    360 @final
    361 def apply(self, func, **kwargs) -> list[Block]:
    362     """
    363     apply the function to my values; return a block if we are not
    364     one
    365     """
--> 366     result = func(self.values, **kwargs)
    368     result = maybe_coerce_values(result)
    369     return self._split_op_result(result)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1922, in GroupBy._cython_agg_general.<locals>.array_func(values)
   1919 else:
   1920     return result
-> 1922 result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   1923 return result

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1874, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1872     msg = f"agg function failed [how->{how},dtype->{ser.dtype}]"
   1873     # preserve the kind of exception that raised
-> 1874     raise type(err)(msg) from err
   1876 if ser.dtype == object:
   1877     res_values = res_values.astype(object, copy=False)

TypeError: agg function failed [how->mean,dtype->object]
In [61]:
pets.groupby('Species')[['Weight', 'Age']].mean() 
Out[61]:
Weight Age
Species
cat 17.50 8.5
dog 48.33 2.5
hamster 1.00 3.0
In [62]:
pets.groupby('Species').max() 
Out[62]:
Color Weight Age
Species
cat golden 20 9.0
dog white 80 5.0
hamster golden 1 3.0

It takes several steps to go from the original pets DataFrame to this grouped DataFrame, but we don't get to see any of Python's inner workings, just the final output.

"Split-apply-combine" paradigm¶

  • The groupby method involves three steps: split, apply, and combine.
    This is the same terminology that the pandas documentation uses.
No description has been provided for this image
  • Split breaks up and "groups" the rows of a DataFrame according to the specified key.
    There is one "group" for every unique value of the key.
  • Apply uses a function (e.g. aggregation, transformation, filtration) within the individual groups.
  • Combine stitches the results of these operations into an output DataFrame.
  • The split-apply-combine pattern can be parallelized to work on multiple computers or threads, by sending computations for each group to different processors.

More examples¶

Before we dive into the internals, let's look at a few more examples.

Activity

Which 'species' has the highest median 'bill_length_mm'?

In [63]:
(
    penguins
    .groupby('species')
    ['bill_length_mm']
    .median()
    .idxmax()
)
Out[63]:
'Chinstrap'
In [64]:
(
    penguins
    .groupby('species')
    ['bill_length_mm']
    .median()
    .plot(kind='barh', title='Median Bill Length of Each Species')
)
In [ ]:
 

Activity

What proportion of penguins of each 'species' live on 'Dream' island?

In [65]:
(
    penguins
    .assign(is_Dream=penguins['island'] == 'Dream')
    .groupby('species')
    ['is_Dream']
    .mean()
)
Out[65]:
species
Adelie       0.38
Chinstrap    1.00
Gentoo       0.00
Name: is_Dream, dtype: float64
In [ ]:
 

groupby's inner workings¶


How does groupby actually work?¶

  • We've just evaluated a few expressions of the following form.
In [66]:
penguins.groupby('species')['bill_length_mm'].mean()
Out[66]:
species
Adelie       38.82
Chinstrap    48.83
Gentoo       47.57
Name: bill_length_mm, dtype: float64
  • There are three "building blocks"

in the above expression: 1. penguins.groupby('species').
First, we specify which column we want to group on. 1. ['bill_length_mm'].
Then, we select the other relevant columns for our calculations. 1. .mean().
Finally, we use an aggregation method.

  • Let's see what each block contributes to the output.

DataFrameGroupBy objects¶


penguins.groupby('species')['bill_length_mm'].mean()

  • If df is a DataFrame, then df.groupby(key) returns a DataFrameGroupBy object.
    This object represents the "split" in "split-apply-combine".
In [67]:
# Simplified DataFrame for demonstration:
penguins_small = penguins.iloc[[0, 150, 300, 1, 251, 151, 301], [0, 5, 6]]
penguins_small
Out[67]:
species body_mass_g sex
0 Adelie 3750.0 Male
150 Chinstrap 3725.0 Male
300 Gentoo 4875.0 Female
1 Adelie 3800.0 Female
251 Gentoo 4350.0 Female
151 Chinstrap 3950.0 Female
301 Gentoo 5550.0 Male
In [68]:
# Creates one group for each unique value in the species column.
penguins_small.groupby('species')
Out[68]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x309c06740>
  • DataFrameGroupBy objects have a groups attribute, which is a dictionary in which the keys are group names and the values are lists of row labels.
    We won't actually use this, but it's helpful in understanding how groupby works under-the-hood.
In [69]:
penguins_small.groupby('species').groups
Out[69]:
{'Adelie': [0, 1], 'Chinstrap': [150, 151], 'Gentoo': [300, 251, 301]}

Column extraction¶


penguins.groupby('species')['bill_length_mm'].mean()

  • After creating a DataFrameGroupBy object, we typically select the relevant column(s) that we want to aggregate.
  • The result is either a SeriesGroupBy or DataFrameGroupBy object, depending on what's passed in.
In [70]:
penguins.groupby('species')['bill_length_mm'] 
Out[70]:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x168927400>
In [71]:
penguins.groupby('species')[['bill_length_mm', 'bill_depth_mm']] 
Out[71]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x309bc25c0>
  • As we've seen already, you should the columns you want to aggregate before using your aggregation method, or you may run into errors!
In [72]:
# This errors, because there are non-numeric columns in penguins
# that it's trying to take the "mean" of, like 'island'.
penguins.groupby('species').mean() 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1870, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1869 try:
-> 1870     res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
   1871 except Exception as err:

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/ops.py:850, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
    848     preserve_dtype = True
--> 850 result = self._aggregate_series_pure_python(obj, func)
    852 npvalues = lib.maybe_convert_objects(result, try_float=False)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/ops.py:871, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
    870 for i, group in enumerate(splitter):
--> 871     res = func(group)
    872     res = extract_result(res)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:2376, in GroupBy.mean.<locals>.<lambda>(x)
   2373 else:
   2374     result = self._cython_agg_general(
   2375         "mean",
-> 2376         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   2377         numeric_only=numeric_only,
   2378     )
   2379     return result.__finalize__(self.obj, method="groupby")

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/series.py:6226, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
   6218 @doc(make_doc("mean", ndim=1))
   6219 def mean(
   6220     self,
   (...)
   6224     **kwargs,
   6225 ):
-> 6226     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/generic.py:11969, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11962 def mean(
  11963     self,
  11964     axis: Axis | None = 0,
   (...)
  11967     **kwargs,
  11968 ) -> Series | float:
> 11969     return self._stat_function(
  11970         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  11971     )

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/generic.py:11926, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  11924 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11926 return self._reduce(
  11927     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11928 )

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/series.py:6134, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   6130     raise TypeError(
   6131         f"Series.{name} does not allow {kwd_name}={numeric_only} "
   6132         "with non-numeric dtypes."
   6133     )
-> 6134 return op(delegate, skipna=skipna, **kwds)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/nanops.py:1693, in _ensure_numeric(x)
   1691 if isinstance(x, str):
   1692     # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1693     raise TypeError(f"Could not convert string '{x}' to numeric")
   1694 try:

TypeError: Could not convert string 'TorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenTorgersenDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDreamDream' to numeric

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[72], line 3
      1 # This errors, because there are non-numeric columns in penguins
      2 # that it's trying to take the "mean" of, like 'island'.
----> 3 penguins.groupby('species').mean()

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:2374, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
   2367     return self._numba_agg_general(
   2368         grouped_mean,
   2369         executor.float_dtype_mapping,
   2370         engine_kwargs,
   2371         min_periods=0,
   2372     )
   2373 else:
-> 2374     result = self._cython_agg_general(
   2375         "mean",
   2376         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
   2377         numeric_only=numeric_only,
   2378     )
   2379     return result.__finalize__(self.obj, method="groupby")

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1925, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
   1922     result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   1923     return result
-> 1925 new_mgr = data.grouped_reduce(array_func)
   1926 res = self._wrap_agged_manager(new_mgr)
   1927 out = self._wrap_aggregated_output(res)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/internals/managers.py:1428, in BlockManager.grouped_reduce(self, func)
   1424 if blk.is_object:
   1425     # split on object-dtype blocks bc some columns may raise
   1426     #  while others do not.
   1427     for sb in blk._split():
-> 1428         applied = sb.apply(func)
   1429         result_blocks = extend_blocks(applied, result_blocks)
   1430 else:

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/internals/blocks.py:366, in Block.apply(self, func, **kwargs)
    360 @final
    361 def apply(self, func, **kwargs) -> list[Block]:
    362     """
    363     apply the function to my values; return a block if we are not
    364     one
    365     """
--> 366     result = func(self.values, **kwargs)
    368     result = maybe_coerce_values(result)
    369     return self._split_op_result(result)

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1922, in GroupBy._cython_agg_general.<locals>.array_func(values)
   1919 else:
   1920     return result
-> 1922 result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   1923 return result

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1874, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1872     msg = f"agg function failed [how->{how},dtype->{ser.dtype}]"
   1873     # preserve the kind of exception that raised
-> 1874     raise type(err)(msg) from err
   1876 if ser.dtype == object:
   1877     res_values = res_values.astype(object, copy=False)

TypeError: agg function failed [how->mean,dtype->object]
In [73]:
penguins.groupby('species')[['bill_length_mm', 'bill_depth_mm']].mean() 
Out[73]:
bill_length_mm bill_depth_mm
species
Adelie 38.82 18.35
Chinstrap 48.83 18.42
Gentoo 47.57 15.00

Aggregation¶


penguins.groupby('species')['bill_length_mm'].mean()

  • Once we create a DataFrameGroupBy or SeriesGroupBy object, we need to apply some function to each group, and combine the results.
  • The most common operation we apply to each group is an aggregation, but we'll see examples of filtrations and transformations soon
    Remember, aggregation is the act of combining many values into a single value.
  • To perform an aggregation, use an aggregation method on the DataFrameGroupBy or SeriesGroupBy object, e.g. .mean(), .max(), or .median().

Let's look at some examples.

In [74]:
penguins_small
Out[74]:
species body_mass_g sex
0 Adelie 3750.0 Male
150 Chinstrap 3725.0 Male
300 Gentoo 4875.0 Female
1 Adelie 3800.0 Female
251 Gentoo 4350.0 Female
151 Chinstrap 3950.0 Female
301 Gentoo 5550.0 Male
In [75]:
penguins_small.groupby('species')['body_mass_g'].mean() 
Out[75]:
species
Adelie       3775.0
Chinstrap    3837.5
Gentoo       4925.0
Name: body_mass_g, dtype: float64
In [76]:
# Note that this worked on the entire DataFrame!
# But, if all we wanted are the sums of `'body_mass_g'
# for each species, this is slower than
# penguins_small.groupby('species')['body_mass_g'].mean().
penguins_small.groupby('species').sum() 
Out[76]:
body_mass_g sex
species
Adelie 7550.0 MaleFemale
Chinstrap 7675.0 MaleFemale
Gentoo 14775.0 FemaleFemaleMale
In [77]:
# Often used in conjunction with sort_values.
# Remember this when you work on the activity in a few slides!
penguins_small.groupby('species').last() 
Out[77]:
body_mass_g sex
species
Adelie 3800.0 Female
Chinstrap 3950.0 Female
Gentoo 5550.0 Male
In [78]:
penguins_small.groupby('species').max() 
Out[78]:
body_mass_g sex
species
Adelie 3800.0 Male
Chinstrap 3950.0 Male
Gentoo 5550.0 Male

Column independence¶

  • Within each group, the aggregation method is applied to each column independently.
In [79]:
penguins_small.groupby('species').max()
Out[79]:
body_mass_g sex
species
Adelie 3800.0 Male
Chinstrap 3950.0 Male
Gentoo 5550.0 Male
  • The above result is not telling us that there is a 'Male' 'Adelie' penguin with a 'body_mass_g' of 3800.0!
In [80]:
# This penguin is Female!
penguins_small.loc[(penguins['species'] == 'Adelie') & (penguins['body_mass_g'] == 3800.0)]
Out[80]:
species body_mass_g sex
1 Adelie 3800.0 Female

Activity

Find the 'species', 'island', and 'body_mass_g' of the heaviest 'Male' and 'Female' penguins in penguins (not penguins_small).

In [81]:
# General idea: Sort the penguibs by mass in decreasing order.
# Then, the first male penguin that appears is the heaviest male penguin,
# and the first female penguin that appears is the heaviest female penguin.
# For each sex, take the first row.
(
    penguins
    .sort_values('body_mass_g', ascending=False)
    .groupby('sex')
    .first()
)
Out[81]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
sex
Female Gentoo Biscoe 45.2 14.8 212.0 5200.0
Male Gentoo Biscoe 49.2 15.2 221.0 6300.0

Beyond default aggregation methods¶

  • There are many built-in aggregation methods.
  • What if you want to apply different aggregation methods to different columns?
  • What if the aggregation method you want to use doesn't already exist in pandas?

The aggregate method¶

  • DataFrameGroupBy and SeriesGroupBy objects have a general aggregate method, which aggregates using one or more operations.
    Remember, aggregation is the act of combining many values into a single value.
  • There are many ways of using aggregate; refer to the documentation for a comprehensive list.
    Per the documentation, agg is an alias for aggregate.
  • Example arguments:
    • A single function.
    • A list of functions.
    • A dictionary mapping column names to functions.
  • We've attached a Reference Slide with examples.

Reference Slide¶

Examples¶

  • How many penguins are there of each 'species', and what is the mean 'body_mass_g' of each 'species'?
In [82]:
(
    penguins
    .groupby('species')
    ['body_mass_g']
    .aggregate(['count', 'mean'])
)
Out[82]:
count mean
species
Adelie 146 3706.16
Chinstrap 68 3733.09
Gentoo 119 5092.44
  • What is the maximum 'bill_length_mm' of each 'species', and which 'island's is each 'species' found on?
In [83]:
(
    penguins
    .groupby('species')
    .agg({'bill_length_mm': 'max', 'island': 'unique'})
)
Out[83]:
bill_length_mm island
species
Adelie 46.0 [Torgersen, Biscoe, Dream]
Chinstrap 58.0 [Dream]
Gentoo 59.6 [Biscoe]

Activity

What is the interquartile range of the 'body_mass_g' of each 'species'?

The interquartile range of a distribution is defined as:

$$\text{75th percentile} - \text{25th percentile}$$

*Hint*: Use np.percentile, and pass agg/aggregate a custom function.

In [84]:
# Here, the argument to agg is a function,
# which takes in a Series and returns a scalar.
def iqr(s):
    return np.percentile(s, 75) - np.percentile(s, 25)
(
    penguins
    .groupby('species')
    ['body_mass_g']
    .agg(iqr)
)
Out[84]:
species
Adelie       637.5
Chinstrap    462.5
Gentoo       800.0
Name: body_mass_g, dtype: float64

Question 🤔 (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

What questions do you have?

Split-apply-combine, revisited¶

  • When we introduced the split-apply-combine pattern, the "apply" step involved aggregation – our final DataFrame had one row for each group.
No description has been provided for this image
  • Instead of aggregating during the apply step, we could instead perform a filtration, in which we keep only the groups that satisfy some condition.
  • Or a transformation, in which we perform operations to every value within each group.

Grouping, then filtering¶

  • To keep only the groups that satisfy a particular condition, use the filter method on a DataFrameGroupBy/SeriesGroupBy object.
    The filter method takes in a function, which itself takes in a DataFrame/Series and return a single Boolean. The result is a new DataFrame/Series with only the groups for which the filter function returned True.
  • For example, suppose we want only the 'species' whose average 'bill_length_mm' is above 39.
In [85]:
(
    penguins
    .groupby('species')
    .filter(lambda df: df['bill_length_mm'].mean() > 39)
)
Out[85]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
146 Chinstrap Dream 46.5 17.9 192.0 3500.0 Female
147 Chinstrap Dream 50.0 19.5 196.0 3900.0 Male
148 Chinstrap Dream 51.3 19.2 193.0 3650.0 Male
... ... ... ... ... ... ... ...
330 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
331 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
332 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

187 rows × 7 columns

  • No more 'Adelie's!

Activity

Create a new DataFrame with only the rows in penguins for popular 'species' – that is, 'species' with at least 100 penguins.

In [86]:
(
    penguins
    .groupby('species')
    .filter(lambda df: df.shape[0] >= 100)
)
Out[86]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
... ... ... ... ... ... ... ...
330 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
331 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
332 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

265 rows × 7 columns

In [87]:
# Note that to just find the 'species' with at least 100 penguins,
# we didn't need to group:
penguins['species'].value_counts()
Out[87]:
species
Adelie       146
Gentoo       119
Chinstrap     68
Name: count, dtype: int64

Reference Slide¶

Example: Z-Scoring¶

  • Suppose we want to convert the 'body_mass_g' column to to z-scores (i.e. standard units):
$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of } x}$$
In [88]:
def z_score(x):
    return (x - x.mean()) / x.std(ddof=0)
In [89]:
z_score(penguins['body_mass_g'])
Out[89]:
0     -0.57
1     -0.51
2     -1.19
       ... 
330    1.92
331    1.23
332    1.48
Name: body_mass_g, Length: 333, dtype: float64

Reference Slide¶

Grouping, then transforming¶

  • Now, what if we wanted the z-score within each group?
  • To do so, we can use the transform method on a DataFrameGroupBy object. The transform method takes in a function, which itself takes in a Series and returns a new Series.
  • A transformation produces a DataFrame or Series of the same size – it is not an aggregation!
In [90]:
z_mass = (penguins
          .groupby('species')
          ['body_mass_g']
          .transform(z_score))
z_mass
Out[90]:
0      0.10
1      0.21
2     -1.00
       ... 
330    1.32
331    0.22
332    0.62
Name: body_mass_g, Length: 333, dtype: float64
In [91]:
penguins.assign(z_mass=z_mass)
Out[91]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex z_mass
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male 0.10
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female 0.21
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female -1.00
... ... ... ... ... ... ... ... ...
330 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male 1.32
331 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female 0.22
332 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male 0.62

333 rows × 8 columns

In [92]:
display_df(penguins.assign(z_mass=z_mass), rows=8)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex z_mass
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male 0.10
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female 0.21
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female -1.00
3 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female -0.56
... ... ... ... ... ... ... ... ...
329 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female -0.49
330 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male 1.32
331 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female 0.22
332 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male 0.62

333 rows × 8 columns

  • Note that above, penguin 340 has a larger 'body_mass_g' than penguin 0, but a lower 'z_mass'.
    • Penguin 0 has an above average 'body_mass_g' among 'Adelie' penguins.
    • Penguin 340 has a below average 'body_mass_g' among 'Gentoo' penguins. Remember from earlier that the average 'body_mass_g' of 'Gentoo' penguins is much higher than for other species.

What's next?¶

  • Can we group on multiple columns at once?
  • What does pivot_table do?
  • How do we combine two DataFrames with information about similar individuals?
  • How do we deal with missing values?
  • How do we decide which type of visualization to create?