agg
, filter
, transform
, and apply
GroupBy Methods¶
EECS 398-003: Practical Data Science, Fall 2024¶
Live from Office Hours¶
In office hours on Tuesday, October 8th, we walked through examples of how agg
, filter
, transform
, and apply
work. Watch a walkthrough of the notebook here: https://www.loom.com/share/4171bfc0e71c4722b10a6c8472d08cb5?sid=2dde2a35-e2a1-4619-8a16-5efb8570362d.
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna()
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
agg¶
Average bill length per species.
penguins.groupby('species')['bill_length_mm'].mean()
species Adelie 38.823973 Chinstrap 48.833824 Gentoo 47.568067 Name: bill_length_mm, dtype: float64
Average bill length and average bill depth per species.
penguins.groupby('species')[['bill_length_mm', 'bill_depth_mm']].mean()
bill_length_mm | bill_depth_mm | |
---|---|---|
species | ||
Adelie | 38.823973 | 18.347260 |
Chinstrap | 48.833824 | 18.420588 |
Gentoo | 47.568067 | 14.996639 |
# penguins.groupby('species').mean() # Need to select relevant columns first!
penguins.groupby('species')['bill_length_mm'].agg(['min', 'max'])
min | max | |
---|---|---|
species | ||
Adelie | 32.1 | 46.0 |
Chinstrap | 40.9 | 58.0 |
Gentoo | 40.9 | 59.6 |
penguins.groupby('species').agg({'bill_length_mm': ['max', 'min'],
'bill_depth_mm': 'min',
'island': 'nunique'})
bill_length_mm | bill_depth_mm | island | ||
---|---|---|---|---|
max | min | min | nunique | |
species | ||||
Adelie | 46.0 | 32.1 | 15.5 | 3 |
Chinstrap | 58.0 | 40.9 | 16.4 | 1 |
Gentoo | 59.6 | 40.9 | 13.1 | 1 |
penguins.groupby('species')['bill_length_mm'].mean()
species Adelie 38.823973 Chinstrap 48.833824 Gentoo 47.568067 Name: bill_length_mm, dtype: float64
If giving agg
a function, the function should:
- Take in a Series.
- Return a single value.
(
penguins
.groupby('species')
['bill_length_mm']
.agg(lambda s: np.percentile(s, 75) - np.percentile(s, 25))
)
species Adelie 4.050 Chinstrap 4.725 Gentoo 4.250 Name: bill_length_mm, dtype: float64
filter¶
- Querying: Keeping rows that satisfy conditions.
- Filtering: Keeping groups that satisfy conditions.
Show me the penguins that have a bill length over 40 mm.
penguins[penguins['bill_length_mm'] > 40]
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
12 | Adelie | Torgersen | 41.1 | 17.6 | 182.0 | 3200.0 | Female |
17 | Adelie | Torgersen | 42.5 | 20.7 | 197.0 | 4500.0 | Male |
19 | Adelie | Torgersen | 46.0 | 21.5 | 194.0 | 4200.0 | Male |
26 | Adelie | Biscoe | 40.6 | 18.6 | 183.0 | 3550.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
237 rows × 7 columns
Show me the species with an average bill length over 40 mm.
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
filter
takes in a function, that:
- Takes in a DataFrame.
- Returns a BOOLEAN.
penguins.groupby('species').filter(lambda df: df['bill_length_mm'].mean() > 40)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
152 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | Female |
153 | Chinstrap | Dream | 50.0 | 19.5 | 196.0 | 3900.0 | Male |
154 | Chinstrap | Dream | 51.3 | 19.2 | 193.0 | 3650.0 | Male |
155 | Chinstrap | Dream | 45.4 | 18.7 | 188.0 | 3525.0 | Female |
156 | Chinstrap | Dream | 52.7 | 19.8 | 197.0 | 3725.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
187 rows × 7 columns
mean_lengths = penguins.groupby('species')['bill_length_mm'].mean()
mean_lengths
species Adelie 38.823973 Chinstrap 48.833824 Gentoo 47.568067 Name: bill_length_mm, dtype: float64
mean_lengths[mean_lengths > 40].index
Index(['Chinstrap', 'Gentoo'], dtype='object', name='species')
penguins[penguins['species'].isin(mean_lengths[mean_lengths > 40].index)]
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
152 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | Female |
153 | Chinstrap | Dream | 50.0 | 19.5 | 196.0 | 3900.0 | Male |
154 | Chinstrap | Dream | 51.3 | 19.2 | 193.0 | 3650.0 | Male |
155 | Chinstrap | Dream | 45.4 | 18.7 | 188.0 | 3525.0 | Female |
156 | Chinstrap | Dream | 52.7 | 19.8 | 197.0 | 3725.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
187 rows × 7 columns
Here, the function that filter
accepts:
- Takes in a Series (which we've still called
df
). - Returns a BOOLEAN.
penguins.groupby('species')['bill_length_mm'].filter(lambda df: df.mean() > 40)
152 46.5 153 50.0 154 51.3 155 45.4 156 52.7 ... 338 47.2 340 46.8 341 50.4 342 45.2 343 49.9 Name: bill_length_mm, Length: 187, dtype: float64
transform¶
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
penguins = sns.load_dataset('penguins')
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows × 7 columns
penguins.isna().sum()
species 0 island 0 bill_length_mm 2 bill_depth_mm 2 flipper_length_mm 2 body_mass_g 2 sex 11 dtype: int64
def z_score(col):
return (col - col.mean()) / col.std()
z_score(penguins['bill_length_mm'])
0 -0.883205 1 -0.809939 2 -0.663408 3 NaN 4 -1.322799 ... 339 NaN 340 0.527159 341 1.186550 342 0.234097 343 1.094968 Name: bill_length_mm, Length: 344, dtype: float64
$z$-scoring the bill lengths, separately for each species. That is, when doing the transformation for a particular species, it only uses the mean and SD for that species, not all penguins overall.
penguins.groupby('species')['bill_length_mm'].transform(z_score)
0 0.115870 1 0.266054 2 0.566421 3 NaN 4 -0.785232 ... 339 NaN 340 -0.228719 341 0.939408 342 -0.747886 343 0.777168 Name: bill_length_mm, Length: 344, dtype: float64
transform
takes in a function, that:
- Takes in a Series.
- Returns a Series.
apply¶
penguins.groupby('species').apply(lambda df: df['bill_length_mm'].max())
species Adelie 46.0 Chinstrap 58.0 Gentoo 59.6 dtype: float64
equivalent to:
penguins.groupby('species')['bill_length_mm'].max()
species Adelie 46.0 Chinstrap 58.0 Gentoo 59.6 Name: bill_length_mm, dtype: float64
The bill length of the second heaviest penguin per species.
def bill_length_of_second_heaviest(df):
return df.sort_values('body_mass_g', ascending=False).iloc[1].loc['bill_length_mm']
bill_length_of_second_heaviest(penguins)
59.6
In this case, apply
is taking in a function, that:
- Takes in a DataFrame.
- Returns a number. (but it could also return something else!)
penguins.groupby('species').apply(bill_length_of_second_heaviest)
species Adelie 41.0 Chinstrap 52.8 Gentoo 59.6 dtype: float64
Find me all the rows for the three heaviest penguins per species.
def three_heaviest_penguins(penguins):
return penguins.sort_values('body_mass_g', ascending=False).head(3)
three_heaviest_penguins(penguins)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
237 | Gentoo | Biscoe | 49.2 | 15.2 | 221.0 | 6300.0 | Male |
253 | Gentoo | Biscoe | 59.6 | 17.0 | 230.0 | 6050.0 | Male |
297 | Gentoo | Biscoe | 51.1 | 16.3 | 220.0 | 6000.0 | Male |
In this case, apply
takes in a function that:
- Takes in a DataFrame.
- Returns a DataFrame.
penguins.groupby('species').apply(three_heaviest_penguins)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | ||
---|---|---|---|---|---|---|---|---|
species | ||||||||
Adelie | 109 | Adelie | Biscoe | 43.2 | 19.0 | 197.0 | 4775.0 | Male |
101 | Adelie | Biscoe | 41.0 | 20.0 | 203.0 | 4725.0 | Male | |
81 | Adelie | Torgersen | 42.9 | 17.6 | 196.0 | 4700.0 | Male | |
Chinstrap | 189 | Chinstrap | Dream | 52.0 | 20.7 | 210.0 | 4800.0 | Male |
181 | Chinstrap | Dream | 52.8 | 20.0 | 205.0 | 4550.0 | Male | |
191 | Chinstrap | Dream | 53.5 | 19.9 | 205.0 | 4500.0 | Male | |
Gentoo | 237 | Gentoo | Biscoe | 49.2 | 15.2 | 221.0 | 6300.0 | Male |
253 | Gentoo | Biscoe | 59.6 | 17.0 | 230.0 | 6050.0 | Male | |
297 | Gentoo | Biscoe | 51.1 | 16.3 | 220.0 | 6000.0 | Male |
Moral of the story: apply
is more general purpose than agg
/transform
(I believe filter
is an edge-case).
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows × 7 columns
penguins['species'].str[0]
0 A 1 A 2 A 3 A 4 A .. 339 G 340 G 341 G 342 G 343 G Name: species, Length: 344, dtype: object
penguins.assign(first_letter_of_species = penguins['species'].str[0])
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | first_letter_of_species | |
---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male | A |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female | A |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female | A |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | A |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female | A |
... | ... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN | G |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female | G |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male | G |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female | G |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male | G |
344 rows × 8 columns