agg, filter, transform, and apply GroupBy Methods¶
EECS 398-003: Practical Data Science, Fall 2024¶
Live from Office Hours¶
In office hours on Tuesday, October 8th, we walked through examples of how agg, filter, transform, and apply work. Watch a walkthrough of the notebook here: https://www.loom.com/share/4171bfc0e71c4722b10a6c8472d08cb5?sid=2dde2a35-e2a1-4619-8a16-5efb8570362d.
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna()
penguins
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
| 5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
| 340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
| 341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
| 342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
| 343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
agg¶
Average bill length per species.
penguins.groupby('species')['bill_length_mm'].mean()
species Adelie 38.823973 Chinstrap 48.833824 Gentoo 47.568067 Name: bill_length_mm, dtype: float64
Average bill length and average bill depth per species.
penguins.groupby('species')[['bill_length_mm', 'bill_depth_mm']].mean()
| bill_length_mm | bill_depth_mm | |
|---|---|---|
| species | ||
| Adelie | 38.823973 | 18.347260 |
| Chinstrap | 48.833824 | 18.420588 |
| Gentoo | 47.568067 | 14.996639 |
# penguins.groupby('species').mean() # Need to select relevant columns first!
penguins.groupby('species')['bill_length_mm'].agg(['min', 'max'])
| min | max | |
|---|---|---|
| species | ||
| Adelie | 32.1 | 46.0 |
| Chinstrap | 40.9 | 58.0 |
| Gentoo | 40.9 | 59.6 |
penguins.groupby('species').agg({'bill_length_mm': ['max', 'min'],
'bill_depth_mm': 'min',
'island': 'nunique'})
| bill_length_mm | bill_depth_mm | island | ||
|---|---|---|---|---|
| max | min | min | nunique | |
| species | ||||
| Adelie | 46.0 | 32.1 | 15.5 | 3 |
| Chinstrap | 58.0 | 40.9 | 16.4 | 1 |
| Gentoo | 59.6 | 40.9 | 13.1 | 1 |
penguins.groupby('species')['bill_length_mm'].mean()
species Adelie 38.823973 Chinstrap 48.833824 Gentoo 47.568067 Name: bill_length_mm, dtype: float64
If giving agg a function, the function should:
- Take in a Series.
- Return a single value.
(
penguins
.groupby('species')
['bill_length_mm']
.agg(lambda s: np.percentile(s, 75) - np.percentile(s, 25))
)
species Adelie 4.050 Chinstrap 4.725 Gentoo 4.250 Name: bill_length_mm, dtype: float64
filter¶
- Querying: Keeping rows that satisfy conditions.
- Filtering: Keeping groups that satisfy conditions.
Show me the penguins that have a bill length over 40 mm.
penguins[penguins['bill_length_mm'] > 40]
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 12 | Adelie | Torgersen | 41.1 | 17.6 | 182.0 | 3200.0 | Female |
| 17 | Adelie | Torgersen | 42.5 | 20.7 | 197.0 | 4500.0 | Male |
| 19 | Adelie | Torgersen | 46.0 | 21.5 | 194.0 | 4200.0 | Male |
| 26 | Adelie | Biscoe | 40.6 | 18.6 | 183.0 | 3550.0 | Male |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
| 340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
| 341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
| 342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
| 343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
237 rows × 7 columns
Show me the species with an average bill length over 40 mm.
penguins
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
| 5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
| 340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
| 341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
| 342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
| 343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
filter takes in a function, that:
- Takes in a DataFrame.
- Returns a BOOLEAN.
penguins.groupby('species').filter(lambda df: df['bill_length_mm'].mean() > 40)
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 152 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | Female |
| 153 | Chinstrap | Dream | 50.0 | 19.5 | 196.0 | 3900.0 | Male |
| 154 | Chinstrap | Dream | 51.3 | 19.2 | 193.0 | 3650.0 | Male |
| 155 | Chinstrap | Dream | 45.4 | 18.7 | 188.0 | 3525.0 | Female |
| 156 | Chinstrap | Dream | 52.7 | 19.8 | 197.0 | 3725.0 | Male |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
| 340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
| 341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
| 342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
| 343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
187 rows × 7 columns
mean_lengths = penguins.groupby('species')['bill_length_mm'].mean()
mean_lengths
species Adelie 38.823973 Chinstrap 48.833824 Gentoo 47.568067 Name: bill_length_mm, dtype: float64
mean_lengths[mean_lengths > 40].index
Index(['Chinstrap', 'Gentoo'], dtype='object', name='species')
penguins[penguins['species'].isin(mean_lengths[mean_lengths > 40].index)]
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 152 | Chinstrap | Dream | 46.5 | 17.9 | 192.0 | 3500.0 | Female |
| 153 | Chinstrap | Dream | 50.0 | 19.5 | 196.0 | 3900.0 | Male |
| 154 | Chinstrap | Dream | 51.3 | 19.2 | 193.0 | 3650.0 | Male |
| 155 | Chinstrap | Dream | 45.4 | 18.7 | 188.0 | 3525.0 | Female |
| 156 | Chinstrap | Dream | 52.7 | 19.8 | 197.0 | 3725.0 | Male |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
| 340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
| 341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
| 342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
| 343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
187 rows × 7 columns
Here, the function that filter accepts:
- Takes in a Series (which we've still called
df). - Returns a BOOLEAN.
penguins.groupby('species')['bill_length_mm'].filter(lambda df: df.mean() > 40)
152 46.5
153 50.0
154 51.3
155 45.4
156 52.7
...
338 47.2
340 46.8
341 50.4
342 45.2
343 49.9
Name: bill_length_mm, Length: 187, dtype: float64
transform¶
penguins
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
| 5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
| 340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
| 341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
| 342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
| 343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
penguins = sns.load_dataset('penguins')
penguins
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
| 340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
| 341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
| 342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
| 343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows × 7 columns
penguins.isna().sum()
species 0 island 0 bill_length_mm 2 bill_depth_mm 2 flipper_length_mm 2 body_mass_g 2 sex 11 dtype: int64
def z_score(col):
return (col - col.mean()) / col.std()
z_score(penguins['bill_length_mm'])
0 -0.883205
1 -0.809939
2 -0.663408
3 NaN
4 -1.322799
...
339 NaN
340 0.527159
341 1.186550
342 0.234097
343 1.094968
Name: bill_length_mm, Length: 344, dtype: float64
$z$-scoring the bill lengths, separately for each species. That is, when doing the transformation for a particular species, it only uses the mean and SD for that species, not all penguins overall.
penguins.groupby('species')['bill_length_mm'].transform(z_score)
0 0.115870
1 0.266054
2 0.566421
3 NaN
4 -0.785232
...
339 NaN
340 -0.228719
341 0.939408
342 -0.747886
343 0.777168
Name: bill_length_mm, Length: 344, dtype: float64
transform takes in a function, that:
- Takes in a Series.
- Returns a Series.
apply¶
penguins.groupby('species').apply(lambda df: df['bill_length_mm'].max())
species Adelie 46.0 Chinstrap 58.0 Gentoo 59.6 dtype: float64
equivalent to:
penguins.groupby('species')['bill_length_mm'].max()
species Adelie 46.0 Chinstrap 58.0 Gentoo 59.6 Name: bill_length_mm, dtype: float64
The bill length of the second heaviest penguin per species.
def bill_length_of_second_heaviest(df):
return df.sort_values('body_mass_g', ascending=False).iloc[1].loc['bill_length_mm']
bill_length_of_second_heaviest(penguins)
59.6
In this case, apply is taking in a function, that:
- Takes in a DataFrame.
- Returns a number. (but it could also return something else!)
penguins.groupby('species').apply(bill_length_of_second_heaviest)
species Adelie 41.0 Chinstrap 52.8 Gentoo 59.6 dtype: float64
Find me all the rows for the three heaviest penguins per species.
def three_heaviest_penguins(penguins):
return penguins.sort_values('body_mass_g', ascending=False).head(3)
three_heaviest_penguins(penguins)
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 237 | Gentoo | Biscoe | 49.2 | 15.2 | 221.0 | 6300.0 | Male |
| 253 | Gentoo | Biscoe | 59.6 | 17.0 | 230.0 | 6050.0 | Male |
| 297 | Gentoo | Biscoe | 51.1 | 16.3 | 220.0 | 6000.0 | Male |
In this case, apply takes in a function that:
- Takes in a DataFrame.
- Returns a DataFrame.
penguins.groupby('species').apply(three_heaviest_penguins)
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | ||
|---|---|---|---|---|---|---|---|---|
| species | ||||||||
| Adelie | 109 | Adelie | Biscoe | 43.2 | 19.0 | 197.0 | 4775.0 | Male |
| 101 | Adelie | Biscoe | 41.0 | 20.0 | 203.0 | 4725.0 | Male | |
| 81 | Adelie | Torgersen | 42.9 | 17.6 | 196.0 | 4700.0 | Male | |
| Chinstrap | 189 | Chinstrap | Dream | 52.0 | 20.7 | 210.0 | 4800.0 | Male |
| 181 | Chinstrap | Dream | 52.8 | 20.0 | 205.0 | 4550.0 | Male | |
| 191 | Chinstrap | Dream | 53.5 | 19.9 | 205.0 | 4500.0 | Male | |
| Gentoo | 237 | Gentoo | Biscoe | 49.2 | 15.2 | 221.0 | 6300.0 | Male |
| 253 | Gentoo | Biscoe | 59.6 | 17.0 | 230.0 | 6050.0 | Male | |
| 297 | Gentoo | Biscoe | 51.1 | 16.3 | 220.0 | 6000.0 | Male |
Moral of the story: apply is more general purpose than agg/transform (I believe filter is an edge-case).
penguins
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
| 340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
| 341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
| 342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
| 343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows × 7 columns
penguins['species'].str[0]
0 A
1 A
2 A
3 A
4 A
..
339 G
340 G
341 G
342 G
343 G
Name: species, Length: 344, dtype: object
penguins.assign(first_letter_of_species = penguins['species'].str[0])
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | first_letter_of_species | |
|---|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male | A |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female | A |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female | A |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | A |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female | A |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN | G |
| 340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female | G |
| 341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male | G |
| 342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female | G |
| 343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male | G |
344 rows × 8 columns