agg, filter, transform, and apply GroupBy Methods¶

EECS 398-003: Practical Data Science, Fall 2024¶

Live from Office Hours¶

In office hours on Tuesday, October 8th, we walked through examples of how agg, filter, transform, and apply work. Watch a walkthrough of the notebook here: https://www.loom.com/share/4171bfc0e71c4722b10a6c8472d08cb5?sid=2dde2a35-e2a1-4619-8a16-5efb8570362d.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
In [2]:
penguins = sns.load_dataset('penguins').dropna()
penguins
Out[2]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male
... ... ... ... ... ... ... ...
338 Gentoo Biscoe 47.2 13.7 214.0 4925.0 Female
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

333 rows × 7 columns

agg¶


Average bill length per species.

In [3]:
penguins.groupby('species')['bill_length_mm'].mean()
Out[3]:
species
Adelie       38.823973
Chinstrap    48.833824
Gentoo       47.568067
Name: bill_length_mm, dtype: float64

Average bill length and average bill depth per species.

In [4]:
penguins.groupby('species')[['bill_length_mm', 'bill_depth_mm']].mean()
Out[4]:
bill_length_mm bill_depth_mm
species
Adelie 38.823973 18.347260
Chinstrap 48.833824 18.420588
Gentoo 47.568067 14.996639
In [5]:
# penguins.groupby('species').mean() # Need to select relevant columns first!
In [6]:
penguins.groupby('species')['bill_length_mm'].agg(['min', 'max'])
Out[6]:
min max
species
Adelie 32.1 46.0
Chinstrap 40.9 58.0
Gentoo 40.9 59.6
In [7]:
penguins.groupby('species').agg({'bill_length_mm': ['max', 'min'],
                                 'bill_depth_mm': 'min',
                                 'island': 'nunique'})
Out[7]:
bill_length_mm bill_depth_mm island
max min min nunique
species
Adelie 46.0 32.1 15.5 3
Chinstrap 58.0 40.9 16.4 1
Gentoo 59.6 40.9 13.1 1
In [8]:
penguins.groupby('species')['bill_length_mm'].mean()
Out[8]:
species
Adelie       38.823973
Chinstrap    48.833824
Gentoo       47.568067
Name: bill_length_mm, dtype: float64

If giving agg a function, the function should:

  • Take in a Series.
  • Return a single value.
In [9]:
(
    penguins
    .groupby('species')
    ['bill_length_mm']
    .agg(lambda s: np.percentile(s, 75) - np.percentile(s, 25))
)
Out[9]:
species
Adelie       4.050
Chinstrap    4.725
Gentoo       4.250
Name: bill_length_mm, dtype: float64

filter¶


  • Querying: Keeping rows that satisfy conditions.
  • Filtering: Keeping groups that satisfy conditions.

Show me the penguins that have a bill length over 40 mm.

In [10]:
penguins[penguins['bill_length_mm'] > 40]
Out[10]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
12 Adelie Torgersen 41.1 17.6 182.0 3200.0 Female
17 Adelie Torgersen 42.5 20.7 197.0 4500.0 Male
19 Adelie Torgersen 46.0 21.5 194.0 4200.0 Male
26 Adelie Biscoe 40.6 18.6 183.0 3550.0 Male
... ... ... ... ... ... ... ...
338 Gentoo Biscoe 47.2 13.7 214.0 4925.0 Female
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

237 rows × 7 columns

Show me the species with an average bill length over 40 mm.

In [11]:
penguins
Out[11]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male
... ... ... ... ... ... ... ...
338 Gentoo Biscoe 47.2 13.7 214.0 4925.0 Female
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

333 rows × 7 columns

filter takes in a function, that:

  • Takes in a DataFrame.
  • Returns a BOOLEAN.
In [12]:
penguins.groupby('species').filter(lambda df: df['bill_length_mm'].mean() > 40)
Out[12]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
152 Chinstrap Dream 46.5 17.9 192.0 3500.0 Female
153 Chinstrap Dream 50.0 19.5 196.0 3900.0 Male
154 Chinstrap Dream 51.3 19.2 193.0 3650.0 Male
155 Chinstrap Dream 45.4 18.7 188.0 3525.0 Female
156 Chinstrap Dream 52.7 19.8 197.0 3725.0 Male
... ... ... ... ... ... ... ...
338 Gentoo Biscoe 47.2 13.7 214.0 4925.0 Female
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

187 rows × 7 columns

In [13]:
mean_lengths = penguins.groupby('species')['bill_length_mm'].mean()
In [14]:
mean_lengths
Out[14]:
species
Adelie       38.823973
Chinstrap    48.833824
Gentoo       47.568067
Name: bill_length_mm, dtype: float64
In [15]:
mean_lengths[mean_lengths > 40].index
Out[15]:
Index(['Chinstrap', 'Gentoo'], dtype='object', name='species')
In [16]:
penguins[penguins['species'].isin(mean_lengths[mean_lengths > 40].index)]
Out[16]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
152 Chinstrap Dream 46.5 17.9 192.0 3500.0 Female
153 Chinstrap Dream 50.0 19.5 196.0 3900.0 Male
154 Chinstrap Dream 51.3 19.2 193.0 3650.0 Male
155 Chinstrap Dream 45.4 18.7 188.0 3525.0 Female
156 Chinstrap Dream 52.7 19.8 197.0 3725.0 Male
... ... ... ... ... ... ... ...
338 Gentoo Biscoe 47.2 13.7 214.0 4925.0 Female
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

187 rows × 7 columns

Here, the function that filter accepts:

  • Takes in a Series (which we've still called df).
  • Returns a BOOLEAN.
In [17]:
penguins.groupby('species')['bill_length_mm'].filter(lambda df: df.mean() > 40)
Out[17]:
152    46.5
153    50.0
154    51.3
155    45.4
156    52.7
       ... 
338    47.2
340    46.8
341    50.4
342    45.2
343    49.9
Name: bill_length_mm, Length: 187, dtype: float64

transform¶


In [18]:
penguins
Out[18]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male
... ... ... ... ... ... ... ...
338 Gentoo Biscoe 47.2 13.7 214.0 4925.0 Female
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

333 rows × 7 columns

In [19]:
penguins = sns.load_dataset('penguins')
penguins
Out[19]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows × 7 columns

In [20]:
penguins.isna().sum()
Out[20]:
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64
$$\text{z-score} = \frac{\text{value} - \text{mean of column}}{\text{SD of column}}$$
In [21]:
def z_score(col):
    return (col - col.mean()) / col.std()
In [22]:
z_score(penguins['bill_length_mm'])
Out[22]:
0     -0.883205
1     -0.809939
2     -0.663408
3           NaN
4     -1.322799
         ...   
339         NaN
340    0.527159
341    1.186550
342    0.234097
343    1.094968
Name: bill_length_mm, Length: 344, dtype: float64

$z$-scoring the bill lengths, separately for each species. That is, when doing the transformation for a particular species, it only uses the mean and SD for that species, not all penguins overall.

In [23]:
penguins.groupby('species')['bill_length_mm'].transform(z_score)
Out[23]:
0      0.115870
1      0.266054
2      0.566421
3           NaN
4     -0.785232
         ...   
339         NaN
340   -0.228719
341    0.939408
342   -0.747886
343    0.777168
Name: bill_length_mm, Length: 344, dtype: float64

transform takes in a function, that:

  • Takes in a Series.
  • Returns a Series.

apply¶


In [24]:
penguins.groupby('species').apply(lambda df: df['bill_length_mm'].max())
Out[24]:
species
Adelie       46.0
Chinstrap    58.0
Gentoo       59.6
dtype: float64

equivalent to:

In [25]:
penguins.groupby('species')['bill_length_mm'].max()
Out[25]:
species
Adelie       46.0
Chinstrap    58.0
Gentoo       59.6
Name: bill_length_mm, dtype: float64

The bill length of the second heaviest penguin per species.

In [26]:
def bill_length_of_second_heaviest(df):
    return df.sort_values('body_mass_g', ascending=False).iloc[1].loc['bill_length_mm']
In [27]:
bill_length_of_second_heaviest(penguins)
Out[27]:
59.6

In this case, apply is taking in a function, that:

  • Takes in a DataFrame.
  • Returns a number. (but it could also return something else!)
In [28]:
penguins.groupby('species').apply(bill_length_of_second_heaviest)
Out[28]:
species
Adelie       41.0
Chinstrap    52.8
Gentoo       59.6
dtype: float64

Find me all the rows for the three heaviest penguins per species.

In [29]:
def three_heaviest_penguins(penguins):
    return penguins.sort_values('body_mass_g', ascending=False).head(3)
In [30]:
three_heaviest_penguins(penguins)
Out[30]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
237 Gentoo Biscoe 49.2 15.2 221.0 6300.0 Male
253 Gentoo Biscoe 59.6 17.0 230.0 6050.0 Male
297 Gentoo Biscoe 51.1 16.3 220.0 6000.0 Male

In this case, apply takes in a function that:

  • Takes in a DataFrame.
  • Returns a DataFrame.
In [31]:
penguins.groupby('species').apply(three_heaviest_penguins)
Out[31]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
species
Adelie 109 Adelie Biscoe 43.2 19.0 197.0 4775.0 Male
101 Adelie Biscoe 41.0 20.0 203.0 4725.0 Male
81 Adelie Torgersen 42.9 17.6 196.0 4700.0 Male
Chinstrap 189 Chinstrap Dream 52.0 20.7 210.0 4800.0 Male
181 Chinstrap Dream 52.8 20.0 205.0 4550.0 Male
191 Chinstrap Dream 53.5 19.9 205.0 4500.0 Male
Gentoo 237 Gentoo Biscoe 49.2 15.2 221.0 6300.0 Male
253 Gentoo Biscoe 59.6 17.0 230.0 6050.0 Male
297 Gentoo Biscoe 51.1 16.3 220.0 6000.0 Male

Moral of the story: apply is more general purpose than agg/transform (I believe filter is an edge-case).

In [32]:
penguins
Out[32]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows × 7 columns

In [33]:
penguins['species'].str[0]
Out[33]:
0      A
1      A
2      A
3      A
4      A
      ..
339    G
340    G
341    G
342    G
343    G
Name: species, Length: 344, dtype: object
In [34]:
penguins.assign(first_letter_of_species = penguins['species'].str[0])
Out[34]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex first_letter_of_species
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male A
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female A
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female A
3 Adelie Torgersen NaN NaN NaN NaN NaN A
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female A
... ... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN G
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female G
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male G
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female G
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male G

344 rows × 8 columns