In [1]:
from lec_utils import *
def show_grouping_animation():
    src = "https://docs.google.com/presentation/d/1tBaFyHseIGsX5wmE3BdNLeVHnKksQtpzLhHge8Tzly0/embed?start=false&loop=false&delayms=60000&rm=minimal"
    width = 960
    height = 509
    display(IFrame(src, width, height))

Lecture 5¶

Aggregation: Grouping and Pivoting¶

EECS 398: Practical Data Science, Winter 2025¶

practicaldsc.org • github.com/practicaldsc/wn25 • 📣 See latest announcements here on Ed

Agenda 📆¶

  • Introduction to the groupby method.
  • groupby's inner workings.
  • Advanced groupby usage.
  • Pivot tables using pivot_table.

Remember to follow along in lecture by accessing the "blank" lecture notebook in our public GitHub repository.

Read the guide!¶

  • We've posted a new guide about how DataFrames are stored in memory, linked here.
  • It covers:
    • How to (correctly) add new columns to DataFrames, and otherwise deal with the fact that they are mutable.
    • How DataFrames interface with numpy.
  • This content used to be part of lecture last semester, but we've moved it into a guide to spend lecture time on more conceptual material.
  • But you're still responsible for knowing it!

Question 🤔 (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

Introduction to the groupby method¶


What sorts of problems does groupby help us solve?

Example: Palmer Penguins¶

No description has been provided for this image Artwork by @allison_horst

The dataset we'll work with for the rest of the lecture involves various measurements taken of three species of penguins in Antarctica.

In [2]:
IFrame('https://www.youtube-nocookie.com/embed/CCrNAHXUstU?si=-DntSyUNp5Kwitjm&start=11',
       width=560, height=315)
Out[2]:

Loading the data¶

In [3]:
penguins = pd.read_csv('data/penguins.csv')
penguins
Out[3]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Dream 41.3 20.3 194.0 3550.0 Male
1 Adelie Torgersen 38.5 17.9 190.0 3325.0 Female
2 Adelie Dream 34.0 17.1 185.0 3400.0 Female
... ... ... ... ... ... ... ...
330 Chinstrap Dream 46.6 17.8 193.0 3800.0 Female
331 Adelie Dream 39.7 17.9 193.0 4250.0 Male
332 Gentoo Biscoe 45.1 14.5 207.0 5050.0 Female

333 rows × 7 columns

  • Here, each row corresponds to a single penguin, and each column corresponds to a different attribute (or feature) we have for each penguin.
  • Data formatted in this way is sometimes called tidy data.

Visualizing the data¶

In [4]:
penguins.plot(kind='scatter', 
              x='bill_length_mm', 
              y='body_mass_g', 
              color='species', 
              title='Body Mass vs. Bill Length')

Aggregating¶

  • Aggregating is the act of combining many values into a single value.
  • Aggregations "hide" some of the detail in the data, and help understand bigger-picture trends.
  • Example: What is the mean 'body_mass_g' for all penguins?
In [5]:
penguins['body_mass_g'].mean() 
Out[5]:
4207.057057057057
  • Example: What is the mean 'body_mass_g' for each 'species'?

A naïve approach to finding the mean 'body_mass_g' per 'species'¶

  • First, we could identify all unique values in the 'species' column.
In [6]:
penguins['species'].unique() 
Out[6]:
array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)
  • Then, for each 'species', we could:
    1. Query for just that 'species'.
    2. Extract the 'body_mass_g' column and use the mean method on it.
In [7]:
penguins.loc[penguins['species'] == 'Adelie', 'body_mass_g'].mean() 
Out[7]:
3706.1643835616437
In [8]:
penguins.loc[penguins['species'] == 'Chinstrap', 'body_mass_g'].mean() 
Out[8]:
3733.0882352941176
In [9]:
penguins.loc[penguins['species'] == 'Gentoo', 'body_mass_g'].mean() 
Out[9]:
5092.436974789916
  • We could use a for-loop, but remember, we want to avoid Python for-loops.

The magic of groupby 🪄¶

  • A better solution is to use the groupby method.
In [10]:
# To find the overall mean 'body_mass_g':
penguins['body_mass_g'].mean() 
Out[10]:
4207.057057057057
In [11]:
# To find the mean 'body_mass_g' for each 'species':
penguins.groupby('species')['body_mass_g'].mean() 
Out[11]:
species
Adelie       3706.16
Chinstrap    3733.09
Gentoo       5092.44
Name: body_mass_g, dtype: float64
  • Somehow, the groupby method computes what we're looking for in just one line. How?
  • We'll work through the internals, but remember this: if you need to calculate something for each group, use groupby!

An illustrative example: Pets 🐱 🐶🐹¶

  • Consider the DataFrame pets, shown below.
Species Color Weight Age
0 dog black 40 5.0
1 cat golden 15 8.0
2 cat black 20 9.0
3 dog white 80 2.0
4 dog golden 25 0.5
5 hamster golden 1 3.0
  • Let's see what happens under the hood when we use the groupby method on pets.
In [12]:
show_grouping_animation()

"Split-apply-combine" paradigm¶

  • The groupby method involves three steps: split, apply, and combine.
    This is the same terminology that the pandas documentation uses.
No description has been provided for this image
  • Split breaks up and "groups" the rows of a DataFrame according to the specified key.
    There is one "group" for every unique value of the key.
  • Apply uses a function (e.g. aggregation, transformation, filtration) within the individual groups.
  • Combine stitches the results of these operations into an output DataFrame.
  • The split-apply-combine pattern can be parallelized to work on multiple computers or threads, by sending computations for each group to different processors.

Activity

Which penguin 'species' has the highest median 'bill_length_mm'?

In [13]:
penguins
Out[13]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Dream 41.3 20.3 194.0 3550.0 Male
1 Adelie Torgersen 38.5 17.9 190.0 3325.0 Female
2 Adelie Dream 34.0 17.1 185.0 3400.0 Female
... ... ... ... ... ... ... ...
330 Chinstrap Dream 46.6 17.8 193.0 3800.0 Female
331 Adelie Dream 39.7 17.9 193.0 4250.0 Male
332 Gentoo Biscoe 45.1 14.5 207.0 5050.0 Female

333 rows × 7 columns

In [14]:
(
    penguins
    .groupby('species')
    ['bill_length_mm']
    .median()
    .idxmax()
)
Out[14]:
'Chinstrap'
In [15]:
(
    penguins
    .groupby('species')
    ['bill_length_mm']
    .median()
    .plot(kind='barh', title='Median Bill Length of Each Species')
)
In [ ]:
 

groupby's inner workings¶


How does groupby actually work?¶

  • We've just evaluated a few expressions of the following form.
In [16]:
penguins.groupby('species')['bill_length_mm'].mean()
Out[16]:
species
Adelie       38.82
Chinstrap    48.83
Gentoo       47.57
Name: bill_length_mm, dtype: float64
  • There are three "building blocks"

in the above expression: 1. penguins.groupby('species').
First, we specify which column we want to group on. 1. ['bill_length_mm'].
Then, we select the other relevant columns for our calculations. 1. .mean().
Finally, we use an aggregation method.

  • Let's see what each block contributes to the output.

DataFrameGroupBy objects¶


penguins.groupby('species')['bill_length_mm'].mean()
  • If df is a DataFrame, then df.groupby(key) returns a DataFrameGroupBy object.
    This object represents the "split" in "split-apply-combine".
In [17]:
penguins.groupby('species')
Out[17]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x16a301420>
In [18]:
# Creates one group for each unique value in the species column.
penguins.groupby('species').groups
Out[18]:
{'Adelie': [0, 1, 2, 4, 6, 11, 12, 15, 19, 20, 21, 23, 25, 26, 33, 34, 36, 37, 38, 42, 49, 51, 52, 54, 59, 60, 61, 63, 64, 69, 70, 71, 77, 78, 79, 81, 85, 87, 90, 91, 94, 95, 96, 102, 103, 106, 107, 111, 112, 115, 117, 118, 120, 121, 123, 126, 127, 131, 133, 136, 139, 141, 143, 147, 150, 153, 158, 160, 164, 165, 167, 168, 172, 173, 174, 177, 178, 184, 188, 193, 198, 201, 203, 207, 208, 210, 214, 216, 217, 218, 219, 220, 223, 227, 228, 229, 230, 232, 234, 237, ...], 'Chinstrap': [3, 5, 7, 8, 13, 18, 22, 24, 28, 30, 31, 39, 48, 50, 55, 58, 74, 84, 92, 93, 97, 99, 101, 105, 110, 125, 128, 130, 135, 140, 142, 146, 152, 162, 163, 176, 180, 183, 187, 190, 192, 195, 197, 206, 209, 211, 224, 235, 240, 241, 242, 246, 251, 260, 261, 265, 266, 271, 282, 302, 308, 311, 312, 317, 327, 328, 329, 330], 'Gentoo': [9, 10, 14, 16, 17, 27, 29, 32, 35, 40, 41, 43, 44, 45, 46, 47, 53, 56, 57, 62, 65, 66, 67, 68, 72, 73, 75, 76, 80, 82, 83, 86, 88, 89, 98, 100, 104, 108, 109, 113, 114, 116, 119, 122, 124, 129, 132, 134, 137, 138, 144, 145, 148, 149, 151, 154, 155, 156, 157, 159, 161, 166, 169, 170, 171, 175, 179, 181, 182, 185, 186, 189, 191, 194, 196, 199, 200, 202, 204, 205, 212, 213, 215, 221, 222, 225, 226, 231, 233, 236, 245, 247, 248, 252, 253, 257, 264, 267, 268, 275, ...]}
  • DataFrameGroupBy objects have a groups attribute, which is a dictionary in which the keys are group names and the values are lists of row labels.
    We won't actually use this, but it's helpful in understanding how groupby works under-the-hood.

Column extraction¶


penguins.groupby('species')['bill_length_mm'].mean()
  • After creating a DataFrameGroupBy object, we typically select the relevant column(s) that we want to aggregate.
    If we don't, we may run into errors if trying to use a numeric aggregation method on non-numeric columns, as we saw earlier.
    Also, this is more efficient, since the aggregations are only performed on the columns you care about, rather than all columns.
  • The result is either a SeriesGroupBy or DataFrameGroupBy object, depending on what's passed in.
In [19]:
penguins.groupby('species')['bill_length_mm'] 
Out[19]:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x16a3015a0>
In [20]:
penguins.groupby('species')[['bill_length_mm', 'bill_depth_mm']] 
Out[20]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1048fc280>

Aggregation¶


penguins.groupby('species')['bill_length_mm'].mean()
  • Once we create a DataFrameGroupBy or SeriesGroupBy object, we need to apply some function separately to each group, and combine the results.
  • The most common operation we apply to each group is an aggregation, but we'll see examples of filtrations and transformations soon.
    Remember, aggregation is the act of combining many values into a single value.
  • To perform an aggregation, use an aggregation method on the DataFrameGroupBy or SeriesGroupBy object, e.g. .mean(), .max(), or .median().

Let's look at some examples.

In [21]:
# Note that this worked on the entire DataFrame!
# But, if all we wanted are the sums of `'body_mass_g'
# for each species, this is slower than
# penguins.groupby('species')['body_mass_g'].sum().
penguins.groupby('species').sum() 
Out[21]:
island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
species
Adelie DreamTorgersenDreamBiscoeTorgersenBiscoeTorger... 5668.3 2678.7 27755.0 541100.0 MaleFemaleFemaleMaleFemaleMaleFemaleMaleFemale...
Chinstrap DreamDreamDreamDreamDreamDreamDreamDreamDreamD... 3320.7 1252.6 13316.0 253850.0 FemaleFemaleMaleMaleFemaleFemaleFemaleFemaleFe...
Gentoo BiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBiscoeBisc... 5660.6 1784.6 25851.0 606000.0 FemaleFemaleFemaleFemaleMaleFemaleFemaleMaleMa...
In [22]:
# Often used in conjunction with sort_values.
# Remember this when you work on the activity in a few slides!
penguins.groupby('species').last() 
Out[22]:
island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
species
Adelie Dream 39.7 17.9 193.0 4250.0 Male
Chinstrap Dream 46.6 17.8 193.0 3800.0 Female
Gentoo Biscoe 45.1 14.5 207.0 5050.0 Female
In [23]:
# Similar to value_counts, but not identical!
penguins.groupby('species').size() 
Out[23]:
species
Adelie       146
Chinstrap     68
Gentoo       119
dtype: int64
In [24]:
penguins['species'].value_counts() 
Out[24]:
species
Adelie       146
Gentoo       119
Chinstrap     68
Name: count, dtype: int64

Reminder: Column independence¶

  • As we've seen, within each group, the aggregation method is applied to each column independently.
In [25]:
penguins.groupby('species').max()
Out[25]:
island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
species
Adelie Torgersen 46.0 21.5 210.0 4775.0 Male
Chinstrap Dream 58.0 20.8 212.0 4800.0 Male
Gentoo Biscoe 59.6 17.3 231.0 6300.0 Male
  • The above result is not telling us that there is a 'Adelie' penguin with a 'body_mass_g' of 4775.0 that lived on 'Torgersen' island.
In [26]:
# This penguin lived on Biscoe island!
penguins.loc[(penguins['species'] == 'Adelie') & (penguins['body_mass_g'] == 4775.0)]
Out[26]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
255 Adelie Biscoe 43.2 19.0 197.0 4775.0 Male

Activity

Find the 'species', 'island', and 'body_mass_g' of the heaviest 'Male' and 'Female' penguins in penguins.

In [27]:
# General idea: Sort the penguibs by mass in decreasing order.
# Then, the first male penguin that appears is the heaviest male penguin,
# and the first female penguin that appears is the heaviest female penguin.
# For each sex, take the first row.
(
    penguins
    .sort_values('body_mass_g', ascending=False)
    .groupby('sex')
    .first()
)
Out[27]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
sex
Female Gentoo Biscoe 45.2 14.8 212.0 5200.0
Male Gentoo Biscoe 49.2 15.2 221.0 6300.0
In [ ]:
 

Activity

What proportion of penguins of each 'species' live on 'Dream' island?

*Hint*: If you've read the guide, this activity will be much easier!

In [28]:
(
    penguins
    .assign(is_Dream=penguins['island'] == 'Dream')
    .groupby('species')
    ['is_Dream']
    .mean()
)
Out[28]:
species
Adelie       0.38
Chinstrap    1.00
Gentoo       0.00
Name: is_Dream, dtype: float64
In [ ]:
 

Advanced groupby usage¶


Beyond default aggregation methods¶

  • There are many built-in aggregation methods, like .mean(), .max(), and .last().
  • What if the aggregation method you want to use doesn't already exist in pandas, or what if you want to apply different aggregation methods to different columns?
  • Or, what if you don't want to perform an aggregation, but want to perform some other operation separately on each group?

Grouping method 1: agg¶

  • After grouping, use the agg method if you want to aggregate with:

    • A single function (either built-in or one that you define!).
    • A list of functions.
    • A dictionary mapping column names to functions.

    Refer to the documentation for a comprehensive list.
    Per the documentation, agg is an alias for aggregate.

  • Example: How many penguins are there of each 'species', and what is the mean 'body_mass_g' of each 'species'?
In [29]:
(
    penguins
    .groupby('species')
    ['body_mass_g']
    .agg(['count', 'mean'])
)
Out[29]:
count mean
species
Adelie 146 3706.16
Chinstrap 68 3733.09
Gentoo 119 5092.44
  • Note that we call agg a "grouping method" because it comes after a call to groupby.

agg with different aggregation methods for different columns¶

  • Example: What is the maximum 'bill_length_mm' of each 'species', and which 'island's is each 'species' found on?
In [30]:
(
    penguins
    .groupby('species')
    .aggregate({'bill_length_mm': 'max', 'island': 'unique'})
)
Out[30]:
bill_length_mm island
species
Adelie 46.0 [Dream, Torgersen, Biscoe]
Chinstrap 58.0 [Dream]
Gentoo 59.6 [Biscoe]

agg with a custom aggregation method¶

  • Example: What is the second largest recorded 'body_mass_g' for each 'species'?
In [31]:
# Here, the argument to agg is a function,
# which takes in a Series and returns a scalar.
def second_largest(s):
    return s.sort_values().iloc[-2]
(
    penguins
    .groupby('species')
    ['body_mass_g']
    .agg(second_largest)
)
Out[31]:
species
Adelie       4725.0
Chinstrap    4550.0
Gentoo       6050.0
Name: body_mass_g, dtype: float64
  • Key idea: If you give agg a custom function, it should map $\texttt{Series} \rightarrow \texttt{number}$.

Grouping method 2: filter¶

  • As we saw last class, a query keeps rows that satisfy conditions.
    For instance, to see the individual penguins with a 'bill_length_mm' over 47 mm:
In [32]:
# This is a query, NOT a filter for the purposes of this slide.
penguins[penguins['bill_length_mm'] > 47]
Out[32]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
5 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female
7 Chinstrap Dream 51.4 19.0 201.0 3950.0 Male
8 Chinstrap Dream 51.3 19.2 193.0 3650.0 Male
... ... ... ... ... ... ... ...
327 Chinstrap Dream 49.3 19.9 203.0 4050.0 Male
328 Chinstrap Dream 50.7 19.7 203.0 4050.0 Male
329 Chinstrap Dream 51.3 19.9 198.0 3700.0 Male

106 rows × 7 columns

  • A filter, on the other hand, keeps entire groups that satisfy conditions.
  • For instance, to see the penguin 'species' with an average 'bill_length_mm' over 47 mm, use the filter method after groupby:
In [33]:
(
    penguins
    .groupby('species')
    .filter(lambda df: df['bill_length_mm'].mean() > 47)
)
Out[33]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
3 Chinstrap Dream 45.5 17.0 196.0 3500.0 Female
5 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female
7 Chinstrap Dream 51.4 19.0 201.0 3950.0 Male
... ... ... ... ... ... ... ...
329 Chinstrap Dream 51.3 19.9 198.0 3700.0 Male
330 Chinstrap Dream 46.6 17.8 193.0 3800.0 Female
332 Gentoo Biscoe 45.1 14.5 207.0 5050.0 Female

187 rows × 7 columns

  • Notice that the above DataFrame has 187 rows, fewer than the 333 in the full DataFrame. That's because there are no 'Adelie' penguins above!
In [34]:
# Since 'Adelie's have a mean 'bill_length_mm' below 47, they aren't included in the output above.
penguins.groupby('species')['bill_length_mm'].mean()
Out[34]:
species
Adelie       38.82
Chinstrap    48.83
Gentoo       47.57
Name: bill_length_mm, dtype: float64
  • Key idea: If you give filter a custom function, it should map $\texttt{DataFrame} \rightarrow \texttt{Boolean}$. The resulting DataFrame will only have the groups for which the custom function returned True.

Activity

There is only one penguins 'species' with:

  • At least 100 penguins.
  • At least 60 'Female' penguins.

Find the 'species' using a single expression (i.e. no intermediate variables). Use filter.

In [35]:
(
    penguins
    .groupby('species')
    .filter(lambda df: (df.shape[0] >= 100) and ((df['sex'] == 'Female').sum() >= 60))
    ['species']
    .unique()
    [0]
)
Out[35]:
'Adelie'
In [ ]:
 
In [ ]:
 
In [36]:
# Note that to just find the 'species' with at least 100 penguins,
# we didn't need to group:
penguins['species'].value_counts()
Out[36]:
species
Adelie       146
Gentoo       119
Chinstrap     68
Name: count, dtype: int64
In [37]:
penguins.loc[penguins['sex'] == 'Female', 'species'].value_counts()
Out[37]:
species
Adelie       73
Gentoo       58
Chinstrap    34
Name: count, dtype: int64

Question 🤔 (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

What questions do you have?

Grouping method 3: transform¶

  • Use the transform grouping method if you want to apply a function separately to each group.
  • Example: How much heavier is each penguin than the average penguin of their 'species'?
In [38]:
penguins.groupby('species')['body_mass_g'].transform(lambda s: s - s.mean()) 
Out[38]:
0     -156.16
1     -381.16
2     -306.16
        ...  
330     66.91
331    543.84
332    -42.44
Name: body_mass_g, Length: 333, dtype: float64
In [39]:
penguins['body_mass_g'] 
Out[39]:
0      3550.0
1      3325.0
2      3400.0
        ...  
330    3800.0
331    4250.0
332    5050.0
Name: body_mass_g, Length: 333, dtype: float64
  • Notice that penguin 332's transformed 'body_mass_g' is negative, even though their actual 'body_mass_g' is very large; this is because they have a below-average 'body_mass_g' for their 'species'.
  • Key idea: If you give transform a custom function, it should map $\texttt{Series} \rightarrow \texttt{Series}$.

Grouping method 4: apply¶

  • Sometimes, you want to apply an operation separately for each group, but the operation isn't possible using agg, filter, or transform.
  • The apply grouping method is like a swiss-army knife.
    It can do anything agg or transform can do, and more.
    Only use it if necessary, because it's slower than agg and transform!
  • Example: Find the two heaviest penguins per 'species'.
In [40]:
penguins.groupby('species').apply(lambda df: df.sort_values('body_mass_g', ascending=False).head(2)) 
Out[40]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
species
Adelie 255 Adelie Biscoe 43.2 19.0 197.0 4775.0 Male
174 Adelie Biscoe 41.0 20.0 203.0 4725.0 Male
Chinstrap 99 Chinstrap Dream 52.0 20.7 210.0 4800.0 Male
187 Chinstrap Dream 52.8 20.0 205.0 4550.0 Male
Gentoo 268 Gentoo Biscoe 49.2 15.2 221.0 6300.0 Male
189 Gentoo Biscoe 59.6 17.0 230.0 6050.0 Male
  • Example: Find the 'flipper_length_mm' of the heaviest penguin of each 'species'.
In [41]:
(
    penguins
    .groupby('species')
    .apply(
        lambda df: df.sort_values('body_mass_g', ascending=False)['flipper_length_mm'].iloc[0]
    )
)
Out[41]:
species
Adelie       197.0
Chinstrap    210.0
Gentoo       221.0
dtype: float64
  • Key idea: If you give apply a custom function, it should map $\texttt{DataFrame} \rightarrow \texttt{anything}$. The outputs of the custom function will be stitched together intelligently by pandas.

⭐️ The grouping method cheat sheet: agg, filter, transform, and apply ⭐️¶

After grouping, use:

  • The agg method if you want to aggregate values, separately for each group.
    If you give agg a custom function, it should map $\texttt{Series} \rightarrow \texttt{number}$.
In [42]:
penguins.groupby('species')['body_mass_g'].agg(lambda s: s.sort_values().iloc[-2])
Out[42]:
species
Adelie       4725.0
Chinstrap    4550.0
Gentoo       6050.0
Name: body_mass_g, dtype: float64
  • The filter method if you want to keep groups that satisfy certain conditions.
    If you give filter a custom function, it should map $\texttt{DataFrame} \rightarrow \texttt{Boolean}$. The resulting DataFrame will only have the groups for which the custom function returned True.
In [43]:
(
    penguins
    .groupby('species')
    .filter(lambda df: (df.shape[0] >= 100) and ((df['sex'] == 'Female').sum() >= 60))
)
Out[43]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Dream 41.3 20.3 194.0 3550.0 Male
1 Adelie Torgersen 38.5 17.9 190.0 3325.0 Female
2 Adelie Dream 34.0 17.1 185.0 3400.0 Female
... ... ... ... ... ... ... ...
325 Adelie Torgersen 35.2 15.9 186.0 3050.0 Female
326 Adelie Dream 41.1 18.1 205.0 4300.0 Male
331 Adelie Dream 39.7 17.9 193.0 4250.0 Male

146 rows × 7 columns

  • The transform method if you want to modify the values within each group separately.
    If you give transform a custom function, it should map $\texttt{Series} \rightarrow \texttt{Series}$.
In [44]:
penguins.groupby('species')['body_mass_g'].transform(lambda s: s - s.mean())
Out[44]:
0     -156.16
1     -381.16
2     -306.16
        ...  
330     66.91
331    543.84
332    -42.44
Name: body_mass_g, Length: 333, dtype: float64
  • The apply method if you want to perform some general operation on each group separately.
    If you give apply a custom function, it should map $\texttt{DataFrame} \rightarrow \texttt{anything}$.
In [45]:
penguins.groupby('species').apply(lambda df: df.sort_values('body_mass_g', ascending=False).head(2))
Out[45]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
species
Adelie 255 Adelie Biscoe 43.2 19.0 197.0 4775.0 Male
174 Adelie Biscoe 41.0 20.0 203.0 4725.0 Male
Chinstrap 99 Chinstrap Dream 52.0 20.7 210.0 4800.0 Male
187 Chinstrap Dream 52.8 20.0 205.0 4550.0 Male
Gentoo 268 Gentoo Biscoe 49.2 15.2 221.0 6300.0 Male
189 Gentoo Biscoe 59.6 17.0 230.0 6050.0 Male

Question 🤔 (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

What questions do you have?

Grouping with multiple columns¶

  • When we group with multiple columns, one group is created for every unique combination of elements in the specified columns.
    In the output below, why are there only 5 rows, rather than $3 \times 3 = 9$ rows, when there are 3 unique 'species' and 3 unique 'island's?
In [46]:
# Read this as:
species_and_island = (
    penguins.groupby(['species', 'island'])         # for every combination of 'species' and 'island' in the DataFrame,
    [['bill_length_mm', 'bill_depth_mm']].mean()    # calculate the mean 'bill_length_mm' and the mean 'bill_depth_mm'.
)
species_and_island
Out[46]:
bill_length_mm bill_depth_mm
species island
Adelie Biscoe 38.98 18.37
Dream 38.52 18.24
Torgersen 39.04 18.45
Chinstrap Dream 48.83 18.42
Gentoo Biscoe 47.57 15.00
  • Advice: When grouping on multiple columns, the result usually has a MultiIndex; use reset_index or set as_index=False in groupby to avoid this.
In [47]:
# Now, this looks like a regular DataFrame!
species_and_island.reset_index() 
Out[47]:
species island bill_length_mm bill_depth_mm
0 Adelie Biscoe 38.98 18.37
1 Adelie Dream 38.52 18.24
2 Adelie Torgersen 39.04 18.45
3 Chinstrap Dream 48.83 18.42
4 Gentoo Biscoe 47.57 15.00

Pivot tables using pivot_table¶


Pivot tables: An extension of grouping¶

  • Pivot tables are a compact way to display tables for humans to read.
sex Female Male
species
Adelie 3368.84 4043.49
Chinstrap 3527.21 3938.97
Gentoo 4679.74 5484.84
  • Notice that each value in the table is the average of 'body_mass_g' of penguins, for every combination of 'species' and 'sex'.
  • You can think of pivot tables as grouping using two columns, then "pivoting" one of the group labels into columns.

pivot_table¶

  • The pivot_table DataFrame method aggregates a DataFrame using two columns. To use it:



df.pivot_table(index=index_col,
                       columns=columns_col,
                       values=values_col,
                       aggfunc=func)
  • The resulting DataFrame will have:
    • One row for every unique value in index_col.
    • One column for every unique value in columns_col.
    • Values determined by applying func on values in values_col.
  • Example: Find the average 'body_mass_g' for every combination of 'species' and 'sex'.
In [48]:
penguins.pivot_table(
    index='species',
    columns='sex',
    values='body_mass_g',
    aggfunc='mean'
)
Out[48]:
sex Female Male
species
Adelie 3368.84 4043.49
Chinstrap 3527.21 3938.97
Gentoo 4679.74 5484.84
In [49]:
# Same information as above, but harder to read!
(
    penguins
    .groupby(['species', 'sex'])
    [['body_mass_g']]
    .mean()
)
Out[49]:
body_mass_g
species sex
Adelie Female 3368.84
Male 4043.49
Chinstrap Female 3527.21
Male 3938.97
Gentoo Female 4679.74
Male 5484.84

Example: Finding the number of penguins per 'island' and 'species'¶

In [50]:
penguins
Out[50]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Dream 41.3 20.3 194.0 3550.0 Male
1 Adelie Torgersen 38.5 17.9 190.0 3325.0 Female
2 Adelie Dream 34.0 17.1 185.0 3400.0 Female
... ... ... ... ... ... ... ...
330 Chinstrap Dream 46.6 17.8 193.0 3800.0 Female
331 Adelie Dream 39.7 17.9 193.0 4250.0 Male
332 Gentoo Biscoe 45.1 14.5 207.0 5050.0 Female

333 rows × 7 columns

  • Suppose we want to find the number of penguins in penguins per 'island' and 'species'. We can do so without pivot_table:
In [51]:
penguins.value_counts(['island', 'species']) 
Out[51]:
island     species  
Biscoe     Gentoo       119
Dream      Chinstrap     68
           Adelie        55
Torgersen  Adelie        47
Biscoe     Adelie        44
Name: count, dtype: int64
In [52]:
penguins.groupby(['island', 'species']).size() 
Out[52]:
island     species  
Biscoe     Adelie        44
           Gentoo       119
Dream      Adelie        55
           Chinstrap     68
Torgersen  Adelie        47
dtype: int64
  • But the data is arguably easier to interpret when we do use pivot_table:
In [53]:
penguins.pivot_table(
    index='species', 
    columns='island', 
    values='bill_length_mm', # Choice of column here doesn't actually matter! Why?
    aggfunc='count',
)
Out[53]:
island Biscoe Dream Torgersen
species
Adelie 44.0 55.0 47.0
Chinstrap NaN 68.0 NaN
Gentoo 119.0 NaN NaN
  • Note that there is a NaN at the intersection of 'Biscoe' and 'Chinstrap', because there were no Chinstrap penguins on Biscoe Island.
    NaN stands for "not a number." It is numpy and pandas' version of a null value (the regular Python null value is None). We'll learn more about how to deal with these soon.
  • We can either use the fillna method afterwards or the fill_value argument to fill in NaNs.
In [54]:
penguins.pivot_table(
    index='species', 
    columns='island', 
    values='bill_length_mm', 
    aggfunc='count',
    fill_value=0,
)
Out[54]:
island Biscoe Dream Torgersen
species
Adelie 44 55 47
Chinstrap 0 68 0
Gentoo 119 0 0

Granularity¶

  • Each row of the original penguins DataFrame represented a single penguin, and each column represented features of the penguins.
In [55]:
penguins
Out[55]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Dream 41.3 20.3 194.0 3550.0 Male
1 Adelie Torgersen 38.5 17.9 190.0 3325.0 Female
2 Adelie Dream 34.0 17.1 185.0 3400.0 Female
... ... ... ... ... ... ... ...
330 Chinstrap Dream 46.6 17.8 193.0 3800.0 Female
331 Adelie Dream 39.7 17.9 193.0 4250.0 Male
332 Gentoo Biscoe 45.1 14.5 207.0 5050.0 Female

333 rows × 7 columns

  • What is the granularity of the DataFrame below?
    That is, what does each row represent?
In [56]:
penguins.pivot_table(
    index='species', 
    columns='island', 
    values='bill_length_mm', 
    aggfunc='count',
    fill_value=0,
)
Out[56]:
island Biscoe Dream Torgersen
species
Adelie 44 55 47
Chinstrap 0 68 0
Gentoo 119 0 0

Reshaping¶

  • pivot_table reshapes DataFrames from "long" to "wide".
  • Other DataFrame reshaping methods:
    • melt: Un-pivots a DataFrame. Very useful in data cleaning.
    • pivot: Like pivot_table, but doesn't do aggregation.
    • stack: Pivots multi-level columns to multi-indices.
    • unstack: Pivots multi-indices to columns.
  • Google, the documentation, and ChatGPT are your friends!

Question 🤔 (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

What questions do you have?