In [1]:
from lec_utils import *
def show_chaining_slides():
    src = 'https://docs.google.com/presentation/d/e/2PACX-1vRYAERthJoyVnD1JymDK6JAtufCJmA5AYb5H2NwlegJHm04WhBfxnA0zQO3vKbEYQbqOJ8XJPZtfoxF/embed?start=false&loop=false&rm=minimal'
    width = 1000
    height = 590
    display(IFrame(src, width, height))

Lecture 4¶

DataFrame Fundamentals¶

EECS 398: Practical Data Science, Spring 2025¶

practicaldsc.org • github.com/practicaldsc/sp25 • 📣 See latest announcements here on Ed

Agenda 📆¶

  • Introduction to pandas DataFrames.
  • Selecting columns.
  • Selecting slices 🍰.
  • Querying 🔎.

Question 🤔 (Answer at practicaldsc.org/q)

Remember that you can always ask questions anonymously at the link above!

Introduction to pandas DataFrames¶


Let's finally start working with real datasets! 🎉

Note that we're going to cover a lot of code quickly. The point of lecture is to expose you to what's possible; you can look at the notebook later for the details.

pandas¶

No description has been provided for this image
  • pandas is the Python library for tabular data manipulation.
    It, roughly, stands for "Python data analysis".
  • Before pandas was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.
  • Wes McKinney, the original developer of pandas, wanted a library which would allow everything to be done in Python.
    Python is faster to develop in than Java or C++, and is more general-purpose than R.
  • pandas is almost always imported in conjunction with numpy.
In [2]:
import pandas as pd
import numpy as np

pandas data structures¶

  • There are three key data structures at the core of pandas.
No description has been provided for this image An example DataFrame.
  • DataFrame: 2 dimensional tables. These have rows and columns.
  • Series: 1 dimensional array-like object, representing a row or column.
    Like arrays, Series contain data of the same type. The plural of Series is also Series.
  • Index: Sequence of row or column labels. When we say "the index", we're referring to the sequence of row labels.
    The index – 'lebronja', 'obammich', 'carpents', and 'timapplec' in the example above – is not a column!
    Column names – 'name', 'program', and 'year' in the example above – are stored as strings, and the sequence of column names is also an index.

Example: Dog breeds 🐶¶

  • The dataset we'll work comes from the American Kennel Club. Here's a cool plot made using our dataset.
No description has been provided for this image
  • We'll usually work with data stored in the CSV format. CSV stands for "comma-separated values."
  • We can read in a CSV using pd.read_csv(path). The path should be relative to your notebook; if the file is in the same folder as your notebook, this is just the name of the file (as a string).
  • Today's dataset is stored 'data/dogs42.csv' – open it up and see what it looks like!
In [3]:
# The "cat" shell command shows you the contents of a file.
!cat data/dogs42.csv 
breed,kind,lifetime_cost,longevity,size,weight,height
Brittany,sporting,22589.0,12.92,medium,35.0,19.0
Cairn Terrier,terrier,21992.0,13.84,small,14.0,10.0
English Cocker Spaniel,sporting,18993.0,11.66,medium,30.0,16.0
Cocker Spaniel,sporting,24330.0,12.5,small,25.0,14.5
Shetland Sheepdog,herding,21006.0,12.53,small,22.0,14.5
Siberian Husky,working,22049.0,12.58,medium,47.5,21.75
Miniature Schnauzer,terrier,20087.0,11.81,small,15.5,13.0
Chihuahua,toy,26250.0,16.5,small,5.5,5.0
English Springer Spaniel,sporting,21946.0,12.54,medium,45.0,19.5
German Shorthaired Pointer,sporting,25842.0,11.46,large,62.5,24.0
Pointer,sporting,24445.0,12.42,large,59.5,25.5
Tibetan Spaniel,non-sporting,25549.0,14.42,small,12.0,10.0
Labrador Retriever,sporting,21299.0,12.04,medium,67.5,23.0
Maltese,toy,19084.0,12.25,small,5.0,9.0
Shih Tzu,toy,21152.0,13.2,small,12.5,9.75
Irish Setter,sporting,20323.0,11.63,large,65.0,26.0
Golden Retriever,sporting,21447.0,12.04,medium,60.0,22.75
Chesapeake Bay Retriever,sporting,16697.0,9.48,large,67.5,23.5
Tibetan Terrier,non-sporting,20336.0,12.31,small,24.0,15.5
Gordon Setter,sporting,19605.0,11.1,large,62.5,25.0
Pug,toy,18527.0,11.0,medium,16.0,16.0
Norfolk Terrier,terrier,24308.0,13.07,small,12.0,9.5
English Toy Spaniel,toy,17521.0,10.1,small,11.0,10.0
Cavalier King Charles Spaniel,toy,18639.0,11.29,small,15.5,12.5
Basenji,hound,22096.0,13.58,medium,23.0,16.5
Staffordshire Bull Terrier,terrier,21650.0,12.05,medium,31.0,15.0
Pembroke Welsh Corgi,herding,23978.0,12.25,small,26.0,11.0
Clumber Spaniel,sporting,18084.0,10.0,medium,70.0,18.5
Dandie Dinmont Terrier,terrier,21633.0,12.17,small,21.0,9.0
Giant Schnauzer,working,26686.0,10.0,large,77.5,25.5
Scottish Terrier,terrier,17525.0,10.69,small,20.0,10.0
Kerry Blue Terrier,terrier,17240.0,9.4,medium,36.5,18.5
Afghan Hound,hound,24077.0,11.92,large,55.0,26.0
Newfoundland,working,19351.0,9.32,large,125.0,27.0
Rhodesian Ridgeback,hound,16530.0,9.1,large,77.5,25.5
Borzoi,hound,16176.0,9.08,large,82.5,28.0
Bull Terrier,terrier,18490.0,10.21,medium,60.0,21.5
Alaskan Malamute,working,21986.0,10.67,large,80.0,24.0
Bloodhound,hound,13824.0,6.75,large,85.0,25.0
Bullmastiff,working,13936.0,7.57,large,115.0,25.5
Mastiff,working,13581.0,6.5,large,175.0,30.0
Saint Bernard,working,20022.0,7.78,large,155.0,26.5
In [4]:
dogs = pd.read_csv('data/dogs42.csv') 
dogs
Out[4]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ... ...
39 Bullmastiff working 13936.0 7.57 large 115.0 25.5
40 Mastiff working 13581.0 6.50 large 175.0 30.0
41 Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 7 columns

Exploring our first DataFrame¶

  • The shape attribute returns the DataFrame's number of rows and columns.
    Sure, we can see 👀 that it says 42 rows x 7 columns below, but the shape attribute allows us to write code involving the number of rows/columns.
In [5]:
dogs
Out[5]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ... ...
39 Bullmastiff working 13936.0 7.57 large 115.0 25.5
40 Mastiff working 13581.0 6.50 large 175.0 30.0
41 Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 7 columns

In [6]:
# Note that the index – 0, 1, 2, ... – does **not** count as a column!
dogs.shape 
Out[6]:
(42, 7)
  • To extract the first or last few rows of a DataFrame, use the head or tail methods.
    Like most DataFrame methods, head and tail don't modify the original DataFrame!
In [7]:
dogs.head(3) 
Out[7]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
In [8]:
dogs.tail(2) 
Out[8]:
breed kind lifetime_cost longevity size weight height
40 Mastiff working 13581.0 6.50 large 175.0 30.0
41 Saint Bernard working 20022.0 7.78 large 155.0 26.5
  • To sort by a column, use the sort_values method.
    ascending=False is a keyword argument, meaning you need to specify the name of the argument to use it.
    You've seen some examples of this in the plotly part of Homework 1.
In [9]:
# Note that the index is no longer 0, 1, 2, ...!
dogs.sort_values('height', ascending=False) 
Out[9]:
breed kind lifetime_cost longevity size weight height
40 Mastiff working 13581.0 6.50 large 175.0 30.0
35 Borzoi hound 16176.0 9.08 large 82.5 28.0
33 Newfoundland working 19351.0 9.32 large 125.0 27.0
... ... ... ... ... ... ... ...
13 Maltese toy 19084.0 12.25 small 5.0 9.0
28 Dandie Dinmont Terrier terrier 21633.0 12.17 small 21.0 9.0
7 Chihuahua toy 26250.0 16.50 small 5.5 5.0

42 rows × 7 columns

Setting the index¶

  • Think of each row's index as its unique identifier or name.
    The default index when we create a DataFrame using pd.read_csv is 0, 1, 2, 3, ...


Think of the index of a DataFrame like a "key" in a dictionary (Python) or map (C++).

In [10]:
dogs
Out[10]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ... ...
39 Bullmastiff working 13936.0 7.57 large 115.0 25.5
40 Mastiff working 13581.0 6.50 large 175.0 30.0
41 Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 7 columns

In [11]:
dogs.index
Out[11]:
RangeIndex(start=0, stop=42, step=1)
  • Often, we like to set the index of a DataFrame to a unique identifier if we have one available.
    We can do so with the set_index method.
    We'll see the real benefit of this shortly.
In [12]:
dogs.set_index('breed') 
Out[12]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 6 columns

In [13]:
# The above cell didn't involve an assignment statement, so dogs was unchanged.
dogs
Out[13]:
breed kind lifetime_cost longevity size weight height
0 Brittany sporting 22589.0 12.92 medium 35.0 19.0
1 Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
2 English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ... ...
39 Bullmastiff working 13936.0 7.57 large 115.0 25.5
40 Mastiff working 13581.0 6.50 large 175.0 30.0
41 Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 7 columns

In [14]:
# By reassigning dogs, our changes will persist.
# Note that we can't run this cell twice! Try it and see what happens.
dogs = dogs.set_index('breed')
dogs
Out[14]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 6 columns

In [15]:
# There used to be 7 columns, but now there are only 6!
# The index is **not** a column!
dogs.shape
Out[15]:
(42, 6)
In [16]:
dogs.index
Out[16]:
Index(['Brittany', 'Cairn Terrier', 'English Cocker Spaniel', 'Cocker Spaniel',
       'Shetland Sheepdog', 'Siberian Husky', 'Miniature Schnauzer',
       'Chihuahua', 'English Springer Spaniel', 'German Shorthaired Pointer',
       'Pointer', 'Tibetan Spaniel', 'Labrador Retriever', 'Maltese',
       'Shih Tzu', 'Irish Setter', 'Golden Retriever',
       'Chesapeake Bay Retriever', 'Tibetan Terrier', 'Gordon Setter', 'Pug',
       'Norfolk Terrier', 'English Toy Spaniel',
       'Cavalier King Charles Spaniel', 'Basenji',
       'Staffordshire Bull Terrier', 'Pembroke Welsh Corgi', 'Clumber Spaniel',
       'Dandie Dinmont Terrier', 'Giant Schnauzer', 'Scottish Terrier',
       'Kerry Blue Terrier', 'Afghan Hound', 'Newfoundland',
       'Rhodesian Ridgeback', 'Borzoi', 'Bull Terrier', 'Alaskan Malamute',
       'Bloodhound', 'Bullmastiff', 'Mastiff', 'Saint Bernard'],
      dtype='object', name='breed')

Activity

Assign tallest_breed to the name, as a string, of the tallest breed in the dataset. Answer using pandas code, i.e. don't look at the dataset and hard-code the answer.

In [17]:
tallest_breed = dogs.sort_values('height', ascending=False).index[0] 
tallest_breed
Out[17]:
'Mastiff'
In [ ]:
 

Reference Slide¶

Displaying more rows/columns¶

  • Sometimes, you just want pandas to display a lot of rows and columns. You can use this helper function to do that.
In [18]:
def display_df(df, rows=pd.options.display.max_rows, cols=pd.options.display.max_columns):
    """Displays n rows and cols from df."""
    with pd.option_context("display.max_rows", rows,
                           "display.max_columns", cols):
        display(df)
In [19]:
display_df(dogs.sort_values('weight', ascending=False), rows=42)
kind lifetime_cost longevity size weight height
breed
Mastiff working 13581.0 6.50 large 175.0 30.00
Saint Bernard working 20022.0 7.78 large 155.0 26.50
Newfoundland working 19351.0 9.32 large 125.0 27.00
Bullmastiff working 13936.0 7.57 large 115.0 25.50
Bloodhound hound 13824.0 6.75 large 85.0 25.00
Borzoi hound 16176.0 9.08 large 82.5 28.00
Alaskan Malamute working 21986.0 10.67 large 80.0 24.00
Rhodesian Ridgeback hound 16530.0 9.10 large 77.5 25.50
Giant Schnauzer working 26686.0 10.00 large 77.5 25.50
Clumber Spaniel sporting 18084.0 10.00 medium 70.0 18.50
Labrador Retriever sporting 21299.0 12.04 medium 67.5 23.00
Chesapeake Bay Retriever sporting 16697.0 9.48 large 67.5 23.50
Irish Setter sporting 20323.0 11.63 large 65.0 26.00
German Shorthaired Pointer sporting 25842.0 11.46 large 62.5 24.00
Gordon Setter sporting 19605.0 11.10 large 62.5 25.00
Bull Terrier terrier 18490.0 10.21 medium 60.0 21.50
Golden Retriever sporting 21447.0 12.04 medium 60.0 22.75
Pointer sporting 24445.0 12.42 large 59.5 25.50
Afghan Hound hound 24077.0 11.92 large 55.0 26.00
Siberian Husky working 22049.0 12.58 medium 47.5 21.75
English Springer Spaniel sporting 21946.0 12.54 medium 45.0 19.50
Kerry Blue Terrier terrier 17240.0 9.40 medium 36.5 18.50
Brittany sporting 22589.0 12.92 medium 35.0 19.00
Staffordshire Bull Terrier terrier 21650.0 12.05 medium 31.0 15.00
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.00
Pembroke Welsh Corgi herding 23978.0 12.25 small 26.0 11.00
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.50
Tibetan Terrier non-sporting 20336.0 12.31 small 24.0 15.50
Basenji hound 22096.0 13.58 medium 23.0 16.50
Shetland Sheepdog herding 21006.0 12.53 small 22.0 14.50
Dandie Dinmont Terrier terrier 21633.0 12.17 small 21.0 9.00
Scottish Terrier terrier 17525.0 10.69 small 20.0 10.00
Pug toy 18527.0 11.00 medium 16.0 16.00
Miniature Schnauzer terrier 20087.0 11.81 small 15.5 13.00
Cavalier King Charles Spaniel toy 18639.0 11.29 small 15.5 12.50
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.00
Shih Tzu toy 21152.0 13.20 small 12.5 9.75
Tibetan Spaniel non-sporting 25549.0 14.42 small 12.0 10.00
Norfolk Terrier terrier 24308.0 13.07 small 12.0 9.50
English Toy Spaniel toy 17521.0 10.10 small 11.0 10.00
Chihuahua toy 26250.0 16.50 small 5.5 5.00
Maltese toy 19084.0 12.25 small 5.0 9.00

Selecting columns¶


In order to answer questions like:

What is the average 'longevity' of all breeds in the dataset?

we'll need to be able to access individual columns in the DataFrame.

Selecting columns with []¶

  • The most common way to select a subset of the columns in a DataFrame is by using the [] operator.


This is just like when we accessed values in a dictionary based on their key.

In [20]:
dogs
Out[20]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 6 columns

In [21]:
# Returns a Series. Note the index appears again on the left!
dogs['kind'] 
Out[21]:
breed
Brittany                  sporting
Cairn Terrier              terrier
English Cocker Spaniel    sporting
                            ...   
Bullmastiff                working
Mastiff                    working
Saint Bernard              working
Name: kind, Length: 42, dtype: object
In [22]:
# Returns a DataFrame.
dogs[['kind', 'size']] 
Out[22]:
kind size
breed
Brittany sporting medium
Cairn Terrier terrier small
English Cocker Spaniel sporting medium
... ... ...
Bullmastiff working large
Mastiff working large
Saint Bernard working large

42 rows × 2 columns

  • As we've seen above, specifying a single column name returns the column as a Series; specifying a list of column names returns a DataFrame.
In [23]:
# 🤔
dogs[['kind']] 
Out[23]:
kind
breed
Brittany sporting
Cairn Terrier terrier
English Cocker Spaniel sporting
... ...
Bullmastiff working
Mastiff working
Saint Bernard working

42 rows × 1 columns

  • As an aside: when you get an error message in Python, the most informative part is usually at the bottom!
    So, if you're posting about your error on Ed, or debugging with us in office hours, show us the bottom first.
In [24]:
# Breeds are stored in the index, which is not a column!
dogs['breed'] 
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key)
   3789 try:
-> 3790     return self._engine.get_loc(casted_key)
   3791 except KeyError as err:

File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'breed'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[24], line 2
      1 # Breeds are stored in the index, which is not a column!
----> 2 dogs['breed']

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/frame.py:3896, in DataFrame.__getitem__(self, key)
   3894 if self.columns.nlevels > 1:
   3895     return self._getitem_multilevel(key)
-> 3896 indexer = self.columns.get_loc(key)
   3897 if is_integer(indexer):
   3898     indexer = [indexer]

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/indexes/base.py:3797, in Index.get_loc(self, key)
   3792     if isinstance(casted_key, slice) or (
   3793         isinstance(casted_key, abc.Iterable)
   3794         and any(isinstance(x, slice) for x in casted_key)
   3795     ):
   3796         raise InvalidIndexError(key)
-> 3797     raise KeyError(key) from err
   3798 except TypeError:
   3799     # If we have a listlike key, _check_indexing_error will raise
   3800     #  InvalidIndexError. Otherwise we fall through and re-raise
   3801     #  the TypeError.
   3802     self._check_indexing_error(key)

KeyError: 'breed'
In [25]:
dogs.index 
Out[25]:
Index(['Brittany', 'Cairn Terrier', 'English Cocker Spaniel', 'Cocker Spaniel',
       'Shetland Sheepdog', 'Siberian Husky', 'Miniature Schnauzer',
       'Chihuahua', 'English Springer Spaniel', 'German Shorthaired Pointer',
       'Pointer', 'Tibetan Spaniel', 'Labrador Retriever', 'Maltese',
       'Shih Tzu', 'Irish Setter', 'Golden Retriever',
       'Chesapeake Bay Retriever', 'Tibetan Terrier', 'Gordon Setter', 'Pug',
       'Norfolk Terrier', 'English Toy Spaniel',
       'Cavalier King Charles Spaniel', 'Basenji',
       'Staffordshire Bull Terrier', 'Pembroke Welsh Corgi', 'Clumber Spaniel',
       'Dandie Dinmont Terrier', 'Giant Schnauzer', 'Scottish Terrier',
       'Kerry Blue Terrier', 'Afghan Hound', 'Newfoundland',
       'Rhodesian Ridgeback', 'Borzoi', 'Bull Terrier', 'Alaskan Malamute',
       'Bloodhound', 'Bullmastiff', 'Mastiff', 'Saint Bernard'],
      dtype='object', name='breed')

Useful Series methods¶

  • A Series is like a 1D array, but with an index.
  • There are a variety of useful methods that work on Series; you can see the entire list here.
    Many methods that work on a Series will also work on entire DataFrames, as we'll soon see.
In [26]:
dogs
Out[26]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 6 columns

In [27]:
# What are the unique kinds of dogs?
dogs['kind'].unique() 
Out[27]:
array(['sporting', 'terrier', 'herding', 'working', 'toy', 'non-sporting',
       'hound'], dtype=object)
In [28]:
# How many unique kinds of dogs are there?
dogs['kind'].nunique() 
Out[28]:
7
In [29]:
# What's the distribution of kinds?
# value_counts is super useful – and I love asking exam questions about it!
dogs['kind'].value_counts() 
Out[29]:
kind
sporting        12
terrier          8
working          7
toy              6
hound            5
herding          2
non-sporting     2
Name: count, dtype: int64
In [30]:
# What's the mean of the 'longevity' column?
dogs['longevity'].mean() 
Out[30]:
11.279285714285715
In [31]:
# Tell me more about the 'weight' column.
dogs['weight'].describe() 
Out[31]:
count     42.00
mean      50.17
std       39.52
          ...  
50%       40.75
75%       67.50
max      175.00
Name: weight, Length: 8, dtype: float64
In [32]:
# Sort the 'lifetime_cost' column. Note that here we're using sort_values on a Series, not a DataFrame!
dogs['lifetime_cost'].sort_values() 
Out[32]:
breed
Mastiff                       13581.0
Bloodhound                    13824.0
Bullmastiff                   13936.0
                               ...   
German Shorthaired Pointer    25842.0
Chihuahua                     26250.0
Giant Schnauzer               26686.0
Name: lifetime_cost, Length: 42, dtype: float64
In [33]:
# Gives us the index of the largest value, not the largest value itself.
# Note that this makes our Activity from a few slides ago way easier!
dogs['height'].idxmax() 
Out[33]:
'Mastiff'

Method chaining 🔗¶

  • When using pandas, it's common to use several method calls, one after another.
    We call this method chaining.
  • For instance, the line of code below finds the most common 'size' among the 10 heaviest breeds.
In [34]:
dogs.sort_values('weight', ascending=False).head(10)['size'].value_counts().idxmax()
Out[34]:
'large'
  • When writing code like this, it's important to:
    1. Write one method call at a time.
    2. Understand the data structures involved!
In [35]:
show_chaining_slides()
  • One tip: if a line of pandas code gets too long, spread it out over multiple lines by wrapping it in (parentheses).
In [36]:
(
    dogs
    .sort_values('weight', ascending=False)
    .head(10)
    ['size']
    .value_counts()
    .idxmax()
)
Out[36]:
'large'

Series support vectorized operations¶

  • Series operations are vectorized, just like with arrays.
  • When performing elementwise-operations involving multiple Series, pandas aligns the Series by their index.
In [37]:
x = pd.Series({'a': 1, 'b': 2})
x
Out[37]:
a    1
b    2
dtype: int64
In [38]:
x * 5
Out[38]:
a     5
b    10
dtype: int64
In [39]:
y = pd.Series({'b': 5, 'c': -1, 'a': 10})
y
Out[39]:
b     5
c    -1
a    10
dtype: int64
In [40]:
# If x and y were regular numpy arrays, this would error because of the size mismatch.
x + y
Out[40]:
a    11.0
b     7.0
c     NaN
dtype: float64
  • Example: If I adopt a puppy today, when should I expect them to live until?
In [41]:
dogs
Out[41]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 6 columns

In [42]:
2025 + dogs['longevity'] 
Out[42]:
breed
Brittany                  2037.92
Cairn Terrier             2038.84
English Cocker Spaniel    2036.66
                           ...   
Bullmastiff               2032.57
Mastiff                   2031.50
Saint Bernard             2032.78
Name: longevity, Length: 42, dtype: float64
  • Example: What is the Body Mass Index (BMI) of each breed?
    Remember, 1 kg = 2.2 pounds and 1 inch = 2.54 cm = 0.0254 m.
In [43]:
dogs
Out[43]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 6 columns

In [44]:
weight_kg = dogs['weight'] / 2.2
height_m = dogs['height'] * 2.54 / 100
bmis = weight_kg / (height_m ** 2)
bmis
Out[44]:
breed
Brittany                   68.31
Cairn Terrier              98.64
English Cocker Spaniel     82.56
                           ...  
Bullmastiff               124.60
Mastiff                   137.00
Saint Bernard             155.51
Length: 42, dtype: float64

Aside: Visualization 📊¶

  • We'll spend more time talking about when to create which types of visualizations in a few lectures.
  • But for now, you can start exploring how the DataFrame plot method works!
In [45]:
dogs.plot(kind='scatter', x='weight', y='longevity')
In [46]:
# Hover over a point and see what happens!
(
    dogs
    .reset_index()
    .plot(kind='scatter', x='weight', y='longevity', color='size', hover_name='breed',
          title='Longevity vs. Weight for 42 Dog Breeds')
)
In [47]:
(
    dogs['kind']
    .value_counts()
    .sort_values()
    .plot(kind='barh', title='Distribution of Dog Kinds')
)

Selecting slices 🍰¶


Now that we know how to access specific columns in a dataset, how do we access specific rows? Or even individual values?

No description has been provided for this image

Use loc to slice rows and columns using labels¶

  • loc stands for "location".
    I pronounce it "loke", like "broke", not "lock".
  • The loc indexer works similarly to slicing 2D arrays, but it uses row labels and column labels, not positions.
    Remember, the "index" refers to the row labels.
In [48]:
dogs
Out[48]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 6 columns

In [49]:
# The first argument is the row label, i.e. the index value.
#        ↓
dogs.loc['Pug', 'longevity']
#                  ↑
# The second argument is the column label.
Out[49]:
11.0
  • loc is not a method – it's an indexer.
In [50]:
type(dogs.loc)
Out[50]:
pandas.core.indexing._LocIndexer
In [51]:
type(dogs.sort_values)
Out[51]:
method

loc is flexible 🧘¶

  • You can provide a sequence (list, array, Series) as either argument to loc.
In [52]:
dogs
Out[52]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 6 columns

In [53]:
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'size']
Out[53]:
breed
Cocker Spaniel         small
Labrador Retriever    medium
Name: size, dtype: object
In [54]:
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], ['kind', 'size', 'height']]
Out[54]:
kind size height
breed
Cocker Spaniel sporting small 14.5
Labrador Retriever sporting medium 23.0
In [55]:
# Note that the 'weight' column is included!
# loc, per the pandas documentation, is inclusive of both slicer endpoints.
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'lifetime_cost': 'weight']
Out[55]:
lifetime_cost longevity size weight
breed
Cocker Spaniel 24330.0 12.50 small 25.0
Labrador Retriever 21299.0 12.04 medium 67.5
In [56]:
dogs.loc[['Cocker Spaniel', 'Labrador Retriever']]
Out[56]:
kind lifetime_cost longevity size weight height
breed
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
Labrador Retriever sporting 21299.0 12.04 medium 67.5 23.0

Use iloc to slice rows and columns using positions¶

  • iloc stands for "integer location."
  • iloc is like loc, but it selects rows and columns based off of integer positions only, just like with 2D arrays.
In [57]:
dogs
Out[57]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 6 columns

In [58]:
# Try removing the iloc and see what happens!
dogs.iloc[1:15, :-2] 
Out[58]:
kind lifetime_cost longevity size
breed
Cairn Terrier terrier 21992.0 13.84 small
English Cocker Spaniel sporting 18993.0 11.66 medium
Cocker Spaniel sporting 24330.0 12.50 small
... ... ... ... ...
Labrador Retriever sporting 21299.0 12.04 medium
Maltese toy 19084.0 12.25 small
Shih Tzu toy 21152.0 13.20 small

14 rows × 4 columns

  • iloc is often most useful when we sort first. For instance, to find the weight of the longest-living breed in the dataset:
In [59]:
dogs.sort_values('longevity', ascending=False)['weight'].iloc[0] 
Out[59]:
5.5
In [60]:
# Finding the breed itself involves sorting, but not iloc, since breeds are stored in the index.
dogs.sort_values('longevity', ascending=False).index[0] 
Out[60]:
'Chihuahua'

Activity

Among just the following four breeds, what is the height of the second tallest breed?

  • Cocker Spaniel.
  • Labrador Retriever.
  • Irish Setter.
  • Newfoundland.

Assign your answer to second_tallest_height below. Answer using pandas code, i.e. don't look at the dataset and hard-code the answer.

If we don't have time to try this activity in lecture, the answer is posted in lec04-filled.ipynb and in the "filled html" link on the course website.

In [61]:
second_tallest_height = (
    dogs
    .loc[['Cocker Spaniel', 'Labrador Retriever', 'Newfoundland', 'Irish Setter'], 'height']
    .sort_values(ascending=False)
    .iloc[1]
)
second_tallest_height
Out[61]:
26.0
In [ ]:
 

Querying 🔎¶


Okay, but what if we don't know anything about the position or index of a row we're looking for? How do we find rows that satisfy certain conditions?

Reflection¶

  • So far, all of the questions we've been able to answer involved all of the rows in the dataset.


What's the weight of the longest-living breed? What's the average lifetime cost of all breeds? Which breed is third heaviest?

  • We don't yet have a mechanism to answer questions about a specific subset of the dataset.


How many terriers are there? What's the average longevity of medium-sized breeds?

Querying¶

  • Querying is the act selecting rows in a DataFrame that satisfy certain condition(s).
    We sometimes call this "filtering", but filtering also has another meaning, which we'll see next week.
  • As we saw in Lecture 3, comparisons with arrays result in Boolean arrays.
    The same is true for Series – make a comparison with a Series, and the result is a Boolean Series!
  • We can use comparisons along with the loc operator to select specific rows from a DataFrame.
In [62]:
dogs
Out[62]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 6 columns

In [63]:
dogs['kind'] == 'terrier' 
Out[63]:
breed
Brittany                  False
Cairn Terrier              True
English Cocker Spaniel    False
                          ...  
Bullmastiff               False
Mastiff                   False
Saint Bernard             False
Name: kind, Length: 42, dtype: bool
In [64]:
dogs.loc[dogs['kind'] == 'terrier'] 
Out[64]:
kind lifetime_cost longevity size weight height
breed
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
Miniature Schnauzer terrier 20087.0 11.81 small 15.5 13.0
Norfolk Terrier terrier 24308.0 13.07 small 12.0 9.5
... ... ... ... ... ... ...
Scottish Terrier terrier 17525.0 10.69 small 20.0 10.0
Kerry Blue Terrier terrier 17240.0 9.40 medium 36.5 18.5
Bull Terrier terrier 18490.0 10.21 medium 60.0 21.5

8 rows × 6 columns

  • Example: How many breeds live to be at least 10 years old?
In [65]:
dogs.loc[dogs['longevity'] >= 10].shape[0] 
Out[65]:
33
  • Since querying is so common, there's a shortcut – loc isn't necessary.
In [66]:
dogs[dogs['longevity'] >= 10] 
Out[66]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Afghan Hound hound 24077.0 11.92 large 55.0 26.0
Bull Terrier terrier 18490.0 10.21 medium 60.0 21.5
Alaskan Malamute working 21986.0 10.67 large 80.0 24.0

33 rows × 6 columns

  • Example: Among all breeds with 'Retriever' in the name, which is the second tallest?
In [67]:
# Since we're selecting both rows AND columns, we do need loc here.
(
    dogs.loc[dogs.index.str.contains('Retriever'), 'height']
    .sort_values(ascending=False)
    .index[1]
)
Out[67]:
'Labrador Retriever'
  • Example: Are there any 'beaver' kinds?
In [68]:
# Empty DataFrame – not an error!
dogs[dogs['kind'] == 'beaver'] 
Out[68]:
kind lifetime_cost longevity size weight height
breed
  • Example: Show me all rows for 'Golden Retriever's.
    Note that because we set the index to 'breed' earlier, we can select rows based on dog breeds without having to query. If 'breed' was instead a column, then we'd need to query to access information about a particular breed.
In [69]:
dogs.loc['Golden Retriever'] 
Out[69]:
kind             sporting
lifetime_cost     21447.0
longevity           12.04
size               medium
weight               60.0
height              22.75
Name: Golden Retriever, dtype: object
  • Example: Show me all of the rows for 'sporting' or 'working' breeds.
    If using multiple conditions, you need parentheses around each condition!
    Also, you must use the bitwise operators & and | instead of the standard and and or keywords, as we saw in Lecture 3.
In [70]:
dogs[(dogs['kind'] == 'sporting') | (dogs['kind'] == 'working')]
Out[70]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

19 rows × 6 columns

In [71]:
# Equivalent to the above!
dogs[dogs['kind'].isin(['sporting', 'working'])]
Out[71]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Cocker Spaniel sporting 24330.0 12.50 small 25.0 14.5
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

19 rows × 6 columns

Reference Slide¶

The query method¶

The DataFrame query method is a convenient way to query, since you don't need parentheses and you can use the and and or keywords.

In [72]:
dogs
Out[72]:
kind lifetime_cost longevity size weight height
breed
Brittany sporting 22589.0 12.92 medium 35.0 19.0
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
... ... ... ... ... ... ...
Bullmastiff working 13936.0 7.57 large 115.0 25.5
Mastiff working 13581.0 6.50 large 175.0 30.0
Saint Bernard working 20022.0 7.78 large 155.0 26.5

42 rows × 6 columns

In [73]:
dogs.query('weight < 20 and kind == "terrier"')
Out[73]:
kind lifetime_cost longevity size weight height
breed
Cairn Terrier terrier 21992.0 13.84 small 14.0 10.0
Miniature Schnauzer terrier 20087.0 11.81 small 15.5 13.0
Norfolk Terrier terrier 24308.0 13.07 small 12.0 9.5
In [74]:
dogs.query('kind in ["sporting", "terrier"] and lifetime_cost < 20000')
Out[74]:
kind lifetime_cost longevity size weight height
breed
English Cocker Spaniel sporting 18993.0 11.66 medium 30.0 16.0
Chesapeake Bay Retriever sporting 16697.0 9.48 large 67.5 23.5
Gordon Setter sporting 19605.0 11.10 large 62.5 25.0
Clumber Spaniel sporting 18084.0 10.00 medium 70.0 18.5
Scottish Terrier terrier 17525.0 10.69 small 20.0 10.0
Kerry Blue Terrier terrier 17240.0 9.40 medium 36.5 18.5
Bull Terrier terrier 18490.0 10.21 medium 60.0 21.5

Reference Slide¶

More practice¶

  • For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We won't cover these in class, but you should try them out yourself. Here's a Pandas Tutor link to visualize these!
In [75]:
jack = pd.DataFrame({1: ['fee', 'fi'], 
                     '1': ['fo', 'fum']})
jack
Out[75]:
1 1
0 fee fo
1 fi fum
In [76]:
jack[1]
Out[76]:
0    fee
1     fi
Name: 1, dtype: object
In [77]:
jack[[1]]
Out[77]:
1
0 fee
1 fi
In [78]:
jack['1']
Out[78]:
0     fo
1    fum
Name: 1, dtype: object
In [79]:
jack[[1, 1]]
Out[79]:
1 1
0 fee fee
1 fi fi
In [80]:
jack.loc[1]
Out[80]:
1     fi
1    fum
Name: 1, dtype: object
In [81]:
jack.loc[jack[1] == 'fo']
Out[81]:
1 1
In [82]:
jack[1, ['1', 1]]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key)
   3789 try:
-> 3790     return self._engine.get_loc(casted_key)
   3791 except KeyError as err:

File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:158, in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(1, ['1', 1])' is an invalid key

During handling of the above exception, another exception occurred:

InvalidIndexError                         Traceback (most recent call last)
Cell In[82], line 1
----> 1 jack[1, ['1', 1]]

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/frame.py:3896, in DataFrame.__getitem__(self, key)
   3894 if self.columns.nlevels > 1:
   3895     return self._getitem_multilevel(key)
-> 3896 indexer = self.columns.get_loc(key)
   3897 if is_integer(indexer):
   3898     indexer = [indexer]

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key)
   3797     raise KeyError(key) from err
   3798 except TypeError:
   3799     # If we have a listlike key, _check_indexing_error will raise
   3800     #  InvalidIndexError. Otherwise we fall through and re-raise
   3801     #  the TypeError.
-> 3802     self._check_indexing_error(key)
   3803     raise

File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/indexes/base.py:5974, in Index._check_indexing_error(self, key)
   5970 def _check_indexing_error(self, key):
   5971     if not is_scalar(key):
   5972         # if key is not a scalar, directly raise an error (the code below
   5973         # would convert to numpy arrays and raise later any way) - GH29926
-> 5974         raise InvalidIndexError(key)

InvalidIndexError: (1, ['1', 1])
In [83]:
jack.loc[1, 1]
Out[83]:
'fi'

What's next?¶

  • Suppose we want to find the average 'longevity' of each 'kind'.
In [84]:
dogs['kind'].unique()
Out[84]:
array(['sporting', 'terrier', 'herding', 'working', 'toy', 'non-sporting',
       'hound'], dtype=object)
  • Using our current toolkit, we'd:
    1. Find all unique values in the 'kind' column.
    2. For each one, query the rows for just that 'kind', extract the 'longevity' column, and take the mean.
In [85]:
dogs.loc[dogs['kind'] == 'sporting', 'longevity'].mean()
Out[85]:
11.649166666666668
In [86]:
dogs.loc[dogs['kind'] == 'terrier', 'longevity'].mean()
Out[86]:
11.655
  • This can get time-consuming! How can we compute all 7 of these means with a single line of code?