from lec_utils import *
def show_chaining_slides():
src = 'https://docs.google.com/presentation/d/e/2PACX-1vRYAERthJoyVnD1JymDK6JAtufCJmA5AYb5H2NwlegJHm04WhBfxnA0zQO3vKbEYQbqOJ8XJPZtfoxF/embed?start=false&loop=false&rm=minimal'
width = 1000
height = 590
display(IFrame(src, width, height))
Lecture 4¶
DataFrame Fundamentals¶
EECS 398: Practical Data Science, Winter 2025¶
practicaldsc.org • github.com/practicaldsc/wn25 • 📣 See latest announcements here on Ed
Agenda 📆¶
- Introduction to
pandas
DataFrames. - Selecting columns.
- Selecting slices 🍰.
- Querying 🔎.
Remember to follow along in lecture by accessing the "blank" lecture notebook in our public GitHub repository.
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!Have you started Homework 2?
Introduction to pandas
DataFrames¶
Let's finally start working with real datasets! 🎉
Note that we're going to cover a lot of code quickly. The point of lecture is to expose you to what's possible; you can look at the notebook later for the details.
pandas
¶
pandas
is the Python library for tabular data manipulation.
It, roughly, stands for "Python data analysis".
- Before
pandas
was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.
- Wes McKinney, the original developer of
pandas
, wanted a library which would allow everything to be done in Python.
Python is faster to develop in than Java or C++, and is more general-purpose than R.
pandas
is almost always imported in conjunction withnumpy
.
import pandas as pd
import numpy as np
pandas
data structures¶
- There are three key data structures at the core of
pandas
.
- DataFrame: 2 dimensional tables. These have rows and columns.
- Series: 1 dimensional array-like object, representing a row or column.
Like arrays, Series contain data of the same type. The plural of Series is also Series.
- Index: Sequence of row or column labels. When we say "the index", we're referring to the sequence of row labels.
The index –'lebronja'
,'obammich'
,'carpents'
, and'timapplec'
in the example above – is not a column!
Column names –'name'
,'program'
, and'year'
in the example above – are stored as strings, and the sequence of column names is also an index.
Example: Dog breeds 🐶¶
- The dataset we'll work comes from the American Kennel Club. Here's a cool plot made using our dataset.
- We'll usually work with data stored in the CSV format. CSV stands for "comma-separated values."
- We can read in a CSV using
pd.read_csv(path)
. The path should be relative to your notebook; if the file is in the same folder as your notebook, this is just the name of the file (as a string).
- Today's dataset is stored
'data/dogs42.csv'
– open it up and see what it looks like!
# The "cat" shell command shows you the contents of a file.
!cat data/dogs42.csv
breed,kind,lifetime_cost,longevity,size,weight,height Brittany,sporting,22589.0,12.92,medium,35.0,19.0 Cairn Terrier,terrier,21992.0,13.84,small,14.0,10.0 English Cocker Spaniel,sporting,18993.0,11.66,medium,30.0,16.0 Cocker Spaniel,sporting,24330.0,12.5,small,25.0,14.5 Shetland Sheepdog,herding,21006.0,12.53,small,22.0,14.5 Siberian Husky,working,22049.0,12.58,medium,47.5,21.75 Miniature Schnauzer,terrier,20087.0,11.81,small,15.5,13.0 Chihuahua,toy,26250.0,16.5,small,5.5,5.0 English Springer Spaniel,sporting,21946.0,12.54,medium,45.0,19.5 German Shorthaired Pointer,sporting,25842.0,11.46,large,62.5,24.0 Pointer,sporting,24445.0,12.42,large,59.5,25.5 Tibetan Spaniel,non-sporting,25549.0,14.42,small,12.0,10.0 Labrador Retriever,sporting,21299.0,12.04,medium,67.5,23.0 Maltese,toy,19084.0,12.25,small,5.0,9.0 Shih Tzu,toy,21152.0,13.2,small,12.5,9.75 Irish Setter,sporting,20323.0,11.63,large,65.0,26.0 Golden Retriever,sporting,21447.0,12.04,medium,60.0,22.75 Chesapeake Bay Retriever,sporting,16697.0,9.48,large,67.5,23.5 Tibetan Terrier,non-sporting,20336.0,12.31,small,24.0,15.5 Gordon Setter,sporting,19605.0,11.1,large,62.5,25.0 Pug,toy,18527.0,11.0,medium,16.0,16.0 Norfolk Terrier,terrier,24308.0,13.07,small,12.0,9.5 English Toy Spaniel,toy,17521.0,10.1,small,11.0,10.0 Cavalier King Charles Spaniel,toy,18639.0,11.29,small,15.5,12.5 Basenji,hound,22096.0,13.58,medium,23.0,16.5 Staffordshire Bull Terrier,terrier,21650.0,12.05,medium,31.0,15.0 Pembroke Welsh Corgi,herding,23978.0,12.25,small,26.0,11.0 Clumber Spaniel,sporting,18084.0,10.0,medium,70.0,18.5 Dandie Dinmont Terrier,terrier,21633.0,12.17,small,21.0,9.0 Giant Schnauzer,working,26686.0,10.0,large,77.5,25.5 Scottish Terrier,terrier,17525.0,10.69,small,20.0,10.0 Kerry Blue Terrier,terrier,17240.0,9.4,medium,36.5,18.5 Afghan Hound,hound,24077.0,11.92,large,55.0,26.0 Newfoundland,working,19351.0,9.32,large,125.0,27.0 Rhodesian Ridgeback,hound,16530.0,9.1,large,77.5,25.5 Borzoi,hound,16176.0,9.08,large,82.5,28.0 Bull Terrier,terrier,18490.0,10.21,medium,60.0,21.5 Alaskan Malamute,working,21986.0,10.67,large,80.0,24.0 Bloodhound,hound,13824.0,6.75,large,85.0,25.0 Bullmastiff,working,13936.0,7.57,large,115.0,25.5 Mastiff,working,13581.0,6.5,large,175.0,30.0 Saint Bernard,working,20022.0,7.78,large,155.0,26.5
dogs = pd.read_csv('data/dogs42.csv')
dogs
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... | ... |
39 | Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
40 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
41 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 7 columns
Exploring our first DataFrame¶
- The
shape
attribute returns the DataFrame's number of rows and columns.
Sure, we can see 👀 that it says 42 rows x 7 columns below, but theshape
attribute allows us to write code involving the number of rows/columns.
dogs
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... | ... |
39 | Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
40 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
41 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 7 columns
# Note that the index – 0, 1, 2, ... – does **not** count as a column!
dogs.shape
(42, 7)
- To extract the first or last few rows of a DataFrame, use the
head
ortail
methods.
Like most DataFrame methods,head
andtail
don't modify the original DataFrame!
dogs.head(3)
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
dogs.tail(2)
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
40 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
41 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
- To sort by a column, use the
sort_values
method.ascending=False
is a keyword argument, meaning you need to specify the name of the argument to use it.
You've seen some examples of this in theplotly
part of Homework 1.
# Note that the index is no longer 0, 1, 2, ...!
dogs.sort_values('height', ascending=False)
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
40 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
35 | Borzoi | hound | 16176.0 | 9.08 | large | 82.5 | 28.0 |
33 | Newfoundland | working | 19351.0 | 9.32 | large | 125.0 | 27.0 |
... | ... | ... | ... | ... | ... | ... | ... |
13 | Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.0 |
28 | Dandie Dinmont Terrier | terrier | 21633.0 | 12.17 | small | 21.0 | 9.0 |
7 | Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.0 |
42 rows × 7 columns
Setting the index¶
- Think of each row's index as its unique identifier or name.
The default index when we create a DataFrame usingpd.read_csv
is 0, 1, 2, 3, ...
Think of the index of a DataFrame like a "key" in a dictionary (Python) or map (C++).
dogs
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... | ... |
39 | Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
40 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
41 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 7 columns
dogs.index
RangeIndex(start=0, stop=42, step=1)
- Often, we like to set the index of a DataFrame to a unique identifier if we have one available.
We can do so with theset_index
method.
We'll see the real benefit of this shortly.
dogs.set_index('breed')
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 6 columns
# The above cell didn't involve an assignment statement, so dogs was unchanged.
dogs
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
2 | English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... | ... |
39 | Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
40 | Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
41 | Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 7 columns
# By reassigning dogs, our changes will persist.
# Note that we can't run this cell twice! Try it and see what happens.
dogs = dogs.set_index('breed')
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 6 columns
# There used to be 7 columns, but now there are only 6!
# The index is **not** a column!
dogs.shape
(42, 6)
dogs.index
Index(['Brittany', 'Cairn Terrier', 'English Cocker Spaniel', 'Cocker Spaniel', 'Shetland Sheepdog', 'Siberian Husky', 'Miniature Schnauzer', 'Chihuahua', 'English Springer Spaniel', 'German Shorthaired Pointer', 'Pointer', 'Tibetan Spaniel', 'Labrador Retriever', 'Maltese', 'Shih Tzu', 'Irish Setter', 'Golden Retriever', 'Chesapeake Bay Retriever', 'Tibetan Terrier', 'Gordon Setter', 'Pug', 'Norfolk Terrier', 'English Toy Spaniel', 'Cavalier King Charles Spaniel', 'Basenji', 'Staffordshire Bull Terrier', 'Pembroke Welsh Corgi', 'Clumber Spaniel', 'Dandie Dinmont Terrier', 'Giant Schnauzer', 'Scottish Terrier', 'Kerry Blue Terrier', 'Afghan Hound', 'Newfoundland', 'Rhodesian Ridgeback', 'Borzoi', 'Bull Terrier', 'Alaskan Malamute', 'Bloodhound', 'Bullmastiff', 'Mastiff', 'Saint Bernard'], dtype='object', name='breed')
Activity
Assign tallest_breed
to the name, as a string, of the tallest breed in the dataset. Answer using pandas
code, i.e. don't look at the dataset and hard-code the answer.
tallest_breed = dogs.sort_values('height', ascending=False).index[0]
tallest_breed
'Mastiff'
- Sometimes, you just want
pandas
to display a lot of rows and columns. You can use this helper function to do that.
def display_df(df, rows=pd.options.display.max_rows, cols=pd.options.display.max_columns):
"""Displays n rows and cols from df."""
with pd.option_context("display.max_rows", rows,
"display.max_columns", cols):
display(df)
display_df(dogs.sort_values('weight', ascending=False), rows=42)
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.00 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.50 |
Newfoundland | working | 19351.0 | 9.32 | large | 125.0 | 27.00 |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.50 |
Bloodhound | hound | 13824.0 | 6.75 | large | 85.0 | 25.00 |
Borzoi | hound | 16176.0 | 9.08 | large | 82.5 | 28.00 |
Alaskan Malamute | working | 21986.0 | 10.67 | large | 80.0 | 24.00 |
Rhodesian Ridgeback | hound | 16530.0 | 9.10 | large | 77.5 | 25.50 |
Giant Schnauzer | working | 26686.0 | 10.00 | large | 77.5 | 25.50 |
Clumber Spaniel | sporting | 18084.0 | 10.00 | medium | 70.0 | 18.50 |
Labrador Retriever | sporting | 21299.0 | 12.04 | medium | 67.5 | 23.00 |
Chesapeake Bay Retriever | sporting | 16697.0 | 9.48 | large | 67.5 | 23.50 |
Irish Setter | sporting | 20323.0 | 11.63 | large | 65.0 | 26.00 |
German Shorthaired Pointer | sporting | 25842.0 | 11.46 | large | 62.5 | 24.00 |
Gordon Setter | sporting | 19605.0 | 11.10 | large | 62.5 | 25.00 |
Bull Terrier | terrier | 18490.0 | 10.21 | medium | 60.0 | 21.50 |
Golden Retriever | sporting | 21447.0 | 12.04 | medium | 60.0 | 22.75 |
Pointer | sporting | 24445.0 | 12.42 | large | 59.5 | 25.50 |
Afghan Hound | hound | 24077.0 | 11.92 | large | 55.0 | 26.00 |
Siberian Husky | working | 22049.0 | 12.58 | medium | 47.5 | 21.75 |
English Springer Spaniel | sporting | 21946.0 | 12.54 | medium | 45.0 | 19.50 |
Kerry Blue Terrier | terrier | 17240.0 | 9.40 | medium | 36.5 | 18.50 |
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.00 |
Staffordshire Bull Terrier | terrier | 21650.0 | 12.05 | medium | 31.0 | 15.00 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.00 |
Pembroke Welsh Corgi | herding | 23978.0 | 12.25 | small | 26.0 | 11.00 |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.50 |
Tibetan Terrier | non-sporting | 20336.0 | 12.31 | small | 24.0 | 15.50 |
Basenji | hound | 22096.0 | 13.58 | medium | 23.0 | 16.50 |
Shetland Sheepdog | herding | 21006.0 | 12.53 | small | 22.0 | 14.50 |
Dandie Dinmont Terrier | terrier | 21633.0 | 12.17 | small | 21.0 | 9.00 |
Scottish Terrier | terrier | 17525.0 | 10.69 | small | 20.0 | 10.00 |
Pug | toy | 18527.0 | 11.00 | medium | 16.0 | 16.00 |
Miniature Schnauzer | terrier | 20087.0 | 11.81 | small | 15.5 | 13.00 |
Cavalier King Charles Spaniel | toy | 18639.0 | 11.29 | small | 15.5 | 12.50 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.00 |
Shih Tzu | toy | 21152.0 | 13.20 | small | 12.5 | 9.75 |
Tibetan Spaniel | non-sporting | 25549.0 | 14.42 | small | 12.0 | 10.00 |
Norfolk Terrier | terrier | 24308.0 | 13.07 | small | 12.0 | 9.50 |
English Toy Spaniel | toy | 17521.0 | 10.10 | small | 11.0 | 10.00 |
Chihuahua | toy | 26250.0 | 16.50 | small | 5.5 | 5.00 |
Maltese | toy | 19084.0 | 12.25 | small | 5.0 | 9.00 |
Selecting columns¶
In order to answer questions like:
What is the average
'longevity'
of all breeds in the dataset?
we'll need to be able to access individual columns in the DataFrame.
Selecting columns with []
¶
- The most common way to select a subset of the columns in a DataFrame is by using the
[]
operator.
This is just like when we accessed values in a dictionary based on their key.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 6 columns
# Returns a Series. Note the index appears again on the left!
dogs['kind']
breed Brittany sporting Cairn Terrier terrier English Cocker Spaniel sporting ... Bullmastiff working Mastiff working Saint Bernard working Name: kind, Length: 42, dtype: object
# Returns a DataFrame.
dogs[['kind', 'size']]
kind | size | |
---|---|---|
breed | ||
Brittany | sporting | medium |
Cairn Terrier | terrier | small |
English Cocker Spaniel | sporting | medium |
... | ... | ... |
Bullmastiff | working | large |
Mastiff | working | large |
Saint Bernard | working | large |
42 rows × 2 columns
- As we've seen above, specifying a single column name returns the column as a Series; specifying a list of column names returns a DataFrame.
# 🤔
dogs[['kind']]
kind | |
---|---|
breed | |
Brittany | sporting |
Cairn Terrier | terrier |
English Cocker Spaniel | sporting |
... | ... |
Bullmastiff | working |
Mastiff | working |
Saint Bernard | working |
42 rows × 1 columns
- As an aside: when you get an error message in Python, the most informative part is usually at the bottom!
So, if you're posting about your error on Ed, or debugging with us in office hours, show us the bottom first.
# Breeds are stored in the index, which is not a column!
dogs['breed']
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key) 3789 try: -> 3790 return self._engine.get_loc(casted_key) 3791 except KeyError as err: File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'breed' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[24], line 2 1 # Breeds are stored in the index, which is not a column! ----> 2 dogs['breed'] File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/frame.py:3896, in DataFrame.__getitem__(self, key) 3894 if self.columns.nlevels > 1: 3895 return self._getitem_multilevel(key) -> 3896 indexer = self.columns.get_loc(key) 3897 if is_integer(indexer): 3898 indexer = [indexer] File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/indexes/base.py:3797, in Index.get_loc(self, key) 3792 if isinstance(casted_key, slice) or ( 3793 isinstance(casted_key, abc.Iterable) 3794 and any(isinstance(x, slice) for x in casted_key) 3795 ): 3796 raise InvalidIndexError(key) -> 3797 raise KeyError(key) from err 3798 except TypeError: 3799 # If we have a listlike key, _check_indexing_error will raise 3800 # InvalidIndexError. Otherwise we fall through and re-raise 3801 # the TypeError. 3802 self._check_indexing_error(key) KeyError: 'breed'
dogs.index
Index(['Brittany', 'Cairn Terrier', 'English Cocker Spaniel', 'Cocker Spaniel', 'Shetland Sheepdog', 'Siberian Husky', 'Miniature Schnauzer', 'Chihuahua', 'English Springer Spaniel', 'German Shorthaired Pointer', 'Pointer', 'Tibetan Spaniel', 'Labrador Retriever', 'Maltese', 'Shih Tzu', 'Irish Setter', 'Golden Retriever', 'Chesapeake Bay Retriever', 'Tibetan Terrier', 'Gordon Setter', 'Pug', 'Norfolk Terrier', 'English Toy Spaniel', 'Cavalier King Charles Spaniel', 'Basenji', 'Staffordshire Bull Terrier', 'Pembroke Welsh Corgi', 'Clumber Spaniel', 'Dandie Dinmont Terrier', 'Giant Schnauzer', 'Scottish Terrier', 'Kerry Blue Terrier', 'Afghan Hound', 'Newfoundland', 'Rhodesian Ridgeback', 'Borzoi', 'Bull Terrier', 'Alaskan Malamute', 'Bloodhound', 'Bullmastiff', 'Mastiff', 'Saint Bernard'], dtype='object', name='breed')
Useful Series methods¶
- A Series is like a 1D array, but with an index.
- There are a variety of useful methods that work on Series; you can see the entire list here.
Many methods that work on a Series will also work on entire DataFrames, as we'll soon see.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 6 columns
# What are the unique kinds of dogs?
dogs['kind'].unique()
array(['sporting', 'terrier', 'herding', 'working', 'toy', 'non-sporting', 'hound'], dtype=object)
# How many unique kinds of dogs are there?
dogs['kind'].nunique()
7
# What's the distribution of kinds?
# value_counts is super useful – and I love asking exam questions about it!
dogs['kind'].value_counts()
kind sporting 12 terrier 8 working 7 toy 6 hound 5 herding 2 non-sporting 2 Name: count, dtype: int64
# What's the mean of the 'longevity' column?
dogs['longevity'].mean()
11.279285714285715
# Tell me more about the 'weight' column.
dogs['weight'].describe()
count 42.00 mean 50.17 std 39.52 ... 50% 40.75 75% 67.50 max 175.00 Name: weight, Length: 8, dtype: float64
# Sort the 'lifetime_cost' column. Note that here we're using sort_values on a Series, not a DataFrame!
dogs['lifetime_cost'].sort_values()
breed Mastiff 13581.0 Bloodhound 13824.0 Bullmastiff 13936.0 ... German Shorthaired Pointer 25842.0 Chihuahua 26250.0 Giant Schnauzer 26686.0 Name: lifetime_cost, Length: 42, dtype: float64
# Gives us the index of the largest value, not the largest value itself.
# Note that this makes our Activity from a few slides ago way easier!
dogs['height'].idxmax()
'Mastiff'
Method chaining 🔗¶
- When using
pandas
, it's common to use several method calls, one after another.
We call this method chaining.
- For instance, the line of code below finds the most common
'size'
among the 10 heaviest breeds.
dogs.sort_values('weight', ascending=False).head(10)['size'].value_counts().idxmax()
'large'
- When writing code like this, it's important to:
- Write one method call at a time.
- Understand the data structures involved!
show_chaining_slides()
- One tip: if a line of
pandas
code gets too long, spread it out over multiple lines by wrapping it in(
parentheses)
.
(
dogs
.sort_values('weight', ascending=False)
.head(10)
['size']
.value_counts()
.idxmax()
)
'large'
Series support vectorized operations¶
- Series operations are vectorized, just like with arrays.
- When performing elementwise-operations involving multiple Series,
pandas
aligns the Series by their index.
x = pd.Series({'a': 1, 'b': 2})
x
a 1 b 2 dtype: int64
x * 5
a 5 b 10 dtype: int64
y = pd.Series({'b': 5, 'c': -1, 'a': 10})
y
b 5 c -1 a 10 dtype: int64
# If x and y were regular numpy arrays, this would error because of the size mismatch.
x + y
a 11.0 b 7.0 c NaN dtype: float64
- Example: If I adopt a puppy today, when should I expect them to live until?
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 6 columns
2025 + dogs['longevity']
breed Brittany 2037.92 Cairn Terrier 2038.84 English Cocker Spaniel 2036.66 ... Bullmastiff 2032.57 Mastiff 2031.50 Saint Bernard 2032.78 Name: longevity, Length: 42, dtype: float64
- Example: What is the Body Mass Index (BMI) of each breed?
Remember, 1 kg = 2.2 pounds and 1 inch = 2.54 cm = 0.0254 m.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 6 columns
weight_kg = dogs['weight'] / 2.2
height_m = dogs['height'] * 2.54 / 100
bmis = weight_kg / (height_m ** 2)
bmis
breed Brittany 68.31 Cairn Terrier 98.64 English Cocker Spaniel 82.56 ... Bullmastiff 124.60 Mastiff 137.00 Saint Bernard 155.51 Length: 42, dtype: float64
Aside: Visualization 📊¶
- We'll spend more time talking about when to create which types of visualizations in a few lectures.
- But for now, you can start exploring how the DataFrame
plot
method works!
dogs.plot(kind='scatter', x='weight', y='longevity')
# Hover over a point and see what happens!
(
dogs
.reset_index()
.plot(kind='scatter', x='weight', y='longevity', color='size', hover_name='breed',
title='Longevity vs. Weight for 42 Dog Breeds')
)
(
dogs['kind']
.value_counts()
.sort_values()
.plot(kind='barh', title='Distribution of Dog Kinds')
)
Selecting slices 🍰¶
Now that we know how to access specific columns in a dataset, how do we access specific rows? Or even individual values?
Use loc
to slice rows and columns using labels¶
loc
stands for "location".
I pronounce it "loke", like "broke", not "lock".
- The
loc
indexer works similarly to slicing 2D arrays, but it uses row labels and column labels, not positions.
Remember, the "index" refers to the row labels.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 6 columns
# The first argument is the row label, i.e. the index value.
# ↓
dogs.loc['Pug', 'longevity']
# ↑
# The second argument is the column label.
11.0
loc
is not a method – it's an indexer.
type(dogs.loc)
pandas.core.indexing._LocIndexer
type(dogs.sort_values)
method
loc
is flexible 🧘¶
- You can provide a sequence (list, array, Series) as either argument to
loc
.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 6 columns
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'size']
breed Cocker Spaniel small Labrador Retriever medium Name: size, dtype: object
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], ['kind', 'size', 'height']]
kind | size | height | |
---|---|---|---|
breed | |||
Cocker Spaniel | sporting | small | 14.5 |
Labrador Retriever | sporting | medium | 23.0 |
# Note that the 'weight' column is included!
# loc, per the pandas documentation, is inclusive of both slicer endpoints.
dogs.loc[['Cocker Spaniel', 'Labrador Retriever'], 'lifetime_cost': 'weight']
lifetime_cost | longevity | size | weight | |
---|---|---|---|---|
breed | ||||
Cocker Spaniel | 24330.0 | 12.50 | small | 25.0 |
Labrador Retriever | 21299.0 | 12.04 | medium | 67.5 |
dogs.loc[['Cocker Spaniel', 'Labrador Retriever']]
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
Labrador Retriever | sporting | 21299.0 | 12.04 | medium | 67.5 | 23.0 |
Use iloc
to slice rows and columns using positions¶
iloc
stands for "integer location."
iloc
is likeloc
, but it selects rows and columns based off of integer positions only, just like with 2D arrays.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 6 columns
# Try removing the iloc and see what happens!
dogs.iloc[1:15, :-2]
kind | lifetime_cost | longevity | size | |
---|---|---|---|---|
breed | ||||
Cairn Terrier | terrier | 21992.0 | 13.84 | small |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small |
... | ... | ... | ... | ... |
Labrador Retriever | sporting | 21299.0 | 12.04 | medium |
Maltese | toy | 19084.0 | 12.25 | small |
Shih Tzu | toy | 21152.0 | 13.20 | small |
14 rows × 4 columns
iloc
is often most useful when we sort first. For instance, to find the weight of the longest-living breed in the dataset:
dogs.sort_values('longevity', ascending=False)['weight'].iloc[0]
5.5
# Finding the breed itself involves sorting, but not iloc, since breeds are stored in the index.
dogs.sort_values('longevity', ascending=False).index[0]
'Chihuahua'
Activity
Among just the following four breeds, what is the height of the second tallest breed?
- Cocker Spaniel.
- Labrador Retriever.
- Irish Setter.
- Newfoundland.
Assign your answer to second_tallest_height
below. Answer using pandas
code, i.e. don't look at the dataset and hard-code the answer.
If we don't have time to try this activity in lecture, the answer is posted in lec04-filled.ipynb
and in the "filled html" link on the course website.
second_tallest_height = (
dogs
.loc[['Cocker Spaniel', 'Labrador Retriever', 'Newfoundland', 'Irish Setter'], 'height']
.sort_values(ascending=False)
.iloc[1]
)
second_tallest_height
26.0
Querying 🔎¶
Okay, but what if we don't know anything about the position or index of a row we're looking for? How do we find rows that satisfy certain conditions?
Reflection¶
- So far, all of the questions we've been able to answer involved all of the rows in the dataset.
What's the weight of the longest-living breed? What's the average lifetime cost of all breeds? Which breed is third heaviest?
- We don't yet have a mechanism to answer questions about a specific subset of the dataset.
How many terriers are there? What's the average longevity of medium-sized breeds?
Querying¶
- Querying is the act selecting rows in a DataFrame that satisfy certain condition(s).
We sometimes call this "filtering", but filtering also has another meaning, which we'll see next week.
- As we saw in Lecture 3, comparisons with arrays result in Boolean arrays.
The same is true for Series – make a comparison with a Series, and the result is a Boolean Series!
- We can use comparisons along with the
loc
operator to select specific rows from a DataFrame.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 6 columns
dogs['kind'] == 'terrier'
breed Brittany False Cairn Terrier True English Cocker Spaniel False ... Bullmastiff False Mastiff False Saint Bernard False Name: kind, Length: 42, dtype: bool
dogs.loc[dogs['kind'] == 'terrier']
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
Miniature Schnauzer | terrier | 20087.0 | 11.81 | small | 15.5 | 13.0 |
Norfolk Terrier | terrier | 24308.0 | 13.07 | small | 12.0 | 9.5 |
... | ... | ... | ... | ... | ... | ... |
Scottish Terrier | terrier | 17525.0 | 10.69 | small | 20.0 | 10.0 |
Kerry Blue Terrier | terrier | 17240.0 | 9.40 | medium | 36.5 | 18.5 |
Bull Terrier | terrier | 18490.0 | 10.21 | medium | 60.0 | 21.5 |
8 rows × 6 columns
- Example: How many breeds live to be at least 10 years old?
dogs.loc[dogs['longevity'] >= 10].shape[0]
33
- Since querying is so common, there's a shortcut –
loc
isn't necessary.
dogs[dogs['longevity'] >= 10]
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Afghan Hound | hound | 24077.0 | 11.92 | large | 55.0 | 26.0 |
Bull Terrier | terrier | 18490.0 | 10.21 | medium | 60.0 | 21.5 |
Alaskan Malamute | working | 21986.0 | 10.67 | large | 80.0 | 24.0 |
33 rows × 6 columns
- Example: Among all breeds with
'Retriever'
in the name, which is the second tallest?
# Since we're selecting both rows AND columns, we do need loc here.
(
dogs.loc[dogs.index.str.contains('Retriever'), 'height']
.sort_values(ascending=False)
.index[1]
)
'Labrador Retriever'
- Example: Are there any
'beaver'
kinds?
# Empty DataFrame – not an error!
dogs[dogs['kind'] == 'beaver']
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed |
- Example: Show me all rows for
'Golden Retriever'
s.
Note that because we set the index to'breed'
earlier, we can select rows based on dog breeds without having to query. If'breed'
was instead a column, then we'd need to query to access information about a particular breed.
dogs.loc['Golden Retriever']
kind sporting lifetime_cost 21447.0 longevity 12.04 size medium weight 60.0 height 22.75 Name: Golden Retriever, dtype: object
- Example: Show me all of the rows for
'sporting'
or'working'
breeds.
If using multiple conditions, you need parentheses around each condition!
Also, you must use the bitwise operators&
and|
instead of the standardand
andor
keywords, as we saw in Lecture 3.
dogs[(dogs['kind'] == 'sporting') | (dogs['kind'] == 'working')]
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
19 rows × 6 columns
# Equivalent to the above!
dogs[dogs['kind'].isin(['sporting', 'working'])]
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
19 rows × 6 columns
The DataFrame query
method is a convenient way to query, since you don't need parentheses and you can use the and
and or
keywords.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
42 rows × 6 columns
dogs.query('weight < 20 and kind == "terrier"')
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
Miniature Schnauzer | terrier | 20087.0 | 11.81 | small | 15.5 | 13.0 |
Norfolk Terrier | terrier | 24308.0 | 13.07 | small | 12.0 | 9.5 |
dogs.query('kind in ["sporting", "terrier"] and lifetime_cost < 20000')
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
Chesapeake Bay Retriever | sporting | 16697.0 | 9.48 | large | 67.5 | 23.5 |
Gordon Setter | sporting | 19605.0 | 11.10 | large | 62.5 | 25.0 |
Clumber Spaniel | sporting | 18084.0 | 10.00 | medium | 70.0 | 18.5 |
Scottish Terrier | terrier | 17525.0 | 10.69 | small | 20.0 | 10.0 |
Kerry Blue Terrier | terrier | 17240.0 | 9.40 | medium | 36.5 | 18.5 |
Bull Terrier | terrier | 18490.0 | 10.21 | medium | 60.0 | 21.5 |
- For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We won't cover these in class, but you should try them out yourself. Here's a Pandas Tutor link to visualize these!
jack = pd.DataFrame({1: ['fee', 'fi'],
'1': ['fo', 'fum']})
jack
1 | 1 | |
---|---|---|
0 | fee | fo |
1 | fi | fum |
jack[1]
0 fee 1 fi Name: 1, dtype: object
jack[[1]]
1 | |
---|---|
0 | fee |
1 | fi |
jack['1']
0 fo 1 fum Name: 1, dtype: object
jack[[1, 1]]
1 | 1 | |
---|---|---|
0 | fee | fee |
1 | fi | fi |
jack.loc[1]
1 fi 1 fum Name: 1, dtype: object
jack.loc[jack[1] == 'fo']
1 | 1 |
---|
jack[1, ['1', 1]]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key) 3789 try: -> 3790 return self._engine.get_loc(casted_key) 3791 except KeyError as err: File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:158, in pandas._libs.index.IndexEngine.get_loc() TypeError: '(1, ['1', 1])' is an invalid key During handling of the above exception, another exception occurred: InvalidIndexError Traceback (most recent call last) Cell In[82], line 1 ----> 1 jack[1, ['1', 1]] File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/frame.py:3896, in DataFrame.__getitem__(self, key) 3894 if self.columns.nlevels > 1: 3895 return self._getitem_multilevel(key) -> 3896 indexer = self.columns.get_loc(key) 3897 if is_integer(indexer): 3898 indexer = [indexer] File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key) 3797 raise KeyError(key) from err 3798 except TypeError: 3799 # If we have a listlike key, _check_indexing_error will raise 3800 # InvalidIndexError. Otherwise we fall through and re-raise 3801 # the TypeError. -> 3802 self._check_indexing_error(key) 3803 raise File ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/indexes/base.py:5974, in Index._check_indexing_error(self, key) 5970 def _check_indexing_error(self, key): 5971 if not is_scalar(key): 5972 # if key is not a scalar, directly raise an error (the code below 5973 # would convert to numpy arrays and raise later any way) - GH29926 -> 5974 raise InvalidIndexError(key) InvalidIndexError: (1, ['1', 1])
jack.loc[1, 1]
'fi'
What's next?¶
- Suppose we want to find the average
'longevity'
of each'kind'
.
dogs['kind'].unique()
array(['sporting', 'terrier', 'herding', 'working', 'toy', 'non-sporting', 'hound'], dtype=object)
- Using our current toolkit, we'd:
- Find all unique values in the
'kind'
column. - For each one, query the rows for just that
'kind'
, extract the'longevity'
column, and take the mean.
- Find all unique values in the
dogs.loc[dogs['kind'] == 'sporting', 'longevity'].mean()
11.649166666666668
dogs.loc[dogs['kind'] == 'terrier', 'longevity'].mean()
11.655
- This can get time-consuming! How can we compute all 7 of these means with a single line of code?