from lec_utils import *
def show_grouping_animation():
src = "https://docs.google.com/presentation/d/1tBaFyHseIGsX5wmE3BdNLeVHnKksQtpzLhHge8Tzly0/embed?start=false&loop=false&delayms=60000&rm=minimal"
width = 960
height = 509
display(IFrame(src, width, height))
Announcements 📣¶
- Homework 2 is due on Thursday.
Post on Ed or
come to Office Hours for help! We're using a queue for office hours now – access it from practicaldsc.org/calendar.
study.practicaldsc.org contains our discussion worksheets (and solutions), which are made up of old exam problems. Use these problems to build your theoretical understanding of the material!
Homework 1 scores are available on Gradescope.
Agenda¶
- Recap: Querying.
- Adding and modifying columns.
pandas
andnumpy
.- Introduction to the
groupby
method. groupby
's inner workings.
Remember to follow along in lecture by accessing the "blank" lecture notebook in our public GitHub repository.
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
Run the cell below to load in our dataset.
dogs = pd.read_csv('data/dogs43.csv').set_index('breed')
dogs.head()
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
Shetland Sheepdog | herding | 21006.0 | 12.53 | small | 22.0 | 14.5 |
Recap: Querying¶
- Querying is the act selecting rows in a DataFrame that satisfy certain condition(s).
We sometimes call this "filtering."
- Example: How many breeds live to be over 10 years old?
dogs[dogs['longevity'] > 10].shape[0]
32
- Example: Among all breeds with
'Retriever'
in the same, which is the second tallest?
Remember, we like to format our code this way when writing long, hard-to-read lines.
(
dogs.loc[dogs.index.str.contains('Retriever'), 'height']
.sort_values(ascending=False)
.iloc[1]
)
23.0
- What is the distribution of breeds
'size'
s among'sporting'
and'working'
breeds?
Remember, you need parentheses around each condition. Also, you must use the bitwise operators&
and|
instead of the standardand
andor
keywords, as we saw in Lecture 3.
(
dogs.loc[(dogs['kind'] == 'sporting') | (dogs['kind'] == 'working'), 'size']
.value_counts()
)
size large 11 medium 7 small 1 Name: count, dtype: int64
# Equivalent to the above!
(
dogs.loc[dogs['kind'].isin(['sporting', 'working']), 'size']
.value_counts()
)
size large 11 medium 7 small 1 Name: count, dtype: int64
- Show me all rows for
'medium'
-sized dogs.
dogs[dogs['size'] == 'medium']
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.00 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.00 |
Siberian Husky | working | 22049.0 | 12.58 | medium | 47.5 | 21.75 |
... | ... | ... | ... | ... | ... | ... |
Clumber Spaniel | sporting | 18084.0 | 10.00 | medium | 70.0 | 18.50 |
Kerry Blue Terrier | terrier | 17240.0 | 9.40 | medium | 36.5 | 18.50 |
Bull Terrier | terrier | 18490.0 | 10.21 | medium | 60.0 | 21.50 |
12 rows × 6 columns
- Show me all rows for
'Golden Retriever'
s.
Note that because we set the index to'breed'
earlier, we can select rows based on dog breeds without having to query. If'breed'
was instead a column, then we'd need to query to access information about a particular breed.
dogs.loc['Golden Retriever']
kind sporting lifetime_cost 21447.0 longevity 12.04 size medium weight 60.0 height 22.75 Name: Golden Retriever, dtype: object
Aside: Reference Slides
- Moving forward, I'm going to try and spend a bit less time on syntax and a bit more time on conceptual problem-solving.
- So, in each lecture, some slides will be called "Reference Slides".
- We will skip these slides during live lecture, but they'll be present in the posted lecture notebooks, so that you can look at them when working on activities in class and on discussion and homework problems.
The material in them is in-scope.
- Reference slides will appear with red headers, like this one.
There's a Reference Slide right after this slide, about the DataFramequery
method!
The DataFrame query
method is a convenient way to query, since you don't need parentheses and you can use the and
and or
keywords.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
dogs.query('weight < 20 and kind == "terrier"')
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
Miniature Schnauzer | terrier | 20087.0 | 11.81 | small | 15.5 | 13.0 |
Norfolk Terrier | terrier | 24308.0 | 13.07 | small | 12.0 | 9.5 |
dogs.query('kind in ["sporting", "terrier"] and lifetime_cost < 20000')
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
Chesapeake Bay Retriever | sporting | 16697.0 | 9.48 | large | 67.5 | 23.5 |
Gordon Setter | sporting | 19605.0 | 11.10 | large | 62.5 | 25.0 |
Clumber Spaniel | sporting | 18084.0 | 10.00 | medium | 70.0 | 18.5 |
Scottish Terrier | terrier | 17525.0 | 10.69 | small | 20.0 | 10.0 |
Kerry Blue Terrier | terrier | 17240.0 | 9.40 | medium | 36.5 | 18.5 |
Bull Terrier | terrier | 18490.0 | 10.21 | medium | 60.0 | 21.5 |
jack = pd.DataFrame({1: ['fee', 'fi'],
'1': ['fo', 'fum']})
jack
1 | 1 | |
---|---|---|
0 | fee | fo |
1 | fi | fum |
For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself. We won't cover these in class, but you should try them out yourself. Here's a Pandas Tutor link to visualize these!
# jack[1]
# jack[[1]]
# jack['1']
# jack[[1, 1]]
# jack.loc[1]
# jack.loc[jack[1] == 'fo']
# jack[1, ['1', 1]]
# jack.loc[1,1]
Adding and modifying columns¶
Adding and modifying columns, using a copy¶
- To add a new column to a DataFrame, use the
assign
method.
To change the values in a column, add a new column with the same name as the existing column.
- Like most
pandas
methods,assign
returns a new DataFrame.- Pro ✅: This doesn't inadvertently change any existing variables.
- Con ❌: It is not very space efficient, as it creates a new copy each time it is called.
dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
You can also use assign
when the desired column name has spaces (and other special characters) by unpacking a dictionary:
dogs.assign(**{'cost per year 💵': dogs['lifetime_cost'] / dogs['longevity']})
kind | lifetime_cost | longevity | size | weight | height | cost per year 💵 | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
Adding and modifying columns, in-place¶
- You can assign a new column to a DataFrame in-place using
[]
.
This works like dictionary assignment. Using[]
modifies the underlying DataFrame, unlikeassign
, which returns a new DataFrame.
- This is the more "common" way of adding/modifying columns.
Exercise caution when using this approach, since it is destructive – it changes the values of existing variables.
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy, the original remains unmodified.
dogs_copy = dogs.copy()
dogs_copy.head(2)
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
dogs_copy['cost_per_year'] = dogs_copy['lifetime_cost'] / dogs_copy['longevity']
dogs_copy
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
- Note that we never reassigned
dogs_copy
in the cell above – that is, we never wrotedogs_copy = ...
– though it was still modified.
Mutability¶
- DataFrames, like lists, arrays, and dictionaries, are mutable. As we saw in Lecture 2, this means they can be modified in-place after creation.
- Not only does this explain the behavior on the previous slide, but it also explains the following:
dogs_copy
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
def cost_in_thousands():
dogs_copy['lifetime_cost'] = dogs_copy['lifetime_cost'] / 1000
# What happens when we run this twice? Three times?
cost_in_thousands()
dogs_copy
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22.59 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21.99 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18.99 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
... | ... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13.94 | 7.57 | large | 115.0 | 25.5 | 1840.95 |
Mastiff | working | 13.58 | 6.50 | large | 175.0 | 30.0 | 2089.38 |
Saint Bernard | working | 20.02 | 7.78 | large | 155.0 | 26.5 | 2573.52 |
43 rows × 7 columns
⚠️ Warning: Avoid mutation when possible!¶
- Note that
dogs_copy
was modified, even though we didn't reassign it! These unintended consequences can influence the behavior of test cases on homeworks, among other things!
- To avoid this, it's a good idea to avoid mutation when possible. If you must use mutation, include
df = df.copy()
as the first line in functions that take DataFrames as input.
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
pandas
and numpy
¶
pandas
is built upon numpy
!¶
- A Series in
pandas
is anumpy
array with an index.
- A DataFrame is like a dictionary of columns, each of which is a
numpy
array.
- Many operations in
pandas
are fast because they usenumpy
's implementations, which are written in fast, compiled languages like C.
- If you need to access the array underlying a DataFrame or Series, use the
to_numpy
method.
dogs['lifetime_cost']
breed Brittany 22589.0 Cairn Terrier 21992.0 English Cocker Spaniel 18993.0 ... Bullmastiff 13936.0 Mastiff 13581.0 Saint Bernard 20022.0 Name: lifetime_cost, Length: 43, dtype: float64
dogs['lifetime_cost'].to_numpy()
array([22589., 21992., 18993., ..., 13936., 13581., 20022.])
- Each Series (column) has a
numpy
data type, which refers to the type of the values stored within. Access it using thedtypes
attribute.
- A column's data type determines which operations can be applied to it.
pandas
tries to guess the correct data types for a given DataFrame, and is often wrong.- This can lead to incorrect calculations and poor memory/time performance.
- As a result, you will often need to explicitly convert between data types.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
dogs.dtypes
kind object lifetime_cost float64 longevity float64 size object weight float64 height float64 dtype: object
- Notice that Python
str
types areobject
types innumpy
andpandas
.
Pandas dtype | Python type | NumPy type | SQL type | Usage |
---|---|---|---|---|
int64 | int | int_, int8,...,int64, uint8,...,uint64 | INT, BIGINT | Integer numbers |
float64 | float | float_, float16, float32, float64 | FLOAT | Floating point numbers |
bool | bool | bool_ | BOOL | True/False values |
datetime64 or Timestamp | datetime.datetime | datetime64 | DATETIME | Date and time values |
timedelta64 or Timedelta | datetime.timedelta | timedelta64 | NA | Differences between two datetimes |
category | NA | NA | ENUM | Finite list of text values |
object | str | string, unicode | NA | Text |
object | NA | object | NA | Mixed types |
- This article details how
pandas
stores different data types under the hood.
- This article explains how
numpy
/pandas
int64
operations differ from vanillaint
operations.
- You can change the data type of a Series using the
.astype
Series method.
- For example, we can change the data type of the
'lifetime_cost'
column indogs
to beuint32
:
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936.0 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581.0 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022.0 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# Gives the types as well as the space taken up by the DataFrame.
dogs.info()
<class 'pandas.core.frame.DataFrame'> Index: 43 entries, Brittany to Saint Bernard Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 kind 43 non-null object 1 lifetime_cost 43 non-null float64 2 longevity 43 non-null float64 3 size 43 non-null object 4 weight 43 non-null float64 5 height 43 non-null float64 dtypes: float64(4), object(2) memory usage: 3.4+ KB
dogs['lifetime_cost'] = dogs['lifetime_cost'].astype('uint32')
- Now, the DataFrame takes up less space! This may be insignificant in our DataFrame, but makes a difference when working with larger datasets.
dogs.info()
<class 'pandas.core.frame.DataFrame'> Index: 43 entries, Brittany to Saint Bernard Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 kind 43 non-null object 1 lifetime_cost 43 non-null uint32 2 longevity 43 non-null float64 3 size 43 non-null object 4 weight 43 non-null float64 5 height 43 non-null float64 dtypes: float64(3), object(2), uint32(1) memory usage: 3.2+ KB
Usually, we prefer to set the correct dtypes in read_csv
, since it can help pandas
load in files more quickly:
dogs_new = pd.read_csv('data/dogs43.csv', dtype={'lifetime_cost': 'uint32'})
dogs_new
breed | kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|---|
0 | Brittany | sporting | 22589 | 12.92 | medium | 35.0 | 19.0 |
1 | Cairn Terrier | terrier | 21992 | 13.84 | small | 14.0 | 10.0 |
2 | English Cocker Spaniel | sporting | 18993 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... | ... |
40 | Bullmastiff | working | 13936 | 7.57 | large | 115.0 | 25.5 |
41 | Mastiff | working | 13581 | 6.50 | large | 175.0 | 30.0 |
42 | Saint Bernard | working | 20022 | 7.78 | large | 155.0 | 26.5 |
43 rows × 7 columns
dogs_new.dtypes
breed object kind object lifetime_cost uint32 longevity float64 size object weight float64 height float64 dtype: object
Axes¶
- The rows and columns of a DataFrame are both stored as Series.
- The axis specifies the direction of a slice of a DataFrame.
- Axis 0 refers to the rows and axis 1 refers to the columns.
These are the same axes definitions that 2Dnumpy
arrays have!
DataFrame methods with axis
¶
- Many Series methods work on DataFrames.
- In such cases, the DataFrame method usually applies the Series method to every row or column.
- Many of these methods accept an
axis
argument; the default is usuallyaxis=0
.
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993 | 11.66 | medium | 30.0 | 16.0 |
... | ... | ... | ... | ... | ... | ... |
Bullmastiff | working | 13936 | 7.57 | large | 115.0 | 25.5 |
Mastiff | working | 13581 | 6.50 | large | 175.0 | 30.0 |
Saint Bernard | working | 20022 | 7.78 | large | 155.0 | 26.5 |
43 rows × 6 columns
# Max element in each column.
dogs.max()
kind working lifetime_cost 26686 longevity 16.5 size small weight 175.0 height 30.0 dtype: object
# Max element in each row – a little nonsensical, since the values in each column are on different scales.
# Note that we had to select the numeric columns first.
dogs[['lifetime_cost', 'longevity', 'weight', 'height']].max(axis=1)
breed Brittany 22589.0 Cairn Terrier 21992.0 English Cocker Spaniel 18993.0 ... Bullmastiff 13936.0 Mastiff 13581.0 Saint Bernard 20022.0 Length: 43, dtype: float64
# The number of unique values in each column.
dogs.nunique()
kind 7 lifetime_cost 43 longevity 40 size 3 weight 37 height 30 dtype: int64
# describe doesn't accept an axis argument; it works on every numeric column in the DataFrame it is called on.
dogs.describe()
lifetime_cost | longevity | weight | height | |
---|---|---|---|---|
count | 43.00 | 43.00 | 43.00 | 43.00 |
mean | 20532.84 | 11.34 | 49.35 | 18.34 |
std | 3290.78 | 2.05 | 39.42 | 6.83 |
... | ... | ... | ... | ... |
50% | 21006.00 | 11.81 | 36.50 | 18.50 |
75% | 22072.50 | 12.52 | 67.50 | 25.00 |
max | 26686.00 | 16.50 | 175.00 | 30.00 |
8 rows × 4 columns
Activity
Pick a dog breed that you personally like or know the name of. Then:
- Try to find a few other dog breeds that are similar in weight to yours in
all_dogs
. - Which similar breeds have the lowest and highest
'lifetime_cost'
?'intelligence_rank'
? - Are there any similar breeds that you haven't heard of before?
For fun, look up these dog breeds on the AKC website to see what they look like!
all_dogs = pd.read_csv('data/all_dogs.csv')
all_dogs
breed | group | datadog | popularity_all | ... | megarank | size | weight | height | |
---|---|---|---|---|---|---|---|---|---|
0 | Border Collie | herding | 3.64 | 45 | ... | 29.0 | medium | NaN | 20.0 |
1 | Border Terrier | terrier | 3.61 | 80 | ... | 1.0 | small | 13.5 | NaN |
2 | Brittany | sporting | 3.54 | 30 | ... | 11.0 | medium | 35.0 | 19.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
169 | Wire Fox Terrier | terrier | NaN | 100 | ... | NaN | small | 17.5 | 15.0 |
170 | Wirehaired Pointing Griffon | sporting | NaN | 92 | ... | NaN | medium | NaN | 22.0 |
171 | Xoloitzcuintli | non-sporting | NaN | 155 | ... | NaN | medium | NaN | 16.5 |
172 rows × 18 columns
# There's no "right answer" here; you're supposed to explore!
fav_weight = all_dogs.loc[all_dogs['breed'] == 'English Cocker Spaniel', 'weight'].iloc[0]
similar_weight = all_dogs[(all_dogs['weight'] >= fav_weight - 5) & (all_dogs['weight'] <= fav_weight + 5)]
similar_weight
breed | group | datadog | popularity_all | ... | megarank | size | weight | height | |
---|---|---|---|---|---|---|---|---|---|
2 | Brittany | sporting | 3.54 | 30 | ... | 11.0 | medium | 35.0 | 19.00 |
5 | English Cocker Spaniel | sporting | 3.33 | 63 | ... | 5.0 | medium | 30.0 | 16.00 |
6 | Cocker Spaniel | sporting | 3.30 | 27 | ... | 6.0 | small | 25.0 | 14.50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
125 | Irish Terrier | terrier | NaN | 132 | ... | NaN | medium | 26.0 | 18.00 |
138 | Norwegian Buhund | herding | NaN | 165 | ... | NaN | medium | 33.0 | 17.25 |
159 | Soft-Coated Wheaten Terrier | terrier | NaN | 52 | ... | NaN | medium | 35.0 | 18.00 |
12 rows × 18 columns
similar_weight.sort_values('intelligence_rank')[['breed', 'lifetime_cost', 'intelligence_rank']]
breed | lifetime_cost | intelligence_rank | |
---|---|---|---|
57 | Pembroke Welsh Corgi | 23978.0 | 11.0 |
5 | English Cocker Spaniel | 18993.0 | 18.0 |
2 | Brittany | 22589.0 | 19.0 |
... | ... | ... | ... |
73 | French Bulldog | 17266.0 | 58.0 |
117 | Glen of Imaal Terrier | NaN | NaN |
138 | Norwegian Buhund | NaN | NaN |
12 rows × 3 columns
Introduction to the groupby
method¶
Example: Palmer Penguins¶
The dataset we'll work with for the rest of the lecture involves various measurements taken of three species of penguins in Antarctica.
IFrame('https://www.youtube-nocookie.com/embed/CCrNAHXUstU?si=-DntSyUNp5Kwitjm&start=11',
width=560, height=315)
Loading the data¶
penguins = sns.load_dataset('penguins').dropna().reset_index(drop=True)
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
330 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
331 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
332 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
- Here, each row corresponds to a single penguin, and each column corresponds to a different attribute (or feature) we have for each penguin.
- Data formatted in this way is sometimes called tidy data.
Visualizing the data¶
penguins.plot(kind='scatter',
x='bill_length_mm',
y='body_mass_g',
color='species',
title='Body Mass vs. Bill Length')