DataFrame Internals
Table of contents
Overview
In Lecture 4, we discussed the three main data structures in pandas
:
- DataFrame: 2 dimensional tables. These have rows and columns.
- Series: 1 dimensional array-like object, representing a row or column.
- Index: Sequence of row or column labels. When we say “the index”, we’re referring to the sequence of row labels.
We were exposed to several DataFrame methods, like head
, tail
, sort_values
, and set_index
, and Series methods, like mean
, value_counts
, and argmax
(to name a few; see the pandas
documentation for more).
We also learned about a few key techniques for accessing data in DataFrames; in particular, we learned that:
loc
is used to extract a subset of rows and columns from a DataFrame, given their index values.iloc
is used to extract a subset of rows and columns from a DataFrame, given their integer positions.- The square brackets operator, used without
loc
oriloc
, can either be used to select an individual column or to perform a query, where we select a subset of rows that satisfy particular conditions.
In this guide, we’ll discuss some of the details of how DataFrames are stored in memory, and how that might influence the way you interact with them in homeworks. In previous semesters, we’d show this material in lecture, but we felt that it’s better served as reference notes, and that lecture time is better spent on activities that involve conceptual problem solving.
Consequences of mutability
DataFrames, like lists, arrays, dictionaries, Series, and indexes, are mutable, meaning that they can be modified in-place after instantiation. A consequence of this is that DataFrames can be inadvertently modified in unexpected ways. To illustrate, we’ll look at a common operation: adding a new column. We’ll continue using the dogs
DataFrame from Lecture 4.
dogs.head()
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
Shetland Sheepdog | herding | 21006.0 | 12.53 | small | 22.0 | 14.5 |
Note that the DataFrame above only has 6 columns – 'kind'
, 'lifetime_cost'
, 'longevity'
, 'size'
, 'weight'
, and 'height'
. 'breed'
is not a column; rather, it’s stored in the 'index'
. The string 'breed'
appears above the index because it was carried over from when 'breed'
initially was a column; recall, in lecture, we manually set the index of dogs
to 'breed'
. The fact that it says 'breed'
above the index does not change the fact that 'breed'
is not one of the columns.
Adding columns using []
The most common way to add or modify columns to a DataFrame is using the []
operator, which we also use to access columns. This works like dictionary assignment. (By “modify” a column, we mean replace an existing column with a new column of the same name.)
Using []
to add or modify columns is destructive, meaning it changes the underlying DataFrame in memory. Let’s demonstrate. Suppose we add a 'cost_per_year'
column to dogs
using the []
operator.
dogs['cost_per_year'] = dogs['lifetime_cost'] / dogs['longevity']
Note that we never explicitly re-assigned dogs
in the cell above, i.e. we never wrote dogs = ...
. But, dogs
was still modified! Scroll to the right below and you’ll see that dogs
now has a 'cost_per_year'
column.
dogs.head()
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 | 1946.40 |
Shetland Sheepdog | herding | 21006.0 | 12.53 | small | 22.0 | 14.5 | 1676.46 |
In and of itself, this doesn’t seem like the worst thing in the world. But, the fact that DataFrames are mutable can have even more unintended consequences. For instance, consider the function cost_in_thousands
, defined below.
# Note that cost_in_thousands doesn't return anything.
def cost_in_thousands(df):
df['lifetime_cost'] = df['lifetime_cost'] / 1000
Suppose dogs
is now a fresh copy of the dogs
DataFrame, i.e. one without a 'cost_per_year'
column. And suppose we run the following cell twice:
cost_in_thousands(dogs)
Even though we never explicitly re-assigned dogs
, it was modified – twice.
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 0.02 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 0.02 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 0.02 | 11.66 | medium | 30.0 | 16.0 |
Cocker Spaniel | sporting | 0.02 | 12.50 | small | 25.0 | 14.5 |
Shetland Sheepdog | herding | 0.02 | 12.53 | small | 22.0 | 14.5 |
The values in the 'lifetime_cost'
above aren’t just their original values divided by \(1000\), but they are their original values divided by \(1000 \cdot 1000 = 1000000\)! You may ask, why would we ever run the cell cost_in_thousands(dogs)
twice? On purpose, we probably wouldn’t, but when you have hundreds of cells in front of you, it’s common to inadvertently run a cell multiple times, and it’s best practice to structure your code such that no matter how many times it’s run, the results and consequences are the same.
Moral of the story: avoid mutation when possible!
We’re about to show you another non-destructive technique for adding or modifying columns. But, since the []
technique is so common, we’ll leave you with a word of advice: when implementing functions that take in DataFrames, it’s a good idea to include df = df.copy()
as the first line. df.copy()
returns a deep copy of df
, which is a new DataFrame with the same content as df
, but a different memory address. This way, the changes you make to df
within the body of the function won’t impact any input DataFrames in your global scope.
# Better than before!
# Now, when cost_in_thousands is called,
# it won't make destructive modifications to the DataFrame it is called on.
# Since it doesn't return anything, it won't actually do anything, then.
def cost_in_thousands(df):
df = df.copy()
df['lifetime_cost'] = df['lifetime_cost'] / 1000
Adding columns using assign
A non-destructive way to add a new column to a DataFrame is to use the assign
DataFrame method. Like most other DataFrame methods we saw in Lecture 4 (and will see in the subsequent lectures), assign
returns a new DataFrame, and leaves the original DataFrame untouched. The downside to this is that it is not very space efficient, as it creates a new copy in memory each time it’s called.
The assign
method creates new columns using Python’s named keyword argument syntax. If the column name we’re trying to add already exists, assign
will replace the old column with the new one. In general, we say:
df.assign(name_of_new_column=values_in_new_column)
To illustrate, suppose dogs
is now a fresh copy of the original DataFrame loaded in at the start of this guide. The following expression returns a new DataFrame with all of the columns in dogs
, plus a new 'cost_per_year'
column.
dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
kind | lifetime_cost | longevity | size | weight | height | cost_per_year | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 | 1946.40 |
Shetland Sheepdog | herding | 21006.0 | 12.53 | small | 22.0 | 14.5 | 1676.46 |
But, dogs
itself is still unchanged:
dogs
kind | lifetime_cost | longevity | size | weight | height | |
---|---|---|---|---|---|---|
breed | ||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 |
Shetland Sheepdog | herding | 21006.0 | 12.53 | small | 22.0 | 14.5 |
You can also use assign
when the desired column name has spaces (and other special characters) by unpacking a dictionary:
dogs.assign(**{'cost per year 💵': dogs['lifetime_cost'] / dogs['longevity']})
kind | lifetime_cost | longevity | size | weight | height | cost per year 💵 | |
---|---|---|---|---|---|---|---|
breed | |||||||
Brittany | sporting | 22589.0 | 12.92 | medium | 35.0 | 19.0 | 1748.37 |
Cairn Terrier | terrier | 21992.0 | 13.84 | small | 14.0 | 10.0 | 1589.02 |
English Cocker Spaniel | sporting | 18993.0 | 11.66 | medium | 30.0 | 16.0 | 1628.90 |
Cocker Spaniel | sporting | 24330.0 | 12.50 | small | 25.0 | 14.5 | 1946.40 |
Shetland Sheepdog | herding | 21006.0 | 12.53 | small | 22.0 | 14.5 | 1676.46 |
To have any of these changes persist, we’d need to explicitly re-assign dogs
to this DataFrame.
dogs = dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
Admittedly, this syntax is a bit convoluted – especially when adding columns whose names have spaces – and the []
operator is more convenient. It just comes with side effects that you should be aware of.
Chained indexing
Consider the following example DataFrame, A
.
A = pd.DataFrame().assign(
x=[1, 2, 3],
y=[4, 4, 5]
)
A
x | y | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 3 |
2 | 3 | 3 |
Suppose we need to set the values of 'y'
when 'x'
is greater than 1 to 3. There are a few ways to attempt this:
A[A['x'] > 1]['y'] = 3
A.loc[A['x'] > 1, 'y'] = 3
It turns out that only the second line would work, not the first. In short, the reason is that because in the first method, A[A['x'] > 1]
returns a new DataFrame, detached from A
, and the ['y'] = 3
piece sets the corresponding 'y'
values of that new DataFrame to 3. It doesn’t change A
, like the second line does.
The first line uses what is called “chained indexing”, which generally should be avoided. If all we were trying to do was access the 'y'
values for which 'x'
is greater than 1, then both A[A['x'] > 1]['y']
and A.loc[A['x'] > 1, 'y']
would give us equivalent results, but because the former technique can lead to problems when used for assignment, it’s best to avoid it in general.
To read more about this nuance, and the famed SettingWithCopyWarning
in pandas
, read:
Pandas and NumPy
pandas
is built upon numpy
, and both pandas
and numpy
are part of a larger Python scientific computing ecosystem of packages. This rich ecosystem is part of why Python is so popular in different domains.
In particular, a Series is a (numpy
) array with an index, and a DataFrame can be thought of as a dictionary of columns, each of which is an array.
Many operations in pandas
are fast because they use numpy
’s implementations, which are written in fast, compiled languages like C and Fortran. If you need to access the array underlying a DataFrame or Series, use the to_numpy
method.
dogs['lifetime_cost']
breed
Brittany 22589.0
Cairn Terrier 21992.0
English Cocker Spaniel 18993.0
...
Bullmastiff 13936.0
Mastiff 13581.0
Saint Bernard 20022.0
Name: lifetime_cost, Length: 43, dtype: float64
dogs['lifetime_cost'].to_numpy()
array([22589., 21992., 18993., ..., 13936., 13581., 20022.])
pandas
data types
Each Series has a numpy
data type, which refers to the type of the values stored within. Remember that each column (and row) in a DataFrame is stored as a Series.
The data type of dogs['lifetime_cost']
is float64
, as you can see in the last line of the Series’ preview above. This is the pandas
/numpy
version of the vanilla Python float
data type, and you’ll run into it frequently in your autograder tests.
You can programmatically access the data type of a Series using the dtype
attribute:
dogs['lifetime_cost'].dtype
dtype('float64')
The DataFrame dtypes
attribute shows you the data types of all columns in the DataFrame:
dogs.dtypes
kind object
lifetime_cost float64
longevity float64
size object
weight float64
height float64
dtype: object
The object
data type usually refers to strings, but can be thought of as a catch-all data type that refers to anything that’s non-numeric.
A Series’ data type determines which operations can be applied to it. When creating a DataFrame using pd.read_csv
, pandas
tries to guess the correct data types for a given DataFrame, and is often wrong. This can lead to incorrect calculations and poor memory/time performance. As a result, you may need to explicitly convert between data types.
To change the data type of a Series, use the astype
method. For instance, if s
was a Series of strings that you wanted to convert to integers, use s.astype(int)
.
Sometimes, changing the data type of a Series can lead to performance benefits. The DataFrame info
method shows us all of the information that the dtypes
attribute does, as well as the space taken up by the DataFrame.
dogs.info()
<class 'pandas.core.frame.DataFrame'>
Index: 42 entries, Brittany to Saint Bernard
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 kind 42 non-null object
1 lifetime_cost 42 non-null float64
2 longevity 42 non-null float64
3 size 42 non-null object
4 weight 42 non-null float64
5 height 42 non-null float64
dtypes: float64(4), object(2)
memory usage: 2.4+ KB
The 'lifetime_cost'
column currently has a data type of float64
, but all of the values are integers between \(13581\) and \(26686\). We could set the data type of 'lifetime_cost'
to uint16
, which stands for unsigned 16-bit integers, which can represent values from 0 to \(2^{16} - 1 = 65535\). In practice, we probably wouldn’t do this, because any operations involving the new 'lifetime_cost'
are at risk of numerical overflow, but we could still try, because doing so saves some space:
dogs['lifetime_cost'] = dogs['lifetime_cost'].astype('uint16') # Notice the quotes!
dogs.info()
<class 'pandas.core.frame.DataFrame'>
Index: 42 entries, Brittany to Saint Bernard
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 kind 42 non-null object
1 lifetime_cost 42 non-null uint16
2 longevity 42 non-null float64
3 size 42 non-null object
4 weight 42 non-null float64
5 height 42 non-null float64
dtypes: float64(3), object(2), uint16(1)
memory usage: 2.1+ KB
Here, we saved 0.3 KB, or \(\left( 1 - \frac{2.1}{2.4} \right) \cdot 100\% = 12.5\%\) of space. These memory savings are insignificant in our humble 42 row DataFrame, but such changes can have a serious impact when working with large-scale datasets.
You can also set the data type of a column when loading it in, to prevent ever instantiating the inefficient representation:
pd.read_csv(file_path, dtype={'lifetime_cost': 'uint16'})
The table below summarizes key data types in Python, numpy
, and pandas
(the latter two have subtle differences). Most importantly, notice that Python str
types are object
types in numpy
and pandas
.
Pandas data type | Python type | NumPy data type | SQL type | Usage |
---|---|---|---|---|
int64 | int | int_, int8,…,int64, uint8,…,uint64 | INT, BIGINT | Integer numbers |
float64 | float | float_, float16, float32, float64 | FLOAT | Floating point numbers |
bool | bool | bool_ | BOOL | True/False values |
datetime64 or Timestamp | datetime.datetime | datetime64 | DATETIME | Date and time values |
timedelta64 or Timedelta | datetime.timedelta | timedelta64 | NA | Differences between two datetimes |
category | NA | NA | ENUM | Finite list of text values |
object | str | string, unicode | NA | Text |
object | NA | object | NA | Mixed types |
For extra reading:
- This article details how
pandas
stores different data types under the hood. - This article explains how
numpy
/pandas
int64
operations differ from vanillaint
operations.
Axes
As we’ve mentioned before, the rows and columns of a DataFrame are both stored as Series. In Lecture 4, we learned about several Series methods, like mean
, value_counts
, and argmax
.
It turns out many of these methods also work on DataFrames, too, with one caveat – we need to be careful to describe whether we want the method to be used on:
- every column of the DataFrame, i.e. across
axis=0
, or - on every row of the DataFrame, i.e. across
axis=1
.
These axis definitions are the same ones that 2D numpy
arrays have, that we first saw in Lecture 3.
The Series methods that also work with DataFrames usually use axis=0
as a default. Let’s illustrate with a few examples. To find the maximum element in each column, we can use:
dogs.max()
kind working
lifetime_cost 26686
longevity 16.5
size small
weight 175.0
height 30.0
dtype: object
Adding axis=0
in the above would not have changed the result.
On the other hand, if we want the maximum element in each row, we’d use axis=1
. However, dogs.max(axis=1)
won’t work directly, since that’d require us to compare numbers (like 15.0
) to strings (like 'working'
). So, we’ll first select the numeric columns out of dogs
. Note that while this works, the results are a little nonsensical, since the values in each column are on different scales.
dogs[['lifetime_cost', 'longevity', 'weight', 'height']].max(axis=1)
breed
Brittany 22589.0
Cairn Terrier 21992.0
English Cocker Spaniel 18993.0
...
Bullmastiff 13936.0
Mastiff 13581.0
Saint Bernard 20022.0
Length: 42, dtype: float64
A shortcut to dogs[['lifetime_cost', 'longevity', 'weight', 'height']]
is dogs.select_dtypes(include='number')
.
As another example, to find the number of unique values in each column:
dogs.nunique()
kind 7
lifetime_cost 42
longevity 40
size 3
weight 37
height 30
dtype: int64
or row:
dogs.nunique(axis=1)
breed
Brittany 6
Cairn Terrier 6
English Cocker Spaniel 6
..
Bullmastiff 6
Mastiff 6
Saint Bernard 6
Length: 42, dtype: int64
And, finally, to get summary statistics for each numeric column:
dogs.describe()
lifetime_cost | longevity | weight | height | |
---|---|---|---|---|
count | 42.00 | 42.00 | 42.00 | 42.00 |
mean | 20532.84 | 11.34 | 49.35 | 18.34 |
std | 3290.78 | 2.05 | 39.42 | 6.83 |
min | 13581.00 | 6.50 | 5.00 | 5.00 |
25% | 18508.50 | 10.05 | 18.00 | 11.75 |
50% | 21006.00 | 11.81 | 36.50 | 18.50 |
75% | 22072.50 | 12.52 | 67.50 | 25.00 |
max | 26686.00 | 16.50 | 175.00 | 30.00 |
Give this guide a read, and refer back to it if you run into behavior you weren’t expecting while working a homework.