DataFrame Internals

Table of contents

  1. Overview
  2. Consequences of mutability
    1. Adding columns using []
    2. Adding columns using assign
    3. Chained indexing
  3. Pandas and NumPy
    1. pandas data types
    2. Axes

Overview

In Lecture 4, we discussed the three main data structures in pandas:

  • DataFrame: 2 dimensional tables. These have rows and columns.
  • Series: 1 dimensional array-like object, representing a row or column.
  • Index: Sequence of row or column labels. When we say “the index”, we’re referring to the sequence of row labels.

We were exposed to several DataFrame methods, like head, tail, sort_values, and set_index, and Series methods, like mean, value_counts, and argmax (to name a few; see the pandas documentation for more).

We also learned about a few key techniques for accessing data in DataFrames; in particular, we learned that:

  • loc is used to extract a subset of rows and columns from a DataFrame, given their index values.
  • iloc is used to extract a subset of rows and columns from a DataFrame, given their integer positions.
  • The square brackets operator, used without loc or iloc, can either be used to select an individual column or to perform a query, where we select a subset of rows that satisfy particular conditions.

In this guide, we’ll discuss some of the details of how DataFrames are stored in memory, and how that might influence the way you interact with them in homeworks. In previous semesters, we’d show this material in lecture, but we felt that it’s better served as reference notes, and that lecture time is better spent on activities that involve conceptual problem solving.


Consequences of mutability

DataFrames, like lists, arrays, dictionaries, Series, and indexes, are mutable, meaning that they can be modified in-place after instantiation. A consequence of this is that DataFrames can be inadvertently modified in unexpected ways. To illustrate, we’ll look at a common operation: adding a new column. We’ll continue using the dogs DataFrame from Lecture 4.

dogs.head()
kindlifetime_costlongevitysizeweightheight
breed
Brittanysporting22589.012.92medium35.019.0
Cairn Terrierterrier21992.013.84small14.010.0
English Cocker Spanielsporting18993.011.66medium30.016.0
Cocker Spanielsporting24330.012.50small25.014.5
Shetland Sheepdogherding21006.012.53small22.014.5

Note that the DataFrame above only has 6 columns – 'kind', 'lifetime_cost', 'longevity', 'size', 'weight', and 'height'. 'breed' is not a column; rather, it’s stored in the 'index'. The string 'breed' appears above the index because it was carried over from when 'breed' initially was a column; recall, in lecture, we manually set the index of dogs to 'breed'. The fact that it says 'breed' above the index does not change the fact that 'breed' is not one of the columns.


Adding columns using []

The most common way to add or modify columns to a DataFrame is using the [] operator, which we also use to access columns. This works like dictionary assignment. (By “modify” a column, we mean replace an existing column with a new column of the same name.)

Using [] to add or modify columns is destructive, meaning it changes the underlying DataFrame in memory. Let’s demonstrate. Suppose we add a 'cost_per_year' column to dogs using the [] operator.

dogs['cost_per_year'] = dogs['lifetime_cost'] / dogs['longevity']

Note that we never explicitly re-assigned dogs in the cell above, i.e. we never wrote dogs = .... But, dogs was still modified! Scroll to the right below and you’ll see that dogs now has a 'cost_per_year' column.

dogs.head()
kindlifetime_costlongevitysizeweightheightcost_per_year
breed
Brittanysporting22589.012.92medium35.019.01748.37
Cairn Terrierterrier21992.013.84small14.010.01589.02
English Cocker Spanielsporting18993.011.66medium30.016.01628.90
Cocker Spanielsporting24330.012.50small25.014.51946.40
Shetland Sheepdogherding21006.012.53small22.014.51676.46


In and of itself, this doesn’t seem like the worst thing in the world. But, the fact that DataFrames are mutable can have even more unintended consequences. For instance, consider the function cost_in_thousands, defined below.

# Note that cost_in_thousands doesn't return anything.
def cost_in_thousands(df):
    df['lifetime_cost'] = df['lifetime_cost'] / 1000

Suppose dogs is now a fresh copy of the dogs DataFrame, i.e. one without a 'cost_per_year' column. And suppose we run the following cell twice:

cost_in_thousands(dogs)

Even though we never explicitly re-assigned dogs, it was modified – twice.

kindlifetime_costlongevitysizeweightheight
breed
Brittanysporting0.0212.92medium35.019.0
Cairn Terrierterrier0.0213.84small14.010.0
English Cocker Spanielsporting0.0211.66medium30.016.0
Cocker Spanielsporting0.0212.50small25.014.5
Shetland Sheepdogherding0.0212.53small22.014.5

The values in the 'lifetime_cost' above aren’t just their original values divided by \(1000\), but they are their original values divided by \(1000 \cdot 1000 = 1000000\)! You may ask, why would we ever run the cell cost_in_thousands(dogs) twice? On purpose, we probably wouldn’t, but when you have hundreds of cells in front of you, it’s common to inadvertently run a cell multiple times, and it’s best practice to structure your code such that no matter how many times it’s run, the results and consequences are the same.

Moral of the story: avoid mutation when possible!


We’re about to show you another non-destructive technique for adding or modifying columns. But, since the [] technique is so common, we’ll leave you with a word of advice: when implementing functions that take in DataFrames, it’s a good idea to include df = df.copy() as the first line. df.copy() returns a deep copy of df, which is a new DataFrame with the same content as df, but a different memory address. This way, the changes you make to df within the body of the function won’t impact any input DataFrames in your global scope.

# Better than before!
# Now, when cost_in_thousands is called,
# it won't make destructive modifications to the DataFrame it is called on.
# Since it doesn't return anything, it won't actually do anything, then.
def cost_in_thousands(df):
    df = df.copy()
    df['lifetime_cost'] = df['lifetime_cost'] / 1000

Adding columns using assign

A non-destructive way to add a new column to a DataFrame is to use the assign DataFrame method. Like most other DataFrame methods we saw in Lecture 4 (and will see in the subsequent lectures), assign returns a new DataFrame, and leaves the original DataFrame untouched. The downside to this is that it is not very space efficient, as it creates a new copy in memory each time it’s called.

The assign method creates new columns using Python’s named keyword argument syntax. If the column name we’re trying to add already exists, assign will replace the old column with the new one. In general, we say:

df.assign(name_of_new_column=values_in_new_column)

To illustrate, suppose dogs is now a fresh copy of the original DataFrame loaded in at the start of this guide. The following expression returns a new DataFrame with all of the columns in dogs, plus a new 'cost_per_year' column.

dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])
kindlifetime_costlongevitysizeweightheightcost_per_year
breed
Brittanysporting22589.012.92medium35.019.01748.37
Cairn Terrierterrier21992.013.84small14.010.01589.02
English Cocker Spanielsporting18993.011.66medium30.016.01628.90
Cocker Spanielsporting24330.012.50small25.014.51946.40
Shetland Sheepdogherding21006.012.53small22.014.51676.46

But, dogs itself is still unchanged:

dogs
kindlifetime_costlongevitysizeweightheight
breed
Brittanysporting22589.012.92medium35.019.0
Cairn Terrierterrier21992.013.84small14.010.0
English Cocker Spanielsporting18993.011.66medium30.016.0
Cocker Spanielsporting24330.012.50small25.014.5
Shetland Sheepdogherding21006.012.53small22.014.5

You can also use assign when the desired column name has spaces (and other special characters) by unpacking a dictionary:

dogs.assign(**{'cost per year 💵': dogs['lifetime_cost'] / dogs['longevity']})
kindlifetime_costlongevitysizeweightheightcost per year 💵
breed
Brittanysporting22589.012.92medium35.019.01748.37
Cairn Terrierterrier21992.013.84small14.010.01589.02
English Cocker Spanielsporting18993.011.66medium30.016.01628.90
Cocker Spanielsporting24330.012.50small25.014.51946.40
Shetland Sheepdogherding21006.012.53small22.014.51676.46

To have any of these changes persist, we’d need to explicitly re-assign dogs to this DataFrame.

dogs = dogs.assign(cost_per_year=dogs['lifetime_cost'] / dogs['longevity'])

Admittedly, this syntax is a bit convoluted – especially when adding columns whose names have spaces – and the [] operator is more convenient. It just comes with side effects that you should be aware of.


Chained indexing

Consider the following example DataFrame, A.

A = pd.DataFrame().assign(
    x=[1, 2, 3],
    y=[4, 4, 5]
)
A
xy
014
123
233

Suppose we need to set the values of 'y' when 'x' is greater than 1 to 3. There are a few ways to attempt this:

  1. A[A['x'] > 1]['y'] = 3
  2. A.loc[A['x'] > 1, 'y'] = 3

It turns out that only the second line would work, not the first. In short, the reason is that because in the first method, A[A['x'] > 1] returns a new DataFrame, detached from A, and the ['y'] = 3 piece sets the corresponding 'y' values of that new DataFrame to 3. It doesn’t change A, like the second line does.

The first line uses what is called “chained indexing”, which generally should be avoided. If all we were trying to do was access the 'y' values for which 'x' is greater than 1, then both A[A['x'] > 1]['y'] and A.loc[A['x'] > 1, 'y'] would give us equivalent results, but because the former technique can lead to problems when used for assignment, it’s best to avoid it in general.

To read more about this nuance, and the famed SettingWithCopyWarning in pandas, read:


Pandas and NumPy

pandas is built upon numpy, and both pandas and numpy are part of a larger Python scientific computing ecosystem of packages. This rich ecosystem is part of why Python is so popular in different domains.

In particular, a Series is a (numpy) array with an index, and a DataFrame can be thought of as a dictionary of columns, each of which is an array.

Many operations in pandas are fast because they use numpy’s implementations, which are written in fast, compiled languages like C and Fortran. If you need to access the array underlying a DataFrame or Series, use the to_numpy method.

dogs['lifetime_cost']
breed
Brittany                  22589.0
Cairn Terrier             21992.0
English Cocker Spaniel    18993.0
                           ...   
Bullmastiff               13936.0
Mastiff                   13581.0
Saint Bernard             20022.0
Name: lifetime_cost, Length: 43, dtype: float64


dogs['lifetime_cost'].to_numpy()
array([22589., 21992., 18993., ..., 13936., 13581., 20022.])


pandas data types

Each Series has a numpy data type, which refers to the type of the values stored within. Remember that each column (and row) in a DataFrame is stored as a Series.

The data type of dogs['lifetime_cost'] is float64, as you can see in the last line of the Series’ preview above. This is the pandas/numpy version of the vanilla Python float data type, and you’ll run into it frequently in your autograder tests.

You can programmatically access the data type of a Series using the dtype attribute:

dogs['lifetime_cost'].dtype
dtype('float64')

The DataFrame dtypes attribute shows you the data types of all columns in the DataFrame:

dogs.dtypes
kind              object
lifetime_cost    float64
longevity        float64
size              object
weight           float64
height           float64
dtype: object

The object data type usually refers to strings, but can be thought of as a catch-all data type that refers to anything that’s non-numeric.


A Series’ data type determines which operations can be applied to it. When creating a DataFrame using pd.read_csv, pandas tries to guess the correct data types for a given DataFrame, and is often wrong. This can lead to incorrect calculations and poor memory/time performance. As a result, you may need to explicitly convert between data types.

To change the data type of a Series, use the astype method. For instance, if s was a Series of strings that you wanted to convert to integers, use s.astype(int).

Sometimes, changing the data type of a Series can lead to performance benefits. The DataFrame info method shows us all of the information that the dtypes attribute does, as well as the space taken up by the DataFrame.

dogs.info()
<class 'pandas.core.frame.DataFrame'>
Index: 42 entries, Brittany to Saint Bernard
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   kind           42 non-null     object 
 1   lifetime_cost  42 non-null     float64
 2   longevity      42 non-null     float64
 3   size           42 non-null     object 
 4   weight         42 non-null     float64
 5   height         42 non-null     float64
dtypes: float64(4), object(2)
memory usage: 2.4+ KB


The 'lifetime_cost' column currently has a data type of float64, but all of the values are integers between \(13581\) and \(26686\). We could set the data type of 'lifetime_cost' to uint16, which stands for unsigned 16-bit integers, which can represent values from 0 to \(2^{16} - 1 = 65535\). In practice, we probably wouldn’t do this, because any operations involving the new 'lifetime_cost' are at risk of numerical overflow, but we could still try, because doing so saves some space:

dogs['lifetime_cost'] = dogs['lifetime_cost'].astype('uint16') # Notice the quotes!
dogs.info()
<class 'pandas.core.frame.DataFrame'>
Index: 42 entries, Brittany to Saint Bernard
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   kind           42 non-null     object 
 1   lifetime_cost  42 non-null     uint16 
 2   longevity      42 non-null     float64
 3   size           42 non-null     object 
 4   weight         42 non-null     float64
 5   height         42 non-null     float64
dtypes: float64(3), object(2), uint16(1)
memory usage: 2.1+ KB

Here, we saved 0.3 KB, or \(\left( 1 - \frac{2.1}{2.4} \right) \cdot 100\% = 12.5\%\) of space. These memory savings are insignificant in our humble 42 row DataFrame, but such changes can have a serious impact when working with large-scale datasets.

You can also set the data type of a column when loading it in, to prevent ever instantiating the inefficient representation:

pd.read_csv(file_path, dtype={'lifetime_cost': 'uint16'})


The table below summarizes key data types in Python, numpy, and pandas (the latter two have subtle differences). Most importantly, notice that Python str types are object types in numpy and pandas.

Pandas data typePython typeNumPy data typeSQL typeUsage
int64intint_, int8,…,int64, uint8,…,uint64INT, BIGINTInteger numbers
float64floatfloat_, float16, float32, float64FLOATFloating point numbers
boolboolbool_BOOLTrue/False values
datetime64 or Timestampdatetime.datetimedatetime64DATETIMEDate and time values
timedelta64 or Timedeltadatetime.timedeltatimedelta64NADifferences between two datetimes
categoryNANAENUMFinite list of text values
objectstrstring, unicodeNAText
objectNAobjectNAMixed types

For extra reading:

  • This article details how pandas stores different data types under the hood.
  • This article explains how numpy/pandas int64 operations differ from vanilla int operations.


Axes

As we’ve mentioned before, the rows and columns of a DataFrame are both stored as Series. In Lecture 4, we learned about several Series methods, like mean, value_counts, and argmax.

It turns out many of these methods also work on DataFrames, too, with one caveat – we need to be careful to describe whether we want the method to be used on:

  • every column of the DataFrame, i.e. across axis=0, or
  • on every row of the DataFrame, i.e. across axis=1.

These axis definitions are the same ones that 2D numpy arrays have, that we first saw in Lecture 3.


The Series methods that also work with DataFrames usually use axis=0 as a default. Let’s illustrate with a few examples. To find the maximum element in each column, we can use:

dogs.max()
kind             working
lifetime_cost      26686
longevity           16.5
size               small
weight             175.0
height              30.0
dtype: object

Adding axis=0 in the above would not have changed the result.


On the other hand, if we want the maximum element in each row, we’d use axis=1. However, dogs.max(axis=1) won’t work directly, since that’d require us to compare numbers (like 15.0) to strings (like 'working'). So, we’ll first select the numeric columns out of dogs. Note that while this works, the results are a little nonsensical, since the values in each column are on different scales.

dogs[['lifetime_cost', 'longevity', 'weight', 'height']].max(axis=1)
breed
Brittany                  22589.0
Cairn Terrier             21992.0
English Cocker Spaniel    18993.0
                           ...   
Bullmastiff               13936.0
Mastiff                   13581.0
Saint Bernard             20022.0
Length: 42, dtype: float64

A shortcut to dogs[['lifetime_cost', 'longevity', 'weight', 'height']] is dogs.select_dtypes(include='number').


As another example, to find the number of unique values in each column:

dogs.nunique()
kind              7
lifetime_cost    42
longevity        40
size              3
weight           37
height           30
dtype: int64

or row:

dogs.nunique(axis=1)
breed
Brittany                  6
Cairn Terrier             6
English Cocker Spaniel    6
                         ..
Bullmastiff               6
Mastiff                   6
Saint Bernard             6
Length: 42, dtype: int64


And, finally, to get summary statistics for each numeric column:

dogs.describe()
lifetime_costlongevityweightheight
count42.0042.0042.0042.00
mean20532.8411.3449.3518.34
std3290.782.0539.426.83
min13581.006.505.005.00
25%18508.5010.0518.0011.75
50%21006.0011.8136.5018.50
75%22072.5012.5267.5025.00
max26686.0016.50175.0030.00

Give this guide a read, and refer back to it if you run into behavior you weren’t expecting while working a homework.