from lec_utils import *
def multiple_kdes(ser_map, title=""):
values = [ser_map[key].dropna() for key in ser_map]
labels = list(ser_map.keys())
fig = ff.create_distplot(
hist_data=values,
group_labels=labels,
show_rug=False,
show_hist=False,
colors=px.colors.qualitative.Dark2[: len(ser_map)],
)
return fig.update_layout(title=title, width=1000).update_xaxes(title="child")
Announcements 📣¶
Homework 3 is due tonight. See this post on Ed for an important clarification.
We've slightly adjusted the Office Hours schedule – take a look, and please come by.
I have office hours right after lecture today!study.practicaldsc.org contains our discussion worksheets (and solutions), which are made up of old exam problems. Use these problems to build your theoretical understanding of the material, and come to discussion!
Agenda¶
- Recap: Types of visualizations.
- Visualization best practices.
- Handling missing values.
Recap: Types of visualizations¶
Dataset setup¶
- Run the cell below to load in our dataset and clean it, using the functions defined in the last lecture.
def clean_term_column(df):
return df.assign(
term=df['term'].str.split().str[0].astype(int)
)
def clean_date_column(df):
return (
df
.assign(date=pd.to_datetime(df['issue_d'], format='%b-%Y'))
.drop(columns=['issue_d'])
)
loans = (
pd.read_csv('data/loans.csv')
.pipe(clean_term_column)
.pipe(clean_date_column)
)
- Each time you run the cell below, you'll see a different sample of rows in
loans
.
loans.sample(5)
id | loan_amnt | term | int_rate | ... | fico_range_high | hardship_flag | mths_since_last_delinq | date | |
---|---|---|---|---|---|---|---|---|---|
3969 | 44897054 | 4800.0 | 36 | 13.33 | ... | 704.0 | N | 56.0 | 2015-04-01 |
957 | 15240229 | 30000.0 | 60 | 18.92 | ... | 704.0 | N | NaN | 2014-05-01 |
3550 | 140840563 | 20000.0 | 36 | 18.94 | ... | 709.0 | N | 31.0 | 2018-10-01 |
3997 | 38607756 | 10000.0 | 36 | 8.19 | ... | 734.0 | N | NaN | 2015-01-01 |
2736 | 130395447 | 15000.0 | 36 | 9.43 | ... | 694.0 | N | NaN | 2018-04-01 |
5 rows × 20 columns
Choosing the correct type of visualization¶
- The type of visualization we create depends on the types of features we're visualizing.
- We'll directly learn how to produce the bolded visualizations below, but the others are also options.
See more examples here.
Feature types | Options |
---|---|
Single categorical feature | Bar charts, pie charts, dot plots |
Single numerical feature | Histograms, box plots, density curves, rug plots, violin plots |
Two numerical features | Scatter plots, line plots, heat maps, contour plots |
One categorical and one numerical feature It really depends on the nature of the features themselves! |
Side-by-side histograms, box plots, or bar charts, overlaid line plots or density curves |
- Note that we use the words "plot", "chart", and "graph" to mean the same thing.
Bar charts¶
- Bar charts are used to show:
- The distribution of a single categorical feature, or
- The relationship between one categorical feature and one numerical feature.
- Usage:
px.bar
/px.barh
ordf.plot(kind='bar')
/df.plot(kind='barh')
.'h'
stands for "horizontal."
- Example: What is the distribution of
'addr_state'
s inloans
?
# Here, we're using the .plot method on loans['addr_state'], which is a Series.
# We prefer horizontal bar charts, since they're easier to read.
(
loans['addr_state']
.value_counts()
.plot(kind='barh')
)
# A little formatting goes a long way!
(
loans['addr_state']
.value_counts()
.head(10)
.sort_values()
.plot(kind='barh', title='States of Residence for Successful Loan Applicants')
.update_layout()
)
- Example: What is the average
'int_rate'
for each'home_ownership'
status?
(
loans
.groupby('home_ownership')
['int_rate']
.mean()
.plot(kind='barh', title='Average Interest Rate by Home Ownership Status')
)
# The "ANY" category seems to be an outlier.
loans['home_ownership'].value_counts()
home_ownership MORTGAGE 2810 RENT 2539 OWN 950 ANY 1 Name: count, dtype: int64
Side-by-side bar charts¶
- Instead of just looking at
'int_rate'
s for different,'home_ownership'
statuses, we could also group by loan'term'
s, too. As we'll see,'term'
impacts'int_rate'
far more than'home_ownership'
.
(
loans
.groupby('home_ownership')
.filter(lambda df: df.shape[0] > 1) # Gets rid of the "ANY" category.
.groupby(['home_ownership', 'term'])
[['int_rate']]
.mean()
)
int_rate | ||
---|---|---|
home_ownership | term | |
MORTGAGE | 36 | 11.42 |
60 | 15.27 | |
OWN | 36 | 11.75 |
60 | 16.14 | |
RENT | 36 | 12.23 |
60 | 16.43 |
- A side-by-side bar chart, which we can create by setting the
color
andbarmode
arguments, makes the pattern clear:
# Annoyingly, the side-by-side bar chart doesn't work properly
# if the column that separates colors (here, 'term')
# isn't made up of strings.
(
loans
.assign(term=loans['term'].astype(str) + ' months')
.groupby('home_ownership')
.filter(lambda df: df.shape[0] > 1)
.groupby(['home_ownership', 'term'])
[['int_rate']]
.mean()
.reset_index()
.plot(kind='bar',
y='int_rate',
x='home_ownership',
color='term',
barmode='group',
title='Average Interest Rate by Home Ownership Status and Loan Term',
width=800)
)
- Why do longer loans have higher
'int_rate'
s on average?
Histograms¶
- The previous slide showed the average
'int_rate'
for different combinations of'home_ownership'
status and'term'
. But what if we want to visualize more about'int_rate'
s than just their average?
- Histograms are used to show the distribution of a single numerical feature.
- Usage:
px.histogram
ordf.plot(kind='hist')
.
- Example: What is the distribution of
'int_rate'
?
(
loans
.plot(kind='hist', x='int_rate', title='Distribution of Interest Rates')
)
- With fewer bins, we see less detail (and less noise) in the shape of the distribution.
Play with the slider that appears when you run the cell below!
from ipywidgets import interact
def hist_bins(nbins):
(
loans
.plot(kind='hist', x='int_rate', nbins=nbins, title='Distribution of Interest Rates')
.show()
)
interact(hist_bins, nbins=(1, 51));
interactive(children=(IntSlider(value=26, description='nbins', max=51, min=1), Output()), _dom_classes=('widge…
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
Based on the histogram below, what is the relationship between the mean and median interest rate?
- A. Mean > median.
- B. Mean $\approx$ median.
- C. Mean < median.
(
loans
.plot(kind='hist', x='int_rate', title='Distribution of Interest Rates')
)
Box plots and violin plots¶
- Box plots and violin plots are alternatives to histograms, in that they also are used to show the distribution of quantitative features.
Learn more about box plots here.
- The benefit to them is that they're easily stacked side-by-side to compare distributions.
- Example: What is the distribution of
'int_rate'
?
(
loans
.plot(kind='box', x='int_rate', title='Distribution of Interest Rates')
)
- Example: What is the distribution of
'int_rate'
, separately for each loan'term'
?
(
loans
.plot(kind='box', y='int_rate', color='term', orientation='v',
title='Distribution of Interest Rates by Loan Term')
)
(
loans
.plot(kind='violin', y='int_rate', color='term', orientation='v',
title='Distribution of Interest Rates by Loan Term')
)
- Overlaid histograms can be used to show the distributions of multiple numerical features, too.
(
loans
.plot(kind='hist', x='int_rate', color='term', marginal='box', nbins=20,
title='Distribution of Interest Rates by Loan Term')
)
Scatter plots¶
- Scatter plots are used to show the relationship between two quantitative features.
- Usage:
px.scatter
ordf.plot(kind='scatter')
.
- Example: What is the relationship between
'int_rate'
and debt-to-income ratio,'dti'
?
(
loans
.sample(200, random_state=23)
.plot(kind='scatter', x='dti', y='int_rate', title='Interest Rate vs. Debt-to-Income Ratio')
)
- There are a multitude of ways that scatter plots can be customized. We can color points based on groups, we can resize points based on another numeric column, we can give them hover labels, etc.
(
loans
.assign(term=loans['term'].astype(str))
.sample(200, random_state=23)
.plot(kind='scatter', x='dti', y='int_rate', color='term',
hover_name='id', size='loan_amnt',
title='Interest Rate vs. Debt-to-Income Ratio')
)
Line charts¶
- Line charts are used to show how one quantitative feature changes over time.
- Usage:
px.line
ordf.plot(kind='line')
.
- Example: How many loans were given out each year in our dataset?
This is likely not true of the market in general, or even LendingClub in general, but just a consequence of where our dataset came from.
(
loans
.assign(year=loans['date'].dt.year)
['year']
.value_counts()
.sort_index()
.plot(kind='line', title='Number of Loans Given Per Year')
)
- Example: How has the average
'int_rate'
changed over time?
(
loans
.resample('6M', on='date')
['int_rate']
.mean()
.plot(kind='line', title='Average Interest Rate over Time')
)
- Example: How has the average
'int_rate'
changed over time, separately for 36 month and 60 month loans?
(
loans
.groupby('term')
.resample('6M', on='date')
['int_rate']
.mean()
.reset_index()
.plot(kind='line', x='date', y='int_rate', color='term',
title='Average Interest Rate over Time')
)
Even more examples¶
- We've uploaded a supplementary notebook,
lec08-extra.ipynb
that contains even more examples of visualizations created inplotly
. There's also an HTML version, posted on the course website.
- Refer to it, along with the
plotly
examples library, for inspiration and reference!
Visualization best practices¶
Why visualize?¶
- In this lecture, we created several visualizations using just a single dataset.
- The visualizations we want to create will often dictate the data cleaning steps we take.
For example, the plot of average interest rate over time wouldn't have been possible if we hadn't converted dates to timestamp objects!
- One reason to create visualizations is for us to better understand our data.
- Another reason is to accurately communicate a message to other people!
Example: Populations of continents¶
- Below, we load in a dataset with information about various countries over time, maintained by Gapminder.
world = px.data.gapminder() # The dataset is built into plotly.express.
world
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.80 | 8425333 | 779.45 | AFG | 4 |
1 | Afghanistan | Asia | 1957 | 30.33 | 9240934 | 820.85 | AFG | 4 |
2 | Afghanistan | Asia | 1962 | 32.00 | 10267083 | 853.10 | AFG | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1701 | Zimbabwe | Africa | 1997 | 46.81 | 11404948 | 792.45 | ZWE | 716 |
1702 | Zimbabwe | Africa | 2002 | 39.99 | 11926563 | 672.04 | ZWE | 716 |
1703 | Zimbabwe | Africa | 2007 | 43.49 | 12311143 | 469.71 | ZWE | 716 |
1704 rows × 8 columns
- Let's suppose we're interested in understanding the distribution of Earth's population by continent.
pop_by_cont = (
world[world['year'] == world['year'].max()]
.groupby('continent')
['pop']
.sum()
)
pop_by_cont
continent Africa 929539692 Americas 898871184 Asia 3811953827 Europe 586098529 Oceania 24549947 Name: pop, dtype: int64
- We've been taught to draw a bar chart:
(
pop_by_cont
.sort_values()
.plot(kind='barh', title='Distribution of Population by Continent')
)
- Or, we could draw a pie chart:
px.pie(
pop_by_cont.reset_index(),
values='pop',
names='continent',
title='Distribution of Population by Continent'
).update_traces(textinfo='label')
- In which plot is it easier to see that Africa's population is larger than that of the Americas?
Takeaway: Make comparisons easy!¶
- In the mid-1980s, statisticians ran experiments comparing how easily human subjects were able to tell apart changes in length, angle, area, volume, color, and other visual encodings.
Read this article for more details.
- Your job is to make comparisons easy! Avoid pie charts and other visual representations that make it difficult to understand the data.
In the women's heights example, the area of the India figure is tiny compared to the area of the Latvia figure, despite only representing a value 5 inches smaller.
Aside: What is a distribution?¶
- Question: Why does the following bar chart not contain a distribution?
- Answer: Because individuals can be in multiple categories, and the frequencies don't add to 100%.
Read the fine print!
- The distribution of a feature tells us the unique values of a feature, and how often they occur.
If using counts, they should add up to the number of data points; if using percentages, they should add up to 100%.
# Actually a distribution!
(
pop_by_cont
.sort_values()
.plot(kind='barh', title='Distribution of Population by Continent')
)
Example: Number of Walmarts¶
- Below, we load in a dataset describing each Walmart location in the US as of 2006.
wm = pd.read_csv('data/walmart.csv')
wm
storenum | OPENDATE | date_super | conversion | ... | LON | MONTH | DAY | YEAR | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7/1/62 | 3/1/97 | 1.0 | ... | -94.07 | 7 | 1 | 1962 |
1 | 2 | 8/1/64 | 3/1/96 | 1.0 | ... | -93.09 | 8 | 1 | 1964 |
2 | 4 | 8/1/65 | 3/1/02 | 1.0 | ... | -94.50 | 8 | 1 | 1965 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2989 | 5485 | 1/27/06 | NaN | NaN | ... | -87.70 | 1 | 27 | 2006 |
2990 | 3425 | 1/27/06 | 1/27/06 | 0.0 | ... | -95.22 | 1 | 27 | 2006 |
2991 | 5193 | 1/31/06 | NaN | NaN | ... | -117.17 | 1 | 31 | 2006 |
2992 rows × 16 columns
wm.columns
Index(['storenum', 'OPENDATE', 'date_super', 'conversion', 'st', 'county', 'STREETADDR', 'STRCITY', 'STRSTATE', 'ZIPCODE', 'type_store', 'LAT', 'LON', 'MONTH', 'DAY', 'YEAR'], dtype='object')
- Let's visualize the number of Walmarts per state. We could use a bar chart, as in the continents example.
wm_per_state = wm['STRSTATE'].value_counts()
wm_per_state
STRSTATE TX 315 FL 175 CA 159 ... WY 9 ND 8 DE 8 Name: count, Length: 41, dtype: int64
wm_per_state.head(10).sort_values().plot(kind='barh', title='Number of Walmarts Per State')
- But, perhaps a more interesting visualization is a choropleth.
choro = px.choropleth(wm_per_state.reset_index(),
locations='STRSTATE',
color='count',
locationmode='USA-states',
scope='usa',
title='Number of Walmarts Per State')
choro
Takeaway: Choose an appropriate color scheme!¶
- Question: What is different between the following two choropleths?
choro
- Answer: In the top choropleth, the feature being compared across states is categorical (political party). In the bottom choropleth, the feature being compared across states is numerical (number of Walmarts).
- When comparing categories, use very different colors for each category, ideally choosing from a known color-blind friendly color palette.
- When comparing numbers, choose an appropriate continuous color scheme.
There are two types: sequential, where larger values are more intense and smaller values are less intensive; or diverging, where both large and small values are equally intense, but in different colors.
More examples¶
- Here's another example of a sequential continuous color scale in action:
px.choropleth(wm_per_state.reset_index(),
locations='STRSTATE',
color='count',
locationmode='USA-states',
scope='usa',
title='Number of Walmarts Per State',
color_continuous_scale='greens')
- Here's a diverging color scale, where dark blue means "large" and dark red means "small." Here, it feels unnatural that states with very few Walmarts and very many Walmarts are similarly "intense."
px.choropleth(wm_per_state.reset_index(),
locations='STRSTATE',
color='count',
locationmode='USA-states',
scope='usa',
title='Number of Walmarts Per State',
color_continuous_scale='rdbu')
- But, diverging color schemes like the one above make sense in other cases, e.g. in political choropleths that show voting margins.
More resources¶
- Entire courses are dedicated to data visualization.
Unfortunately, we don't have an entire semester to dedicate to it ourselves!
- We've just provided you with a few high-level considerations to be aware of when making plots.
- For more resources, look at:
- This lecture I taught at another university that discusses some of these ideas in more depth.
- This visualization course at UC San Diego.
- This visualization course at the University of Washington.
- This visualization course at UC Berkeley.
Handling missing values¶
Intentionally missing values and default replacements¶
- Sometimes, values are missing intentionally, or by design. In these cases, we can't fill in the missing values.
For instance, if we survey students and ask "if you're from Michigan, what high school did you go to?", the students not from Michigan will have missing responses. But, there's nothing to fill in, since they're not from Michigan!
- Other times, missing values have a default replacement.
For instance, you automatically get a 0 for all assignments you don't submit in this class. So, when calculating your grades, I'll need to fill in all of yourNaN
s with 0. The DataFrame/Seriesfillna
method helps with this.
- Most situations are more complicated than this, though!
Don't get in the habit of just automatically filling all null values with 0.
Generally, what do we do with missing data?¶
- Consider a feature, $Y$.
Imagine $Y$ is a column in a DataFrame.
- Some of its values, $Y_\text{present}$, are present, while others, $Y_\text{missing}$, are missing.
- Issue: $Y_\text{present}$ may look different than the full dataset, $Y$.
Remember, we don't get to see $Y$.
- That is, the mean, median, and variance of $Y_\text{present}$ may be different than that of $Y$.
- Furthermore, the correlations between $Y_\text{present}$ and other features may be different than the correlations between $Y$ and other features.
Example: Heights¶
- Below, we load in a dataset containing the heights of parents and their children. Some of the
'child'
's heights are missing.
- Aside: The dataset was collected by Sir Francis Galton, who developed many key ideas in statistics (including correlation and regression) for the purposes of eugenics, which he is also the originator of.
heights = pd.read_csv('data/heights-missing-2.csv')
heights.head()
father | mother | gender | child | |
---|---|---|---|---|
0 | 78.5 | 67.0 | male | NaN |
1 | 78.5 | 67.0 | female | 69.2 |
2 | 78.5 | 67.0 | female | 69.0 |
3 | 78.5 | 67.0 | female | 69.0 |
4 | 75.5 | 66.5 | male | NaN |
- Goal: Try and fill the missing values in
heights['child']
using the information we do have.
- Plan: Discuss several ideas on how to solve this problem.
In practice, the approach you use depends on the situation.
Aside: Kernel density estimates¶
- In this section, we'll need to visualize the distributions of many numerical features.
- To do so, we'll use yet another visualization, a kernel density estimate (KDE). Think of a KDE as a smoothed version of a histogram.
heights['child'].plot(kind='hist', nbins=30)
# You can see how multiple_kdes is defined at the top of the notebook.
multiple_kdes({'Before Imputation': heights['child']})
Idea: Dropping missing values¶
- One solution is to "drop" all rows with missing values, and do calculations with just the values that we have.
- This is called listwise deletion.
heights
father | mother | gender | child | |
---|---|---|---|---|
0 | 78.5 | 67.0 | male | NaN |
1 | 78.5 | 67.0 | female | 69.2 |
2 | 78.5 | 67.0 | female | 69.0 |
... | ... | ... | ... | ... |
931 | 62.0 | 66.0 | female | 61.0 |
932 | 62.5 | 63.0 | male | 66.5 |
933 | 62.5 | 63.0 | female | 57.0 |
934 rows × 4 columns
heights.dropna()
father | mother | gender | child | |
---|---|---|---|---|
1 | 78.5 | 67.0 | female | 69.2 |
2 | 78.5 | 67.0 | female | 69.0 |
3 | 78.5 | 67.0 | female | 69.0 |
... | ... | ... | ... | ... |
931 | 62.0 | 66.0 | female | 61.0 |
932 | 62.5 | 63.0 | male | 66.5 |
933 | 62.5 | 63.0 | female | 57.0 |
765 rows × 4 columns
- Issue: We went from 934 to 765 rows, which means we lost 18% of rows for all columns, even columns in which no values were originally missing.
- Most
numpy
/pandas
methods already ignore missing values when performing calculations, so we don't need to do anything to ignore the missing values when calculating the mean and standard deviation.
heights['child'].mean()
67.10339869281046
heights['child'].std()
3.5227776335950374
Idea: Mean imputation¶
- Suppose we need all of the missing values to be filled in, or imputed, for our future analyses, meaning we can't just drop them.
- A terrible idea would be to impute all of the missing values with 0. Why?
heights['child']
0 NaN 1 69.2 2 69.0 ... 931 61.0 932 66.5 933 57.0 Name: child, Length: 934, dtype: float64
# DON'T do this!
heights['child'].fillna(0)
0 0.0 1 69.2 2 69.0 ... 931 61.0 932 66.5 933 57.0 Name: child, Length: 934, dtype: float64
- A better idea is to impute missing values with the mean of the observed values.
heights['child']
0 NaN 1 69.2 2 69.0 ... 931 61.0 932 66.5 933 57.0 Name: child, Length: 934, dtype: float64
heights['child'].mean()
67.10339869281046
mean_imputed = heights['child'].fillna(heights['child'].mean())
mean_imputed
0 67.1 1 69.2 2 69.0 ... 931 61.0 932 66.5 933 57.0 Name: child, Length: 934, dtype: float64
- The mean of
mean_imputed
is the same as the mean of'child'
before we imputed. Why?
# Mean before imputation:
heights['child'].mean()
67.10339869281046
# Mean after imputation:
mean_imputed.mean()
67.10339869281046
- What do you think a histogram of
mean_imputed
would look like?
mean_imputed.value_counts()
child 67.1 169 70.0 54 68.0 50 ... 63.2 1 62.2 1 59.0 1 Name: count, Length: 64, dtype: int64
Mean imputation destroys spread!¶
- Let's look at the distribution of
heights['child']
before we filled in missing values along with the distribution ofmean_imputed
, after we filled in missing values.
multiple_kdes({'Before Imputation': heights['child'],
'After Mean Imputation': mean_imputed})
- The standard deviation after imputing with the mean is much lower! The true distribution of
'child'
likely does not look like the distribution on the right.
heights['child'].std()
3.5227776335950374
mean_imputed.std()
3.187800157298298
- This makes it harder to use the imputed
'child'
column in analyses with other columns.
Mean imputation and listwise deletion introduce bias!¶
- What if the values that are missing, $Y_\text{missing}$, are not a representative sample of the full dataset, $Y$?
Equivalently, what if the values that are present are not a representative sample of the full dataset?
For example, if shorter heights are more likely to be missing than larger heights, then:
- The mean of the present values will be too big.
$$\text{mean}(Y_\text{present}) > \text{mean}(Y)$$
- So, by replacing missing values with the mean, our estimates will all be too big.
- Instead of filling all missing values with the same one value, can we do something to prevent this added bias?
Idea: Conditional mean imputation¶
- If we have reason to believe the chance that a
'height'
is missing depends on another feature, we can use that other feature to inform how we fill the missing value!
- For example, if we have reason to believe that heights are more likely to be missing for
'female'
children than'male'
children, we could fill in the missing'female'
and'male'
heights separately.
# Here, we're computing the proportion of 'child' heights that are missing
# per gender.
(
heights
.groupby('gender')
['child']
.apply(lambda s: s.isna().mean())
)
gender female 0.33 male 0.04 Name: child, dtype: float64
- That seems to be the case here, so let's try it. We can use the
groupby
transform
method.
# The mean 'female' observed 'child' height is 64.03, while
# the mean 'male' observed 'child' height is 69.13.
heights.groupby('gender')['child'].mean()
gender female 64.03 male 69.13 Name: child, dtype: float64
heights
father | mother | gender | child | |
---|---|---|---|---|
0 | 78.5 | 67.0 | male | NaN |
1 | 78.5 | 67.0 | female | 69.2 |
2 | 78.5 | 67.0 | female | 69.0 |
... | ... | ... | ... | ... |
931 | 62.0 | 66.0 | female | 61.0 |
932 | 62.5 | 63.0 | male | 66.5 |
933 | 62.5 | 63.0 | female | 57.0 |
934 rows × 4 columns
# Note the first missing 'child' height is filled in with
# 69.13, the mean of the observed 'male' heights, since
# they are a 'male' child!
conditional_mean_imputed = ...
conditional_mean_imputed = (
heights
.groupby('gender')
['child']
.transform(lambda s: s.fillna(s.mean()))
)
conditional_mean_imputed
0 69.13 1 69.20 2 69.00 ... 931 61.00 932 66.50 933 57.00 Name: child, Length: 934, dtype: float64
Pros and cons of conditional mean imputation¶
- Instead of having a single "spike", the conditionally-imputed distribution has two smaller "spikes".
In this case, one at the observed'female'
mean and one at the observed'male'
mean.
multiple_kdes({'Before Imputation': heights['child'],
'After Mean Imputation': mean_imputed,
'After Conditional Mean Imputation': conditional_mean_imputed})
- Pro ✅: The conditionally-imputed column's mean is likely to be closer to the true mean than if we just dropped all missing values, since we attempted to account for the imbalance in missingness.
# The mean of just our present values.
heights['child'].mean()
67.10339869281046
# Lower than above, reflecting the fact that we are missing
# more 'female' heights and 'female' heights
# tend to be lower.
conditional_mean_imputed.mean()
66.65591770566121
- Con ❌: The conditionally-imputed column likely still has a lower standard deviation than the true
'height'
column.
The true'height'
column likely doesn't look like the right-most histogram above.
- Con ❌: The chance that
'child'
's heights are missing may depend on other columns, too, and we didn't account for those. There may still be bias.
Idea: Regression imputation¶
- A common solution is to fill in missing values by using other features to predict what the missing value would have been.
# There's nothing special about the values passed into .iloc below;
# they're just for illustration.
heights.iloc[[0, 2, 919, 11, 4, 8, 9]]
father | mother | gender | child | |
---|---|---|---|---|
0 | 78.5 | 67.0 | male | NaN |
2 | 78.5 | 67.0 | female | 69.0 |
919 | 64.0 | 64.0 | female | NaN |
11 | 75.0 | 64.0 | male | 68.5 |
4 | 75.5 | 66.5 | male | NaN |
8 | 75.0 | 64.0 | male | 71.0 |
9 | 75.0 | 64.0 | female | 68.0 |
- We'll learn how to make such predictions in the second half of the course.
Idea: Probabilistic imputation¶
- Since we don't know what the missing values would have been, one could argue our technique for filling in missing values should incorporate this uncertainty.
- We could fill in missing values using a random sample of observed values.
This avoids the key issue with mean imputation, where we fill all missing values with the same one value. It also limits the bias present if the missing values weren't a representative sample, since we're filling them in with a range of different values.
# impute_prob should take in a Series with missing values
# and return an imputed Series.
def impute_prob(s):
s = s.copy()
# Find the number of missing values.
num_missing = s.isna().sum()
# Take a sample of size num_missing from the present values.
sample = np.random.choice(s.dropna(), num_missing)
# Fill in the missing values with our random sample.
s.loc[s.isna()] = sample
return s
- Each time we run the cell below, the missing values in
heights['child']
are filled in with a different sample of present values inheights['child']
!
# The number at the very top is constantly changing!
prob_imputed = impute_prob(heights['child'])
print('Mean:', prob_imputed.mean())
prob_imputed
Mean: 67.17526766595289
0 70.0 1 69.2 2 69.0 ... 931 61.0 932 66.5 933 57.0 Name: child, Length: 934, dtype: float64
- To account for the fact that each run is slightly different, a common strategy is multiple imputation.
This involves performing probabilistic imputation many (> 5) times, performing further analysis on each new dataset (e.g. building a regression model), and aggregating the results.
- Probabilistic imputation can even be done conditionally!
Now, missing'male'
heights are filled in using a sample of observed'male'
heights, and missing'female'
heights are filled in using a sample of observed'female'
heights!
conditional_prob_imputed = ...
conditional_prob_imputed = (
heights
.groupby('gender')
['child']
.transform(impute_prob)
)
conditional_prob_imputed
0 68.5 1 69.2 2 69.0 ... 931 61.0 932 66.5 933 57.0 Name: child, Length: 934, dtype: float64
Visualizing imputation strategies¶
multiple_kdes({'Before Imputation': heights['child'],
'After Mean Imputation': mean_imputed,
'After Conditional Mean Imputation': conditional_mean_imputed,
'After Probabilistic Imputation': prob_imputed,
'After Conditional Probabilistic Imputation': conditional_prob_imputed})
Activity
Work on UCSD DSC 80 Spring 2022 Final Exam, Problem 5 (Parts 1-4).
Missingness mechanisms¶
There are three key missingness mechanisms, which describe how data in a column can be missing.
- Missing completely at random (MCAR): Data are MCAR if the chance that a value is missing is completely independent of other columns and the actual missing value.
Example: Suppose that after the Midterm Exam, I randomly choose 5 scores to delete on Gradescope, meaning that 5 students have missing grades. MCAR is ideal, but rare!
- Missing at random (MAR): Data are MAR if the chance that a value is missing depends on other columns.
Example: Suppose that after the Midterm Exam, I randomly choose 5 scores to delete on Gradescope among sophomore students. Now, scores are missing at random dependent on class standing.
- Not missing at random (NMAR): Data are NMAR if the chance that a value is missing depends on the actual missing value itself.
Example: Suppose that after the Midterm Exam, I randomly delete 5 of the 10 lowest scores on Gradescope. Now, scores are not missing at random, since the chance a value is missing depends on how large it is.
- Statistical imputation packages usually assume data are MAR.
MCAR is usually unrealistic to assume. If data are NMAR, you can't impute missing values, since the other features in your data can't explain the missingness.
How do we know if data are MCAR?¶
- It seems that if our data are MCAR, there is no risk to dropping missing values.
In the MCAR setting, just imagine we're being given a large, random sample of the true dataset.
- If the data are not MCAR, though, then dropping the missing values will introduce bias.
For instance, suppose we asked people "How much do you give to charity?" People who give little are less likely to respond, so the average response is biased high.
- There is no perfect procedure for determining if our data are MCAR, MAR, or NMAR; we mostly have to use our understanding of how the data is generated.
- But, we can try to determine whether $Y_\text{missing}$ is similar to $Y$, using the information we do have in other columns.
We did this earlier, when looking at the proportion of missing'child'
heights for each'gender'
.
Summary of imputation techniques¶
- Consider whether values are missing intentionally, or whether there's a default replacement.
- Listwise deletion.
Drop, or ignore, missing values.
- (Conditional) mean imputation.
Fill in missing values with the mean of observed values. If there's a reason to believe the missingness depends on another categorical column, fill in missing values with the observed mean separately for each category.
- (Conditional) Probabilistic imputation.Fill in missing values with a random sample of observed values. If there's a reason to believe the missingness depends on another categorical column, fill in missing values with a random sample drawn separately for each category.
- Regression imputation.
Predict missing values using other features.
What's next?¶
- So far, our data has just been given to us as a CSV.
Sometimes it's messy, and we need to clean it.
- But, what if the data we want is somewhere on the internet?