In [1]:
from lec_utils import *
def multiple_kdes(ser_map, title=""):
values = [ser_map[key].dropna() for key in ser_map]
labels = list(ser_map.keys())
fig = ff.create_distplot(
hist_data=values,
group_labels=labels,
show_rug=False,
show_hist=False,
colors=px.colors.qualitative.Dark2[: len(ser_map)],
)
return fig.update_layout(title=title, width=1000).update_xaxes(title="child")
Announcements 📣¶
Homework 3 is due tonight. See this post on Ed for an important clarification.
We've slightly adjusted the Office Hours schedule – take a look, and please come by.
I have office hours right after lecture today!study.practicaldsc.org contains our discussion worksheets (and solutions), which are made up of old exam problems. Use these problems to build your theoretical understanding of the material, and come to discussion!
Agenda¶
- Recap: Types of visualizations.
- Visualization best practices.
- Handling missing values.
Recap: Types of visualizations¶
Dataset setup¶
- Run the cell below to load in our dataset and clean it, using the functions defined in the last lecture.
In [2]:
def clean_term_column(df):
return df.assign(
term=df['term'].str.split().str[0].astype(int)
)
def clean_date_column(df):
return (
df
.assign(date=pd.to_datetime(df['issue_d'], format='%b-%Y'))
.drop(columns=['issue_d'])
)
In [3]:
loans = (
pd.read_csv('data/loans.csv')
.pipe(clean_term_column)
.pipe(clean_date_column)
)
- Each time you run the cell below, you'll see a different sample of rows in
loans
.
In [4]:
loans.sample(5)
Out[4]:
id | loan_amnt | term | int_rate | ... | fico_range_high | hardship_flag | mths_since_last_delinq | date | |
---|---|---|---|---|---|---|---|---|---|
3969 | 44897054 | 4800.0 | 36 | 13.33 | ... | 704.0 | N | 56.0 | 2015-04-01 |
957 | 15240229 | 30000.0 | 60 | 18.92 | ... | 704.0 | N | NaN | 2014-05-01 |
3550 | 140840563 | 20000.0 | 36 | 18.94 | ... | 709.0 | N | 31.0 | 2018-10-01 |
3997 | 38607756 | 10000.0 | 36 | 8.19 | ... | 734.0 | N | NaN | 2015-01-01 |
2736 | 130395447 | 15000.0 | 36 | 9.43 | ... | 694.0 | N | NaN | 2018-04-01 |
5 rows × 20 columns
Choosing the correct type of visualization¶
- The type of visualization we create depends on the types of features we're visualizing.
- We'll directly learn how to produce the bolded visualizations below, but the others are also options.
See more examples here.
Feature types | Options |
---|---|
Single categorical feature | Bar charts, pie charts, dot plots |
Single numerical feature | Histograms, box plots, density curves, rug plots, violin plots |
Two numerical features | Scatter plots, line plots, heat maps, contour plots |
One categorical and one numerical feature It really depends on the nature of the features themselves! |
Side-by-side histograms, box plots, or bar charts, overlaid line plots or density curves |
- Note that we use the words "plot", "chart", and "graph" to mean the same thing.
Bar charts¶
- Bar charts are used to show:
- The distribution of a single categorical feature, or
- The relationship between one categorical feature and one numerical feature.
- Usage:
px.bar
/px.barh
ordf.plot(kind='bar')
/df.plot(kind='barh')
.'h'
stands for "horizontal."
- Example: What is the distribution of
'addr_state'
s inloans
?
In [5]:
# Here, we're using the .plot method on loans['addr_state'], which is a Series.
# We prefer horizontal bar charts, since they're easier to read.
(
loans['addr_state']
.value_counts()
.plot(kind='barh')
)