Visualization Tips and Examples
Table of contents
Overview
In Lecture 7, we provided an overview of plotly
syntax, and discussed how to decide which type of chart to create, be it a bar chart, histogram, line chart, box plot, scatter plot.
The purpose of this guide is twofold:
- First, we’ll discuss visualization “best practices”, and how to avoid common mistakes.
- Then, we’ll show you several examples of other plots you can create in
plotly
, drawing from rich historical examples.
As a reminder, the plotly
examples library is excellent; you should use it as a reference and as inspiration when developing plots on your own (say, for the Final Project).
The plots in this website are not interactive, only due to a rendering limitation with the course website. If you run the code below on your own, you’ll be able to interact with the resulting plots.
Best practices
Perception
As we discussed in Lecture 7, one reason to create visualizations is for us to better understand our data. But another reason is to accurately communicate a message to other people. And, as it turns out, the world around us is filled with examples of visualizations that are difficult to accurately interpret, or perceive.
We’ll start with a few examples from the internet.
data:image/s3,"s3://crabby-images/17a7c/17a7c122d8dbc2048786349491a95e524acf4887" alt=""
data:image/s3,"s3://crabby-images/96054/9605401672b56864b4c72c9c9149c6b7cab919e1" alt=""
Something seems “wrong” about the two visualizations above, but describing specifically what is wrong can be challenging without the right vocabulary. To illustrate, let’s pivot to a dataset of our own. Below, we load in a dataset with information about various countries over time, maintained by Gapminder.
world = px.data.gapminder() # The dataset is built into plotly.express.
world
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.80 | 8425333 | 779.45 | AFG | 4 |
1 | Afghanistan | Asia | 1957 | 30.33 | 9240934 | 820.85 | AFG | 4 |
2 | Afghanistan | Asia | 1962 | 32.00 | 10267083 | 853.10 | AFG | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1701 | Zimbabwe | Africa | 1997 | 46.81 | 11404948 | 792.45 | ZWE | 716 |
1702 | Zimbabwe | Africa | 2002 | 39.99 | 11926563 | 672.04 | ZWE | 716 |
1703 | Zimbabwe | Africa | 2007 | 43.49 | 12311143 | 469.71 | ZWE | 716 |
1704 rows × 8 columns
Let’s suppose we’re interested in understanding the distribution of Earth’s population by continent, in the most recent year we have data for (which, in this dataset, happens to be 2007).
pop_by_cont = (
world[world['year'] == world['year'].max()]
.groupby('continent')
['pop']
.sum()
)
pop_by_cont
continent
Africa 929539692
Americas 898871184
Asia 3811953827
Europe 586098529
Oceania 24549947
Name: pop, dtype: int64
In Lecture 7, we’ve seen that the “default” way to visualize such a distribution is to draw a bar chart:
(
pop_by_cont
.sort_values()
.plot(kind='barh', title='Distribution of Population by Continent')
)
data:image/s3,"s3://crabby-images/0ef52/0ef527e408f96628954e7299cab754903073c90e" alt=""
But, we could also draw a pie chart:
px.pie(
pop_by_cont.reset_index(),
values='pop',
names='continent',
title='Distribution of Population by Continent'
).update_traces(textinfo='label')
data:image/s3,"s3://crabby-images/0bef1/0bef1c7bdd5902ceaafba2fe93f027b00ff82b3b" alt=""
Note that Africa’s population is larger than that of the Americas. But, that trend is only visually obvious in the bar chart. It’s easy to distinguish the lengths of bars that start at the same baseline; visualizing differences in angles or areas – as we’re being asked to in the pie chart – is more difficult.
There is science to back up this phenomenon. In the mid-1980s, statisticians ran experiments comparing how easily human subjects were able to tell apart changes in length, angle, area, volume, color, and other visual encodings. Read this article for more details.
data:image/s3,"s3://crabby-images/2cef7/2cef79f65d8d496df7d6eb4e4fb34b5f6a883652" alt=""
As a data scientist, your job is to make comparisons easy! Avoid pie charts and other visual representations that make it difficult to understand the data. Going back to the women’s heights example, the area of the India figure is tiny compared to the area of the Latvia figure, despite only representing a value 5 inches smaller.
Aside: What is a distribution?
The term “distribution” is often misused. For example, the following bar chart does not show a distribution. Why not?
data:image/s3,"s3://crabby-images/d02f4/d02f46eeef752390581e5bc5d7c2f215ef6b2417" alt=""
The answer is because individuals can be in multiple categories – as told to us in the fine print – and the frequencies don’t add to 100%.
By definition, the distribution of a column tells us the unique values in that column, and how often they occur. If using counts, the counts should add up to the number of data points; if using percentages, they should add up to 100%.
# Actually a distribution!
(
pop_by_cont
.sort_values()
.plot(kind='barh', title='Distribution of Population by Continent')
)
data:image/s3,"s3://crabby-images/e09fc/e09fcd227a3d1531b5a8254513a3b89df77826bf" alt=""
Color scales
Let’s switch gears and investigate the role of color in our graphs. We’ll start by loading in a dataset describing each Walmart location in the US as of 2006. Download it from here.
wm = pd.read_csv('data/walmart.csv')
wm
storenum | OPENDATE | date_super | conversion | ... | LON | MONTH | DAY | YEAR | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7/1/62 | 3/1/97 | 1.0 | ... | -94.07 | 7 | 1 | 1962 |
1 | 2 | 8/1/64 | 3/1/96 | 1.0 | ... | -93.09 | 8 | 1 | 1964 |
2 | 4 | 8/1/65 | 3/1/02 | 1.0 | ... | -94.50 | 8 | 1 | 1965 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2989 | 5485 | 1/27/06 | NaN | NaN | ... | -87.70 | 1 | 27 | 2006 |
2990 | 3425 | 1/27/06 | 1/27/06 | 0.0 | ... | -95.22 | 1 | 27 | 2006 |
2991 | 5193 | 1/31/06 | NaN | NaN | ... | -117.17 | 1 | 31 | 2006 |
2992 rows × 16 columns
To visualize the number of Walmarts per state, we could use a bar chart, as in the continents example.
wm_per_state = wm['STRSTATE'].value_counts()
wm_per_state
STRSTATE
TX 315
FL 175
CA 159
...
WY 9
ND 8
DE 8
Name: count, Length: 41, dtype: int64
wm_per_state.head(10).sort_values().plot(kind='barh', title='Number of Walmarts Per State')
data:image/s3,"s3://crabby-images/ffd64/ffd64eb13b9aabc4ef6856f5ce295587f477032d" alt=""
But, perhaps a more interesting visualization is a choropleth – the kind you created in Homework 3, when visualizing the party with the most votes per state in 2024.
choro = px.choropleth(wm_per_state.reset_index(),
locations='STRSTATE',
color='count',
locationmode='USA-states',
scope='usa',
title='Number of Walmarts Per State')
choro
data:image/s3,"s3://crabby-images/cc7e5/cc7e5833a986a34621223cdc1f484de3a27d9546" alt=""
Now, you may notice the choropleth above is colored differently than the one you had to create in Homework 3:
data:image/s3,"s3://crabby-images/79811/7981190f39f4d2019f4c40d2a7f8684c186f5d83" alt=""
Why? In the bottom, political choropleth, the feature being compared across states is categorical (political party). In the top, Walmart choropleth, the feature being compared across states is numerical (number of Walmarts).
So:
- When comparing categories, use very different colors for each category, ideally choosing from a known color-blind friendly color palette.
- When comparing numbers, choose an appropriate continuous color scheme. There are two types: sequential, where larger values are more intense and smaller values are less intensive; or diverging, where both large and small values are equally intense, but in different colors.
Here’s another example of a sequential continuous color scale in action:
px.choropleth(wm_per_state.reset_index(),
locations='STRSTATE',
color='count',
locationmode='USA-states',
scope='usa',
title='Number of Walmarts Per State',
color_continuous_scale='greens')
data:image/s3,"s3://crabby-images/efc69/efc69f21b1fc8325b6421017ea6eec437c96941a" alt=""
Here’s a diverging color scale, where dark blue means “large” and dark red means “small.” Here, it feels unnatural that states with very few Walmarts and very many Walmarts are similarly “intense.”
px.choropleth(wm_per_state.reset_index(),
locations='STRSTATE',
color='count',
locationmode='USA-states',
scope='usa',
title='Number of Walmarts Per State',
color_continuous_scale='rdbu')
data:image/s3,"s3://crabby-images/3873a/3873aa978fccfd069be475b5af36257add45f90e" alt=""
But, diverging color schemes like the one above make sense in other cases, e.g. in political choropleths that show voting margins.
Key takeaways
The Gapminder and Walmart examples should have made two points clear. In your visualizations:
- Make comparisons easy.
- Choose an appropriate color scheme.
More examples
Next, we’ll look at several example visualizations, to serve as further inspiration. Some of these use chart types we saw in Lecture 7; others are new.
Historical examples
William Playfair is known as the “father of data visualization”, and is the creator of line charts, bar charts, and pie charts, among other things. We’ll start by recreating some of his historical charts using plotly
!
First, we’ll recreate the very first known example of a bar chart, which depicts the imports and exports of Scotland to various countries in 1781.
data:image/s3,"s3://crabby-images/e08dc/e08dc0c2bb1ccf2925ed17e8060ef5baf7f32be5" alt=""
scotland = pd.read_csv('data/playfair-scotland.csv')
scotland
country | imports | exports | |
---|---|---|---|
0 | Ireland | 195685 | 305167 |
1 | America | 49826 | 183620 |
2 | West Indies | 169375 | 141220 |
... | ... | ... | ... |
13 | Greenland | 8291 | 0 |
14 | Isle of Man & Jersey | 802 | 1818 |
15 | Denmark and Norway | 28118 | 35011 |
16 rows × 3 columns
The reproduction code is quite long, so we’ve hidden it behind a button.
Click here to see the code for this example.
fig = px.bar(scotland.sort_values('imports', ascending=False),
x=['exports', 'imports'],
y='country',
barmode='group',
orientation='h',
color_discrete_map={
'exports': '#151EA6',
'imports': '#FCB305',
},
title='Exports and Imports of <b>Scotland</b> to and from different parts for one Year'
)
fig.update_layout(
font_family="Arial",
title_font_family="Arial",
paper_bgcolor='#FFFFFF',
plot_bgcolor='#FFFFFF',
legend = {
'title': '',
'orientation': 'h'
}
)
fig.add_annotation( # add a text callout with arrow
text="no exports to Greenland!", x=10000, y="Greenland", ax=125,
arrowhead=2, showarrow=True
)
fig.update_xaxes(title_text='',
side='top',
showline=True,
linewidth=2,
linecolor='black',
mirror=True,
showgrid=True,
gridwidth=1,
gridcolor='#EEEEEE',
tick0=0,
dtick=25000,
tickangle=0)
fig.update_yaxes(title_text='',
side='right',
showline=True,
linewidth=2,
linecolor='black',
mirror=True,
showgrid=True,
gridwidth=1,
gridcolor='#EEEEEE',
tickson='boundaries')
data:image/s3,"s3://crabby-images/f161e/f161e1283c68f17fd726043469ae185ca71138de" alt=""
As an aside – what if we want to export this chart to HTML, to put on a website? (Say, for the Final Project?)
The .to_html()
method will come in handy. Assuming fig
is a plotly
Figure, then we could use:
with open('scotland.html', 'w') as f:
f.write(fig.to_html())
f.close()
This next plot shows the relationship between weekly labor wages and the cost of a “quarter” of wheat, along with a timeline of English monarchs, from 1565 to 1821.
wheat = pd.read_csv('data/Wheat.csv').drop(columns=['Unnamed: 0']).iloc[:-1]
wheat.head()
Year | Wheat | Wages | |
---|---|---|---|
0 | 1565 | 41.0 | 5.00 |
1 | 1570 | 45.0 | 5.05 |
2 | 1575 | 42.0 | 5.08 |
3 | 1580 | 49.0 | 5.12 |
4 | 1585 | 41.5 | 5.15 |
This task is a bit different, since it involves two different types of visualizations – a line chart and a bar chart.
px.line(wheat, x='Year', y='Wages')
data:image/s3,"s3://crabby-images/fe7e9/fe7e94f12c398d6c8ead906685ff19b65ed7a259" alt=""
px.bar(wheat, x='Year', y='Wages')
data:image/s3,"s3://crabby-images/fdc85/fdc851d09a389893923e60f9705375aedb64f626" alt=""
Instead of using plotly.express
, which is a “lite” version of plotly
, we will use plotly
’s graph_objects
module.
import plotly.graph_objects as go
Click here to see the code for this example.
wheat_fig = go.Figure()
# Add bar chart
wheat_fig.add_trace(
go.Bar(
x=wheat['Year'],
y=wheat['Wheat'],
name='Wheat',
marker={'color': '#AAAAAA'},
width=5
)
)
# Add line chart
wheat_fig.add_trace(
go.Scatter(
x=wheat['Year'],
y=wheat['Wages'],
name='Wages',
marker={'color': 'red'},
fill='tozeroy',
fillcolor='rgba(135,206,235,0.65)'
)
)
# Adjust overall attributes
wheat_fig.update_layout(
font_family="Arial",
title_font_family="Arial",
paper_bgcolor='#FFFFFF',
plot_bgcolor='#FFFFFF',
showlegend=False
)
# Adjust x-axis
wheat_fig.update_xaxes(title_text='<i>5 Years each division</i>',
tickmode='array',
tickvals=[1565, 1600, 1650, 1700, 1750, 1800, 1820],
tickangle=0,
showgrid=False,
showline=True,
linewidth=2,
linecolor='black',
mirror=True)
# Adjust y-axis
wheat_fig.update_yaxes(title_text='<i>Price of the Quarter of Wheat in Shillings</i>',
side='right',
tick0=0,
dtick=5,
gridcolor='#EEEEEE',
gridwidth=1,
showline=True,
linewidth=2,
linecolor='black',
mirror=True)
# Add annotations
wheat_fig.add_annotation( # add a text callout with arrow
text="<i>Weekly Wages of a Good Mechanic</i>",
x=1640,
y=9,
showarrow=False,
font = {
'size': 10,
'color': 'white'
}
)
# Add annotations
title_text = 'CHART,<br><i>Showing at One View</i><br><i>The Price of The Quarter of Wheat</i><br>& Wages of Labour by the Week,<br>-- from --<br><i>The Year 1565 to 1821</i><br>-- by --<br><i>William Playfair</i>'
wheat_fig.add_annotation(
text=title_text,
x=1640,
y=70,
font = {
'size': 10,
'color': 'black'
},
bordercolor="black",
borderwidth=2,
borderpad=4,
bgcolor="#FFFFFF",
opacity=1
)
wheat_fig.add_annotation(
text="<i>Weekly Wages of a Good Mechanic</i>",
x=1640,
y=9,
showarrow=False,
font = {
'size': 10,
'color': 'black'
}
)
data:image/s3,"s3://crabby-images/f12ea/f12eaba269f133cad3bfb2ea88ad00fd3c5a9ea2" alt=""
Finally, we’ll look at Playfair’s first pie chart, describing the land distribution of the Turkish Empire.
data:image/s3,"s3://crabby-images/b1fae/b1fae1bf57f693dbe53779c8da3b7d39e80986ea" alt=""
dist = pd.DataFrame().assign(
continent=['African', 'European', 'Asiatic'],
proportion=[0.2, 0.25, 0.55]
)
dist
continent | proportion | |
---|---|---|
0 | African | 0.20 |
1 | European | 0.25 |
2 | Asiatic | 0.55 |
The code here is effectively the same as the code we used to create our earlier pie chart.
px.pie(dist,
values='proportion',
names='continent',
width=400,
height=300,
title='Land Distribution of the Turkish Empire')
data:image/s3,"s3://crabby-images/9c1a0/9c1a0b6bf3ba322b208455ca3c3f168e911598ef" alt=""
Other plot types
Let’s wrap up by looking at other plot types.
Gantt charts (i.e. timelines)
phases = [
['Newborn', '1998-11-26', '1999-11-26', 'Canada'],
['Toddler, Preschooler', '1999-11-26', '2005-09-03', 'US'],
['Elementary School Student', '2005-09-03', '2009-06-30', 'Canada'],
['Middle School Student', '2009-09-15', '2012-06-15', 'Canada'],
['High School Student', '2012-09-05', '2016-05-30', 'Canada'],
['Undergrad @ UC Berkeley', '2016-08-22','2020-05-15', 'US'],
['Masters @ UC Berkeley', '2020-08-25', '2021-05-14', 'Canada'],
['Lecturer @ UCSD', '2021-09-01', '2024-06-30', 'US'],
['Lecturer @ UMich', '2024-08-26', '2025-04-28', 'US']]
phases_df = pd.DataFrame(phases, columns=['Phase', 'Start', 'End', 'Location'])
phases_df
Phase | Start | End | Location | |
---|---|---|---|---|
0 | Newborn | 1998-11-26 | 1999-11-26 | Canada |
1 | Toddler, Preschooler | 1999-11-26 | 2005-09-03 | US |
2 | Elementary School Student | 2005-09-03 | 2009-06-30 | Canada |
... | ... | ... | ... | ... |
6 | Masters @ UC Berkeley | 2020-08-25 | 2021-05-14 | Canada |
7 | Lecturer @ UCSD | 2021-09-01 | 2024-06-30 | US |
8 | Lecturer @ UMich | 2024-08-26 | 2025-04-28 | US |
9 rows × 4 columns
tim = px.timeline(phases_df,
x_start = 'Start',
x_end = 'End',
y = 'Phase',
text = 'Location',
title = 'My Life Trajectory',
width=700,
height=400)
tim.update_yaxes(autorange='reversed')
data:image/s3,"s3://crabby-images/a86e2/a86e2bb73abb54e2d96787f5f8851a5725a6bd26" alt=""
Animated scatter plots
world = px.data.gapminder()
world
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.80 | 8425333 | 779.45 | AFG | 4 |
1 | Afghanistan | Asia | 1957 | 30.33 | 9240934 | 820.85 | AFG | 4 |
2 | Afghanistan | Asia | 1962 | 32.00 | 10267083 | 853.10 | AFG | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1701 | Zimbabwe | Africa | 1997 | 46.81 | 11404948 | 792.45 | ZWE | 716 |
1702 | Zimbabwe | Africa | 2002 | 39.99 | 11926563 | 672.04 | ZWE | 716 |
1703 | Zimbabwe | Africa | 2007 | 43.49 | 12311143 | 469.71 | ZWE | 716 |
1704 rows × 8 columns
px.scatter(world,
x = 'gdpPercap',
y = 'lifeExp',
hover_name = 'country',
color = 'continent',
size = 'pop',
size_max = 60,
log_x = True,
range_y = [30, 90],
animation_frame = 'year',
title = 'Life Expectancy, GDP Per Capita, and Population over Time'
)
data:image/s3,"s3://crabby-images/4a170/4a170f2a06781a3f966f811d9e9c25bcce55bcd4" alt=""
Again, our website doesn’t have interactive versions of these plots, but if you run this code yourself you’ll be able to click the “▶️ Play” button to see the points move over time, in the style of this classic video.
Animated histograms
px.histogram(world,
x = 'lifeExp',
animation_frame = 'year',
range_x = [20, 90],
range_y = [0, 50],
title = 'Distribution of Life Expectancy over Time')
data:image/s3,"s3://crabby-images/4b688/4b6881ece8a33bbb493d10a19043677b6f371d25" alt=""
3D scatter plots
import seaborn as sns
penguins = sns.load_dataset('penguins')
penguins
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows × 7 columns
px.scatter_3d(penguins,
x = 'bill_length_mm',
y = 'bill_depth_mm',
z = 'flipper_length_mm',
color = 'species',
hover_name = 'island',
title = 'Flipper Length vs. Bill Depth vs. Bill Length')
data:image/s3,"s3://crabby-images/4e095/4e09594cd248204d74508e009ceaf1cd37210077" alt=""
Again, the last few plots would be interactive if you produced them in your notebook.
More resources
Entire courses are dedicated to data visualization. Unfortunately, we don’t have an entire semester to dedicate to it ourselves!
We’ve just provided you with a few high-level considerations to be aware of when making plots. For more resources, look at:
- [This lecture](https://ds100.org/su20/lecture/lec10) I taught at another university that discusses some of these ideas in more depth.
- [This visualization course at UC San Diego](https://dsc-courses.github.io/dsc106-wi24).
- [This visualization course at the University of Washington](https://courses.cs.washington.edu/courses/cse442/23au/).
- [This visualization course at UC Berkeley](https://peteraldhous.com/ucb/2016/dataviz/).