Visualization Tips and Examples

Overview
Best practices
More examples
1. Historical examples
2. Other plot types
More resources

Overview

In Lecture 7, we provided an overview of plotly syntax, and discussed how to decide which type of chart to create, be it a bar chart, histogram, line chart, box plot, scatter plot.

The purpose of this guide is twofold:

First, we’ll discuss visualization “best practices”, and how to avoid common mistakes.
Then, we’ll show you several examples of other plots you can create in plotly, drawing from rich historical examples.

As a reminder, the plotly examples library is excellent; you should use it as a reference and as inspiration when developing plots on your own (say, for the Final Project).

The plots in this website are not interactive, only due to a rendering limitation with the course website. If you run the code below on your own, you’ll be able to interact with the resulting plots.

Best practices

Perception

As we discussed in Lecture 7, one reason to create visualizations is for us to better understand our data. But another reason is to accurately communicate a message to other people. And, as it turns out, the world around us is filled with examples of visualizations that are difficult to accurately interpret, or perceive.

We’ll start with a few examples from the internet.

What's wrong with this visualization?

Something seems “wrong” about the two visualizations above, but describing specifically what is wrong can be challenging without the right vocabulary. To illustrate, let’s pivot to a dataset of our own. Below, we load in a dataset with information about various countries over time, maintained by Gapminder.

world = px.data.gapminder() # The dataset is built into plotly.express.
world

	country	continent	year	lifeExp	pop	gdpPercap	iso_alpha	iso_num
0	Afghanistan	Asia	1952	28.80	8425333	779.45	AFG	4
1	Afghanistan	Asia	1957	30.33	9240934	820.85	AFG	4
2	Afghanistan	Asia	1962	32.00	10267083	853.10	AFG	4
...	...	...	...	...	...	...	...	...
1701	Zimbabwe	Africa	1997	46.81	11404948	792.45	ZWE	716
1702	Zimbabwe	Africa	2002	39.99	11926563	672.04	ZWE	716
1703	Zimbabwe	Africa	2007	43.49	12311143	469.71	ZWE	716

1704 rows × 8 columns

Let’s suppose we’re interested in understanding the distribution of Earth’s population by continent, in the most recent year we have data for (which, in this dataset, happens to be 2007).

pop_by_cont = (
    world[world['year'] == world['year'].max()]
    .groupby('continent')
    ['pop']
    .sum()
)
pop_by_cont

continent
Africa       929539692
Americas     898871184
Asia        3811953827
Europe       586098529
Oceania       24549947
Name: pop, dtype: int64

In Lecture 7, we’ve seen that the “default” way to visualize such a distribution is to draw a bar chart:

(
    pop_by_cont
    .sort_values()
    .plot(kind='barh', title='Distribution of Population by Continent')
)

But, we could also draw a pie chart:

px.pie(
    pop_by_cont.reset_index(), 
    values='pop',
    names='continent',
    title='Distribution of Population by Continent'
).update_traces(textinfo='label')

Note that Africa’s population is larger than that of the Americas. But, that trend is only visually obvious in the bar chart. It’s easy to distinguish the lengths of bars that start at the same baseline; visualizing differences in angles or areas – as we’re being asked to in the pie chart – is more difficult.

There is science to back up this phenomenon. In the mid-1980s, statisticians ran experiments comparing how easily human subjects were able to tell apart changes in length, angle, area, volume, color, and other visual encodings. Read this article for more details.

As a data scientist, your job is to make comparisons easy! Avoid pie charts and other visual representations that make it difficult to understand the data. Going back to the women’s heights example, the area of the India figure is tiny compared to the area of the Latvia figure, despite only representing a value 5 inches smaller.

Aside: What is a distribution?

The term “distribution” is often misused. For example, the following bar chart does not show a distribution. Why not?

The answer is because individuals can be in multiple categories – as told to us in the fine print – and the frequencies don’t add to 100%.

By definition, the distribution of a column tells us the unique values in that column, and how often they occur. If using counts, the counts should add up to the number of data points; if using percentages, they should add up to 100%.

# Actually a distribution!
(
    pop_by_cont
    .sort_values()
    .plot(kind='barh', title='Distribution of Population by Continent')
)

Color scales

Let’s switch gears and investigate the role of color in our graphs. We’ll start by loading in a dataset describing each Walmart location in the US as of 2006. Download it from here.

wm = pd.read_csv('data/walmart.csv')
wm

	storenum	OPENDATE	date_super	conversion	...	LON	MONTH	DAY	YEAR
0	1	7/1/62	3/1/97	1.0	...	-94.07	7	1	1962
1	2	8/1/64	3/1/96	1.0	...	-93.09	8	1	1964
2	4	8/1/65	3/1/02	1.0	...	-94.50	8	1	1965
...	...	...	...	...	...	...	...	...	...
2989	5485	1/27/06	NaN	NaN	...	-87.70	1	27	2006
2990	3425	1/27/06	1/27/06	0.0	...	-95.22	1	27	2006
2991	5193	1/31/06	NaN	NaN	...	-117.17	1	31	2006

2992 rows × 16 columns

To visualize the number of Walmarts per state, we could use a bar chart, as in the continents example.

wm_per_state = wm['STRSTATE'].value_counts()
wm_per_state

STRSTATE
TX    315
FL    175
CA    159
     ... 
WY      9
ND      8
DE      8
Name: count, Length: 41, dtype: int64

wm_per_state.head(10).sort_values().plot(kind='barh', title='Number of Walmarts Per State')

But, perhaps a more interesting visualization is a choropleth – the kind you created in Homework 3, when visualizing the party with the most votes per state in 2024.

choro = px.choropleth(wm_per_state.reset_index(),
             locations='STRSTATE',
             color='count',
             locationmode='USA-states',
             scope='usa',
             title='Number of Walmarts Per State')
choro

Now, you may notice the choropleth above is colored differently than the one you had to create in Homework 3:

Why? In the bottom, political choropleth, the feature being compared across states is categorical (political party). In the top, Walmart choropleth, the feature being compared across states is numerical (number of Walmarts).

So:

When comparing categories, use very different colors for each category, ideally choosing from a known color-blind friendly color palette.
When comparing numbers, choose an appropriate continuous color scheme. There are two types: sequential, where larger values are more intense and smaller values are less intensive; or diverging, where both large and small values are equally intense, but in different colors.

Here’s another example of a sequential continuous color scale in action:

px.choropleth(wm_per_state.reset_index(),
             locations='STRSTATE',
             color='count',
             locationmode='USA-states',
             scope='usa',
             title='Number of Walmarts Per State',
             color_continuous_scale='greens')

Here’s a diverging color scale, where dark blue means “large” and dark red means “small.” Here, it feels unnatural that states with very few Walmarts and very many Walmarts are similarly “intense.”

px.choropleth(wm_per_state.reset_index(),
             locations='STRSTATE',
             color='count',
             locationmode='USA-states',
             scope='usa',
             title='Number of Walmarts Per State',
             color_continuous_scale='rdbu')

But, diverging color schemes like the one above make sense in other cases, e.g. in political choropleths that show voting margins.

Key takeaways

The Gapminder and Walmart examples should have made two points clear. In your visualizations:

Make comparisons easy.
Choose an appropriate color scheme.

More examples

Next, we’ll look at several example visualizations, to serve as further inspiration. Some of these use chart types we saw in Lecture 7; others are new.

Historical examples

William Playfair is known as the “father of data visualization”, and is the creator of line charts, bar charts, and pie charts, among other things. We’ll start by recreating some of his historical charts using plotly!

First, we’ll recreate the very first known example of a bar chart, which depicts the imports and exports of Scotland to various countries in 1781.

scotland = pd.read_csv('data/playfair-scotland.csv')
scotland

	country	imports	exports
0	Ireland	195685	305167
1	America	49826	183620
2	West Indies	169375	141220
...	...	...	...
13	Greenland	8291	0
14	Isle of Man & Jersey	802	1818
15	Denmark and Norway	28118	35011

16 rows × 3 columns

The reproduction code is quite long, so we’ve hidden it behind a button.

Click here to see the code for this example.

fig = px.bar(scotland.sort_values('imports', ascending=False), 
             x=['exports', 'imports'], 
             y='country', 
             barmode='group', 
             orientation='h',
             color_discrete_map={
                 'exports': '#151EA6',
                 'imports': '#FCB305',
              },      
             title='Exports and Imports of <b>Scotland</b> to and from different parts for one Year'
            )

fig.update_layout(
    font_family="Arial",
    title_font_family="Arial",
    paper_bgcolor='#FFFFFF',
    plot_bgcolor='#FFFFFF',
    legend = {
        'title': '',
        'orientation': 'h'
    }
)

fig.add_annotation( # add a text callout with arrow
    text="no exports to Greenland!", x=10000, y="Greenland", ax=125,
    arrowhead=2, showarrow=True
)

fig.update_xaxes(title_text='',
                 side='top',
                 showline=True, 
                 linewidth=2, 
                 linecolor='black',
                 mirror=True,
                 showgrid=True, 
                 gridwidth=1, 
                 gridcolor='#EEEEEE',
                 tick0=0, 
                 dtick=25000,
                 tickangle=0)

fig.update_yaxes(title_text='',
                 side='right',
                 showline=True, 
                 linewidth=2, 
                 linecolor='black',
                 mirror=True,
                 showgrid=True, 
                 gridwidth=1, 
                 gridcolor='#EEEEEE',
                 tickson='boundaries')

As an aside – what if we want to export this chart to HTML, to put on a website? (Say, for the Final Project?)

The .to_html() method will come in handy. Assuming fig is a plotly Figure, then we could use:

with open('scotland.html', 'w') as f:
    f.write(fig.to_html())
    f.close()

This next plot shows the relationship between weekly labor wages and the cost of a “quarter” of wheat, along with a timeline of English monarchs, from 1565 to 1821.

wheat = pd.read_csv('data/Wheat.csv').drop(columns=['Unnamed: 0']).iloc[:-1]
wheat.head()

	Year	Wheat	Wages
0	1565	41.0	5.00
1	1570	45.0	5.05
2	1575	42.0	5.08
3	1580	49.0	5.12
4	1585	41.5	5.15

This task is a bit different, since it involves two different types of visualizations – a line chart and a bar chart.

px.line(wheat, x='Year', y='Wages')

px.bar(wheat, x='Year', y='Wages')

Instead of using plotly.express, which is a “lite” version of plotly, we will use plotly’s graph_objects module.

import plotly.graph_objects as go

Click here to see the code for this example.

wheat_fig = go.Figure()

# Add bar chart
wheat_fig.add_trace(
    go.Bar(
        x=wheat['Year'],
        y=wheat['Wheat'],
        name='Wheat',
        marker={'color': '#AAAAAA'},
        width=5
    )
)

# Add line chart
wheat_fig.add_trace(
    go.Scatter(
        x=wheat['Year'],
        y=wheat['Wages'],
        name='Wages',
        marker={'color': 'red'},
        fill='tozeroy',
        fillcolor='rgba(135,206,235,0.65)'
    )
)

# Adjust overall attributes
wheat_fig.update_layout(
    font_family="Arial",
    title_font_family="Arial",
    paper_bgcolor='#FFFFFF',
    plot_bgcolor='#FFFFFF',
    showlegend=False
)

# Adjust x-axis
wheat_fig.update_xaxes(title_text='<i>5 Years each division</i>', 
                       tickmode='array',
                       tickvals=[1565, 1600, 1650, 1700, 1750, 1800, 1820], 
                       tickangle=0,
                       showgrid=False,
                       showline=True, 
                       linewidth=2, 
                       linecolor='black',
                       mirror=True)

# Adjust y-axis
wheat_fig.update_yaxes(title_text='<i>Price of the Quarter of Wheat in Shillings</i>',
                       side='right',
                       tick0=0, 
                       dtick=5, 
                       gridcolor='#EEEEEE',
                       gridwidth=1,
                       showline=True, 
                       linewidth=2, 
                       linecolor='black',
                       mirror=True)

# Add annotations
wheat_fig.add_annotation( # add a text callout with arrow
    text="<i>Weekly Wages of a Good Mechanic</i>", 
    x=1640, 
    y=9, 
    showarrow=False, 
    font = {
        'size': 10,
        'color': 'white'
    }
    
)

# Add annotations
title_text = 'CHART,<br><i>Showing at One View</i><br><i>The Price of The Quarter of Wheat</i><br>& Wages of Labour by the Week,<br>-- from --<br><i>The Year 1565 to 1821</i><br>-- by --<br><i>William Playfair</i>'

wheat_fig.add_annotation(
    text=title_text, 
    x=1640, 
    y=70, 
    font = {
        'size': 10,
        'color': 'black'
    },
    bordercolor="black",
    borderwidth=2,
    borderpad=4,
    bgcolor="#FFFFFF",
    opacity=1
    
)

wheat_fig.add_annotation(
    text="<i>Weekly Wages of a Good Mechanic</i>", 
    x=1640, 
    y=9, 
    showarrow=False, 
    font = {
        'size': 10,
        'color': 'black'
    }
    
)

Finally, we’ll look at Playfair’s first pie chart, describing the land distribution of the Turkish Empire.

dist = pd.DataFrame().assign(
    continent=['African', 'European', 'Asiatic'],
    proportion=[0.2, 0.25, 0.55]
)

dist

	continent	proportion
0	African	0.20
1	European	0.25
2	Asiatic	0.55

The code here is effectively the same as the code we used to create our earlier pie chart.

px.pie(dist,
       values='proportion',
       names='continent',
       width=400,
       height=300,
       title='Land Distribution of the Turkish Empire')

Other plot types

Let’s wrap up by looking at other plot types.

Gantt charts (i.e. timelines)

phases = [
 ['Newborn', '1998-11-26', '1999-11-26', 'Canada'],
 ['Toddler, Preschooler', '1999-11-26', '2005-09-03', 'US'],
 ['Elementary School Student', '2005-09-03', '2009-06-30', 'Canada'],
 ['Middle School Student', '2009-09-15', '2012-06-15', 'Canada'],
 ['High School Student', '2012-09-05', '2016-05-30', 'Canada'],
 ['Undergrad @ UC Berkeley', '2016-08-22','2020-05-15', 'US'],
 ['Masters @ UC Berkeley', '2020-08-25', '2021-05-14', 'Canada'],
 ['Lecturer @ UCSD', '2021-09-01', '2024-06-30', 'US'],
 ['Lecturer @ UMich', '2024-08-26', '2025-04-28', 'US']]

phases_df = pd.DataFrame(phases, columns=['Phase', 'Start', 'End', 'Location'])
phases_df

	Phase	Start	End	Location
0	Newborn	1998-11-26	1999-11-26	Canada
1	Toddler, Preschooler	1999-11-26	2005-09-03	US
2	Elementary School Student	2005-09-03	2009-06-30	Canada
...	...	...	...	...
6	Masters @ UC Berkeley	2020-08-25	2021-05-14	Canada
7	Lecturer @ UCSD	2021-09-01	2024-06-30	US
8	Lecturer @ UMich	2024-08-26	2025-04-28	US

9 rows × 4 columns

tim = px.timeline(phases_df,
                  x_start = 'Start',
                  x_end = 'End',
                  y = 'Phase',
                  text = 'Location',
                  title = 'My Life Trajectory',
                  width=700,
                  height=400)

tim.update_yaxes(autorange='reversed')

Animated scatter plots

world = px.data.gapminder()
world

	country	continent	year	lifeExp	pop	gdpPercap	iso_alpha	iso_num
0	Afghanistan	Asia	1952	28.80	8425333	779.45	AFG	4
1	Afghanistan	Asia	1957	30.33	9240934	820.85	AFG	4
2	Afghanistan	Asia	1962	32.00	10267083	853.10	AFG	4
...	...	...	...	...	...	...	...	...
1701	Zimbabwe	Africa	1997	46.81	11404948	792.45	ZWE	716
1702	Zimbabwe	Africa	2002	39.99	11926563	672.04	ZWE	716
1703	Zimbabwe	Africa	2007	43.49	12311143	469.71	ZWE	716

1704 rows × 8 columns

px.scatter(world,
           x = 'gdpPercap',
           y = 'lifeExp', 
           hover_name = 'country',
           color = 'continent',
           size = 'pop',
           size_max = 60,
           log_x = True,
           range_y = [30, 90],
           animation_frame = 'year',
           title = 'Life Expectancy, GDP Per Capita, and Population over Time'
          )

Again, our website doesn’t have interactive versions of these plots, but if you run this code yourself you’ll be able to click the “▶️ Play” button to see the points move over time, in the style of this classic video.

Animated histograms

px.histogram(world,
            x = 'lifeExp',
            animation_frame = 'year',
            range_x = [20, 90],
            range_y = [0, 50],
            title = 'Distribution of Life Expectancy over Time')

3D scatter plots

import seaborn as sns
penguins = sns.load_dataset('penguins')
penguins

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
...	...	...	...	...	...	...	...
341	Gentoo	Biscoe	50.4	15.7	222.0	5750.0	Male
342	Gentoo	Biscoe	45.2	14.8	212.0	5200.0	Female
343	Gentoo	Biscoe	49.9	16.1	213.0	5400.0	Male

344 rows × 7 columns

px.scatter_3d(penguins,
             x = 'bill_length_mm',
             y = 'bill_depth_mm',
             z = 'flipper_length_mm',
             color = 'species',
             hover_name = 'island',
             title = 'Flipper Length vs. Bill Depth vs. Bill Length')

Again, the last few plots would be interactive if you produced them in your notebook.

More resources

Entire courses are dedicated to data visualization. Unfortunately, we don’t have an entire semester to dedicate to it ourselves!

We’ve just provided you with a few high-level considerations to be aware of when making plots. For more resources, look at:

- [This lecture](https://ds100.org/su20/lecture/lec10) I taught at another university that discusses some of these ideas in more depth.
- [This visualization course at UC San Diego](https://dsc-courses.github.io/dsc106-wi24).
- [This visualization course at the University of Washington](https://courses.cs.washington.edu/courses/cse442/23au/).
- [This visualization course at UC Berkeley](https://peteraldhous.com/ucb/2016/dataviz/).