SQL and Spreadsheets

Table of contents

  1. Overview
    1. Representations of tabular data
    2. Relational algebra
    3. Dataset: Top 200 Streams
  2. Round 1: pandas
  3. Round 2: SQL
    1. Overview and setup
    2. Aside: DuckDB
  4. Round 3: Spreadsheets
    1. Overview and setup
    2. Walkthrough video

Overview

Representations of tabular data

In this class, weโ€™ve worked with DataFrames in pandas.

  • When we say pandas DataFrame, weโ€™re talking about the pandas API for its DataFrame objects.
  • When we say โ€œDataFrameโ€, weโ€™re referring to a general way to represent data. Specifically, DataFrames organize data into rows and columns, with labels for both rows and columns.

There many other ways to work with data tables, each of which have their pros and cons.
Some examples include R data frames, SQL databases, spreadsheets in Google Sheets/Excel, or even matrices from linear algebra.

Below, we discuss pros and cons of (some of) the ways we can work with tabular data.

PlatformPros โœ…Cons โŒ
pandas DataFramesWorks well with the Python ecosystem (for visualization, machien learning, domain-specifc purposes, etc.), extremely flexible, reproducible stepsSteep learning curve (need to know Python too) and messy code, easy to make destructive in-place modifications, no persistence (everything starts from a CSV file)
R data framesDesigned specifically for data science so statistics and visualizations are easy; reproducible stepsR isnโ€™t as general-purpose as Python, no persistence
SQLScalable, good for maintaining many large, important datasets with many concurrent usersRequires lots of infrastructure, advanced operations can be challenging
SpreadsheetsWidespread use, very easy to get started, easy for sharingSteps arenโ€™t reproducible, advanced operations can be challenging

A common workflow is to load a subset of data in from a database system into pandas, then do further cleaning and visualization. Another is to load and clean data in pandas, then store it in a database system for others to use.

Relational algebra

Relational algebra captures common data operations between many data table systems. For example, the following expression describes a calculation in relational algebra:

\[\pi_{\text{int_rate}}(\sigma_{\text{state} = \text{"MI"}}(\text{loans}))\]

\(\pi\) stands for โ€œproject,โ€ i.e. selecting columns; \(\sigma\) stands for โ€œselect,โ€ i.e. selecting rows. We wonโ€™t test you on relational algebra syntax, but if youโ€™d like to learn more, take EECS 484: Database Management Systems.

In pandas, weโ€™d implement this expression as follows:

loans.loc[loans['state'] == 'MI', 'int_rate']

Question: How would we implement it in SQL? Or a spreadsheet?

To show you how the tabular manipulation skills youโ€™ve learned in pandas generalize to other systems, we will answer a few questions in all three of the above platforms. First, weโ€™ll work through our analysis in pandas, and then in SQL, and finally, in a Google Sheets document.

Dataset: Top 200 Streams

Our dataset contains the number of streams for the top 200 songs on Spotify, for the week leading up to September 19th. We downloaded it from here in Fall 2024, and you can find a copy of it (to repeat the analysis) here.

charts = pd.read_csv('data/regional-global-weekly-2024-09-19.csv')
charts
rankuriartist_namestrack_name...peak_rankprevious_rankweeks_on_chartstreams
01spotify:track:2plbrEY59IikOBgBGLjaoeLady Gaga, Bruno MarsDie With A Smile...11576502673
12spotify:track:6dOtVTDdiauQNBQEDOtlABBillie EilishBIRDS OF A FEATHER...121854756808
23spotify:track:5G2f63n7IPVPPjfNIGih7QSabrina CarpenterTaste...33448790619
..............................
197198spotify:track:3EaJDYHA0KnX88JvDhL9oaSteve LacyDark Red...662001188901006
198199spotify:track:0gEyKnHvgkrkBM6fbeHdwKThe CranberriesLinger...199-128867800
199200spotify:track:7iUtQNMRB8ZkKC4AmEuCJCMyke TowersLA FALDA...24185368848312

200 rows ร— 9 columns

Scroll through the columns above. A songโ€™s 'uri' is its uniform resource identifier. Given a songโ€™s 'uri', we can play it!

def show_spotify(uri):
    code = uri[uri.rfind(':')+1:]
    src = f"https://open.spotify.com/embed/track/{code}"
    width = 400
    height = 75
    display(IFrame(src, width, height))

my_uri = charts.loc[charts['track_name'] == 'Yellow', 'uri'].iloc[0] 
my_uri
'spotify:track:3AJwUDP919kvQ9QcozQPxg'
show_spotify(my_uri)

Round 1: pandas

Below, we present six tasks and their solutions. A good way to practice your pandas skills is to download the CSV linked above and attempt the six tasks on your own, without looking at the solutions.

Task 1: Find the total number of streams of songs by Sabrina Carpenter.

Click here to see the solution.
charts.loc[charts['artist_names'] == 'Sabrina Carpenter', 'streams'].sum()
212960137


Task 2: Find the total number of streams per artist, sorted by number of streams in descending order. Only show the top 5 artists.

Click here to see the solution.
(
    charts
    .groupby('artist_names')
    ['streams']
    .sum()
    .sort_values(ascending=False)
    .head(5)
)
artist_names
Sabrina Carpenter        212960137
Billie Eilish            107797821
Chappell Roan             92990557
Linkin Park               89935206
Lady Gaga, Bruno Mars     76502673
Name: streams, dtype: int64


Task 3: Find the artist with the lowest average number of streams, among artists with at least 5 songs in the Top 200.

Click here to see the solution.
(
    charts
    .groupby('artist_names')
    .filter(lambda df: df.shape[0] >= 5)
    .groupby('artist_names')
    ['streams']
    .mean()
    .sort_values()
    .head(1)
)
artist_names
Bruno Mars    1.24e+07
Name: streams, dtype: float64


Task 4: Find the number of songs with a higher ranking this week than last week.

Click here to see the solution.
charts[charts['rank'] > charts['previous_rank']].shape[0]
119


Task 5: charts_old contains the Top 200 songs in the previous week. You can download it from here.

charts_old = pd.read_csv('data/regional-global-weekly-2024-09-12.csv')
charts_old.head()
rankuriartist_namestrack_name...peak_rankprevious_rankweeks_on_chartstreams
01spotify:track:2plbrEY59IikOBgBGLjaoeLady Gaga, Bruno MarsDie With A Smile...11474988392
12spotify:track:6dOtVTDdiauQNBQEDOtlABBillie EilishBIRDS OF A FEATHER...121756949755
23spotify:track:5G2f63n7IPVPPjfNIGih7QSabrina CarpenterTaste...33349847514
34spotify:track:7tI8dRuH2Yc6RuoTjxo4dUJiminWho...14845431448
45spotify:track:2qSkIjg1o9h3YT9RAgYN75Sabrina CarpenterEspresso...152244982843

5 rows ร— 9 columns

Find the song with the largest increase in streams between last week and this week, among songs that were in the Top 200 in both weeks.

Click here to see the solution.
with_old = (
    charts[['uri', 'track_name', 'artist_names', 'streams']]
    .merge(charts_old[['uri', 'streams',]], on='uri', suffixes=('_new', '_old'))
)

(
    with_old
    .assign(change=with_old['streams_new'] - with_old['streams_old'])
    .sort_values('change', ascending=False)
    [['track_name', 'artist_names', 'change']]
    .head(1)
)
track_nameartist_nameschange
5Good Luck, Babe!Chappell Roan3217539


Task 6: Find the 4 songs with the most artists.

Click here to see the solution.
(
    charts
    .assign(num_artists=charts['artist_names'].str.count(', '))
    .sort_values('num_artists', ascending=False)
    .head(4)
)
rankuriartist_namestrack_name...previous_rankweeks_on_chartstreamsnum_artists
4546spotify:track:1BJJbSX6muJVF2AK7uH1x4Adam Port, Stryv, Keinemusik, Orso, MalachiiiMove...3114159870474
186187spotify:track:22skzmqfdWrjJylampe0ktMacklemore & Ryan Lewis, Macklemore, Ryan Lewi...Can't Hold Us (feat. Ray Dalton)...16313290509843
194195spotify:track:28drn6tQo95MRvO0jQEo5CFuture, Metro Boomin, Travis Scott, Playboi CartiType Shit...-12589424753
177178spotify:track:4QNpBfC0zvjKqPJcyqBy9WPitbull, AFROJACK, Ne-Yo, NayerGive Me Everything (feat. Nayer)...1651992656963

4 rows ร— 10 columns


Round 2: SQL

Overview and setup

SQL stands for โ€œStructured Query Languageโ€. It is the standard language for database manipulation. Each database system โ€“ for instance, MySQL, SQLite, DuckDB, or PostgreSQL โ€“ has its own slightly different dialect of SQL with unique features.

Of note, SQL is a declarative language. That is, in SQL, you just describe what you want calculated, not how you want it calculated. Itโ€™s the database engineโ€™s job to figure out how to process your query.

        SELECT artist_names, SUM(streams) AS total_streams FROM charts
        GROUP BY artist_names
        ORDER BY total_streams DESC
        LIMIT 5;
One of the SQL queries we'll write shortly.

We bolded the word โ€œqueryโ€ above because all code we write in SQL is referred to as a query. All SQL queries follow the same general template:


(source)

W3Schools has examples of SQL syntax, as does the example above.

In this guide, weโ€™ll demonstrate the usage of two SQL database platforms: SQLite and DuckDB. SQLite comes pre-installed on most systems, while DuckDB is open-source, and integrates with pandas really well. Weโ€™ll answer our first few tasks by working with sqlite3 in the command-line. To follow along:

  1. Download spotify.db from here.
  2. Open your Terminal and cd to the directory where you downloaded spotify.db.
  3. Run sqlite3 spotify.db.

These steps will open a sqlite3 interpreter, connected to the spotify.db database. A database file can contain multiple tables; this one contains two: charts and charts_old. Note that tables are also known as relations in database terminology.

Now, on with the tasks. Here, weโ€™ll show the solutions directly. That said, this walkthrough video also covers the answers, and develops them step-by-step. (Itโ€™s taken from a lecture recording in Fall 2024.)

Task 1: Find the total number of streams of songs by Sabrina Carpenter.

Task 2: Find the total number of streams per artist, sorted by number of streams in descending order. Only show the top 5 artists.

Note that the values above match the outputs in the pandas implementations of the tasks!

Aside: DuckDB

Instead of having to run all of our SQL queries in the Terminal using a .db file, we can use DuckDB, which allows us to execute queries in a notebook using a pandas DataFrame!

To use DuckDB, pip install it in your Terminal using pip install duckdb. Then, in your notebook, run import duckdb. That will allow you to use the run_sql function, defined below, to execute SQL queries using the DataFrames in your notebook as SQL tables.

def run_sql(query_str, as_df=False):
    out = duckdb.query(query_str)
    if as_df:
        return out.to_df()
    else:
        return out

Letโ€™s revisit Task 2:

run_sql('''
SELECT artist_names, SUM(streams) AS total_streams FROM charts
GROUP BY artist_names
ORDER BY total_streams DESC
LIMIT 5;
''')
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     artist_names      โ”‚ total_streams โ”‚
โ”‚        varchar        โ”‚    int128     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Sabrina Carpenter     โ”‚     212960137 โ”‚
โ”‚ Billie Eilish         โ”‚     107797821 โ”‚
โ”‚ Chappell Roan         โ”‚      92990557 โ”‚
โ”‚ Linkin Park           โ”‚      89935206 โ”‚
โ”‚ Lady Gaga, Bruno Mars โ”‚      76502673 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

We can even ask for the output back as a DataFrame by setting as_df to True! This means that you can combine both SQL operations and pandas operations.

run_sql('''
SELECT artist_names, SUM(streams) AS total_streams FROM charts
GROUP BY artist_names
ORDER BY total_streams DESC
LIMIT 5;
''', as_df=True)
artist_namestotal_streams
0Sabrina Carpenter2.13e+08
1Billie Eilish1.08e+08
2Chappell Roan9.30e+07
3Linkin Park8.99e+07
4Lady Gaga, Bruno Mars7.65e+07

Back to the tasks.

Task 3: Find the artist with the lowest average number of streams, among artists with at least 5 songs in the Top 200.

run_sql('''
SELECT artist_names, AVG(streams) as avg_streams FROM charts
GROUP BY artist_names
HAVING COUNT(*) >= 5
ORDER BY avg_streams
LIMIT 1;
''')
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ artist_names โ”‚ avg_streams โ”‚
โ”‚   varchar    โ”‚   double    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Bruno Mars   โ”‚  12411488.8 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Task 4: Find the number of songs with a higher ranking this week than last week.

run_sql('''
SELECT COUNT(*) as num_songs FROM (
    SELECT * FROM charts
    WHERE rank > previous_rank
)
''')
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ num_songs โ”‚
โ”‚   int64   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚       119 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Task 5: charts_old contains the Top 200 songs in the previous week.

Find the song with the largest increase in streams between last week and this week, among songs that were in the Top 200 in both weeks.

run_sql('''
SELECT track_name, artist_names, (new_streams - old_streams) AS change
FROM (
    SELECT charts.uri, 
           charts.track_name, 
           charts.artist_names, 
           charts.streams AS new_streams, 
           charts_old.uri, 
           charts_old.streams AS old_streams
    FROM charts
    INNER JOIN charts_old ON charts.uri = charts_old.uri
) AS merged
ORDER BY change DESC
LIMIT 1;
''', as_df=True)
track_nameartist_nameschange
0Good Luck, Babe!Chappell Roan3217539

Task 6: Find the 4 songs with the most artists.

Note that the syntax used in our solution doesnโ€™t exist in SQLite, but does exist in DuckDB.

run_sql('''
SELECT track_name, artist_names, array_length(str_split(artist_names, ', ')) AS num_artists
FROM charts
ORDER BY num_artists DESC
LIMIT 4;
''', as_df=True)
track_nameartist_namesnum_artists
0MoveAdam Port, Stryv, Keinemusik, Orso, Malachiii5
1Give Me Everything (feat. Nayer)Pitbull, AFROJACK, Ne-Yo, Nayer4
2Can't Hold Us (feat. Ray Dalton)Macklemore & Ryan Lewis, Macklemore, Ryan Lewi...4
3Type ShitFuture, Metro Boomin, Travis Scott, Playboi Carti4

In all of the examples above, the results matched our pandas implementations, as weโ€™d expect.


Round 3: Spreadsheets

Overview and setup

While weโ€™re big fans of writing code in this class, in the business world, spreadsheets are (still) the most common tool for tabular data manipulation. Spreadsheets are great for showing information directly to someone else, and are easy to share.

Here, weโ€™ll replicate our prior analyses using Google Sheets. Excel may be more industry-standard, but Google Sheets are widely popular as well, and are easy for sharing.

This Google Sheet has both the charts and charts_old datasets that weโ€™ve referred to above. You can try your hand at the analyses by creating a copy of it; do so by opening it and going to File > Make a copy.

Walkthrough video

Since Google Sheets operations are slightly more complicated to share via text, weโ€™ve produced a walkthrough video. It covers the same six tasks that weโ€™ve already completed in pandas and SQL.

If following along, at each stage, make sure you see the same results as in the pandas and SQL sections!

Task 1: Find the total number of streams of songs by Sabrina Carpenter.

Relevant functions: FILTER, SUMIF.

Task 2: Find the total number of streams per artist, sorted by number of streams in descending order. Only show the top 5 artists.

Relevant functions: UNIQUE, SUMIF.

Task 3: Find the artist with the lowest average number of streams, among artists with at least 5 songs in the Top 200.

Relevant functions: AVERAGEIF.

Task 4: Find the number of songs with a higher ranking this week than last week.

Relevant functions: IF.

Task 5: charts_old contains the Top 200 songs in the previous week.

Find the song with the largest increase in streams between last week and this week, among songs that were in the Top 200 in both weeks.

Relevant functions: FILTER.

Task 6: Find the 4 songs with the most artists.

Relevant functions: SPLIT, LEN, SUBSTITUTE.


As youโ€™ve seen, the DataFrame manipulation techniques weโ€™ve learned about over the past month generalize to other systems that you might be exposed to. This is important to realize, since in 20 years, pandas may not exist, but grouping, pivoting, querying, etc. are all concepts that still will be useful.