from lec_utils import *
import re

from IPython.display import YouTubeVideo
YouTubeVideo('cplSUhU2avc')

10.616051

234

State of the Union Address
Joseph R. Biden Jr.  
March 7, 2024

Good evening. 

Mr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. 

In January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. 

He said, “I address you at a moment unprecedented in the history of the Union.” 

Hitler was on the march. War was raging in Europe. 

President Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   

Freedom and democracy were under assault in the world. 

Tonight I come to the same chamber to address the nation. 

Now it is we who face an unprecedented moment in the history of the Union. 

And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. 

Not since President Lincoln and the Civil War have freedom and democracy been under assault here at home as they are today. 

What makes our moment rare is th

True

[0.0, 0.0, 1.0986122886681098, 0.4054651081081644]

big        0.00
data       0.00
class      1.10
science    0.41
dtype: float64

big big big big data class      class
data big data science         science
science big data              science
dtype: object

text
the        147338
of          94505
to          60827
            ...  
palacio         1
not'            1
isn             1
Name: count, Length: 24259, dtype: int64

Index(['the', 'of', 'to', 'and', 'in', 'a', 'that', 'for', 'be', 'our',
       ...
       'months', 'call', 'increasing', 'desire', 'submitted', 'throughout',
       'point', 'trust', 'set', 'object'],
      dtype='object', name='text', length=500)

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: March 1, 2022       514
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       398
Name: text, Length: 234, dtype: int64

from IPython.display import YouTubeVideo
YouTubeVideo('cplSUhU2avc')

10.616051

234

State of the Union Address
Joseph R. Biden Jr.  
March 7, 2024

Good evening. 

Mr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. 

In January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. 

He said, “I address you at a moment unprecedented in the history of the Union.” 

Hitler was on the march. War was raging in Europe. 

President Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   

Freedom and democracy were under assault in the world. 

Tonight I come to the same chamber to address the nation. 

Now it is we who face an unprecedented moment in the history of the Union. 

And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. 

Not since President Lincoln and the Civil War have freedom and democracy been under assault here at home as they are today. 

What makes our moment rare is th

True

[0.0, 0.0, 1.0986122886681098, 0.4054651081081644]

big        0.00
data       0.00
class      1.10
science    0.41
dtype: float64

big big big big data class      class
data big data science         science
science big data              science
dtype: object

text
the        147338
of          94505
to          60827
            ...  
palacio         1
not'            1
isn             1
Name: count, Length: 24259, dtype: int64

Index(['the', 'of', 'to', 'and', 'in', 'a', 'that', 'for', 'be', 'our',
       ...
       'months', 'call', 'increasing', 'desire', 'submitted', 'throughout',
       'point', 'trust', 'set', 'object'],
      dtype='object', name='text', length=500)

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: March 1, 2022       514
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       398
Name: text, Length: 234, dtype: int64

from IPython.display import YouTubeVideo
YouTubeVideo('cplSUhU2avc')

with open('data/stateoftheunion1790-2024.txt') as f:
    sotu = f.read()

# The file is over 10 million characters long!
len(sotu) / 1_000_000

10.616051

speeches_lst = sotu.split('\n***\n')[1:]
len(speeches_lst)

234

print(speeches_lst[-1][:1000])

State of the Union Address
Joseph R. Biden Jr.  
March 7, 2024

Good evening. 

Mr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. 

In January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. 

He said, “I address you at a moment unprecedented in the history of the Union.” 

Hitler was on the march. War was raging in Europe. 

President Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   

Freedom and democracy were under assault in the world. 

Tonight I come to the same chamber to address the nation. 

Now it is we who face an unprecedented moment in the history of the Union. 

And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. 

Not since President Lincoln and the Civil War have freedom and democracy been under assault here at home as they are today. 

What makes our moment rare is th

def create_speeches_df(speeches_lst):
    def extract_struct(speech):
        L = speech.strip().split('\n', maxsplit=3)
        L[3] = re.sub(r"[^A-Za-z' ]", ' ', L[3]).lower() # Replaces anything OTHER than letters with ' '.
        L[3] = re.sub(r"it's", 'it is', L[3])
        return dict(zip(['president', 'date', 'text'], L[1:]))
    speeches = pd.DataFrame(list(map(extract_struct, speeches_lst)))
    speeches.index = speeches['president'].str.strip() + ': ' + speeches['date']
    speeches = speeches[['text']]
    return speeches

speeches = create_speeches_df(speeches_lst)
speeches

bow = pd.DataFrame([[4, 1, 1, 0], [1, 2, 0, 1], [1, 1, 0, 1]],
                   index=['big big big big data class', 'data big data science', 'science big data'],
                   columns=['big', 'data', 'class', 'science'])
bow

True

[0.0, 0.0, 1.0986122886681098, 0.4054651081081644]

big        0.00
data       0.00
class      1.10
science    0.41
dtype: float64

big big big big data class      class
data big data science         science
science big data              science
dtype: object

text
the        147338
of          94505
to          60827
            ...  
palacio         1
not'            1
isn             1
Name: count, Length: 24259, dtype: int64

Index(['the', 'of', 'to', 'and', 'in', 'a', 'that', 'for', 'be', 'our',
       ...
       'months', 'call', 'increasing', 'desire', 'submitted', 'throughout',
       'point', 'trust', 'set', 'object'],
      dtype='object', name='text', length=500)

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: March 1, 2022       514
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       398
Name: text, Length: 234, dtype: int64

George Washington: January 8, 1790        97
George Washington: December 8, 1790      122
George Washington: October 25, 1791      242
                                        ... 
Joseph R. Biden Jr.: March 1, 2022       357
Joseph R. Biden Jr.: February 7, 2023    338
Joseph R. Biden Jr.: March 7, 2024       296
Name: text, Length: 234, dtype: int64

George Washington: January 8, 1790           object
George Washington: December 8, 1790      convention
George Washington: October 25, 1791       provision
                                            ...    
Joseph R. Biden Jr.: March 1, 2022          tonight
Joseph R. Biden Jr.: February 7, 2023       tonight
Joseph R. Biden Jr.: March 7, 2024          tonight
Length: 234, dtype: object

0.5229932678775406

bow = pd.DataFrame([[4, 1, 1, 0], [1, 2, 0, 1], [1, 1, 0, 1]],
                   index=['big big big big data class', 'data big data science', 'science big data'],
                   columns=['big', 'data', 'class', 'science'])
bow

True

[0.0, 0.0, 1.0986122886681098, 0.4054651081081644]

big        0.00
data       0.00
class      1.10
science    0.41
dtype: float64

big big big big data class      class
data big data science         science
science big data              science
dtype: object

text
the        147338
of          94505
to          60827
            ...  
palacio         1
not'            1
isn             1
Name: count, Length: 24259, dtype: int64

Index(['the', 'of', 'to', 'and', 'in', 'a', 'that', 'for', 'be', 'our',
       ...
       'months', 'call', 'increasing', 'desire', 'submitted', 'throughout',
       'point', 'trust', 'set', 'object'],
      dtype='object', name='text', length=500)

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: March 1, 2022       514
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       398
Name: text, Length: 234, dtype: int64

George Washington: January 8, 1790        97
George Washington: December 8, 1790      122
George Washington: October 25, 1791      242
                                        ... 
Joseph R. Biden Jr.: March 1, 2022       357
Joseph R. Biden Jr.: February 7, 2023    338
Joseph R. Biden Jr.: March 7, 2024       296
Name: text, Length: 234, dtype: int64

George Washington: January 8, 1790           object
George Washington: December 8, 1790      convention
George Washington: October 25, 1791       provision
                                            ...    
Joseph R. Biden Jr.: March 1, 2022          tonight
Joseph R. Biden Jr.: February 7, 2023       tonight
Joseph R. Biden Jr.: March 7, 2024          tonight
Length: 234, dtype: object

0.5229932678775406

bow = pd.DataFrame([[4, 1, 1, 0], [1, 2, 0, 1], [1, 1, 0, 1]],
                   index=['big big big big data class', 'data big data science', 'science big data'],
                   columns=['big', 'data', 'class', 'science'])
bow

# Verify that each row sums to 1!
tfs = bow.apply(lambda s: s / s.sum(), axis=1) 
tfs

def idf(term):
    term_column = tfs[term]
    return np.log(term_column.shape[0] / (term_column > 0).sum())

idf('class') == np.log(3 / 1)

True

all_idfs = [idf(c) for c in tfs.columns] 
all_idfs

[0.0, 0.0, 1.0986122886681098, 0.4054651081081644]

all_idfs = pd.Series(all_idfs, index=tfs.columns)
all_idfs

big        0.00
data       0.00
class      1.10
science    0.41
dtype: float64

tfidfs = tfs * all_idfs 
tfidfs

tfidfs

tfidfs.idxmax(axis=1)

big big big big data class      class
data big data science         science
science big data              science
dtype: object

speeches

text
the        147338
of          94505
to          60827
            ...  
palacio         1
not'            1
isn             1
Name: count, Length: 24259, dtype: int64

Index(['the', 'of', 'to', 'and', 'in', 'a', 'that', 'for', 'be', 'our',
       ...
       'months', 'call', 'increasing', 'desire', 'submitted', 'throughout',
       'point', 'trust', 'set', 'object'],
      dtype='object', name='text', length=500)

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: March 1, 2022       514
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       398
Name: text, Length: 234, dtype: int64

George Washington: January 8, 1790        97
George Washington: December 8, 1790      122
George Washington: October 25, 1791      242
                                        ... 
Joseph R. Biden Jr.: March 1, 2022       357
Joseph R. Biden Jr.: February 7, 2023    338
Joseph R. Biden Jr.: March 7, 2024       296
Name: text, Length: 234, dtype: int64

George Washington: January 8, 1790           object
George Washington: December 8, 1790      convention
George Washington: October 25, 1791       provision
                                            ...    
Joseph R. Biden Jr.: March 1, 2022          tonight
Joseph R. Biden Jr.: February 7, 2023       tonight
Joseph R. Biden Jr.: March 7, 2024          tonight
Length: 234, dtype: object

0.5229932678775406

0.0692680683399537

George Washington: January 8, 1790        a, and, to, of, the
George Washington: December 8, 1790      in, and, to, of, the
George Washington: October 25, 1791       a, and, to, of, the
                                                 ...         
Joseph R. Biden Jr.: March 1, 2022       we, of, to, and, the
Joseph R. Biden Jr.: February 7, 2023     a, of, and, to, the
Joseph R. Biden Jr.: March 7, 2024        a, of, to, and, the
Length: 234, dtype: object

1.001001001001001

0.001000500333583622

speeches

all_unique_terms = speeches['text'].str.split().explode().value_counts() # Faster than .sum() from last lecture! 
all_unique_terms

text
the        147338
of          94505
to          60827
            ...  
palacio         1
not'            1
isn             1
Name: count, Length: 24259, dtype: int64

unique_terms = all_unique_terms.iloc[:500].index 
unique_terms

Index(['the', 'of', 'to', 'and', 'in', 'a', 'that', 'for', 'be', 'our',
       ...
       'months', 'call', 'increasing', 'desire', 'submitted', 'throughout',
       'point', 'trust', 'set', 'object'],
      dtype='object', name='text', length=500)

speeches['text'].str.count('the')

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: March 1, 2022       514
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       398
Name: text, Length: 234, dtype: int64

# Remember, the \b special character matches **word boundaries**!
# This makes sure that we don't count instances of "the" that are part of other words,
# like "thesaurus".
speeches['text'].str.count(r'\bthe\b')

George Washington: January 8, 1790        97
George Washington: December 8, 1790      122
George Washington: October 25, 1791      242
                                        ... 
Joseph R. Biden Jr.: March 1, 2022       357
Joseph R. Biden Jr.: February 7, 2023    338
Joseph R. Biden Jr.: March 7, 2024       296
Name: text, Length: 234, dtype: int64

from tqdm.notebook import tqdm
counts_dict = {}
for term in tqdm(unique_terms):
    counts_dict[term] = speeches['text'].str.count(fr'\b{term}\b')        
counts = pd.DataFrame(counts_dict, index=speeches.index)
counts

tfs = counts.apply(lambda s: s / s.sum(), axis=1)
tfs

tfidfs = tfs.apply(lambda s: s * np.log(s.shape[0] / (s > 0).sum())) 
tfidfs

summaries = tfidfs.idxmax(axis=1) 
summaries

George Washington: January 8, 1790           object
George Washington: December 8, 1790      convention
George Washington: October 25, 1791       provision
                                            ...    
Joseph R. Biden Jr.: March 1, 2022          tonight
Joseph R. Biden Jr.: February 7, 2023       tonight
Joseph R. Biden Jr.: March 7, 2024          tonight
Length: 234, dtype: object

def five_largest(row):
    return ', '.join(row.index[row.argsort()][-5:])

keywords = tfidfs.apply(five_largest, axis=1).to_frame().rename(columns={0: 'most important terms'})
keywords

# display_df(keywords, rows=234)

tfidfs

def sim(speech_1, speech_2):
    v1 = tfidfs.loc[speech_1]
    v2 = tfidfs.loc[speech_2]
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

sim('George Washington: January 8, 1790', 'George Washington: December 8, 1790')

0.5229932678775406

sim('George Washington: January 8, 1790', 'Joseph R. Biden Jr.: March 7, 2024')

0.0692680683399537

from itertools import combinations

sims_dict = {}
# For every pair of speeches, find the similarity and store it in
# the sims_dict dictionary.
for pair in combinations(tfidfs.index, 2):
    sims_dict[pair] = sim(pair[0], pair[1])
# Turn the sims_dict dictionary into a DataFrame.
sims = (
    pd.Series(sims_dict)
    .reset_index()
    .rename(columns={'level_0': 'speech 1', 'level_1': 'speech 2', 0: 'cosine similarity'})
    .sort_values('cosine similarity', ascending=False)
)
sims

sims[sims['speech 1'].str.split(':').str[0] != sims['speech 2'].str.split(':').str[0]]

tfidfs_nl_dict = {}
tf_denom = speeches['text'].str.split().str.len()
for word in tqdm(unique_terms):
    re_pat = fr' {word} ' # Imperfect pattern for speed.
    tf = speeches['text'].str.count(re_pat) / tf_denom
    idf_nl = len(speeches) / speeches['text'].str.contains(re_pat).sum()
    tfidfs_nl_dict[word] =  tf * idf_nl

tfidfs_nl = pd.DataFrame(tfidfs_nl_dict)
tfidfs_nl.head()

keywords_nl = tfidfs_nl.apply(five_largest, axis=1)
keywords_nl

George Washington: January 8, 1790        a, and, to, of, the
George Washington: December 8, 1790      in, and, to, of, the
George Washington: October 25, 1791       a, and, to, of, the
                                                 ...         
Joseph R. Biden Jr.: March 1, 2022       we, of, to, and, the
Joseph R. Biden Jr.: February 7, 2023     a, of, and, to, the
Joseph R. Biden Jr.: March 7, 2024        a, of, to, and, the
Length: 234, dtype: object

(1000 / 999)

1.001001001001001

np.log(1000 / 999)

0.001000500333583622

(50 / 2)

25.0

(500 / 2)

250.0

np.log(50 / 2)

3.2188758248682006

np.log(500 / 2)

5.521460917862246

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(speeches['text'])
tfidfs_sklearn = pd.DataFrame(X.toarray(), 
                              columns=vectorizer.get_feature_names_out(), 
                              index=speeches.index)

tfidfs_sklearn

tfidfs_sklearn[tfidfs_sklearn['zuloaga'] != 0]

	jobs	down	commerce	...	convention	americans	tonight
George Washington: January 8, 1790	0.00e+00	0.00e+00	3.55e-04	...	0.00e+00	0.00e+00	0.00e+00
George Washington: December 8, 1790	0.00e+00	0.00e+00	1.10e-03	...	1.18e-03	0.00e+00	0.00e+00
...	...	...	...	...	...	...	...
Joseph R. Biden Jr.: February 7, 2023	2.73e-03	1.78e-03	0.00e+00	...	0.00e+00	1.56e-03	3.34e-03
Joseph R. Biden Jr.: March 7, 2024	1.77e-03	1.96e-03	5.93e-05	...	0.00e+00	2.37e-03	3.90e-03

Pair	Dot Product	Cosine Similarity
big big big big data class and data big data science	6	0.577
science big data and data big data science	4	0.943

	the	of	to	and	...	point	trust	set	object
George Washington: January 8, 1790	0.12	0.09	0.07	0.05	...	0.00e+00	1.24e-03	0.00e+00	3.73e-03
George Washington: December 8, 1790	0.12	0.09	0.05	0.04	...	0.00e+00	0.00e+00	0.00e+00	1.97e-03
George Washington: October 25, 1791	0.15	0.10	0.05	0.04	...	0.00e+00	1.20e-03	1.20e-03	1.20e-03
...	...	...	...	...	...	...	...	...	...
Joseph R. Biden Jr.: March 1, 2022	0.07	0.04	0.05	0.06	...	5.87e-04	5.87e-04	5.87e-04	0.00e+00
Joseph R. Biden Jr.: February 7, 2023	0.07	0.04	0.06	0.05	...	2.12e-04	1.27e-03	0.00e+00	0.00e+00
Joseph R. Biden Jr.: March 7, 2024	0.07	0.03	0.05	0.06	...	2.37e-04	4.74e-04	2.37e-04	0.00e+00

	speech 1	speech 2	cosine similarity
27171	Barack Obama: January 25, 2011	Barack Obama: February 12, 2013	0.93
11685	James Polk: December 8, 1846	James Polk: December 7, 1847	0.93
27183	Barack Obama: January 24, 2012	Barack Obama: February 12, 2013	0.92
...	...	...	...
6714	James Monroe: December 7, 1819	Ronald Reagan: January 26, 1982	0.04
18192	Grover Cleveland: December 6, 1887	George W. Bush: September 20, 2001	0.04
6733	James Monroe: December 7, 1819	George W. Bush: February 27, 2001	0.03

	speech 1	speech 2	cosine similarity
27191	Barack Obama: January 24, 2012	Joseph R. Biden Jr.: April 28, 2021	0.88
27239	Donald J. Trump: February 27, 2017	Joseph R. Biden Jr.: March 7, 2024	0.87
27243	Donald J. Trump: January 30, 2018	Joseph R. Biden Jr.: March 1, 2022	0.87
...	...	...	...
6714	James Monroe: December 7, 1819	Ronald Reagan: January 26, 1982	0.04
18192	Grover Cleveland: December 6, 1887	George W. Bush: September 20, 2001	0.04
6733	James Monroe: December 7, 1819	George W. Bush: February 27, 2001	0.03

	text
George Washington: January 8, 1790	fellow citizens of the senate and house of re...
George Washington: December 8, 1790	fellow citizens of the senate and house of re...
George Washington: October 25, 1791	fellow citizens of the senate and house of re...
...	...
Joseph R. Biden Jr.: March 1, 2022	madam speaker madam vice president and our ...
Joseph R. Biden Jr.: February 7, 2023	mr speaker madam vice president our firs...
Joseph R. Biden Jr.: March 7, 2024	good evening mr speaker madam vice presi...

	most important terms
George Washington: January 8, 1790	your, proper, regard, ought, object
George Washington: December 8, 1790	case, established, object, commerce, convention
...	...
Joseph R. Biden Jr.: February 7, 2023	americans, down, percent, jobs, tonight
Joseph R. Biden Jr.: March 7, 2024	jobs, down, get, americans, tonight

	big	data	class	science
big big big big data class	4	1	1	0
data big data science	1	2	0	1
science big data	1	1	0	1

	big	data	class	science
big big big big data class	4	1	1	0
data big data science	1	2	0	1
science big data	1	1	0	1

	big	data	class	science
big big big big data class	0.67	0.17	0.17	0.00
data big data science	0.25	0.50	0.00	0.25
science big data	0.33	0.33	0.00	0.33

	class	science
big big big big data class	0.18	0.00
data big data science	0.00	0.10
science big data	0.00	0.14

	the	of	to	and	...	point	trust	set	object
George Washington: January 8, 1790	0.0	0.0	0.0	0.0	...	0.00e+00	5.78e-04	0.00e+00	2.78e-03
George Washington: December 8, 1790	0.0	0.0	0.0	0.0	...	0.00e+00	0.00e+00	0.00e+00	1.47e-03
George Washington: October 25, 1791	0.0	0.0	0.0	0.0	...	0.00e+00	5.58e-04	4.79e-04	8.95e-04
...	...	...	...	...	...	...	...	...	...
Joseph R. Biden Jr.: March 1, 2022	0.0	0.0	0.0	0.0	...	2.27e-04	2.73e-04	2.34e-04	0.00e+00
Joseph R. Biden Jr.: February 7, 2023	0.0	0.0	0.0	0.0	...	8.19e-05	5.91e-04	0.00e+00	0.00e+00
Joseph R. Biden Jr.: March 7, 2024	0.0	0.0	0.0	0.0	...	9.16e-05	2.20e-04	9.46e-05	0.00e+00

	aaa	aaron	abandon	abandoned	...	zones	zoological	zooming	zuloaga
James Buchanan: December 19, 1859	0.0	0.0	0.0	0.00e+00	...	0.0	0.0	0.0	1.34e-02
James Buchanan: December 3, 1860	0.0	0.0	0.0	1.30e-03	...	0.0	0.0	0.0	2.91e-03

Lecture 12¶

Text as Data¶

EECS 398-003: Practical Data Science, Fall 2024¶

Announcements 📣¶

Aside: Following along with lecture¶

Agenda¶

Activity

Question 🤔 (Answer at practicaldsc.org/q)

From text to numbers¶

From text to numbers¶

Example: State of the Union addresses 🎤¶

Terminology¶

Extracting speeches¶

Quantifying speeches¶

Bag of words 💰¶

Counting frequencies¶

Bag of words¶

Aside: Interactive bag of words demo¶

Applications of the bag of words model¶

Recall: The dot product¶

Angles and similarity¶

Cosine similarity¶

Activity

Normalizing¶

Reference Slide¶

Cosine distance¶

Issues with the bag of words model¶

Question 🤔 (Answer at practicaldsc.org/q)

TF-IDF¶

What makes a word important?¶

Term frequency¶

Inverse document frequency¶

Intuition¶

Term frequency-inverse document frequency¶

Computing TF-IDF¶

TF-IDF of all terms in all documents¶

Interpreting TF-IDFs¶

Question 🤔 (Answer at practicaldsc.org/q)

Example: State of the Union addresses 🎤¶

Overview¶

Finding all unique terms¶

Finding term frequencies¶

Finding TF-IDFs¶

Summarizing speeches¶

Cosine similarity, revisited¶

Aside: What if we remove the $\log$ from $\text{idf}(t)$?¶

The role of $\log$ in $\text{idf}(t)$¶

Activity

TF-IDF, implemented¶

Summary¶