from lec_utils import *

from IPython.display import YouTubeVideo
YouTubeVideo('XkFKNkAEzQ8')

10.675837

235

Address Before a Joint Session of Congress 
Donald J. Trump
March 4, 2025

The President. Thank you. Thank you very much. Thank you very much. It's a great honor. Thank you very much.

Speaker Johnson, Vice President Vance, the First Lady of the United States, Members of the United States Congress: Thank you very much.

And to my fellow citizens, America is back.

Audience members. U.S.A.! U.S.A.! U.S.A.!

The President. Six weeks ago, I stood beneath the dome of this Capitol and proclaimed the dawn of the golden age of America. From that moment on, it has been nothing but swift and unrelenting action to usher in the greatest and most successful era in the history of our country.

We have accomplished more in 43 days than most administrations accomplished in 4 years or 8 years, and we are just getting started. [Applause] Thank you.

I return to this Chamber tonight to report that America's momentum is back, our spirit is back, our pride is back, our confidence is back, and the America

big big big big data class      class
data big data science         science
science big data              science
dtype: object

text
the           147744
of             94765
and            61192
               ...  
pathos             1
desirables         1
skylines           1
Name: count, Length: 24528, dtype: int64

Index(['the', 'of', 'and', 'to', 'in', 'a', 'that', 'for', 'be', 'our',
       ...
       'abroad', 'demand', 'call', 'old', 'think', 'throughout', 'increasing',
       'desire', 'submitted', 'building'],
      dtype='object', name='text', length=500)

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       399
Donald J. Trump: March 4, 2025           673
Name: text, Length: 235, dtype: int64

George Washington: January 8, 1790        97
George Washington: December 8, 1790      122
George Washington: October 25, 1791      242
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    338
Joseph R. Biden Jr.: March 7, 2024       293
Donald J. Trump: March 4, 2025           411
Name: text, Length: 235, dtype: int64

George Washington: January 8, 1790            ought
George Washington: December 8, 1790      convention
George Washington: October 25, 1791       provision
                                            ...    
Joseph R. Biden Jr.: February 7, 2023       tonight
Joseph R. Biden Jr.: March 7, 2024          tonight
Donald J. Trump: March 4, 2025              tonight
Length: 235, dtype: object

0.4808392552136325

from IPython.display import YouTubeVideo
YouTubeVideo('XkFKNkAEzQ8')

with open('data/stateoftheunion1790-2025.txt') as f:
    sotu = f.read()

# The file is over 10 million characters long!
len(sotu) / 1_000_000

10.675837

speeches_lst = sotu.split('\n***\n')[1:]
len(speeches_lst)

235

print(speeches_lst[-1][:1000])

Address Before a Joint Session of Congress 
Donald J. Trump
March 4, 2025

The President. Thank you. Thank you very much. Thank you very much. It's a great honor. Thank you very much.

Speaker Johnson, Vice President Vance, the First Lady of the United States, Members of the United States Congress: Thank you very much.

And to my fellow citizens, America is back.

Audience members. U.S.A.! U.S.A.! U.S.A.!

The President. Six weeks ago, I stood beneath the dome of this Capitol and proclaimed the dawn of the golden age of America. From that moment on, it has been nothing but swift and unrelenting action to usher in the greatest and most successful era in the history of our country.

We have accomplished more in 43 days than most administrations accomplished in 4 years or 8 years, and we are just getting started. [Applause] Thank you.

I return to this Chamber tonight to report that America's momentum is back, our spirit is back, our pride is back, our confidence is back, and the America

def create_speeches_df(speeches_lst):
    def extract_struct(speech):
        L = speech.strip().split('\n', maxsplit=3)
        L[3] = re.sub(r"[^A-Za-z' ]", ' ', L[3]).lower() # Replaces anything OTHER than letters with ' '.
        L[3] = re.sub(r"it's", 'it is', L[3]).replace(' s ', '')
        return dict(zip(['president', 'date', 'text'], L[1:]))
    speeches = pd.DataFrame(list(map(extract_struct, speeches_lst)))
    speeches.index = speeches['president'].str.strip() + ': ' + speeches['date']
    speeches = speeches[['text']]
    return speeches

speeches = create_speeches_df(speeches_lst)
speeches

bow = pd.DataFrame([[4, 1, 1, 0], [1, 2, 0, 1], [1, 1, 0, 1]],
                   index=['big big big big data class', 'data big data science', 'science big data'],
                   columns=['big', 'data', 'class', 'science'])
bow

big big big big data class      class
data big data science         science
science big data              science
dtype: object

text
the           147744
of             94765
and            61192
               ...  
pathos             1
desirables         1
skylines           1
Name: count, Length: 24528, dtype: int64

Index(['the', 'of', 'and', 'to', 'in', 'a', 'that', 'for', 'be', 'our',
       ...
       'abroad', 'demand', 'call', 'old', 'think', 'throughout', 'increasing',
       'desire', 'submitted', 'building'],
      dtype='object', name='text', length=500)

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       399
Donald J. Trump: March 4, 2025           673
Name: text, Length: 235, dtype: int64

George Washington: January 8, 1790        97
George Washington: December 8, 1790      122
George Washington: October 25, 1791      242
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    338
Joseph R. Biden Jr.: March 7, 2024       293
Donald J. Trump: March 4, 2025           411
Name: text, Length: 235, dtype: int64

George Washington: January 8, 1790            ought
George Washington: December 8, 1790      convention
George Washington: October 25, 1791       provision
                                            ...    
Joseph R. Biden Jr.: February 7, 2023       tonight
Joseph R. Biden Jr.: March 7, 2024          tonight
Donald J. Trump: March 4, 2025              tonight
Length: 235, dtype: object

0.4808392552136325

0.09111229184834736

George Washington: January 8, 1790        a, and, to, of, the
George Washington: December 8, 1790      in, and, to, of, the
George Washington: October 25, 1791       a, and, to, of, the
                                                 ...         
Joseph R. Biden Jr.: February 7, 2023     a, of, and, to, the
Joseph R. Biden Jr.: March 7, 2024        a, of, to, and, the
Donald J. Trump: March 4, 2025            a, of, to, the, and
Length: 235, dtype: object

1.001001001001001

bow = pd.DataFrame([[4, 1, 1, 0], [1, 2, 0, 1], [1, 1, 0, 1]],
                   index=['big big big big data class', 'data big data science', 'science big data'],
                   columns=['big', 'data', 'class', 'science'])
bow

big big big big data class      class
data big data science         science
science big data              science
dtype: object

text
the           147744
of             94765
and            61192
               ...  
pathos             1
desirables         1
skylines           1
Name: count, Length: 24528, dtype: int64

Index(['the', 'of', 'and', 'to', 'in', 'a', 'that', 'for', 'be', 'our',
       ...
       'abroad', 'demand', 'call', 'old', 'think', 'throughout', 'increasing',
       'desire', 'submitted', 'building'],
      dtype='object', name='text', length=500)

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       399
Donald J. Trump: March 4, 2025           673
Name: text, Length: 235, dtype: int64

George Washington: January 8, 1790        97
George Washington: December 8, 1790      122
George Washington: October 25, 1791      242
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    338
Joseph R. Biden Jr.: March 7, 2024       293
Donald J. Trump: March 4, 2025           411
Name: text, Length: 235, dtype: int64

George Washington: January 8, 1790            ought
George Washington: December 8, 1790      convention
George Washington: October 25, 1791       provision
                                            ...    
Joseph R. Biden Jr.: February 7, 2023       tonight
Joseph R. Biden Jr.: March 7, 2024          tonight
Donald J. Trump: March 4, 2025              tonight
Length: 235, dtype: object

0.4808392552136325

0.09111229184834736

George Washington: January 8, 1790        a, and, to, of, the
George Washington: December 8, 1790      in, and, to, of, the
George Washington: October 25, 1791       a, and, to, of, the
                                                 ...         
Joseph R. Biden Jr.: February 7, 2023     a, of, and, to, the
Joseph R. Biden Jr.: March 7, 2024        a, of, to, and, the
Donald J. Trump: March 4, 2025            a, of, to, the, and
Length: 235, dtype: object

1.001001001001001

bow = pd.DataFrame([[4, 1, 1, 0], [1, 2, 0, 1], [1, 1, 0, 1]],
                   index=['big big big big data class', 'data big data science', 'science big data'],
                   columns=['big', 'data', 'class', 'science'])
bow

# To convert the term counts to term frequencies, we'll divide by the sum of each row.
# Each row corresponds to the terms in one document; the sum of a row is the total number of terms in the document, 
# which is the denominator in the formula for term frequency, (# of occurrences of t in d) / (total # of terms in d).
tfs = bow.apply(lambda s: s / s.sum(), axis=1)
# Next, we need to find the inverse document frequency of each term, t, 
# where idf(t) = log(total # of documents / # of documents in which t appears).
def idf(term):
    term_column = tfs[term]
    return np.log(term_column.shape[0] / (term_column > 0).sum())
all_idfs = [idf(c) for c in tfs.columns]
all_idfs = pd.Series(all_idfs, index=tfs.columns)
# Finally, let's multiply `tfs`, the DataFrame with the term frequencies of each term in each document, 
# by `all_idfs`, the Series of inverse document frequencies of each term.
tfidfs = tfs * all_idfs
tfidfs

tfidfs

tfidfs.idxmax(axis=1)

big big big big data class      class
data big data science         science
science big data              science
dtype: object

speeches

text
the           147744
of             94765
and            61192
               ...  
pathos             1
desirables         1
skylines           1
Name: count, Length: 24528, dtype: int64

Index(['the', 'of', 'and', 'to', 'in', 'a', 'that', 'for', 'be', 'our',
       ...
       'abroad', 'demand', 'call', 'old', 'think', 'throughout', 'increasing',
       'desire', 'submitted', 'building'],
      dtype='object', name='text', length=500)

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       399
Donald J. Trump: March 4, 2025           673
Name: text, Length: 235, dtype: int64

George Washington: January 8, 1790        97
George Washington: December 8, 1790      122
George Washington: October 25, 1791      242
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    338
Joseph R. Biden Jr.: March 7, 2024       293
Donald J. Trump: March 4, 2025           411
Name: text, Length: 235, dtype: int64

George Washington: January 8, 1790            ought
George Washington: December 8, 1790      convention
George Washington: October 25, 1791       provision
                                            ...    
Joseph R. Biden Jr.: February 7, 2023       tonight
Joseph R. Biden Jr.: March 7, 2024          tonight
Donald J. Trump: March 4, 2025              tonight
Length: 235, dtype: object

0.4808392552136325

0.09111229184834736

George Washington: January 8, 1790        a, and, to, of, the
George Washington: December 8, 1790      in, and, to, of, the
George Washington: October 25, 1791       a, and, to, of, the
                                                 ...         
Joseph R. Biden Jr.: February 7, 2023     a, of, and, to, the
Joseph R. Biden Jr.: March 7, 2024        a, of, to, and, the
Donald J. Trump: March 4, 2025            a, of, to, the, and
Length: 235, dtype: object

1.001001001001001

0.001000500333583622

speeches

all_unique_terms = speeches['text'].str.split().explode().value_counts()
all_unique_terms

text
the           147744
of             94765
and            61192
               ...  
pathos             1
desirables         1
skylines           1
Name: count, Length: 24528, dtype: int64

unique_terms = all_unique_terms.iloc[:500].index
unique_terms

Index(['the', 'of', 'and', 'to', 'in', 'a', 'that', 'for', 'be', 'our',
       ...
       'abroad', 'demand', 'call', 'old', 'think', 'throughout', 'increasing',
       'desire', 'submitted', 'building'],
      dtype='object', name='text', length=500)

speeches['text'].str.count('the')

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       399
Donald J. Trump: March 4, 2025           673
Name: text, Length: 235, dtype: int64

George Washington: January 8, 1790        97
George Washington: December 8, 1790      122
George Washington: October 25, 1791      242
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    338
Joseph R. Biden Jr.: March 7, 2024       293
Donald J. Trump: March 4, 2025           411
Name: text, Length: 235, dtype: int64

George Washington: January 8, 1790            ought
George Washington: December 8, 1790      convention
George Washington: October 25, 1791       provision
                                            ...    
Joseph R. Biden Jr.: February 7, 2023       tonight
Joseph R. Biden Jr.: March 7, 2024          tonight
Donald J. Trump: March 4, 2025              tonight
Length: 235, dtype: object

0.4808392552136325

0.09111229184834736

George Washington: January 8, 1790        a, and, to, of, the
George Washington: December 8, 1790      in, and, to, of, the
George Washington: October 25, 1791       a, and, to, of, the
                                                 ...         
Joseph R. Biden Jr.: February 7, 2023     a, of, and, to, the
Joseph R. Biden Jr.: March 7, 2024        a, of, to, and, the
Donald J. Trump: March 4, 2025            a, of, to, the, and
Length: 235, dtype: object

1.001001001001001

0.001000500333583622

25.0

250.0

speeches['text'].str.count('the')

George Washington: January 8, 1790       120
George Washington: December 8, 1790      160
George Washington: October 25, 1791      302
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    507
Joseph R. Biden Jr.: March 7, 2024       399
Donald J. Trump: March 4, 2025           673
Name: text, Length: 235, dtype: int64

# Remember, the \b special character matches **word boundaries**!
# This makes sure that we don't count instances of "the" that are part of other words,
# like "thesaurus".
speeches['text'].str.count(r'\bthe\b')

George Washington: January 8, 1790        97
George Washington: December 8, 1790      122
George Washington: October 25, 1791      242
                                        ... 
Joseph R. Biden Jr.: February 7, 2023    338
Joseph R. Biden Jr.: March 7, 2024       293
Donald J. Trump: March 4, 2025           411
Name: text, Length: 235, dtype: int64

from tqdm.notebook import tqdm
counts_dict = {}
for term in tqdm(unique_terms):
    counts_dict[term] = speeches['text'].str.count(fr'\b{term}\b')        
counts = pd.DataFrame(counts_dict, index=speeches.index)
counts

tfs = counts.apply(lambda s: s / s.sum(), axis=1)
tfs

tfidfs = tfs.apply(lambda s: s * np.log(s.shape[0] / (s > 0).sum()))
tfidfs

George Washington: January 8, 1790            ought
George Washington: December 8, 1790      convention
George Washington: October 25, 1791       provision
                                            ...    
Joseph R. Biden Jr.: February 7, 2023       tonight
Joseph R. Biden Jr.: March 7, 2024          tonight
Donald J. Trump: March 4, 2025              tonight
Length: 235, dtype: object

0.4808392552136325

0.09111229184834736

George Washington: January 8, 1790        a, and, to, of, the
George Washington: December 8, 1790      in, and, to, of, the
George Washington: October 25, 1791       a, and, to, of, the
                                                 ...         
Joseph R. Biden Jr.: February 7, 2023     a, of, and, to, the
Joseph R. Biden Jr.: March 7, 2024        a, of, to, and, the
Donald J. Trump: March 4, 2025            a, of, to, the, and
Length: 235, dtype: object

1.001001001001001

0.001000500333583622

25.0

250.0

3.2188758248682006

5.521460917862246

tfidfs = tfs.apply(lambda s: s * np.log(s.shape[0] / (s > 0).sum()))
tfidfs

tfidfs

summaries = tfidfs.idxmax(axis=1) 
summaries

George Washington: January 8, 1790            ought
George Washington: December 8, 1790      convention
George Washington: October 25, 1791       provision
                                            ...    
Joseph R. Biden Jr.: February 7, 2023       tonight
Joseph R. Biden Jr.: March 7, 2024          tonight
Donald J. Trump: March 4, 2025              tonight
Length: 235, dtype: object

def five_largest(row):
    return ', '.join(row.index[row.argsort()][-5:])

keywords = tfidfs.apply(five_largest, axis=1).to_frame().rename(columns={0: 'most important terms'})
keywords

display_df(keywords, rows=235)

tfidfs

def sim(speech_1, speech_2):
    v1 = tfidfs.loc[speech_1]
    v2 = tfidfs.loc[speech_2]
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

sim('George Washington: January 8, 1790', 'George Washington: December 8, 1790')

0.4808392552136325

sim('George Washington: January 8, 1790', 'Donald J. Trump: March 4, 2025')

0.09111229184834736

from itertools import combinations
sims_dict = {}
# For every pair of speeches, find the similarity and store it in
# the sims_dict dictionary.
for pair in combinations(tfidfs.index, 2):
    sims_dict[pair] = sim(pair[0], pair[1])
# Turn the sims_dict dictionary into a DataFrame.
sims = (
    pd.Series(sims_dict)
    .reset_index()
    .rename(columns={'level_0': 'speech 1', 'level_1': 'speech 2', 0: 'cosine similarity'})
    .sort_values('cosine similarity', ascending=False)
)
sims

sims[sims['speech 1'].str.split(':').str[0] != sims['speech 2'].str.split(':').str[0]]

tfidfs_nl_dict = {}
tf_denom = speeches['text'].str.split().str.len()
for word in tqdm(unique_terms):
    re_pat = fr' {word} ' # Imperfect pattern for speed.
    tf = speeches['text'].str.count(re_pat) / tf_denom
    idf_nl = len(speeches) / speeches['text'].str.contains(re_pat).sum()
    tfidfs_nl_dict[word] =  tf * idf_nl

tfidfs_nl = pd.DataFrame(tfidfs_nl_dict)
tfidfs_nl.head()

keywords_nl = tfidfs_nl.apply(five_largest, axis=1)
keywords_nl

George Washington: January 8, 1790        a, and, to, of, the
George Washington: December 8, 1790      in, and, to, of, the
George Washington: October 25, 1791       a, and, to, of, the
                                                 ...         
Joseph R. Biden Jr.: February 7, 2023     a, of, and, to, the
Joseph R. Biden Jr.: March 7, 2024        a, of, to, and, the
Donald J. Trump: March 4, 2025            a, of, to, the, and
Length: 235, dtype: object

(1000 / 999)

1.001001001001001

np.log(1000 / 999)

0.001000500333583622

(50 / 2)

25.0

(500 / 2)

250.0

np.log(50 / 2)

3.2188758248682006

np.log(500 / 2)

5.521460917862246

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(speeches['text'])
tfidfs_sklearn = pd.DataFrame(X.toarray(), 
                              columns=vectorizer.get_feature_names_out(), 
                              index=speeches.index)

tfidfs_sklearn

tfidfs_sklearn[tfidfs_sklearn['zuloaga'] != 0]

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.decomposition import PCA
pipeline = make_pipeline(
    TfidfVectorizer(),
    PCA(n_components=2),
)
pipeline.fit(speeches['text'])
scores = pipeline.transform(speeches['text'])
fig = px.scatter(x=scores[:, 0], 
           y=scores[:, 1], 
           hover_name=speeches['text'].index, 
           color=speeches['text'].index.str.split(', ').str[-1].astype(int),
           color_continuous_scale='Turbo',
           size_max=12,
           size=[1] * np.ones(len(scores)))
fig.update_layout(xaxis_title='PC 1', 
                  yaxis_title='PC 2', 
                  title='PC 2 vs. PC 1 of TF-IDF-encoded<br>Presidential Speeches',
                  width=1000, height=600)

	jobs	down	commerce	...	convention	americans	tonight
George Washington: January 8, 1790	0.00e+00	0.00e+00	3.55e-04	...	0.00e+00	0.00e+00	0.00e+00
George Washington: December 8, 1790	0.00e+00	0.00e+00	1.10e-03	...	1.18e-03	0.00e+00	0.00e+00
...	...	...	...	...	...	...	...
Joseph R. Biden Jr.: February 7, 2023	2.73e-03	1.78e-03	0.00e+00	...	0.00e+00	1.56e-03	3.34e-03
Joseph R. Biden Jr.: March 7, 2024	1.77e-03	1.96e-03	5.93e-05	...	0.00e+00	2.37e-03	3.90e-03

Pair	Dot Product	Cosine Similarity
big big big big data class and data big data science	6	0.577
science big data and data big data science	4	0.943

	the	of	and	to	...	increasing	desire	submitted	building
George Washington: January 8, 1790	0.12	0.09	0.05	0.07	...	1.25e-03	0.0	0.00e+00	0.00e+00
George Washington: December 8, 1790	0.12	0.09	0.04	0.05	...	0.00e+00	0.0	9.86e-04	0.00e+00
George Washington: October 25, 1791	0.15	0.10	0.04	0.05	...	6.02e-04	0.0	6.02e-04	0.00e+00
...	...	...	...	...	...	...	...	...	...
Joseph R. Biden Jr.: February 7, 2023	0.07	0.04	0.05	0.06	...	4.23e-04	0.0	0.00e+00	1.27e-03
Joseph R. Biden Jr.: March 7, 2024	0.07	0.03	0.06	0.05	...	2.42e-04	0.0	0.00e+00	1.21e-03
Donald J. Trump: March 4, 2025	0.06	0.04	0.06	0.04	...	0.00e+00	0.0	0.00e+00	8.76e-04

	speech 1	speech 2	cosine similarity
11742	James Polk: December 8, 1846	James Polk: December 7, 1847	0.93
27391	Barack Obama: January 25, 2011	Barack Obama: February 12, 2013	0.93
27404	Barack Obama: January 24, 2012	Barack Obama: February 12, 2013	0.92
...	...	...	...
6744	James Monroe: December 7, 1819	Ronald Reagan: January 26, 1982	0.04
18290	Grover Cleveland: December 6, 1887	George W. Bush: September 20, 2001	0.04
6763	James Monroe: December 7, 1819	George W. Bush: February 27, 2001	0.03

	speech 1	speech 2	cosine similarity
27412	Barack Obama: January 24, 2012	Joseph R. Biden Jr.: April 28, 2021	0.88
27470	Donald J. Trump: January 30, 2018	Joseph R. Biden Jr.: March 1, 2022	0.88
27465	Donald J. Trump: February 27, 2017	Joseph R. Biden Jr.: March 7, 2024	0.87
...	...	...	...
6744	James Monroe: December 7, 1819	Ronald Reagan: January 26, 1982	0.04
18290	Grover Cleveland: December 6, 1887	George W. Bush: September 20, 2001	0.04
6763	James Monroe: December 7, 1819	George W. Bush: February 27, 2001	0.03

	text
George Washington: January 8, 1790	fellow citizens of the senate and house of re...
George Washington: December 8, 1790	fellow citizens of the senate and house of re...
George Washington: October 25, 1791	fellow citizens of the senate and house of re...
...	...
Joseph R. Biden Jr.: February 7, 2023	mr speaker madam vice president our firs...
Joseph R. Biden Jr.: March 7, 2024	good evening mr speaker madam vice presi...
Donald J. Trump: March 4, 2025	the president thank you thank you very much...

	most important terms
George Washington: January 8, 1790	your, proper, regard, ought, object
George Washington: December 8, 1790	case, established, object, commerce, convention
...	...
Joseph R. Biden Jr.: February 7, 2023	americans, down, percent, jobs, tonight
Joseph R. Biden Jr.: March 7, 2024	jobs, down, get, americans, tonight

	big	data	class	science
big big big big data class	4	1	1	0
data big data science	1	2	0	1
science big data	1	1	0	1

	big	data	class	science
big big big big data class	4	1	1	0
data big data science	1	2	0	1
science big data	1	1	0	1

	class	science
big big big big data class	0.18	0.00
data big data science	0.00	0.10
science big data	0.00	0.14

	the	of	and	to	...	increasing	desire	submitted	building
George Washington: January 8, 1790	0.0	0.0	0.0	0.0	...	5.20e-04	0.0	0.00e+00	0.00e+00
George Washington: December 8, 1790	0.0	0.0	0.0	0.0	...	0.00e+00	0.0	6.15e-04	0.00e+00
George Washington: October 25, 1791	0.0	0.0	0.0	0.0	...	2.51e-04	0.0	3.75e-04	0.00e+00
...	...	...	...	...	...	...	...	...	...
Joseph R. Biden Jr.: February 7, 2023	0.0	0.0	0.0	0.0	...	1.76e-04	0.0	0.00e+00	6.04e-04
Joseph R. Biden Jr.: March 7, 2024	0.0	0.0	0.0	0.0	...	1.01e-04	0.0	0.00e+00	5.76e-04
Donald J. Trump: March 4, 2025	0.0	0.0	0.0	0.0	...	0.00e+00	0.0	0.00e+00	4.17e-04

	most important terms
George Washington: January 8, 1790	your, opinion, proper, regard, ought
George Washington: December 8, 1790	welfare, case, established, commerce, convention
George Washington: October 25, 1791	community, upon, lands, proper, provision
...	...
Joseph R. Biden Jr.: February 7, 2023	down, percent, let, jobs, tonight
Joseph R. Biden Jr.: March 7, 2024	jobs, down, get, americans, tonight
Donald J. Trump: March 4, 2025	you, want, get, million, tonight

	aaa	aaron	abandon	abandoned	...	zones	zoological	zooming	zuloaga
James Buchanan: December 19, 1859	0.0	0.0	0.0	0.00e+00	...	0.0	0.0	0.0	1.34e-02
James Buchanan: December 3, 1860	0.0	0.0	0.0	1.30e-03	...	0.0	0.0	0.0	2.92e-03

	most important terms
George Washington: January 8, 1790	your, opinion, proper, regard, ought
George Washington: December 8, 1790	welfare, case, established, commerce, convention
George Washington: October 25, 1791	community, upon, lands, proper, provision
George Washington: November 6, 1792	subject, upon, information, proper, provision
George Washington: December 3, 1793	territory, vessels, executive, shall, ought
George Washington: November 19, 1794	laws, army, let, ought, constitution
George Washington: December 8, 1795	representatives, information, prevent, provisi...
George Washington: December 7, 1796	establishment, republic, treaty, britain, ought
John Adams: November 22, 1797	spain, british, claims, treaty, vessels
John Adams: December 8, 1798	st, minister, treaty, spain, commerce
John Adams: December 3, 1799	civil, period, british, minister, treaty
John Adams: November 11, 1800	experience, protection, navy, commerce, ought
Thomas Jefferson: December 8, 1801	revenue, consideration, shall, vessels, subject
Thomas Jefferson: December 15, 1802	shall, debt, naval, duties, vessels
Thomas Jefferson: October 17, 1803	debt, vessels, sum, millions, friendly
Thomas Jefferson: November 8, 1804	received, having, convention, due, friendly
Thomas Jefferson: December 3, 1805	families, convention, sum, millions, vessels
Thomas Jefferson: December 2, 1806	due, consideration, millions, shall, spain
Thomas Jefferson: October 27, 1807	whether, army, british, vessels, shall
Thomas Jefferson: November 8, 1808	shall, british, millions, commerce, her
James Madison: November 29, 1809	cases, having, due, british, minister
James Madison: December 5, 1810	provisions, view, minister, commerce, british
James Madison: November 5, 1811	britain, provisions, commerce, minister, british
James Madison: November 4, 1812	nor, subject, provisions, britain, british
James Madison: December 7, 1813	number, having, naval, britain, british
James Madison: September 20, 1814	naval, vessels, britain, his, british
James Madison: December 5, 1815	debt, treasury, millions, establishment, sum
James Madison: December 3, 1816	annual, constitution, sum, treasury, british
James Monroe: December 12, 1817	improvement, territory, indian, millions, lands
James Monroe: November 16, 1818	revenue, minister, territory, her, spain
James Monroe: December 7, 1819	parties, friendly, minister, treaty, spain
James Monroe: November 14, 1820	amount, minister, extent, vessels, spain
James Monroe: December 3, 1821	powers, duties, revenue, spain, vessels
James Monroe: December 3, 1822	duties, proper, vessels, spain, convention
James Monroe: December 2, 1823	powers, th, department, minister, spain
James Monroe: December 7, 1824	commerce, spain, governments, convention, parties
John Quincy Adams: December 6, 1825	establishment, commerce, condition, upon, impr...
John Quincy Adams: December 5, 1826	commercial, upon, vessels, british, duties
John Quincy Adams: December 4, 1827	lands, british, receipts, upon, th
John Quincy Adams: December 2, 1828	duties, revenue, upon, commercial, britain

Lecture 10¶

Text as Data¶

EECS 398: Practical Data Science, Spring 2025¶

Agenda 📆¶

Question 🤔 (Answer at practicaldsc.org/q)

From text to numbers¶

From text to numbers¶

Example: State of the Union addresses 🎤¶

Terminology¶

Extracting speeches¶

Quantifying speeches¶

Bag of words 💰¶

Counting frequencies¶

Bag of words¶

Aside: Interactive bag of words demo¶

Applications of the bag of words model¶

Recall: The dot product¶

Angles and similarity¶

Cosine similarity¶

Activity

Normalizing¶

Reference Slide¶

Cosine distance¶

Issues with the bag of words model¶

Question 🤔 (Answer at practicaldsc.org/q)

TF-IDF¶

What makes a word important?¶

Term frequency¶

Inverse document frequency¶

Intuition¶

Term frequency-inverse document frequency¶

Computing TF-IDF¶

TF-IDF of all terms in all documents¶

Interpreting TF-IDFs¶

Question 🤔 (Answer at practicaldsc.org/q)

Activity¶

Example: State of the Union addresses 🎤¶

Overview¶

Reference Slide¶

Finding all unique terms¶

Reference Slide¶

Finding term frequencies¶

Reference Slide¶

Finding TF-IDFs¶

Summarizing speeches¶

Cosine similarity, revisited¶

Aside: What if we remove the $\log$ from $\text{idf}(t)$?¶

The role of $\log$ in $\text{idf}(t)$¶

TF-IDF in practice¶

What's next?¶

Summary¶