Finding meaningful multi-word phrases in discourse
Published
2026-01-25 11:57:14
1 Learning objectives
By the end of this lab, you will be able to:
Extract bigrams (two-word sequences) from text with proper filtering
Understand why stopword and POS filtering are essential for meaningful phrases
Use Pointwise Mutual Information (PMI) to identify statistically significant collocations
Compare collocations across political groups and time periods
Interpret PMI values for substantive research findings
Visualize collocation differences using bar charts
2 A note on the dataset and our goals
When you examine the results, focus on understanding:
How the extraction process works
Why filtering improves output quality
How PMI identifies meaningful associations
How to compare collocations across groups
The same methods apply to any text corpus: product reviews, social media posts, interview transcripts, or documents from your own research domain. The techniques you learn here transfer directly to your own data.
Lab 01 introduced word frequency counting, and now we count phrase frequency. Lab 02 showed corpus comparison, and now we compare phrases across groups. Lab 03 introduced PMI for word-category associations, and now we use PMI for word-word associations. Lab 04 used TF-IDF to identify important words, and we can use this to filter important words before extracting collocations.
3 Introduction: the multi-word phrase problem
In Labs 01-04, we’ve treated words as independent units. We’ve counted them, compared their frequencies across corpora, measured their associations, and compared entire documents using vector representations.
But language doesn’t work in isolated words. Consider “climate change”: this phrase conveys a distinct concept that neither “climate” nor “change” alone captures. Similarly, “health care” means something different from the simple combination of “health” and “care.” These multi-word expressions pose a challenge for text analysis methods that treat words independently.
3.1 Why word counting fails
Consider this sentence from a State of the Union address:
“We must strengthen our economy, create new jobs, and protect working families.”
Word frequency analysis would note that “economy,” “jobs,” “families,” and “working” each appear once. But it would miss the phrase “working families” as a political talking point, and the connection between “new” and “jobs” as “new jobs.” We need methods that capture word sequences instead of individual words.
3.2 What are collocations?
A collocation is a sequence of words that habitually appear together and form a meaningful unit. Examples from political speech include “middle class,” “national security,” and “climate change.” These differ from arbitrary word sequences like “the nation” (a grammatical artifact) or “’s economy” (a tokenization artifact from possessive constructions like “America’s economy”).
3.3 Validation with PMI
Frequency alone doesn’t tell us whether a word pair forms a true collocation. If “health care” appears 200 times, is this significant? It depends on how often “health” and “care” appear independently. If both words are very common, 200 co-occurrences might be expected by chance. But if both words are relatively rare, 200 co-occurrences would be surprisingly high.
We need a measure that accounts for:
How often the phrase appears together
How often each word appears separately
Whether the co-occurrence is more frequent than chance would predict
Pointwise Mutual Information (PMI), which we learned in Lab 03, measures exactly this: how much more often two words appear together than we’d expect by chance.
4 Setup: loading packages
# Data manipulationimport pandas as pdimport numpy as np# Text processingimport spacyfrom collections import Counter# Visualizationimport matplotlib.pyplot as pltimport seaborn as sns# Set visualization stylesns.set_style("whitegrid")plt.rcParams['figure.figsize'] = (12, 8)print("Packages loaded successfully")
Packages loaded successfully
We continue using the same packages from previous labs: pandas for data manipulation, spacy for text processing with part-of-speech tagging, and matplotlib/seaborn for visualization.
4.1 Loading spaCy model
# Load English language model for text processingnlp = spacy.load("en_core_web_sm")nlp.max_length =1530000# Increase limit for long documentsprint(f"spaCy model loaded: {nlp.meta['name']}")
spaCy model loaded: core_web_sm
5 Loading and preparing the data
We’ll continue working with State of the Union addresses:
# Load the dataspeeches = pd.read_csv("data/transcripts.csv")speeches['date'] = pd.to_datetime(speeches['date'])speeches['year'] = speeches['date'].dt.yearprint(f"Total speeches: {len(speeches)}")print(f"Date range: {speeches['year'].min()} to {speeches['year'].max()}")print(f"Presidents: {speeches['president'].nunique()}")speeches.head()
Total speeches: 244
Date range: 1790 to 2018
Presidents: 42
date
president
title
url
transcript
year
0
2018-01-30
Donald J. Trump
Address Before a Joint Session of the Congress...
https://www.cnn.com/2018/01/30/politics/2018-s...
\nMr. Speaker, Mr. Vice President, Members of ...
2018
1
2017-02-28
Donald J. Trump
Address Before a Joint Session of the Congress
http://www.presidency.ucsb.edu/ws/index.php?pi...
Thank you very much. Mr. Speaker, Mr. Vice Pre...
2017
2
2016-01-12
Barack Obama
Address Before a Joint Session of the Congress...
http://www.presidency.ucsb.edu/ws/index.php?pi...
Thank you. Mr. Speaker, Mr. Vice President, Me...
2016
3
2015-01-20
Barack Obama
Address Before a Joint Session of the Congress...
http://www.presidency.ucsb.edu/ws/index.php?pi...
The President. Mr. Speaker, Mr. Vice President...
2015
4
2014-01-28
Barack Obama
Address Before a Joint Session of the Congress...
http://www.presidency.ucsb.edu/ws/index.php?pi...
The President. Mr. Speaker, Mr. Vice President...
2014
5.1 Creating party labels
For comparing collocations across political groups, we need party affiliation labels:
# Define party affiliations# This is a simplified mapping for the SOTU datasetparty_map = {'Harry S. Truman': 'Democrat','Dwight D. Eisenhower': 'Republican','John F. Kennedy': 'Democrat','Lyndon B. Johnson': 'Democrat','Richard Nixon': 'Republican','Gerald Ford': 'Republican','Jimmy Carter': 'Democrat','Ronald Reagan': 'Republican','George Bush': 'Republican','William J. Clinton': 'Democrat','George W. Bush': 'Republican','Barack Obama': 'Democrat','Donald J. Trump': 'Republican'}speeches['party'] = speeches['president'].map(party_map)print("Speeches by party:")print(speeches['party'].value_counts())
Speeches by party:
party
Republican 44
Democrat 40
Name: count, dtype: int64
6 Part 1: extracting bigrams properly
6.1 What are bigrams?
A bigram is a sequence of two consecutive words. From the text “We need affordable health care,” we would extract: (“we”, “need”), (“need”, “affordable”), (“affordable”, “health”), (“health”, “care”). Bigrams capture two-word phrases and collocations that single words miss.
6.2 The naive approach and its problems
Let’s start with a simple bigram extractor:
def extract_bigrams_naive(text):"""Extract bigrams without filtering.""" doc = nlp(text.lower()) tokens = [token.text for token in doc ifnot token.is_punct andnot token.is_space] bigrams =list(zip(tokens[:-1], tokens[1:]))return bigrams# Test with sample sentencesample ="We must strengthen the economy and protect working families"sample_bigrams = extract_bigrams_naive(sample)print("Naive bigrams:")for bg in sample_bigrams:print(f" {bg[0]} → {bg[1]}")
Naive bigrams:
we → must
must → strengthen
strengthen → the
the → economy
economy → and
and → protect
protect → working
working → families
This approach extracts all consecutive word pairs, but most are not meaningful collocations. The output includes grammatical artifacts like “the economy” and “and protect”, function word combinations that provide no conceptual content. Only “working families” looks like an actual phrase.
6.3 Problem 1: stopwords create noise
Stopwords are common function words like “the,” “a,” “of,” and “to” that provide grammatical structure but little meaning. Without filtering, we get bigrams like “the nation,” “of the,” and “a new”—these are grammatical artifacts, not meaningful phrases.
A common mistake is to only remove bigrams where both words are stopwords:
# This approach is too permissiveifnot (is_stop(w1) and is_stop(w2)): keep bigram
This allows “the nation” and “a new” (one stopword, one content word) to pass through. We need to filter if either word is a stopword:
# Better approachifnot (is_stop(w1) or is_stop(w2)): keep bigram
6.4 Problem 2: possessive markers create artifacts
When spaCy tokenizes “America’s economy,” it produces [“America”, “’s”, “economy”], creating uninformative bigrams like (“America”, “’s”) and (“’s”, “economy”). The possessive marker “’s” is just grammatical structure (part-of-speech tag: PART), not a content word.
6.5 Problem 3: not all content words form meaningful phrases
Even after removing stopwords and possessives, we get bigrams like (“economy”, “protect”) from “strengthen the economy and protect families”—words that happen to be adjacent but don’t form a meaningful unit. We need part-of-speech (POS) filtering to keep only meaningful patterns like noun+noun (“health care”), adjective+noun (“working families”), or verb+noun (“create jobs”).
6.6 The proper bigram extractor
Let’s implement comprehensive filtering:
def extract_bigrams(text, remove_stopwords=True, pos_filter=True):""" Extract bigrams with proper filtering. Returns list of (word1, word2) tuples. """ doc = nlp(text.lower())# Extract tokens with POS information# Filter: no punctuation, no spaces, no numbers, no possessives tokens = [ (token.text, token.pos_, token.is_stop) for token in doc ifnot token.is_punct andnot token.is_space andnot token.like_numand token.pos_ !='PART'# Remove possessive markers like 'sandlen(token.text) >1# Remove single characters ]# Create bigrams bigrams_raw =list(zip(tokens[:-1], tokens[1:]))# Apply filters bigrams = []for (w1, pos1, stop1), (w2, pos2, stop2) in bigrams_raw:# Filter: remove if either word is a stopwordif remove_stopwords and (stop1 or stop2):continue# Filter: keep only content word combinationsif pos_filter: content_pos = {'NOUN', 'PROPN', 'ADJ', 'VERB'}if pos1 notin content_pos or pos2 notin content_pos:continue bigrams.append((w1, w2))return bigrams# Test with same sample sentencesample_bigrams_filtered = extract_bigrams(sample, remove_stopwords=True, pos_filter=True)print("Filtered bigrams:")for bg in sample_bigrams_filtered:print(f" {bg[0]} → {bg[1]}")
Filtered bigrams:
protect → working
working → families
Now we only see meaningful content word combinations: “working families” (adjective + noun), “strengthen economy” (verb + noun), and “protect families” (verb + noun). No more “the economy,” “we must,” or “’s” artifacts.
6.7 Before and after comparison
Let’s compare naive vs. filtered extraction on a real speech:
# Get a sample speechsample_speech = speeches.iloc[0]['transcript'][:1000] # First 1000 chars# Extract both waysnaive_bigrams = extract_bigrams_naive(sample_speech)filtered_bigrams = extract_bigrams(sample_speech, remove_stopwords=True, pos_filter=True)print("NAIVE EXTRACTION:")print(f"Total bigrams: {len(naive_bigrams)}")print("Top 10:", [f"{w1}{w2}"for w1, w2 in Counter(naive_bigrams).most_common(10)])print("\nFILTERED EXTRACTION:")print(f"Total bigrams: {len(filtered_bigrams)}")print("Top 10:", [f"{w1}{w2}"for w1, w2 in Counter(filtered_bigrams).most_common(10)])
Filtering dramatically improves the quality of extracted phrases by removing grammatical noise and keeping only meaningful word combinations.
6.8 Extracting bigrams from all speeches
Now let’s extract bigrams from our full dataset. Be wary: this can take a while.
# Extract all bigramsall_bigrams = []print("Extracting bigrams from all speeches...")for text in speeches['transcript']: bigrams = extract_bigrams(text, remove_stopwords=True, pos_filter=True) all_bigrams.extend(bigrams)print(f"Total bigrams extracted: {len(all_bigrams):,}")# Count frequenciesbigram_counts = Counter(all_bigrams)# Convert to DataFramebigram_df = pd.DataFrame( bigram_counts.most_common(100), columns=['bigram', 'count'])bigram_df['bigram_str'] = bigram_df['bigram'].apply(lambda x: f"{x[0]}{x[1]}")print("\nMost common bigrams:")bigram_df.head(20)
Extracting bigrams from all speeches...
Total bigrams extracted: 564,724
Most common bigrams:
bigram
count
bigram_str
0
(united, states)
9365
united states
1
(fiscal, year)
1696
fiscal year
2
(federal, government)
1041
federal government
3
(great, britain)
1026
great britain
4
(american, people)
863
american people
5
(past, year)
627
past year
6
(fellow, citizens)
562
fellow citizens
7
(public, debt)
544
public debt
8
(year, ending)
519
year ending
9
(health, care)
508
health care
10
(public, lands)
508
public lands
11
(social, security)
474
social security
12
(past, years)
449
past years
13
(postmaster, general)
421
postmaster general
14
(post, office)
399
post office
15
(annual, message)
392
annual message
16
(ending, june)
379
ending june
17
(civil, service)
378
civil service
18
(united, nations)
353
united nations
19
(soviet, union)
346
soviet union
6.9 Visualizing top bigrams
# Plot top 20 bigramsfig, ax = plt.subplots(figsize=(10, 8))top_20 = bigram_df.head(20)sns.barplot(data=top_20, y='bigram_str', x='count', palette='viridis', ax=ax)ax.set_xlabel('Frequency', fontsize=12)ax.set_ylabel('Bigram', fontsize=12)ax.set_title('Most frequent bigrams in State of the Union speeches\n(with stopword and POS filtering)', fontsize=13, fontweight='bold')plt.tight_layout()plt.show()
/tmp/ipykernel_1009652/1736114540.py:5: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.
sns.barplot(data=top_20, y='bigram_str', x='count',
With proper filtering, the top bigrams reveal domain-specific multi-word terms rather than grammatical noise. Notice that many of these are proper names and institutional references—this is characteristic of political speech, where naming specific people, places, and organizations is common. The filtering methods work the same way on any corpus; a product review corpus would show different domain-specific phrases like “customer service” or “battery life.”
7 Part 2: finding meaningful collocations with PMI
Frequency alone doesn’t tell us if a bigram is a true collocation—a statistically significant phrase.
7.1 The frequency problem
Consider two bigrams: “health care” appears 450 times, and “opportunity economy” appears 50 times. Which is a stronger collocation? If “health” and “care” each appear thousands of times independently, their co-occurrence might be expected by chance. But if “opportunity” and “economy” rarely co-occur otherwise, 50 instances could be surprisingly high.
We need to compare observed co-occurrence against expected co-occurrence based on individual word frequencies.
7.2 What is PMI?
Pointwise Mutual Information (PMI) measures how much more often two words appear together than we’d expect by chance:
PMI = 0: Words appear together exactly as expected by chance (independent)
PMI > 0: Words appear together more than expected (positive association / collocation)
PMI < 0: Words appear together less than expected (negative association / avoid each other)
PMI = 3: Words appear 2³ = 8 times more often together than expected
Rule of thumb for collocation strength:
PMI value
Interpretation
PMI < 0
Not a collocation
PMI ≈ 0
Random co-occurrence
PMI > 0
Weak collocation
PMI > 3
Strong collocation
PMI > 6
Very strong collocation
7.3 A concrete example
Let’s work through “health care” step by step.
Given: - Total bigrams in corpus: 500,000 - “health care” appears: 450 times - “health” appears (in any bigram): 2,000 times - “care” appears (in any bigram): 1,500 times
\(P(w_1, w_2)\) = probability of seeing the bigram
\(P(w_1)\) = probability of seeing word1 (in any bigram)
\(P(w_2)\) = probability of seeing word2 (in any bigram)
The logarithm converts multiplicative relationships to additive (easier to interpret), and PMI is symmetric: PMI(A,B) = PMI(B,A). In Lab 03, we used PMI to measure word associations with categories (Democrat/Republican). Here, we measure associations between word pairs.
7.4 Calculating PMI for bigrams
Let’s implement PMI calculation:
def calculate_pmi_for_bigrams(bigram_df, all_bigrams):""" Calculate PMI for each bigram. Returns DataFrame with added PMI column. """# Total number of bigrams total_bigrams =len(all_bigrams)# Count individual word occurrences (from all bigrams) word1_counts = Counter([bg[0] for bg in all_bigrams]) word2_counts = Counter([bg[1] for bg in all_bigrams])# Calculate PMI for each bigram pmi_values = []for _, row in bigram_df.iterrows(): w1, w2 = row['bigram'] count_together = row['count']# Get individual counts count_w1 = word1_counts[w1] count_w2 = word2_counts[w2]# Calculate probabilities p_together = count_together / total_bigrams p_w1 = count_w1 / total_bigrams p_w2 = count_w2 / total_bigrams# Calculate PMI (add small epsilon to avoid log(0)) epsilon =1e-10 pmi = np.log2((p_together + epsilon) / ((p_w1 * p_w2) + epsilon)) pmi_values.append(pmi) bigram_df['pmi'] = pmi_valuesreturn bigram_df# Calculate PMIbigram_df = calculate_pmi_for_bigrams(bigram_df, all_bigrams)print("Bigrams with PMI scores:")print("\nTop 20 by frequency:")print(bigram_df.head(20)[['bigram_str', 'count', 'pmi']])print("\nTop 20 by PMI:")print(bigram_df.nlargest(20, 'pmi')[['bigram_str', 'count', 'pmi']])
Bigrams with PMI scores:
Top 20 by frequency:
bigram_str count pmi
0 united states 9365 5.628802
1 fiscal year 1696 6.871318
2 federal government 1041 4.779861
3 great britain 1026 6.499146
4 american people 863 5.213916
5 past year 627 5.782482
6 fellow citizens 562 8.157588
7 public debt 544 5.564640
8 year ending 519 7.502743
9 health care 508 7.514194
10 public lands 508 5.573711
11 social security 474 7.531967
12 past years 449 6.712649
13 postmaster general 421 8.760305
14 post office 399 8.438991
15 annual message 392 8.019959
16 ending june 379 9.653955
17 civil service 378 6.083541
18 united nations 353 3.656138
19 soviet union 346 8.791776
Top 20 by PMI:
bigram_str count pmi
78 merchant marine 152 10.714477
94 sinking fund 141 10.096041
77 panama canal 152 10.075506
42 vice president 217 9.900207
54 white house 193 9.770017
16 ending june 379 9.653955
30 middle east 248 9.560037
23 supreme court 301 9.368994
19 soviet union 346 8.791776
13 postmaster general 421 8.760305
46 attorney general 208 8.726112
29 interstate commerce 257 8.467558
14 post office 399 8.438991
58 low income 184 8.400325
60 treasury notes 180 8.396572
39 long term 222 8.386739
47 private sector 203 8.356946
85 nuclear weapons 148 8.159649
6 fellow citizens 562 8.157588
31 armed forces 245 8.075245
Notice the difference between ranking by frequency versus PMI. Frequent bigrams may have moderate PMI scores if the individual words are also common. Rarer bigrams may have very high PMI if the words strongly prefer each other.
Strong collocations (PMI > 3): 99
Top 30 strong collocations:
bigram_str
count
pmi
78
merchant marine
152
10.714477
94
sinking fund
141
10.096041
77
panama canal
152
10.075506
42
vice president
217
9.900207
54
white house
193
9.770017
16
ending june
379
9.653955
30
middle east
248
9.560037
23
supreme court
301
9.368994
19
soviet union
346
8.791776
13
postmaster general
421
8.760305
46
attorney general
208
8.726112
29
interstate commerce
257
8.467558
14
post office
399
8.438991
58
low income
184
8.400325
60
treasury notes
180
8.396572
39
long term
222
8.386739
47
private sector
203
8.356946
85
nuclear weapons
148
8.159649
6
fellow citizens
562
8.157588
31
armed forces
245
8.075245
57
indian tribes
187
8.054296
15
annual message
392
8.019959
56
favorable consideration
192
8.014378
76
reason believe
152
8.012472
72
executive branch
160
7.902223
63
law enforcement
177
7.874163
91
taken place
145
7.789880
90
current fiscal
146
7.714495
43
natural resources
217
7.703547
32
office department
240
7.620863
7.6 Visualizing PMI distribution
# Plot PMI distributionfig, axes = plt.subplots(1, 2, figsize=(14, 5))# Histogram of PMI valuesaxes[0].hist(bigram_df['pmi'], bins=50, color='skyblue', edgecolor='black')axes[0].axvline(x=3, color='red', linestyle='--', linewidth=2, label='PMI = 3 (strong collocation)')axes[0].set_xlabel('PMI', fontsize=12)axes[0].set_ylabel('Number of bigrams', fontsize=12)axes[0].set_title('Distribution of PMI values', fontsize=13, fontweight='bold')axes[0].legend()# Scatter: frequency vs PMIaxes[1].scatter(bigram_df['count'], bigram_df['pmi'], alpha=0.5, s=30)axes[1].axhline(y=3, color='red', linestyle='--', linewidth=2, label='PMI = 3')axes[1].set_xlabel('Frequency', fontsize=12)axes[1].set_ylabel('PMI', fontsize=12)axes[1].set_title('Frequency vs PMI', fontsize=13, fontweight='bold')axes[1].set_xscale('log')axes[1].legend()plt.tight_layout()plt.show()
The left plot shows that most bigrams (among the top 100 most frequent) have moderate PMI values between 2 and 6. The right plot reveals that high frequency doesn’t guarantee high PMI—common words may co-occur by chance. Conversely, low frequency can have very high PMI if the words are rare but perfectly collocated. The sweet spot is moderate frequency combined with high PMI, indicating meaningful phrases that appear often enough to be notable.
8 Part 3: comparing collocations across groups
Now we can answer: how do collocations differ across parties and time periods?
8.1 Democrat vs Republican collocations
Let’s extract collocations separately for each party:
def extract_party_collocations(speeches_df, party_name, min_count=10, min_pmi=3):""" Extract collocations for a specific party. Returns DataFrame of collocations for this party. """# Filter speeches party_speeches = speeches_df[speeches_df['party'] == party_name]# Extract bigrams party_bigrams = []for text in party_speeches['transcript']: bigrams = extract_bigrams(text, remove_stopwords=True, pos_filter=True) party_bigrams.extend(bigrams)# Count and create DataFrame bigram_counts = Counter(party_bigrams) party_df = pd.DataFrame( bigram_counts.most_common(), columns=['bigram', 'count'] ) party_df = party_df[party_df['count'] >= min_count]# Calculate PMI party_df = calculate_pmi_for_bigrams(party_df, party_bigrams)# Filter by PMI party_df = party_df[party_df['pmi'] >= min_pmi].copy() party_df['bigram_str'] = party_df['bigram'].apply(lambda x: f"{x[0]}{x[1]}") party_df['party'] = party_namereturn party_df# Extract for both partiesdem_collocations = extract_party_collocations(speeches, 'Democrat', min_count=15, min_pmi=2.5)rep_collocations = extract_party_collocations(speeches, 'Republican', min_count=15, min_pmi=2.5)print(f"Democrat collocations: {len(dem_collocations)}")print(dem_collocations.head(15)[['bigram_str', 'count', 'pmi']])print(f"\nRepublican collocations: {len(rep_collocations)}")print(rep_collocations.head(15)[['bigram_str', 'count', 'pmi']])
Democrat collocations: 569
bigram_str count pmi
0 united states 769 6.742979
1 american people 345 5.359103
2 health care 342 6.566845
3 fiscal year 318 6.939028
4 social security 282 6.872091
5 soviet union 256 7.656043
6 federal government 245 5.248756
7 past years 202 6.793270
8 human rights 201 7.085960
9 united nations 184 5.408262
10 small business 154 7.204985
11 private sector 152 8.022914
12 middle east 132 8.412535
13 world war 130 5.357563
14 long term 118 7.957562
Republican collocations: 309
bigram_str count pmi
0 united states 461 7.007118
1 federal government 348 4.851698
2 american people 244 4.997661
3 health care 162 6.873490
4 social security 162 6.780482
5 fiscal year 150 7.043834
6 past years 148 6.545036
7 local governments 125 6.939865
8 economic growth 122 5.807923
9 united nations 113 5.918166
10 free world 108 5.969912
11 law enforcement 107 8.510486
12 middle east 104 9.119251
13 community development 94 7.164706
14 health insurance 90 6.522676
8.2 Finding party-distinctive collocations
Which collocations are distinctively Democratic vs Republican? We can use PMI difference to identify distinctive phrases:
Most party-distinctive collocations:
Top 15 (positive = Democratic, negative = Republican):
bigram_str
pmi_dem
pmi_rep
pmi_diff
579
status quo
12.779820
0.000000
12.779820
640
viet nam
12.630079
0.000000
12.630079
308
humphrey hawkins
12.493105
0.000000
12.493105
330
iron curtain
12.491782
0.000000
12.491782
260
gramm rudman
0.000000
12.434724
-12.434724
119
counter cyclical
12.145276
0.000000
12.145276
479
prime minister
11.975929
0.000000
11.975929
624
undocumented aliens
11.857188
0.000000
11.857188
644
wall street
11.748956
0.000000
11.748956
331
item veto
0.000000
11.566370
-11.566370
134
d. eisenhower
0.000000
11.535189
-11.535189
455
panama canal
11.469806
0.000000
11.469806
66
canal treaties
11.452477
0.000000
11.452477
149
displaced persons
11.370174
0.000000
11.370174
346
lend lease
11.301328
0.000000
11.301328
8.3 Visualizing party differences
# Get top distinctive collocations for visualizationtop_dem = comparison_df.nlargest(15, 'pmi_diff')top_rep = comparison_df.nsmallest(15, 'pmi_diff')top_distinctive = pd.concat([top_rep, top_dem])# Create diverging bar chartfig, ax = plt.subplots(figsize=(10, 10))colors = ['red'if x <0else'blue'for x in top_distinctive['pmi_diff']]ax.barh(top_distinctive['bigram_str'], top_distinctive['pmi_diff'], color=colors, alpha=0.7)ax.axvline(x=0, color='black', linewidth=1)ax.set_xlabel('PMI Difference\n(negative = Republican, positive = Democrat)', fontsize=12)ax.set_ylabel('Collocation', fontsize=12)ax.set_title('Most party-distinctive collocations in SOTU addresses', fontsize=13, fontweight='bold')# Add legendfrom matplotlib.patches import Patchlegend_elements = [ Patch(facecolor='blue', alpha=0.7, label='Democratic collocations'), Patch(facecolor='red', alpha=0.7, label='Republican collocations')]ax.legend(handles=legend_elements, loc='lower right')plt.tight_layout()plt.show()
This visualization shows which collocations are distinctively associated with each party based on PMI differences. Many of these will be proper names (historical figures, legislation names) specific to different eras when each party held power. The method itself—comparing collocations across groups using PMI—applies to any comparison you want to make: positive vs negative product reviews, formal vs informal writing, different time periods, different authors, etc.
8.4 Temporal comparison: how collocations change over time
Let’s compare collocations across three eras:
def extract_era_collocations(speeches_df, start_year, end_year, era_name, min_count=10, min_pmi=2.5):"""Extract collocations for a specific time period.""" era_speeches = speeches_df[ (speeches_df['year'] >= start_year) & (speeches_df['year'] <= end_year) ]# Extract bigrams era_bigrams = []for text in era_speeches['transcript']: bigrams = extract_bigrams(text, remove_stopwords=True, pos_filter=True) era_bigrams.extend(bigrams)# Count and create DataFrame bigram_counts = Counter(era_bigrams) era_df = pd.DataFrame( bigram_counts.most_common(), columns=['bigram', 'count'] ) era_df = era_df[era_df['count'] >= min_count]# Calculate PMI era_df = calculate_pmi_for_bigrams(era_df, era_bigrams)# Filter by PMI era_df = era_df[era_df['pmi'] >= min_pmi].copy() era_df['bigram_str'] = era_df['bigram'].apply(lambda x: f"{x[0]}{x[1]}") era_df['era'] = era_namereturn era_df# Define eraseras = [ (1945, 1975, 'Post-war (1945-1975)'), (1976, 2000, 'Late 20th century (1976-2000)'), (2001, 2018, '21st century (2001-2018)')]# Extract for each eraera_collocations = {}for start, end, name in eras: era_df = extract_era_collocations(speeches, start, end, name, min_count=8, min_pmi=2.5) era_collocations[name] = era_dfprint(f"\n{name}: {len(era_df)} collocations")print(era_df.head(10)[['bigram_str', 'count', 'pmi']])
Post-war (1945-1975): 1261 collocations
bigram_str count pmi
0 united states 580 6.531445
1 fiscal year 416 6.539359
2 federal government 339 4.981652
3 united nations 261 5.533383
4 free world 204 5.740163
5 american people 176 5.407417
6 past years 158 6.376723
7 free nations 148 5.066070
8 soviet union 146 8.219514
9 local governments 145 6.929096
Late 20th century (1976-2000): 1387 collocations
bigram_str count pmi
0 united states 508 7.150882
1 american people 300 5.308773
2 health care 274 6.463104
3 federal government 250 5.032715
4 social security 216 6.778588
5 soviet union 196 7.468193
6 human rights 191 7.016411
7 private sector 168 8.060140
8 past years 160 6.832990
9 middle east 120 8.685531
21st century (2001-2018): 344 collocations
bigram_str count pmi
0 united states 196 6.991582
1 health care 168 6.531817
2 american people 149 4.595473
3 al qaida 116 8.141844
4 social security 112 6.997299
5 new jobs 74 3.633414
6 middle east 70 7.902651
7 middle class 67 7.669534
8 health insurance 66 6.492642
9 clean energy 62 6.925543
8.5 Visualizing temporal patterns
# Compare top collocations across erasfig, axes = plt.subplots(1, 3, figsize=(18, 6))for idx, (era_name, era_df) inenumerate(era_collocations.items()): ax = axes[idx]# Get top 15 by PMI top_era = era_df.nlargest(15, 'pmi') sns.barplot( data=top_era, y='bigram_str', x='pmi', palette='rocket', ax=ax ) ax.set_xlabel('PMI', fontsize=11) ax.set_ylabel('Collocation', fontsize=11) ax.set_title(era_name, fontsize=12, fontweight='bold')plt.tight_layout()plt.show()
/tmp/ipykernel_1009652/3915074568.py:10: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.
sns.barplot(
/tmp/ipykernel_1009652/3915074568.py:10: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.
sns.barplot(
/tmp/ipykernel_1009652/3915074568.py:10: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.
sns.barplot(
This temporal comparison shows how collocations change across historical periods. In political speech, you’ll see many proper names and historical references specific to each era. In other domains, temporal analysis reveals different patterns: product reviews show evolving features (“app crashes” → “facial recognition”), social media shows changing slang and topics, scientific abstracts show emerging methodologies.
9 Conclusion: why collocations matter for text analysis
9.1 What we learned
In this lab, you learned to:
Extract bigrams with proper filtering (stopwords, POS tags, possessives)
Use PMI to identify true collocations (not just frequent bigrams)
Compare collocations across groups (parties, time periods)
Interpret PMI values for substantive research findings
Visualize collocation patterns with bar charts
9.2 Key insights about multi-word phrases
Multi-word phrases matter because language uses formulaic expressions to convey meaning. Single-word analysis misses these important patterns. Statistical validation is essential—frequency alone is misleading. PMI separates true collocations from coincidental co-occurrences. Strong collocations (PMI > 3) reveal meaningful semantic bonds between words.
Language changes over time. New issues emerge through new collocations, rhetorical patterns evolve with historical context, and tracking collocations reveals shifting priorities in discourse.
9.3 How this complements word-level analysis
Lab 01 introduced word frequency counting—now we count phrase frequency. Lab 02 showed corpus comparison—now we compare phrase usage across groups. Lab 03 introduced PMI for word-category association—now we measure word-word association. Lab 04 compared entire documents—now we identify the phrases that characterize documents.
9.4 When to use collocation analysis
Collocation analysis is useful when studying formulaic language and multi-word expressions, identifying domain-specific terminology, comparing rhetorical strategies across groups, and tracking emergence of new phrases over time.
Collocations won’t help with broader semantic relationships (words that co-occur but aren’t adjacent—use word embeddings or topic models), document-level patterns (use Lab 04’s similarity methods), or sentiment analysis (use Lab 03’s dictionaries).
9.5 Applying these methods to your own research
The techniques you learned here work on any text corpus. For product reviews, you might extract collocations like “battery life,” “customer service,” and “money back,” comparing positive vs negative reviews. For social media, you might track emerging slang and hashtag patterns over time. For interview transcripts, you might identify domain-specific terminology that characterizes different participant groups.
The filtering strategy (stopwords, POS tags, frequency thresholds, PMI cutoffs) may need adjustment for different genres, but the core methods remain the same.
9.6 Next steps
To extend this analysis:
Extract trigrams (three-word sequences) for longer formulaic phrases
Combine inflected forms through lemmatization (“create jobs” + “creating jobs”)
Extract collocations with named entities (people, places, organizations)
Model which factors predict collocation usage through regression analysis
Compare your corpus collocations with reference corpus (e.g., general language)
10 Exercises
Trigram extraction: Modify the extract_bigrams() function to extract trigrams (three-word sequences). What are the most common trigrams? Do they reveal different patterns than bigrams?
PMI threshold sensitivity: Re-run the party comparison with different PMI thresholds (2.0, 3.0, 4.0). How does the threshold affect which collocations are identified as party-distinctive?
Verb + noun collocations: Filter bigrams to only include VERB + NOUN patterns. What verbs are most common? How do they differ by party?
Presidential comparison: Instead of comparing parties, compare collocations across specific presidents (Obama vs Trump, Reagan vs Clinton). What distinctive phrases characterize each president?
Sentiment collocations: Combine this lab with Lab 03’s sentiment dictionaries. Extract bigrams where one word is from a sentiment lexicon. Do different groups use different sentiment words in their collocations?
Temporal emergence: Track when specific collocations first appear. Create a timeline showing emergence of new phrases in your corpus.