Lab 02: Words as data points II

Comparing corpora, lemmatization, and statistical significance

Published

2026-01-25 11:57:22

1 Learning objectives

By the end of this lab, you will understand:

How to compare word usage across different groups or corpora
What lemmatization is and why it matters for text analysis
The difference between stemming and lemmatization
Why simple frequency comparisons can be misleading
How to measure statistical significance with log-likelihood (G²)
How to quantify effect size with log odds ratio
What named entities are and how to extract them
How to use spaCy for advanced NLP tasks in Python

2 Introduction: Why compare corpora?

In social and political science, texts often serve as proxies for social phenomena, sentiments, ideas, or discourses. A common research design involves collecting texts from different institutions, groups, or actors to create contrasting corpora. By comparing word usage across these corpora, we can infer something about the underlying social or political features of the entities they represent.

2.1 The research question

Consider this scenario: Do Democratic and Republican presidents talk differently? Not just in terms of political positions, but in the actual words they use?

In Lab 01, we compared authors based on pre-selected words (stop words, personal pronouns). This worked well for stylometry because function words are a closed class - we know all of them in advance.

But what about content words? If we want to compare the substance of what different groups talk about, how do we:

Avoid arbitrary word selection?
Distinguish meaningful differences from random variation?
Quantify both the significance and magnitude of differences?

This is where corpus comparison methods come in.

2.2 Our approach today

We’ll create two contrasting corpora:

Corpus A: State of the Union addresses by Democratic presidents (since 1917)
Corpus B: State of the Union addresses by Republican presidents (since 1917)

Then we’ll use statistical measures to identify which words are significantly over- or under-used in one corpus compared to the other.

Key insight: We’re not just looking for different words - we’re looking for statistically significant differences that reveal meaningful patterns.

3 Setup: Loading packages

# Data manipulation
import pandas as pd
import numpy as np

# Text processing
import spacy
from collections import Counter

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical functions
from scipy.stats import chi2

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Packages loaded successfully")

✓ Packages loaded successfully

About these packages

New packages in this lab:

numpy: Numerical computing (for log calculations)
scipy.stats: Statistical functions (for significance testing)

We’ll continue using pandas, spaCy, and visualization libraries from Lab 01.

3.1 Loading spaCy model

# Load English language model
nlp = spacy.load("en_core_web_sm")

nlp.max_length = 1530000        # https://github.com/explosion/spaCy/issues/13207#issuecomment-1865973378

print(f"✓ spaCy model loaded: {nlp.meta['name']}")
print(f"  Language: {nlp.meta['lang']}")
print(f"  Components: {nlp.pipe_names}")

✓ spaCy model loaded: core_web_sm
  Language: en
  Components: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

4 Loading and preparing the data

Let’s load our State of the Union dataset:

# Load the data
speeches = pd.read_csv("data/transcripts.csv")

print(f"Total speeches: {len(speeches)}")
print(f"Date range: {speeches['date'].min()} to {speeches['date'].max()}")
print(f"\nFirst few rows:")
speeches.head()

Total speeches: 244
Date range: 1790-01-08 to 2018-01-30

First few rows:

	date	president	title	url	transcript
0	2018-01-30	Donald J. Trump	Address Before a Joint Session of the Congress...	https://www.cnn.com/2018/01/30/politics/2018-s...	\nMr. Speaker, Mr. Vice President, Members of ...
1	2017-02-28	Donald J. Trump	Address Before a Joint Session of the Congress	http://www.presidency.ucsb.edu/ws/index.php?pi...	Thank you very much. Mr. Speaker, Mr. Vice Pre...
2	2016-01-12	Barack Obama	Address Before a Joint Session of the Congress...	http://www.presidency.ucsb.edu/ws/index.php?pi...	Thank you. Mr. Speaker, Mr. Vice President, Me...
3	2015-01-20	Barack Obama	Address Before a Joint Session of the Congress...	http://www.presidency.ucsb.edu/ws/index.php?pi...	The President. Mr. Speaker, Mr. Vice President...
4	2014-01-28	Barack Obama	Address Before a Joint Session of the Congress...	http://www.presidency.ucsb.edu/ws/index.php?pi...	The President. Mr. Speaker, Mr. Vice President...

4.1 Creating contrasting corpora

We’ll split speeches by party affiliation. First, let’s define which presidents belong to which party (since 1917):

# Democratic presidents since 1917
democrats = [
    "Woodrow Wilson", 
    "Franklin D. Roosevelt", 
    "Harry S. Truman", 
    "John F. Kennedy", 
    "Lyndon B. Johnson", 
    "Jimmy Carter",
    "William J. Clinton", 
    "Barack Obama"
]

# Filter speeches
speeches_after_1917 = speeches[speeches['date'] > '1917-10-25'].copy()

# Create party labels
speeches_after_1917['party'] = speeches_after_1917['president'].apply(
    lambda x: 'Democrat' if x in democrats else 'Republican'
)

# Split into two corpora
dem_speeches = speeches_after_1917[speeches_after_1917['party'] == 'Democrat']
rep_speeches = speeches_after_1917[speeches_after_1917['party'] == 'Republican']

print("Democratic presidents:")
print(dem_speeches['president'].value_counts())
print(f"\nTotal Democratic speeches: {len(dem_speeches)}")

print("\n" + "="*50)
print("\nRepublican presidents:")
print(rep_speeches['president'].value_counts())
print(f"\nTotal Republican speeches: {len(rep_speeches)}")

Democratic presidents:
president
Franklin D. Roosevelt    13
Barack Obama              8
Harry S. Truman           8
William J. Clinton        8
Jimmy Carter              7
Lyndon B. Johnson         6
Woodrow Wilson            4
John F. Kennedy           3
Name: count, dtype: int64

Total Democratic speeches: 57

==================================================

Republican presidents:
president
Richard Nixon           12
Dwight D. Eisenhower    10
Ronald Reagan            8
George W. Bush           8
Calvin Coolidge          6
Herbert Hoover           4
George Bush              4
Gerald R. Ford           3
Donald J. Trump          2
Warren G. Harding        2
Name: count, dtype: int64

Total Republican speeches: 59

Why start in 1917?

We chose 1917 as a cutoff to:

Ensure both parties have substantial representation
Focus on relatively modern political language
Avoid complications from 19th century political realignments

In your own research, such choices should be explicit and justified.

4.2 Combining texts by party

For corpus comparison, we’ll combine all speeches from each party into two large text collections:

# Combine all speeches by party
dem_corpus = " ".join(dem_speeches['transcript'].tolist())
rep_corpus = " ".join(rep_speeches['transcript'].tolist())

print(f"Democratic corpus: {len(dem_corpus):,} characters")
print(f"Republican corpus: {len(rep_corpus):,} characters")

Democratic corpus: 4,874,617 characters
Republican corpus: 4,176,938 characters

5 From wordforms to lemmas: Introduction to lemmatization

5.1 The problem with raw words

Consider these sentences:

“The government regulates industry.”
“These regulations affect small businesses.”
“The regulatory framework is complex.”

These three words - regulates, regulations, regulatory - are clearly related. They share the same root concept of “regulation.” But if we count them separately, we miss this connection.

This problem is especially acute in languages with rich inflection (Russian, German, Finnish), but it exists in English too:

Verbs: walk, walks, walked, walking
Nouns: cat, cats, mouse, mice
Adjectives: big, bigger, biggest

5.2 Two solutions: Stemming vs lemmatization

Stemming: Crudely chop off word endings

running → run
better → bet (⚠️ wrong!)
organization → organ (⚠️ wrong!)
Fast but imprecise

Lemmatization: Reduce words to their dictionary form (lemma)

running → run
better → good
mice → mouse
Slower but accurate

5.3 How lemmatization works

Lemmatization requires:

Part-of-speech information: Is “running” a verb or a noun?
Morphological dictionary: What are all the forms of “run”?
Linguistic rules: How do irregular forms work?

Fortunately, spaCy does all this for us!

5.4 Lemmatization with spaCy

Let’s see lemmatization in action:

# Example text
example = "The regulations are regulating industries more effectively than previous regulatory frameworks."

# Process with spaCy
doc = nlp(example)

# Show original word, lemma, and part of speech
print("Word → Lemma (Part of Speech)")
print("-" * 40)
for token in doc:
    if not token.is_punct:
        print(f"{token.text:15} → {token.lemma_:15} ({token.pos_})")

Word → Lemma (Part of Speech)
----------------------------------------
The             → the             (DET)
regulations     → regulation      (NOUN)
are             → be              (AUX)
regulating      → regulate        (VERB)
industries      → industry        (NOUN)
more            → more            (ADV)
effectively     → effectively     (ADV)
than            → than            (ADP)
previous        → previous        (ADJ)
regulatory      → regulatory      (ADJ)
frameworks      → framework       (NOUN)

Notice how:

“regulations” → “regulation”
“are regulating” → “be regulate” (splits auxiliary verb)
“regulatory” → “regulatory” (already base form)

When to lemmatize?

Use lemmatization when:

Comparing content across documents
Building topic models
Working with inflected languages
Vocabulary is too large

Don’t lemmatize when:

Doing stylometry (exact forms matter)
Studying syntax or grammar
Tense/number/person is important
Training neural language models

6 Processing our corpora with spaCy

Now let’s lemmatize both of our political corpora. This will take a few minutes as spaCy processes all the text.

Processing time

Processing large texts with spaCy is computationally intensive. For very large corpora (millions of words), you might want to:

Use spaCy’s nlp.pipe() for batch processing
Disable unnecessary components (nlp.disable_pipes())
Save processed results to disk

For this lab, the processing should take 2-5 minutes.

# Process Democratic speeches (this takes time!)
print("Processing Democratic speeches...")
dem_doc = nlp(dem_corpus)
print("✓ Democratic corpus processed")

# Process Republican speeches
print("Processing Republican speeches...")
rep_doc = nlp(rep_corpus)
print("✓ Republican corpus processed")

For the purposes of this lab, let’s work with a sample to speed things up:

# Take a sample of each corpus for faster processing
dem_sample = " ".join(dem_speeches.sample(n=min(20, len(dem_speeches)), random_state=42)['transcript'].tolist())
rep_sample = " ".join(rep_speeches.sample(n=min(20, len(rep_speeches)), random_state=42)['transcript'].tolist())

# Process samples
print("Processing samples...")
dem_doc = nlp(dem_sample)
rep_doc = nlp(rep_sample)
print("✓ Processing complete")

print(f"\nDemocratic sample: {len(dem_doc)} tokens")
print(f"Republican sample: {len(rep_doc)} tokens")

Processing samples...
✓ Processing complete

Democratic sample: 295013 tokens
Republican sample: 267382 tokens

6.1 Extracting lemmas and filtering

We want to keep only content-bearing words. Let’s filter out:

Punctuation (., ,, !, etc.)
Numbers (1, 2020, million)
Symbols ($, %, @)
Proper nouns (specific names of people and places)
Stop words (optional - let’s keep them for now to see what happens)

# Extract lemmas from Democratic speeches
dem_lemmas = []
for token in dem_doc:
    if not token.is_punct and not token.is_space and token.pos_ not in ['NUM', 'SYM', 'PROPN']:
        dem_lemmas.append({
            'lemma': token.lemma_.lower(),
            'pos': token.pos_,
            'party': 'Democrat'
        })

# Extract lemmas from Republican speeches
rep_lemmas = []
for token in rep_doc:
    if not token.is_punct and not token.is_space and token.pos_ not in ['NUM', 'SYM', 'PROPN']:
        rep_lemmas.append({
            'lemma': token.lemma_.lower(),
            'pos': token.pos_,
            'party': 'Republican'
        })

# Combine into DataFrames
dem_df = pd.DataFrame(dem_lemmas)
rep_df = pd.DataFrame(rep_lemmas)

print(f"Democratic lemmas: {len(dem_df):,}")
print(f"Republican lemmas: {len(rep_df):,}")
print(f"\nSample of Democratic lemmas:")
print(dem_df.head(10))

Democratic lemmas: 247,919
Republican lemmas: 225,161

Sample of Democratic lemmas:
     lemma   pos     party
0    thank  VERB  Democrat
1      you  PRON  Democrat
2       of   ADP  Democrat
3       my  PRON  Democrat
4   fellow   ADJ  Democrat
5  tonight  NOUN  Democrat
6     mark  VERB  Democrat
7      the   DET  Democrat
8   eighth   ADJ  Democrat
9     year  NOUN  Democrat

6.2 Creating frequency tables

Now let’s count how often each lemma appears in each corpus:

# Count frequencies by party
dem_counts = dem_df['lemma'].value_counts().reset_index()
dem_counts.columns = ['lemma', 'democrat']

rep_counts = rep_df['lemma'].value_counts().reset_index()
rep_counts.columns = ['lemma', 'republican']

# Merge into one table
freq_table = dem_counts.merge(rep_counts, on='lemma', how='outer').fillna(0)

# Convert to integers
freq_table['democrat'] = freq_table['democrat'].astype(int)
freq_table['republican'] = freq_table['republican'].astype(int)

# Filter out very rare words (appears less than 10 times in both corpora)
freq_table = freq_table[(freq_table['democrat'] > 10) | (freq_table['republican'] > 10)].copy()

print(f"Unique lemmas (after filtering): {len(freq_table):,}")
print(f"\nTop 20 by total frequency:")
freq_table['total'] = freq_table['democrat'] + freq_table['republican']
print(freq_table.sort_values('total', ascending=False).head(20))

Unique lemmas (after filtering): 2,072

Top 20 by total frequency:
     lemma  democrat  republican  total
6817   the     14722       14995  29717
4617    of      8965        9654  18619
395    and      9610        8446  18056
6915    to      9536        7944  17480
708     be      8286        8421  16707
3462    in      5361        5390  10751
7507    we      5757        4214   9971
94       a      4410        4106   8516
4700   our      4423        3595   8018
6816  that      3850        2843   6693
3187  have      3386        3183   6569
2805   for      2936        2765   5701
3364     i      2742        1921   4663
7583  will      2390        1964   4354
3797    it      1995        1740   3735
6849  this      1952        1738   3690
4548   not      1780        1372   3152
7604  with      1626        1383   3009
4640    on      1419        1445   2864
6842  they      1658        1065   2723

7 The problem with simple frequency comparisons

Looking at raw frequencies is tempting, but it can be misleading. Let’s see why.

7.1 Corpus size matters

# Total tokens per party
total_dem = freq_table['democrat'].sum()
total_rep = freq_table['republican'].sum()

print(f"Total Democratic tokens: {total_dem:,}")
print(f"Total Republican tokens: {total_rep:,}")
print(f"Ratio (Dem/Rep): {total_dem/total_rep:.2f}")

Total Democratic tokens: 234,922
Total Republican tokens: 212,286
Ratio (Dem/Rep): 1.11

If one corpus is larger, it will naturally have higher raw counts for most words. We need to account for this.

7.2 Example: The word “people”

# Look at a specific word
people_row = freq_table[freq_table['lemma'] == 'people']

if len(people_row) > 0:
    dem_count = people_row['democrat'].values[0]
    rep_count = people_row['republican'].values[0]
    
    # Raw counts
    print(f"Raw counts for 'people':")
    print(f"  Democrats: {dem_count}")
    print(f"  Republicans: {rep_count}")
    print(f"  Difference: {dem_count - rep_count}")
    
    # Normalized (per 1000 words)
    dem_rate = (dem_count / total_dem) * 1000
    rep_rate = (rep_count / total_rep) * 1000
    
    print(f"\nNormalized rates (per 1,000 words):")
    print(f"  Democrats: {dem_rate:.2f}")
    print(f"  Republicans: {rep_rate:.2f}")
    print(f"  Difference: {dem_rate - rep_rate:.2f}")

Raw counts for 'people':
  Democrats: 972
  Republicans: 687
  Difference: 285

Normalized rates (per 1,000 words):
  Democrats: 4.14
  Republicans: 3.24
  Difference: 0.90

The raw difference might be large just because one corpus is bigger!

7.3 Two questions we need to answer

Is the difference statistically significant?
- Could this difference occur by chance?
- How confident can we be that it’s a real pattern?
- → We’ll use log-likelihood (G²) for this
How large is the effect?
- Is it a huge difference or a tiny one?
- Which words show the strongest contrast?
- → We’ll use log odds ratio for this

Statistical significance ≠ practical importance

A difference can be:

Statistically significant but tiny (large sample)
Large but not significant (small sample)

We need both measures to draw meaningful conclusions.

8 Measuring significance: Log-likelihood (G²)

8.1 The problem: When is a difference real?

Let’s say we’re comparing Republican and Democratic speeches, and we find that the word “freedom” appears:

100 times in Republican speeches
50 times in Democratic speeches

Should we conclude that Republicans talk twice as much about freedom?

Not necessarily. Here’s why: What if the Republican corpus contains 1,000,000 words total, while the Democratic corpus contains 500,000 words? Then both parties use “freedom” at exactly the same rate (100 per million words). The difference in raw counts is simply because we have more Republican text.

This is why we need a statistical test that accounts for corpus size.

8.2 What is log-likelihood (G²)?

Log-likelihood, abbreviated as G², is a statistical test that answers one simple question:

“Given the sizes of my two corpora, how surprising is this word’s distribution?”

The logic:

If a word is distributed just as we’d expect (proportional to corpus size), G² is close to 0
If the distribution is very different from what we’d expect, G² is large
The larger G², the more confident we can be that the difference is real, not just random variation

Think of G² as a “surprise meter” - it measures how surprised we should be by what we observe.

8.3 How to read G² values

G² follows a well-known statistical distribution, which means we have standard thresholds for interpretation:

G² value	Confidence level	What it means
< 3.84	Not significant	Difference might be random chance
> 3.84	95% confident	Probably a real pattern (p < 0.05)
> 6.63	99% confident	Very likely a real pattern (p < 0.01)
> 10.83	99.9% confident	Almost certainly a real pattern (p < 0.001)

Rule of thumb: We typically use G² > 6.63 as our cutoff for trusting a difference.

What does “99% confident” mean?

It means: “If there were actually no real difference, we’d see a result this extreme less than 1% of the time.” In other words, we’re very confident the pattern is real, not just luck.

For the mathematically curious: How G² is calculated

G² compares observed frequencies (what we actually see) to expected frequencies (what we’d see if words were distributed proportionally to corpus size).

The formula is:

\[G^2 = 2 \sum O \times \ln\left(\frac{O}{E}\right)\]

Where:

$O$ = observed frequency
$E$ = expected frequency
$\ln$ = natural logarithm

For two corpora, this expands to:

\[G^2 = 2 \times \left[ a \times \ln\left(\frac{a}{E_1}\right) + b \times \ln\left(\frac{b}{E_2}\right) \right]\]

Where:

$a$ = word count in Corpus A
$b$ = word count in Corpus B
$E_1$ = expected count in Corpus A
$E_2$ = expected count in Corpus B

The expected frequencies account for corpus size:

\[E_1 = C \times \frac{a + b}{C + D}\] \[E_2 = D \times \frac{a + b}{C + D}\]

Where:

$C$ = total size of Corpus A
$D$ = total size of Corpus B

This test is based on Dunning (1993), a foundational paper in corpus linguistics. It’s preferred over chi-squared for text data because it handles sparse data (rare words) more reliably.

8.4 Calculating G² in Python

We’ll create a function that does all the mathematical work for us:

def log_likelihood(a, b):
    """
    Calculate log-likelihood (G²) for word frequencies in two corpora.

    This function compares observed word frequencies to expected frequencies
    (based on corpus size) and returns a G² value indicating how surprising
    the observed distribution is.

    Parameters:
    -----------
    a : array-like
        Word counts in corpus A (e.g., Democratic speeches)
    b : array-like
        Word counts in corpus B (e.g., Republican speeches)

    Returns:
    --------
    array-like
        G² values for each word (higher = more surprising/significant)
    """
    # Total corpus sizes
    C = np.sum(a)  # Total tokens in corpus A
    D = np.sum(b)  # Total tokens in corpus B

    # Calculate expected frequencies (what we'd expect if words were distributed proportionally)
    E1 = C * ((a + b) / (C + D))
    E2 = D * ((a + b) / (C + D))

    # Calculate G² statistic
    # Note: We add a tiny constant (1e-10) to avoid mathematical errors when counts are zero
    g2 = 2 * ((a * np.log(a / E1 + 1e-10)) + (b * np.log(b / E2 + 1e-10)))

    return g2

8.5 Using G² to find significant differences

# Calculate log-likelihood for all words
freq_table['g2'] = log_likelihood(
    freq_table['democrat'].values, 
    freq_table['republican'].values
)

# Sort by G² (most significant differences)
freq_table_sorted = freq_table.sort_values('g2', ascending=False).copy()

print("Words with highest G² (most significant differences):")
print(freq_table_sorted[['lemma', 'democrat', 'republican', 'g2']].head(20))

Words with highest G² (most significant differences):
            lemma  democrat  republican          g2
2095           do      1380         612  231.203248
2986          get       421         122  145.273635
4617           of      8965        9654  143.074502
7675          you       941         443  136.369324
7625         work      1034         505  135.944566
7507           we      5757        4214  108.982754
5829        right       503         198  108.139684
6817          the     14722       14995  106.400966
7343           us       174          28  103.171606
1226      college       174          29  100.740012
3123          gun        78           0  100.428162
3813          job       544         232   99.441736
1061         cent         6          86   91.520368
5163      present        86         238   90.417812
7555        which       805        1117   87.414119
7560          who       837         443   86.633621
1278      company       120          14   85.637924
4673           or       930         514   83.115776
2546  expenditure        28         130   82.156755
6803    terrorist        42         156   81.910170

Look at the G² values. Many are well above 6.63, meaning we can be very confident these differences are real.

8.6 How many significant differences did we find?

Let’s count how many words show statistically significant differences at different confidence levels:

# Count significant differences
sig_05 = (freq_table['g2'] > 3.84).sum()
sig_01 = (freq_table['g2'] > 6.63).sum()
sig_001 = (freq_table['g2'] > 10.83).sum()

print(f"Significant differences:")
print(f"  95% confident (G² > 3.84):   {sig_05} words")
print(f"  99% confident (G² > 6.63):   {sig_01} words")
print(f"  99.9% confident (G² > 10.83): {sig_001} words")
print(f"\nTotal words tested: {len(freq_table)}")

Significant differences:
  95% confident (G² > 3.84):   994 words
  99% confident (G² > 6.63):   737 words
  99.9% confident (G² > 10.83): 487 words

Total words tested: 2072

So we have hundreds of words with statistically significant differences. But are they all interesting?

8.7 The problem: Stop words dominate

Not all statistically significant differences are interesting. Let’s check what kinds of words have the highest G² values:

# Load stop words from spaCy
stop_words = nlp.Defaults.stop_words

# Check if top G² words are stop words
freq_table_sorted['is_stopword'] = freq_table_sorted['lemma'].isin(stop_words)

print("Top 20 by G² - are they stop words?")
print(freq_table_sorted[['lemma', 'g2', 'is_stopword']].head(20))

Top 20 by G² - are they stop words?
            lemma          g2  is_stopword
2095           do  231.203248         True
2986          get  145.273635         True
4617           of  143.074502         True
7675          you  136.369324         True
7625         work  135.944566        False
7507           we  108.982754         True
5829        right  108.139684        False
6817          the  106.400966         True
7343           us  103.171606         True
1226      college  100.740012        False
3123          gun  100.428162        False
3813          job   99.441736        False
1061         cent   91.520368        False
5163      present   90.417812        False
7555        which   87.414119         True
7560          who   86.633621         True
1278      company   85.637924        False
4673           or   83.115776         True
2546  expenditure   82.156755        False
6803    terrorist   81.910170        False

Notice that many high-G² words are stop words (words like “the”, “and”, “of”).

Why does this happen?

Stop words appear thousands of times in our corpora
G² is sensitive to absolute frequencies - when a word appears 5,000 times, even a small proportional difference produces high G²
A word that’s 51% vs 49% between corpora can have higher G² than a word that’s 90% vs 10%, just because the first word is more common overall

The solution: Filter to focus on content words (nouns, verbs, adjectives) by removing stop words.

# Focus on content words by removing stop words
content_words = freq_table_sorted[~freq_table_sorted['is_stopword']].copy()

print("Top 20 content words by G²:")
print(content_words[['lemma', 'democrat', 'republican', 'g2']].head(20))

Top 20 content words by G²:
            lemma  democrat  republican          g2
7625         work      1034         505  135.944566
5829        right       503         198  108.139684
1226      college       174          29  100.740012
3123          gun        78           0  100.428162
3813          job       544         232   99.441736
1061         cent         6          86   91.520368
5163      present        86         238   90.417812
1278      company       120          14   85.637924
2546  expenditure        28         130   82.156755
6803    terrorist        42         156   81.910170
7477         want       326         119   80.271861
3164         hard       196          51   76.786385
1582        court        18         103   74.890387
3866         know       423         186   72.262874
5282     property         6          67   66.090068
776           big       120          25   58.448358
6891         tile         0          38   56.626794
5806      revenue        22          94   55.715059
4211      measure        66         166   55.275584
1830    dependent         2          46   54.495605

Much better! Now we’re seeing substantive words about policy, governance, and political issues.

8.8 What G² doesn’t tell us

G² tells us that a difference exists and how confident we can be about it. But it doesn’t tell us:

Which corpus uses the word more
How much more it’s used

For that, we need another measure: log odds ratio.

9 Measuring effect size: Log odds ratio

9.1 The problem: G² doesn’t tell us everything

Look back at the content words with high G² values. Can you quickly tell which party uses each word more? Is “health” more Democratic or Republican? What about “security”?

G² told us that differences exist and that they’re statistically significant. But it doesn’t tell us:

Direction: Which corpus uses the word more?
Magnitude: Is it slightly more common, or dramatically more common?

For this, we need a different measure: log odds ratio.

9.2 What is log odds ratio?

Log odds ratio is a measure of effect size that answers:

“How much more is this word used in one corpus compared to the other?”

It gives us two pieces of information:

The sign (+ or -) tells us which corpus uses the word more
The number tells us how much more it’s used

9.3 How to read log odds values

In our analysis, we calculate log odds where:

Positive values = word is more common in Democratic speeches
Negative values = word is more common in Republican speeches
Zero = word is equally common in both

The magnitude tells us how big the difference is:

Log Odds	Meaning
+1.0	Word is 2× more common in Democratic speeches
+2.0	Word is 4× more common in Democratic speeches
+3.0	Word is 8× more common in Democratic speeches
-1.0	Word is 2× more common in Republican speeches
-2.0	Word is 4× more common in Republican speeches
0.0	Word is equally common in both

9.4 A concrete example

Let’s say the word “healthcare” appears:

200 times in Democratic speeches (out of 100,000 total Democratic words)
50 times in Republican speeches (out of 100,000 total Republican words)

The proportions are:

Democratic: 200/100,000 = 0.002 (0.2%)
Republican: 50/100,000 = 0.0005 (0.05%)

The ratio is 0.002/0.0005 = 4.0 (Democrats use it 4× more often).

The log₂(4.0) = 2.0

So this word would have a log odds ratio of +2.0, meaning Democrats use it 4× more than Republicans.

Why use logarithm?

Raw ratios are asymmetric and hard to interpret:

“2× more common” = ratio of 2.0
“2× less common” = ratio of 0.5

These don’t look symmetric even though they represent the same magnitude of difference.

Taking the logarithm makes them symmetric:

2× more common: log₂(2.0) = +1.0
2× less common: log₂(0.5) = -1.0

We use base-2 logarithm (log₂) because it’s easy to interpret:

Each +1 means “doubled”
Each -1 means “halved”

This makes effect sizes comparable across different words.

The mathematical formula

Log odds ratio is calculated as:

\[\text{Log Odds Ratio} = \log_2\left(\frac{a/C}{b/D}\right)\]

Where:

$a$ = word count in Corpus A (Democrats)
$b$ = word count in Corpus B (Republicans)
$C$ = total size of Corpus A
$D$ = total size of Corpus B

This simplifies to comparing the proportions (a/C vs b/D) of how often each corpus uses the word.

9.5 Calculating log odds ratio in Python

Let’s create a function to calculate log odds ratio for all our words:

def log_odds_ratio(a, b):
    """
    Calculate log odds ratio for word frequencies in two corpora.

    This function compares how often words appear in each corpus
    (accounting for corpus size) and returns a number telling us
    which corpus uses each word more and by how much.

    Positive values = more common in corpus A (Democrats)
    Negative values = more common in corpus B (Republicans)
    Magnitude = how much more (1 = 2×, 2 = 4×, 3 = 8×, etc.)

    Parameters:
    -----------
    a : array-like
        Word counts in corpus A (e.g., Democratic speeches)
    b : array-like
        Word counts in corpus B (e.g., Republican speeches)

    Returns:
    --------
    array-like
        Log odds ratios (base 2) for each word
    """
    # Total corpus sizes
    C = np.sum(a)  # Total words in corpus A
    D = np.sum(b)  # Total words in corpus B

    # Calculate proportions (what percentage of each corpus is this word?)
    prop_a = a / C
    prop_b = b / D

    # Calculate log odds ratio
    # Note: We add a tiny constant (1e-10) to avoid mathematical errors when counts are zero
    lor = np.log2((prop_a + 1e-10) / (prop_b + 1e-10))

    return lor

9.6 Using log odds ratio to see which party uses each word

# Calculate log odds ratio
freq_table['log_odds'] = log_odds_ratio(
    freq_table['democrat'].values,
    freq_table['republican'].values
)

# Add to our content words table too
content_words['log_odds'] = log_odds_ratio(
    content_words['democrat'].values,
    content_words['republican'].values
)

print("Words most strongly associated with Democrats (positive log odds):")
print(content_words.nlargest(15, 'log_odds')[['lemma', 'democrat', 'republican', 'log_odds', 'g2']])

print("\nWords most strongly associated with Republicans (negative log odds):")
print(content_words.nsmallest(15, 'log_odds')[['lemma', 'democrat', 'republican', 'log_odds', 'g2']])

Words most strongly associated with Democrats (positive log odds):
          lemma  democrat  republican   log_odds          g2
3123        gun        78           0  23.022742  100.428162
4042   lobbyist        32           0  21.737340   41.201297
3706   internet        30           0  21.644231   38.626216
2085  diversity        28           0  21.544695   36.051135
6756       tech        24           0  21.322303   30.900973
6152     shrink        24           0  21.322303   30.900973
1397   conquest        22           0  21.196772   28.325892
4817  paperwork        22           0  21.196772   28.325892
4238     mental        20           0  21.059268   25.750811
7081         tv        20           0  21.059268   25.750811
6763       teen        18           0  20.907265   23.175730
91         96th        18           0  20.907265   23.175730
6734       tank        18           0  20.907265   23.175730
5617   reinvent        18           0  20.907265   23.175730
2887     french        18           0  20.907265   23.175730

Words most strongly associated with Republicans (negative log odds):
            lemma  democrat  republican   log_odds         g2
6891         tile         0          38 -22.100187  56.626794
53           11th         0          28 -21.659614  41.725006
3090        gross         0          26 -21.552699  38.744648
7036     tribunal         0          22 -21.311691  32.783933
3517        index         0          20 -21.174187  29.803576
244      advisory         0          20 -21.174187  29.803576
5256  prohibition         0          19 -21.100187  28.313397
3028     governor         0          16 -20.852259  23.842860
6279       solely         0          16 -20.852259  23.842860
6808      testing         0          16 -20.852259  23.842860
5564        refer         0          16 -20.852259  23.842860
5442     reaction         0          16 -20.852259  23.842860
6082      seventy         0          16 -20.852259  23.842860
7342     urgently         0          16 -20.852259  23.842860
5455      realism         0          15 -20.759150  22.352682

Now we can see the full picture! Look at the output:

Positive log odds (e.g., +2.5) means Democrats use this word more (roughly 2^2.5 ≈ 5-6× more often)
Negative log odds (e.g., -1.8) means Republicans use this word more (roughly 2^1.8 ≈ 3-4× more often)

9.7 Reading the results: Putting it all together

For each word, we now have three key numbers:

Democrat count / Republican count: Raw frequencies (affected by corpus size)
Log odds ratio: Effect size - which party uses it more and by how much
G² value: Statistical significance - how confident we can be

Example interpretation:

If you see a word with:

Log odds = +2.0
G² = 45.3

This means: “Democrats use this word about 4× more often than Republicans, and we’re extremely confident (p < 0.001) this is a real pattern, not chance.”

Best practice: Filter for both significance AND effect size

Not every statistically significant difference is interesting. And not every large difference is reliable.

The most meaningful words are those that pass three tests:

Statistically significant: G² > 6.63 (we’re 99% confident it’s real)
Large effect: |log odds| > 0.5 (at least 40% more frequent in one corpus)
Not too rare: Appears at least 5 times in both corpora (reliable measurement)

Only words that pass all three tests are truly distinctive and reliable.

9.8 Finding the most meaningful differences

Let’s filter our results to find words that are both statistically significant AND show large effects:

# Find meaningful differences - must pass all three tests
meaningful = content_words[
    (content_words['g2'] > 6.63) &                    # Test 1: Statistically significant
    (np.abs(content_words['log_odds']) > 0.5) &       # Test 2: Large effect size
    (content_words['democrat'] > 5) &                 # Test 3: Not too rare
    (content_words['republican'] > 5)
].copy()

print(f"Words with significant AND large differences: {len(meaningful)}")
print("\nTop 10 most distinctively Democratic words:")
print(meaningful.nlargest(10, 'log_odds')[['lemma', 'democrat', 'republican', 'log_odds', 'g2']])

print("\nTop 10 most distinctively Republican words:")
print(meaningful.nsmallest(10, 'log_odds')[['lemma', 'democrat', 'republican', 'log_odds', 'g2']])

Words with significant AND large differences: 376

Top 10 most distinctively Democratic words:
           lemma  democrat  republican  log_odds          g2
1278     company       120          14  2.984616   85.637924
4296     minimum        56           8  2.692435   35.797119
2469   everybody        37           6  2.509570   21.825902
1226     college       174          29  2.470043  100.740012
3605  innovation        54          10  2.318039   28.953921
4071         lot        78          15  2.263592   40.605421
282   aggression        80          16  2.207008   40.338351
2955         gas        60          12  2.207008   30.253763
776          big       120          25  2.148115   58.448358
5008      planet        28           6  2.107472   13.304258

Top 10 most distinctively Republican words:
          lemma  democrat  republican  log_odds         g2
1061       cent         6          86 -3.956219  91.520368
5282   property         6          67 -3.596044  66.090068
3765      iraqi         6          48 -3.114917  41.579958
1263  commodity         6          42 -2.922272  34.142816
1743   decrease         6          40 -2.851883  31.708856
826       board         6          40 -2.851883  31.708856
5588     regime        10          61 -2.723727  46.054082
2921   function         8          48 -2.699880  35.895878
4907      pende         6          35 -2.659238  25.744069
1582      court        18         103 -2.631494  74.890387

These are the words that truly distinguish Democratic from Republican political rhetoric - they’re both statistically reliable and substantively important.

10 Visualizing differences

10.1 Bar chart of log odds ratios

# Get top 15 for each party
top_dem = meaningful.nlargest(15, 'log_odds')
top_rep = meaningful.nsmallest(15, 'log_odds')
top_both = pd.concat([top_dem, top_rep])

# Sort by log odds for plotting
top_both = top_both.sort_values('log_odds')

# Create plot
fig, ax = plt.subplots(figsize=(10, 8))

colors = ['#0015BC' if x > 0 else '#E81B23' for x in top_both['log_odds']]
ax.barh(range(len(top_both)), top_both['log_odds'], color=colors, alpha=0.7)
ax.set_yticks(range(len(top_both)))
ax.set_yticklabels(top_both['lemma'])
ax.axvline(0, color='black', linewidth=0.8, linestyle='--')
ax.set_xlabel('Log Odds Ratio (negative = Republican, positive = Democrat)', fontsize=12)
ax.set_title('Most Distinctive Words by Party', fontsize=14, fontweight='bold')

# Add legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='#0015BC', alpha=0.7, label='More Democratic'),
    Patch(facecolor='#E81B23', alpha=0.7, label='More Republican')
]
ax.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.show()

10.2 Scatter plot: Significance vs effect size

A scatter plot helps us visualize the relationship between effect size (log odds ratio) and statistical significance (G²).

However, we need to be careful about very rare words. Words that appear only once or twice in one corpus but zero times in the other create extreme log odds ratios (dividing by near-zero) with low statistical significance. These are statistical artifacts, not meaningful patterns.

To avoid misleading visualizations, we’ll filter out words that don’t appear at least 5 times in both corpora:

# Create scatter plot
fig, ax = plt.subplots(figsize=(12, 8))

# Filter for plotting - remove very rare words that create artifacts
plot_data = content_words[
    (content_words['democrat'] >= 5) &
    (content_words['republican'] >= 5)
].copy()

# Color by which party uses word more
colors = ['#0015BC' if x > 0 else '#E81B23' for x in plot_data['log_odds']]

ax.scatter(plot_data['log_odds'], plot_data['g2'],
           c=colors, alpha=0.5, s=30)

# Add significance threshold line
ax.axhline(6.63, color='gray', linestyle='--', linewidth=1, alpha=0.7, label='p < 0.01')

# Add effect size threshold lines
ax.axvline(-0.5, color='gray', linestyle=':', linewidth=1, alpha=0.5)
ax.axvline(0.5, color='gray', linestyle=':', linewidth=1, alpha=0.5)

ax.set_xlabel('Log Odds Ratio (negative = Republican, positive = Democrat)', fontsize=12)
ax.set_ylabel('Log-Likelihood (G²)', fontsize=12)
ax.set_title('Effect Size vs Statistical Significance', fontsize=14, fontweight='bold')

# Annotate some interesting words
for _, row in meaningful.head(10).iterrows():
    ax.annotate(row['lemma'], 
                (row['log_odds'], row['g2']),
                fontsize=8, alpha=0.7,
                xytext=(5, 5), textcoords='offset points')

plt.tight_layout()
plt.show()

This scatter plot shows the relationship between:

X-axis: Effect size (how different?)
Y-axis: Statistical significance (how confident?)

The most interesting words are in the upper left and upper right corners - both statistically significant (high G²) and distinctive (large absolute log odds ratio). These are the words that show strong, reliable differences between the two parties.

Words near the bottom (low G²) may have large log odds ratios but aren’t statistically reliable - often because they’re too rare. The horizontal line at G² = 6.63 marks the p < 0.01 significance threshold.

11 Named entity recognition

So far we’ve analyzed individual words (lemmas). But sometimes we’re interested in references to real-world entities:

PERSON: Barack Obama, Hillary Clinton
ORG: United Nations, Department of Defense
GPE: America, Iraq, New York
DATE: tomorrow, 2020, next year
MONEY: $1 billion, five dollars

This is called Named Entity Recognition (NER), and spaCy does it automatically!

11.1 How NER works

NER is a classification task:

Identify spans of text that might be entities
Classify each span into entity types
Use machine learning models trained on annotated data

Modern NER systems use neural networks trained on large corpora of hand-labeled examples.

11.2 Extracting entities with spaCy

Let’s look at entities in a sample speech:

# Get one speech
sample_speech = dem_speeches.iloc[0]['transcript'][:1000]  # First 1000 chars

# Process it
sample_doc = nlp(sample_speech)

# Display entities
print("Named entities found:\n")
print(f"{'Entity':<25} {'Type':<15} {'Explanation'}")
print("-" * 65)

for ent in sample_doc.ents:
    print(f"{ent.text:<25} {ent.label_:<15} {spacy.explain(ent.label_)}")

Named entities found:

Entity                    Type            Explanation
-----------------------------------------------------------------
Speaker                   PERSON          People, including fictional
Congress                  ORG             Companies, agencies, institutions, etc.
Americans                 NORP            Nationalities or religious or political groups
Tonight                   TIME            Times smaller than a day
the eighth year           DATE            Absolute or relative dates or periods
the State of the Union    ORG             Companies, agencies, institutions, etc.
Iowa                      GPE             Countries, cities, states
an election season        DATE            Absolute or relative dates or periods
this year                 DATE            Absolute or relative dates or periods
Speaker                   PERSON          People, including fictional
the end of last year      DATE            Absolute or relative dates or periods
this year                 DATE            Absolute or relative dates or periods
tonight                   TIME            Times smaller than a day
the year ahead            DATE            Absolute or relative dates or periods
Don                       PERSON          People, including fictional

11.3 Comparing entity usage across parties

Let’s extract all location entities (GPE = Geo-Political Entity) from both corpora:

# Extract GPE entities from both corpora
dem_locations = [ent.text.lower() for ent in dem_doc.ents if ent.label_ == 'GPE']
rep_locations = [ent.text.lower() for ent in rep_doc.ents if ent.label_ == 'GPE']

print(f"Democratic location mentions: {len(dem_locations)}")
print(f"Republican location mentions: {len(rep_locations)}")

# Count frequencies
dem_loc_counts = pd.Series(dem_locations).value_counts().reset_index()
dem_loc_counts.columns = ['location', 'democrat']

rep_loc_counts = pd.Series(rep_locations).value_counts().reset_index()
rep_loc_counts.columns = ['location', 'republican']

# Merge
location_freq = dem_loc_counts.merge(rep_loc_counts, on='location', how='outer').fillna(0)
location_freq['democrat'] = location_freq['democrat'].astype(int)
location_freq['republican'] = location_freq['republican'].astype(int)

# Filter for locations mentioned at least 5 times
location_freq = location_freq[
    (location_freq['democrat'] >= 5) | (location_freq['republican'] >= 5)
].copy()

print(f"\nLocations mentioned frequently:")
print(location_freq.head(15))

Democratic location mentions: 2133
Republican location mentions: 2007

Locations mentioned frequently:
       location  democrat  republican
3   afghanistan        40          57
5        alaska        10           2
7       america       554         701
19    australia         0           6
21      baghdad         0          18
28      belgium         6           2
29       berlin        20           2
38       brazil         4           6
42        burma         6           4
44   california        18          13
45       canada         2          16
52      chicago         4           6
54        china        68          31
57     colombia        14           0
61        congo         6           2

11.4 Statistical comparison of locations

# Calculate G² and log odds for locations
location_freq['g2'] = log_likelihood(
    location_freq['democrat'].values,
    location_freq['republican'].values
)

location_freq['log_odds'] = log_odds_ratio(
    location_freq['democrat'].values,
    location_freq['republican'].values
)

# Find significant differences
sig_locations = location_freq[
    (location_freq['g2'] > 6.63) &
    (location_freq['democrat'] >= 3) &
    (location_freq['republican'] >= 3)
].copy()

print("Locations with significant usage differences:\n")
print("Most Democratic:")
print(sig_locations.nlargest(10, 'log_odds')[['location', 'democrat', 'republican', 'log_odds', 'g2']])

print("\nMost Republican:")
print(sig_locations.nsmallest(10, 'log_odds')[['location', 'democrat', 'republican', 'log_odds', 'g2']])

Locations with significant usage differences:

Most Democratic:
                         location  democrat  republican  log_odds         g2
96                        germany        24           4  2.539343  15.224309
62                           cuba        16           3  2.369418   9.359102
158                        mexico        16           3  2.369418   9.359102
113                         india        16           4  1.954381   7.335339
238                        russia        36          11  1.664874  13.230245
290  the united states of america        48          20  1.217415  11.011168
284              the soviet union        64          28  1.147026  13.355114
54                          china        68          31  1.087647  13.024438
314                       vietnam        48          22  1.079912   9.087790
7                         america       554         701 -0.385148  22.219891

Most Republican:
                         location  democrat  republican  log_odds         g2
119                          iraq        19         119 -2.692510  83.903263
261                        states        59          88 -0.622408   6.712539
288             the united states       116         163 -0.536366   9.511364
7                         america       554         701 -0.385148  22.219891
314                       vietnam        48          22  1.079912   9.087790
54                          china        68          31  1.087647  13.024438
284              the soviet union        64          28  1.147026  13.355114
290  the united states of america        48          20  1.217415  11.011168
238                        russia        36          11  1.664874  13.230245
113                         india        16           4  1.954381   7.335339

11.5 Visualizing location mentions

if len(sig_locations) > 0:
    # Get top locations for each party
    top_dem_loc = sig_locations.nlargest(10, 'log_odds')
    top_rep_loc = sig_locations.nsmallest(10, 'log_odds')
    top_loc = pd.concat([top_dem_loc, top_rep_loc]).drop_duplicates().sort_values('log_odds')
    
    # Plot
    fig, ax = plt.subplots(figsize=(10, 6))
    colors = ['#0015BC' if x > 0 else '#E81B23' for x in top_loc['log_odds']]
    
    ax.barh(range(len(top_loc)), top_loc['log_odds'], color=colors, alpha=0.7)
    ax.set_yticks(range(len(top_loc)))
    ax.set_yticklabels(top_loc['location'])
    ax.axvline(0, color='black', linewidth=0.8, linestyle='--')
    ax.set_xlabel('Log Odds Ratio (negative = Republican, positive = Democrat)', fontsize=12)
    ax.set_title('Geographic Focus: Location Mentions by Party', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
else:
    print("Not enough significant location differences in our sample.")

Other entity types

You can analyze other entity types the same way:

PERSON: Which individuals are mentioned?
ORG: What organizations are discussed?
DATE: How are temporal references used?
MONEY: How are financial amounts discussed?

Try exploring these in the exercises!

12 Summary and key takeaways

12.1 What we learned

Today we covered methods for statistically comparing corpora:

Lemmatization → Reducing words to dictionary forms
Corpus preparation → Creating contrasting text collections
Log-likelihood (G²) → Testing statistical significance
Log odds ratio → Measuring effect size
Named entity recognition → Extracting references to real-world entities

12.2 Key concepts

Lemmatization: Reducing wordforms to their base dictionary form (lemma)
Contrasting corpora: Collections of texts from different sources for comparison
Log-likelihood (G²): Statistical test for significance of frequency differences
Log odds ratio: Measure of effect size (how much more frequent)
Named entity: Reference to a real-world entity (person, place, organization)
Effect size vs significance: Significance = confidence; effect size = magnitude

12.3 Critical insights

Don’t trust p-values alone!

A word can be:

Highly significant but barely different (large sample)
Highly different but not significant (rare word)

Always report both significance and effect size.

Preprocessing choices matter

Lemmatize or not?
Remove stop words or not?
Filter by part of speech or not?

Each choice affects your results. Make them explicit and justified.

12.4 Statistical comparison workflow

Prepare corpora → Split data into contrasting groups
Lemmatize → Reduce morphological variation (if appropriate)
Count frequencies → Create frequency table
Filter → Remove very rare words, stop words (if appropriate)
Calculate G² → Test significance
Calculate log odds → Measure effect size
Filter meaningful differences → Both significant AND large
Interpret → What do the differences tell us?

13 Exercises

13.1 Exercise 1: Full corpus analysis

We used samples for speed in this lab. Now process the full corpora:

Process all Democratic and all Republican speeches (not just samples)
Calculate G² and log odds ratio for all lemmas
Identify the 20 most distinctive content words for each party
Create visualizations

Note: This will take 10-15 minutes to process!

13.2 Exercise 2: Stop word investigation

Investigate whether stop words show political patterns:

Filter for only stop words in your frequency table
Calculate G² and log odds ratio
Which stop words differ most between parties?
Can you interpret why? (Think about formality, rhetorical style)

13.3 Exercise 3: Temporal comparison

Instead of comparing parties, compare time periods:

Split speeches into before/after 1970 (or another meaningful date)
Calculate distinctive words for each period
What changes in American political discourse can you observe?

13.4 Exercise 4: Named entity deep dive

Choose one entity type (PERSON, ORG, or DATE) and:

Extract all entities of that type from both corpora
Calculate frequency differences
Identify significant patterns
Interpret: What do these patterns reveal about political priorities?

13.5 Exercise 5: Part-of-speech patterns (Advanced)

Compare parts of speech:

Count how often each POS tag appears in each corpus
Do Democrats use more adjectives? Republicans more verbs?
Calculate significance and effect size
What might linguistic differences reveal about rhetorical style?

13.6 Exercise 6: Creating your own contrasting corpora

Think of another comparison that interests you in the State of the Union data:

War vs peace time presidents
First term vs second term speeches
19th vs 20th vs 21st century
High vs low approval ratings (you’d need to add this data)

Design and execute your own corpus comparison study.

14 References and further reading

14.1 Academic papers

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61-74. https://aclanthology.org/J93-1003.pdf
Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. Proceedings of the Workshop on Comparing Corpora, 1-6. https://doi.org/10.3115/1117729.1117730
Monroe, B. L., Colaresi, M. P., & Quinn, K. M. (2008). Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4), 372-403. https://doi.org/10.1093/pan/mpn018

14.2 Textbooks

Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed., draft). Chapter 2 (Regular Expressions, Text Normalization, Edit Distance). https://web.stanford.edu/~jurafsky/slp3/
Silge, J., & Robinson, D. (2017). Text Mining with R. Chapter 4 (Relationships between words). https://www.tidytextmining.com/ngrams.html

14.3 Tutorials

spaCy documentation on lemmatization: https://spacy.io/usage/linguistic-features#lemmatization
spaCy documentation on NER: https://spacy.io/usage/linguistic-features#named-entities
Log-likelihood calculator and explanation: http://ucrel.lancs.ac.uk/llwizard.html

14.4 Tools

spaCy: Industrial-strength NLP library - https://spacy.io
NLTK: Classic Python NLP toolkit - https://www.nltk.org
Lancaster Stats Tools: Log-likelihood calculator - http://ucrel.lancs.ac.uk/llwizard.html

End of Lab 02