Lab 01: Words as data points I

Zipf’s law, stop words and stylometry

Published

2026-01-25 11:57:21

1 Learning Objectives

By the end of this lab, you will understand:

  • How to transform unstructured text into structured data
  • What tokenization is and why it matters
  • How word frequencies reveal patterns in text
  • Zipf’s Law and its implications for text analysis
  • What stop words are and when to remove them
  • The basics of text preprocessing
  • What stylometry is and how function words reveal authorship
  • How to use Python and pandas for basic text analysis

2 Introduction: Why Text as Data?

Text is everywhere: social media posts, news articles, scientific papers, government documents, customer reviews. But text in its raw form is unstructured - it’s just sequences of characters. To analyze text computationally, we need to transform it into structured data that computers can process mathematically.

Think of it this way: if you wanted to compare two novels, you could read both and form an impression. But what if you had 1,000 novels? Or 100,000 tweets? This is where computational text analysis becomes essential.

2.1 Our Dataset: State of the Union Addresses

Today we’ll work with a collection of State of the Union addresses - speeches given by US presidents from the 18th century to 2018. These speeches are:

  • Historical: Spanning over 200 years
  • Political: Reflecting different eras and priorities
  • Comparable: Same genre, same occasion, different authors

This makes them perfect for learning text analysis techniques.

Data source: Brian Weinstein’s State of the Union corpus


3 Setup: Loading Packages

First, let’s load the Python packages we’ll need:

# you will need to run it once after installing spaCy
!python -m spacy download en_core_web_sm
# Data manipulation
import pandas as pd

# Text processing
import spacy
from collections import Counter

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For word clouds
from wordcloud import WordCloud

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Packages loaded successfully")
✓ Packages loaded successfully
Note📌 About these packages
  • pandas: Works with tabular data (like spreadsheets)
  • spaCy: Processes natural language text
  • matplotlib/seaborn: Create visualizations
  • wordcloud: Generate word cloud visualizations

4 Loading and Exploring the Data

Let’s load our dataset:

# Load the data
speeches = pd.read_csv("data/transcripts.csv")

# Display first few rows
print("Dataset shape:", speeches.shape)
print("\nFirst few rows:")
speeches.head()
Dataset shape: (244, 5)

First few rows:
date president title url transcript
0 2018-01-30 Donald J. Trump Address Before a Joint Session of the Congress... https://www.cnn.com/2018/01/30/politics/2018-s... \nMr. Speaker, Mr. Vice President, Members of ...
1 2017-02-28 Donald J. Trump Address Before a Joint Session of the Congress http://www.presidency.ucsb.edu/ws/index.php?pi... Thank you very much. Mr. Speaker, Mr. Vice Pre...
2 2016-01-12 Barack Obama Address Before a Joint Session of the Congress... http://www.presidency.ucsb.edu/ws/index.php?pi... Thank you. Mr. Speaker, Mr. Vice President, Me...
3 2015-01-20 Barack Obama Address Before a Joint Session of the Congress... http://www.presidency.ucsb.edu/ws/index.php?pi... The President. Mr. Speaker, Mr. Vice President...
4 2014-01-28 Barack Obama Address Before a Joint Session of the Congress... http://www.presidency.ucsb.edu/ws/index.php?pi... The President. Mr. Speaker, Mr. Vice President...

4.1 Understanding the Data Structure

Our data is in tabular format - like a spreadsheet with rows and columns:

# Check column names and types
print("Columns:")
print(speeches.dtypes)

# Basic statistics
print("\nNumber of speeches:", len(speeches))
print("Number of presidents:", speeches['president'].nunique())
print("\nDate range:")
print("  Earliest:", speeches['date'].min())
print("  Latest:", speeches['date'].max())
Columns:
date          object
president     object
title         object
url           object
transcript    object
dtype: object

Number of speeches: 244
Number of presidents: 42

Date range:
  Earliest: 1790-01-08
  Latest: 2018-01-30

Each row represents one speech. The columns contain:

  • date: When the speech was given
  • president: Who gave it
  • title: The speech’s official title
  • url: Where the transcript came from
  • transcript: The actual text of the speech

Let’s look at a short excerpt:

# Display a snippet of one speech
sample_text = speeches.loc[0, 'transcript'][:500]  # First 500 characters
print("Sample from first speech:")
print(sample_text)
print("...")
Sample from first speech:

Mr. Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States, and my fellow Americans:
Less than 1 year has passed since I first stood at this podium, in this majestic chamber, to speak on behalf of the American People -- and to address their concerns, their hopes, and their dreams.  That night, our new Administration had already taken swift action.  A new tide of optimism was already sweeping across our land.
Each day since, we have gone forward with a clear vision
...
Tip💡 Why This Structure?

Text data commonly comes with metadata (data about data):

  • Author, date, source, title, etc.
  • This metadata helps us compare and analyze texts
  • We keep text and metadata together in a table

5 From Text to Tokens - The Bag of Words Model

5.1 The Challenge: Text is Unstructured

Right now, our transcript column contains long strings of text. How do we measure or compare them? We can’t easily do math on text!

5.2 The Solution: Tokenization

Tokenization is the process of splitting text into smaller units called tokens. Usually, tokens are words, but they could also be:

  • Characters (letters)
  • Sentences
  • N-grams (sequences of N words)

Let’s see this in action:

# Simple example: split text into words
example = "Python is great for text analysis"
words = example.split()
print("Original text:", example)
print("Tokens (words):", words)
print("Number of tokens:", len(words))
Original text: Python is great for text analysis
Tokens (words): ['Python', 'is', 'great', 'for', 'text', 'analysis']
Number of tokens: 6

5.3 Tokenizing Our Speeches

Now let’s tokenize one full speech:

# Get one speech
speech_text = speeches.loc[0, 'transcript']

# Simple tokenization (just splitting on spaces)
simple_tokens = speech_text.split()

print(f"Speech length: {len(speech_text)} characters")
print(f"Number of tokens: {len(simple_tokens)}")
print(f"\nFirst 20 tokens:")
print(simple_tokens[:20])
Speech length: 30354 characters
Number of tokens: 5190

First 20 tokens:
['Mr.', 'Speaker,', 'Mr.', 'Vice', 'President,', 'Members', 'of', 'Congress,', 'the', 'First', 'Lady', 'of', 'the', 'United', 'States,', 'and', 'my', 'fellow', 'Americans:', 'Less']

5.4 The Bag of Words Model

Once we have tokens, we can treat text as a “bag of words” - we:

  1. Ignore word order (“cat dog” = “dog cat”)
  2. Count word frequencies
  3. Represent text as numbers

This might seem like we’re throwing away information (word order matters!), but this simple model is surprisingly powerful for many tasks.

Let’s create a bag of words for one speech:

# Count word frequencies
word_counts = Counter(simple_tokens)

# Show most common words
print("Top 15 most frequent words:")
for word, count in word_counts.most_common(15):
    print(f"  {word}: {count}")
Top 15 most frequent words:
  the: 215
  and: 184
  to: 175
  of: 121
  our: 95
  we: 88
  a: 79
  in: 72
  is: 61
  are: 48
  that: 45
  --: 44
  have: 40
  for: 34
  will: 34

5.5 Two Data Formats: Long vs Wide

We can represent tokenized text in two ways:

Long format: One row per word occurrence

president    word
Trump        the
Trump        economy
Trump        is
Trump        strong

Wide format: One row per document, one column per unique word

president    the    economy    is    strong
Trump        150    12         89    5

For now, we’ll work with long format because it’s easier to understand and manipulate.


6 Tokenizing All Speeches

Let’s tokenize all speeches and create a long-format dataset:

# Initialize empty list to store results
all_tokens = []

# Process each speech
for idx, row in speeches.iterrows():
    president = row['president']
    text = row['transcript']
    
    # Simple tokenization
    tokens = text.lower().split()  # Convert to lowercase
    
    # Add each token to our list
    for token in tokens:
        all_tokens.append({
            'president': president,
            'word': token
        })

# Convert to DataFrame
tokens_df = pd.DataFrame(all_tokens)

print(f"Total number of tokens: {len(tokens_df):,}")
print(f"\nFirst few rows:")
tokens_df.head(10)
Total number of tokens: 3,947,946

First few rows:
president word
0 Donald J. Trump mr.
1 Donald J. Trump speaker,
2 Donald J. Trump mr.
3 Donald J. Trump vice
4 Donald J. Trump president,
5 Donald J. Trump members
6 Donald J. Trump of
7 Donald J. Trump congress,
8 Donald J. Trump the
9 Donald J. Trump first

6.1 How many unique words?

unique_words = tokens_df['word'].nunique()
print(f"Number of unique words: {unique_words:,}")
Number of unique words: 68,356

That’s a lot of unique words! But are they all useful?


7 Word Frequencies

Now let’s count how often each word appears:

# Count word frequencies
word_freq = tokens_df['word'].value_counts().reset_index()
word_freq.columns = ['word', 'count']

print("Top 20 most frequent words:")
print(word_freq.head(20))
Top 20 most frequent words:
     word   count
0     the  326862
1      of  212531
2      to  135329
3     and  133944
4      in   85772
5       a   61992
6    that   47130
7     for   43079
8      be   40521
9     our   38759
10     is   36752
11     by   32429
12     it   29271
13     we   27276
14   have   26784
15   this   26594
16     as   26510
17   with   26400
18  which   26237
19   will   21913

7.1 Visualizing Frequency

# Plot top 15 words
top_15 = word_freq.head(15)

plt.figure(figsize=(10, 6))
sns.barplot(data=top_15, y='word', x='count', palette='viridis')
plt.title('Top 15 Most Frequent Words in State of the Union Addresses', fontsize=14, fontweight='bold')
plt.xlabel('Frequency (count)', fontsize=12)
plt.ylabel('Word', fontsize=12)
plt.tight_layout()
plt.show()
/tmp/ipykernel_1863094/2171022789.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=top_15, y='word', x='count', palette='viridis')

7.2 What do we notice?

The most frequent words are:

  • Articles: “the”, “a”, “an”
  • Prepositions: “of”, “to”, “in”, “for”
  • Conjunctions: “and”, “but”, “or”

These are grammatical words that appear in almost any English text. They don’t tell us much about the content of the speeches!

This observation leads us to an important concept…


8 Zipf’s Law - A Fundamental Pattern

8.1 What is Zipf’s Law?

If you count word frequencies in any large collection of text and rank words by frequency, you’ll observe something remarkable:

The most common word appears roughly twice as often as the second most common word, three times as often as the third most common, and so on.

This is called Zipf’s Law (named after linguist George Kingsley Zipf).

8.2 Why Does This Matter?

Zipf’s Law tells us that:

  1. A few words are very common (“the”, “of”, “and”)
  2. Most words are very rare (appear only once or twice)
  3. This pattern is universal - it appears in all languages and genres

This has practical implications:

  • Most words provide little statistical power (they’re too rare)
  • A small vocabulary covers most of any text
  • We need strategies to deal with this imbalance

8.3 Visualizing Zipf’s Law

Let’s see if our data follows Zipf’s Law:

# Add rank to our frequency table
word_freq['rank'] = range(1, len(word_freq) + 1)

# Create log-log plot
plt.figure(figsize=(10, 6))
plt.loglog(word_freq['rank'], word_freq['count'], 'b.')
plt.xlabel('Rank (log scale)', fontsize=12)
plt.ylabel('Frequency (log scale)', fontsize=12)
plt.title("Zipf's Law in State of the Union Addresses", fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

The nearly straight line on a log-log plot confirms Zipf’s Law!

8.4 The Long Tail

Let’s look at how many words appear only once or twice:

# Count words by frequency
freq_distribution = word_freq['count'].value_counts().sort_index()

print("Distribution of word frequencies:")
print(f"  Words appearing once: {freq_distribution.get(1, 0):,}")
print(f"  Words appearing twice: {freq_distribution.get(2, 0):,}")
print(f"  Words appearing 3-5 times: {freq_distribution.loc[3:5].sum():,}")
print(f"  Words appearing 6-10 times: {freq_distribution.loc[6:10].sum():,}")
print(f"  Words appearing > 100 times: {(word_freq['count'] > 100).sum():,}")
Distribution of word frequencies:
  Words appearing once: 9,835
  Words appearing twice: 27,310
  Words appearing 3-5 times: 8,157
  Words appearing 6-10 times: 7,957
  Words appearing > 100 times: 3,484

This is the “long tail” - many rare words, few common words.

Important🎯 Key Insight

Zipf’s Law means that word frequencies are highly skewed. This affects how we:

  • Build statistical models
  • Select features for machine learning
  • Preprocess text
  • Interpret results

9 Stop Words and Preprocessing

9.1 What are Stop Words?

Stop words are extremely common words that appear frequently in almost any text:

  • Articles: the, a, an
  • Prepositions: in, on, at, to, from
  • Pronouns: I, you, he, she, it
  • Conjunctions: and, but, or
  • Auxiliary verbs: is, are, was, were

9.2 Why Remove Stop Words?

There are two main perspectives:

Reasons to remove stop words:

  1. Content analysis: They don’t tell us about the topic
  2. Computational efficiency: Fewer words = faster processing
  3. Signal-to-noise: They can overwhelm more informative words

Reasons to keep stop words:

  1. Stylometry: They reveal personal writing style
  2. Syntax: Needed for parsing sentence structure
  3. Meaning: Sometimes they matter (“not good” ≠ “good”)

9.3 Stop Words in Our Data

Let’s see what Python’s spaCy library considers stop words:

# Load English language model (small version)
try:
    nlp = spacy.load("en_core_web_sm")
except:
    # If not installed, show installation instructions
    print("Please install spaCy model:")
    print("!python -m spacy download en_core_web_sm")
    # For now, use a simple stopword list
    from spacy.lang.en.stop_words import STOP_WORDS
    stopwords_set = STOP_WORDS
else:
    stopwords_set = nlp.Defaults.stop_words

print(f"Number of stop words: {len(stopwords_set)}")
print(f"\nFirst 30 stop words (alphabetically):")
print(sorted(list(stopwords_set))[:30])
Number of stop words: 326

First 30 stop words (alphabetically):
["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amount', 'an', 'and', 'another', 'any']

9.4 Comparing with Our Most Frequent Words

# Check which top words are stop words
top_30 = word_freq.head(30).copy()
top_30['is_stopword'] = top_30['word'].isin(stopwords_set)

print("Top 30 words with stop word labels:")
print(top_30[['word', 'count', 'is_stopword']])
Top 30 words with stop word labels:
     word   count  is_stopword
0     the  326862         True
1      of  212531         True
2      to  135329         True
3     and  133944         True
4      in   85772         True
5       a   61992         True
6    that   47130         True
7     for   43079         True
8      be   40521         True
9     our   38759         True
10     is   36752         True
11     by   32429         True
12     it   29271         True
13     we   27276         True
14   have   26784         True
15   this   26594         True
16     as   26510         True
17   with   26400         True
18  which   26237         True
19   will   21913         True
20     on   20689         True
21      i   20687         True
22    has   19896         True
23    are   19619         True
24   been   19135         True
25    not   18597         True
26  their   16784         True
27   from   16055         True
28     at   14991         True
29    all   13608         True

Notice how most of the top frequent words are stop words!

9.5 Removing Stop Words

# Filter out stop words
content_words = word_freq[~word_freq['word'].isin(stopwords_set)].copy()

print(f"Words before removing stop words: {len(word_freq):,}")
print(f"Words after removing stop words: {len(content_words):,}")
print(f"\nTop 30 content words:")
print(content_words.head(30))
Words before removing stop words: 68,356
Words after removing stop words: 68,058

Top 30 content words:
            word  count  rank
35    government  11209    36
38        united  10158    39
42        states   9524    43
44      congress   8597    45
52           new   7120    53
56         great   6714    57
59        public   6520    60
62        people   6229    63
66          year   5918    67
69      american   5575    70
72          time   5162    73
76      national   4888    77
81       country   4367    82
82       present   4331    83
88       federal   4093    89
89         state   4073    90
90         shall   4006    91
91           war   3987    92
99          work   3501   100
100          act   3485   101
102      foreign   3369   103
103        years   3319   104
105        power   3206   106
106      general   3191   107
107          law   3171   108
108        world   3168   109
112       system   2931   113
116    necessary   2878   117
117     increase   2818   118
118  legislation   2790   119

Now we see more content-bearing words: government, states, congress, country, people, etc.

9.6 Visualizing Content Words

# Create word cloud from content words
wordcloud_dict = dict(zip(content_words['word'].head(100), 
                          content_words['count'].head(100)))

wordcloud = WordCloud(width=800, height=400, 
                      background_color='white',
                      colormap='viridis').generate_from_frequencies(wordcloud_dict)

plt.figure(figsize=(14, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Frequent Content Words in State of the Union Addresses', 
          fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

Tip💡 Preprocessing is a Choice

Whether to remove stop words depends on your research question:

  • Topic modeling: Usually remove
  • Sentiment analysis: Keep (negations matter!)
  • Stylometry: Keep (they reveal style!)
  • Document classification: Test both

10 Introduction to Stylometry

10.1 What is Stylometry?

Stylometry is the statistical analysis of writing style. It’s used to:

  • Attribute authorship: Who wrote this anonymous text?
  • Detect plagiarism: Did someone copy another’s style?
  • Study literary history: How did an author’s style evolve?
  • Forensics: Analyze threatening letters or disputed documents

10.2 The Surprising Power of Function Words

You might think content words (nouns, verbs) reveal writing style. But actually, function words (the stop words we just removed!) are more revealing because:

  1. Unconscious use: Authors don’t think about them
  2. High frequency: Provide strong statistical signal
  3. Stable patterns: Less affected by topic
  4. Individual variation: People use them differently

10.3 A Simple Example

Different presidents might discuss the same topics but use different grammatical structures:

  • We must ensure…” vs “I believe we must…”
  • “This is important” vs “This has been important”
  • The people” vs “Our people”

These tiny differences add up to distinctive “stylistic fingerprints.”

10.4 Analyzing Presidential Style

Let’s look at how often different presidents use personal pronouns:

# Define personal pronouns
personal_pronouns = ['i', 'me', 'my', 'we', 'us', 'our', 'you', 'your']

# Filter for pronouns only
pronoun_df = tokens_df[tokens_df['word'].isin(personal_pronouns)].copy()

# Count by president
pronoun_by_president = (pronoun_df.groupby(['president', 'word'])
                        .size()
                        .reset_index(name='count'))

# Calculate total words per president for normalization
words_per_president = tokens_df.groupby('president').size()

# Normalize (calculate rate per 1000 words)
pronoun_by_president = pronoun_by_president.merge(
    words_per_president.reset_index(name='total_words'),
    on='president'
)
pronoun_by_president['rate_per_1000'] = (
    (pronoun_by_president['count'] / pronoun_by_president['total_words']) * 1000
)

# Show some examples
print("Sample of pronoun usage rates (per 1000 words):")
print(pronoun_by_president[pronoun_by_president['president'].isin(
    ['Donald J. Trump', 'Barack Obama', 'George W. Bush']
)].head(15))
Sample of pronoun usage rates (per 1000 words):
          president  word  count  total_words  rate_per_1000
24     Barack Obama     i    857       106566       8.041965
25     Barack Obama    me    120       106566       1.126063
26     Barack Obama    my    180       106566       1.689094
27     Barack Obama   our   1811       106566      16.994163
28     Barack Obama    us    287       106566       2.693167
29     Barack Obama    we   1927       106566      18.082691
30     Barack Obama   you    349       106566       3.274966
31     Barack Obama  your    113       106566       1.060376
56  Donald J. Trump     i    105        15058       6.973038
57  Donald J. Trump    me      5        15058       0.332049
58  Donald J. Trump    my     37        15058       2.457166
59  Donald J. Trump   our    331        15058      21.981671
60  Donald J. Trump    us     55        15058       3.652543
61  Donald J. Trump    we    319        15058      21.184752
62  Donald J. Trump   you     35        15058       2.324346

10.5 Visualizing Style Differences

Let’s compare how different presidents use “I” vs “we”:

# Focus on 'I' and 'we'
i_we_df = pronoun_by_president[pronoun_by_president['word'].isin(['i', 'we'])].copy()

# Pivot for easier plotting
i_we_pivot = i_we_df.pivot(index='president', columns='word', values='rate_per_1000').fillna(0)

# Get presidents with most speeches for clearer visualization
top_presidents = (tokens_df['president']
                  .value_counts()
                  .head(10)
                  .index)

i_we_plot = i_we_pivot.loc[i_we_pivot.index.isin(top_presidents)]

# Create scatter plot
plt.figure(figsize=(10, 8))
plt.scatter(i_we_plot['i'], i_we_plot['we'], s=100, alpha=0.6)

# Label points
for idx, row in i_we_plot.iterrows():
    # Shorten long names
    name = idx.split()[-1]  # Just last name
    plt.annotate(name, (row['i'], row['we']), 
                xytext=(5, 5), textcoords='offset points', fontsize=9)

plt.xlabel('Rate of "I" (per 1000 words)', fontsize=12)
plt.ylabel('Rate of "we" (per 1000 words)', fontsize=12)
plt.title('Presidential Pronoun Usage: "I" vs "We"', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Note📌 What This Tells Us

Presidents who use “I” more often might be:

  • Speaking in a more personal style
  • Taking individual responsibility
  • Modern era (contemporary style)

Presidents who use “we” more might be:

  • Emphasizing collective action
  • Speaking for the nation
  • Earlier era (formal style)

These patterns can distinguish authors even when topics overlap!

10.6 The Role of PCA (Principal Component Analysis)

In a full stylometric analysis, we would:

  1. Count many function words (not just pronouns)
  2. Have dozens or hundreds of features
  3. Need to reduce complexity to visualize patterns

PCA (Principal Component Analysis) helps by:

  • Finding the main “directions” of variation
  • Reducing many features to 2-3 dimensions
  • Allowing us to plot and compare texts

What PCA gives us (practically):

  • A 2D plot where similar authors cluster together
  • Ability to spot outliers or disputed authorship
  • Quantified measure of stylistic distance

We won’t dive into the mathematics here, but know that PCA is a standard tool for reducing complex data to interpretable patterns.


11 Summary and Key Takeaways

11.1 What We Learned

Today we covered the fundamental workflow of computational text analysis:

  1. Load text data → Working with structured formats (CSV, DataFrame)
  2. Tokenize → Split text into words (or other units)
  3. Count frequencies → Transform text into numbers
  4. Explore patterns → Zipf’s Law, frequency distributions
  5. Preprocess → Remove stop words (when appropriate)
  6. Analyze style → Use function words for stylometry

11.2 Key Concepts

Tokenization
Splitting text into units (words, characters, sentences)
Bag of Words
Treating text as unordered collection of words
Zipf’s Law
Word frequency follows power law distribution (few common, many rare)
Stop Words
High-frequency grammatical words with little content
Stylometry
Statistical analysis of writing style using function words
Preprocessing
Transforming raw text for analysis (lowercase, remove punctuation, etc.)

11.3 Tools in Your Toolkit

Task Python Tool
Load data pandas.read_csv()
Tokenize str.split() or spaCy
Count frequencies Counter() or value_counts()
Remove stop words spaCy stop word list
Visualize matplotlib, seaborn, wordcloud

11.4 Next Steps

In future labs, we’ll explore:

  • More sophisticated tokenization (handling punctuation, contractions)
  • N-grams (sequences of words)
  • TF-IDF weighting (smarter than raw counts)

12 Exercises

12.1 Exercise 1: Basic Frequency Analysis

Pick any president from the dataset and:

  1. Extract all their speeches
  2. Tokenize and count word frequencies
  3. Create a bar plot of their top 20 words
  4. Create a word cloud

12.2 Exercise 2: Stop Word Impact

Compare word frequencies with and without stop words:

  1. Calculate top 50 words with stop words
  2. Calculate top 50 words without stop words
  3. How many overlap? What changes?

12.3 Exercise 3: Pronoun Style

Choose three presidents and compare their use of:

  • First person singular (“I”, “me”, “my”)
  • First person plural (“we”, “us”, “our”)

Create a visualization showing the differences.

12.4 Exercise 4: Historical Change (Advanced)

Compare speeches before and after 1950:

  1. Split the dataset into two time periods
  2. Calculate word frequencies for each period
  3. Identify words that became more/less common
  4. Create a Zipf’s Law plot for each period

13 References and Further Reading

13.1 Academic Papers

13.2 Tutorials and Books

13.3 Documentation


End of Lab 01