Topic modeling I: Discovering themes in text

Latent Dirichlet Allocation (LDA)

Published

2026-01-25 11:57:17

1 The challenge of large text collections

Imagine you are a researcher studying consumer feedback for a popular product. You have collected 5,000 user reviews from an online platform. Your research question is: what themes appear in these reviews? What do users like? What frustrates them?

Reading all 5,000 reviews would take weeks. Even if you sampled 100 reviews, you might miss important patterns that appear elsewhere in the collection. You could use dictionary methods (from Lab 03) to measure sentiment, but that requires knowing in advance which words indicate positive or negative opinions.

This is where topic modeling becomes useful. It is a computational method that discovers themes (topics) in a collection of documents automatically, without requiring predefined word lists or categories.

1.1 What topic modeling discovers

Topic modeling identifies two key patterns:

Topics: Groups of words that tend to appear together across documents. Each topic represents a theme or discourse pattern.

Topic proportions: How much each document discusses each topic. Documents are not assigned to a single topic; instead, they are mixtures of multiple topics.

For example, in a collection of product reviews, topic modeling might discover:

  • Topic 0 (Technical Issues): “bug”, “crash”, “freeze”, “error”, “broken”
  • Topic 1 (Usability): “easy”, “intuitive”, “confusing”, “learn”, “interface”
  • Topic 2 (Features): “love”, “feature”, “missing”, “wish”, “add”
  • Topic 3 (Value): “price”, “expensive”, “worth”, “cheap”, “money”

For each review, the model estimates topic proportions. For instance: “This review is 10% Topic 0 (technical issues), 60% Topic 1 (usability), 20% Topic 2 (features), and 10% Topic 3 (value).”

This representation helps us understand not just what individual reviews say, but what themes structure the entire corpus.

2 What is Latent Dirichlet Allocation?

Latent Dirichlet Allocation (LDA) is the most widely used topic modeling algorithm. The name sounds intimidating, but the concept is intuitive.

2.1 The intuition

Imagine you are reading a document and trying to guess what topics it discusses. You notice certain words appear frequently: “island”, “villagers”, “crafting”, “design”. From these words, you infer the document discusses gameplay mechanics and creative aspects of a video game.

LDA reverses this process. It observes which words appear in documents and infers:

  1. What topics exist in the corpus
  2. Which words characterize each topic
  3. Which topics appear in each document

LDA makes two key assumptions:

  • Every document is a mixture of topics (e.g., 60% gameplay, 40% aesthetics)
  • Every topic is a mixture of words (e.g., gameplay topic has high probability for “play”, “fun”, “mechanics”)

By analyzing word co-occurrence patterns across all documents, LDA discovers topics that explain the data well.

2.2 A visual analogy

Think of documents as recipes and topics as cuisine types (Italian, Mexican, Thai). Each recipe contains ingredients (words) from multiple cuisines. A “fusion” recipe might be 70% Italian (pasta, tomato, basil) and 30% Thai (lemongrass, lime, chili).

Topic modeling is like reverse-engineering cuisine types from a cookbook where cuisine labels were removed. By observing which ingredients co-occur, you can infer that dishes with pasta, tomato, and basil represent “Italian cuisine,” while dishes with rice, soy sauce, and ginger represent “Asian cuisine.”

NoteUnsupervised learning

LDA is an unsupervised learning method. Unlike supervised learning (where we provide labeled examples), unsupervised learning discovers patterns in data without being told what to look for. We don’t tell LDA what topics exist; it infers them from word patterns.

LDA uses a probabilistic generative model. It assumes each document was created by:

  1. Choosing a distribution over topics (e.g., 60% Topic A, 40% Topic B)
  2. For each word in the document:
    • Choose a topic according to the distribution
    • Choose a word from that topic’s word distribution

LDA inverts this process: given observed words, it infers the topic distributions that most likely generated the data. This inference uses Bayesian statistics and sampling algorithms (Gibbs sampling or variational inference).

The mathematics involves Dirichlet distributions (hence the name), but understanding the intuition is sufficient for using topic modeling in research.

3 Setup and data

We will use the Animal Crossing: New Horizons user reviews dataset. This dataset contains reviews from Metacritic posted in March-May 2020 following the game’s release. You can find the dataset here: https://github.com/rfordatascience/tidytuesday/tree/main/data/2020/2020-05-05

NoteAbout this dataset

This dataset contains user reviews of Animal Crossing: New Horizons from its release in March 2020. You will notice that many early reviews focus on a specific complaint: Nintendo’s decision to limit players to one island per console.

This reflects “review bombing” - coordinated negative reviews about a specific issue. While this might seem problematic, it makes the dataset excellent for learning topic modeling:

  1. We can verify topics work (we KNOW this theme exists in the data)
  2. We see how corpus composition affects topic modeling results
  3. We can explore whether review patterns change over time
  4. We learn about limitations and appropriate use of computational methods

Important caveat: Because many reviews discuss the same complaint, this corpus is not ideal for topic modeling. You will see lower coherence scores (Part 2) than you would with a more diverse corpus. This is a feature, not a bug - it teaches you to recognize when topic modeling struggles due to data characteristics.

We will use Python’s genism library to use topic modeling: https://radimrehurek.com/gensim/.

Let’s load the necessary libraries and data:

import pandas as pd
import numpy as np
import spacy
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100

print("Libraries loaded successfully")
Libraries loaded successfully

Now load and explore the data:

# Load reviews
df = pd.read_csv('data/animal-crossing/user_reviews.tsv', sep='\t')

print(f"Total reviews: {len(df):,}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few reviews:")
df.head()
Total reviews: 2,999

Columns: ['grade', 'user_name', 'text', 'date']

First few reviews:
grade user_name text date
0 4 mds27272 My gf started playing before me. No option to ... 2020-03-20
1 5 lolo2178 While the game itself is great, really relaxin... 2020-03-20
2 0 Roachant My wife and I were looking forward to playing ... 2020-03-20
3 0 Houndf We need equal values and opportunities for all... 2020-03-20
4 0 ProfessorFox BEWARE! If you have multiple people in your h... 2020-03-20

Let’s examine the rating distribution:

# Rating distribution
print("Rating distribution (0-10 scale):")
print(df['grade'].value_counts().sort_index())

# Create sentiment categories
df['sentiment'] = pd.cut(
    df['grade'],
    bins=[-1, 3, 7, 10],
    labels=['negative', 'neutral', 'positive']
)

print(f"\nSentiment categories:")
print(df['sentiment'].value_counts())
Rating distribution (0-10 scale):
grade
0     1158
1      255
2      131
3       98
4      105
5       78
6       44
7       34
8       91
9      253
10     752
Name: count, dtype: int64

Sentiment categories:
sentiment
negative    1642
positive    1096
neutral      261
Name: count, dtype: int64

Show a few example reviews to understand the data:

# Sample reviews from different sentiment categories
print("="*70)
print("NEGATIVE REVIEW (Grade 0/10):")
print("="*70)
negative_sample = df[df['grade'] == 0].iloc[0]
print(f"{negative_sample['text'][:500]}...\n")

print("="*70)
print("POSITIVE REVIEW (Grade 10/10):")
print("="*70)
positive_sample = df[df['grade'] == 10].iloc[0]
print(f"{positive_sample['text'][:500]}...")
======================================================================
NEGATIVE REVIEW (Grade 0/10):
======================================================================
My wife and I were looking forward to playing this game when it released. I bought it, I let her play first she made an island and played for a bit. Then I decided to play only to discover that Nintendo only allows one island per switch! Not only that, the second player cannot build anything on the island and tool building is considerably harder to do. So, if you have more than one personMy wife and I were looking forward to playing this game when it released. I bought it, I let her play first s...

======================================================================
POSITIVE REVIEW (Grade 10/10):
======================================================================
Cant stop playing!...

4 Preprocessing for topic modeling

Topic modeling requires clean, meaningful tokens. We need to transform raw text into a format LDA can analyze. This involves several steps, each serving a specific purpose.

4.1 Why preprocessing matters

Raw text contains noise that obscures meaningful patterns. Consider these issues:

  • Stopwords: Common function words (“the”, “is”, “and”) appear in every document but don’t distinguish topics
  • Inflection: “running”, “runs”, “ran” are different forms of the same concept
  • Capitalization: “Game” and “game” should be treated identically
  • Length: Very short words (< 4 characters) are often abbreviations or noise

Preprocessing addresses these issues systematically.

NotePreprocessing for online reviews

User-generated content on the internet often contains elements that need special handling:

  • Emojis (😭, ⭐, 💔)
  • Internet slang (“lol”, “tbh”, “ngl”, “gg”)
  • Game-specific abbreviations (“acnh”, “ac”, “nh”)

Our preprocessing handles these automatically:

  • token.is_alpha removes emojis and numbers (only alphabetic characters)
  • len(token) > 3 removes most abbreviations and slang
  • Lemmatization normalizes informal spellings

Although we will not do it right now, with this kind of data it is also useful to run spellchecking before doing lemmatization. However, since automatic spellchecking is not error-free, you risk loosing the community-specific language.

TipA note on “Gamerspeak”

We use token.is_alpha to remove numbers and emojis for simplicity. However, in a specialized study of gaming culture, you might want to keep alphanumeric terms like “PS4”, “3DS”, or “10/10”. In that case, you would adjust the filtering logic to allow mixed alphanumeric tokens.

This pattern works for analyzing social media posts, product reviews, forum discussions, and other online text.

4.2 Calculating text length

Before preprocessing, let’s filter reviews by length. Very short reviews (< 50 words) often lack substance for topic modeling. LDA specifically is also not very good with shorter texts like twits (search ‘twitterLDA’).

# Calculate word count
df['word_count'] = df['text'].str.split().str.len()

print("Word count distribution:")
print(df['word_count'].describe())

# How many reviews are substantial?
print(f"\nReviews with >= 50 words: {(df['word_count'] >= 50).sum():,} ({(df['word_count'] >= 50).sum()/len(df)*100:.1f}%)")
print(f"Reviews with < 50 words: {(df['word_count'] < 50).sum():,} ({(df['word_count'] < 50).sum()/len(df)*100:.1f}%)")

# Filter for quality
df_filtered = df[df['word_count'] >= 50].copy()
print(f"\nFiltered corpus: {len(df_filtered):,} reviews")
Word count distribution:
count    2999.000000
mean      120.710237
std       129.331087
min         1.000000
25%        30.000000
50%        59.000000
75%       181.000000
max       995.000000
Name: word_count, dtype: float64

Reviews with >= 50 words: 1,738 (58.0%)
Reviews with < 50 words: 1,261 (42.0%)

Filtered corpus: 1,738 reviews

4.3 Deduplication

Check for and remove duplicate reviews. Although we did not use it before, when working with data originated online it is also useful to check for duplication. It is a widespread problem with online text and many sophisticated tools exist to find duplicates. If we keep duplicates in our topic model, they can bias the algorithms - we will not find smaller and more interesting topics. Duplicates can also tell you something about the data generating process.

# Check for exact duplicates
duplicates = df_filtered.duplicated(subset=['text']).sum()
print(f"Exact duplicate reviews: {duplicates}")

# Remove duplicates
df_filtered = df_filtered.drop_duplicates(subset=['text'])
print(f"After deduplication: {len(df_filtered):,} reviews")
Exact duplicate reviews: 3
After deduplication: 1,735 reviews

4.4 Tokenization and lemmatization

Now we process the text using spaCy. We will tokenize (split into words), lemmatize (reduce to base form), and filter:

# Load spaCy English model
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

def process_text(texts):
    """
    Clean and tokenize texts for topic modeling.
    
    Parameters:
        texts: List of text strings to process
    
    Returns:
        List of token lists (one list per document)
    """
    processed_docs = []
    
    # Custom stopwords for this domain
    # These words appear in almost every review and don't help distinguish topics
    custom_stops = {'game', 'play', 'nintendo', 'switch', 'island'}
    
    for doc in nlp.pipe(texts, batch_size=50):
        # Extract tokens with filtering:
        # - token.lemma_: Base form of word (running → run)
        # - token.is_stop: Remove common words (the, is, and)
        # - token.is_punct: Remove punctuation
        # - token.is_alpha: Remove numbers and emojis (only letters)
        # - len() > 3: Remove short words (abbreviations, slang)
        tokens = [token.lemma_.lower() for token in doc 
                  if not token.is_stop 
                  and not token.is_punct 
                  and token.is_alpha 
                  and len(token) > 3
                  and token.lemma_.lower() not in custom_stops]
        
        processed_docs.append(tokens)
    
    return processed_docs

# Process all reviews (this takes a minute or two)
print("Processing reviews (this may take 1-2 minutes)...")
texts_processed = process_text(df_filtered['text'].tolist())

# Show example transformation
print("\n" + "="*70)
print("EXAMPLE: Original vs Processed")
print("="*70)
print(f"Original (first 100 chars):\n{df_filtered.iloc[0]['text'][:100]}...\n")
print(f"Processed tokens (first 15):\n{texts_processed[0][:15]}")
print(f"\nTotal tokens in this review: {len(texts_processed[0])}")
Processing reviews (this may take 1-2 minutes)...

======================================================================
EXAMPLE: Original vs Processed
======================================================================
Original (first 100 chars):
My gf started playing before me. No option to create my own island and guys, being the 2nd player to...

Processed tokens (first 15):
['start', 'option', 'create', 'guy', 'player', 'start', 'console', 'suck', 'miss', 'player', 'get', 'term', 'activity', 'resource', 'absolutely']

Total tokens in this review: 22

4.5 Building dictionary and corpus

LDA requires two data structures:

Dictionary: Maps each unique word to a numeric ID. For example, “island” might be ID 42, “game” might be ID 108.

Corpus: Represents each document as a bag-of-words - a list of (word_id, count) tuples. Word order is ignored; only counts matter.

This representation is called “bag-of-words” because we treat documents like bags of tokens, ignoring grammar and word order.

# Create dictionary
id2word = corpora.Dictionary(texts_processed)

print(f"Initial vocabulary size: {len(id2word):,} unique words")

# Filter extremes:
# - no_below: Remove words appearing in < 5 documents (too rare)
# - no_above: Remove words appearing in > 80% of documents (too common)
id2word.filter_extremes(no_below=5, no_above=0.8)

print(f"After filtering extremes: {len(id2word):,} unique words")

# Create corpus (bag-of-words representation)
corpus = [id2word.doc2bow(text) for text in texts_processed]

print(f"\nCorpus size: {len(corpus):,} documents")
Initial vocabulary size: 7,820 unique words
After filtering extremes: 1,671 unique words

Corpus size: 1,735 documents

Let’s examine the bag-of-words representation:

# Show bag-of-words for first document
print("First document (bag-of-words):")
print("Format: (word_id, count)")
print(corpus[0][:10])  # First 10 word-count pairs

print("\nDecoded (showing actual words):")
for word_id, count in corpus[0][:10]:
    print(f"  {id2word[word_id]}: {count}")
First document (bag-of-words):
Format: (word_id, count)
[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]

Decoded (showing actual words):
  absolutely: 1
  activity: 1
  console: 2
  create: 1
  experience: 1
  get: 1
  guy: 1
  household: 1
  miss: 1
  option: 1

5 Building the topic model

Now we train the LDA model. We will use k=4 topics for this tutorial. In Part 2, you will learn how to choose the optimal number of topics using coherence scores.

NoteWhy k=4?

The number of topics (k) is a modeling choice, not a property of the data. For this tutorial, we use k=4 to keep interpretation manageable. With too few topics, themes get mixed together; with too many, topics become redundant or incoherent.

In Part 2 (Evaluation and Validation), you will learn systematic methods for choosing k, including coherence scores and stability analysis.

# Train LDA model
print("Training LDA model (this may take a minute)...")

lda_model = gensim.models.LdaMulticore(
    corpus=corpus,
    id2word=id2word,
    num_topics=4,
    random_state=42,  # For reproducibility
    passes=10,        # Number of training passes through corpus
    workers=1         # Single worker for consistency
)

print("Model training complete!")
Training LDA model (this may take a minute)...
Model training complete!

6 Interpreting topics

The most important step in topic modeling is interpretation. We need to examine what the model discovered and assign meaningful labels to topics.

6.1 Topic-word distributions

Each topic is characterized by its top words - the words with highest probability in that topic:

print("="*70)
print("DISCOVERED TOPICS")
print("="*70)

for idx in range(4):
    print(f"\nTopic {idx}:")
    # Get top 15 words for this topic
    top_words = lda_model.show_topic(idx, topn=15)
    
    # Format as: word (probability)
    words_formatted = [f"{word} ({prob:.3f})" for word, prob in top_words]
    print("  " + ", ".join(words_formatted))
======================================================================
DISCOVERED TOPICS
======================================================================

Topic 0:
  player (0.021), people (0.019), experience (0.016), expand (0.014), console (0.014), animal (0.013), thing (0.012), review (0.012), like (0.012), crossing (0.012), want (0.010), share (0.009), think (0.009), family (0.009), get (0.008)

Topic 1:
  player (0.067), second (0.018), console (0.016), progress (0.015), want (0.015), expand (0.014), person (0.012), like (0.011), start (0.010), multiplayer (0.010), get (0.009), main (0.008), family (0.008), buy (0.007), experience (0.007)

Topic 2:
  time (0.021), animal (0.019), crossing (0.019), like (0.018), expand (0.012), thing (0.012), good (0.011), feel (0.009), want (0.009), craft (0.008), review (0.008), series (0.008), great (0.007), villager (0.007), item (0.006)

Topic 3:
  console (0.033), share (0.018), expand (0.016), save (0.016), want (0.014), animal (0.014), crossing (0.014), player (0.013), multiple (0.013), experience (0.012), people (0.012), account (0.012), buy (0.011), family (0.010), person (0.010)

6.2 Labeling topics

Based on the top words, we can assign interpretive labels to each topic. This requires domain judgment - what themes do these words represent?

To get a sense of a topic, you will need two things:

  1. A list of top words ranked by their probability to be in a topic;
  2. A list of top documents ranked by their probability to be in a topic.

For each topic, you read the word list to get an initial sense of a subject. Then you start reading the documents to test your initial ideas. About a 100 of words and documents seems to be good enough to get a confident idea how to label a topic.

Let’s analyze each topic:

# Topic labels (you would refine these based on actual output)
topic_labels = {
    0: "Multiplayer/Island Limitations",
    1: "Gameplay and Enjoyment", 
    2: "Game Mechanics and Features",
    3: "Aesthetics and Design"
}

print("TOPIC LABELS:")
print("="*70)
for idx, label in topic_labels.items():
    print(f"Topic {idx}: {label}")
    top_words = [word for word, prob in lda_model.show_topic(idx, topn=10)]
    print(f"  Top words: {', '.join(top_words)}\n")
TOPIC LABELS:
======================================================================
Topic 0: Multiplayer/Island Limitations
  Top words: player, people, experience, expand, console, animal, thing, review, like, crossing

Topic 1: Gameplay and Enjoyment
  Top words: player, second, console, progress, want, expand, person, like, start, multiplayer

Topic 2: Game Mechanics and Features
  Top words: time, animal, crossing, like, expand, thing, good, feel, want, craft

Topic 3: Aesthetics and Design
  Top words: console, share, expand, save, want, animal, crossing, player, multiple, experience
TipWhat makes a “good” topic?

A coherent topic has words that:

  • Relate to a common theme or concept
  • Make semantic sense together
  • Are distinguishable from other topics

A “junk” topic has words that:

  • Are high-frequency but unrelated
  • Don’t form a clear theme
  • Overlap heavily with other topics

If you see junk topics, you may need to:

  • Adjust preprocessing (add custom stopwords)
  • Change the number of topics (k)
  • Filter your corpus differently

6.3 Interactive exploration with pyLDAvis

pyLDAvis creates an interactive visualization for exploring topics. This is one of the most powerful tools for understanding topic models:

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Prepare visualization
vis_data = gensimvis.prepare(lda_model, corpus, id2word, sort_topics=False)

# Display (in Jupyter/Quarto this will be interactive)
pyLDAvis.display(vis_data)
NoteReading pyLDAvis

Left panel (topic circles):

  • Each circle represents a topic
  • Circle size = topic prevalence (how common)
  • Distance between circles = topic similarity
  • Topics far apart are more distinct

Right panel (word bars):

  • Shows top words for selected topic
  • Red bar = frequency of word in selected topic
  • Blue bar = frequency across all topics

Lambda (λ) slider:

  • λ = 1: Shows most frequent words in topic
  • λ = 0: Shows most exclusive words (unique to topic)
  • λ ≈ 0.6: Balanced view (recommended)

Slide lambda to ~0.6 to see words that are both frequent AND distinctive for each topic.

7 Interpreting document-topic distributions

Topics describe word patterns, but we also need to understand how topics appear in documents. Each document has a distribution over topics.

7.1 Extracting topic proportions

Here’s how we can get document-topic proportions. For labeling, add the column with unprocessed text instead of grade (or along with), as shown below.

def get_document_topics(corpus, lda_model, num_topics):
    """
    Extract topic proportions for all documents.
    
    Returns:
        DataFrame with one column per topic
    """
    doc_topics_list = []
    
    for doc_bow in corpus:
        # Get topic distribution for this document
        topic_dist = lda_model.get_document_topics(doc_bow, minimum_probability=0.0)
        
        # Extract probabilities (in topic order)
        probs = [prob for topic_id, prob in topic_dist]
        doc_topics_list.append(probs)
    
    # Create DataFrame
    topic_cols = [f'Topic_{i}' for i in range(num_topics)]
    df_topics = pd.DataFrame(doc_topics_list, columns=topic_cols)
    
    return df_topics

# Extract topic proportions
df_topics = get_document_topics(corpus, lda_model, num_topics=4)

# Merge with original metadata
df_analysis = pd.concat([df_filtered.reset_index(drop=True), df_topics], axis=1)

print("Document-topic matrix (first 5 documents):")
df_analysis[['grade', 'Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].head()
Document-topic matrix (first 5 documents):
grade Topic_0 Topic_1 Topic_2 Topic_3
0 4 0.011414 0.965717 0.011129 0.011740
1 5 0.735381 0.005071 0.005293 0.254255
2 0 0.268592 0.718765 0.006276 0.006367
3 0 0.013864 0.958788 0.013654 0.013693
4 0 0.006553 0.036656 0.333053 0.623737

7.2 Worked example: Understanding topic mixtures

Let’s examine a specific review to understand what topic proportions mean:

# Select a review with varied topic proportions
example_idx = 10
example_row = df_analysis.iloc[example_idx]

print("="*70)
print("WORKED EXAMPLE: Review Topic Mixture")
print("="*70)
print(f"\nGrade: {example_row['grade']}/10")
print(f"\nTopic proportions:")
for i in range(4):
    topic_prop = example_row[f'Topic_{i}']
    print(f"  Topic {i} ({topic_labels[i]}): {topic_prop:.3f} ({topic_prop*100:.1f}%)")

print(f"\nReview text (first 400 characters):")
print(f"{example_row['text'][:400]}...")

print("\n" + "="*70)
print("INTERPRETATION")
print("="*70)
print(f"This review is a MIXTURE of topics. The model estimates it is:")
dominant_topic = df_analysis.iloc[example_idx][['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].idxmax()
dominant_prop = df_analysis.iloc[example_idx][['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].max()
print(f"  - Primarily about {dominant_topic.replace('_', ' ')} ({dominant_prop*100:.1f}%)")
print(f"  - With contributions from other topics")
print(f"\nThis mixture reflects that real documents discuss multiple themes.")
======================================================================
WORKED EXAMPLE: Review Topic Mixture
======================================================================

Grade: 0/10

Topic proportions:
  Topic 0 (Multiplayer/Island Limitations): 0.004 (0.4%)
  Topic 1 (Gameplay and Enjoyment): 0.004 (0.4%)
  Topic 2 (Game Mechanics and Features): 0.004 (0.4%)
  Topic 3 (Aesthetics and Design): 0.987 (98.7%)

Review text (first 400 characters):
Only ONE island per console!You can't create a different island per user/account.Once you have created an island on your console, all users have to share it.Good luck if you have two or more kids...Nintendo got away with forcing us to buy one game per user for years and years with Pokemon and Animal Crossing, and now they want us to buy one CONSOLE per user. They had theOnly ONE island per console...

======================================================================
INTERPRETATION
======================================================================
This review is a MIXTURE of topics. The model estimates it is:
  - Primarily about Topic 3 (98.7%)
  - With contributions from other topics

This mixture reflects that real documents discuss multiple themes.

7.3 Visualizing dominant topics

Which topic is most prevalent in each review?

# Find dominant topic for each document
df_analysis['dominant_topic'] = df_analysis[['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].idxmax(axis=1)
df_analysis['dominant_topic_prop'] = df_analysis[['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].max(axis=1)

# Distribution of dominant topics
print("Distribution of dominant topics:")
dominant_counts = df_analysis['dominant_topic'].value_counts()
for topic, count in dominant_counts.items():
    topic_num = int(topic.split('_')[1])
    print(f"  {topic} ({topic_labels[topic_num]}): {count} reviews ({count/len(df_analysis)*100:.1f}%)")

# Visualize as horizontal bar chart, sorted by count
fig, ax = plt.subplots(figsize=(10, 6))

# Sort by count (ascending so largest is at top)
dominant_counts_sorted = dominant_counts.sort_values(ascending=True)

# Get topic labels
topic_names = [topic_labels[int(t.split('_')[1])] for t in dominant_counts_sorted.index]

# Create horizontal bars
y_pos = range(len(topic_names))
ax.barh(y_pos, dominant_counts_sorted.values, color='steelblue', alpha=0.7)

# Set labels
ax.set_yticks(y_pos)
ax.set_yticklabels(topic_names)
ax.set_xlabel('Number of Reviews')
ax.set_title('Distribution of Dominant Topics Across Reviews')

# Add count labels at end of each bar
for i, (count, topic) in enumerate(zip(dominant_counts_sorted.values, topic_names)):
    ax.text(count + 5, i, f'{count}', va='center', fontsize=10)

plt.tight_layout()
plt.show()
Distribution of dominant topics:
  Topic_2 (Game Mechanics and Features): 535 reviews (30.8%)
  Topic_3 (Aesthetics and Design): 455 reviews (26.2%)
  Topic_1 (Gameplay and Enjoyment): 414 reviews (23.9%)
  Topic_0 (Multiplayer/Island Limitations): 331 reviews (19.1%)

This is just one way to showing topic prevalence: for each topic we counted in how many documents it has the highers probability value. Another way is to to sum all probabilities of a topic. If you sum together topic probabilities over a document (or a word), you will always get 1 (100%). But if you sum over all documents, the cumulative values will give you a similar rank of topic prevalence in the entire corpus.

8 Close reading validation

Computational metrics tell us topics are statistically coherent, but they cannot tell us if topics are meaningful. We must validate through close reading - examining actual documents.

ImportantWhy close reading matters

Topic modeling is exploratory. The computer finds word patterns, but only human judgment can determine if those patterns represent meaningful themes.

Always validate computational findings by reading documents. Ask:

  • Do high-loading documents actually discuss this theme?
  • Are the topics coherent when you read actual text?
  • Do topic labels accurately describe document content?

Let’s examine reviews where Topic 0 is dominant:

# Get reviews where Topic 0 is dominant (>50% of document)
topic_0_docs = df_analysis[df_analysis['Topic_0'] > 0.5].copy()
topic_0_docs = topic_0_docs.sort_values('Topic_0', ascending=False)

print("="*70)
print(f"CLOSE READING: Topic 0 ({topic_labels[0]})")
print("="*70)
print(f"\nTop words: {', '.join([w for w, p in lda_model.show_topic(0, topn=10)])}")
print(f"\nTop 3 reviews for this topic:\n")

for i in range(min(3, len(topic_0_docs))):
    row = topic_0_docs.iloc[i]
    print(f"--- Review {i+1} ---")
    print(f"Grade: {row['grade']}/10")
    print(f"Topic 0 proportion: {row['Topic_0']:.3f}")
    print(f"Text: {row['text'][:350]}...")
    print()
======================================================================
CLOSE READING: Topic 0 (Multiplayer/Island Limitations)
======================================================================

Top words: player, people, experience, expand, console, animal, thing, review, like, crossing

Top 3 reviews for this topic:

--- Review 1 ---
Grade: 10/10
Topic 0 proportion: 0.993
Text: Besides the fact that this game is really good, there's something I need to talk about. Don't trust any reviews for this game. It's honestly pathetic how this review bombing is turning out. If you take a look at hundreds of 0 reviews you'll find their logic extremely flawed. And most of all, I noticed a common pattern between these reviewers. It al...

--- Review 2 ---
Grade: 10/10
Topic 0 proportion: 0.992
Text: Almost every negative review I have seen for this game addresses the shared island problem. And it is causing the user score to plummet. However, in the midst of all the controversy, people seem to forget that this problem doesn't even affect everyone, and that there is a stellar game past this one issue. I would have rated this game 9/10 for its s...

--- Review 3 ---
Grade: 10/10
Topic 0 proportion: 0.991
Text: Let's begin by being extremely clear: this game is amazing. It deserves all the praise you've heard about it. It's fun, it's innovative for the series, it's really zen and, most importantly, it still conveys that sense of belonging to a community when playing.You probably are also reading a lot of angry people saying that the game only allows one i...

Now examine a positive topic:

# Find topic most associated with positive reviews
positive_reviews = df_analysis[df_analysis['grade'] >= 8]
topic_means = positive_reviews[['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].mean()
positive_topic = topic_means.idxmax()
positive_topic_num = int(positive_topic.split('_')[1])

print("="*70)
print(f"CLOSE READING: {positive_topic} ({topic_labels[positive_topic_num]})")
print("="*70)
print(f"\nTop words: {', '.join([w for w, p in lda_model.show_topic(positive_topic_num, topn=10)])}")
print(f"\nThis topic is most prevalent in POSITIVE reviews (grade >= 8)")
print(f"Mean proportion in positive reviews: {topic_means.max():.3f}\n")

# Get top positive reviews for this topic
positive_topic_docs = df_analysis[
    (df_analysis['grade'] >= 8) & 
    (df_analysis[positive_topic] > 0.3)
].sort_values(positive_topic, ascending=False)

print(f"Top 3 positive reviews for this topic:\n")

for i in range(min(3, len(positive_topic_docs))):
    row = positive_topic_docs.iloc[i]
    print(f"--- Review {i+1} ---")
    print(f"Grade: {row['grade']}/10")
    print(f"{positive_topic} proportion: {row[positive_topic]:.3f}")
    print(f"Text: {row['text'][:350]}...")
    print()
======================================================================
CLOSE READING: Topic_2 (Game Mechanics and Features)
======================================================================

Top words: time, animal, crossing, like, expand, thing, good, feel, want, craft

This topic is most prevalent in POSITIVE reviews (grade >= 8)
Mean proportion in positive reviews: 0.584

Top 3 positive reviews for this topic:

--- Review 1 ---
Grade: 8/10
Topic_2 proportion: 0.996
Text: More of 7 1/2 than an 8...I've always loved the animal crossing series. But of course there are always some kind of problems. Its Nintendo, we all know that they make a lot of decisions because they want that $$$. We still give in and buy the games anyway. That being said here's some things I found wrong.1st, the one island per switch situation. We...

--- Review 2 ---
Grade: 8/10
Topic_2 proportion: 0.996
Text: #DashReviewsIntroAnimal Crossing: New Horizons is the console follow up to the critical apprised portable outing, New Leaf. Nintendo has hit most Switch outing out of the part, so hoping the Animal Crossing would go leaps and bounds beyond previous titles is to be expected. While Nintendo mostly pulls this off but there are a few things that bring ...

--- Review 3 ---
Grade: 9/10
Topic_2 proportion: 0.995
Text: Animal Crossing has been a series beloved by many, and for good reason. It is a feel-good, calm and open-ended game that allows the user to live a second day-to-day life among the animals of the world. Animal Crossing: New Horizons stays true to that formula - in some cases - while also breathing a much needed plume of fresh air into the series.New...
TipValidation checklist

When doing close reading, verify:

  1. Coherence: Do reviews discuss related themes?
  2. Accuracy: Do topic labels match actual content?
  3. Distinctiveness: Are topics distinguishable from each other?
  4. Quality: Are reviews real content (not spam)?

If validation fails, consider:

  • Refining preprocessing (add custom stopwords)
  • Changing number of topics (k)
  • Filtering corpus differently
  • Accepting that some topics may be “junk”

9 Limitations and appropriate use

Topic modeling is a powerful exploratory tool, but has important limitations. Understanding when and how to use topic modeling appropriately is crucial for research.

9.1 Topics are computational constructs

Topics discovered by LDA are word co-occurrence patterns, not inherent properties of text. Different preprocessing choices, different k values, or different random seeds produce different topics.

There is no “true” set of topics waiting to be discovered. Topics are analytical constructs that help us understand patterns, but they are not facts about the world.

9.2 LDA is stochastic

LDA uses randomization during training. Running the same code twice with different random_state values produces slightly different results. This is why we fix random_state=42 for reproducibility.

In Part 2 (Evaluation and Validation), you will learn how to assess topic stability - testing whether topics remain consistent across multiple runs.

9.3 Topic modeling is exploratory

Use topic modeling to:

  • Discover patterns and generate hypotheses
  • Explore themes in large corpora
  • Identify prevalent discourses
  • Guide close reading (which documents to examine)

Topic modeling (LDA) is is not good for:

  • Test hypotheses (it’s exploratory, not confirmatory)
  • Prove theories (findings need validation)
  • Replace close reading (computers cannot replace human interpretation)

9.4 When not to use topic modeling

Topic modeling is inappropriate when:

  • Corpus is very small (< 100 documents typically)
  • Documents are very short (tweets, reviews < 50 words)
  • Topics are pre-defined (use dictionary methods from Lab 03 instead or another topic modeling algorithm)
  • You need precise classifications (use supervised learning instead)
NoteBuilding on previous labs

From Lab 03 (Dictionary Methods):

We used pre-defined word lists to measure sentiment (positive/negative lexicons). Topic modeling extends this by discovering word groups from data, without requiring predefined categories.

From Lab 04 (Document Similarity):

We represented documents as TF-IDF vectors to measure similarity. Topic modeling provides an alternative: represent documents as topic distributions (4 dimensions instead of 10,000+), which are easier to interpret.

In Part 2, we will explore topics by sentiment (negative vs positive reviews) and see how topic modeling reveals patterns in consumer feedback.

9.5 Preview: Part 2

In the next lab (Topic Modeling II: Evaluation and Validation), you will learn:

  • Topic stability: How to test whether topics are consistent across runs
  • Choosing k: Using coherence scores to select number of topics
  • Metadata analysis: Comparing topics across sentiments (negative vs positive reviews)
  • Human validation: Inter-coder reliability and qualitative assessment
  • Reporting: How to document topic models in research papers

10 Exercises

10.1 Exercise 1: Custom stopword refinement

After exploring your topics in pyLDAvis, you may notice words that appear in EVERY topic but aren’t meaningful (e.g., “game”, “play”, “switch”).

Tasks:

  1. Identify 5-10 domain-specific stopwords from the pyLDAvis visualization
  2. Modify the process_text() function to exclude these words
  3. Re-run the model with your custom stopwords
  4. Compare the topics: are they more interpretable?
  5. Reflect: when should you add custom stopwords vs. keep them?

Hint: Look for words with high frequency but low topic specificity in pyLDAvis (words that appear in the top-right of every topic’s word panel).

10.2 Exercise 2: Close reading validation

Choose one topic from your model (Topic 0, 1, 2, or 3):

  1. Write a 1-sentence label for the topic based on its top 10 words
  2. Find the 5 reviews with highest probability for this topic
  3. Read all 5 reviews carefully
  4. Answer:
    • Do they match your label?
    • What proportion genuinely fit the topic?
    • What does this tell you about topic model reliability?
    • Would you trust this topic for research conclusions?

10.3 Exercise 3: Topic labeling practice

For each of the 4 topics in your model:

  1. Examine the top 15 words (use lda_model.show_topic(topic_id, topn=15))
  2. Create a descriptive label (not just “Topic 0” - describe the theme)
  3. Find 2 example reviews with high probability for this topic
  4. From pyLDAvis, estimate what proportion of the corpus this topic represents
  5. Assess: is this topic coherent? Why or why not?

10.4 Exercise 4: Comparing to dictionary methods

Thinking back to Lab 03 (dictionary methods for sentiment):

  1. How would you measure review sentiment using dictionary methods?
  2. How does topic modeling differ from dictionary-based sentiment analysis?
  3. What are advantages of topic modeling? What are advantages of dictionaries?
  4. When would you use each method?

10.5 Exercise 5: Reflection

Write a brief reflection (250-500 words) addressing:

  1. What surprised you about the topics discovered in this dataset?
  2. Did the topics match your expectations based on reading sample reviews?
  3. What are the biggest limitations of topic modeling you encountered?
  4. How might you apply topic modeling to your own research interests?