Topic modeling II: Evaluation and validation

Stability, coherence, and human validation

Published

2026-01-25 11:57:11

Prerequisites

This notebook assumes you have completed Part 1 (Introduction to Topic Modeling). You should be familiar with:

LDA conceptual model (documents as mixtures of topics)
Preprocessing for topic modeling
Interpreting topic-word distributions
Using pyLDAvis visualization

If you have not completed Part 1, start there first.

1 Why evaluation matters

In Part 1, you built a topic model and interpreted its results. But how do you know if your topics are reliable? How do you choose the number of topics (k)? How do you validate that topics represent meaningful patterns rather than statistical artifacts?

Topic models are analytical constructs. Different choices produce different results:

Different preprocessing (stopwords, filtering)
Different k values (number of topics)
Different random initializations (LDA is stochastic)

Systematic evaluation helps us assess:

Stability: Do topics remain consistent across runs?
Coherence: Do topics make semantic sense?
Validity: Do topics align with human interpretation?
Generalizability: Do findings hold across different subsets of data?

This lab covers methods for evaluating topic models rigorously.

2 Setup and data

We use the same Animal Crossing reviews dataset from Part 1. Let’s reload and preprocess:

import pandas as pd
import numpy as np
import spacy
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import mannwhitneyu

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100

print("Libraries loaded successfully")

Libraries loaded successfully

# NOTE: This preprocessing code duplicates Part 1 for standalone use.
# In practice, you would save the corpus and model from Part 1 and load them here.
# We repeat for pedagogical clarity.

# Load data
df = pd.read_csv('data/animal-crossing/user_reviews.tsv', sep='\t')

# Filter and deduplicate
df['word_count'] = df['text'].str.split().str.len()
df_filtered = df[df['word_count'] >= 50].copy()
df_filtered = df_filtered.drop_duplicates(subset=['text'])

# Add sentiment categories
df_filtered['sentiment'] = pd.cut(
    df_filtered['grade'],
    bins=[-1, 3, 7, 10],
    labels=['negative', 'neutral', 'positive']
)

print(f"Filtered corpus: {len(df_filtered):,} reviews")

# Preprocess
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

def process_text(texts):
    """Clean and tokenize texts for topic modeling."""
    processed_docs = []
    
    # Custom stopwords (consistent with Part 1)
    custom_stops = {'game', 'play', 'nintendo', 'switch', 'island'}
    
    for doc in nlp.pipe(texts, batch_size=50):
        tokens = [token.lemma_.lower() for token in doc 
                  if not token.is_stop 
                  and not token.is_punct 
                  and token.is_alpha 
                  and len(token) > 3
                  and token.lemma_.lower() not in custom_stops]
        processed_docs.append(tokens)
    return processed_docs

print("Processing reviews...")
texts_processed = process_text(df_filtered['text'].tolist())

# Build dictionary and corpus
id2word = corpora.Dictionary(texts_processed)
id2word.filter_extremes(no_below=5, no_above=0.8)
corpus = [id2word.doc2bow(text) for text in texts_processed]

print(f"Vocabulary: {len(id2word):,} words")
print(f"Corpus: {len(corpus):,} documents")

Filtered corpus: 1,735 reviews
Processing reviews...
Vocabulary: 1,671 words
Corpus: 1,735 documents

3 Choosing the number of topics

One fundamental challenge in topic modeling: we must decide how many topics (k) exist in our corpus. There is no “correct” answer - this is a modeling choice.

We can use coherence scores as a heuristic. Coherence measures how often the top words in a topic appear together in the same documents. Higher coherence suggests the topic represents a meaningful theme.

3.1 What is coherence?

Coherence asks: do the top words in a topic actually co-occur in documents? If a topic has high coherence, its top words appear together frequently, suggesting they represent a real pattern.

Consider two hypothetical topics:

Topic A (high coherence):

Top words: “island”, “villager”, “design”, “custom”, “decoration”
These words co-occur in reviews about creative gameplay

Topic B (low coherence):

Top words: “game”, “play”, “thing”, “like”, “get”
These are high-frequency words that appear everywhere but don’t form a coherent theme

Coherence scores quantify this intuition.

3.2 Interpreting coherence scores

We use the c_v coherence metric, which ranges from 0 to 1:

Coherence (c_v)	Interpretation
< 0.40	Poor topic quality; topics likely incoherent
0.40 - 0.50	Moderate; topics may be interpretable
0.50 - 0.60	Good; topics generally coherent
> 0.60	Excellent; topics well-defined

These are guidelines, not rules. Always examine actual topics (words) to verify they make semantic sense.

Expected coherence for this dataset

Due to review bombing (many reviews focusing on one complaint), this corpus is not ideal for topic modeling. You should see coherence scores around 0.35-0.40 - lower than typical well-structured corpora.

This demonstrates an important lesson: corpus composition affects topic quality. When most documents discuss the same theme, LDA struggles to find distinct topics.

Despite low coherence, topics may still be interpretable if you examine them carefully. The low scores reflect the limitation of the data, not the method.

3.3 Computing coherence for different k values

Let’s test k from 2 to 10 topics:

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    """
    Compute coherence scores for different numbers of topics.
    
    Parameters:
        dictionary: Gensim dictionary
        corpus: Gensim corpus (bag-of-words)
        texts: Tokenized texts (list of lists)
        limit: Maximum number of topics to test
        start: Minimum number of topics
        step: Step size
    
    Returns:
        model_list: List of trained models
        coherence_values: List of coherence scores
    """
    coherence_values = []
    model_list = []
    
    for num_topics in range(start, limit, step):
        print(f"  Training model with k={num_topics}...")
        
        # Train LDA model
        model = gensim.models.LdaMulticore(
            corpus=corpus,
            id2word=dictionary,
            num_topics=num_topics,
            random_state=42,
            passes=10,
            workers=1
        )
        model_list.append(model)
        
        # Calculate coherence
        coherencemodel = CoherenceModel(
            model=model,
            texts=texts,
            dictionary=dictionary,
            coherence='c_v'
        )
        coherence_score = coherencemodel.get_coherence()
        coherence_values.append(coherence_score)
        
        print(f"    k={num_topics}: coherence = {coherence_score:.4f}")
    
    return model_list, coherence_values

# Run coherence analysis
print("Computing coherence scores (this takes a few minutes)...")
print("="*70)

model_list, coherence_values = compute_coherence_values(
    dictionary=id2word,
    corpus=corpus,
    texts=texts_processed,
    start=2,
    limit=11,
    step=1
)

print("\n" + "="*70)
print("Coherence scores summary:")
print("="*70)
for k, score in zip(range(2, 11), coherence_values):
    print(f"k={k}: {score:.4f}")

Computing coherence scores (this takes a few minutes)...
======================================================================
  Training model with k=2...
    k=2: coherence = 0.3569
  Training model with k=3...
    k=3: coherence = 0.3949
  Training model with k=4...
    k=4: coherence = 0.3939
  Training model with k=5...
    k=5: coherence = 0.4081
  Training model with k=6...
    k=6: coherence = 0.3944
  Training model with k=7...
    k=7: coherence = 0.3888
  Training model with k=8...
    k=8: coherence = 0.3942
  Training model with k=9...
    k=9: coherence = 0.3888
  Training model with k=10...
    k=10: coherence = 0.3704

======================================================================
Coherence scores summary:
======================================================================
k=2: 0.3569
k=3: 0.3949
k=4: 0.3939
k=5: 0.4081
k=6: 0.3944
k=7: 0.3888
k=8: 0.3942
k=9: 0.3888
k=10: 0.3704

3.4 Visualizing the elbow method

Plot coherence scores to identify the “elbow” - where adding more topics stops improving coherence significantly:

fig, ax = plt.subplots(figsize=(10, 6))

k_values = list(range(2, 11))
ax.plot(k_values, coherence_values, marker='o', linewidth=2, markersize=8, color='steelblue')
ax.set_xlabel('Number of Topics (k)', fontsize=12)
ax.set_ylabel('Coherence Score (c_v)', fontsize=12)
ax.set_title('Coherence Scores by Number of Topics', fontsize=14, fontweight='bold')
ax.set_xticks(k_values)
ax.grid(True, alpha=0.3)

# Highlight maximum
max_idx = np.argmax(coherence_values)
max_k = k_values[max_idx]
max_score = coherence_values[max_idx]
ax.plot(max_k, max_score, marker='*', markersize=20, color='red', 
        label=f'Maximum: k={max_k} ({max_score:.3f})')

ax.legend(fontsize=11)
plt.tight_layout()
plt.show()

print(f"\nMaximum coherence: k={max_k} with score {max_score:.4f}")


Maximum coherence: k=5 with score 0.4081

3.5 Choosing k in practice

The coherence plot helps identify reasonable k values, but the choice involves tradeoffs:

Higher coherence suggests semantically coherent topics, but doesn’t guarantee:

Topics are useful for your research question
Topics are interpretable by humans
Topics are stable across runs

Consider:

Elbow point: Where does coherence plateau?
Interpretability: Can you assign meaningful labels to all topics?
Coverage: Do topics capture important themes in your corpus?
Parsimony: Fewer topics (k=4-6) are easier to interpret than many topics (k=15-20)

For this corpus specifically:

Looking at the coherence scores, they are all relatively low (0.35-0.38) with no clear peak. The highest score is k=2 (0.385), but only two topics would oversimplify the corpus.

The plot shows coherence is relatively flat across k=2 to k=10, suggesting the review bombing pattern makes it difficult for LDA to find highly distinct topics regardless of k.

In this situation, we choose k=4 based on:

Interpretability: Four topics are manageable to interpret and validate
Balance: Captures main themes (complaints, gameplay, aesthetics, comparisons) without excessive granularity
Pedagogical value: Demonstrates that topic modeling doesn’t always produce high-coherence results

This teaches an important lesson: when coherence is uniformly low, examine the corpus composition. The review bombing phenomenon limits topic diversity, which affects coherence scores.

Coherence is a heuristic, not truth

High coherence does not guarantee topics are “correct” or meaningful for your research question. Always validate topics through:

Close reading of high-loading documents
Domain expertise
Alignment with theoretical expectations

Use coherence to narrow options, then choose k based on interpretability.

Consider other coherence metrics on Palmetto: https://palmetto.demos.dice-research.org/

4 Understanding topic stability

Topic modeling uses probabilistic algorithms that incorporate randomness during training. This means running the same code twice on the same data can produce different results.

For researchers, this raises an important question: how stable are our topics? If topics change substantially between runs, can we trust our findings?

You start worrying about topic stability once you have decided on the number of topics for your model. Although our results above suggest that 5 topics is a better choice, we will stick to 4 topics for the demonstration purposes. The scale of you model (the number of topics) quadratically increases the amount of computation needed to find stable topics because we will have to do pair-wise comparisons between a number of models (at least 5) and each found topic inside a model: for 4 topics we will need to do 160 comparisons and for 5 topics - 250. Her is a plot showing how the number of comparisons increases:

# 1. Define Parameters
num_models = 5
model_pairs = (num_models * (num_models - 1)) / 2

# 2. Generate Data (Topics 2 to 100)
topics_range = range(2, 101)
data = []

for t in topics_range:
    total_comparisons = model_pairs * (t ** 2)
    data.append({"Topics": t, "Comparisons": total_comparisons})

df = pd.DataFrame(data)

# 3. Plot
plt.figure(figsize=(10, 6))
sns.set_theme(style="whitegrid")

sns.lineplot(data=df, x="Topics", y="Comparisons", linewidth=2.5)

plt.title(f"Growth of Comparisons for {num_models} Models ($10 \\times T^2$)", fontsize=14)
plt.ylabel("Total Comparisons")
plt.xlabel("Number of Topics per Model")
plt.show()

You may ignore topic stability if you use a topic model to reduce your data for other purposes - like training a classifier or doing a preliminary exploration. In the industry, topic stability is less of an issue because topic modeling is often used as a routine procedure powering up a feature such as a recommendation of similar text (or product). These models are constantly updated with new data and there is no reasonable way to keep track of stable topics. Yet, you should care about stability if you plan on making conclusions based on discovered topics.

4.1 The challenge of stochastic algorithms

LDA uses random initialization and random sampling during training. Each run explores the space of possible topic solutions differently. While the algorithm converges to similar solutions, topics may differ slightly between runs.

This is not a flaw - it reflects the nature of unsupervised learning. However, we must assess stability to ensure findings are robust.

4.2 The topic labeling problem

Here’s a crucial insight: topic IDs are arbitrary labels. When you train a topic model, the algorithm assigns numeric IDs (0, 1, 2, 3) to topics, but these numbers have no inherent meaning.

What this means:

In Model A, “gameplay mechanics” might be labeled Topic 0
In Model B, the same semantic topic might be labeled Topic 2
The numeric IDs are just arbitrary labels assigned during training

This creates a challenge: to assess stability, we cannot simply compare Topic 0 vs Topic 0 across models. We must find which topics from Model A correspond to which topics in Model B by comparing their content (top words).

4.3 Training multiple models

To properly assess stability, we need multiple models (typically 5) trained with different random seeds:

Technical detail: workers=1

We set workers=1 to force single-threaded execution. When using multiple threads (workers > 1), the order in which processor cores finish tasks can introduce variations even with a fixed random seed. For reproducibility and stability testing, single-threaded execution is essential.

print("Training 5 models with different random seeds...")
print("This will take several minutes.\n")

models = []
seeds = [42, 100, 999, 2023, 5555]

for i, seed in enumerate(seeds):
    print(f"Training model {i+1}/5 (seed={seed})...")
    model = gensim.models.LdaMulticore(
        corpus=corpus,
        id2word=id2word,
        num_topics=4,
        random_state=seed,
        passes=10,
        workers=1
    )
    models.append(model)

print("\nAll models trained!")

Training 5 models with different random seeds...
This will take several minutes.

Training model 1/5 (seed=42)...
Training model 2/5 (seed=100)...
Training model 3/5 (seed=999)...
Training model 4/5 (seed=2023)...
Training model 5/5 (seed=5555)...

All models trained!

Let’s examine how topic labels differ across runs by looking at the top words:

# Compare Topic 0 across first two models
print("="*70)
print("Comparing 'Topic 0' across two different models")
print("="*70)
print("\nMODEL 1 - Topic 0:")
for word, prob in models[0].show_topic(0, topn=10):
    print(f"  {word}: {prob:.4f}")

print("\nMODEL 2 - Topic 0:")
for word, prob in models[1].show_topic(0, topn=10):
    print(f"  {word}: {prob:.4f}")

print("\n" + "="*70)
print("OBSERVATION")
print("="*70)
print("Notice: Topic 0 in Model 1 may represent a DIFFERENT semantic theme")
print("than Topic 0 in Model 2. The numeric IDs are arbitrary!")
print("\nTo find stable topics, we must compare EVERY topic from Model 1")
print("against EVERY topic from Model 2 to find the best matches.")

======================================================================
Comparing 'Topic 0' across two different models
======================================================================

MODEL 1 - Topic 0:
  player: 0.0210
  people: 0.0192
  experience: 0.0163
  expand: 0.0143
  console: 0.0138
  animal: 0.0125
  thing: 0.0123
  review: 0.0122
  like: 0.0121
  crossing: 0.0118

MODEL 2 - Topic 0:
  player: 0.0459
  console: 0.0186
  second: 0.0166
  progress: 0.0137
  expand: 0.0128
  like: 0.0116
  start: 0.0105
  want: 0.0103
  multiplayer: 0.0092
  person: 0.0089

======================================================================
OBSERVATION
======================================================================
Notice: Topic 0 in Model 1 may represent a DIFFERENT semantic theme
than Topic 0 in Model 2. The numeric IDs are arbitrary!

To find stable topics, we must compare EVERY topic from Model 1
against EVERY topic from Model 2 to find the best matches.

4.4 Cross-model topic matching

To properly assess stability, we need to:

Compare every topic from Model A against every topic from Model B
Calculate Jaccard similarity for all pairs
Find which topics match across models (Jaccard > 0.7)
Count how many times each semantic topic appears across all 5 models
Topics appearing in 3+ models are considered “stable”

Here’s the implementation:

def jaccard_similarity(topic_words_a, topic_words_b):
    """
    Calculate Jaccard similarity between two sets of words.
    
    Parameters:
        topic_words_a: Set of words from topic A
        topic_words_b: Set of words from topic B
    
    Returns:
        Jaccard coefficient (intersection / union)
    """
    intersection = len(topic_words_a & topic_words_b)
    union = len(topic_words_a | topic_words_b)
    return intersection / union if union > 0 else 0


def get_top_words(model, topic_id, topn=100):
    """
    Extract top N words for a topic as a set.
    
    Parameters:
        model: LDA model
        topic_id: Topic index
        topn: Number of top words
    
    Returns:
        Set of top words
    """
    return set([word for word, prob in model.show_topic(topic_id, topn=topn)])


def compare_model_pair(model_a, model_b, num_topics=4, topn=100):
    """
    Compare all topics between two models.
    
    Parameters:
        model_a: First LDA model
        model_b: Second LDA model  
        num_topics: Number of topics (k)
        topn: Number of top words to compare
    
    Returns:
        DataFrame with Jaccard similarities for all topic pairs
    """
    results = []
    
    for topic_a in range(num_topics):
        words_a = get_top_words(model_a, topic_a, topn)
        
        for topic_b in range(num_topics):
            words_b = get_top_words(model_b, topic_b, topn)
            
            jaccard = jaccard_similarity(words_a, words_b)
            
            results.append({
                'topic_a': topic_a,
                'topic_b': topic_b,
                'jaccard': jaccard
            })
    
    return pd.DataFrame(results)


def analyze_stability(models, threshold=0.7, topn=100):
    """
    Analyze topic stability across multiple models.
    
    Parameters:
        models: List of LDA models
        threshold: Jaccard threshold for considering topics "matched"
        topn: Number of top words to compare
    
    Returns:
        DataFrame summarizing stability findings
    """
    num_topics = models[0].num_topics
    all_comparisons = []
    
    # Compare each pair of consecutive models
    for i in range(len(models) - 1):
        model_a_id = i
        model_b_id = i + 1
        
        comparison = compare_model_pair(models[i], models[i+1], num_topics, topn)
        comparison['model_a_id'] = model_a_id
        comparison['model_b_id'] = model_b_id
        
        all_comparisons.append(comparison)
    
    # Combine all comparisons
    all_results = pd.concat(all_comparisons, ignore_index=True)
    
    # Find strong matches (Jaccard > threshold)
    strong_matches = all_results[all_results['jaccard'] >= threshold]
    
    return all_results, strong_matches


print("Stability analysis functions defined.")

Stability analysis functions defined.

Smarter use of Jaccard similarity

Jaccard similarity is very simple and - importantly for us - transitive. Jaccard similarity allows us to use a common link to skip calculations. If we know how Topic A relates to Topic B, and we already know how Topic A relates to Topic C, we can use Topic A as a “bridge.” This allows us to automatically deduce the relationship between B and C without having to compare them directly.

Unfortunately, the Python ecosystem still lacks a package for this. And to keep the Python code simpler, we were using Jaccard similarity the ‘dumb’ way.

Now let’s run the stability analysis:

print("Analyzing topic stability across 5 models...")
print("Comparing each model with the next one (4 comparisons total)\n")

all_results, strong_matches = analyze_stability(models, threshold=0.7, topn=100)

print("="*70)
print("STRONG TOPIC MATCHES (Jaccard >= 0.7)")
print("="*70)
print(f"\nFound {len(strong_matches)} strong matches across 4 model pairs:\n")

if len(strong_matches) > 0:
    # Group by model comparison
    for (model_a, model_b), group in strong_matches.groupby(['model_a_id', 'model_b_id']):
        print(f"\nModel {model_a} vs Model {model_b}:")
        for _, row in group.iterrows():
            print(f"  Topic {row['topic_a']} → Topic {row['topic_b']}: {row['jaccard']:.3f}")
else:
    print("No strong matches found. This suggests low topic stability.")
    print("Consider: more preprocessing, different k, or acknowledging instability.")

Analyzing topic stability across 5 models...
Comparing each model with the next one (4 comparisons total)

======================================================================
STRONG TOPIC MATCHES (Jaccard >= 0.7)
======================================================================

Found 2 strong matches across 4 model pairs:


Model 2 vs Model 3:
  Topic 2.0 → Topic 3.0: 0.869

Model 3 vs Model 4:
  Topic 3.0 → Topic 0.0: 0.770

4.5 Interpreting stability results

The analysis shows which topics match across different model runs. Here’s how to interpret the findings:

What constitutes a stable topic?

A topic is considered stable if it appears consistently across multiple models (typically 3 out of 5) with high Jaccard similarity (>0.7).

For example, if we find:

Model 0 vs Model 1: Topic 2 → Topic 0 (Jaccard = 0.75)
Model 1 vs Model 2: Topic 0 → Topic 3 (Jaccard = 0.78)
Model 2 vs Model 3: Topic 3 → Topic 1 (Jaccard = 0.72)
Model 3 vs Model 4: Topic 1 → Topic 2 (Jaccard = 0.76)

This suggests one stable semantic topic that appears in all 5 models, though it gets different numeric IDs in each run.

Defining stability

There is no agreed-upon definition of what is a stable topic. In my own work, I follow this principle. If a given pair of topics has Jaccard of 0.9 and higher, then it is the same topic. If this topic appears at least in 3 models out of 5, then it is a stable topic. If a topic is not stable, I discard it from the analysis.

Practice shows that this rule is strict and does not work with every dataset. It shows the best results on clean and easily distinguishable data like news items. It’s too strict for text like short social media posts or product reviews.

Although the approach is general, it was designed to work with LDA-type of topic models. For example, it becomes useless in the case of STM with shadow initialization.

Jaccard similarity thresholds:

Jaccard Score	Interpretation
≥ 0.70	Strong match - topics are essentially the same
0.50 - 0.69	Moderate match - topics share core themes
0.30 - 0.49	Weak match - some overlap but different topics
< 0.30	No meaningful match

What if few or no matches are found?

If the analysis reveals few strong matches (Jaccard ≥ 0.7), this indicates low topic stability. Possible reasons:

k is too high: Too many topics for the corpus, causing artificial splits
Corpus is too homogeneous: Reviews discuss the same issues (review bombing)
Preprocessing needs refinement: Add custom stopwords, adjust filters
Small corpus: Not enough documents for stable patterns

What to do with unstable topics

If stability analysis shows poor results, you have two options:

Refine the model: Try different k values, improve preprocessing, filter corpus
Acknowledge instability: Report findings honestly and interpret cautiously
Consider other topic models: LDA is just one widely used implementation of the idea; for example, one promising new model is BERTopic https://maartengr.github.io/BERTopic/

Don’t ignore instability. If topics change substantially between runs, your findings are not robust.

Let’s visualize the stability patterns across all model pairs:

# Create heatmap showing Jaccard similarities for one model pair
comparison_01 = all_results[
    (all_results['model_a_id'] == 0) & 
    (all_results['model_b_id'] == 1)
]

# Pivot to matrix form
similarity_matrix = comparison_01.pivot(
    index='topic_a',
    columns='topic_b', 
    values='jaccard'
)

# Create heatmap
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(
    similarity_matrix,
    annot=True,
    fmt='.3f',
    cmap='YlOrRd',
    vmin=0,
    vmax=1,
    cbar_kws={'label': 'Jaccard Similarity'},
    ax=ax
)

ax.set_xlabel('Model 1 Topics')
ax.set_ylabel('Model 0 Topics')
ax.set_title('Topic Similarity: Model 0 vs Model 1\n(Higher values = better match)')

plt.tight_layout()
plt.show()

How to read this heatmap:

Each cell shows Jaccard similarity between topic pairs
Darker red = stronger match
Look for high values (>0.7) to identify matching topics
The highest value in each ROW shows the best match for that topic

4.6 Summary: Which topics are stable?

To determine which semantic topics are truly stable, we need to trace topic chains across all 5 models:

# For each model pair, find the best match for each topic
print("="*70)
print("TOPIC MATCHING CHAINS ACROSS 5 MODELS")
print("="*70)
print("\nFor each topic in each model, showing its best match in the next model:\n")

for i in range(len(models) - 1):
    comparison = all_results[
        (all_results['model_a_id'] == i) & 
        (all_results['model_b_id'] == i + 1)
    ]
    
    print(f"Model {i} → Model {i+1}:")
    
    # For each topic in model A, find best match in model B
    for topic_a in range(4):
        matches = comparison[comparison['topic_a'] == topic_a].sort_values('jaccard', ascending=False)
        best_match = matches.iloc[0]
        
        if best_match['jaccard'] >= 0.7:
            marker = "✓ STRONG"
        elif best_match['jaccard'] >= 0.5:
            marker = "~ moderate"
        else:
            marker = "✗ weak"
        
        print(f"  Topic {topic_a} → Topic {best_match['topic_b']:.0f} "
              f"(Jaccard={best_match['jaccard']:.3f}) {marker}")
    
    print()

======================================================================
TOPIC MATCHING CHAINS ACROSS 5 MODELS
======================================================================

For each topic in each model, showing its best match in the next model:

Model 0 → Model 1:
  Topic 0 → Topic 0 (Jaccard=0.504) ~ moderate
  Topic 1 → Topic 0 (Jaccard=0.550) ~ moderate
  Topic 2 → Topic 1 (Jaccard=0.681) ~ moderate
  Topic 3 → Topic 3 (Jaccard=0.562) ~ moderate

Model 1 → Model 2:
  Topic 0 → Topic 2 (Jaccard=0.562) ~ moderate
  Topic 1 → Topic 0 (Jaccard=0.667) ~ moderate
  Topic 2 → Topic 2 (Jaccard=0.538) ~ moderate
  Topic 3 → Topic 2 (Jaccard=0.681) ~ moderate

Model 2 → Model 3:
  Topic 0 → Topic 1 (Jaccard=0.587) ~ moderate
  Topic 1 → Topic 1 (Jaccard=0.562) ~ moderate
  Topic 2 → Topic 3 (Jaccard=0.869) ✓ STRONG
  Topic 3 → Topic 1 (Jaccard=0.493) ✗ weak

Model 3 → Model 4:
  Topic 0 → Topic 2 (Jaccard=0.471) ✗ weak
  Topic 1 → Topic 3 (Jaccard=0.626) ~ moderate
  Topic 2 → Topic 2 (Jaccard=0.587) ~ moderate
  Topic 3 → Topic 0 (Jaccard=0.770) ✓ STRONG

Identifying stable topics manually

To trace a stable topic across all 5 models:

Start with Topic X in Model 0
Find its best match in Model 1 (e.g., Topic Y)
Find Topic Y’s best match in Model 2 (e.g., Topic Z)
Continue through all models
If a chain exists with Jaccard > 0.7 at each step, that’s a stable topic

For example, if you see:

Model 0 Topic 2 → Model 1 Topic 0 (0.75)
Model 1 Topic 0 → Model 2 Topic 3 (0.78)
Model 2 Topic 3 → Model 3 Topic 1 (0.72)
Model 3 Topic 1 → Model 4 Topic 2 (0.76)

This represents ONE stable semantic topic appearing in all 5 models (despite having different numeric IDs in each).

Ensuring reproducibility

To make your results reproducible, always set random_state to a fixed value (e.g., random_state=42). This ensures the same random decisions occur each time, producing identical results.

For research papers, report:

The random seed used
Whether you tested stability (and results)
Any unstable topics that were excluded

5 Post-hoc metadata analysis

In Part 1, we built a topic model without considering metadata (review grades). Now we analyze how topics relate to sentiment - a post-hoc analysis.

This differs from Structural Topic Models (STM), which incorporate metadata during training. In gensim LDA, we extract topics first, then analyze patterns.

5.1 Hypotheses about sentiment and topics

Before examining data, let’s form hypotheses. Based on the Animal Crossing review bombing pattern:

Negative reviews (0-3) might focus on:

Multiplayer limitations (“one island per console”)
Missing features (“cloud save”, “limited content”)
Technical issues

Positive reviews (8-10) might focus on:

Aesthetics (“beautiful”, “cute”, “design”)
Gameplay (“relaxing”, “fun”, “enjoyable”)
Characters and villagers
Escapism and stress relief

Let’s test these hypotheses.

5.2 Extracting topic proportions

First, we need a stable model. Let’s use k=4 (from Part 1) with our standard seed:

# Train final model (k=4, seed=42)
print("Training final model (k=4)...")
lda_final = gensim.models.LdaMulticore(
    corpus=corpus,
    id2word=id2word,
    num_topics=4,
    random_state=42,
    passes=20,  # More passes for stability
    workers=1
)

# Show topics
print("\nFinal model topics:")
print("="*70)
for idx in range(4):
    words = [word for word, prob in lda_final.show_topic(idx, topn=10)]
    print(f"Topic {idx}: {', '.join(words)}")

Training final model (k=4)...

Final model topics:
======================================================================
Topic 0: people, review, experience, player, console, expand, animal, crossing, like, thing
Topic 1: player, second, progress, console, expand, want, person, start, get, multiplayer
Topic 2: time, animal, crossing, like, expand, thing, good, feel, want, series
Topic 3: console, save, share, expand, want, multiple, animal, account, experience, buy

Extract topic proportions for all documents:

def get_document_topics(corpus, lda_model, num_topics):
    """Extract topic proportions for all documents."""
    doc_topics_list = []
    
    for doc_bow in corpus:
        topic_dist = lda_model.get_document_topics(doc_bow, minimum_probability=0.0)
        probs = [prob for topic_id, prob in topic_dist]
        doc_topics_list.append(probs)
    
    topic_cols = [f'Topic_{i}' for i in range(num_topics)]
    df_topics = pd.DataFrame(doc_topics_list, columns=topic_cols)
    
    return df_topics

# Extract and merge
df_topics = get_document_topics(corpus, lda_final, num_topics=4)
df_analysis = pd.concat([df_filtered.reset_index(drop=True), df_topics], axis=1)

print(f"Analysis dataframe: {len(df_analysis):,} reviews")
print(f"Columns: {list(df_analysis.columns)}")

Analysis dataframe: 1,735 reviews
Columns: ['grade', 'user_name', 'text', 'date', 'word_count', 'sentiment', 'Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']

5.3 Topic prevalence by sentiment

Compare topic usage in negative vs. positive reviews:

# Filter to negative and positive (exclude sparse neutral)
df_sentiment = df_analysis[df_analysis['sentiment'].isin(['negative', 'positive'])].copy()

# Remove unused category level
df_sentiment['sentiment'] = df_sentiment['sentiment'].cat.remove_unused_categories()

print(f"Negative reviews: {(df_sentiment['sentiment'] == 'negative').sum():,}")
print(f"Positive reviews: {(df_sentiment['sentiment'] == 'positive').sum():,}")

# Reshape for plotting
df_melt = df_sentiment.melt(
    id_vars=['sentiment', 'grade'],
    value_vars=['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3'],
    var_name='Topic',
    value_name='Proportion'
)

# Create horizontal boxplot (sorted by median difference)
fig, ax = plt.subplots(figsize=(10, 6))

# Calculate median difference for sorting
topic_medians = df_sentiment.groupby('sentiment')[['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].median()
median_diff = topic_medians.loc['positive'] - topic_medians.loc['negative']

# Sort by signed difference (Negative -> Positive) rather than absolute
topic_order = median_diff.sort_values(ascending=True).index.tolist()

# Create plot
sns.boxplot(
    data=df_melt,
    y='Topic',
    x='Proportion',
    hue='sentiment',
    order=topic_order,
    ax=ax,
    palette={'negative': 'salmon', 'positive': 'lightgreen'},
    showfliers=False            # we hide outliers for this notebook
                                # for simplicity; but you should
                                # always check them out in you work
)

ax.set_ylabel('Topic')
ax.set_xlabel('Topic Proportion')
ax.set_title('Topic Prevalence by Review Sentiment')
ax.legend(title='Sentiment', loc='lower right')

plt.tight_layout()
plt.show()

Negative reviews: 1,003
Positive reviews: 531

/tmp/ipykernel_3338620/2677725862.py:22: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  topic_medians = df_sentiment.groupby('sentiment')[['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].median()

5.4 Interpreting the sentiment boxplot

Each box shows the distribution of topic proportions across reviews:

Box: Contains middle 50% of values (25th to 75th percentile)
Line inside box: Median value
Whiskers: Extend to extreme values (excluding outliers)
Dots: Outlier reviews

Interpreting differences:

Separated boxes: Strong sentiment difference (e.g., negative median 0.35, positive median 0.15)
Overlapping boxes: Topics used similarly across sentiments
Wide boxes: High variation within sentiment (some reviews emphasize topic, others don’t)

For each topic showing separation, examine the top words to understand what distinguishes negative from positive reviews.

5.5 Statistical comparison

Let’s test whether topic prevalence differs significantly by sentiment:

print("Statistical tests (Mann-Whitney U):")
print("="*70)
print("Null hypothesis: Topic prevalence is the same in negative and positive reviews")
print()

for topic in ['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']:
    negative_vals = df_sentiment[df_sentiment['sentiment'] == 'negative'][topic]
    positive_vals = df_sentiment[df_sentiment['sentiment'] == 'positive'][topic]
    
    # Mann-Whitney U test (non-parametric, doesn't assume normal distribution)
    stat, pvalue = mannwhitneyu(negative_vals, positive_vals, alternative='two-sided')
    
    neg_median = negative_vals.median()
    pos_median = positive_vals.median()
    
    print(f"{topic}:")
    print(f"  Negative median: {neg_median:.3f}")
    print(f"  Positive median: {pos_median:.3f}")
    print(f"  p-value: {pvalue:.4f} {'***' if pvalue < 0.001 else '**' if pvalue < 0.01 else '*' if pvalue < 0.05 else 'ns'}")
    print()

print("Note on p-values: With large datasets (N > 1000), even small differences can be")
print("statistically significant. Look at the difference in medians (Effect Size) to")
print("judge if the difference is meaningful in practice.")

Statistical tests (Mann-Whitney U):
======================================================================
Null hypothesis: Topic prevalence is the same in negative and positive reviews

Topic_0:
  Negative median: 0.015
  Positive median: 0.098
  p-value: 0.0000 ***

Topic_1:
  Negative median: 0.273
  Positive median: 0.009
  p-value: 0.0000 ***

Topic_2:
  Negative median: 0.013
  Positive median: 0.638
  p-value: 0.0000 ***

Topic_3:
  Negative median: 0.185
  Positive median: 0.012
  p-value: 0.0000 ***

Note on p-values: With large datasets (N > 1000), even small differences can be
statistically significant. Look at the difference in medians (Effect Size) to
judge if the difference is meaningful in practice.

Correlation vs causation

These patterns show association between sentiment and topic usage, not causation. We cannot conclude “being negative causes using Topic X” from this analysis.

Both sentiment and topic usage reflect underlying review content. The relationship is:

Review content → Sentiment + Topic usage

Not: Sentiment → Topic usage

In our case, the p-value is 0.000, but look at the medians: Topic 0 jumps from 0.015 (Negative) to 0.098 (Positive). This is a substantial shift in prevalence.

5.6 Temporal analysis

Do topics change over time? Let’s examine whether review bombing (negative complaints) decreases as positive reviews accumulate:

# Add time variable (weeks since release)
df_analysis['date'] = pd.to_datetime(df_filtered['date'].values)
df_analysis['week'] = ((df_analysis['date'] - df_analysis['date'].min()).dt.days // 7)

# Calculate weekly averages for each topic
weekly_topics = df_analysis.groupby('week')[['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].mean()

# Plot
fig, ax = plt.subplots(figsize=(12, 6))

for topic in ['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']:
    ax.plot(weekly_topics.index, weekly_topics[topic], 
            marker='o', linewidth=2, label=topic)

ax.set_xlabel('Weeks Since Launch (March 20, 2020)', fontsize=12)
ax.set_ylabel('Mean Topic Proportion', fontsize=12)
ax.set_title('Topic Prevalence Over Time', fontsize=14, fontweight='bold')
ax.legend(loc='best')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print interpretation
print("\nTemporal pattern interpretation:")
print("="*70)
for topic in ['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']:
    week0 = weekly_topics.iloc[0][topic]
    week_last = weekly_topics.iloc[-1][topic]
    change = week_last - week0
    
    print(f"{topic}:")
    print(f"  Week 0: {week0:.3f}")
    print(f"  Week {weekly_topics.index[-1]}: {week_last:.3f}")
    print(f"  Change: {change:+.3f}")
    print()


Temporal pattern interpretation:
======================================================================
Topic_0:
  Week 0: 0.207
  Week 6: 0.188
  Change: -0.018

Topic_1:
  Week 0: 0.255
  Week 6: 0.204
  Change: -0.051

Topic_2:
  Week 0: 0.320
  Week 6: 0.331
  Change: +0.010

Topic_3:
  Week 0: 0.218
  Week 6: 0.277
  Change: +0.059

6 Topic correlations

Do certain topics tend to appear together in the same reviews? Correlation analysis reveals co-occurrence patterns.

# Calculate correlation matrix
corr_matrix = df_analysis[['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].corr()

# Visualize heatmap
fig, ax = plt.subplots(figsize=(8, 6))

sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    vmin=-0.5,
    vmax=0.5,
    center=0,
    square=True,
    cbar_kws={'label': 'Correlation'},
    ax=ax
)

ax.set_title('Topic Correlations', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

6.1 Interpreting correlation values

Correlation	Interpretation
> 0.30	Strong positive relationship; topics often co-occur
0.10 to 0.30	Moderate positive relationship
-0.10 to 0.10	Weak/no relationship; topics independent
-0.30 to -0.10	Moderate negative relationship
< -0.30	Strong negative relationship; topics rarely co-occur

Positive correlation suggests topics appear together (e.g., reviews discussing aesthetics also discuss relaxation). Negative correlation suggests topics are mutually exclusive (e.g., complaint-focused reviews don’t discuss positive aspects).

When to use correlation analysis

Topic correlation is useful for:

Understanding rhetorical combinations (which themes co-occur?)
Identifying conceptually related topics
Exploring discourse patterns

However, correlation does not reveal:

Causal relationships
Hierarchical structure (which topics are subtopics of others)

For network-like relationships, see Lab 05.2 (Semantic Networks).

7 Validation through human coding

Computational metrics (coherence, stability) assess internal consistency, but cannot tell us if topics are meaningful. For that, we need human judgment.

7.1 The gold standard: Inter-coder reliability

When we claim “Topic 1 represents gameplay discourse,” we make an interpretive claim. To validate, we ask: would other researchers agree?

Inter-coder reliability measures agreement between independent human coders. If multiple people interpret a topic the same way, we have evidence the topic is coherent and meaningful.

7.2 Designing a coding study

For a topic we labeled “Multiplayer Complaints” (based on top words), we could ask two independent coders to:

Read the top 20 reviews for this topic
For each review, answer: “Does this review primarily discuss multiplayer or island-sharing issues?” (Yes/No)
Compare their answers

Here is hypothetical coding data:

# Hypothetical inter-coder reliability data
coding_data = pd.DataFrame({
    'document_id': range(1, 21),
    'Coder_A': [1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0],
    'Coder_B': [1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1]
})

print("Inter-coder reliability example:")
print("(1 = review discusses multiplayer issues, 0 = does not)")
print(coding_data.head(10))

# Calculate simple agreement
agreements = (coding_data['Coder_A'] == coding_data['Coder_B']).sum()
total = len(coding_data)
simple_agreement = agreements / total

print(f"\nSimple agreement: {agreements}/{total} = {simple_agreement:.2f}")
print("\nNote: Simple agreement can be misleading because some agreement")
print("occurs by chance. We need a measure that corrects for chance agreement.")

Inter-coder reliability example:
(1 = review discusses multiplayer issues, 0 = does not)
   document_id  Coder_A  Coder_B
0            1        1        1
1            2        1        1
2            3        1        1
3            4        0        0
4            5        1        0
5            6        0        0
6            7        1        1
7            8        1        1
8            9        0        1
9           10        1        1

Simple agreement: 15/20 = 0.75

Note: Simple agreement can be misleading because some agreement
occurs by chance. We need a measure that corrects for chance agreement.

7.3 Calculating Krippendorff’s alpha

Simple percentage agreement can be misleading because some agreement occurs by chance. Krippendorff’s alpha corrects for chance agreement.

# Install if needed: uv pip install simpledorff
import simpledorff

# Prepare data in long format
coding_long = coding_data.melt(
    id_vars='document_id',
    var_name='coder',
    value_name='annotation'
)

# Calculate alpha
alpha = simpledorff.calculate_krippendorffs_alpha_for_df(
    coding_long,
    experiment_col='document_id',
    annotator_col='coder',
    class_col='annotation'
)

print("="*70)
print("Krippendorff's Alpha")
print("="*70)
print(f"Alpha = {alpha:.3f}")
print()

======================================================================
Krippendorff's Alpha
======================================================================
Alpha = 0.389

Installing simpledorff

If simpledorff is not installed, run:

uv pip install simpledorff

Alternatively, for simpler cases (two coders, binary coding), you can use: sklearn.metrics.cohen_kappa_score

7.4 Interpreting alpha values

Alpha	Interpretation	Action
> 0.80	Excellent agreement	Topic interpretation reliable
0.67 - 0.80	Good agreement	Acceptable for most purposes
0.60 - 0.67	Marginal agreement	Use cautiously; consider refining
< 0.60	Poor agreement	Topic interpretation unreliable; discard or redefine

In this example, alpha = 0.389. This suggests the topic interpretation has poor agreement.

Limitations of inter-coder reliability

High agreement does not guarantee a topic is meaningful for your research question. It only means coders agree on the definition.

Two coders could reliably identify a nonsensical category. Inter-coder reliability should complement, not replace:

Domain expertise
Theoretical grounding
Close reading of documents

7.5 Sample size for validation

How many documents should be coded?

General guidelines:

Small topics (< 50 documents): Code 20-30 documents
Medium topics (50-500 documents): Code 50-100 documents
Large topics (> 500 documents): Code 100-200 documents

Or use proportion: 10-20% of documents assigned to topic, minimum 20.

Tradeoff: More coding = higher precision, but more human time/cost.

Blind Coding

To ensure valid results, use blind coding: do not tell your coders which topic the computer assigned to each document. If they know the computer thinks a review is about “Multiplayer”, they might be biased to agree. Present the documents in random order without topic labels.

8 Reporting topic models in research

When publishing research using topic modeling, transparency about methods and evaluation is essential.

8.1 Minimum reporting standards

Model specification:

Algorithm (LDA, CTM, STM, etc.)
Number of topics (k) and how chosen
Preprocessing steps (lemmatization, stopwords, filtering)
Hyperparameters (alpha, eta/beta, passes, random_state)

Evaluation:

Coherence scores (if used to select k)
Stability assessment (e.g., “top 10 words stable across 5 runs”)
Validation approach (close reading, inter-coder reliability, expert review)

Interpretation:

Topic labels and how assigned
Representative documents or quotes for each topic
Prevalence (proportion of corpus each topic represents)

8.2 Writing about topics

Use tentative language that acknowledges topics are analytical constructs:

Good examples:

“Topic 1, which we interpret as gameplay discourse, comprises words like…”
“Documents assigned primarily to Topic 1 tend to discuss…”
“We identify a topic characterized by words such as ‘island’, ‘villager’, ‘design’”

Avoid:

“Topic 1 is the gameplay topic” (reifies the construct)
“LDA discovered that 30% of reviews are about gameplay” (implies objective truth)
“The complaint topic shows…” (without acknowledging interpretation)

8.3 Common reporting pitfalls

Not reporting k selection process: Readers need to know how you chose number of topics
Claiming topics are “discovered truths”: Topics are model outputs, not inherent document properties
Omitting failed topics: If some topics were incoherent, report this
Cherry-picking documents: Show representative documents, not just best examples
Ignoring stability: Report any stability issues encountered

Making research reproducible

Share when possible:

Preprocessed corpus (if no privacy/copyright restrictions)
Trained model file (lda_model.save('model.pkl'))
Full preprocessing code
Random seed used (random_state=42)

This allows others to verify and build on your work.

9 Exercises

9.1 Exercise 1: Stability assessment

Using the 5 models already trained in this lab, conduct a deeper stability analysis.

Tasks:

Examine the strong_matches dataframe. Identify chains of matching topics across models (e.g., Topic 2 in Model 0 → Topic 0 in Model 1 → Topic 3 in Model 2). How many such chains can you find?
For the best-matching chain (highest average Jaccard), extract the top 20 words from each model’s version of that topic. How many words appear in all 5 versions?
Compare the stability heatmap (Model 0 vs Model 1) to a different pair (e.g., Model 2 vs Model 3). Do the same topics match, or do patterns differ?
Reflect: If a topic appears in only 2 out of 5 models with Jaccard > 0.7, would you trust research conclusions based on that topic? Why or why not?
What does low stability tell you about this corpus? (Hint: recall the “review bombing” note about this dataset)

9.2 Exercise 2: Sentiment hypothesis testing

Hypothesis: “Negative reviews focus on multiplayer issues; positive reviews focus on aesthetics and relaxation.”

Tasks:

Examine the boxplot from the sentiment analysis section
Identify which topic(s) represent “multiplayer issues” (look at top words)
Identify which topic(s) represent “aesthetics/relaxation”
Do the data support the hypothesis? Provide evidence
What alternative explanations could account for the patterns you observe?

9.3 Exercise 3: Choosing k

Using the coherence analysis results:

What value of k has the highest coherence score?
Does the plot show a clear “elbow” point? Where?
Compare k=3 vs k=6: examine the topics for each (you’ll need to retrain models)
Which k value is more interpretable? Why?
Does highest coherence always mean best interpretability?

9.4 Exercise 4: HCI application

Imagine you are a UX researcher at Nintendo analyzing user feedback:

Identify the “UX complaint” topic in your model (based on top words)
Extract the top 20 reviews for this topic
Read them and note:
- What specific design decision is criticized?
- How do users describe the impact on their experience?
- Do users suggest solutions?
Based on this analysis, what would you recommend to the design team?
Reflect: How does topic modeling help vs. hurt in UX research?

9.5 Exercise 5: Temporal patterns

Looking at the temporal analysis plot:

Which topic(s) decrease over time?
Which topic(s) increase over time?
What explains these patterns? (Hint: consider review bombing phenomenon)
Does topic modeling help understand how public discourse evolves?
What are limitations of this temporal analysis?

9.6 Exercise 6: Validation design

Design an inter-coder reliability study for your topic model:

Choose one topic to validate
How many documents would you have coders read?
What specific question would you ask coders?
What Krippendorff’s alpha threshold would you accept for publication?
Besides inter-coder reliability, what other validation steps would you take?