This notebook assumes you have completed Part 1 (Introduction to Topic Modeling). You should be familiar with:
LDA conceptual model (documents as mixtures of topics)
Preprocessing for topic modeling
Interpreting topic-word distributions
Using pyLDAvis visualization
If you have not completed Part 1, start there first.
1 Why evaluation matters
In Part 1, you built a topic model and interpreted its results. But how do you know if your topics are reliable? How do you choose the number of topics (k)? How do you validate that topics represent meaningful patterns rather than statistical artifacts?
Topic models are analytical constructs. Different choices produce different results:
Different preprocessing (stopwords, filtering)
Different k values (number of topics)
Different random initializations (LDA is stochastic)
Systematic evaluation helps us assess:
Stability: Do topics remain consistent across runs?
Coherence: Do topics make semantic sense?
Validity: Do topics align with human interpretation?
Generalizability: Do findings hold across different subsets of data?
This lab covers methods for evaluating topic models rigorously.
2 Setup and data
We use the same Animal Crossing reviews dataset from Part 1. Let’s reload and preprocess:
import pandas as pdimport numpy as npimport spacyimport gensimimport gensim.corpora as corporafrom gensim.models import CoherenceModelimport matplotlib.pyplot as pltimport seaborn as snsfrom scipy.stats import mannwhitneyu# Set visualization stylesns.set_style("whitegrid")plt.rcParams['figure.dpi'] =100print("Libraries loaded successfully")
Libraries loaded successfully
# NOTE: This preprocessing code duplicates Part 1 for standalone use.# In practice, you would save the corpus and model from Part 1 and load them here.# We repeat for pedagogical clarity.# Load datadf = pd.read_csv('data/animal-crossing/user_reviews.tsv', sep='\t')# Filter and deduplicatedf['word_count'] = df['text'].str.split().str.len()df_filtered = df[df['word_count'] >=50].copy()df_filtered = df_filtered.drop_duplicates(subset=['text'])# Add sentiment categoriesdf_filtered['sentiment'] = pd.cut( df_filtered['grade'], bins=[-1, 3, 7, 10], labels=['negative', 'neutral', 'positive'])print(f"Filtered corpus: {len(df_filtered):,} reviews")# Preprocessnlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])def process_text(texts):"""Clean and tokenize texts for topic modeling.""" processed_docs = []# Custom stopwords (consistent with Part 1) custom_stops = {'game', 'play', 'nintendo', 'switch', 'island'}for doc in nlp.pipe(texts, batch_size=50): tokens = [token.lemma_.lower() for token in doc ifnot token.is_stop andnot token.is_punct and token.is_alpha andlen(token) >3and token.lemma_.lower() notin custom_stops] processed_docs.append(tokens)return processed_docsprint("Processing reviews...")texts_processed = process_text(df_filtered['text'].tolist())# Build dictionary and corpusid2word = corpora.Dictionary(texts_processed)id2word.filter_extremes(no_below=5, no_above=0.8)corpus = [id2word.doc2bow(text) for text in texts_processed]print(f"Vocabulary: {len(id2word):,} words")print(f"Corpus: {len(corpus):,} documents")
One fundamental challenge in topic modeling: we must decide how many topics (k) exist in our corpus. There is no “correct” answer - this is a modeling choice.
We can use coherence scores as a heuristic. Coherence measures how often the top words in a topic appear together in the same documents. Higher coherence suggests the topic represents a meaningful theme.
3.1 What is coherence?
Coherence asks: do the top words in a topic actually co-occur in documents? If a topic has high coherence, its top words appear together frequently, suggesting they represent a real pattern.
Consider two hypothetical topics:
Topic A (high coherence):
Top words: “island”, “villager”, “design”, “custom”, “decoration”
These words co-occur in reviews about creative gameplay
Topic B (low coherence):
Top words: “game”, “play”, “thing”, “like”, “get”
These are high-frequency words that appear everywhere but don’t form a coherent theme
Coherence scores quantify this intuition.
3.2 Interpreting coherence scores
We use the c_v coherence metric, which ranges from 0 to 1:
Coherence (c_v)
Interpretation
< 0.40
Poor topic quality; topics likely incoherent
0.40 - 0.50
Moderate; topics may be interpretable
0.50 - 0.60
Good; topics generally coherent
> 0.60
Excellent; topics well-defined
These are guidelines, not rules. Always examine actual topics (words) to verify they make semantic sense.
NoteExpected coherence for this dataset
Due to review bombing (many reviews focusing on one complaint), this corpus is not ideal for topic modeling. You should see coherence scores around 0.35-0.40 - lower than typical well-structured corpora.
This demonstrates an important lesson: corpus composition affects topic quality. When most documents discuss the same theme, LDA struggles to find distinct topics.
Despite low coherence, topics may still be interpretable if you examine them carefully. The low scores reflect the limitation of the data, not the method.
3.3 Computing coherence for different k values
Let’s test k from 2 to 10 topics:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):""" Compute coherence scores for different numbers of topics. Parameters: dictionary: Gensim dictionary corpus: Gensim corpus (bag-of-words) texts: Tokenized texts (list of lists) limit: Maximum number of topics to test start: Minimum number of topics step: Step size Returns: model_list: List of trained models coherence_values: List of coherence scores """ coherence_values = [] model_list = []for num_topics inrange(start, limit, step):print(f" Training model with k={num_topics}...")# Train LDA model model = gensim.models.LdaMulticore( corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42, passes=10, workers=1 ) model_list.append(model)# Calculate coherence coherencemodel = CoherenceModel( model=model, texts=texts, dictionary=dictionary, coherence='c_v' ) coherence_score = coherencemodel.get_coherence() coherence_values.append(coherence_score)print(f" k={num_topics}: coherence = {coherence_score:.4f}")return model_list, coherence_values# Run coherence analysisprint("Computing coherence scores (this takes a few minutes)...")print("="*70)model_list, coherence_values = compute_coherence_values( dictionary=id2word, corpus=corpus, texts=texts_processed, start=2, limit=11, step=1)print("\n"+"="*70)print("Coherence scores summary:")print("="*70)for k, score inzip(range(2, 11), coherence_values):print(f"k={k}: {score:.4f}")
Computing coherence scores (this takes a few minutes)...
======================================================================
Training model with k=2...
k=2: coherence = 0.3569
Training model with k=3...
k=3: coherence = 0.3949
Training model with k=4...
k=4: coherence = 0.3939
Training model with k=5...
k=5: coherence = 0.4081
Training model with k=6...
k=6: coherence = 0.3944
Training model with k=7...
k=7: coherence = 0.3888
Training model with k=8...
k=8: coherence = 0.3942
Training model with k=9...
k=9: coherence = 0.3888
Training model with k=10...
k=10: coherence = 0.3704
======================================================================
Coherence scores summary:
======================================================================
k=2: 0.3569
k=3: 0.3949
k=4: 0.3939
k=5: 0.4081
k=6: 0.3944
k=7: 0.3888
k=8: 0.3942
k=9: 0.3888
k=10: 0.3704
3.4 Visualizing the elbow method
Plot coherence scores to identify the “elbow” - where adding more topics stops improving coherence significantly:
fig, ax = plt.subplots(figsize=(10, 6))k_values =list(range(2, 11))ax.plot(k_values, coherence_values, marker='o', linewidth=2, markersize=8, color='steelblue')ax.set_xlabel('Number of Topics (k)', fontsize=12)ax.set_ylabel('Coherence Score (c_v)', fontsize=12)ax.set_title('Coherence Scores by Number of Topics', fontsize=14, fontweight='bold')ax.set_xticks(k_values)ax.grid(True, alpha=0.3)# Highlight maximummax_idx = np.argmax(coherence_values)max_k = k_values[max_idx]max_score = coherence_values[max_idx]ax.plot(max_k, max_score, marker='*', markersize=20, color='red', label=f'Maximum: k={max_k} ({max_score:.3f})')ax.legend(fontsize=11)plt.tight_layout()plt.show()print(f"\nMaximum coherence: k={max_k} with score {max_score:.4f}")
Maximum coherence: k=5 with score 0.4081
3.5 Choosing k in practice
The coherence plot helps identify reasonable k values, but the choice involves tradeoffs:
Higher coherence suggests semantically coherent topics, but doesn’t guarantee:
Topics are useful for your research question
Topics are interpretable by humans
Topics are stable across runs
Consider:
Elbow point: Where does coherence plateau?
Interpretability: Can you assign meaningful labels to all topics?
Coverage: Do topics capture important themes in your corpus?
Parsimony: Fewer topics (k=4-6) are easier to interpret than many topics (k=15-20)
For this corpus specifically:
Looking at the coherence scores, they are all relatively low (0.35-0.38) with no clear peak. The highest score is k=2 (0.385), but only two topics would oversimplify the corpus.
The plot shows coherence is relatively flat across k=2 to k=10, suggesting the review bombing pattern makes it difficult for LDA to find highly distinct topics regardless of k.
In this situation, we choose k=4 based on:
Interpretability: Four topics are manageable to interpret and validate
Balance: Captures main themes (complaints, gameplay, aesthetics, comparisons) without excessive granularity
Pedagogical value: Demonstrates that topic modeling doesn’t always produce high-coherence results
This teaches an important lesson: when coherence is uniformly low, examine the corpus composition. The review bombing phenomenon limits topic diversity, which affects coherence scores.
WarningCoherence is a heuristic, not truth
High coherence does not guarantee topics are “correct” or meaningful for your research question. Always validate topics through:
Close reading of high-loading documents
Domain expertise
Alignment with theoretical expectations
Use coherence to narrow options, then choose k based on interpretability.
Topic modeling uses probabilistic algorithms that incorporate randomness during training. This means running the same code twice on the same data can produce different results.
For researchers, this raises an important question: how stable are our topics? If topics change substantially between runs, can we trust our findings?
You start worrying about topic stability once you have decided on the number of topics for your model. Although our results above suggest that 5 topics is a better choice, we will stick to 4 topics for the demonstration purposes. The scale of you model (the number of topics) quadratically increases the amount of computation needed to find stable topics because we will have to do pair-wise comparisons between a number of models (at least 5) and each found topic inside a model: for 4 topics we will need to do 160 comparisons and for 5 topics - 250. Her is a plot showing how the number of comparisons increases:
# 1. Define Parametersnum_models =5model_pairs = (num_models * (num_models -1)) /2# 2. Generate Data (Topics 2 to 100)topics_range =range(2, 101)data = []for t in topics_range: total_comparisons = model_pairs * (t **2) data.append({"Topics": t, "Comparisons": total_comparisons})df = pd.DataFrame(data)# 3. Plotplt.figure(figsize=(10, 6))sns.set_theme(style="whitegrid")sns.lineplot(data=df, x="Topics", y="Comparisons", linewidth=2.5)plt.title(f"Growth of Comparisons for {num_models} Models ($10 \\times T^2$)", fontsize=14)plt.ylabel("Total Comparisons")plt.xlabel("Number of Topics per Model")plt.show()
You may ignore topic stability if you use a topic model to reduce your data for other purposes - like training a classifier or doing a preliminary exploration. In the industry, topic stability is less of an issue because topic modeling is often used as a routine procedure powering up a feature such as a recommendation of similar text (or product). These models are constantly updated with new data and there is no reasonable way to keep track of stable topics. Yet, you should care about stability if you plan on making conclusions based on discovered topics.
4.1 The challenge of stochastic algorithms
LDA uses random initialization and random sampling during training. Each run explores the space of possible topic solutions differently. While the algorithm converges to similar solutions, topics may differ slightly between runs.
This is not a flaw - it reflects the nature of unsupervised learning. However, we must assess stability to ensure findings are robust.
4.2 The topic labeling problem
Here’s a crucial insight: topic IDs are arbitrary labels. When you train a topic model, the algorithm assigns numeric IDs (0, 1, 2, 3) to topics, but these numbers have no inherent meaning.
What this means:
In Model A, “gameplay mechanics” might be labeled Topic 0
In Model B, the same semantic topic might be labeled Topic 2
The numeric IDs are just arbitrary labels assigned during training
This creates a challenge: to assess stability, we cannot simply compare Topic 0 vs Topic 0 across models. We must find which topics from Model A correspond to which topics in Model B by comparing their content (top words).
4.3 Training multiple models
To properly assess stability, we need multiple models (typically 5) trained with different random seeds:
NoteTechnical detail: workers=1
We set workers=1 to force single-threaded execution. When using multiple threads (workers > 1), the order in which processor cores finish tasks can introduce variations even with a fixed random seed. For reproducibility and stability testing, single-threaded execution is essential.
print("Training 5 models with different random seeds...")print("This will take several minutes.\n")models = []seeds = [42, 100, 999, 2023, 5555]for i, seed inenumerate(seeds):print(f"Training model {i+1}/5 (seed={seed})...") model = gensim.models.LdaMulticore( corpus=corpus, id2word=id2word, num_topics=4, random_state=seed, passes=10, workers=1 ) models.append(model)print("\nAll models trained!")
Training 5 models with different random seeds...
This will take several minutes.
Training model 1/5 (seed=42)...
Training model 2/5 (seed=100)...
Training model 3/5 (seed=999)...
Training model 4/5 (seed=2023)...
Training model 5/5 (seed=5555)...
All models trained!
Let’s examine how topic labels differ across runs by looking at the top words:
# Compare Topic 0 across first two modelsprint("="*70)print("Comparing 'Topic 0' across two different models")print("="*70)print("\nMODEL 1 - Topic 0:")for word, prob in models[0].show_topic(0, topn=10):print(f" {word}: {prob:.4f}")print("\nMODEL 2 - Topic 0:")for word, prob in models[1].show_topic(0, topn=10):print(f" {word}: {prob:.4f}")print("\n"+"="*70)print("OBSERVATION")print("="*70)print("Notice: Topic 0 in Model 1 may represent a DIFFERENT semantic theme")print("than Topic 0 in Model 2. The numeric IDs are arbitrary!")print("\nTo find stable topics, we must compare EVERY topic from Model 1")print("against EVERY topic from Model 2 to find the best matches.")
======================================================================
Comparing 'Topic 0' across two different models
======================================================================
MODEL 1 - Topic 0:
player: 0.0210
people: 0.0192
experience: 0.0163
expand: 0.0143
console: 0.0138
animal: 0.0125
thing: 0.0123
review: 0.0122
like: 0.0121
crossing: 0.0118
MODEL 2 - Topic 0:
player: 0.0459
console: 0.0186
second: 0.0166
progress: 0.0137
expand: 0.0128
like: 0.0116
start: 0.0105
want: 0.0103
multiplayer: 0.0092
person: 0.0089
======================================================================
OBSERVATION
======================================================================
Notice: Topic 0 in Model 1 may represent a DIFFERENT semantic theme
than Topic 0 in Model 2. The numeric IDs are arbitrary!
To find stable topics, we must compare EVERY topic from Model 1
against EVERY topic from Model 2 to find the best matches.
4.4 Cross-model topic matching
To properly assess stability, we need to:
Compare every topic from Model A against every topic from Model B
Calculate Jaccard similarity for all pairs
Find which topics match across models (Jaccard > 0.7)
Count how many times each semantic topic appears across all 5 models
Topics appearing in 3+ models are considered “stable”
Here’s the implementation:
def jaccard_similarity(topic_words_a, topic_words_b):""" Calculate Jaccard similarity between two sets of words. Parameters: topic_words_a: Set of words from topic A topic_words_b: Set of words from topic B Returns: Jaccard coefficient (intersection / union) """ intersection =len(topic_words_a & topic_words_b) union =len(topic_words_a | topic_words_b)return intersection / union if union >0else0def get_top_words(model, topic_id, topn=100):""" Extract top N words for a topic as a set. Parameters: model: LDA model topic_id: Topic index topn: Number of top words Returns: Set of top words """returnset([word for word, prob in model.show_topic(topic_id, topn=topn)])def compare_model_pair(model_a, model_b, num_topics=4, topn=100):""" Compare all topics between two models. Parameters: model_a: First LDA model model_b: Second LDA model num_topics: Number of topics (k) topn: Number of top words to compare Returns: DataFrame with Jaccard similarities for all topic pairs """ results = []for topic_a inrange(num_topics): words_a = get_top_words(model_a, topic_a, topn)for topic_b inrange(num_topics): words_b = get_top_words(model_b, topic_b, topn) jaccard = jaccard_similarity(words_a, words_b) results.append({'topic_a': topic_a,'topic_b': topic_b,'jaccard': jaccard })return pd.DataFrame(results)def analyze_stability(models, threshold=0.7, topn=100):""" Analyze topic stability across multiple models. Parameters: models: List of LDA models threshold: Jaccard threshold for considering topics "matched" topn: Number of top words to compare Returns: DataFrame summarizing stability findings """ num_topics = models[0].num_topics all_comparisons = []# Compare each pair of consecutive modelsfor i inrange(len(models) -1): model_a_id = i model_b_id = i +1 comparison = compare_model_pair(models[i], models[i+1], num_topics, topn) comparison['model_a_id'] = model_a_id comparison['model_b_id'] = model_b_id all_comparisons.append(comparison)# Combine all comparisons all_results = pd.concat(all_comparisons, ignore_index=True)# Find strong matches (Jaccard > threshold) strong_matches = all_results[all_results['jaccard'] >= threshold]return all_results, strong_matchesprint("Stability analysis functions defined.")
Stability analysis functions defined.
NoteSmarter use of Jaccard similarity
Jaccard similarity is very simple and - importantly for us - transitive. Jaccard similarity allows us to use a common link to skip calculations. If we know how Topic A relates to Topic B, and we already know how Topic A relates to Topic C, we can use Topic A as a “bridge.” This allows us to automatically deduce the relationship between B and C without having to compare them directly.
Unfortunately, the Python ecosystem still lacks a package for this. And to keep the Python code simpler, we were using Jaccard similarity the ‘dumb’ way.
Now let’s run the stability analysis:
print("Analyzing topic stability across 5 models...")print("Comparing each model with the next one (4 comparisons total)\n")all_results, strong_matches = analyze_stability(models, threshold=0.7, topn=100)print("="*70)print("STRONG TOPIC MATCHES (Jaccard >= 0.7)")print("="*70)print(f"\nFound {len(strong_matches)} strong matches across 4 model pairs:\n")iflen(strong_matches) >0:# Group by model comparisonfor (model_a, model_b), group in strong_matches.groupby(['model_a_id', 'model_b_id']):print(f"\nModel {model_a} vs Model {model_b}:")for _, row in group.iterrows():print(f" Topic {row['topic_a']} → Topic {row['topic_b']}: {row['jaccard']:.3f}")else:print("No strong matches found. This suggests low topic stability.")print("Consider: more preprocessing, different k, or acknowledging instability.")
Analyzing topic stability across 5 models...
Comparing each model with the next one (4 comparisons total)
======================================================================
STRONG TOPIC MATCHES (Jaccard >= 0.7)
======================================================================
Found 2 strong matches across 4 model pairs:
Model 2 vs Model 3:
Topic 2.0 → Topic 3.0: 0.869
Model 3 vs Model 4:
Topic 3.0 → Topic 0.0: 0.770
4.5 Interpreting stability results
The analysis shows which topics match across different model runs. Here’s how to interpret the findings:
What constitutes a stable topic?
A topic is considered stable if it appears consistently across multiple models (typically 3 out of 5) with high Jaccard similarity (>0.7).
For example, if we find:
Model 0 vs Model 1: Topic 2 → Topic 0 (Jaccard = 0.75)
Model 1 vs Model 2: Topic 0 → Topic 3 (Jaccard = 0.78)
Model 2 vs Model 3: Topic 3 → Topic 1 (Jaccard = 0.72)
Model 3 vs Model 4: Topic 1 → Topic 2 (Jaccard = 0.76)
This suggests one stable semantic topic that appears in all 5 models, though it gets different numeric IDs in each run.
NoteDefining stability
There is no agreed-upon definition of what is a stable topic. In my own work, I follow this principle. If a given pair of topics has Jaccard of 0.9 and higher, then it is the same topic. If this topic appears at least in 3 models out of 5, then it is a stable topic. If a topic is not stable, I discard it from the analysis.
Practice shows that this rule is strict and does not work with every dataset. It shows the best results on clean and easily distinguishable data like news items. It’s too strict for text like short social media posts or product reviews.
Although the approach is general, it was designed to work with LDA-type of topic models. For example, it becomes useless in the case of STM with shadow initialization.
Jaccard similarity thresholds:
Jaccard Score
Interpretation
≥ 0.70
Strong match - topics are essentially the same
0.50 - 0.69
Moderate match - topics share core themes
0.30 - 0.49
Weak match - some overlap but different topics
< 0.30
No meaningful match
What if few or no matches are found?
If the analysis reveals few strong matches (Jaccard ≥ 0.7), this indicates low topic stability. Possible reasons:
k is too high: Too many topics for the corpus, causing artificial splits
Corpus is too homogeneous: Reviews discuss the same issues (review bombing)
Small corpus: Not enough documents for stable patterns
ImportantWhat to do with unstable topics
If stability analysis shows poor results, you have two options:
Refine the model: Try different k values, improve preprocessing, filter corpus
Acknowledge instability: Report findings honestly and interpret cautiously
Consider other topic models: LDA is just one widely used implementation of the idea; for example, one promising new model is BERTopic https://maartengr.github.io/BERTopic/
Don’t ignore instability. If topics change substantially between runs, your findings are not robust.
Let’s visualize the stability patterns across all model pairs:
# Create heatmap showing Jaccard similarities for one model paircomparison_01 = all_results[ (all_results['model_a_id'] ==0) & (all_results['model_b_id'] ==1)]# Pivot to matrix formsimilarity_matrix = comparison_01.pivot( index='topic_a', columns='topic_b', values='jaccard')# Create heatmapfig, ax = plt.subplots(figsize=(8, 6))sns.heatmap( similarity_matrix, annot=True, fmt='.3f', cmap='YlOrRd', vmin=0, vmax=1, cbar_kws={'label': 'Jaccard Similarity'}, ax=ax)ax.set_xlabel('Model 1 Topics')ax.set_ylabel('Model 0 Topics')ax.set_title('Topic Similarity: Model 0 vs Model 1\n(Higher values = better match)')plt.tight_layout()plt.show()
How to read this heatmap:
Each cell shows Jaccard similarity between topic pairs
Darker red = stronger match
Look for high values (>0.7) to identify matching topics
The highest value in each ROW shows the best match for that topic
4.6 Summary: Which topics are stable?
To determine which semantic topics are truly stable, we need to trace topic chains across all 5 models:
# For each model pair, find the best match for each topicprint("="*70)print("TOPIC MATCHING CHAINS ACROSS 5 MODELS")print("="*70)print("\nFor each topic in each model, showing its best match in the next model:\n")for i inrange(len(models) -1): comparison = all_results[ (all_results['model_a_id'] == i) & (all_results['model_b_id'] == i +1) ]print(f"Model {i} → Model {i+1}:")# For each topic in model A, find best match in model Bfor topic_a inrange(4): matches = comparison[comparison['topic_a'] == topic_a].sort_values('jaccard', ascending=False) best_match = matches.iloc[0]if best_match['jaccard'] >=0.7: marker ="✓ STRONG"elif best_match['jaccard'] >=0.5: marker ="~ moderate"else: marker ="✗ weak"print(f" Topic {topic_a} → Topic {best_match['topic_b']:.0f} "f"(Jaccard={best_match['jaccard']:.3f}) {marker}")print()
======================================================================
TOPIC MATCHING CHAINS ACROSS 5 MODELS
======================================================================
For each topic in each model, showing its best match in the next model:
Model 0 → Model 1:
Topic 0 → Topic 0 (Jaccard=0.504) ~ moderate
Topic 1 → Topic 0 (Jaccard=0.550) ~ moderate
Topic 2 → Topic 1 (Jaccard=0.681) ~ moderate
Topic 3 → Topic 3 (Jaccard=0.562) ~ moderate
Model 1 → Model 2:
Topic 0 → Topic 2 (Jaccard=0.562) ~ moderate
Topic 1 → Topic 0 (Jaccard=0.667) ~ moderate
Topic 2 → Topic 2 (Jaccard=0.538) ~ moderate
Topic 3 → Topic 2 (Jaccard=0.681) ~ moderate
Model 2 → Model 3:
Topic 0 → Topic 1 (Jaccard=0.587) ~ moderate
Topic 1 → Topic 1 (Jaccard=0.562) ~ moderate
Topic 2 → Topic 3 (Jaccard=0.869) ✓ STRONG
Topic 3 → Topic 1 (Jaccard=0.493) ✗ weak
Model 3 → Model 4:
Topic 0 → Topic 2 (Jaccard=0.471) ✗ weak
Topic 1 → Topic 3 (Jaccard=0.626) ~ moderate
Topic 2 → Topic 2 (Jaccard=0.587) ~ moderate
Topic 3 → Topic 0 (Jaccard=0.770) ✓ STRONG
TipIdentifying stable topics manually
To trace a stable topic across all 5 models:
Start with Topic X in Model 0
Find its best match in Model 1 (e.g., Topic Y)
Find Topic Y’s best match in Model 2 (e.g., Topic Z)
Continue through all models
If a chain exists with Jaccard > 0.7 at each step, that’s a stable topic
For example, if you see:
Model 0 Topic 2 → Model 1 Topic 0 (0.75)
Model 1 Topic 0 → Model 2 Topic 3 (0.78)
Model 2 Topic 3 → Model 3 Topic 1 (0.72)
Model 3 Topic 1 → Model 4 Topic 2 (0.76)
This represents ONE stable semantic topic appearing in all 5 models (despite having different numeric IDs in each).
NoteEnsuring reproducibility
To make your results reproducible, always set random_state to a fixed value (e.g., random_state=42). This ensures the same random decisions occur each time, producing identical results.
For research papers, report:
The random seed used
Whether you tested stability (and results)
Any unstable topics that were excluded
5 Post-hoc metadata analysis
In Part 1, we built a topic model without considering metadata (review grades). Now we analyze how topics relate to sentiment - a post-hoc analysis.
This differs from Structural Topic Models (STM), which incorporate metadata during training. In gensim LDA, we extract topics first, then analyze patterns.
5.1 Hypotheses about sentiment and topics
Before examining data, let’s form hypotheses. Based on the Animal Crossing review bombing pattern:
Negative reviews (0-3) might focus on:
Multiplayer limitations (“one island per console”)
Missing features (“cloud save”, “limited content”)
Technical issues
Positive reviews (8-10) might focus on:
Aesthetics (“beautiful”, “cute”, “design”)
Gameplay (“relaxing”, “fun”, “enjoyable”)
Characters and villagers
Escapism and stress relief
Let’s test these hypotheses.
5.2 Extracting topic proportions
First, we need a stable model. Let’s use k=4 (from Part 1) with our standard seed:
# Train final model (k=4, seed=42)print("Training final model (k=4)...")lda_final = gensim.models.LdaMulticore( corpus=corpus, id2word=id2word, num_topics=4, random_state=42, passes=20, # More passes for stability workers=1)# Show topicsprint("\nFinal model topics:")print("="*70)for idx inrange(4): words = [word for word, prob in lda_final.show_topic(idx, topn=10)]print(f"Topic {idx}: {', '.join(words)}")
Training final model (k=4)...
Final model topics:
======================================================================
Topic 0: people, review, experience, player, console, expand, animal, crossing, like, thing
Topic 1: player, second, progress, console, expand, want, person, start, get, multiplayer
Topic 2: time, animal, crossing, like, expand, thing, good, feel, want, series
Topic 3: console, save, share, expand, want, multiple, animal, account, experience, buy
Extract topic proportions for all documents:
def get_document_topics(corpus, lda_model, num_topics):"""Extract topic proportions for all documents.""" doc_topics_list = []for doc_bow in corpus: topic_dist = lda_model.get_document_topics(doc_bow, minimum_probability=0.0) probs = [prob for topic_id, prob in topic_dist] doc_topics_list.append(probs) topic_cols = [f'Topic_{i}'for i inrange(num_topics)] df_topics = pd.DataFrame(doc_topics_list, columns=topic_cols)return df_topics# Extract and mergedf_topics = get_document_topics(corpus, lda_final, num_topics=4)df_analysis = pd.concat([df_filtered.reset_index(drop=True), df_topics], axis=1)print(f"Analysis dataframe: {len(df_analysis):,} reviews")print(f"Columns: {list(df_analysis.columns)}")
Compare topic usage in negative vs. positive reviews:
# Filter to negative and positive (exclude sparse neutral)df_sentiment = df_analysis[df_analysis['sentiment'].isin(['negative', 'positive'])].copy()# Remove unused category leveldf_sentiment['sentiment'] = df_sentiment['sentiment'].cat.remove_unused_categories()print(f"Negative reviews: {(df_sentiment['sentiment'] =='negative').sum():,}")print(f"Positive reviews: {(df_sentiment['sentiment'] =='positive').sum():,}")# Reshape for plottingdf_melt = df_sentiment.melt( id_vars=['sentiment', 'grade'], value_vars=['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3'], var_name='Topic', value_name='Proportion')# Create horizontal boxplot (sorted by median difference)fig, ax = plt.subplots(figsize=(10, 6))# Calculate median difference for sortingtopic_medians = df_sentiment.groupby('sentiment')[['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].median()median_diff = topic_medians.loc['positive'] - topic_medians.loc['negative']# Sort by signed difference (Negative -> Positive) rather than absolutetopic_order = median_diff.sort_values(ascending=True).index.tolist()# Create plotsns.boxplot( data=df_melt, y='Topic', x='Proportion', hue='sentiment', order=topic_order, ax=ax, palette={'negative': 'salmon', 'positive': 'lightgreen'}, showfliers=False# we hide outliers for this notebook# for simplicity; but you should# always check them out in you work)ax.set_ylabel('Topic')ax.set_xlabel('Topic Proportion')ax.set_title('Topic Prevalence by Review Sentiment')ax.legend(title='Sentiment', loc='lower right')plt.tight_layout()plt.show()
Negative reviews: 1,003
Positive reviews: 531
/tmp/ipykernel_3338620/2677725862.py:22: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
topic_medians = df_sentiment.groupby('sentiment')[['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']].median()
5.4 Interpreting the sentiment boxplot
Each box shows the distribution of topic proportions across reviews:
Box: Contains middle 50% of values (25th to 75th percentile)
Line inside box: Median value
Whiskers: Extend to extreme values (excluding outliers)
Dots: Outlier reviews
Interpreting differences:
Separated boxes: Strong sentiment difference (e.g., negative median 0.35, positive median 0.15)
Overlapping boxes: Topics used similarly across sentiments
Wide boxes: High variation within sentiment (some reviews emphasize topic, others don’t)
For each topic showing separation, examine the top words to understand what distinguishes negative from positive reviews.
5.5 Statistical comparison
Let’s test whether topic prevalence differs significantly by sentiment:
print("Statistical tests (Mann-Whitney U):")print("="*70)print("Null hypothesis: Topic prevalence is the same in negative and positive reviews")print()for topic in ['Topic_0', 'Topic_1', 'Topic_2', 'Topic_3']: negative_vals = df_sentiment[df_sentiment['sentiment'] =='negative'][topic] positive_vals = df_sentiment[df_sentiment['sentiment'] =='positive'][topic]# Mann-Whitney U test (non-parametric, doesn't assume normal distribution) stat, pvalue = mannwhitneyu(negative_vals, positive_vals, alternative='two-sided') neg_median = negative_vals.median() pos_median = positive_vals.median()print(f"{topic}:")print(f" Negative median: {neg_median:.3f}")print(f" Positive median: {pos_median:.3f}")print(f" p-value: {pvalue:.4f}{'***'if pvalue <0.001else'**'if pvalue <0.01else'*'if pvalue <0.05else'ns'}")print()print("Note on p-values: With large datasets (N > 1000), even small differences can be")print("statistically significant. Look at the difference in medians (Effect Size) to")print("judge if the difference is meaningful in practice.")
Statistical tests (Mann-Whitney U):
======================================================================
Null hypothesis: Topic prevalence is the same in negative and positive reviews
Topic_0:
Negative median: 0.015
Positive median: 0.098
p-value: 0.0000 ***
Topic_1:
Negative median: 0.273
Positive median: 0.009
p-value: 0.0000 ***
Topic_2:
Negative median: 0.013
Positive median: 0.638
p-value: 0.0000 ***
Topic_3:
Negative median: 0.185
Positive median: 0.012
p-value: 0.0000 ***
Note on p-values: With large datasets (N > 1000), even small differences can be
statistically significant. Look at the difference in medians (Effect Size) to
judge if the difference is meaningful in practice.
NoteCorrelation vs causation
These patterns show association between sentiment and topic usage, not causation. We cannot conclude “being negative causes using Topic X” from this analysis.
Both sentiment and topic usage reflect underlying review content. The relationship is:
Review content → Sentiment + Topic usage
Not: Sentiment → Topic usage
In our case, the p-value is 0.000, but look at the medians: Topic 0 jumps from 0.015 (Negative) to 0.098 (Positive). This is a substantial shift in prevalence.
5.6 Temporal analysis
Do topics change over time? Let’s examine whether review bombing (negative complaints) decreases as positive reviews accumulate:
Hierarchical structure (which topics are subtopics of others)
For network-like relationships, see Lab 05.2 (Semantic Networks).
7 Validation through human coding
Computational metrics (coherence, stability) assess internal consistency, but cannot tell us if topics are meaningful. For that, we need human judgment.
7.1 The gold standard: Inter-coder reliability
When we claim “Topic 1 represents gameplay discourse,” we make an interpretive claim. To validate, we ask: would other researchers agree?
Inter-coder reliability measures agreement between independent human coders. If multiple people interpret a topic the same way, we have evidence the topic is coherent and meaningful.
7.2 Designing a coding study
For a topic we labeled “Multiplayer Complaints” (based on top words), we could ask two independent coders to:
Read the top 20 reviews for this topic
For each review, answer: “Does this review primarily discuss multiplayer or island-sharing issues?” (Yes/No)
Compare their answers
Here is hypothetical coding data:
# Hypothetical inter-coder reliability datacoding_data = pd.DataFrame({'document_id': range(1, 21),'Coder_A': [1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0],'Coder_B': [1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1]})print("Inter-coder reliability example:")print("(1 = review discusses multiplayer issues, 0 = does not)")print(coding_data.head(10))# Calculate simple agreementagreements = (coding_data['Coder_A'] == coding_data['Coder_B']).sum()total =len(coding_data)simple_agreement = agreements / totalprint(f"\nSimple agreement: {agreements}/{total} = {simple_agreement:.2f}")print("\nNote: Simple agreement can be misleading because some agreement")print("occurs by chance. We need a measure that corrects for chance agreement.")
Inter-coder reliability example:
(1 = review discusses multiplayer issues, 0 = does not)
document_id Coder_A Coder_B
0 1 1 1
1 2 1 1
2 3 1 1
3 4 0 0
4 5 1 0
5 6 0 0
6 7 1 1
7 8 1 1
8 9 0 1
9 10 1 1
Simple agreement: 15/20 = 0.75
Note: Simple agreement can be misleading because some agreement
occurs by chance. We need a measure that corrects for chance agreement.
7.3 Calculating Krippendorff’s alpha
Simple percentage agreement can be misleading because some agreement occurs by chance. Krippendorff’s alpha corrects for chance agreement.
# Install if needed: uv pip install simpledorffimport simpledorff# Prepare data in long formatcoding_long = coding_data.melt( id_vars='document_id', var_name='coder', value_name='annotation')# Calculate alphaalpha = simpledorff.calculate_krippendorffs_alpha_for_df( coding_long, experiment_col='document_id', annotator_col='coder', class_col='annotation')print("="*70)print("Krippendorff's Alpha")print("="*70)print(f"Alpha = {alpha:.3f}")print()
Alternatively, for simpler cases (two coders, binary coding), you can use: sklearn.metrics.cohen_kappa_score
7.4 Interpreting alpha values
Alpha
Interpretation
Action
> 0.80
Excellent agreement
Topic interpretation reliable
0.67 - 0.80
Good agreement
Acceptable for most purposes
0.60 - 0.67
Marginal agreement
Use cautiously; consider refining
< 0.60
Poor agreement
Topic interpretation unreliable; discard or redefine
In this example, alpha = 0.389. This suggests the topic interpretation has poor agreement.
WarningLimitations of inter-coder reliability
High agreement does not guarantee a topic is meaningful for your research question. It only means coders agree on the definition.
Two coders could reliably identify a nonsensical category. Inter-coder reliability should complement, not replace:
Domain expertise
Theoretical grounding
Close reading of documents
7.5 Sample size for validation
How many documents should be coded?
General guidelines:
Small topics (< 50 documents): Code 20-30 documents
Medium topics (50-500 documents): Code 50-100 documents
Large topics (> 500 documents): Code 100-200 documents
Or use proportion: 10-20% of documents assigned to topic, minimum 20.
Tradeoff: More coding = higher precision, but more human time/cost.
TipBlind Coding
To ensure valid results, use blind coding: do not tell your coders which topic the computer assigned to each document. If they know the computer thinks a review is about “Multiplayer”, they might be biased to agree. Present the documents in random order without topic labels.
8 Reporting topic models in research
When publishing research using topic modeling, transparency about methods and evaluation is essential.
Prevalence (proportion of corpus each topic represents)
8.2 Writing about topics
Use tentative language that acknowledges topics are analytical constructs:
Good examples:
“Topic 1, which we interpret as gameplay discourse, comprises words like…”
“Documents assigned primarily to Topic 1 tend to discuss…”
“We identify a topic characterized by words such as ‘island’, ‘villager’, ‘design’”
Avoid:
“Topic 1 is the gameplay topic” (reifies the construct)
“LDA discovered that 30% of reviews are about gameplay” (implies objective truth)
“The complaint topic shows…” (without acknowledging interpretation)
8.3 Common reporting pitfalls
Not reporting k selection process: Readers need to know how you chose number of topics
Claiming topics are “discovered truths”: Topics are model outputs, not inherent document properties
Omitting failed topics: If some topics were incoherent, report this
Cherry-picking documents: Show representative documents, not just best examples
Ignoring stability: Report any stability issues encountered
TipMaking research reproducible
Share when possible:
Preprocessed corpus (if no privacy/copyright restrictions)
Trained model file (lda_model.save('model.pkl'))
Full preprocessing code
Random seed used (random_state=42)
This allows others to verify and build on your work.
9 Exercises
9.1 Exercise 1: Stability assessment
Using the 5 models already trained in this lab, conduct a deeper stability analysis.
Tasks:
Examine the strong_matches dataframe. Identify chains of matching topics across models (e.g., Topic 2 in Model 0 → Topic 0 in Model 1 → Topic 3 in Model 2). How many such chains can you find?
For the best-matching chain (highest average Jaccard), extract the top 20 words from each model’s version of that topic. How many words appear in all 5 versions?
Compare the stability heatmap (Model 0 vs Model 1) to a different pair (e.g., Model 2 vs Model 3). Do the same topics match, or do patterns differ?
Reflect: If a topic appears in only 2 out of 5 models with Jaccard > 0.7, would you trust research conclusions based on that topic? Why or why not?
What does low stability tell you about this corpus? (Hint: recall the “review bombing” note about this dataset)
9.2 Exercise 2: Sentiment hypothesis testing
Hypothesis: “Negative reviews focus on multiplayer issues; positive reviews focus on aesthetics and relaxation.”
Tasks:
Examine the boxplot from the sentiment analysis section
Identify which topic(s) represent “multiplayer issues” (look at top words)
Identify which topic(s) represent “aesthetics/relaxation”
Do the data support the hypothesis? Provide evidence
What alternative explanations could account for the patterns you observe?
9.3 Exercise 3: Choosing k
Using the coherence analysis results:
What value of k has the highest coherence score?
Does the plot show a clear “elbow” point? Where?
Compare k=3 vs k=6: examine the topics for each (you’ll need to retrain models)
Which k value is more interpretable? Why?
Does highest coherence always mean best interpretability?
9.4 Exercise 4: HCI application
Imagine you are a UX researcher at Nintendo analyzing user feedback:
Identify the “UX complaint” topic in your model (based on top words)
Extract the top 20 reviews for this topic
Read them and note:
What specific design decision is criticized?
How do users describe the impact on their experience?
Do users suggest solutions?
Based on this analysis, what would you recommend to the design team?
Reflect: How does topic modeling help vs. hurt in UX research?
9.5 Exercise 5: Temporal patterns
Looking at the temporal analysis plot:
Which topic(s) decrease over time?
Which topic(s) increase over time?
What explains these patterns? (Hint: consider review bombing phenomenon)
Does topic modeling help understand how public discourse evolves?
What are limitations of this temporal analysis?
9.6 Exercise 6: Validation design
Design an inter-coder reliability study for your topic model:
Choose one topic to validate
How many documents would you have coders read?
What specific question would you ask coders?
What Krippendorff’s alpha threshold would you accept for publication?
Besides inter-coder reliability, what other validation steps would you take?