Find stable topics in STM with Jaccard similarity
Source:vignettes/find_stable_topics-jaccard.Rmd
find_stable_topics-jaccard.Rmd
This vignette shows how to use topicl
to find stable
(reproducible) topics with Jaccard similarity. This method is presented
in (Koltsov et al., 2018), but can be shortly summarized as follows.
A stable topic is a topic that persists across fits of a topic model to the data. Thus, keeping model parameters fixed, a stable topic must appear in a majority of fits. Across fits, we compare topics to each other to test whether a given pair is the same. For comparison, we use compute Jaccard similarity on N top-words for a topic. Papers by Koltosov and colleagues advise fitting at least five models and consider a given pair of topics identical if their Jaccard similarity is >= 90%. Then, if a stable topic persisted across at least 3 out of 5 model runs, you can consider this topic stable.
In this vignette, we will fit five STM models to
poliblog5k
(see ?stm::poliblog5k
for details)
and consider how many topics persists using topicl
.
fit_stm
specifies STM parameters. We increase the
maximum number of iterations for the EM algorithm and ensure that each
iteration is performed with max_it = 1000
and
emtol = 0
- see ?stm
for details. In my
experience, with an increased number of iterations on these data, topic
similarity across fits also increases. I cannot say if
K = 25
is optimal for these data, but this number of topics
is used in STM documentation.
With https://random.org,
we generated five random seeds
for each model.
fit_stm <- function(seed) {
K = 25
max_it = 1000
emtol = 0
mod <- stm(poliblog5k.docs,
poliblog5k.voc, K=K,
data=poliblog5k.meta,
max.em.its=max_it,
emtol = emtol,
init.type="Random",
seed = seed,
verbose = F)
return(mod)
}
seeds <- c(9934,9576,1563,3379,8505)
With furrr::future_map()
, we fit five models with
different seeds in parallel. Given the high number of iterations, it can
take up to 40 minutes.
tic()
plan(multisession, workers = 5) # toggle multiple parallel processes
mods <- future_map(seeds, fit_stm,
.options = furrr_options(seed=T))
plan(sequential) # toggle back to sequential
toc()
#> 1077.603 sec elapsed
We use topicl::compare_solutions()
to calculate topic
similarity on the level of depth = 10
top terms for a
topic. By default, top terms are produced by ranking term probabilities.
Please note that (Koltsov et al., 2018) advice using 100 top terms.
results <- compare_solutions(mods, depth = 10)
#> comparing solutions ■■■■■■■■■■■■■■ 42% | ETA: 1s
#> comparing solutions ■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 90% | ETA: 0s
#> comparing solutions ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100% | ETA: 0s
compare_solutions()
produces a table with showing the
results of comparisons. Here is how to read this output: each row shows
a comparison of a topic A (column topic_id_A
) from model A
(column model_id_A
) to topic B from model B
(topic_id_B
and model_id_B
).
results |>
arrange(desc(jaccard)) |>
head(10)
#> # A tibble: 10 × 5
#> model_id_A topic_id_A model_id_B topic_id_B jaccard
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 mod_1 topic_1 mod_2 topic_5 1
#> 2 mod_1 topic_13 mod_3 topic_20 1
#> 3 mod_1 topic_13 mod_4 topic_1 1
#> 4 mod_1 topic_11 mod_5 topic_1 1
#> 5 mod_2 topic_23 mod_3 topic_3 1
#> 6 mod_2 topic_4 mod_5 topic_11 1
#> 7 mod_3 topic_20 mod_4 topic_1 1
#> 8 mod_3 topic_6 mod_5 topic_5 1
#> 9 mod_1 topic_16 mod_2 topic_18 0.818
#> 10 mod_1 topic_23 mod_2 topic_6 0.818
If model and topic IDs from the table above are confusing, they can
be matched back to mods
, which is a list containing models
in the order of model IDs from the compare_solutions()
.
print(mods)
#> [[1]]
#> A topic model with 25 topics, 5000 documents and a 2632 word dictionary.
#>
#> [[2]]
#> A topic model with 25 topics, 5000 documents and a 2632 word dictionary.
#>
#> [[3]]
#> A topic model with 25 topics, 5000 documents and a 2632 word dictionary.
#>
#> [[4]]
#> A topic model with 25 topics, 5000 documents and a 2632 word dictionary.
#>
#> [[5]]
#> A topic model with 25 topics, 5000 documents and a 2632 word dictionary.
Topic IDs match IDs inside each model. To manually check the terms
used for comparison of a given pair, you can use
top_terms()
. Let’s compare the terms from topic 13 of model
1 to topic 1 from model 4, which showed 100% Jaccard similarity in the
output of compare_solutions()
.
top_terms(mods[[1]], topic_id = 13,n_terms = 10)
#> # A tibble: 10 × 3
#> topic term beta
#> <int> <chr> <dbl>
#> 1 13 report 0.0400
#> 2 13 time 0.0320
#> 3 13 stori 0.0269
#> 4 13 new 0.0269
#> 5 13 media 0.0237
#> 6 13 news 0.0215
#> 7 13 post 0.0181
#> 8 13 york 0.0175
#> 9 13 press 0.0170
#> 10 13 articl 0.0111
top_terms(mods[[4]], topic_id = 1,n_terms = 10)
#> # A tibble: 10 × 3
#> topic term beta
#> <int> <chr> <dbl>
#> 1 1 report 0.0347
#> 2 1 time 0.0338
#> 3 1 stori 0.0264
#> 4 1 media 0.0251
#> 5 1 new 0.0224
#> 6 1 news 0.0191
#> 7 1 post 0.0189
#> 8 1 press 0.0166
#> 9 1 york 0.0152
#> 10 1 articl 0.0109
Similar topics persisting across fits is helpful to visualize with a
network plot. Given output of topicl::compare_solutions()
and a threshold, topicl::viz_comparisons()
will plot a
topic-topic network. Here, we plot connections between topics >= 80%
similarity.
viz_comparisons(results, 0.8)
With topicl::filter_topics()
, we can get a table with
topic IDs and counts of their distinct pairs in other models.
filter_topics(results, 0.5, 2)
#> # A tibble: 16 × 2
#> topic_1 distinct_models
#> <chr> <int>
#> 1 mod_1_topic_1 4
#> 2 mod_1_topic_11 3
#> 3 mod_1_topic_12 2
#> 4 mod_1_topic_13 4
#> 5 mod_1_topic_15 2
#> 6 mod_1_topic_16 3
#> 7 mod_1_topic_17 3
#> 8 mod_1_topic_2 3
#> 9 mod_1_topic_20 2
#> 10 mod_1_topic_21 3
#> 11 mod_1_topic_23 4
#> 12 mod_1_topic_24 3
#> 13 mod_1_topic_4 3
#> 14 mod_1_topic_6 3
#> 15 mod_1_topic_7 3
#> 16 mod_1_topic_9 2
References
- Koltsov, S., Pashakhin, S., & Dokuka, S. (2018). A Full-Cycle Methodology for News Topic Modeling and User Feedback Research. In S. Staab, O. Koltsova, & D. I. Ignatov (Eds.), Social Informatics (Vol. 11185, pp. 308–321). Springer International Publishing. https://doi.org/10.1007/978-3-030-01129-1_19
Appendix
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.2
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Berlin
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 furrr_0.3.1 future_1.34.0 tictoc_1.2.1
#> [5] stm_1.3.7 topicl_0.0.0.9000
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.45 bslib_0.7.0 ggplot2_3.5.1
#> [5] htmlwidgets_1.6.4 ggrepel_0.9.5 lattice_0.22-6 vctrs_0.6.5
#> [9] tools_4.4.2 generics_0.1.3 parallel_4.4.2 tibble_3.2.1
#> [13] highr_0.11 janeaustenr_1.0.0 pkgconfig_2.0.3 tokenizers_0.3.0
#> [17] Matrix_1.7-1 data.table_1.16.4 RColorBrewer_1.1-3 desc_1.4.3
#> [21] lifecycle_1.0.4 compiler_4.4.2 farver_2.1.2 stringr_1.5.1
#> [25] textshaping_0.4.0 munsell_0.5.1 ggforce_0.4.2 graphlayouts_1.2.1
#> [29] codetools_0.2-20 SnowballC_0.7.1 htmltools_0.5.8.1 sass_0.4.9
#> [33] yaml_2.3.8 tidytext_0.4.2 pillar_1.10.1 pkgdown_2.1.1
#> [37] jquerylib_0.1.4 tidyr_1.3.1 MASS_7.3-61 cachem_1.1.0
#> [41] viridis_0.6.5 parallelly_1.39.0 tidyselect_1.2.1 digest_0.6.37
#> [45] stringi_1.8.4 reshape2_1.4.4 purrr_1.0.2 listenv_0.9.1
#> [49] labeling_0.4.3 polyclip_1.10-7 fastmap_1.2.0 grid_4.4.2
#> [53] colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3 ggraph_2.2.1
#> [57] tidygraph_1.3.1 utf8_1.2.4 withr_3.0.2 scales_1.3.0
#> [61] rmarkdown_2.27 globals_0.16.3 igraph_2.0.3 gridExtra_2.3
#> [65] ragg_1.3.2 memoise_2.0.1 evaluate_0.24.0 knitr_1.47
#> [69] viridisLite_0.4.2 rlang_1.1.4 Rcpp_1.0.14 glue_1.8.0
#> [73] tweenr_2.0.3 jsonlite_1.8.8 R6_2.5.1 plyr_1.8.9
#> [77] systemfonts_1.1.0 fs_1.6.4