Publications

Modeling Cascade Growth: Predicting Content Diffusion on VKontakte

Published in Networks in the Global World V. NetGloW 2020. Lecture Notes in Networks and Systems, vol 181, 2021

Online social networks have become an essential communication channel for the broad and rapid sharing of information. Currently, the mechanics of such information-sharing is captured by the notion of cascades, which are tree-like networks comprised of (re)sharing actions. However, it is still unclear what factors drive cascade growth. Moreover, there is a lack of studies outside Western countries and platforms such as Facebook and Twitter. In this work, we aim to investigate what factors contribute to the scope of information cascading and how to predict this variation accurately. We examine six machine learning algorithms for their predictive and interpretative capabilities concerning cascades’ structural metrics (width, mass, and depth). To do so, we use data from a leading Russian-language online social network VKontakte capturing cascades of 4,424 messages posted by 14 news outlets during a year. The results show that the best models in terms of predictive power are Gradient Boosting algorithm for width and depth, and Lasso Regression algorithm for the mass of a cascade, while depth is the least predictable. We find that the most potent factor associated with cascade size is the number of reposts on its origin level. We examine its role along with other factors such as content features and characteristics of sources and their audiences.

Recommended citation: Moroz A., Pashakhin S., Koltsov S. (2021) Modeling Cascade Growth: Predicting Content Diffusion on VKontakte. In: Antonyuk A., Basov N. (eds) Networks in the Global World V. NetGloW 2020. Lecture Notes in Networks and Systems, vol 181. Springer, Cham. https://doi.org/10.1007/978-3-030-64877-0_12 https://doi.org/10.1007/978-3-030-64877-0_12

PolSentiLex: Sentiment Detection in Socio-Political Discussions on Russian Social Media

Published in AINL 2020, CCIS 1292, 2020

We present a freely available Russian language sentiment lexicon PolSentiLex designed to detect sentiment in user-generated content related to social and political issues. The lexicon was generated from a database of posts and comments of the top 2,000 LiveJournal bloggers posted during one year (∼1.5 million posts and 20 million comments). Following a topic modeling approach, we extracted 85,898 documents that were used to retrieve domain-specific terms. This term list was then merged with several external sources. Together, they formed a lexicon (16,399 units) marked-up using a crowdsourcing strategy. A sample of Russian native speakers (n = 105) was asked to assess words’ sentiment given the context of their use (randomly paired) as well as the prevailing sentiment of the respective texts. In total, we received 59,208 complete annotations for both texts and words. Several versions of the marked-up lexicon were experimented with, and the final version was tested for quality against the only other freely available Russian language lexicon and against three machine learning algorithms. All experiments were run on two different collections. They have shown that, in terms of Fmacro, lexicon-based approaches outperform machine learning by 11%, and our lexicon outperforms the alternative one by 11% on the first collection, and by 7% on the negative scale of the second collection while showing similar quality on the positive scale and being three times smaller. Our lexicon also outperforms or is similar to the best existing sentiment analysis results for other types of Russian-language texts.

Recommended citation: Koltsova O., Alexeeva S., Pashakhin S., Koltsov S. (2020) PolSentiLex: Sentiment Detection in Socio-Political Discussions on Russian Social Media. In: Filchenkov A., Kauttonen J., Pivovarova L. (eds) Artificial Intelligence and Natural Language. AINL 2020. Communications in Computer and Information Science, vol 1292. Springer, Cham. https://doi.org/10.1007/978-3-030-59082-6_1 http://doi.org/10.1007/978-3-030-59082-6_1

Agenda divergence in a developing conflict: Quantitative evidence from Ukrainian and Russian TV newsfeeds

Published in Media, War & Conflict, 2020

Although conflict representation in media has been widely studied, few attempts have been made to perform large-scale comparisons of agendas in the media of conflicting parties, especially for armed country-level confrontations. In this article, the authors introduce quantitative evidence of agenda divergence between the media of conflicting parties in the course of the Ukrainian crisis 2013–2014. Using 45,000 messages from the online newsfeeds of a Russian and a Ukrainian TV channel, they perform topic modelling coupled with qualitative analysis to reveal crisis-related topics, assess their salience and map evolution of attention of both channels to each of those topics. They find that the two channels produce fundamentally different agenda sequences. Based on the Ukrainian case, they offer a typology of conflict media coverage stages.

Recommended citation: Koltsova, O., & Pashakhin, S. (2020). Agenda divergence in a developing conflict: Quantitative evidence from Ukrainian and Russian TV newsfeeds. Media, War & Conflict, 13(3), 237–257. https://doi.org/10.1177/1750635219829876 https://doi.org/10.1177/1750635219829876

‘Alien Elections’: Neighboring State News on the 2018 Russian Presidential Elections

Published in Journal of Economic Sociology (in Russian), 2020

News media tend to reflect voices in the political establishment while cov-ering international events. Is it still true when almost half of the national audience speak the language of the country featured in the coverage? In this paper, we present an analysis of 19.5k news messages collected from Russian-language Ukrainian news outlets covering the 2018 presidential elections in Russia. Using a mixed-method approach (topic modeling and qualitative reading), we identify key topics and stories and evaluate the ex-tent of personalization in the election coverage. We find three central angles: the focus on polls and election results, election preparations in Crimea, and Vladimir Putin’s victory. The elections are linked predominantly to Crimean issues through the date of the elections, each candidate’s stance on the sub-ject, the election management in the region, and other countries’ reactions to the results. Such coverage has an accusatory bias; it stresses the legal status of the Crimean referendum and the Russian authorities’ actions and reports the pressures on locals by authorities, especially the Crimean Tatars. Not linked directly to Crimea, other angles are less emotionally charged. Political personalization of the discussion has a contradictory nature. On one hand, the overwhelming majority of the messages mention public figures. On the other hand, the coverage of the figures is limited and omits their traits. Moreover, at times, public figures are replaced by non-personalized symbols (e.g., Kremlin, Russian invaders). However, if the former’s coverage is predomi-nantly neutral, the latter’s coverage is more prone to negative and loaded statements.

Recommended citation: Kazun A., Pashakhin S. «Chuzhie vybory»: novosti sosednego gosudarstva o vyborakh prezidenta RF v 2018 g. [‘Alien Elections’: Neighboring State News on the 2018 Russian Presidential Elections]. Journal of Economic Sociology = Ekonomicheskaya sotsiologiya, vol. 22, no 1, pp. 71–91. doi: 10.17323/1726-3247-2021-1-71-91 (in Russian). http://doi.org/10.17323/1726-3247-2021-1-71-91

Fast Tuning of Topic Models: An Application of Rényi Entropy and Renormalization Theory

Published in Entropy and Its Applications Vol. 46. Issue 1, 2020

In practice, the critical step in building machine learning models of big data (BD) is costly in terms of time and the computing resources procedure of parameter tuning with a grid search. Due to the size, BD are comparable to mesoscopic physical systems. Hence, methods of statistical physics could be applied to BD. The paper shows that topic modeling demonstrates self-similar behavior under the condition of a varying number of clusters. Such behavior allows using a renormalization technique. The combination of a renormalization procedure with the Rényi entropy approach allows for fast searching of the optimal number of clusters. In this paper, the renormalization procedure is developed for the Latent Dirichlet Allocation (LDA) model with a variational Expectation-Maximization algorithm. The experiments were conducted on two document collections with a known number of clusters in two languages. The paper presents results for three versions of the renormalization procedure: (1) a renormalization with the random merging of clusters, (2) a renormalization based on minimal values of Kullback–Leibler divergence and (3) a renormalization with merging clusters with minimal values of Rényi entropy. The paper shows that the renormalization procedure allows finding the optimal number of topics 26 times faster than grid search without significant loss of quality.

Recommended citation: Koltcov S, Ignatenko V, Pashakhin S. Fast Tuning of Topic Models: An Application of Rényi Entropy and Renormalization Theory. Proceedings. 2020; 46(1):5. https://doi.org/10.3390/ecea-5-06674 https://doi.org/10.3390/ecea-5-06674

How Many Clusters? An Entropic Approach to Hierarchical Cluster Analysis

Published in Intelligent Computing. SAI 2020. Advances in Intelligent Systems and Computing, vol 1230, 2020

Clustering large and heterogeneous data of user-profiles from social media is problematic as the problem of finding the optimal number of clusters becomes more critical than for clustering smaller and homogeneous data. We propose a new approach based on the deformed Rényi entropy for determining the optimal number of clusters in hierarchical clustering of user-profile data. Our results show that this approach allows us to estimate Rényi entropy for each level of a hierarchical model and find the entropy minimum (information maximum). Our approach also shows that solutions with the lowest and the highest number of clusters correspond to the entropy maxima (minima of information).

Recommended citation: Koltcov S., Ignatenko V., Pashakhin S. (2020) How Many Clusters? An Entropic Approach to Hierarchical Cluster Analysis. In: Arai K., Kapoor S., Bhatia R. (eds) Intelligent Computing. SAI 2020. Advances in Intelligent Systems and Computing, vol 1230. Springer, Cham. https://doi.org/10.1007/978-3-030-52243-8_40 http://doi.org/10.1007/978-3-030-52243-8_40

A Full-Cycle Methodology for News Topic Modeling and User Feedback Research

Published in SocInfo 2018, LNCS 11185, 2018

Online social networks (OSNs) play an increasingly important role in news dissemination and consumption, attracting such traditional media outlets as TV channels with growing online audiences. Online news streams require appropriate instruments for analysis. One of such tools is topic modeling (TM). However, TM has a set of limitations (the problem of topic number choice and the algorithm instability, among others) that must be addressed specifically for the task of sociological online news analysis. In this paper, we propose a full- cycle methodology for such study: from choosing the optimal topic number to the extraction of stable topics and analysis of TM results. We illustrate it with an analysis of online news stream of 164,426 messages formed by twelve national TV channels during a one-year period in a leading Russian OSN. We show that our method can easily reveal associations between news topics and user feed- back, including sharing behavior. Additionally, we show how uneven distri- bution of document quantities and lengths over classes (TV channels) could affect TM results.

Recommended citation: Koltsov, S., Pashakhin, S., & Dokuka, S. (2018). A Full-Cycle Methodology for News Topic Modeling and User Feedback Research. In S. Staab, O. Koltsova, & D. I. Ignatov (Eds.), Social Informatics (Vol. 11185, pp. 308–321). Springer International Publishing. https://doi.org/10.1007/978-3-030-01129-1_19 http://doi.org/10.1007/978-3-030-01129-1_19

Topic Modeling for Frame Analysis of News Media

Published in Artificial Intelligence and Natural Language AINL FRUCT 2016 Conference, 2016

Media frames have been traditionally extracted via manual content and discourse analysis. Such approach has a limited ability to deal with large text collections and is prone to subjectivity both in terms of text selection and interpretation. We illustrate possibilities and limitations of topic modeling for frame detection applying this method to a collection of 50,000 news items related to the Ukrainian crisis and retrieved from a Russian and a Ukrainian TV channels websites. We conclude that although topic modeling results allow to make assumptions about how topic is framed it is still not as precise as human reading of texts.

Recommended citation: Pashakhin S. Topic Modeling for Frame Analysis of News Media, in: Proceedings of the Artificial Intelligence and Natural Language AINL FRUCT 2016 Conference, Saint-Petersburg, Russia, 10-12 November 2016 / Eds.: S. I. Balandin, A. Filchenkov, L. Pivovarova, J. Zizka. FRUCT Oy, 2016. P. 103-106. https://fruct.org/publications/abstract-AINL-FRUCT-2016/files/Pas.pdf