Open access peer-reviewed chapter - ONLINE FIRST

Text as Data in Demography: Russian-Language Experience

Written By

Irina Kalabikhina, Natalia Loukachevitch, Eugeny Banin and Anton Kolotusha

Submitted: 06 March 2023 Reviewed: 15 May 2023 Published: 09 February 2024

DOI: 10.5772/intechopen.1003274

Population and Development in the 21st Century<br> IntechOpen
Population and Development in the 21st Century
Between the Anthropocene and Anthropocentrism Edited by Parfait M Eloundou-Enyegue

From the Edited Volume

Population and Development in the 21st Century - Between the Anthropocene and Anthropocentrism [Working Title]

Prof. Parfait M Eloundou-Enyegue

Chapter metrics overview

29 Chapter Downloads

View Full Metrics

Abstract

We propose to consider our experience in data use of Russian-language texts of social networks, electronic media, and search engines in demographic analysis. Experiments on the automatic classification of opinions have been carried out. Conversational RuBERT has been used in most cases. The following main scientific results on text data will be described: (1) short-term forecasts of fertility dynamics according to Google trend data, (2) automatic measurement of the demographic temperature of various demographic groups (pronatalists and antinatalists) in social networks, (3) sentiment analysis of reproductive behavior, sentiment analysis of vital behavior in pandemic, sentiment analysis of attitudes toward demographic and epidemiological policy according to social network data, (4) analysis of the arguments of social network users, and (5) analysis of media publications on demographic policy. A description of the created open databases of all these studies will be provided. All of the studies described will contain reflections on the advantages and difficulties of using texts as data in demographic analysis.

Keywords

  • text as data
  • demographic values
  • demographic behaviour
  • Russian-language
  • digital data
  • conversational RuBERT
  • sentiment analysis
  • demographic temperature
  • stance and arguments
  • natural language processing
  • demographic policy
  • Google trend data
  • social network

1. Introduction and brief literature review

Only recently have texts begun to be used as data in social sciences. With the growing access to high-speed Internet and the creation of search engines and social networks, researchers can receive a large number of comments on issues in social sciences, including behavior, trends, phenomena, and politics.

In social sciences, the practical application of texts in research covers a variety of topics ranging from questions of authorship, stock prices estimation, Central Bank communication impact, nowcasting, policy uncertainty, media slant, market definition and innovation impact, topics in research, politics, and law [1]; and different aspects of consumer behavior [2, 3, 4].

In demographic studies, issues of interest include many aspects of demographic behavior, demographic trends and the assessment of demographic, epidemiological, and other types of population policy. For all these topics, the use of text as data (information on search platforms, social networks, and mass media) can prove useful. Research in the field of digital demography is a new promising trend. Researchers across the world are just beginning to apply an algorithm for extracting, structuring, processing, and combining socio-demographic data about populations.

One simple way to use the new data is to identify the frequency of search queries on different demographic topics. Google data can be used to calculate the volume and geography of abortions in the United States and other countries [5] to forecast fertility trends ahead of the official data of statistical agencies [6, 7], estimate migration trends between couple of countries with virtually no time delay [8] (Russia-Germany case), and monitor the spread of diseases. For instance, Google Flu Trends project, aimed to predict epidemics of diseases from 2009 to 2014, based on weekly and monthly Google search queries data in the United Kingdom [9], and spread of the COVID-19 pandemic in different countries and regions was studied. The short-term forecast of fertility on Russian data [7] is based on using the SARIMA/ARIMA predictive model on the time series of Google query data since 2004 with a monthly data collection step that showed high accuracy with a 2-year forecast horizon. The search engines, for example, Google and Yandex1, have developed their own algorithms to popularize and simplify work using simple queries. The digital traces of other platforms can be used to study population trends together with various frequency data (the cellular operator data, among others). A couple of good examples can be shown with Russian data. The migration of the population of the Russian Arctic was investigated on the profiles of users of the social network VKontakte2 at the municipal level, in combination of the ticket services data [10]. The difference between Russian regions in relation to the sensitivity of regional media to the events of the legislative process in the field of maternity (family) capital was studied using public.ru platform [11]; an open dataset of the sentiment of the texts of 0.5 million media publications on the use of the maternity (family) capital [12]. Track record data from the resumes of the LinkedIn professional social network participants were used [13] to determine the volume of migration influx of educated personnel and students to the United States and Asian countries.

The second direction in using big unstructured data research in demographic analysis involves the collection of texts of social and professional networks and media as data through sentiment analysis. Artificial intelligence algorithms allow users to explore the emotional background of texts in terms of selected demographic topics. A demographic topic is identified in a social network (“becoming a parent”, “abortion”, “illness”, “vaccination”, etc.) or in the media and the ratio of positive, negative, and neutral statements is assessed. One can add other assessments/opinions (e.g., irony [14]). So, an analysis based on information from Twitter with a variety of computational linguistic methods allows users to find out the feelings of people about to have a child, how they plan to have children [15]. The sentiment analysis on demographic issues includes abortion [16, 17, 18, 19, 20, 21], in particular, the legalization of abortion [22], various aspects of parenthood [14, 15], health issues [5, 21, 23, 24], drivers of demographic processes (e.g., natural disasters [25]), the demographic structure and trend of telemedicine [26], and the COVID-19 pandemic and other infections [27, 28, 29, 30, 31, 32]. Topics closely related to demographics were also studied by sentiment analysis: these include sexual harassment or violence [33, 34, 35], attitudes towards genetic testing [36], processes of racial segregation [37]. Based on Facebook data, the features of assimilation of Mexican immigrants to the United States were investigated [38]. In Ref. [39], using the example of migration analysis in Poland, it was demonstrated how big data from the Facebook social network can help effectively to investigate narrow population groups that would normally be inaccessible to demographers.

The development of databases [40] and softs [e.g., [41]] in the field of sentiment analysis of texts in Russian has pushed the emergence of works using sentiment analysis, including the social sciences [42]. In a few demographic studies based on Russian-language data and sentiment analysis, the authors studied the positions on the vaccination of children [43] (the dataset is composed of posts from social networks VKontakte, divided into two cases: “for” and “against”) as well as health issues [44] (DigitalFreud profile data, VKontakte user data, phone application data).

Deep thematic analysis of texts is associated with the automation of arguments on different issues such as the analysis of opinions, including not only stance but also the supporting arguments behind that stance. For example, we can not only observe the share of positive and negative assessments about birth, having many children, abortion, the childfree phenomenon, unregistered marriage, etc. but also get a structured share of positive and negative arguments, that is, why do people choice positive or negative reaction.

Modern methods of natural language processing do not provide a universal tool for the automatic determination of argumentation in texts. Researchers essentially “manually” analyze the data, answering the question of why a particular topic is perceived negatively or positively by people. Or they re-conduct a time-consuming sentiment analysis already on topics-arguments.

The analysis of arguments from social media comments has gained popularity in environmental research in recent years due to the growing discussion of new topics such as COVID-19: vaccination [45, 46, 47], quarantine [48]. In addition, the comments are used to monitor and make operational decisions on COVID-19 and its consequences [49, 50]. But most researchers in the analysis of social media content concentrate mainly on assessing the emotional background of posts and comments [51, 53], without attempting to extract the argumentation of users’ positions. If we compare the number of studies on the thematic analysis of stances and the thematic analysis of arguments, the latter clearly lose out to the former (see, e.g., the analysis of stances on abortion [5, 16, 17, 18, 19, 21, 54] and analysis of arguments on abortion [16]). In particular, for opinions about abortion, the authors received eight most common arguments for abortion and five arguments against abortion [16]. There are fewer argument analysis works since this requires the additional theoretical processing of argument structures for using text processing methods. That is why it is important to learn how to identify the presence of argumentation and somehow automatically classify the arguments found.

In this chapter, we present the results of over 3-years of work on sentiment analysis of Russian-language texts in social networks, as they pertain to demographic issues. We conduct a consistent sentiment analysis of comments in the social network VKontakte of selected demographically oriented groups, the so-called “pro-natalist” and “anti-natalist” groups with different reproductive attitudes. Our approach consists of three steps: (1) the automated measurement of the dynamic of the emotional phone of groups in social networks during long period of time in the framework of demographic and family policy in Russia, (2) the sentiment analysis of reproductive attitudes and opinions on pro-natalist demographic policy according to social network users from the same groups (demographic temperature of the second type), and (3) the search of general algorithm of the sentiment analysis of the arguments of positive or negative opinions on reproductive issues and police according to social network users from the same groups.

Advertisement

2. Data and methodology

Our methodological approach is based on the sequential study of selected demographically oriented groups in a social network.

The research was based on an array of data containing comments from users of the VKontakte social network, pre-selected according to the principle of relevance to issues of reproductive behavior. This data is publicly available [55, 56]. The selected demographically oriented groups in a social network include pro-natalists (supporters of childbearing) and anti-natalists (supporters of refuse of childbearing, child-free groups).

The data collection was carried out in several stages. First, using the built-in API (application programming interface), user groups dedicated to the topic of reproductive behavior, both pro- and anti-natalist orientation, were selected. The search for groups was carried out by keywords (“mom”, “child”, “parents”, “childfree”, etc.). Then the selected groups were cleared of irrelevant ones, in particular advertising or inactive groups. Thus, as a result, out of over 1000 groups selected at the first stage, about 350 groups remained after cleaning. It was revealed that there are significantly more pro-natalist groups, and they are usually represented by a significantly larger number of participants, whereas anti-natalist groups are more active, that is, they contain a greater number of posts and discussions. The final sample included 341 pro-natalist and 8 anti-natalist groups.

The number of groups was chosen based on the amount of text (using a threshold of more than 100 thousand of comments). Supporters of the childless lifestyle (keywords “childfree” and their variations) published significantly more posts and comments than representatives of other groups. In the reproductive dataset, only resonant comments (at least 5 likes) from relevant thematic (related to issues of childhood, motherhood, pregnancy) and representative in size (at least 500 subscribers) groups are taken into account. The search depth was 5000 posts.

At the final stage, texts (user comments) were extracted from these groups, which were brought to a unified format within the framework of standard algorithms for preprocessing text information: translation to lowercase, removal of stop words, numbers and punctuation marks, lemmatization (reduction to the initial form) and stemming (removal of endings). A structured array (corpus) of texts has been formed. Thematic clusters were identified based on Latent Dirichlet allocation (LDA). Further, the collected textual information was additionally filtered by the presence of keywords related to reproductive topics in order to exclude most irrelevant comments.

Additionally, irrelevant comments were deleted during expert annotation both on thematic analysis stage and argument automation stage. For the reproductive data subset with reproductive and pro-natalist policy topics on thematic analysis see Ref. [57].

Moreover, we constructed a database of health-related behavior using VKontakte users’ comments discussing COVID-19 news texts [58]. We use this dataset to strengthen some conclusions on sentiment analysis of Russian-language text of social networks on demographic topics [59]. Altogether, about 11,000 reproductive-related comments and about 10,000 health-related comments were annotated by stance and arguments.

We replicated this three-step methodology using the same data for all steps. First, we examine the general emotional background of the pro-natalist and anti-natalist groups with different reproductive attitudes with demographic trends and developments in the field of population policy. This is the so-called demographic temperature of the first type.

The demographic temperature is the author’s term by which we understand the emotional background or the predominance of positive or negative statements on topics related to family values, the birth of children, and other topics in the field of reproductive behavior. Demographic temperature is measured as the difference (or ratio) between the number of positive and negative statements over a certain period [60].

The demographic temperature of the first type is the difference (or ratio) between the number of any positive and negative statements of users of selected demographic groups with different reproductive attitudes. We propose that most of the comments of our users of these groups are devoted to reproductive and pro-natalist policy topics. Monitoring of the demographic temperature of the first type in different groups when superimposed on the calendar of phenomenon gives us an understanding of hypothetically impact of the phenomenon (economic crises, population policy, and so on) on the emotional background in different demographic groups.

Then we conduct a sentiment analysis of the social network comments of these selected demographic groups on demographic topics. This is the demographic temperature of the second type.

The demographic temperature of the second type is the difference (or ratio) between the number of reproductive positive and negative statements of users of our groups. Theoretically, we can implement the sentiment analysis using the reproductive statements of any users of social networks. However, we choose the same reproductive groups to reduce the challenge of representation (we gather the reproductive sentiments in the reproductive groups) and to get the statements from selected marginal groups of people with opposite reproductive attitudes. Using data from different groups of the VKontakte social network avoids data homogeneity which is one of the weak points of the sentiment analysis [61].

Finally, we analyze the arguments of the positive and negative statements of our users at the previous stage of the analysis. The last step is the development of the algorithm of the argument thematic analysis through natural language processing methods to monitor both the emotional background of social groups and their arguments of the positive or negative demographic temperature of stances. The main idea of this step is argument’s thematic analysis automation.

As a subject of research in this chapter, we consider reproductive behavior and the public’s opinion on the measures of demographic policy in the field of fertility. We conduct a complex investigation of social user opinions reproductive attitudes and population policy: the emotional background measurement of demographic groups, the thematic analysis of social network users’ comments about demographic behavior and demographic policy, and argument analysis, which potentially helps us grasp motivation behind social network users’ comments.

In addition, we constructed the dataset on health-related behavior based on VKontakte users’ comments discussing COVID-19 news texts [58]. The sensitive thematic analysis on health-related behavior was made with the same method of sentiment analysis as in the case of the reproductive dataset but on the base of the alternative dataset. We use the research direction with the health-related behavior dataset in this chapter to strengthen some conclusions on sentiment analysis of Russian-language text of social networks on demographic topics.

A more detailed disclosure of the methods of solving our demographic problems is presented in appendix.

Advertisement

3. Results

3.1 Automated measurement of demographic temperature (temperature of the first type)

Our study was a pioneering attempt to analyze the sentiment of Russian-language comments on social networks to determine the demographic temperature (ratio of positive and negative comments) in certain reproductive-related behavior groups of social network users was made. Using the available data in two types of groups since 2012, an asynchronous structural shift in comments of the corpuses of pro-natalist and anti-natalist thematic groups has been revealed [[60], Fig. 11, 12]. We contrast two thematic groups, which differ in their reproductive and family attitudes. This technique allowed us to identify asymmetric trends in changes in the sentiment of anti-natalists and pro-natalists from 2012 to 2020 before and after the introduction of family policy with an emphasis on traditional family values in 2014.

The pro-natalists are more positive in general during the total period. The average demographic temperature for the period is slightly negative, which is 0.951. (A value of “1” indicates a neutral demographic temperature, below “1” is a negative temperature, and above “1” is a positive one). The anti-natalists have a lower average demographic temperature over the period, which is 0.691 [[60], Fig. 15, 16]. Both pro-natalists and anti-natalists have been reducing the demographic temperature from 2014 to the present. Anti-natalists did it in 2014, and pro-natalists decreased later, in 2017, 3 years after the family policy started. We have found that the demographic temperature in the pro-natalist groups closely repeats the key events in demographic policy (indexation and extension of maternity capital, family, and demographic policy initiatives).

The method-related result of this stage is that comments under posts are more suitable and relevant for analyzing the sentiment of statements than the texts of posts.

3.2 Sentiment analysis of reproductive attitudes and opinions about demographic policy (temperature of the second type)

Using the data from the social network VKontakte, the opinions of network users about the birth of children were examined. A dataset has been compiled with a markup of opinions in three classes of positions in relation to six topics related to the birth of children and pro-natalist politics. Experiments on the automatic classification of opinions were carried out.

The predominance of positive assessments of childlessness, individualism and abortion permits, and the predominance of negative assessments of having many children, maternity capital, and parental leave has been revealed.

The created dataset enables drawing meaningful conclusions on the attitude of VKontakte users to issues of reproductive behavior. The easiest topic to determine relevance is “Abortions”, since in the vast majority of cases the topic is determined by the word “abortion” itself or the words formed from it. In this case, the topic with high quality is also determined by keywords. The prevalence of markups “for” on the topic “Abortion” (51.6%) is associated with the position of social network users regarding what women should have right to abortion; with the population’s attitude towards abortion as an acceptable means of birth control. The abortion rate significantly has currently decreased in Russia, but the attitude towards this method of birth control shows not so much readiness for action, as much as recognition of reproductive rights.

It is revealed that the phenomenon of conscious childlessness is actively represented on the web, and having many children remains a poorly spread behavior model. The predominance of negative opinions about having many children (44.9%) is due to the growing heterogeneity of the population in the number of children born. The increase in the share of large families among families with minor children during the period of demographic policy (from 7 to 9%, according to the 2010 census and 2015 micro-census) does not change the fact that the type of reproductive behavior, oriented towards having many children, remains rare.

On the topic “childlessness” one can see the prevalence of positive assessments (60.0%), since supporters of this pattern are quite emotional, the topic is apparently active during the period of growth of representatives of this model. In addition, we must remember about selection – we work with text groups of antinatalists and pronatalists. Among antinatalists, the topic of childlessness is expressed predominantly positively, but among pronatalists, this topic is less popular. This can also explain the prevalence of positive assessments in the topic “individualism” (62.8%), which interpreted as self-development, directing resources to one’s own pleasure. This topic often appears in the context of justification positive position on childlessness, our hypothesis about this justification was confirmed.

The predominance of the negative statements about parental leaves (35.9%) is explained by the fact that maternity leave, and carer’s leave children are perceived negatively by employers and colleagues; and the fact that women themselves often complain about the increased workload during this period, about the change of lifestyle during such holidays, and about the lack of free time. Opinions about childcare benefits (45.5%) are also rather negative since their size appears small for the recipients. And also supporters of childfree do not want their taxes to go towards family benefits.

The reproductive dataset’s statistics by topics and positions are shown in Table 1.

Topic“Relevant”“For”“Against”“Other”
Having many children34175 (22.0%)*153 (44.9%)113 (33.1%)
Childlessness1422853 (60.0%)134 (9.4%)435 (30.6%)
Individualism739464 (62.8%)119 (16.1%)156 (21.1%)
Abortion1374709 (51.6%)161 (11.7%)504 (36.7%)
Maternity capital (Childcare benefits)813184 (22.6%)370 (45.5%)259 (31.9%)
Parental leaves992201 (20.3%)356 (35.9%)435 (43.9%)
Total5681**2486 (43.8%)1293 (22.7%)1902 (33.5%)

Table 1.

Distribution of markups on topics related to the birth of children.

Notes: (1) the share of markups (*) with the corresponding token from the total number of relevant markups for a given topic in the line is indicated in parentheses, (2) the total number of markups (**) in the table is greater than the total number of comments analyzed since some comments contained opinions on several topics. Source: compiled by the authors (see more Refs. [62, 63]).

As far as the health-related behavior dataset the best results achieved in evaluation were 0.70 for stance detection and 0.74 for argument identification. These results were obtained with the approach which applied the RuBERT classifier to determine the relevance of the texts using the NLI-method: to form an input example, a second sentence with the aspect (“masks”, “quarantine”, or “vaccination”) was added to each original sentence from the dataset [64]. For the stance classification task, the texts were pre-processed and then translated into English using the pre-trained seq2seq-model (https://huggingface.co/Helsinki-NLP/opus-mt-ru-en). This made it possible to use a specialized BERT COVID model trained on texts discussing COVID (https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2).

The method-related result of this stage is that the best results were obtained using the BERT neural network model in the NLI (Natural language Inference) formulation. Our results are similar to other authors. In Ref. [27], three groups of methods for determining the author’s position in relation to COVID aspects are compared: based on the LSTM and CNN networks, as well as based on the BERT model [65]. The best results with a large margin are given by models based on BERT.

3.3 Analysis of the arguments of social network users: the search of general classification and automatic algorithm of monitoring of demographic issues

An algorithm for identifying arguments and automatically distinguishing certain types of arguments (based on the assessment of personal and public arguments) has been built using the Conversational RuBERT neural network model. The method used for automatic extraction and classification of the “VKontakte” social network users’ opinions has proven that we can accurately classify users’ comments submitted as the model’s input for the presence of argumentation and an argument’s type within the “personal-public” dichotomy to identify personal and social attitudes, values, stories, and opinions to study reproductive behavior. For the task of detecting the presence of arguments, the comments of VKontakte users were classified into two classes: Class “1” if at least two out of three annotators agreed that the comment contained an argument, and Class “0” if only one annotator or none of them indicated the presence of an argument.

As a result, six experiments were carried out, differing in the dataset and the classification option. In the first three experiments, classification was carried out only by the presence of an argument, but different types of data were used.

In Experiment 1, the reproductive data was used, consisting of comments from VKontakte users on reproductive behavior and the evaluation of demographic policy measures. The training, test, and validation sample comprised a total of 5410 comments. Standard pre-processing techniques were applied, including removing capital letters, punctuation, stop words, and empty comments.

In Experiment 2, COVID-19-related comments from VKontakte users, such as vaccination, mask-wearing, and quarantine restrictions, were added to the reproductive data. The new comments were included only in the training set, and the additional data consisted of 6716 comments, forming the training sample for the COVID-19 data.

In Experiment 3, training was conducted in two stages: pre-training on the COVID-19 data and fine-tuning and evaluation of the model on the reproductive data.

The results of training models with the indication of quality metrics are shown in Table 2.

F-score – Class “0” (there is no argument)F-score – Class “1” (there is argument)Accuracy
Experiment 1: reproductive data0.810.610.75
Experiment 2: reproductive and health-related data0.820.490.73
Experiment 3: pre-training on health-related data, fine-tuning on reproductive data0.790.480.70

Table 2.

Classification is based on the presence of an argument (class “0” – no argument, class “1” – Presence of an argument).

Source: compiled by the authors (see more Ref. [66]).

The results in Table 2 indicate that the addition of COVID-19 data did not improve the results of argument extraction in the reproductive collection. The model trained on the reproductive data alone performed better in this task.

In subsequent experiments, the type of argument was taken into account. In the fourth experiment, the focus was on the search for personal arguments (class “1”), while public comments and comments without arguments were generally taken into account equally in class “0”. Similarly, in the fifth experiment, the focus was on public arguments (class “1”). In the last experiment, all three classes were evaluated separately (“0” – absence of an argument, “1” – public argument, “2” – personal argument). The results are presented in Table 3.

F-score – Class “0” (no argument)F-score – Class “1” (public argument)F-score – class “2” (personal argument)Accuracy
Experiment 4: two classes with the emphasis on personal arguments0.860.710.81
Experiment 5: two classes with the emphasis on public arguments0.700.810.77
Experiment 6: three classes0.780.520.340.67

Table 3.

Classification model according to the type of argument (“personal” – Based on personal experience, “public” – Based on public perceptions).

Source: compiled by the authors (see more Ref. [66]).

The ratio of comments by class averaged 40:60 for comments with arguments versus those without arguments. The classification by the presence of an argument was the most accurate in terms of the ratio of quality metrics in the first experiment, where only reproductive data were used. From this, we can conclude that models trained on the data of the same subject show better results. The classification by argument type was more accurate when using two rather than three classes. In our static measurement, the share of personal arguments was around 40%. Public arguments prevailed in comments on VKontakte during the period under review in the demographic groups under consideration.

Based on the results of this work, we have made sure that we can extract and analyze the data of reasoned opinions of users of the largest Russian social network in automatic mode. All experiments have shown high accuracy of the methods used. In addition, the general approach based on the “personal-public” dichotomy is suitable for any type of demographic behavior, making it universal.

In future, models trained in this way can be used to monitor public attitudes on reproductive behavior and many others. The monitoring of the “personal-public” dichotomy in the future research will test our hypothesis on the growth of the share of personal arguments as a signal for policy-makers. The speed of automated analysis of such data will enable quick monitoring and responding appropriately to changes.

Advertisement

4. Conclusion and discussion

We developed and tested a method for measuring the demographic temperature of the first type (the mood of specific demographic groups in social networks) and the second type (the extraction of opinions on demographic topics in social networks). We have identified differences in the demographic behavior of social network users in different socio-demographic groups and their opinions about parenthood and childbearing on the VKontakte social network. The search for a general structure of demographic arguments was conducted based on the “personal-public” dichotomy.

There are limitations in the use of such data in demographic analysis, which are discussed by researchers. These are all kinds of biases, issues with representativeness and fragmentation, a tendency to make negative statements in comments, data access-related difficulties, the presence of bots, the risk of false information and changes in the algorithms of the data platforms themselves, the difficulty of interpreting data due to the use of initially unstructured texts, the discrepancy of scientific terms and the language of social networks, the double meaning of some words, and also due to the limited information on the characteristics of users.

The limitations of our research are similar to those mentioned above. The sentiment thematic analysis is sensitive to the following:

  • Representativeness and fragmentation of data. The characteristics of the users of the social network VKontakte may not coincide in socio-demographic characteristics with the population of the country. As far as it relates to reproductive dataset, the bias is not so strong. The users of VKontalte are close to the age-sex structure of reproductive age. Nevertheless, we can lose some socio-demographic groups.

  • The risk of disconnection between the scope and name of the groups and the demographic values of its commentators. We checked the connection between the scope of the reproductive-related groups of pro-natalists and anti-natalists, their socio-demographic profile, and demographic values of various groups’ participators. The connection has been found. Our groups are relevant for the comparative analysis of groups with opposite demographic values [67].

  • The tendency to negative statements in the comments, which is a specific characteristic of social networks.

  • Limited access to data, presence of bots, risk of false information [68, 69], and changes in the algorithms of the data platforms themselves.

  • The complexity of data interpretation due to the use of initially unstructured texts, the discrepancy between scientific terms and the language of social networks, the double meaning of some words, and the limited visible characteristics of users [10, 70, 71].

  • The cross-cultural analysis can be limited by the circumstance that usually models are initially trained mostly in English texts and could not be easily extrapolated in the case of Russia. Studying the existing software systems for Russian text analysis reveals their low accuracy compared to English [40].

Overcoming the limitations of textual data lies in the field of stratification and other methods of correcting text data biases [72]; using different data sources [73]; supervised machine learning; development of algorithms for collecting; and processing digital data for the possibility of discussing and repeating the research path for comparative purposes [37].

Sentiment analysis and extraction of opinions (stance and arguments) is an actively developing field of research that analyzes opinions, moods, assessments, attitudes, arguments, and emotions of people based on written or spoken language [74, 75]. Such studies are conducted based on a corpus of comments under videos on the YouTube platform [76] and under news posts published on social media [77]. However, researchers Castellano Parra, Meso Ayerdi and Pena Fernandez, when analyzing the comments under news posts on social portals, as well as comments on the social network pages of certain Spanish newspapers, came to the conclusion that both the level of engagement (the number of comments per one reader) and the proportion of reasoned comments on social networks are higher than in the comments on news portals [78]. In our research, we use text data from social networks to extract opinions on demographic issues also because it contains much larger data than if we conducted our research analyzing comments on news portals.

The prospects for using texts as data in demographic research are significant. The further direction of research may be related to expanding a methodology to a wider range of demographic behaviors, including matrimonial (marriage), migration, and health. The search for universal general types of arguments should continue. Their combination may eventually give a more complete picture of automated arguments on demographic issues, for example, in assessing private or public services in the field of healthcare, maternity care, partner search, infrastructure projects for families with children, and social support for certain demographic groups. The statements can be made about the quality of the service itself (the result of treatment) and about the quality of environment (the attitude of the doctor, the availability of equipment, cleanliness, timing, etc.). We also expect that our experience can be extended to other linguistic and cultural spaces at least within the countries where the demographic transition has been completed. We believe that it is possible because we have a comparable level of public Internet coverage, and we are at a similar stage of the natalistic transition and the second demographic transition. Cultural differences and institutes are inevitable in every country and constitute the essence of comparative analysis, which is possible when using the same methodology and methods in different language corpora. Cross-cultural analysis of demographic behavior using big data in Russia and other countries can also be an interesting subject for further research.

Another path of development is applying large language models (LLM), which are currently considered to be one of the most powerful instruments for analyzing text data. Despite the fact that this instrument is rather new, one can already witness a number of successful applications in different fields, including social sciences. We expect that, in the future, large language models will be able to provide opportunities for deeper and more profound analysis of demographic data.

The prospects of working with texts as data in demographic analysis are impressive in the speed and frequency of obtaining information, and the ability to reveal information in the field of inaccessible issues within the framework of official statistics, the ability to combine different data within one research question.

Advertisement

5. Research support

The manuscript was prepared with the financial support of the Scientific and Educational School of Lomonosov Moscow State University “Brain, Cognitive Systems, Artificial Intelligence” (the sentiment analysis of reproductive attitudes and public opinion on demographic policy, dataset on reproductive behavior) and the Faculty of Economics of Moscow State University within the framework of research on the topic “Reproduction of the population in a socio-economic development” (the demographic temperature of the first type, dataset on pro-natalist and anti-natalist groups).

The work of Natalia Loukachevitch on annotating and experimenting with the health-related dataset (the vital behavior during the pandemic COVID-19) is supported by Russian Science Foundation (project 21-71-3003).

Advertisement

Appendix

Appendix.1 method for emotional background of demographic groups measurement: demographic temperature of the first type

We aimed to develop an algorithm that would quickly measure the emotional background, that is the demographic temperature in contrasting demographic groups according to reproductive attitudes on the basis of big data over a long period of time. A thematic model was applied to the generated array of texts to implement the task of reverse thematic modeling. That is, for each set of words and texts, based on the probabilistic distributions of the frequencies of words and topics, the most appropriate topic corresponding to these sets was determined. This allowed us to form thematic clusters. For each thematic cluster, the task of analyzing the statements was carried out. TensorFlow3 and tflearn4 libraries were used for sentiment analysis. The neural network for sentiment analysis has three-level architecture: the first level corresponds to the dimension of the commentary corpus dictionary, the second-level consists of 125 neurons (fully connected layer, ReLU activation function5), the third-level consists of 25 neurons (fully connected layer, ReLU activation function), the output-level is binary (“0” – negative, “1” – positive, activation function – softmax6). Specification of the learning algorithm7:

Rg = tflearn.regression (*neural network layer*.

, optimizer = “sgd”.

, loss = “categorical_crossentro”).

The sum of the output level signals is equal to one, that is, the output is two numbers characterizing the probability that comment is negative or positive. Neural network training is based on the principle of error backpropagation. Neural network training was carried out on a marked-up database of short messages from Twitter [79]. Neural network training was performed in the Google Colab environment using a graphics accelerator (GPU, graphics processing unit). See Ref. Kalabikhina et al. [60] for other model parameters and more details.

Appendix.2 method for a thematic analysis of social network users’ opinions on demographic behavior and demographic policy: the demographic temperature of the second type

The research directions for the analysis of rare reproductive patterns and the attitude to the main measures of demographic policy were determined by experts. The analysis of stances of reproductive dataset took place on six topics: having many children, childlessness, individualism (argument topic for childlessness), abortion, maternity capital, parental leave. The first three topics concern the reproductive attitudes, the last three topics concern the opinion on demographic policy on fertility side (see more Refs. [62, 63]).

The sentences from the collected sample were marked up by six annotators. The annotators’ group included professional linguists and demographers. Since several issues could be discussed in each sentence, the annotator marked each sentence by all topics and by six following labels:

“irrelevant” (it means that the text does not contain stance on the topic);

“for” (positive stance, which means that the speaker expresses his support for the topic);

“against” (negative stance, the topic of discussion is not endorsed by the speaker);

“neutral” (this label is used for factual sentences without any visible attitudes from the author);

contradictory stance (for such a label, evident positive and negative attitudes should be seen in a message);

“unclear” (the presence of a stance is seen, but the context of the sentence does enable determining it).

We consider three main classes: “for”, “against”, and “others” (“neutral”, contradictory stance, “unclear”).

The created dataset was divided into three sets: training, validation, and test ones. The neural network model BERT [65] was used as the main method, in the version of Conversational RuBERT, for the creation of which the Russian-language model RuBERT [41], which was further trained on Russian-language dialogs and texts of social networks. Three variants of the BERT model training were used: classification of the target utterance, as well as the so-called NLI (Natural Language Inference) and QA (question-answering) approaches. In the NLI and QA approaches [80], the model received pairs (text, assumption). For the classification of the relevance of NLI and the classification of the QA position, this assumption was the aspect itself (“Abortions”, “Payments”, etc.), for the classification of the NLI position, the assumption also included the position itself (“Negative to abortion”, “Neutral to payments”, etc.). To assess the quality of classification, the following measures are used: accuracy of classification (Accuracy) and F-measure.

As we mentioned earlier, the sensitive thematic analysis on health-related behavior is based on VKontakte users’ comments discussing COVID-2019 news with the same method of sentiment analysis on the base of alternative dataset [58]. The sentences were labeled in relation to the following claims: (1) “Vaccination is beneficial for society.”; (2) “The introduction and observance of quarantine is beneficial for society.”; (3) “Wearing masks is beneficial for society” (see more [59]).

The annotation process included two stages: labelling by stance and labelling by premises. At the first stage of stance annotation, each sentence was labeled by several experts (three on average) [81]. An annotator should indicate the stance it expresses towards each of the above-mentioned aspects (or indicate that the sentence is not relevant to the aspect). The annotators’ group included professional linguists and psychologists. We consider four stance labels, namely: “irrelevant”, “for”, “against”, “other” (like in reproductive case).

In the second stage of annotation, the dataset was also annotated by arguments for all three claims [59]. The following four classes (labels) were used: “irrelevant” (text does not contain stance and, consequently, premise on the topic), “for” (the stance is supported with an argument in favor of the topic), “against” (the argument explains the author’s negative outlook on the topic), “no argument” (no explanation is given for the topic). A sentence was considered as a premise if the annotator could use it to convince an opponent about the given claim, such as “Masks help prevent the spread of disease”. The task of premise annotation should be separated from stance detection and sentiment analysis tasks. For example, the following statement does not contain a premise in relation to masks, although there is an author’s stance “for”: “It is high time to involve the city of ‘brides’ in the production of protective masks”. It is also necessary to distinguish between sentiment polarity (positive and/or negative) and argumentation. In the following sentence, there is a negative polarity towards quarantine, a positive polarity towards Trump, but no rational premises “for” or “against” quarantine are given: “And the fact that Trump did not introduce a suffocating quarantine is well done!”

The created dataset was also divided into three sets: training, validation, and test ones and was utilized for RuArg-2022 shared task which attracted 16 participants8.

It is important to stress that the described argument’s thematic analysis of the reproductive-related and health-related comments was made by the same approach as the stance’s thematic analysis. We annotated “for”, “against”, and “other” marks of relevant comments (comments with arguments) in the health-related dataset. The context of these arguments does not understandable for us in this case. Another way is that we annotated “for”, “against”, and “other” marks of an argument-type topic (“individualism”) in the reproductive-related dataset. The numerical link between anti-natalist attitudes (“childfree”) and argument-type topic (“individualism”) does not visible for us in that case.

Appendix.3 argument analysis’s method: search of the automatic general algorithm for demographic studies

The previous paragraph describes our methods to study of the reproductive (and health) behavior of Russians based on the opinions of users of the largest social network in Russia, VKontakte, using natural language processing methods (supervised machine learning). Our next step is to identify the arguments in the statements of VKontakte users on reproductive behavior (or any demographic behavior).

We have proposed a dichotomous “personal-public” approach for such a broad classification of arguments. The arguments where the author of the statement appeals to his own experience, personal preferences or attitudes are considered personal. Public statements were considered to be statements aimed at the authors’ reflections on how the family should be arranged in society as a whole. This type of division was chosen because the ratio of personal and public arguments plays an important role in understanding the “acuteness” of public sentiment and the need for policy-makers to respond to an alarm signal if the proportion of personal arguments increases significantly. Thus, in addition to identifying arguments, the second level of classification was automatically carried out according to the principle of attributing an argument to “personal” or “public” (see for more details see Ref. [Kalabikhina et al. [66, 67]]).

In demography, a dichotomous “personal-public” is not the new view. Well-known the example of using such dichotomous in the framework of fertility survey: “What is the ideal number of children in your opinion?” (the public site); “How many children do you plan to have based on your capabilities?” (the personal site). The examples of comments with arguments in the categories “personal” and “public” in our study of arguments: “I have three children. They make a mess better than without children. There is no meaning to life without children” (the personal site); “In countries with a ban on abortion, the crime rate goes off scale” (the public site).

The total volume of the final sample for with stage of our investigation was 5412 comments. The classification method was chosen as the main one according to two criteria: the presence and type of argument. These comments were marked up by experts according to the following principle: if there is argumentation or not9; if an argument is present, then it is determined whether it is personal or public. In different experiments, different classification options were used: only by the presence of an argument, by the presence of arguments of both types (personal and public), or by the presence of a specific type of argument.

The analysis of the annotated data was carried out using the Python programming language in the PyTorch environment and the Transformers and Scikit Learn libraries. The Conversational RuBERT model was used for experiments10. See Ref. [Kalabikhina et al. [66, 67]] for other model parameters and more details.

References

  1. 1. Gentzkow M, Kelly B, Taddy M. Text as data. Journal of Economic Literature. 2019;57(3):535-574
  2. 2. Kalpak KK, Arti DK, Dinesh S, Piyush S. A typology of viral ad sharers using sentiment analysis. Journal of Retailing and Consumer Services. 2020;53:101739
  3. 3. Dinesh KS, Fernandes S. Impact of e-WOM on consumer purchase behaviour through twitter sentiment analysis using Vader and machine learning. AIP Conference Proceedings. 2023;2523(1):30012
  4. 4. Karn AL, Karna RK, Kondamudi BR, et al. Customer centric hybrid recommendation system for E-commerce applications by integrating hybrid sentiment analysis. Electronic Commerce Research. 2023;23:279-314
  5. 5. Reis BY, Brownstein JS. Measuring the impact of health policies using internet search patterns: The case of abortion. BMC Public Health. 2010;10:1-5
  6. 6. Billari F, D'Amuri F, Marcucci J. Forecasting births using Google. In: Carma 2016: 1st International Conference on Advanced Research Methods in Analytics. Valencia: Editorial Universitat Politècnica de València; 2016. p. 119
  7. 7. Kalabikhina IE, Abduselimova IA, Arkhangelsky VN, Banin EP, Klimenko GA, Kolotusha AV, et al. Short-term forecasting of demographic trends based on Google trends data. Applied Computer Science. 2020;15(6):91-118. (In Russian)
  8. 8. Bronitsky G, Vakulenko E. Using Google trends for external migration prediction. Demographic Review. 2022;9(3):75-92. DOI: 10.17323/demreview.v9i3.16471 (in Russian)
  9. 9. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature. 2009;457(7232):1012-1014
  10. 10. Smirnov A. Digital traces of the population as a data source on migration flows in the Russian Arctic. Demographic Review. 2022a;9(2):42-64
  11. 11. Kalabikhina I, Kazbekova Z, Klimenko G, Kolotusha A. Demographic regional rankings by media activity on maternal (family) capital. Applied Econometrics. 2022b;67:46-73
  12. 12. Kalabikhina IE, Klimenko GA, Banin EP, Vorobyeva EK, Lameeva AD. Database of digital media publications on maternal (family) capital in Russia in 2006-2019. Population and Economics. 2021d;5(4):1-29
  13. 13. State B, Rodriguez M, Helbing D, Zagheni E. Migration of professionals to the US. In: Social Informatics: 6th International Conference, SocInfo 2014, Barcelona, Spain, November 11-13, 2014. Proceedings. Cham: Springer International Publishing; 2014. pp. 531-543
  14. 14. Mencarini L, Hernández-Farías DI, Lai M, Patti V, Sulis E, Vignoli D. Happy parents’ tweets. Demographic Research. 2019;40:693-724
  15. 15. Vignoli D, Farías DIH, Mencarini L, Lai M, Patti V, Sulis E, et al. Happy parents’ Tweet? An Exploration of 3 Milion Italian Twitter Data. In: 2017 International Population Conference. Cape Town, South Africa: IUSSP; 2017
  16. 16. Hasan KS, Ng V. Stance classification of ideological debates: Data, models, features, and constraints. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing. IJCNLP. Nagoya, Japan: Asian Federation of Natural Language Processing. 2013. pp. 1348-1356
  17. 17. Ntontis E, Hopkins N. Framing a ‘social problem': Emotion in anti-abortion activists' depiction of the abortion debate. British Journal of Social Psychology. 2018;57(3):666-683
  18. 18. Roldán-Robles PR, Umaquinga-Criollo AC, García-Santillán JA, Herrera-Granda ID, García-Santillán ID. A conceptual architecture for content analysis about abortion using the twitter platform. Revista Ibérica de Sistemas e Tecnologias de Informaçao. 2019;E22:363-374
  19. 19. Sharma E, Saha K, Ernala SK, Ghoshal S, De Choudhury M. Analyzing ideological discourse on social media: A case study of the abortion debate. In: Proceedings of the 2017 International Conference of the Computational Social Science Society of the Americas. New York, NY, United States: Association for Computing Machinery; 2017. pp. 1-8
  20. 20. Graells-Garrido E, Baeza-Yates R, Lalmas M. How representative is an abortion debate on twitter? In: Proceedings of the 10th ACM Conference on Web Science. New York, NY, United States: Association for Computing Machinery; 2019. pp. 133-134
  21. 21. LaRoche KJ, Jozkowski KN, Crawford BL, Haus KR. Attitudes of US adults toward using telemedicine to prescribe medication abortion during COVID-19: A mixed methods study. Contraception. 2021;104(1):104-110
  22. 22. Misra A, Oraby S, Tandon S, Ts S, Anand P, Walker M. Summarizing dialogic arguments from social media. arXiv preprint arXiv:1711.00092. 2017
  23. 23. Shah Z, Martin P, Coiera E, Mandl KD, Dunn AG. Modeling spatiotemporal factors associated with sentiment on twitter: Synthesis and suggestions for improving the identification of localized deviations. Journal of Medical Internet Research. 2019;21(5):e12881
  24. 24. Liu S, Li J, Liu J. Leveraging transfer learning to analyze opinions, attitudes, and behavioural intentions toward COVID-19 vaccines: Social media content and temporal analysis. Journal of Medical Internet Research. 2021;23(8):302-351
  25. 25. Mandel B, Culotta A, Boulahanis J, Stark D, Lewis B, Rodrigue J. A demographic analysis of online sentiment during hurricane Irene. In: Proceedings of the Second Workshop on Language in Social Media. 2012. pp. 27-36
  26. 26. Talpada H, Halgamuge MN, Tran Q , Vinh N. An analysis on use of deep learning and lexical-semantic based sentiment analysis method on twitter data to understand the demographic trend of telemedicine. In: 2019 11th International Conference on Knowledge and Systems Engineering (KSE), Da Nang, Vietnam: IEEE; 2019. pp. 1-9
  27. 27. Glandt K et al. Stance detection in COVID-19 tweets. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint 46 Conference on Natural Language Processing. Vol. 1: Long Papers. Stroudsburg, PA, USA: Association for Computational Linguistics; 2021. pp. 1596-1611
  28. 28. Liu S, Liu J. Public attitudes toward COVID-19 vaccines on English-language twitter: A sentiment analysis. Vaccine. 2021;39(39):5499-5505
  29. 29. Miao L, Last M, Litvak M. Twitter data augmentation for monitoring public opinion on COVID-19 intervention measures. In: Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Association for Computational Linguistics; 2020
  30. 30. Abosedra S, Laopodis NT, Fakih A. Dynamics and asymmetries between consumer sentiment and consumption in pre-and during-COVID-19 time: Evidence from the US. The Journal of Economic Asymmetries. 2021;24:e00227
  31. 31. Huerta DT, Hawkins JB, Brownstein JS, Hswen Y. Exploring discussions of health and risk and public sentiment in Massachusetts during COVID-19 pandemic mandate implementation: A twitter analysis. SSM-Population Health. 2021;15:100851
  32. 32. Alamoodi AH, Zaidan BB, Zaidan AA, Albahri OS, Mohammed KI, Malik RQ , et al. Sentiment analysis and its applications in fighting COVID-19 and infectious diseases: A systematic review. Expert Systems with Applications. 2021;167:114155
  33. 33. Andalibi N, Haimson OL, De Choudhury M, Forte A. Understanding social media disclosures of sexual abuse through the lenses of support seeking and anonymity. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. New York, NY, United States: Association for Computing Machinery; 2016. pp. 3906-3918
  34. 34. Al-Rawi A, Grepin K, Li X, Morgan R, Wenham C, Smith J. Investigating public discourses around gender and COVID-19: A social media analysis of twitter data. Journal of Healthcare Informatics Research. 2021;5(3):249-269
  35. 35. Xue J, Macropol K, Jia Y, Zhu T, Gelles RJ. Harnessing big data for social justice: An exploration of violence against women-related conversations on Twitter. Human Behavior and Emerging Technologies. 2019;1(3):269-279
  36. 36. Mittos A, Zannettou S, Blackburn J, Cristofaro ED. Analyzing genetic testing discourse on the web through the lens of twitter, reddit, and 4chan. ACM Transactions on the Web (TWEB). 2020;14(4):1-38
  37. 37. Cesare N, Lee H, McCormick T, Spiro E, Zagheni E. Promises and pitfalls of using digital traces for demographic research. Demography. 2018;55(5):1979-1999
  38. 38. Stewart I, Flores RD, Riffe T, Weber I, Zagheni E. Rock, rap, or reggaeton?: Assessing Mexican immigrants' cultural assimilation using Facebook data. In: The World Wide Web Conference. New York, NY, United States: Association for Computing Machinery; 2019. pp. 3258-3264
  39. 39. Pötzschke S, Braun M. Migrant sampling using Facebook advertisements: A case study of polish migrants in four European countries. Social Science Computer Review. 2016;35(5):633-653
  40. 40. Dvoynikova AA, Karpov AA. Analytical review of approaches to Russian text sentiment recognition. Information and Control Systems. 2020;4:20-30. (In Russian)
  41. 41. Kuratov Y, Arkhipov M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv preprint arXiv:1905.07213. 2019
  42. 42. Smetanin S. The applications of sentiment analysis for Russian language texts: Current challenges and future perspectives. IEEE Access. 2020;8:110693-110719
  43. 43. Vychegzhanin SV, Kotelnikov EV. Stance detection based on ensembles of classifiers. Programming and Computer Software. 2019;45(5):228-240
  44. 44. Panicheva P, Mararitsa L, Sorokin S, et al. Predicting subjective well-being in a high-risk sample of Russian mental health app users. EPJ Data Science. 2022;11:21
  45. 45. Melton CA, Olusanya OA, Ammar N, Shaban-Nejad A. Public sentiment analysis and topic modeling regarding COVID-19 vaccines on the reddit social media platform: A call to action for strengthening vaccine confidence. Journal of Infection and Public Health. 2021;14(10):1505-1512
  46. 46. Wawrzuta D, Jaworski M, Gotlib J, Panczyk M. What arguments against COVID-19 vaccines run on facebook in Poland: Content analysis of comments. Vaccine. 2021;9(5):481-492
  47. 47. Wawrzuta D, Klejdysz J, Jaworski M, Gotlib J, Panczyk M. Attitudes toward COVID-19 vaccination on social media: A cross-platform analysis. Vaccine. 2022;10(8):1190
  48. 48. Karami A, Anderson M. Social media and COVID-19: Characterizing anti-quarantine comments on twitter. Proceedings of the Association for Information Science and Technology. 2020;57(1):349-353
  49. 49. Han X, Wang J, Zhang M, Wang X. Using social media to mine and analyze public opinion related to COVID-19 in China. International Journal of Environmental Research and Public Health. 2020;17(8):2788
  50. 50. Oyebode O, Ndulue C, Adib A, Mulchandani D, Suruliraj B, Orji FA, et al. Health, psychosocial, and social issues emanating from the COVID-19 pandemic based on social media comments: Text mining and thematic analysis approach. JMIR Medical Informatics. 2021;9(4):227-234
  51. 51. Donchenko D, Ovchar N, Sadovnikova N, Parygin D, Shabalina O, Ather D. Analysis of comments of users of social networks to assess the level of social tension. Procedia Computer Science. 2017;119:359-367
  52. 52. Sidorov N, Slastnikov S. Some features of sentiment analysis for Russian language posts and comments from social networks. Journal of Physics: Conference Series. IOP Publishing. 2021;1740(1):12-36
  53. 53. Smetanin S, Komarov M. Share of toxic comments among different topics: The case of Russian social networks. In: 2021 IEEE 23rd Conference on Business Informatics (CBI). Vol. 2. Bolzano, Italy: IEEE; 2021. pp. 65-70
  54. 54. Hopkins N, Zeedyk S, Raitt F. Visualising abortion: Emotion discourse and fetal imagery in a contemporary abortion debate. Social Science & Medicine. 2005;61(2):393-403
  55. 55. Kalabikhina IE, Banin EP. Database “pro-family (pro-natalist) communities in the social network VKontakte”. Population and Economics. 2020;4:98
  56. 56. Kalabikhina IE, Banin EP. Database “childfree (anti-natalist) communities in the social network VKontakte”. Population and Economics. 2021;5(2):92-96
  57. 57. Kalabikhina IE, Loukachevitch NV, Banin EP, Alibaeva KV, Rebrey SM. Automatic extraction of opinions of users of social networks on reproductive behaviour issues [dataset]. Zenodo. 2021b. DOI: 10.5281/zenodo.5561126
  58. 58. Chkhartishvili A, Gubanov D, Kozitsin I. Covid-19 information consumption and dissemination: A study of online social network VKontakte. In: 2021 14th International Conference Management of Large-Scale System Development (MLSD). Moscow, Russian Federation: IEEE; 2021. pp. 1-5
  59. 59. Kotelnikov E, Loukachevitch N, Nikishina I, Panchenko A. RuArg-2022: Argument mining evaluation. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2022”. Moscow: Dialogue; 2022
  60. 60. Kalabikhina IE, Banin EP, Abduselimova IA, Klimenko GA, Kolotusha AV. The measurement of demographic temperature using the sentiment analysis of data from the social network VKontakte. Mathematics. 2021c;9(9):987
  61. 61. Taj MN, Girisha GS. Insights of strength and weakness of evolving methodologies of sentiment analysis. Global Transitions Proceedings. 2021;2(2):157-162
  62. 62. Kalabikhina IE, Loukachevitch NV, Banin EP, Alibaeva KV, Rebrey SM. Automatic extraction of opinions of users of social networks on reproductive behaviour. Software Systems: Theory and Applications. 2021a;12(51):33-63. (In Russian)
  63. 63. Kalabikhina IE, Loukachevitch NV, Banin EP, Alibaeva KV. Automatic analysis of reproductive values of VKontakte network users. Intelligent Systems. Theory and Applications. 2022a;26(1):90-96. (In Russian)
  64. 64. Alibaeva K, Loukachevitch N. Analyzing COVID-related stance and arguments using BERT-based natural language inference. Computational Linguistics and Intellectual Technologies. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2022”. Moscow: Dialogue; 2022
  65. 65. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018
  66. 66. Kalabikhina I, Zubova E, Loukachevitch N, Kazbekova Z, Kolotusha A, Banin E, et al. Arguments on reproductive behaviour of users of social network by natural language processing method. Population and Economics. 2023a;7(2):40-59
  67. 67. Kalabikhina IE, Kazbekova ZG, Banin EP, Klimenko GA. Demographic values and socio-demographic profile of VKontakte users: Is there a connection? In: Moscow University Bulletin. Series 6: Economy. 3. 2023b. pp. 157-180. (In Russian)
  68. 68. Golder SA, Macy MW. Digital footprints: Opportunities and challenges for online social research. Annual Review of Sociology. 2014;40:129-152
  69. 69. Lazer D, Radford J. Data ex machina: Introduction to big data. Annual Review of Sociology. 2017;43:19-39
  70. 70. Loukachevitch N. Automatic Sentiment Analysis of Texts: The Case of Russian. In: Gritsenko D, Wijermars M, Kopotev M, editors. The Palgrave Handbook of Digital Russia Studies. Cham: Palgrave Macmillan; DOI: 10.1007/978-3-030-42855-6_28
  71. 71. Rusnachenko N, Loukachevitch NV. Extracting sentiment attitudes from analytical texts. In: Computational Linguistics and Intelligent Technologies: Proceedings of the International Conference “Dialogue 2018”. May 30–June 2, 2018. Moscow: Lomonosov Moscow State University; 2018. pp. 459-468
  72. 72. Hughes C, Zagheni E, Abel G, Wi’sniowski A, Sorichetta A, Weber I, et al. Inferring Migrations: Traditional Methods and New Approaches Based on Mobile Phone, Social Media, and Other Big Data. Luxembourg: Publications Office of the European Union; 2016
  73. 73. Alburez-Gutierrez D, Zagheni E, Aref S, Gil-Clavel S, Grow A, Negraia DV. Demography in the Digital Era: New Data Sources for Population Research. SocArXiv; 2019. DOI: 10.31235/osf.io/24jp7
  74. 74. Liu B. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies. 2012;5(1):1-167
  75. 75. Pozzi FA, Fersini E, Messina E, Liu B. Challenges of sentiment analysis in social networks: An overview. Sentiment Analysis in Social Networks. 2017:1-11. DOI: 10.1016/B978-0-12-804412-4.00001-2
  76. 76. Sagredos C, Nikolova E. ‘Slut I hate you’: A critical discourse analysis of gendered conflict on YouTube. Journal of Language Aggression and Conflict. 2022;10(1):169-196
  77. 77. Ehret K, Taboada M. Are online news comments like face-to-face conversation?: A multi-dimensional analysis of an emerging register. Register Studies. 2020;2(1):1-36
  78. 78. Castellano Parra O, Meso Ayerdi K, Pena Fernandez S. Behind the Comments Section: The Ethics of Digital Native News Discussions. 2020
  79. 79. Loukachevitch N, Rubtsova Y. Entity-oriented sentiment analysis of tweets: Results and problems. In: Text, Speech, and Dialogue: 18th International Conference, TSD 2015, Pilsen, Czech Republic, September 14-17, 2015, Proceedings 18. Pilsen, Czech Republic: Springer International Publishing; 2015. pp. 551-559
  80. 80. Sun C, Huang L, Qiu X. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. arXiv preprint arXiv:1903.09588. 2019
  81. 81. Nugamanov E, Loukachevitch N, Dobrov B. Extracting sentiments towards COVID-19 aspects. In: Supplementary 23rd International Conference on Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2021; 2021. pp. 299-312

Notes

  • Website: https://trends.google.ru/, https://wordstat.yandex.ru/.
  • Website: https://vk.com/.
  • Official website of the project tensorflow.org.
  • Official website of the project tflearn.org.
  • Function description by link https://en.wikipedia.org/wiki/Rectifier_ (neural_networks).
  • Function description by link https://ru.wikipedia.org/wiki/Softmax.
  • Description by link http://tflearn.org/layers/estimator/.
  • https://www.dialog-21.ru/en/dialogue-evaluation/competitions/dialogue-evaluation-2022/ruarg-2022/.
  • Note that we have also tested the method of finding arguments based on personal pronouns in statements. This method gave a worse result compared to the markup.
  • DeepPavlov archive on Hugging Face to access the model: https://huggingface.co/DeepPavlov/rubert-base-cased-conversational.

Written By

Irina Kalabikhina, Natalia Loukachevitch, Eugeny Banin and Anton Kolotusha

Submitted: 06 March 2023 Reviewed: 15 May 2023 Published: 09 February 2024