InTechOpen uses cookies to offer you the best online experience. By continuing to use our site, you agree to our Privacy Policy.

Social Sciences » "Sociolinguistics - Interdisciplinary Perspectives", book edited by Xiaoming Jiang, ISBN 978-953-51-3334-6, Print ISBN 978-953-51-3333-9, Published: July 5, 2017 under CC BY 3.0 license. © The Author(s).

Chapter 7

Time-Series Analysis of Video Comments on Social Media

By Kazuyuki Matsumoto, Hayato Shimizu, Minoru Yoshida and Kenji Kita
DOI: 10.5772/intechopen.68636

Article top


Replay interface GUI of Nicovideo.
Figure 1. Replay interface GUI of Nicovideo.
The screen shot of unfair videos (in the 15th rank).
Figure 2. The screen shot of unfair videos (in the 15th rank).
Transition of the comment numbers on each day since the video was uploaded.
Figure 3. Transition of the comment numbers on each day since the video was uploaded.
Comment distribution of a popular video.
Figure 4. Comment distribution of a popular video.
Comment distribution of an unfair video.
Figure 5. Comment distribution of an unfair video.
Flow of the training phase and test phase to judge unfairness.
Figure 6. Flow of the training phase and test phase to judge unfairness.
Calculation of correlation coefficient by analysing chronological data of the comment numbers.
Figure 7. Calculation of correlation coefficient by analysing chronological data of the comment numbers.
A method to extract comment distribution data chronologically.
Figure 8. A method to extract comment distribution data chronologically.
Similarity of the comment numbers posted on the videos.
Figure 9. Similarity of the comment numbers posted on the videos.
Accuracy of detection and false detection rate.
Figure 10. Accuracy of detection and false detection rate.
Unfair/not unfair rate for each video category.
Figure 11. Unfair/not unfair rate for each video category.

Time-Series Analysis of Video Comments on Social Media

Kazuyuki Matsumoto, Hayato Shimizu, Minoru Yoshida and Kenji Kita
Show details


In this study, we propose a method to detect unfair rating cheat caused by multiple comment postings focusing on time-series analysis of the number of comments. We defined the videos that obtained a lot of comments by unfair cheat as ‘unfair video’ and defined the videos which obtained without unfair cheat as ‘popular video’. Specifically, our proposed method focused on the difference of chronological distributions of the comments between the popular videos and the unfair videos. As the evaluation result, our proposed method could obtain higher accuracy than that of the baseline method.

Keywords: video comments, shared videos, comment analysis, time-series analysis

1. Introduction

With the recent proliferation of mobile devices such as smartphones and tablets, a wide variety of videos has become easily accessible on the Internet. Video sharing sites have been utilized for various purposes to share private activities or the public relations activities of companies or to distribute the latest news. Nicovideos [1], one of the most popular video sharing sites in Japan, uses its original system to rank its videos according to the total number of plays, the total number of times the title has been added to the users’ favourites list (called ‘my list’) and the total number of comments annotated to the videos.

Unfortunately, evaluations under this system can be intentionally distorted by dishonest users who want to influence the evaluation of a specific video by using multiple accounts. Such manipulation can dishonestly seek to improve a video’s ranking in an effort to have it attract more attention. That is to say, it could be used to create a bandwagon effect that produces increased viewing. To combat such efforts, a new evaluation method, one that is more effective and goes beyond computing a simple numerical value, is necessary.

In this study, we propose a method to detect ratings of cheating caused by dishonest multiple comment postings, focusing on a time-series analysis of the number of comments accumulated by the video. We define videos that receive a substantial number of comments as a result of unfair cheating as unfair videos; videos that appear to be relatively free of such manufactured comments are defined as popular videos. Our proposed detection method focuses on differences in the chronological distributions of comments for the two video classifications—popular videos versus unfair videos.

In the Japanese Internet culture, there are terms or expressions specific to an Internet community that appears to be slowly sinking in with the broader community of Internet users. For example, on the anonymous message board 2 channel, users often use 2 channel words, many of which have been adopted as general Internet slang. On the video sharing site Nicovideo, viewers can post a comment to the videos. Various words related to animation, music or video games are often used in these comments, interpretations of which have appeared in Nikoniko daihyakka. Indeed, Nicovideo has developed an independent culture populated by individuals who use these expressions.

In our study, we pursued the idea that generalization of these expressions beyond their domain is required. As the services and communities involved are liquid, the expressions are constantly increasing and changing. Because of this, analysis of the content of user comments may be necessary, requiring the conversion of the comments into semantic primitive form in order to conduct an analysis of word distributed representation. However, because some domain-specific expressions are neologisms not defined in an existing dictionary, it is not always effective to treat these expressions as semantic primitives.

In the case of distributed representation, it is necessary to train some quantity of corpus, which can be expensive since there is a cost to updating training data repeatedly for adaptation to new expressions. Our proposed unfair video detection method is independent of the culture specific to the video sharing site, as it analyses the quantitative variation of comments in viewer’s feedback to the video. In this method, it is effective that the statistical approach to language information be the same as the analytic approach used for Internet flaming or Internet trend. In the field of sociolinguistics, there are many studies focusing on a language feature by unit of community or attribute. However, on the Internet, the cultural barriers fall year by year, and there are language expressions that are imported from various cultures which messily exist. In such a situation, it is effective that the analysis model uses a domain-specific dictionary or corpus; however, it is difficult to decide the parameters to be applied for each domain. Therefore, we should discuss the effectiveness of simply treating a language that acts as a numerical value.

2. Related works

2.1. User-centric type content delivery using PageRank algorithm

To improve a video’s quality of experience (QoE), Yoshimura and [2] proposed a method that uses a score based on the PageRank algorithm as a video evaluation parameter. The method uses link information such as author information and publicly opened favourites lists (my lists).

This method judges videos with higher relevance as more beneficial. However, both author information and publicly opened my lists, which are used to score a video, share the same problem: the author can manipulate the evaluation of the video by using multiple accounts.

2.2. Detecting videos with social novelty using Nicovideo log data

Hirasawa et al. [3] defined a video that ‘is not yet known socially but would interest many people if it is recommended’ as a video with ‘social novelty’. To discover these videos, the researchers focused on three features:

  • Video content: directly expresses the content of a video based on tags annotated to the video.

  • Comments: used to indicate viewer opinion based on comments posted on the video.

  • Viewing activity: represents users’ viewing actions based on the number of video plays, the number of comments and the number of my list registrations, etc.

The study’s authors proposed a method to estimate social-novelty videos by machine learning, using these three features. They focused on the tag: ‘the video should be evaluated higher’, which is annotated when ‘although the video’s quality is good and interesting, the video has not been played many times and not been widely known’ and evaluated the videos using a feature analysis of the videos with the tag. They confirmed that videos annotated with the ‘should be evaluated higher’ tag were commented on with many positive words. However, because the tags on Nicovideo can be edited by the video’s author himself/herself, it is possible for the author to remove appropriate tags or annotate non-related tags. In addition, because the video author can set the configuration to prevent viewers from changing the tags, a video’s evaluation can become improperly inflated.

An example of the Nicovideo interface is shown in Figure 1. In the video display area, comments posted for a scene in the video are indicated moving from right to left. Comments are reflected and indicated in real-time; older comments posted more than a certain period of time from their posted dates/times are not included.


Figure 1.

Replay interface GUI of Nicovideo.

2.3. Content quality analysis on social media

Eugene et al. [4] focused on analysing the quality of User Generated Content (UGC) on social media and proposed a method to automatically analyse the content’s quality by calculating a value defined as a ‘quality score’. In this study, the authors considered the interaction between the content creator and the users and estimated the quality of question and answer on the question/answer (QA) site. As in QA, the questions and their answers are described in natural language; both the questioner and the answerer can be the content generators.

In contrast, our study deals with videos posted to a video sharing site and considers the comments posted by viewers. While creators post their videos to the site, users (viewers) are able to watch the videos, post comments and add their favourite videos to their my lists. To estimate content quality, the trend of such viewer actions can serve as a useful indicator. However, because it is difficult to improve content quality in the short term, cheating or dishonest manipulation is an ever-present temptation.

3. Comment data posted on video

3.1. Definition of unfair video

The unfair videos dealt with in our study are those which receive a remarkably high number of plays or comments, even though their contents are relatively poor (without any change in the picture displayed or the sound heard, as shown in Figure 2). Although it is not easy to evaluate the quality of a variety of video contents, having a tag on the video to identify whether the video is fair or unfair might allow us to use such a tag for the evaluation of content quality.


Figure 2.

The screen shot of unfair videos (in the 15th rank).

By being (undeservedly) highly ranked, unfair videos receive more attention from viewers than they otherwise would. This increased attention can sometimes result in many viewers submitting negative reports to the site manager, who will, in response, delete the offending video. Therefore, in devising our study, we believed that there would not be very many unfair videos among those in the higher ranks (based on comments and plays). In addition, we chose not to examine unfair videos whose contents violated copyright laws.

3.2. Relevance of popular videos and unfair videos

3.2.1. Transition of the comment numbers after the video was uploaded

We calculated the number of the comments posted on both popular videos and unfair videos. We used Nicovideo dataset [5]. Figure 3 shows the ratio of the comments on each day since the video was uploaded. The horizontal axis indicates the elapsed days and the vertical axis indicates the comment rate. The rate of comment (CRi) is calculated by Eq. (1).


Figure 3.

Transition of the comment numbers on each day since the video was uploaded.


Here, cfi and cft indicate the number of comments after i days or after t days from the day the video was uploaded. N indicates the maximum number of elapsed days in the analysis. In Figure 1, we set the N value to 20. As shown in the graph, the comment rate of the popular videos begins to decrease after 20 days. This seems reasonable as users tend to submit their comments when they view the video for the first time. For this reason, we decided that it would be sufficient to acquire the comment data after 7 days from the time the video is uploaded.

3.2.2. Character string analysis of the video comments

Viewers on the Nicovideo site are able to post their comments using a maximum of 75 characters at any one time. To find basis for judging whether a video has been unfairly evaluated, we analysed videos that received a large number of comments, calculating the length of the character string of the comments posted for both the popular and the unfair videos. The results are shown in Table 1.

0 < L ≤ 1583.6%98.4%
15 < L ≤ 3014.3%0.7%
30 < L ≤ 451.6%0.3%
45 < L ≤ 600.4%0.4%
60 < L ≤ 550.1%0.2%
Total number of comments3,920,2035,089,784

Table 1.

The rate of comments written with under 75 characters on the popular/unfair videos.

Overall, the comments tended to be written in under 15 characters. As indicated in the table, for the popular videos, over 83% of the comments were written in under 15 characters; for the unfair videos, over 98% of the comments were written in under 15 characters. Table 2 shows the breakdown of these under-15-character comments.

0 < L ≤ 535.0%99.3%
5 < L ≤ 1042.2%0.4%
10 < L ≤ 1522.8%0.3%
Total number of comments3,277,2345,010,393

Table 2.

The rate of the comments written with under 15 characters on the popular/unfair videos.

Indeed, many comments were written in under five characters. For the popular videos, approximately 70% of the comments were written in under 10 characters. From these results, it is clear that comments tend to be written in short sentences, making it difficult to judge whether a video is ‘fair’ or ‘unfair’ solely from the meanings of the comments posted for it.

3.2.3. Entropy analysis of the comments

As mentioned in the previous section, we attempted to analyse the quality of the comments to demonstrate the difficulty of judging the ‘fairness’ of a video from the meanings of the comment sentences. We used information entropy as an index.

Information entropy [6] has been defined in information theory. It is a measure applied to an event that rarely happens. If an event’s entropy value is high, it is a rare and important event. If the entropy value of a particular event is higher than that of some other event, it can be inferred that the higher valued event includes more information.

We calculated the information entropy value of the characters, words, parts of speeches and primitive semantics of the words used in the comments and analysed the difference between these values for the popular videos and the unfair videos. The entropy equation of video X is shown in Eq. (2). Pi indicates the appearance probability of the character, word, part of speech or primitive semantics.


To divide the comments into word units, we used MeCab, a Japanese morphological analyser [7]. To judge the primitive semantics, we used four kinds of dictionaries: Goi-Taikei (a Japanese lexicon) [8], Word-Classification Lexicon [9], Japanese WordNet [10] and the EDR Concept Dictionary [11].

As the analysis target, we manually prepared 100 popular and unfair videos, and calculated the entropy of character/word/part of speech/primitive semantics. The average/maximum/minimum entropy values are shown in Table 3.

CategoryAvg/Max/MinNo. of commentsAvg. of comment lengthWord entropyChar. entropyPos. entropySem. entropy

Table 3.

Analysis result of each kind of entropy.

As can be seen in the table, the entropy values were uneven, depending on the video. However, the entropy values of the unfair videos were lower, on average, than the popular video entropy values. Because the entropy of primitive semantics was calculated using an existing thesaurus, which means that it was unable to handle the latest buzz terms, it is possible that important or meaningful words were ignored. Given these results, we concluded that we can classify unfair/popular videos by using entropy values as a feature.

3.2.4. Relation between the comment posted date and the comment posting scene

Figure 4 shows the distribution of comments for the popular videos. Figure 5 shows the comment distribution for the unfair videos. In each case, the vertical axis indicates the date of the comment’s posting, while the horizontal axis indicates the position of the posted comment.


Figure 4.

Comment distribution of a popular video.


Figure 5.

Comment distribution of an unfair video.

There are clear differences between the two graphs

  • For the popular videos, the number of comments decreased gradually, day-by-day. On the other hand, for the unfair videos, the number of comments decreased at a certain time of the day. Furthermore, the days with many comments and the days with few comments were distinctively distributed.

  • For the popular videos, the scenes at which the comments were posted showed little variation; on the other hand, for the unfair videos, the scenes at which the comments were posted varied widely. Given these two tendencies, we based our method for detecting unfair videos on the chronological fluctuation feature of the comments.

4. Proposed method

4.1. Extraction of unfair detection basis

As already mentioned, we concluded that popular videos and unfair videos had distinctive differences in comment posting dates and the degree of variation in the scenes at which the comments were posted.

Based on the information obtained from the videos that we examined, we attempted to determine a method to extract unfair videos from a video database. Below are the steps we followed to determine a basis for identifying the unfair videos (Figure 6)

  1. We prepared training data for unfair/popular videos that received a large number of comments.

  2. We prepared comment data from the training data and extracted the dates when the comments were posted and the scenes where the comments were posted.

  3. We sought to find a basis to judge unfair/popular videos based on the posting dates and posting position.


Figure 6.

Flow of the training phase and test phase to judge unfairness.

4.2. Judgement by time-series data

We attempted to find useful information with which to judge whether a video is popular or unfair by examining the fluctuations in the number of comments on a day-to-day basis. Firstly, we split the video into various segments chronologically. We then counted the number of comments posted in each segment for each day and calculated the correlation coefficient between neighbouring dates.

4.3. Calculation method for correlation coefficient

Using the comment data, we calculated the correlation coefficient rx,x+1, for x = 1, 2, 3,.., 6, where x is the number of days since the video was uploaded, as detailed below (see Figure 7)

  1. The comment data obtained x days from the date that the video was uploaded was divided into N equal segments according to the length of the video, as shown in Figure 8.

  2. The number of comments, Cx,n, was calculated for each segment (x, n).

  3. The average of Cx,n (n = 1, 2, 3,…, N) is defined as Cx¯; the correlation coefficient rx,x+1 was calculated from Eq. (3).

  4. A threshold value to judge whether the video is popular or unfair was determined for the correlation coefficient for each neighbouring date, up to 7 days from the date the video was uploaded. A threshold is the value of the boundary line between two categories. In this study, we classify videos as either over-threshold or under-threshold by establishing the threshold of the coefficient of correlation.


Figure 7.

Calculation of correlation coefficient by analysing chronological data of the comment numbers.


Figure 8.

A method to extract comment distribution data chronologically.


Based on the comment data posted for the popular/unfair videos, we plotted the number of comments posted on the xth day on the vertical axis and the (x+1)th day on the horizontal axis, as shown in Figure 9. The colours of each point indicate the elapsed days (1–6). The more the points appear near the diagonal line from upper right to lower left, the higher the correlation of the number of comments becomes. As shown in figure, in the case of the unfair video, there was very low correlation, while in the case of the popular video, there was clear positive correlation.


Figure 9.

Similarity of the comment numbers posted on the videos.

We proposed the following two types of judgment methods using features of the maximum/minimum/average of the correlation coefficient rx,x+1 (x = 1, 2,…, 6) obtained from each video comment set

  • Calculate the correlation coefficients as described above. From the minimum values of the correlation coefficients, use the maximum value to determine the judgement threshold Tu according to Eq. (4) and use it in judging the unfairness of the video. Eq. (5) shows the calculation of the judgment result for the unfair video labeli, where the minimum value of the correlation coefficient of video i is rimin. Video i is an unfair video when labeli = 0; video i is not unfair video when labeli = 1.

    labeli={0if rminiTu1if rmini>Tu

  • Convert the maximum/minimum/average values of the comment correlation coefficient of video i in each of the elapsed days areas into the vector Ri=(rmaxi,rmini,ravgi); use a support vector machine [12] to classify popular and unfair videos. Support vector machine (SVM) is a machine learning algorithm. SVM calculates the hyperplane that divides a feature space for binary classification. The method has a generalization capability by using the selection basis of max margin learning and the kernel method, which treat non-linear relationships among the features. SVM is effective in high dimension space; in general, the more the training data the less the bias and the better the chance that a good classification model can be constructed. This method is often used for pattern recognition or text classification.

5. Experiment

5.1. Experimental data

The number of videos included in the experiment is shown in Table 4. The average number of comments posted to the videos is shown in Table 5.

Training dataTest data
Popular video10065
Unfair video10065

Table 4.

The number of experimental data.

Average number of comments
Popular video18801
Unfair video40573

Table 5.

The average number of comments.

5.2. Selection basis of unfair video data

The following conditions were used to select unfair videos for the experiment

Condition 1. The number of comments is larger than the number of plays, and the number of comments is over 10,000.

Condition 2. Satisfying Condition 1, the number of the viewers who registered the video in my list is less than 1000.

Condition 3. Despite the fact that the video does not contain a motion picture or any sound, a large number of comments are posted.

Condition 4. Although the video shows only one still picture and the same sound is repeated, a large number of comments are posted.

When the number of comments is larger than the number of plays, the implication is that each viewer posted a comment at least once. If the video satisfies this condition, it is highly likely that viewers posted multiple comments for one play; therefore, we set Condition 1. Since it is difficult to increase the number of my list registrations without at least one viewing, we set the threshold for the number of my list registrations at a relatively small value.

5.3. Selection basis of popular videos

Popular videos were defined as videos previously ranked in the top 50 on Nicovideo’s official site [1].

5.4. Baseline method: comment contents based method

As our baseline method, we chose a method to binary-classify a video as either popular or unfair from the contents of the comments. Existing studies on the document classification task often use a method that calculates word weights as a score based on term frequency or global document frequency, using them as a bag-of-words feature value in machine learning.

Ebata et al. [13] calculated the importance value of a video using term frequency-inverse document frequency (tf-idf). Our study also uses morpheme as a minimum unit to express a comment’s content and uses tf-idf [14] as the feature value. Tf-idf is an abbreviation of term frequency-inverse document frequency. This is a measure that considers two word characteristics: high-frequency in the document (high TF word), which means it is important; and strongly related to the document’s topic. It is often difficult to distinguish a document’s topic by words which appear in multiple documents.

For our baseline method, we applied a machine learning method that binary-classified popular/unfair videos by training feature using a support vector machine (SVM). Because general document classification tasks often use the number of words of a particular kind as a dimension and importance value, we used the tf-idf-based machine learning method as our baseline method. The calculation of tf-idf follows Eq. (6) through Eq. (9).

Here, d indicates the set of all comments posted to a video; t indicates the words included in comment set d. The appearance frequency of word t in document d is expressed as nt,d; N indicates the total number of videos in the training data and df(t) indicates the number of documents including word t. The feature vector md for each video is calculated based on the tf-idf value w(t,d). The feature vector md is trained by SVM.

md=(w(t1,d), w(t2,d), , w(tN,d))

To produce another baseline method using comment contents, we applied a method in which the feature quantity is obtained by using comment entropy analysis. Because the scales of the comment entropy values differ depending on the basis units (character/word/part of speech/primitive semantics), we used the feature quantity scaled in advance.

5.5. Preliminary experiment

We conducted a preliminary experiment to examine the judgment method using a correlation coefficient threshold. We calculated correlation coefficients from the training data shown in Table 4, then calculated the average value, the maximum average value and the minimum average value based on the correlation coefficients, as shown in Table 6.

Video typeAverageAveraged maximum valueAveraged minimum Value
Popular video0.9790.9930.848
Unfair video0.7780.9720.496

Table 6.

Average correlation coefficient.

As the table indicates, there were substantial differences between the minimum values of the popular videos and those of the unfair videos. Table 7 shows the judgement result of the test data using the minimum value as the feature.

ThresholdDetection rateFalse detection rate
Proposed method0.584.8%0%
tf-idf + SVM75.8%1.0%

Table 7.

Unfair detection rate and false detection rate.

From this result, it was found that in the range of threshold values of 0.6–0.7, the unfair videos were detected in less than 10% of the cases. On the other hand, the popular videos were detected in over 90% of the cases. We also found that the proposed method using the correlation coefficients achieved a higher unfair detection rate than the comment content–based judgment method.

5.6. Evaluation experiment based on cross-validation

We used the 100 popular/unfair videos from the preliminary experiment and evaluated the following four methods

  • Baseline method

    • tf-idf + SVM

    • Entropy feature + SVM

  • Proposed method

    • Method based on the threshold of the time-series correlation coefficient

    • Machine learning method using the maximum/minimum/average value of correlation coefficient as feature

The results of the leave-one-out cross-validation test are shown in Table 8. We used the Radial Basis Function (RBF) kernel for the SVM kernel parameter. In the proposed method, the number of image segmentations used to calculate the correlation coefficient was set at 10.

MethodUnfair video detection rateFalse detection rate
Baseline methodTf-idf + SVM0.7260.025
Entropy + SVM0.9700.030
Proposed methodCoefficient + threshold0.9910.015
Coefficient + SVM0.8900.070

Table 8.

The result of cross validation test of unfair video detection.

We found that the unfair video detection rate for the baseline method was 72.3%, while the rate for the proposed method based on the correlation coefficient was approximately 99%. However, in the case of the baseline method, the false detection rate of popular videos was only 0.5%, while the false detection rate for the proposed method was 1.5%. This result confirmed that our proposed method was effective for unfair video detection.

6. Discussions

In the evaluation experiment using the cross-validation test, the baseline method (tf-idf + SVM) had an average detection rate of 72.3%, which was not particularly high. This might be caused by the fact that there were many comments in which the character strings were short and their meanings were nonsense. To improve the comment contents–based method, we would need a way to extract for analysis only those comments that make sense according to a certain standard.

On the other hand, the performance of the judgment method using the entropy feature was equivalent to the performance of the method using the correlation coefficient vector. This result indicated that the entropy feature was able to express the features of the comments on the unfair/popular videos without using time-series information.

In the preliminary experiment, which examined the judgment method based on the correlation coefficients, when the threshold of the correlation coefficient was set at 0.60, only 4.2% of the unfair videos were detected. To illustrate the problem, in cases where not-so-high and not-so-low correlation coefficient values were continuously calculated, when all the values were larger than 0.60 but smaller than 0.65, the video was suspected of being unfair, but the proposed method was unable to identify them as such.

Figure 10 shows additional experimental results for the SVM-based judgment method that used the maximum/minimum/average correlation coefficient as the feature. The value of N (number of segments) was set at 3, 5, 7, 10, 15, 20, 25 and 30. When the number of segments was small, the detection error rate was relatively large. When the number of segments was 20, the detection error rate reached it minimum value. On the other hand, when the segment number was 25, the unfair video detection rate reached its highest value—0.97. It appears that the larger the segment number, the higher the unfair video detection rate becomes.


Figure 10.

Accuracy of detection and false detection rate.

In all, 1332 unclassified videos were classified into unfair/not unfair by the proposed method using the split number 30. Figure 11 shows the classification results for each category annotated to the video. The results show that there were categories that were subjected to being judged as unfair.


Figure 11.

Unfair/not unfair rate for each video category.

7. Conclusion

We proposed a method to detect unfair videos—videos whose rankings were falsely influenced—among the videos uploaded to a video sharing site. We focused on differences in the comment distribution tendencies between the unfair videos and what we termed popular videos (videos whose rankings were not falsely influenced). The play time of the video was divided into N segments, and the number of comments extracted from each segment was used as a feature. The correlation coefficient between neighbouring elapsed dates was calculated, and the unfair videos were detected by applying a judgment threshold for the minimum correlation coefficient value.

Our proposed method did not focus on the quality of a comment’s content; rather it focused on the specificity of comment postings by unfair (fraudulent) users. In our evaluation experiment, we found that the proposed method was able to detect unfair videos with 90% accuracy by extracting the video feature based on a time-series analysis of comment data, irrespective of the number of comments.

It should be noted that in this study, we targeted unfair videos that were obviously unfair. Therefore, it is not clear that the threshold of the proposed method can work for less obvious unfair videos or in-between videos that are not popular but not unfair, because their patterns of comment posting are likely to be different. In our experiment, we calculated the correlation coefficients from the time-series distribution of comments and determined threshold values to classify the videos. However, the proposed method lacks versatility because the method is specific to the comment distribution of Nicovideo.

It is likely that a more versatile method could be developed by using other features such as the quality of contents or comment content similarities. In future studies, we plan to develop an automatic video evaluation basis that focuses on the degree of the match between the sense of the video and the content of the comments.


This research was partially supported by JSPS KAKENHI Grant Numbers 15K00425, 15K00309, 15K16077.


1 - Nicovideo. Nikoniko Douga [Internet]. Available from:
2 - Yoshimura, Y., Kamioka, E. PageRank based user-centric content delivery system. IEICE Technical Report. 2012;111(476):7–12
3 - Hirasawa M, Ogawa Y, Suwa H, Ohta T. Finding and evaluating social-novelty videos in NicoNico douga. Journal of Information Processing Society of Japan. 2013;54(1):214–222
4 - Eugene A, Carlos C, Debora D, Aristides G, Gilad M. Finding high-quality content in social media. In: Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM’08); 2008. pp. 183–194. DOI: 10.1145/1341531.1341557
5 - National Institute of Informatics. Nikoniko douga dataset [Internet]. Available from:
6 - Shannon CE. A mathematical theory of communication. Bell System Technical Journal. 1948;27:379-423 & 623–656
7 - MeCab: Yet Another Part-of-Speech and Morphological Analyzer [Internet]. Available from:
8 - Ikehara S, editor. GoiTaikei: A Japanese Lexicon (CD-ROM). NTT Communication Science Laboratories; 1999, Iwanami Publishers, Japan
9 - Bunrui Goihyo (Word List by Semantic Principles). Revised and Enlarged Edition ed. National Institute for Japanese Language and Linguistics; 2004, Dainippon Tosho Publishing Co., Ltd., Japan
10 - Bond F, Baldwin T, Fothergill R, Uchimoto K. Japanese SemCor: A sense-tagged corpus of Japanese. In: the 6th International Conference of the Global WordNet Association (GWC-2012); 2012; Matsue, Japan
11 - EDR Electronic Dictionary. Japan Electronic Dictionary Research Institute LTD; 1996, National Institute of Information and Communications Technology (NiCT), Japan
12 - Vapnik VN, editors. Statistical Learning Theory. John Wiley & Sons; 1998, USA
13 - Kawamura H, Suzuki K. Interactive suggestion of related videos by analyzing users' comments based on tf-idf scheme. IEICE technical report. Artificial Intelligence and Knowledge-Based Processing. 2010;109(439):7–10
14 - Salton G. Yang CS. On the specification of term values in automatic indexing. Journal of Documentation. 1973;29(4):351–372