Open access peer-reviewed chapter

Classification Model for Bullying Posts Detection

By K. Nalini and L. Jabasheela

Submitted: June 14th 2019Reviewed: July 17th 2019Published: December 5th 2019

DOI: 10.5772/intechopen.88633

Downloaded: 128

Abstract

Nowadays, many research tasks are concentrating on Social Media for Analyzing Sentiments and Opinions, Political Issues, Marketing Strategies and many more. Several text mining structures have been designed for different applications. Harassing is a category of claiming social turmoil in different structures and conduct toward a singular or group, to damage others. Investigation outcomes demonstrated that 7 young people out of 10 become the casualty of cyber bullying. Throughout the world, many prominent cases are existing due to the bad communications over the Web. So there could be suitable solutions for this problem and there is a need to eradicate the lacking in existing strategies in dealing problems with cyber bullying incidents. A prominent aim is to design a scheme to alert the people those who are using social networks and also to prevent them from bullying environments. Tweet corpus carries the messages in the text as well as it has ID, time, and so forth. The messages are imparted in informal form and furthermore, there is variety in the dialect. So, there is a requirement to operate a progression of filtration to handle the raw tweets before feature extraction and frequency extraction. The idea is to regard each tweet as a limited blend over a basic arrangement of topics, each of which is described by dissemination over words, and after that analyze tweets through such topic dispersions. Naturally, bullying topics might be related to higher probabilities for bullying words. An arrangement of training tweets with both bullying and non-bullying texts are required to take in a model that can derive topic distributions from tweets. Topic modeling is used to get lexical collocation designs in the irreverent content and create significant topics for a model.

Keywords

  • cyberbullying
  • Twitter
  • LDA
  • SVM
  • TF-IDF

1. Introduction

The proposed methodology is a dual compound method. It utilizes the arrangement of “bullying” or “non-bullying” class and also it utilizes link analysis to locate the most dynamic users as predators and victims. Each step can be explained in detail as follows. The feature selection is an essential phase in denoting data within component space to the categorizers. Mostly the data available from social network are noisy. So, there is a need to apply pre-processing techniques in order to obtain the research data with better quality followed by successive systematic steps; Moreover, sparsity in feature space increases with the count of documents. Nevertheless, the following types of features generated through the B-LDA topic model and weighted B-TF-IDF scheme. In the initial step, semantic highlights are related for locating harassing, abusive and offending posts. In pestering discovery the presence of pronouns in the nuisance post was represented. Essentially in this work, three sorts of capabilities are utilized. They are depicted as follows: (i) all second individual pronouns “you,” “yourself,” and so forth are considered one term; (ii) all other outstanding pronouns “he,” “she,” and so on., are viewed together as another element; (iii) foul words such as “fr**k,” “shit,” “moronic,” and so forth., which make the post merciless are assembled in another arrangement of highlights. The new harassing words lexicon was made in view of the accompanying essential sites likenoswearing.comand urban dictionary. The primary rationale behind consolidating these features is that it will boost the viability of the classification of tormenting posts. The classification outcomes are revealed in the experiments.

2. Review of literature

Rahat et al. [1] presented a multi-stage cyber bullying detection results that radically decreases the classification period and give warning signals. The system is greatly scalable without forfeiting precision and highly approachable in raising signals. It also contained an active priority scheduler and a rising classification procedure by applying Vine data sets. The performance outcomes demonstrate that the model enhances the scalability of digital harassing discovery contrasted to non-priority model and also explained that the system could fully check Vine-scale networks. The results depict that this digital harassing detection is considerably more measurable and receptive than the present modern technology. Zhong et al. [2] proposed an investigation to find out cyberbullying in Instagram utilizing the improvement of early-warning methods to detect offensive images.

The research operated by obtaining a huge volume of pictures in the Instagram image sharing process along with messages. They studied new features of the topics acquired from the picture portrayal and trained using neural network technology, added with images and texts. The results got the potential objectives for harassing on the characterization of texts and images. Sherly [3] proposed research using supervised feature selection to select the characteristics from the tweets by the ranking method. Then extreme learning machine (ELM) classifier is applied to execute the cyberbullying detection and enhance the precision and reduce the performance period. The performance investigation of the SFS-ELM model observed that the accuracy is improved by 13% and executed using MATLAB. Micheline et al. [4] accomplished a study by using an unsupervised methodology to identify harassing messages in social networks, utilizing Growing Hierarchical Self Organizing Map. The research contains various features to find semantic and syntactic interactions of regular cyber tormentors. They conducted various trials on FormSpring, Twitter and YouTube networks by collecting real time datasets. The outcomes of the research show that the model attains the significant performance and also promotes permanent watching applications to alleviate the huge issues of harassing. Suchini et al. [5] applied a text classification model to categorize the text as insulting or not. Feature selection is performed using Chi-square test and then classification algorithms are utilized for segregating comments as insulting or non-insulting words. Various algorithms like SVM, Naive Bayes, Logistic Regression, Random Forest are applied and out of all algorithms, SVM gave better results.

Krishna et al. [6] proposed a model deployed for detecting abusive text and images in the social network. This automated system could find the offensive content in messages using the combination of a bag of visual word method, local binary pattern and SVM classifier. The offensive detection in the text messages are executed by a bag of word method with Naïve Bayes classifier and then the Boolean system is applied to classify the content. Javier et al. [7] have displayed automatic strategies for identifying erotic plundering in Chat rooms. They have effectively demonstrated that a learning-based technique is an attainable method to approach this issue and have proposed novel sets of highlights to determine the classification of chat partakers as exploiters or non-exploiters. They exhibited that the arrangements of features used and the comparative weighting of the disarrangement expenditures in the SVMs are two fundamental factors that ought to be considered to upgrade execution.

Huang et al. [8] proposed normal text investigation using social network characteristics to classify harassing in Twitter and also considered the social connection between clients would betterment outcome for classification. Zhao et al. [9] applied a collection of features known as EBoW (Natural Language Processing method), containing a bag of words structure connected with Latent Semantic analysis and word embeddings by computing word vectors. They also used SVM to classify the data collection in Twitter which contains keywords like bully or bullying.

Chen et al. [10] researched existing content mining techniques in recognizing harassing texts for ensuring adolescent online safety. In particular, they proposed the Lexical Syntactical Feature (LSF) way to deal with hostile contents on the internet and further foresee a client’s potentiality to convey hostile contents. Their investigation has many commitments. To begin with, they essentially conceptualize the idea of online hostile contents and further recognize the contribution of pejoratives/obscenities and profanities in deciding offensive substance, and present hand creating syntactic standards in finding verbally abusing provocation. Second, they enhanced customary Machine-Learning strategies by not just utilizing lexical features to identify hostile dialect, yet in addition style feature, structure features, and content-specific features to better foresee a client’s possibility to convey hostile content in social media. Investigation result demonstrates that the LSF Sentence offensiveness forecast and client offensiveness estimate algorithm beat, customary learning-based methodologies in turns of precision, recall, and F-score. The LSF endures casual and incorrect spelling contents and it can possibly adjust to any forms of English written word styles.

3. The Bully-latent Dirichlet allocation (B-LDA): model design

LDA is an outstanding method of Bayesian multinomial mixture model in text analysis based on its ability to assemble, elucidate and semantically cogent topics. It uses the Dirichlet distribution to model the distribution of the topics for each and every one document. In LDA, each word is measured from a multinomial distribution over words particular to this topic. Since LDA is extremely modular and hierarchical, consequently, it can simply be broadened. Various expansions to basic LDA model have been recommended to incorporate document metadata. The easy process of integrating the metadata in generative topic models is to create both the words and the metadata concurrently specified unseen topic variables. The Author-Topic (AT) model resembles Bayesian network, in which every authors’ attractions are modeled with a combination of topics [11]. In this model an arrangement of authors, advertisements are watched and looked over from different documents depends on their topics. To create each word, an author x is picked at identical from this set, then a topic z is chosen from a topic distribution θx that is particular to the author, and after that, a word w is created by testing from a topic-particular multinomial distribution ϕz.

The proposed Bully-LDA (B-LDA) model is used for identifying bullying words used by authors. This model captures bullying-topics which are used in social networks like Twitter. In Twitter, one person sends tweets to many followers. Here in this model, the sender is considered as Predator, when he/she sends bullying words to their followers. The followers are represented as Victims. The B-LDA model is a generative process model and also encapsulates topics and the communication networks of Predators and Victims by conditioning the multinomial distribution over bullying topics distinctly on both the Predator and a Victim of a bullying message. Unlike other models, B-LDA model takes into concern both predator and victims distinctly. The motive of the predator is also considered in addition to this representation. Each motive is associated with a set of topics, and these topics may overlap. For example, the categories of motive can be racist, sexual, outrage, irrelevant. The sexual motive of predator contains the topics of crude, implicit/ambiguous language or an indecent proposal. The Racist category contains more abusive matters such as homophobia, extremism, slurs, etc. The outrage is a category, which specifies reactions that express contempt. The messages that do not contain any form of offensive language are considered to be irrelevant. Each predator has a multinomial distribution over motives. Thus, B-LDA model is a clustering model, in which appearances of topics are the underlying data, and sets of correlated topics are together gathered as clusters that denote motive. Predators and Victims are mapped to motive assignments, and then a topic is selected based on these motives. The intention of each and every predator has a multinomial distribution on topics, and every topic has a multinomial distribution on words. First, the motive assignments can be made separately for each word in a document. This model represents that someone can change motive during the exchange of the messages.

Author-Topic (AT) [11] model has been extended by incorporating a new set of variables like authors as Predators and Victims, the motivation of an author. In this generative process for each message, a Predator, pd and a set of Victims, vd are observed. To generate each word, a victim y is chosen at uniform from vd, and then a motive x for the Predator is chosen from multinomial motive distribution ψpd. Next a topic z is selected from a multinomial topic distribution θx, in which the distribution is specific to the predator-motive(x). At last, the word w is produced by sampling from a topic-meticulous multinomial distribution ϕz.

Figure 1 is a schematic diagram of the B-LDA model.

Figure 1.

Graphical model for B-LDA.

The generative procedure of this strategy is as follows:

  1. for every motive m with m = 1,…..M, choose ψm ∼Dir(γ)

  2. for each predator and victim pair (x,y) with x = 1,….,A and y = 1,….,A choose θx,y ∼ Dir (α)

  3. for each topic t with t = 1,…..T, choose ϕt ∼ Dir(β)

  4. for each message d

    1. observe motive md

    2. observe predator pd and the victims vd

    3. for each word w in d

      1. choose topic zdn ∼ θzd

      2. choose word wdn ∼ ϕzdn

In this model for a particular message d, given the hyper parameters α, β, and γ, the predator pd, and set of victims vd, the connected dispersion of an author blend θ, a motive blend ψ, a topic blend ϕ, a set of Nd victims yd, and a set of Nd predator motives xd, a set of Nd topics zd and a set of Nd words wd is assigned by,

pθϕψydxdzdwdαβγpdvd=pψγpθαpϕβ=n=1NdpydnvdpxdnpdpzdnθxdnpwdnϕzdnE1

Integrating over γ, θ and ϕ and summing over yd, xd, and zd, the marginal distribution of a document is calculated as follows:

pwdαβγpdvd=pψγpθαpϕβn=1NdydnxdnzdnpydnvdpxdnpdpzdnθxdnpwdnϕzdndψdϕdθE2

Then the product of the marginal probabilities of single documents, and the probability of a corpus is computed as,

pDαβγpv=d=1DpwdαβγpdvdE3

3.1 Monte Carlo Gibbs sampling

The assumption on models in the LDA family cannot be carried out correctly. Three standard approximations have been occupied to acquire practical results: Variational methods [12], Gibbs sampling [13], and expectation propagation [14]. As Gibbs sampling is easy to implement, it has been applied here. There is a need to derive a formula to carry out the Gibbs sampling for P(zi,yi,xi|z-i,y-i,x-i), the conditional distribution of a topic and victims for w word given all other words topic and victim assignment, the motive of the predator, z-i, y-i, and x-i. In order to calculate P(z,y,x|w), the posterior distribution of topic, victim assignments and the motive of the predator given the words in the corpus.

The calculations begin with P(w|z,x), using P(w|z,x,Φ) in order to integrate out the unknown Φ distributions to obtain: PwzyΦ=iw=1WϕziwWiw.

Reorganizing the product over the W word token exist in the corpus to collect words that are assigned to the same bullying topic,

PwzyΦ=z=1Tu=1UϕznzwuE4

where nzwuis the number of times that a bullying word, wu was assigned to a bullying topic. To integrate out the ϕ distribution by using the Dirichlet distributions,

pwzy=z=1TΓu=1Uβuu=1UΓβuu=1Uϕznzwu+βu1wuzwu=z=1TΓu=1Uβuu=1UΓβuu=1UΓnzwu+βuΓu=1Uβu+u=1UnzwuE5

In the same manner, P(z,y) is computed using a procedure analogous to that used for P(w|z,y). The collected terms of bullying words are assigned to the same topic and predator-victim pair and integrate out the Θ distributions corresponding to all the different predator-victim pairs, P:

Pzy=iw=1W1nRdiwp=1PΓzαzz=1TΓαzzΓnpz+αzΓzαz+znpzE6

where nRdiwis the number of victims corresponding to a word in a message.

Similarly can calculate P(z, x) using a procedure analogous to that used for P(w|z, x). Bullying words have been assigned to the same topic and the motivation of the predator can be computed as,

Pzx=iw=1W1nSdiwp=1PΓzγzz=1TΓγzzΓnmz+γzΓzγz+znmzE7

where nSdiwis the number of predators having bad motivation with respect to the bullying word in a message. An expression for P (w, z, y, x) can be achieved by combining the equations of P(w|z, y), P(z, y) and P(z, x). This can be used to write an expression for the posterior distribution of z, y and x given the corpus,

Pzyxw=Pwzyxz,y,xPwzyxE8

Hence the denominator cannot be calculated directly. The following equations are used to run a MCMC Gibbs sampling calculation by using the conditional distribution P(zi, yi, xi, wi|z-i, y-i, x-i, w-i).

Pziyixiwiziyixiwi=PzyxwPziyixiwi=1nRΓnmt+γtΓznmz+zγzΓnmt1+γtΓznmz1+zγzΓnpt+αtΓznpz+zαzΓntwu+βuΓuntwu+uβuΓnpt1+αtΓznpz1+zαzΓntwu1+βuΓuntwu1+uβu=1nRnm,it+γtznm,iz+zγznp,it+αtznp,iz+zαznt,iwu+βuunt,i+uβuE9

where the victim, y is part of Predator-Victim pair, p, the –i subscript is used to denote that the counts are taken by excluding the assignment of word i itself, and nR is the number of Victims for the message to which word i belongs.

3.2 Experiments and results

In this chapter, the experimental results are discussed. The datasets used in these experiments are tweets from Twitter. An experiment has been conducted on tweets based on the architecture of an automatic cyber bullying detection system. Search is made in the Twitter stream for Tweets containing the strings that contain offensive words so as to particularly filter for tweets related to bullying. In total, more than 1,00,000 tweets are gathered between Jan 1st, 2015 and Jan 30th, 2016. A limit number of tweets are matching with the query. So, approximately 300 tweets are filtered per day. The statistics for training and the testing corpus is given in Table 1. Tweets were manually labeled as belonging to one of the different motives namely Sexual, Racist, Outrage, Irrelevant, and Unknown after the preprocessing. The examples of harassing comments posted on Twitter are listed below and depicted in Figure 2(a) and (b) and top bullying words which are extracted are given in Table 2 (Figures 35).

DateTimeTweets
01–13-1512:16NefarioussNess Do not fuck with people’s hearts
09–18-1511:51TittyCityClay it’s always been a self respect thing. Shit like this is stupid as fuck lol
05–13-1510:11djkeneechi Nah kiss no one ass to stay in my life anymore im tired of that shit it’s time for me to man up

Training corpusTesting corpus
Tweets3,18,14,71697,35,537
Retweets76,20,3352,87,567
URLs85,45,1124,76,234
Usernames97,02,44514,20,554
Hashtags79,85,9563,56,778

Table 1.

Statistics of training and testing corpus.

Figure 2.

(a) Bullying words with their probability, and (b) List of bullying words.

WordProbWordProbWordProb
Fuck0.0798Bitch0.0705Naked0.0588
Ass0.0767Freak0.0699Sexy0.0569
shit0.0752Fat0.0663Mood0.0547
Gay0.0738Dirty0.0643Lick0.0519
Dumb0.0722Bullshit0.0621Bed0.0508
Suck0.0711Kiss0.0604Piss0.0495

Table 2.

Extracted top bullying words.

Figure 3.

Word cloud for bullying words.

Figure 4.

Number of bullying tweets over time intervals.

Figure 5.

Distributions of tweets per motive.

3.3 Results and discussions

Bully-Latent Dirichlet Allocation model is an intended for pictorial representation of texts in a harassing message, given their predator and a pair of casualties. B-LDA got crucial enrichment to facilitate specification the per-bullying message topic dispersion mutually on the predator and individual victims. Every topic includes multinomial distribution on words and every Predator-Casualty pair has a distribution on topics. So, subsidiary dispersions in excess on bullying subjects accustomed exclusively on a predator, or solely on a recipient, can be computed easily. For example, corpus comprising 135 persons and 35 k bulling messages, and also on 5 months of sending and receiving messages of a predator, comprising 17 victims and 19 k messages. B-LDA turns up tremendously prominent topics, and grants support that it predicts predator’s motives. In the experiments, the hectic parameters α and β are fixed at 1 and 0.01 respectively. The number of topics T is also fixed at |T| = 5. For a 50 topic solution, Dataset from Twitter took 150 hours for 2000 iterations (5 min per iteration).

B-LDA proves the motive of the predator and track the activity of the predator with victims, using the following steps. First, the proportions of each predator contributing in each of the bullying topics are determined. Next the impacts of the predators throughout the time intervals on the bullying topics. The two users’ threshold ε and λ are empirically set to 3.2106 and 2.0457, respectively. From each of the documents, B-LDA generates 5 topics with predators associated with each. The distribution of the different bullying topics from the documents is displayed in Table 3. From the table, predator p1 has a probability of 0.0547 for bullying topic t5. There is a need to prove the bullying motive of the predator with victim using specific time intervals within bullying topics. It could be characterized as trails: A tweet message is a triplet (a, μ, т), representing a textual bullying message μ written by the predator “a” at time т. A document, denoted by d, is a sequence of bullying messages ordered by т. From this definition, time тd is associated with both message μd and predator ad.

MOTIVE = RACISM
TOPIC 5TOPIC 10TOPIC 15TOPIC 20TOPIC 25
EXTREMISMHOMOPHOBIAVIOLENCEREF. TO HANDICAPSSLURS
Incorrect0.0271ColdSweat0.0265Shit0.0752Fuck0.0798Pussi0.0321
Improper0.0242Dread0.0254Bullshit0.0621Ass0.0767Dog0.0312
Indecent0.0231Fearful0.0235Piss0.0506Dumb0.0722Filthy0.0304
Ineligible0.0225Horror0.0223Aggrieve0.0254Blind0.0342Crow0.0294
Unfit0.0214Panic0.0212Tee toe0.0232Cracy0.0212Nitchie0.0276
Unsuited0.021Phobia0.0203Nose0.0215Daft0.0203Peckerwood0.0253
Room0.0197Scare0.0194Gotoofar0.0201Autism0.0167Cameljockey0.0238
Raffish0.0193Terror0.0187Rufflesb’s feathers0.0176Freak0.0154Nigger0.0221
Square peg0.0184Alarm0.0176Aggravate0.0154Gimpy0.0132Peckerwood0.0213
Unworthy0.0173Fright0.0169Burn0.0132Windowlicker0.0121Wigger0.0201
Predators: VictimsProbPredators: VictimsProbPredators: VictimsProbPredators: VictimsProbPredators: VictimsProb
P1: V10.0547P1: V10.0341P4: V40.0352P3: V50.0421P1: V60.0284
P2: V20.0367P2: V20.0288P1: V20.0254P2: V60.0325P5: V70.0257
P3: V30.0361P3: V30.0254P1: V30.0246P1: V30.0208P4: V50.0236
MOTIVE = SEXUAL
TOPIC 30TOPIC 35TOPIC 40TOPIC 45TOPIC 50
CRUDE LANGUAGEIMPLICIT LANGUAGEINDECENT PROPOSALSUNREFINED LANGUAGESLANG WORDS
Gay0.0738Dirty0.0643Mood0.0547Bitch0.0705Pull0.0456
Suck0.0711Bed0.0508Lick0.0519Freak0.0699Bumpuglies0.0423
Naked0.0588Frequent0.0491Kiss0.0508Fat0.0663Fug0.0321
Sexy0.0569Sleep0.0282Hangnow0.0485Happyhappy0.0341Randy0.0307
Kickit0.0445Kneedeep0.0241Givebusiness0.0465Poundduck0.0324Juicy0.0284
FuckforOL’0.0432Encounter0.0215Monkeylove0.0328Homerun0.0307Hempedup0.0245
Getdown dirty0.0421Donasty0.0208Sexytime0.0319Smack0.0284Jiffystiffy0.0209
Slap0.0316doublebag0.0165Intimacy0.0206Serve0.0271Ride0.0154
Hump0.0307Giveitup0.0154Cottage0.0191Jellosex0.0135Smush0.0124
Screw0.0201Getlucky0.0142Raunchy0.0147Score0.0104Trim0.0107
Predators: VictimsProbPredators: VictimsProbPredators: VictimsProbPredators: VictimsProbPredators: VictimsProb
P1: V10.0737P4: V40.0541P3: V50.0452P1: V 60.0595P3: V50.0354
P2: v20.0552P1: V20.0428P2: V60.0321P5: V70.0467P2: V60.0241
P3: V30.0324P1: V30.0367P1: V30.0276P4: V50.0354P1: V30.0211
MOTIVE = OUTRAGEMOTIVE = IRRELEVANTMOTIVE = UNKNOWN
TOPIC 60TOPIC 70TOPIC 90
ANGERMake out0.0267Outhouse0.0246
Bitterness0.0365Marquee0.0235Pant0.0232
Hard0.0354Mate0.0223Pass out0.0214
Storm0.0321Minor0.0215Patient0.0208
Irritation0.0306Moot0.0209PC0.0179
Wrath0.0268MP0.0201Period0.0165
Fury0.0251MUM0.0189Plant0.0152
ANGER
Resent0.0237Nappy0.0154POP0.0143
Rancor0.0209Natter0.0142Restroom0.0137
Grudge0.0192Nick0.0126Rider0.0129
Flap0.0163Nonce0.0118Sick0.0109
Predators: VictimsProbPredators: VictimsProbPredators: VictimsProb
P1: V70.0241P5: V20.0207P2: V60.0175
P2: V40.0219P4: V50.0165P5: V80.0154
P3: V30.0147P1: V30.0126P4: V50.0132

Table 3.

The distribution for the different bullying topics from the documents.

The predator time contributions during time interval have been evaluated by:

FadtTsTf=activeifpadtTsTfusersthreshold,FtTsTfisactivenotactiveotherwiseE10

A predator is said to be active and his/her motive of bullying during the interval [тs, тf] for topic t if the probability of a predator participating in t, during that time period, exceeds the user-specified threshold, and FtTsTfis active within that duration. The user enumerated threshold is calculated by taking an average of ϑatover predators for t. The contribution of a predator ai,dtwithin [тs, тf], using PaTst=paTsdTs×ptTsdTspdTsper tome instance s, is mapped first in order to compute pai,dtTsTf. Next, the total probability for predator at during [тs, тf] is calculated as TsTfPaTst. Figure 6 shows the activity of predators over time. For example, the activity of predators in bullying topic t5,d5 during [15:00,21:00] can be analyzed in the following manner. Initially, the specified threshold is determined as 0.1770, for the average of ϑat. Then the mapping function is calculated for all predators. For example, a predator a5 and time instance s = 15:00 are considered to analyze. The mapping function is calculated as Pa5,T15:00t5=0.0547and then the total probability of a5 is estimated by calculating T15:00T21:00Pa5,Tst5=0.2307. When applying the transition function FadtTsTf, the predators (a1,a3) are active for bullying topic t5,d1 and the predators (a2,a4,a5) are not active.

Figure 6.

Predators activity for bullying topic 30.

3.4 Performance evaluation

The Perplexity of the model is used on test documents to estimate the execution of model and it is a customary measure for evaluating the operation of a probabilistic model. The adapted models are compared by means of perplexity on test datasets. Perplexity is extensively used in a probabilistic model for checking their quality. The perplexity of a couple of trial texts, (wd,pd) for d ϵ Dtest, is characterized as the exponential of the negative standardized predictive likelihood underneath the representation,

perplexitywdpd=explnpwdpdNdE11

Better simplification functioning is designated by means of a lesser perplexity on a held-out document. The derivation of the likelihood of a collection of texts specified the predator is a uncomplicated computation in Bully-LDA model.

pwdpd=dϕpθDtrainpϕDtrainm=1Nd1Adipd,jθijϕwmjE12

The term in the brackets is merely the probability for the word wm specified the pair of predators pd. The detailed results are exposed in Figure 7. These results indicate that B-LDA better generalizes performance than ATM and LDA. The improvement in generalization performance of B-LDA can be explained by its ability to better model when comparing with LDA and ATM model. If a word which has small probability in the bullying topics of training document, then it will cause an increase in perplexity. As the number of bullying topics increase, then the probabilities assigned to words get smaller in each bullying topic. Even though ATM models the roles of authors, does not show promising results and it is originally designed for the scenario where each document has multiple authors. It is clear that B-LDA achieves superior performances among all the adopted models. The perplexity of LDA, ATM, and B-LDA are closer and they decrease steadily with the increase of topics. According to human judgments, perplexity is not easy to correlate the results. So, it is necessary to compare the models using simple metrics like Precision, Recall, and F1 measure. The standard supervised classifier, i.e., Support Vector Machine (SVM), is adopted with B-LDA for classification. LibSVM was applied to the two-class classification problem using a linear kernel. Each post is an instance; positive classes contain bullying messages and negative classes contain non-bullying messages. A 10-fold cross-validation was performed in which the complete dataset was partitioned 10 times into 10 samples; in every round, nine portions were employed for exercising and the enduring section was applied for trial (Figure 8).

Figure 7.

Comparisons of different models in terms of perplexity.

Figure 8.

Classifier performances based on different feature reduction methods.

The functioning of the classifier was appraised on precision, recall and F-1 measure and these measures depend on the top-ranked features produced through B-LDA method against the truth set as tested on the datasets. Precision: The Aggregate number of accurately distinguished genuine harassing posts out of recovered tormenting cases. Recall: Number of effectively distinguished tormenting cases from an aggregate number of genuine harassing cases. F-1 measure: the equally weighted harmonic mean of precision and recall. Table 4 shows the classifier performance.

MethodPrecisionRecallF-Measure
DF + SVM0.84710.77700.8105
PCA + SVM0.83970.78700.8125
LDA + SVM0.88460.85540.8724
B-LDA + SVM0.91210.89010.9003

Table 4.

Classifier performances based on different feature reduction methods.

3.5 Comparison of weighted B-TFIDF with baseline method

The weighted B-TFIDF method is compared with the work done in a content analysis in a web on four different datasets. The new feature selection method using weighted B-TFIDF proved that it is better than baseline. The outcomes are cataloged in Table 5 and also indicate a very high precision, recall and F-1 measure on Twitter. In Kongregate precision fell down at the top 2000 features. In most of the cases, the classifier performed almost similar, that is between 80 and 100%. On Myspace dataset recall is moderate nearing to 1. However, precision varies between 76 and 87% except at feature value 18,000 when it reaches 91%. Unlike other datasets, Slashdot performance is very low. Although recall is moderate, precision and F-1 measures decomposed while component set was low. Also, poor performance is observed at feature value 18,000. From this discussion, the performance of weighted B-TFIDF shows the best result (Figure 9).

KongregateSlashdotMySpaceTwitter
BaselinePrecision0.350.320.420.62
BaselineRecall0.600.280.250.53
BaselineF-1 measure0.440.300.310.57
Weighted TFIDFPrecision0.870.780.860.87
Weighted TFIDFRecall0.970.990.980.75
Weighted TFIDFF-1 measure0.920.870.920.81
Weighted B-TFIDFPrecision0.950.960.960.98
Weighted B-TFIDFRecall0.930.840.930.96
Weighted B-TFIDFF-1 measure0.940.900.950.97

Table 5.

Comparison of weighted B-TFIDF with baseline method on other datasets.

Figure 9.

(a) Base line method, (b) weighted TFIDF method, and (c) weighted B-TFIDF method.

3.6 Victim and predator identification

In order to identify cyber bullying predators and victims, there is need to determine the most active predators and the most attacked users. The most dynamic predators and victims, and look at the association of clients in a tormenting relationship as appeared in Table 6 and it demonstrates that now and again there is more than one user at a similar rank. In this manner, users with a similar rank are gathered together. So it is important to notice that predators hailed at Rank I are additionally recognized as a victim at Rank II. Additionally, Rank II predators are Rank VII victims as well (Figure 10).

RankIIIIIIIVVVIVIIVIII
Number of users (predators)42112732
Number of users (victims)84722198

Table 6.

Performance of graph model: Predators and victims identification.

Figure 10.

Predators and victims identification.

SenderRecipient
U1U2U3U4U5……UN
U103013….….
U210000….….
U312010….….
U401001….….
U501110….….
…..….….….….….….….
UN….….….….….….….

Table 7.

Cyber bullying matrix (W).

3.6.1 Graph representation

The major goal of a users’ communication network are considered to identify predators and casualties. Gephi [15], a graphical interface is employed to monitor a user’s link in the harassing posts in a network. Figure 11 delineates the bullying network and it represents that a group of users obtained depend upon on the tormenting messages by utilizing modularity theorem, in order to quantify the quality of segment of a system into sub-graphs or groups. Modularity is characterized as the summation of the weight of all the edges that sink inside the given subgroups less the expected part if edges were dispensed at arbitrary in a given graph.

Figure 11.

Bullying network.

As appeared in Figure 11, nine groups or communities, delineated by various colors are formed by considering users that are thickly connected inside the group contrasted with between group by utilizing modularity algorithm. The density of post indicates the badness embedded inside the post and it is calculated for each post. The thickness of a post is computed as the aggregate count of the harassing words within the post separated by the aggregate number of the words in the post. The HITS algorithm is utilized in order to recognize the predators and related casualties and it is also helpful to calculate their scores. The objective behind the HITS strategies is that in a network, the good hub pages point to good authorized pages which are connected by the good hub pages. The search query enters through web pages to recognize potential hub and authority pages with respect to the individual scores. Likewise, this concept is used to rank predators and casualties in a communication network.

  1. Assumption: One bullying message is considered for each user.

  2. Predator: Person who has posted at least one bullying message.

  3. Victim: User who has received at least one bullying message.

  4. Objective: To identify and to rank the most dynamic user as Predator and Victim.

Presently, a ranking method using the HITS module is utilized to detect predators and casualties. A user may be a predator and a victim depends upon on the harassing messages he/she sends or receives. So, a user appointed as a predator and in addition with a casualty score. Predator and victim scores can be calculated by the following two equations.

puuyvyE13
vuyupyE14

Here, p(u) and v(u) are represented as the Predator and Victim scores respectively. u yrepresents the existing harassing post from u to y, whereas yushows the presence of the bullying posting from y to u. The above equations are used for evaluating predator and casualty scores and also considered as repeatedly upgrade a set of equations. They depend upon the presumption that the most dynamic predator connects to the most dynamic victims by sending harassing posts. The most active victim is connected to the most dynamic predators by getting bullying messages. Basically, the user’s predator score increases when the user (u) is connected with another user with a high victim score. In the same manner, the user’s victim score increments when the user (u) is connected through received bullying messages to a user with a high predator score. The scores are computed through incoming degrees and outgoing degrees, and associated scores, in each and every iteration and this may give the result in large values. Subsequently, scores are standardized to unit length, i.e., each predator and victim scores is divided by the sum of all predator and victim scores respectively.

Then there is a necessity to define the ranking methods to the predators and victims which is depicted in the network diagram in Figure 11. In order to explain a real scenario in a simple manner, only five users are selected as depicted in Figure 12 as an example and it depicts the recognition of the most dynamic predators and casualties in a bullying network. It is a weighted directed graph G = (U,A) with a set of nodes are represented as |U| and a set of arcs are represented as |A| where,

Figure 12.

Communication paths between predator and casualty.

Each node ui ϵ U is a user involved in the bullying conversation,

Each arc (ui,uj) ϵ A, is defined as a bullying message sent from ui to uj,

The weight of arc (ui,uj), denoted as wij, is defined as a summation of in-degrees.

Predators and victims are recognized by the directed graph G with weight. The victim can be recognized with many incoming arcs and the predator can be recognized with many outgoing arcs of the respective nodes. This method is helpful to observe the most dynamic predator or a casualty.

3.6.2 Cyber bullying matrix

A cyber bullying matrix(w) is constructed to discover a predator and victim depends upon their individual scores. It is depicted in Table 7. It is formulated as a square adjacency matrix (it represents the incoming degrees and outgoing degrees of each node) of the subnet with entry w, which is a square adjacency grid of the sub collection with entry wij, where,

wij={n if there be n harassing posts fromuitouj,0otherwise}E15

Since each client will have a casualty as well as a predator score, scores are represented as the vectors of n*1 dimension where ith coordinate of the vector represent both the scores of the ith user, say pi and vi respectively. To calculate scores, equations p(u) and v(u) are shortened as the casualty and predator renovating matrix–vector multiplication equations. For the preliminary iteration, pi and vi are started at 1. For every client (say, i = 1 to N) predator and victim notches are as follows:

pui=wi1v1+wi2v2+.+wiNvNE16
vui=wi1p1+wi2p2++wiNpNE17

When these equations congregate at a stable value (say k), it offers the final predator and casualty vector of each user. At last, to compute the eigenvector to acquire the predator and casualty scores.

Algorithm 1 gives a general framework of identification of the top-ranked most active predators and victims. In the algorithm N is a total number of users and Top is a threshold value, which is set manually.

Algorithm 1. Predators and casualty recognition.

Input: Set of consumer engaged in the chat with harassing
post, N, Top.
Output: Set of Top Casualty and Top Predator
  1. Take out dispatchers and receivers from N;

  2. Initialize predator and casualty vector each N;

  3. Generate adjacent matrix w using formula (15);

  4. Compute Predator and casualty vectors with iterative updating

    Eqs. (16) and (17), and normalize, until congregate at secure value k;

  5. Compute Eigen vectors to locate Predator and Casualty scores;

  6. Revisit high ranked Predators and Casualties.

4. Summary

The new system is achieved by two commitments. First, a Novel Statistical Application, which is established on the new Bully-LDA with the weighted B-TFIDF strategy on bullying like attributes. It also efficiently and effectively finds latent bullying features to cultivate the accomplishment of the classifier and also to reduce the feature sparsity. Secondly, a Graph Model lends a hand to pinpoint the attackers and causalities in social networks. Such a system would encompass the following function: Tweets Crawling, Tweet Preprocessing and Tokenization, Feature extraction and Frequency extraction, Text Representation Model, Text Classification, Category of Texts, Performance Evaluation, and Results.

The Twitter corpus consists of text communications by way of metadata such user ID, dispatching time, etc. Tweets Crawling is performed using many classes and techniques in order to get the information of the users’ connected data and the details of the Tweets’ which is done using Twitter’s Application programming interface called “Twitter4j-core-4.02.jar.” Tweets are shown in entirely colloquial manner, with more amount noise and variation in linguistics. For example, tweets contain a hefty quantity of novel words, interjections, repetitions, short words such as acronyms, words with missing letters, words with phonetic spelling like Gud for Good, etc. and also missing blank spaces between the words, such as whatareyoudoing, which increases the tweet length. All these things impose a huge burden in the analysis of the text. Text preprocessing module contains word segmentation, word processing, and subsequent analytical steps include like converting uppercase letters to lower case, stemming, eradicating stop words, superfluous characters and hyperlinks.

The proposed framework utilizing Bully-Latent Dirichlet Allocation through Support Vector Machine has been examined with Twitter messages. This system is based on a novel concept of applying text mining techniques to tweets for detecting Bullying messages and also to identify Predators. The weighted B-TFIDF function is used to enhance the execution of classification, in which bullying-like features are measured. The overall results using Bully-LDA + SVM and weighted B-TFIDF outperformed other models. This model has numerous benefits adding more accuracy, superior noise diminution, faster speed and greater automation. The results obtained were analyzed properly using different metrics. A range of performance measures for instance accuracy, recall and F1 measures were calculated. The analysis of results plainly displays that the system yields effective results in identifying bullying messages in a successful manner.

In this research, a methodology for cyber bullying recognition of the most operative predators and casualties are done powerfully and fruitfully. This chapter presents a framework for detecting cyber bullying in Twitter using Bully-Latent Dirichlet Allocation with support vector machine. The preprocessing procedures have pertained to tweets. First Bully-LDA, a statistical topic modeling is used on a massive Twitter Corpus, with the help of weighted B-TFIDF scheme to detect offensive words in tweets. Next, a graph representation is utilized to recognize the predators and casualties in Twitter.

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

K. Nalini and L. Jabasheela (December 5th 2019). Classification Model for Bullying Posts Detection, Cyberspace, Evon Abu-Taieh, Abdelkrim El Mouatasim and Issam H. Al Hadid, IntechOpen, DOI: 10.5772/intechopen.88633. Available from:

chapter statistics

128total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

5G Road Map to Communication Revolution

By Evon Abu-Taieh, Issam H. Al Hadid and Ali Zolait

Related Book

First chapter

Introductory Chapter: Simulation and Modeling

By Evon Abu-Taieh

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us