Centroid-Based Lexical Clustering

Conventional lexical-clustering algorithms treat text fragments as a mixed collection of words, with a semantic similarity between them calculated based on the term of how many the particular word occurs within the compared fragments. Whereas this technique is appropriate for clustering large-sized textual collections, it operates poorly when clustering small-sized texts such as sentences. This is due to compared sentences that may be linguistically similar despite having no words in common. This chapter presents a new version of the original k-means method for sentence-level text clustering that is relay on the idea of use of the related synonyms in order to construct the rich semantic vectors. These vectors represent a sentence using linguistic information resulting from a lexical database founded to determine the actual sense to a word, based on the context in which it occurs. Therefore, while traditional k-means method application is relay on calculating the distance between patterns, the new proposed version operates by calculating the semantic similarity between sentences. This allows it to capture a higher degree of semantic or linguistic information existing within the clustered sentences. Experimental results illustrate that the proposed version of clustering algorithm performs favorably against other well-known clustering algorithms on several standard datasets.

The sentence similarity measures proposed by Li et al. [1], Mihalcea et al. [2], and Wang et al. [18] have two major features in common. Firstly, rather than using all possible features from applied external textual collections to representing sentences in a vector space model [19], only the words appearing in the compared sentences are used, thus solving the issue of data sparseness (i.e., high dimensionally) resulting from a randomly processing of the words (i.e., bag of words representation). Secondly, they use the available semantic and linguistic information from the applied lexical sources to solve the issue of deficiency of word co-occurrence. The measures of sentence-level text similarity such as presented by Abdalgader and Skabar [10] (the latter of which we use in this chapter and described later in Section 2) depend in a way of using the word-related synonyms to calculating the semantic similarity between words. Unlike existing measure of short text semantic similarity, which use the exact words that appear in the compared sentences, this similarity method creates an expansion word set for each sentence using related synonyms of the sense-disambiguated words in that sentence. This way lead to provide a richer and highly connected semantic context to estimate sentence similarity through better utilization of the possible semantic information from the available lexical resources such as WordNet [20,21]. For each of the sentences being calculated for their similarity, a word sense identification step is first applied in order to determine the correct sense based on the surrounding context [22]. A synonym expansion step is then applied, resulting in a richer and fully connected semantic context from which to estimate semantic vectors. The similarity between these vectors can then be calculated using a standard vector space similarity measure (i.e., cosine measure).
Several text-clustering methods: however, have been existed in the study [18, 23-37, 38-40, 42], and a majority of them consider the matrix of semantic similarities between words as input only. The k-medoids [30,31] is one of these methods, which is considered as a developed version of k-means method in which centroids are restricted to being data patterns (i.e., points). However, a problem with the k-medoid method is that it is highly sensitive to the random selection (i.e., initial) of centroids, and in empirical executions, it is often requiring to be executed many times with different initialization settings. To solve this issue with k-medoids, Frey and Dueck [35] proposed Affinity Propagation, a graph-based algorithm that concurrently does take all data points as possible centroids (i.e., exemplars). Processing each data point as a node in a graph, affinity propagation recursively transfers real-valued messages along the vertices of the graph until a required set of possible centroids are achieved.
Another graph-based clustering method that depends on matrix decomposition techniques from the linear algebra theories is a spectral-clustering algorithm [18,36,37,39,41]. Rather than clustering data patterns in the traditional vector space model, it associated data patterns together with the space resulted from eigen-vectors linked with the top eigen-values and then apply clustering in this new transformed space, usually applying a k-means method. One of the benefits of this method is that it has the ability to classify non-convex classes, which is challenging when clustering by using k-means method (i.e., typical feature space). Since spectral-clustering method requires only a matrix comprising pairwise similarity as input, it is easy to apply it to the sentence-level text-clustering task [18,29].
Erkan and Radev [43], Mihalcea and Tarau [44], and Fang et al. [46] have applied a PageRank [45] as a centrality measure in the task of document summarization, in which the aim is to rank sentences regarding their role in the document being summarized. Importantly, Skabar and Abdalgader [29] proposed a new fuzzy sentence-level text-clustering method that also uses PageRank as a centrality measure, and it allows clustered sentences to belong to all classes with different degrees of similarity (i.e., membership). The nation of this fuzzy clustering is required in the case of document summarization, in which a sentence may be linguistically similar or related to more than one topic [14,29,47].
The contribution presented in this chapter is a new version of the original k-means method for sentence-level text clustering that is dependent on the idea of using the related synonym sets to create rich and highly connected semantic vectors [42]. These vectors characterize sentence using semantic information derived from a WordNet to determine the actual sense to a word, based on the surrounding context. Thus, while the original k-means method is relay on calculating the distance between patterns, the new version is operating by calculating the semantic similarity between sentences. This allows it to capture more semantic information accessible within the clustered sentences. The result is a centroid-based lexical-clustering method which can be used in any application in which the relationship between patterns is expressed in terms of pairwise semantic similarities. We apply the algorithm to several benchmark datasets and compare its performance with that of well-known clustering methods such as spectral clustering [36], affinity propagation [35], k-medoids [30,31], STC-LE [39], and k-means (TF-IDF) [40]). We claim that the satisfactory performance of new proposed version of the centroid-based lexical-clustering method is due to its ability to better utilize and capture a higher degree of semantic information available in used lexical resource.
The remainder of this chapter is organized as follows. Section 2 presents a representation scheme for calculating sentence semantic similarity. Section 3 describes the proposed variation of original k-means clustering (centroid-based) method. Empirical results are shown in Section 4, and Section 5 concludes the chapter.

Semantic similarity representation scheme
By far, the most widely used text representation scheme in the natural language processing activities is the vector space model (VSM), in which a text or a document is represented as a point in a high-dimensional (N i ) input space. Each dimension in this input space (i.e., VSM) corresponds to a unique word [19]. That is, a document d j is represented as a vector x j = (word 1j , word 2j , word 3j , …), where word ij is a weight that represents in some way the importance or relatedness of word word i in document d j and is dependent on the frequency of occurrence of word i in document d j . The semantic similarity between the compared documents is then measured using the corresponding vectors, and a usually applied measure is the cosine of the angle between the two vectors.
The VSM has been effective in information retrieval (IR) activities because it is able to sufficiently utilize much of the semantic information expressed in the larger-sized textual collection. This is due to a large textual collection or documents may contain many shared words with each other and thus be considered similar regarding to well-known vector space similarity measures such as the cosine measure. However, in the case of sentence-level text (text fragment), this is not the case, since two sentences may be carrying the same meaning (i.e., semantically similar) whereas comprising no similar words. For instance, consider the sentences "Some places in the country are now in torrent crisis" and "The current flood disaster affects the particular states." Obviously, these two sentences have the same meaning, yet the only common word they have is the, which does not carry any semantic information (i.e., stop words). The reason why word co-occurrence may be rare or even absent in sentences is due to the flexibility of natural language that allows humans to express the same meanings using very different sentences in terms of structure and length [50]. Therefore, we need a sentencelevel text representation scheme which is superiorly able to utilize and capture all the possible semantic information of sentences, thus enabling a more efficient similarity method to be used.

Measuring sentence-level text similarity
To calculate the semantic similarity between two sentences, we use sentence similarity method that uses the sets of synonym expansion appeared in the compared sentences [10]. To demonstrate how this measure work: however, suppose that Sentence 1 and Sentence 2 are the two sentences being compared to calculate their semantic similarity, W 1 and W 2 are the sets of sense-assigned words appeared in Sentence 1 and Sentence 2, respectively, sentence 1 and sentence 2 are the sets of synonym expansion appeared in W 1 and W 2, and U = W 1 ∪ W 2 . Then, a semantic vectors v 1 and v 2 have been created, according to sentence 1 and sentence 2 .
Let word j be the corresponding sense-assigned word from U and v ij be the j th element of v i . In this case, there are two instances to take into the account, relaying on whether word j appears in sentence i or not: Instance 1: If word j exists in sentence i, then set v ij equal to the value of 1, this is based on the semantic similarity of the same words in the WordNet.
Instance 2: If word j does not exist in sentence i , then compute the semantic similarity between compared words by using one of the WordNet-based word-to-word similarity measures (i.e., J&C measure) [51]. The final similarity score to v ij is the highest of these scores between word j and each sentence i .
Once the vectors (v 1 and v 2 ) have been constructed, the semantic similarity between two sentences can be determined using a cosine similarity measure between two constructed vectors as

Sentence-level clustering algorithm
In this section, we firstly describe the new proposed version of the original k-means clustering algorithm which we called it centroid-based lexical-clustering (CBLC) algorithm. Then, we describe how a cluster centroid can be constructed and defined. The remaining subsections discuss the issues of calculating the semantic similarity between sentences and clustering centroid, and other related technical issues such as empirical settings and space and time complexity.

End
Given a k set (i.e., clusters), partition all the data points (i.e., sentences) randomly in given sets (i.e., initialization), each with a determined centroid (mean) that demonstrates as representative of the cluster. There are iterations process that rearrange these means or centroid of the clusters, which is based on moving each sentence to the cluster corresponding to the centroid to which it is closest (i.e., semantically similar). Redetermine the cluster centroids based on the new located sentences belonging to them. Then, the following iteration is repeated until the centroids do not move (until convergence). The new proposed version of the original k-means clustering algorithm is as follows.

Determining a clustering centroid
In the standard vector space model, the text such as a document is processed as a vector (i.e., its elements are the tf-idf scores), a cluster centroid can be determined by taking into account the vector average over all text fragments related to that cluster. This is experienced very hard using the above-discussed text representation scheme, since the semantic vector for a sentence is not unique, but depends on the length of the compared sentence context. However, just as a context may be constructed by two sentences, it is direct to apply this nation to defining the context over a collection of sentences. While a cluster is just such text fragments, we can define the centroid of a cluster as the union set of all associated synonyms of disambiguated words existing in the sentences relating to that cluster. Thus, if Sentence 1 , Sentence 2 , … Sentence N are sentences belonging to some cluster, the centroid of the cluster, which we denote as M j , is just the union set {word 1 , word 2 , .. word n }, where n is the number of distinct synonym words (sentence i ) inSentence 1 ∪Sentence 2 ∪…∪Sentence N . Figure 1 exemplifies the idea of determining a clustering centroid.

Calculating similarity between sentence and cluster centroid
When the CBLC algorithm calculates the semantic similarity between sentences, there are two cases to take into account. Firstly, if a sentence does belong to the cluster and secondly, if a sentence does not belong to the cluster. This case is straightforward to implement. Since the cluster centroids are represented in the same way as a union set (synonyms), the similarity between a sentence and a cluster centroid (i.e., two sentences) can be calculated by using sentence similarity measure, as described earlier. There is, however, a subtlety in the first case, which is not immediately apparent.
To demonstrate how this semantic similarity is calculated, assume that Sentence 1 = {word 1 , word 2 , word 3 } and Sentence 2 = {word 4 , word 5 } are not semantically similar. Comparing these sentences (S 1 and S 2 ), we obtain the semantic vectors v 1 = {1,1,1,0,0} and v 2 = {0,0,0,1,1} which obviously have a cosine value of zero and is reliable with the fact that they are no semantic relation between them. Now suppose, however, that Sentence 1 (S 1 ) and Sentence 2 (S 2 ) are in the same cluster. If we create the cluster union set as mentioned earlier (i.e., by taking the union of all synonym words appearing in all sentences in that cluster), we obtain M j = {word 1 , word 2 , word 3 , word 4 , word 5 }. If we now calculate the semantic similarity between M j and S 1 by using the cosine measure, we then obtain the vectors v j = {1,1,1,1,1} and v 1 = {1,1,1,0,0}, which have a similarity score equal to 0.77. An issue is clearly seen here, since S 1 and S 2 are not similar and their centroid would not carry any useful meaning. This issue in which we would not expect the similarity value like this has happened due to all of the words of S 1 already existing in the cluster centroid M j . We can solve this problem by defining the centroid using all sentences in the cluster except the sentence with which the cluster centroid is being currently compared. Therefore, assuming that we have a cluster containing sentences Sentence 1 … Sentence N , and we want the similarity between this cluster and a sentence SG appearing in the cluster, we would determine the cluster centroid using only the words appearing in Sentence 1 ∪Sentence 2 ∪…∪Sentence GÀ1 ∪ Gþ1 ∪…∪Sentence N ; that is, we omit SG in calculating the cluster centroid.

Space and time complexity of CBLC algorithm
It has been founded that the proposed algorithm is no more expansive comparing with the basic k-means [52] and spectral-clustering [18,37] algorithms regarding the space complexity (i.e., the three algorithms require the storage of the same similarity scores). The time (i.e., computation) complexity of a new version of the standard k-means: however, far exceeds that of basic k-means; and spectral-clustering algorithms. Furthermore, the computation complexities appeared in the stage of calculating the similarity between each sentence and corresponding centroid; this is due to representation of the text in the sentence similarity measure we have been applied within this clustering algorithm. To demonstrate this complexity, suppose that operation time unit for calculating semantic similarity between each sentence and cluster centroid is SentSim, the operation time unit for recalculating cluster centroids is ReTime, the total number of sentences in the used dataset is tn, the number of clusters is k, and the iteration loop of the proposed algorithm is LoopI. Therefore, essentially, the two following computations are required for each and every clustering iteration: (i) tn.k times sentence to cluster centroid similarity calculation; (ii) k times for relocate cluster centroid. As a result, the time complexity of proposed version can be defined as O CBLC = (SentSim. tn. k + ReTime. k). LoopI.
Since SentSim> > ReTime and tn> > k, the overall time complexity of CBLC algorithm is found O (tn), which means that computational complexity is relative to the size of the dataset that needs to be clustered.

Experiments and results
This section presents the performance of the CBLC algorithm to seven benchmark datasets, and the results are compared with that of other well-known clustering algorithms; spectral clustering [18,36], affinity propagation [35], k-medoids algorithm [30,31], STC-LE [39], and k-means (TF-IDF) [40]. We first describe the seven benchmark datasets, discuss cluster evaluation criteria, and we then report the experimental results ( Figure 2).

Benchmark datasets
While CBLC algorithm is obviously appropriate to tasks involving sentence clustering, the algorithm is applied to generic in nature standard datasets such as Reuters-21,578 dataset [29], Aural Sonar dataset [29,53], Protein dataset [29,54], Voting dataset [29,55], SearchSnippets [38,56], StackOverflow [38], and Biomedical [38]. The Reuters-21,578 is the commonly used dataset for text classification task. It contains more than 20,000 documents from over 600 classes. The experimental results presented in this chapter only use a subset containing only 1833 text fragments, each of them are labeled as relating to one of 10 distinguished classes. The total number of the text fragments in each of the 10 classes is 354, 333, 258, 210, 155, 134, 113, 100, 90, and 70, respectively.
In the Aural Sonar dataset [53], two randomly selected people were asked to assign a similarity score between 1 and 5 to all pairs of signals returned from a broadband active sonar system. The two obtained scores from participated people were added to produce a 100 Â 100 similarity matrix with values ranging from 2 to 10.
The Protein dataset [54,57] consists of dissimilarity values for 226 samples over nine classes. We use the reduced set [57] of 213 proteins from four classes that result from removing classes with fewer than seven samples.
The Voting dataset is a two-class classification task with around 435 samples (text fragments). Similarity scores in the form of a matrix table were computed from the data in the categorical domain.
The SearchSnippets dataset consists of eight different predefined domains (i.e., classes), which was generated from the web-search-transaction result activity.
The StackOverflow dataset consists of 3,370,528 samples collected through the period of July 31, 2012, to August 14, 2012 (https//:www.kaggle.com). In this chapter, we randomly select 20,000 question titles from 20 different classes.
The Biomedical is a challenge dataset published in BioASQ's official website, and we randomly select 20,000 paper titles from 20 different MeSH major classes.

Clustering evaluation criteria
Since complete cluster (i.e., all objects from a single class are assigned to a single cluster) and homogeneous cluster (i.e., each cluster contains only objects from a single class) are hardly achieved, we aim to reach a satisfactory balance between these two approaches. Therefore, we apply five well-known clustering criteria in order to evaluate the performance of the proposed algorithm, which are Purity, Entropy, V-measure, Rand Index, and F-measure.
Entropy and Purity [58]. Entropy measure is used to show how the clusters of sentences are partitioned within each cluster, and it is known as the average of weighted values in each cluster entropy over all clusters C = {c 1 , c 2 , c 3 , … c n }: The purity of a cluster is the fraction of the cluster size that the largest class of sentences assigned to that cluster represents, that is, Overall purity is the weighted sum of the individual cluster purities and is given by While purity and entropy are useful for comparing clusterings with the same number of clusters, they are not reliable when comparing clusterings with different numbers of clusters. This is because entropy and purity perform on how the sets of sentences are partitioned within each cluster, and this will lead to homogeneity case. Highest scores however, of purity and lowest scores of entropy are usually obtained when the total number of clusters is too big, where this step will lead to being lowest in the completeness. The next measure we have used considers both completeness and homogeneity approaches.
V-measure [59]. This is a measure that is known as the homogeneity and completeness harmonic mean; that is, V = homogeneity * completeness / (homogeneity + completeness), where homogeneity and completeness are defined as homogeneity = 1 -H(C|L)/H(C) and completeness = 1 -H (L|C)/H(L).
Eq. (5) can be written as follows, where Rand Index and F-measure. These measures depend on a combinatorial approach which considers each possible pair of sentences. It is defined as Rand Index = (TP + FP)/(TP + FP + FN + TN), where TP is a true positive (sentences corresponded to both same class and cluster), FP is a false positive (sentences corresponded to the different classes but same cluster), FP is a false positive (sentences corresponded to the different clusters but same class), and FN is a false negative (sentences must correspond to both different clusters and classes).
The F-measure is another method widely applied in the information retrieval domain and is defined as the harmonic mean of Precision (P) and Recall (R), that is, F-measure = 2*P*R/ (P + R), where P = TP/(TP + FP) and R = TP/(TP + FN).

Results
Since CBLC algorithm is generic in nature and can in principal be applied to any lexical semantic clustering domain, Figure 3 shows the results of applying it to the Reuters-21,578, Aural Sonar, Protein, Voting, SearchSnippets, StackOverflow, and Biomedical datasets, respectively, by using the Purity, Entropy, V-measure, Rand Index, and F-measure evaluation measures. CBLC algorithm however, requires an initial number of clusters in which we specified before the algorithm start. This number was varied from 7 to 12 for Reuters-21,578, Aural Sonar, Protein, Voting, and SearchSnippets datasets, and from 17 to 23 for StackOverflow and Biomedical datasets. This is because we found a proper clustering performance. Note that the values in the figure are averaged over 100 trials, and the best performance according to each measure is only presented. Figures 3-9 show the clustering performance of CBLC algorithm comparing with that of spectral clustering, affinity propagation, k-medoids, STC-LE, and k-means (TF-IDF), respectively, on seven mentioned benchmark datasets using the five cluster evaluation criteria described earlier. For the baselined (i.e., compared) methods, the total values of the used evaluation measures (i.e., purity, entropy, V-measure, Rand Index, and F-measure) were in each measure obtained by discovering a range of numbers starting from 7 to 23 clusters and then considering that which performance is the best in overall clustering quality. The figured empirical results for our proposed new version of standard k-means clustering and other compared algorithms correspond to the best performance resulted from 200 time runs.
The empirical results demonstrate that CBLC algorithm significantly outperforms the other baselined algorithms on all used datasets. In this experiment however, we knew a priori what the real number of clusters was. Generally, we wish that the clustering algorithm could automatically determine an actual number of clusters, since we would not have this information. Even when run with a high initial number of clusters, CBLC algorithm was able to converge to a solution containing not more than seven clusters (e.g., in case of Reuters-21,578 dataset), and from the figures, it can be again seen that the evaluation of these clusterings is superior than that for the other baselined clustering algorithms.     Centroid-Based Lexical Clustering http://dx.doi.org/10.5772/intechopen.75433

Concluding remarks
This chapter has shown a new version of the k-means clustering method that is able to cluster small-sized text fragments. This new variation measures the semantic similarity between patterns (i.e., sentences) based on the idea of creating a synonym expansion set to be used in the compared semantic vectors. The sentences are represented in these vectors by using semantic information derived from a WordNet that is created for the purpose of identifying the actual sense to a word, based on the surrounding context. The experimental results have demonstrated the method to achieve a satisfactory performance against the compared algorithms such as spectral clustering affinity propagation, k-medoids, STC-LE, and k-means (TF-IDF), as evaluated on several standard datasets.
A clear domain of applying the algorithm is to text-mining processing; however, the algorithm can also be used within more general text-processing settings such as text summarization. Like any clustering algorithm, the performance of CBLC will eventually be based on the text similarity values, and these values can be improved by defining the sentence-level text similarity measure that can utilize much more possible semantic information expressed with the compared sentences. Any such improvements are surly effected by the overall sentences clustering performance.
Sentence-level text clustering is an exciting area of research within the knowledge discovery and computational linguistic activities, and this chapter has proposed a new variation of k-means clustering which are capable to cluster sentences based on available semantic information written in these sentences. We are interested in some of the new research directions that we have experienced in this area; however, what we are most excited about is applying our proposed cluster technique to operate on the text-mining activities. This is because the concepts existing in human-written documents usually have buried knowledge and information, whereas the technique we have developed in this work is only applied on the clusters text-fragments domain. Therefore, one of the possible future works is to apply these ideas of sentence clustering to the development of complete techniques for sentiment analysis of the people's opinion.

Author details
Khaled Abdalgader