Some Classical Keyword Extraction Systems.
Link to this chapter Copy to clipboard
Cite this chapter Copy to clipboard
Embed this chapter on your site Copy to clipboard
Embed this code snippet in the HTML of your website to show this chapter
Open access peer-reviewed chapter
By Yuan-Chao Liu, Ming Liu and Xiao-Long Wang
Submitted: March 24th 2012Reviewed: June 12th 2012Published: November 21st 2012
Text clustering is one of the most important text mining research directions. Despite the loss of some details, clustering technology simplifies the structure of data set, so that people can observe the data from a macro point of view.
After clustering process, the text data set can be divided into some different clusters, making the distance between the individuals in the same cluster as small as possible, while the distance between the different categories as far away from each other as possible.
Similar as text classification, text clustering is also the technology of processing a large number of texts and gives their partition.What is different is that text clustering analysis of the text collection gives an optimal division of the category without the need for labeling the category of some documents by hand in advance, so it is an unsupervised machine learning method. By comparison, text clustering technology has strong flexibility and automatic processing capabilities, and has become an important means of effective organization and navigation of text information. Jardine and van Rijsbergen made the famous clustering hypothesis: closely associated documents belong to same category and the same request . Text clustering can also act as the basic research for many other applications. It is a preprocessing step for some natural language processing applications, e.g., automatic summarization, user preference mining, or be used to improve text classification results. YC Fang, S. Parthasarathy,  and Charu  use clustering techniques to cluster users’ frequent query and then the results to update the FAQ of search engine sites.
Although both text clustering and text classification are based on the idea of class, there are still some apparent differences: the classification is based on the taxonomy, the category distribution has been known beforehand. While the purpose of text clustering is to find the topic structure of documents       . Yasemin Kural  made a lot of experiments and compared the clustering mode and linear array mode for search engine, the results show that the former can indeed increase information access efficiency greatly.
Although there are many clustering methods, SOM has attracted many researchers in recent years. In this chapter, we reviewed the application of Self-Organizing Maps in Text Clustering. Our recent works on SOM based text clustering are also introduced briefly. The remaining of this chapter is organized as follows. Section 2 gives a review about the advances in text clustering and SOM; section 3 presents our recent work on application of self-organizing maps in text clustering. Then in section 4 some conclusions and discussions are given.
Text clustering is an unsupervised process that is not dependent on the prior knowledge of data collection, and based solely on the similarity relationship between documents in the collection to separate the document collection into some clusters. The general mathematical description of text clustering can be depicted as follows:
Suppose is a collection of documents to be clustered, each document can be represented as high-dimensional space vectorby the famous vector space model (VSM), where means the weight of on feature . The purpose of text clustering is to divide into , here . For hard clustering, each document can belong to only one class, i.e. . Whereas for soft clustering, one document may belong to multiple clusters. Membership degree can be used to denote how much belongs to cluster .
Compared with other data types, text data is semi-structured. This makes man database-based algorithms does not apply to text clustering.
One important preprocessing step for text clustering is to consider how the text content can be represented in the form of mathematical expression for further analysis and processing. The Common method is Salton's vector space model  (Vector Space Model, VSM). The basic idea is: one feature space are constructed firstly, each dimension means one term, which comes from the key words of each document. Then each document is represented as one vector in this feature space. The document vector is usually a sparse vector as the dimension is very huge.
Dimensionality reduction is an essential step in text clustering. There are several techniques to reduce the dimension of the high-dimensional feature vector. PCA (Principal Component Analysis) method is one of the widely used dimension reduction techniques. Given an n × m-order document-term matrix, the k eigenvectors of the PCA with an m × m-order covariance matrix is used to reduce the dimension of the word space, and ultimately resulted in a k-term space dimension, which is much smaller than m.
LSI (Latent Semantic the Indexing) method is also widely used in the field of information retrieval, dimensionality reduction. It is in essence similar with the PCA. LSI make singular value decomposition not on covariance matrix, but on the initial n × m-order document–term matrix, and then selecting these singular eigenvectors as representative, thereby reduces the dimension.
Another problem is how to extract important features from documents. Mark P. Sinka and David W. Corne  argue that stop word removal will improve the text clustering effect. They also pointed out that after obtaining all unique words in the collection, you can only keep some high-frequency words to construct the space. Anton V. Leouski and W. Bruce Crof demonstrated that for each document, it is necessary to select only some important words to represent the document, and can basically meet the needs of the cluster without impacting clustering results. Literature  proposed a method to extract the key words in the document as features Literature  use latent semantic indexing (LSI) method to compress the dimension of the clustering feature space. Besides, ZhengYu Niu  and STANISŁAW OSIŃSKI , etc also performed research on feature selection.
Assume there are five documents doc1 doc2, doc3, doc4, and doc5. For each document, the first steps are segmenting, stop word removal, and word frequency counting. In order to improve the clustering efficiency, only the words which frequency is above a certain threshold value are used to construct the feature space. Studies have shown that such a treatment will not have an adverse impact on the clustering quality. Then the feature space can be constructed by using the term set which comes from all these terms. Each document is represented as a vector in the feature space. Fig.2. depicts the preprocessing steps for text clustering.
Suppose the feature space is (apple, banana in the cat, window), and feature words frequency threshold is 2, then the following example document-term matrix can be formed:
As all documents are represented as the vector in the same feature space, thus it is more convenient for computing the document similarity. In fact, the similarity calculation is very frequent for most clustering algorithms. In addition, as there are usually many common words in different documents, the actual dimension of the feature space is less than the sum of the number of words selected from each document.
The evaluation of word importance. Take a science paper as an example, it is shown that about 65% to 90% author-marked keywords can be found in the main content in the original paper. This means that by importance evaluation, the key words can be extracted from documents to represent the main content. Basically, keyword extraction can be seen as a supervised machine learning problems; this idea is first proposed by Turney . Turney also make a comparative study based on genetic algorithms and decision tree-based keywords extraction algorithm. Factors which can denote the word importance includes word frequency, word location (title, caption and etc.). Many researches showed that high-frequency words are the more important words. Some typical keyword extraction system has been listed in table 1.
|Verity’s Search 97||http://www.verity.com/|
|Microsoft office 2003||http://www.microsoft.com/|
|Eric Brill’s Tagger||ftp://ftp.cs.jhu.edu/pub/brill/Programs/|
There are two common Clustering strategies, and both need to measure the similarity of the document.
The first strategy is the "complete" strategy, or called "static" strategy. During the clustering process, the documents collection did not change neither adding documents, nor removing documents. At the beginning of clustering, the documents in the collection are fixed. In the clustering Method based on this policy, an N*N similarity matrix can be generated from the beginning and there are similarity values in the matrix. As it will compare the similarity among any documents, the computation is very costly.
The second strategy is the strategy of "incremental". In many occasions, the document collection can be increased at any time in the clustering process. When adding a document, it will be merged into the existing cluster, or you can separate it as a new category. While increasing documents, it may be necessary to perform re-clustering.
There are some methods to calculate the similarity or distances between different clusters: 1) the shortest distance method (single link method). If are two different clusters, ; 2) the longest distance method. If are two different clusters, ;3) Group average method. ;4) The centric method. Mean Quantization Error (abbreviated as MQE) is adopted as convergence condition as performed by Ref. [10-12]. Since MQE can measure the average agglomeration degree of clustering results, when its value is less than a threshold such as 0.01 (which is adopted by Kohonen in Ref. ), this dynamic algorithm stops.
Where, C represents the quantity of clusters. Nj represents one neuron. Cj represents the cluster, which includes the data that are more similar to Nj than to other neurons. |Cj| represents the quantity of the data included by Cj. Di represents one datum among Cj.
Self-organizing map network (SOM, for abbreviation) is first proposed by T.Kohonen Professor in University of Helsinki in Finland, also known as the Kohonen network . Kohonen believes that a neural network will be divided into different corresponding regions while receiving outside input mode, and different regions have different response characteristics for corresponding input mode, and this process can be done automatically. SOM network has the following main properties: 1) The cluster center is the mathematical expectation of all the documents in this cluster; 2) "cluster" of input data, and maintaining the topological order. Fully trained SOM network can be viewed as a pattern classifier. By inputting a document, the neurons representing the pattern class-specific in the output layer will have the greatest response.
The self-organizing map is proposed based on this idea, which is similar to the self-organization clustering process in human brain . SOM clustering method has been successfully used in the field of digital libraries, text clustering and many other applications    .
The running process of the SOM network can be divided into two stages: training and mapping. In the training phase, the samples were input randomly. For a particular input pattern, there will be a winning node in the output layer, which produces the greatest response. At the beginning of the training phase, which node in the output layer will generate the maximum response is uncertain. When the category of the input pattern is changed, the winning node of the two-dimensional plane will also change. Due to the lateral mutual excitatory effects, Nodes around the winning node have a greater response, so all the nodes of the winning node and its neighborhood will both perform different levels of adjustment.
SOM adjust the weights of the output layer nodes with a large number of training samples, and finally each node in the output layer is sensitive to a specific pattern class. When the class characteristics of the two clusters are close, the nodes on behalf of these two clusters are also close in position.
After the training of the SOM network, the relation between output layer nodes and each input pattern can be determined, then all the input patterns can be mapped onto the nodes in the output layers, which is called mapping steps.
SOM method usually requires pre-defining the size and structure of the network. There are some methods which can achieve this purpose . The basic idea is to allow more rows or columns to be dynamically added to the network, make the network more suitable for the simulation of the real input space.
SOM method requires the definition of neighborhood function and learning rate function beforehand. There is no fixed pattern in Kohonen model on the choice of neighborhood function and learning rate function, they are generally selected based on the heuristic information . H.Yin proposed BSOM, which is SOM method based on Bayesian . The basic idea is to minimize the KL distance of the data density and neural models. KL distance can measure the distance or deviation between the environment probability density and real probability density, its value is generally a positive number. Learning process can be done within a fixed range of the winner neuron. The BSOM therefore gives a new perspective on the role of the conventional SOM neighborhood function. In addition, Filip, Mulier and Vladimir Cherkassky studied the learning rate function strategy in SOM . The experimental results show that the location of the neurons may be over affected by the last input data. Filip, Mulier, Vladimir Cherkassky has improved the learning rate function and neighborhood function, to make impact of the input training data on the neuron location more uniform.
Besides from SOM, There are also two widely used text clustering methods: AHC clustering method and K-means clustering method. The basic steps of AHC for text clustering method are as follows:
Calculate the document similarity matrix;
Each document is seen as a cluster firstly;
Merge the nearest two clusters into one;
Update the similarity matrix, i.e, re-calculating of the similarity of the new cluster with the current cluster; if there are only one cluster, then go to step 5), otherwise go to step 3);
Researchers often use two different methods to cut the hierarchical relationships. One is to use the number of clusters as segmentation standard; another method is using the similarity as the segmentation standard, that is, when the similarity between two clusters is lower than a given threshold, the clustering algorithm will stop. Besides, it has been shown that the clustering entropy  can be used as the termination conditions of the hierarchical clustering method:
The first expression in the right side of the formula is the intra-cluster entropy; the second means the inter-cluster entropy. When En is smallest, the clustering result achieves optimum value. is the center of all the samples. is the i documents for cluster j. is the center of the jth clusters. K is the number of clusters, is the number of documents in cluster j.
Randomly select K documents, which represent initial cluster centroids.
Assign each document to the cluster that has the closest centroid.
When all documents have been assigned, recalculate the K centroids.
Repeat Steps 2 and 3 until the centroids no longer change.
Output the separation of these documents, i.e. different clusters.
For K-means, if the k value selected is inappropriate or the choice of initial accumulation point is uneven, the clustering process will be delayed and the clustering results are also adversely affected. Traditionally, there are mainly two methods to select the initial cluster center: 1) randomly select k points; 2) use empirical method to select the initial cluster centers. In addition, the researchers also made some of the more complex but very effective method: 1) the gravity center method. The basic idea is: first calculate the gravity center of all the samples as the first point; then select a positive number as the minimum critical distance. Input all the samples in turn, if the input sample has distance greater than d, it will be deemed as a new clustering point; 2) the density method. Two positive numbers d1 and d2 are first set, form the ultra-dimensional ball using d1 as the radius, which density is calculated as the number of samples in that ball. Select the sample with the maximum density as the first center; select the sample with the second maximum density.
Generally, SOM has proven to be the most suitable document clustering method. It can map documents onto two-dimensional diagram to show the relationship between the different documents. SOM can depict text in more figurative and better visual way. High-dimensional space can be transformed into two-dimensional space, and the similarity between the input data in the multi-dimension space is well maintained in the two-dimensional discrete space, the degree of similarity between the high dimensional spatial data can also be transformed into the location proximity of representation space, which can maintain the topological order. SOM also has the following advantages: 1) noise immunity; 2) visualization; 3) parallel processing.
Text Clustering is a high-dimensional application and closely related to the semantic features. The above characteristics of SOM make it very suitable for text clustering.
Self-Organizing-Mapping (abbreviated as SOM) is one of the most extensively applied clustering algorithm for data analysis, because of its characteristic that its neuron topology is identical with the distribution of input data. However, the inconvenience, that it needs to predefine two parameters of cluster quantity and neuron topology, prevents it from prevailing in online situation.
As indicated by Ref. , many methods have been proposed to cluster dynamic data. For example, Dhillon et al.  proposed a dynamic clustering algorithm to help analyze the transfer of information. Unfortunately, this algorithm is time-consuming and impractical, since it needs to run several times. Ghaseminezhad and Karami  improve this algorithm by employing SOM structure, which forms an initial neuron topology at first and then dynamically tunes its topology once input data are updated. However, its neuron topology is fixed in advance and too rigid to be altered.
In order to enable neuron topology easily to be altered, some self-adaptive algorithms have been proposed. The prominent merit of them is that they don’t need to set any assumption about neuron topology in advance. For example, Melody in Ref.  initializes a neuron topology of small scale at first and then gradually expands it following the update of input data. Tseng et al in Ref.  improve this algorithm by tuning neuron topology in virtue of dynamically creating and deleting the arcs between different neurons.
Unfortunately, aforementioned self-adaptive algorithms have two defects. One is that, when neuron topology isn’t suitable for current input data, they will insert or split neurons, whereas, these newly created neurons may locate out of the area where input data distribute. The other is that, they fail to preserve topology order. Therefore, they can’t perform competitive learning as transitional SOM based algorithms, which will generate some dead neurons and they will never be tuned. The detailed discussions are indicated in Ref. .
For avoiding predefining cluster quantity, some scalable SOM based clustering algorithms are proposed, such as GSOM in Ref.  and GHSOM in Ref. . Nevertheless, neuron topologies of them are fixed as liner, cycle, square or rectangle in advance. These kinds of topologies are too rigid, and hardly to be altered.
In order to solve this problem, some topology adaptive algorithms have been proposed, such as GNG in Ref. , PSOM in Ref. , and DASH in Ref. . These algorithms free of predefining neuron topology and can automatically construct it to let it conform to the distribution of input data.
Most of the existing text clustering methods simply use word frequency vector to represent the document, with little regard to the language's own characteristics and ontological knowledge. When documents are clustered using conventional “SOM plus VSM” way, it is hard to grasp the underlying semantic knowledge and consequently the clustering quality may be adversely affected. However, we notice that the documents in same cluster are very relevant to each other even though there are few common words shared by these documents, so the relevance calculation among documents can be simplified by the relevance calculation of words in documents.
Y.C. Liu et al. have proposed a conceptional self-organizing map model (ConSOM) for text clustering, in which neurons and documents are represented by the vector in extended concept space and that in traditional feature space. It has been shown that by importing concept relevance knowledge, SOM can achieve better performance than traditional mode due to its semantic sensitivity. Figure 3 give the basic principle for ConSOM. After both extended concept space and traditional feature space are constructed, all documents and neurons are represented by two vectors: traditional vector VF purely formed by word frequency and extended concept vector VC, as shown in Fig. 3. Table 2.presents Concept Representation of Word in HowNet.
Conventional data clustering methods frequently perform unsatisfactorily for large text collections due to 3 factors:1) there are usually large number of documents to be processed; 2) the dimension is very huge for text clustering; 2) the computation complexity is very high. So it is very necessary to improve the computation speed.
As similarity computation is very crucial for text clustering, and has much impact on clustering efficiency, Y. liu and etc. propose one novel feature representation and similarity computation method to make SOM text clustering much faster. Each document is coded as the collection of some keywords extracted from the original document, and will directly be input to SOM, whereas each output layer node of SOM are coded as numerical vector as that of most Kohonen Networks.
In order to directly separate documents into different groups, ring topology is adopted as our SOM structure, thus the number of groups can be any integral values. Like Kohonen Networks, it consists of two layers, input layer and output layer; each node in output layer corresponds to one cluster. Only neurons need to be represented as high-dimension vector, whereas the document will be coded as indexes of keywords.
Figure 5 shows the ring output layer topology of V-SOM . The advantage of this topology is that sector number (node number) can be any integers, and it will be possible to reflect topic distribution of the input documents more finely and make full use of neurons. Besides, the number of neighboring neurons for each neuron is same, thus it can help avoid edge effect which usually happens by using rectangular or hexagonal topology. Neurons can be inserted gradually to avoid lack-of-use phenomenon of neurons. cluster criterion is used to find suitable network size which can reflect topic distribution of input documents.
In conclusion, SOM has obvious advantage in terms of topology preserving order, anti-noise ability. By using self-organizing map network as the main framework of the text clustering, semantic knowledge can also be easily incorporated so as to enhance the clustering effect.
First, SOM can better handle the dynamic clustering problem through various kinds of dynamic vari-structure model. E.g. V-SOM model, which combine the decomposition strategy and neuronal dynamic expansion, under the guidance of clustering criterion function, dynamically and adaptively adjust the network structure, thus the clustering results can better reflect the topic distribution of input documents.
Second, semantic knowledge can be easily integrated into the SOM. Due to the diversity and complexity of language, same concept may also have different forms of expression. The traditional “VSM+SOM” mode rely solely on the frequency of feature words, and cannot grasp and embody semantic information. We use HowNet as a source of conceptual knowledge and perform effective integration with statistical information in order to enhance the sensitive ability of the clustering. if there are clusters with hidden common concept, they will be merged into one cluster, even if they are less common words shared by these documents.
Finally, the SOM's unique training structure provides convenience for the realization of parallel clustering and incremental clustering, thus contributing to improve the efficiency of clustering. Incremental clustering also makes it more suitable for dynamic clustering of web documents.
2250total chapter downloads
Login to your personal dashboard for more detailed statistics on your publications.Access personal reporting
Edited by Magnus Johnsson
By Dimitris Kontogiorgos
Edited by Reza Fazel-Rezai
By Andrea Kübler, Elisa Holz, Tobias Kaufmann and Claudia Zickler
We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.More about us