Application of Self-Organizing Maps in Text Clustering: A Review

Yuan-Chao Liu; Ming Liu; Xiao-Long Wang

doi:10.5772/50618

Author Information

Show +

Yuan-Chao Liu*
- School of Computer Science and Technology, Harbin Institute of Technology, China
Ming Liu
- School of Computer Science and Technology, Harbin Institute of Technology, China
Xiao-Long Wang
- School of Computer Science and Technology, Harbin Institute of Technology, China

*Address all correspondence to: lyc@insun.hit.edu.cn

1. Introduction

Text clustering is one of the most important text mining research directions. Despite the loss of some details, clustering technology simplifies the structure of data set, so that people can observe the data from a macro point of view.

After clustering process, the text data set can be divided into some different clusters, making the distance between the individuals in the same cluster as small as possible, while the distance between the different categories as far away from each other as possible.

Similar as text classification, text clustering is also the technology of processing a large number of texts and gives their partition.What is different is that text clustering analysis of the text collection gives an optimal division of the category without the need for labeling the category of some documents by hand in advance, so it is an unsupervised machine learning method. By comparison, text clustering technology has strong flexibility and automatic processing capabilities, and has become an important means of effective organization and navigation of text information. Jardine and van Rijsbergen made the famous clustering hypothesis: closely associated documents belong to same category and the same request [1]. Text clustering can also act as the basic research for many other applications. It is a preprocessing step for some natural language processing applications, e.g., automatic summarization, user preference mining, or be used to improve text classification results. YC Fang, S. Parthasarathy, [2] and Charu [3] use clustering techniques to cluster users’ frequent query and then the results to update the FAQ of search engine sites.

Although both text clustering and text classification are based on the idea of class, there are still some apparent differences: the classification is based on the taxonomy, the category distribution has been known beforehand. While the purpose of text clustering is to find the topic structure of documents [4] [5] [6] [7] [8] [9] [10]. Yasemin Kural [11] made a lot of experiments and compared the clustering mode and linear array mode for search engine, the results show that the former can indeed increase information access efficiency greatly.

Although there are many clustering methods, SOM has attracted many researchers in recent years. In this chapter, we reviewed the application of Self-Organizing Maps in Text Clustering. Our recent works on SOM based text clustering are also introduced briefly. The remaining of this chapter is organized as follows. Section 2 gives a review about the advances in text clustering and SOM; section 3 presents our recent work on application of self-organizing maps in text clustering. Then in section 4 some conclusions and discussions are given.

2. The Advances In Text Clustering And SOM

2.1. Text Clustering And Its Recent Research And Development

Text clustering is an unsupervised process that is not dependent on the prior knowledge of data collection, and based solely on the similarity relationship between documents in the collection to separate the document collection into some clusters. The general mathematical description of text clustering can be depicted as follows:

Figure 1.
The main framework for text clustering system.

Suppose C = { d 1 , d 2 , … , d n } is a collection of documents to be clustered, each document d i can be represented as high-dimensional space vector d i = { w 1 , w 2 , … , w i } by the famous vector space model (VSM), where w i means the weight of d i on feature j . The purpose of text clustering is to divide C into C 1 , C 2 , … , C x , C 1 ∪ C 2 ∪ … ∪ C x = C , here 1 ≤ i ≠ j ≤ k . For hard clustering, each document can belong to only one class, i.e. C i ∩ C j = Φ . Whereas for soft clustering, one document may belong to multiple clusters. Membership degree μ i j can be used to denote how much d i belongs to cluster C j .

Compared with other data types, text data is semi-structured. This makes man database-based algorithms does not apply to text clustering.

One important preprocessing step for text clustering is to consider how the text content can be represented in the form of mathematical expression for further analysis and processing. The Common method is Salton's vector space model [12] (Vector Space Model, VSM). The basic idea is: one feature space are constructed firstly, each dimension means one term, which comes from the key words of each document. Then each document is represented as one vector in this feature space. The document vector is usually a sparse vector as the dimension is very huge.

Dimensionality reduction is an essential step in text clustering. There are several techniques to reduce the dimension of the high-dimensional feature vector. PCA (Principal Component Analysis) method is one of the widely used dimension reduction techniques. Given an n × m-order document-term matrix, the k eigenvectors of the PCA with an m × m-order covariance matrix is used to reduce the dimension of the word space, and ultimately resulted in a k-term space dimension, which is much smaller than m.

LSI (Latent Semantic the Indexing) method is also widely used in the field of information retrieval, dimensionality reduction. It is in essence similar with the PCA. LSI make singular value decomposition not on covariance matrix, but on the initial n × m-order document–term matrix, and then selecting these singular eigenvectors as representative, thereby reduces the dimension.

Another problem is how to extract important features from documents. Mark P. Sinka and David W. Corne [13] argue that stop word removal will improve the text clustering effect. They also pointed out that after obtaining all unique words in the collection, you can only keep some high-frequency words to construct the space. Anton V. Leouski and W. Bruce Crof demonstrated that for each document, it is necessary to select only some important words to represent the document, and can basically meet the needs of the cluster without impacting clustering results. Literature [14] proposed a method to extract the key words in the document as features Literature [15] use latent semantic indexing (LSI) method to compress the dimension of the clustering feature space. Besides, ZhengYu Niu [16] and STANISŁAW OSIŃSKI [17], etc also performed research on feature selection.

Assume there are five documents doc1 doc2, doc3, doc4, and doc5. For each document, the first steps are segmenting, stop word removal, and word frequency counting. In order to improve the clustering efficiency, only the words which frequency is above a certain threshold value are used to construct the feature space. Studies have shown that such a treatment will not have an adverse impact on the clustering quality. Then the feature space can be constructed by using the term set which comes from all these terms. Each document is represented as a vector in the feature space. Fig.2. depicts the preprocessing steps for text clustering.

Figure 2.
the preprocessing steps of text document for text clustering.

Suppose the feature space is (apple, banana in the cat, window), and feature words frequency threshold is 2, then the following example document-term matrix can be formed:

(Apple, banana, cat, window) doc1 = ( 5, 3, 0, 4 ) doc2 = ( 4, 6, 0, 3 ) doc3 = ( 0, 3, 7, 5 ) doc4 = ( 8, 0, 9, 0 ) doc5 = ( 5, 0, 0, 3 )

As all documents are represented as the vector in the same feature space, thus it is more convenient for computing the document similarity. In fact, the similarity calculation is very frequent for most clustering algorithms. In addition, as there are usually many common words in different documents, the actual dimension of the feature space is less than the sum of the number of words selected from each document.

The evaluation of word importance. Take a science paper as an example, it is shown that about 65% to 90% author-marked keywords can be found in the main content in the original paper[18]. This means that by importance evaluation, the key words can be extracted from documents to represent the main content. Basically, keyword extraction can be seen as a supervised machine learning problems; this idea is first proposed by Turney [19]. Turney also make a comparative study based on genetic algorithms and decision tree-based keywords extraction algorithm. Factors which can denote the word importance includes word frequency, word location (title, caption and etc.). Many researches showed that high-frequency words are the more important words. Some typical keyword extraction system has been listed in table 1.

name	websites
NRC’s Extractor	http://ai.iit.nrc.ca/II_public/extractor/
Verity’s Search 97	http://www.verity.com/
KEA	http://www.nzdl.org/Kea/
GenEX	http://extractor.iit.nrc.ca/
Microsoft office 2003	http://www.microsoft.com/
Eric Brill’s Tagger	ftp://ftp.cs.jhu.edu/pub/brill/Programs/

Table 1.

Some Classical Keyword Extraction Systems.

2.2. Two Clustering strategies in Text Clustering: whole clustering and incremental clustering

There are two common Clustering strategies, and both need to measure the similarity of the document.

The first strategy is the "complete" strategy, or called "static" strategy. During the clustering process, the documents collection did not change neither adding documents, nor removing documents. At the beginning of clustering, the documents in the collection are fixed. In the clustering Method based on this policy, an N*N similarity matrix can be generated from the beginning and there are N ( N − 1 ) / 2 similarity values in the matrix. As it will compare the similarity among any documents, the computation is very costly.

The second strategy is the strategy of "incremental"[20]. In many occasions, the document collection can be increased at any time in the clustering process. When adding a document, it will be merged into the existing cluster, or you can separate it as a new category. While increasing documents, it may be necessary to perform re-clustering.

There are some methods to calculate the similarity or distances between different clusters: 1) the shortest distance method (single link method). If G p , G q are two different clusters, D s ( p , q ) = min { d i j | i ∈ G p , j ∈ G q } ; 2) the longest distance method. If G p , G q are two different clusters, D s ( p , q ) = max { d i j | i ∈ G p , j ∈ G q } ;3) Group average method. D s 2 ( p , q ) = 1 n p n q ∑ i ∈ G p j ∈ G q d i j 2 ;4) The centric method. − x G = 1 L ∑ i = l L x i Mean Quantization Error (abbreviated as MQE) is adopted as convergence condition as performed by Ref. [10-12]. Since MQE can measure the average agglomeration degree of clustering results, when its value is less than a threshold such as 0.01 (which is adopted by Kohonen in Ref. [21]), this dynamic algorithm stops.

M Q E = ∑ j = 1 C ∑ D i ∈ C j | D i - N j | 2 | C j | C E1

Where, C represents the quantity of clusters. Nj represents one neuron. Cj represents the cluster, which includes the data that are more similar to Nj than to other neurons. |Cj| represents the quantity of the data included by Cj. Di represents one datum among Cj.

2.3. SOM And Its Application For Text Clustering

Self-organizing map network (SOM, for abbreviation) is first proposed by T.Kohonen Professor in University of Helsinki in Finland, also known as the Kohonen network [22]. Kohonen believes that a neural network will be divided into different corresponding regions while receiving outside input mode, and different regions have different response characteristics for corresponding input mode, and this process can be done automatically. SOM network has the following main properties: 1) The cluster center is the mathematical expectation of all the documents in this cluster; 2) "cluster" of input data, and maintaining the topological order. Fully trained SOM network can be viewed as a pattern classifier. By inputting a document, the neurons representing the pattern class-specific in the output layer will have the greatest response.

The self-organizing map is proposed based on this idea, which is similar to the self-organization clustering process in human brain[23] [24]. SOM clustering method has been successfully used in the field of digital libraries, text clustering and many other applications [25] [26] [27] [28].

The running process of the SOM network can be divided into two stages: training and mapping. In the training phase, the samples were input randomly. For a particular input pattern, there will be a winning node in the output layer, which produces the greatest response. At the beginning of the training phase, which node in the output layer will generate the maximum response is uncertain. When the category of the input pattern is changed, the winning node of the two-dimensional plane will also change. Due to the lateral mutual excitatory effects, Nodes around the winning node have a greater response, so all the nodes of the winning node and its neighborhood will both perform different levels of adjustment.

SOM adjust the weights of the output layer nodes with a large number of training samples, and finally each node in the output layer is sensitive to a specific pattern class. When the class characteristics of the two clusters are close, the nodes on behalf of these two clusters are also close in position.

After the training of the SOM network, the relation between output layer nodes and each input pattern can be determined, then all the input patterns can be mapped onto the nodes in the output layers, which is called mapping steps.

SOM method usually requires pre-defining the size and structure of the network. There are some methods which can achieve this purpose [29][30][31]. The basic idea is to allow more rows or columns to be dynamically added to the network, make the network more suitable for the simulation of the real input space.

SOM method requires the definition of neighborhood function and learning rate function beforehand. There is no fixed pattern in Kohonen model on the choice of neighborhood function and learning rate function, they are generally selected based on the heuristic information [32][33]. H.Yin proposed BSOM, which is SOM method based on Bayesian [34]. The basic idea is to minimize the KL distance of the data density and neural models. KL distance can measure the distance or deviation between the environment probability density and real probability density, its value is generally a positive number. Learning process can be done within a fixed range of the winner neuron. The BSOM therefore gives a new perspective on the role of the conventional SOM neighborhood function. In addition, Filip, Mulier and Vladimir Cherkassky studied the learning rate function strategy in SOM [35]. The experimental results show that the location of the neurons may be over affected by the last input data. Filip, Mulier, Vladimir Cherkassky has improved the learning rate function and neighborhood function, to make impact of the input training data on the neuron location more uniform.

2.4. The Comparison Of SOM With Other Text Clustering Methods

Besides from SOM, There are also two widely used text clustering methods: AHC clustering method and K-means clustering method. The basic steps of AHC for text clustering method are as follows:

Calculate the document similarity matrix;
Each document is seen as a cluster firstly;
Merge the nearest two clusters into one;
Update the similarity matrix, i.e, re-calculating of the similarity of the new cluster with the current cluster; if there are only one cluster, then go to step 5), otherwise go to step 3);
End.

Researchers often use two different methods to cut the hierarchical relationships. One is to use the number of clusters as segmentation standard; another method is using the similarity as the segmentation standard, that is, when the similarity between two clusters is lower than a given threshold, the clustering algorithm will stop. Besides, it has been shown that the clustering entropy [36] can be used as the termination conditions of the hierarchical clustering method:

E n = ( ∑ j − l k ∑ i − l n e ( p i ( j ) , p 0 ( j ) ) ) + ∑ j − l k e ( p 0 ( j ) , c 0 ) E2

The first expression in the right side of the formula is the intra-cluster entropy; the second means the inter-cluster entropy. When En is smallest, the clustering result achieves optimum value. c 0 is the center of all the samples. p i ( j ) is the i documents for cluster j. p 0 ( j ) is the center of the jth clusters. K is the number of clusters, n j is the number of documents in cluster j.

K-means clustering algorithm is the typical dynamic partition method [37] [38] [39] [40]. The basic steps [41] are as follows:

Randomly select K documents, which represent initial cluster centroids.
Assign each document to the cluster that has the closest centroid.
When all documents have been assigned, recalculate the K centroids.
Repeat Steps 2 and 3 until the centroids no longer change.
Output the separation of these documents, i.e. different clusters.

For K-means, if the k value selected is inappropriate or the choice of initial accumulation point is uneven, the clustering process will be delayed and the clustering results are also adversely affected. Traditionally, there are mainly two methods to select the initial cluster center: 1) randomly select k points; 2) use empirical method to select the initial cluster centers. In addition, the researchers also made some of the more complex but very effective method: 1) the gravity center method. The basic idea is: first calculate the gravity center of all the samples as the first point; then select a positive number as the minimum critical distance. Input all the samples in turn, if the input sample has distance greater than d, it will be deemed as a new clustering point; 2) the density method. Two positive numbers d1 and d2 ( d 1 d 2 ) are first set, form the ultra-dimensional ball using d1 as the radius, which density is calculated as the number of samples in that ball. Select the sample with the maximum density as the first center; select the sample with the second maximum density.

Generally, SOM has proven to be the most suitable document clustering method. It can map documents onto two-dimensional diagram to show the relationship between the different documents. SOM can depict text in more figurative and better visual way. High-dimensional space can be transformed into two-dimensional space, and the similarity between the input data in the multi-dimension space is well maintained in the two-dimensional discrete space, the degree of similarity between the high dimensional spatial data can also be transformed into the location proximity of representation space, which can maintain the topological order. SOM also has the following advantages: 1) noise immunity; 2) visualization; 3) parallel processing.

Text Clustering is a high-dimensional application and closely related to the semantic features. The above characteristics of SOM make it very suitable for text clustering.

2.5. Dynamic clustering of SOM

Self-Organizing-Mapping (abbreviated as SOM) is one of the most extensively applied clustering algorithm for data analysis, because of its characteristic that its neuron topology is identical with the distribution of input data. However, the inconvenience, that it needs to predefine two parameters of cluster quantity and neuron topology, prevents it from prevailing in online situation.

As indicated by Ref. [42][43][44], many methods have been proposed to cluster dynamic data. For example, Dhillon et al. [45] proposed a dynamic clustering algorithm to help analyze the transfer of information. Unfortunately, this algorithm is time-consuming and impractical, since it needs to run several times. Ghaseminezhad and Karami [46] improve this algorithm by employing SOM structure, which forms an initial neuron topology at first and then dynamically tunes its topology once input data are updated. However, its neuron topology is fixed in advance and too rigid to be altered.

In order to enable neuron topology easily to be altered, some self-adaptive algorithms have been proposed. The prominent merit of them is that they don’t need to set any assumption about neuron topology in advance. For example, Melody in Ref. [47] initializes a neuron topology of small scale at first and then gradually expands it following the update of input data. Tseng et al in Ref. [48] improve this algorithm by tuning neuron topology in virtue of dynamically creating and deleting the arcs between different neurons.

Unfortunately, aforementioned self-adaptive algorithms have two defects. One is that, when neuron topology isn’t suitable for current input data, they will insert or split neurons, whereas, these newly created neurons may locate out of the area where input data distribute. The other is that, they fail to preserve topology order. Therefore, they can’t perform competitive learning as transitional SOM based algorithms, which will generate some dead neurons and they will never be tuned. The detailed discussions are indicated in Ref. [49][50].

For avoiding predefining cluster quantity, some scalable SOM based clustering algorithms are proposed, such as GSOM in Ref. [51] and GHSOM in Ref. [52]. Nevertheless, neuron topologies of them are fixed as liner, cycle, square or rectangle in advance. These kinds of topologies are too rigid, and hardly to be altered.

In order to solve this problem, some topology adaptive algorithms have been proposed, such as GNG in Ref. [53], PSOM in Ref. [54], and DASH in Ref. [55]. These algorithms free of predefining neuron topology and can automatically construct it to let it conform to the distribution of input data.

3. Our Recent Work On Application Of Self-Organizing Maps In Text Clustering

3.1. The Conceptual SOM Model For Text Clustering

Most of the existing text clustering methods simply use word frequency vector to represent the document, with little regard to the language's own characteristics and ontological knowledge. When documents are clustered using conventional “SOM plus VSM” way, it is hard to grasp the underlying semantic knowledge and consequently the clustering quality may be adversely affected. However, we notice that the documents in same cluster are very relevant to each other even though there are few common words shared by these documents, so the relevance calculation among documents can be simplified by the relevance calculation of words in documents.

Y.C. Liu et al. have proposed a conceptional self-organizing map model (ConSOM) [56]for text clustering, in which neurons and documents are represented by the vector in extended concept space and that in traditional feature space. It has been shown that by importing concept relevance knowledge, SOM can achieve better performance than traditional mode due to its semantic sensitivity. Figure 3 give the basic principle for ConSOM. After both extended concept space and traditional feature space are constructed, all documents and neurons are represented by two vectors: traditional vector VF purely formed by word frequency and extended concept vector VC, as shown in Fig. 3. Table 2.presents Concept Representation of Word in HowNet.

Figure 3.
The basic principle of ConSOM.

Table 2.
Concept Representation of Word in HowNet.

3.2. Fast SOM Clustering Method For Large-Scale Text Clustering

Conventional data clustering methods frequently perform unsatisfactorily for large text collections due to 3 factors:1) there are usually large number of documents to be processed; 2) the dimension is very huge for text clustering; 2) the computation complexity is very high. So it is very necessary to improve the computation speed.

As similarity computation is very crucial for text clustering, and has much impact on clustering efficiency, Y. liu and etc[57]. propose one novel feature representation and similarity computation method to make SOM text clustering much faster. Each document is coded as the collection of some keywords extracted from the original document, and will directly be input to SOM, whereas each output layer node of SOM are coded as numerical vector as that of most Kohonen Networks.

In order to directly separate documents into different groups, ring topology is adopted as our SOM structure, thus the number of groups can be any integral values. Like Kohonen Networks, it consists of two layers, input layer and output layer; each node in output layer corresponds to one cluster. Only neurons need to be represented as high-dimension vector, whereas the document will be coded as indexes of keywords.

3.3. The Variant Of SOM Model For Dynamic Text Clustering

Figure 5 shows the ring output layer topology of V-SOM [58]. The advantage of this topology is that sector number (node number) can be any integers, and it will be possible to reflect topic distribution of the input documents more finely and make full use of neurons. Besides, the number of neighboring neurons for each neuron is same, thus it can help avoid edge effect which usually happens by using rectangular or hexagonal topology. Neurons can be inserted gradually to avoid lack-of-use phenomenon of neurons. R 2 cluster criterion is used to find suitable network size which can reflect topic distribution of input documents.

Figure 4.
The rectangular topology of GHSOM (N10, N11, N12 in Figure1. (b) are the newly inserted neurons).

Figure 5.
The Ring Topology of V-SOM. N10 Is The Inserted Node In Figure (b).

4. Conclusions and discission

In conclusion, SOM has obvious advantage in terms of topology preserving order, anti-noise ability. By using self-organizing map network as the main framework of the text clustering, semantic knowledge can also be easily incorporated so as to enhance the clustering effect.

First, SOM can better handle the dynamic clustering problem through various kinds of dynamic vari-structure model. E.g. V-SOM model, which combine the decomposition strategy and neuronal dynamic expansion, under the guidance of clustering criterion function, dynamically and adaptively adjust the network structure, thus the clustering results can better reflect the topic distribution of input documents.

Second, semantic knowledge can be easily integrated into the SOM. Due to the diversity and complexity of language, same concept may also have different forms of expression. The traditional “VSM+SOM” mode rely solely on the frequency of feature words, and cannot grasp and embody semantic information. We use HowNet as a source of conceptual knowledge and perform effective integration with statistical information in order to enhance the sensitive ability of the clustering. if there are clusters with hidden common concept, they will be merged into one cluster, even if they are less common words shared by these documents.

Finally, the SOM's unique training structure provides convenience for the realization of parallel clustering and incremental clustering, thus contributing to improve the efficiency of clustering. Incremental clustering also makes it more suitable for dynamic clustering of web documents.

References

1. Jardine N. van Rijsbergen C. J. 1971 The use of hierarchic clustering in information retrieval Information Storage and Retrieval 7 1 217 240
2. Fang Y. C. Parthasarathy S. Schwartz F. 2002 Using Clustering to Boost Text Classification In Proceedings of the IEEE ICDM Workshop on Text Mining 101 112
3. Charu 2004 On using Partial Supervision for Text Categorization IEEE Transactions On Knowledge And Data Engineering 16 2 245 258
4. Croft W. B. 1978 Organizing and searching large files of documents Ph.D. Thesis University of Cambridge
5. Hearst M. A. Pedersen J. O. 1996 Reexamining the cluster hypothesis: Scatter/Gather on retrieval results Proceedings of the 19th International ACM SIGIR Conference on Research and Development in InformationRetrieval (SIGIR’96) 76 84
6. Leouski A. V. Croft W. B. 1996 An evaluation of techniques for clustering search results Technical Report IR-76 Department of Computer Science, University of Massachusetts Amherst
7. Allen R. B. Obry P. Littman M. 1993 An interface for navigating clustered document sets returned by queries Proceedings of the ACM Conference on Organizational Computing Systems 166 171
8. Cutting Douglass R. Karger David R. etc. Scatter/Gather 1992 A Cluster-based Approach to Browsing Large Document Collections SIGIR’92 318 329
9. Voorhees E. M. 1986 The efficiency of Inverted Index and Cluster Searches In: Proceedings of the ACM Conference on R&D in IR. Pisa 1986 164 174
10. El -Hamdouchi A. Willett P. 1989 Comparison of Hierarchic Agglomerative Clustering Methods for Document Retrieval The Computer Journal 32 3 220 227
11. Kural Yasemin Robertson Steve Jones Susan 2001 Clustering Information Retrieval Search Outputs Information Processing & Management 1630 1700
12. Salton G. Wong A. Yang C. 1975 A vector space model for automatic indexing Communications of the ACM 18 11 613 620 .
13. Seung-Shik Kang 2003 Keyword-based Document Clustering The 6th International Workshop on Information Retrieval with Asian Languages 132 137
14. Lerman K. 2004 Document Clustering in Reduced Dimension Vector Space Proceedings of CSAW’04
15. Niu Z. Y. Ji D. H. Tan C. L. 2004 Document clustering based on cluster validation 13th Conference on Information and Knowledge Management. CIKM 2004 Washington DC, USA 501 506
16. Osiński S. 2004 Dimensionality Reduction Techniques for Search Results Clustering MSc. thesis University of Sheffield UK
17. Wang B. B. Mc Kay R. I. Hussein A. Abbass et al. 2003 A comparative study for domain ontology guided feature extraction1 In Proc of 26th Australian Computer Science Conference (ACSC2003) Darlinghurst, Australia Australian Computer Society Inc. 69 78
18. Witten I. H. Paynter G. W. Frank E. Gutwin C. Nevill-Manning C. G. 1999 KEA: Practical automatic keyphrase extraction Proceedings of DigitalLibraries 99 (DL’99) ACM Press 254 256
19. Turney P. Mining the Web for Lexical Knowledge to Improve Keyphase Extraction: Learning from Labeled and Unlabeled Data P. Source: NRC/ERB-1096 2002 July NRC Publication Number: NRC 44947.2002
20. Zamir O. Etzioni O. 1999 Grouper: A dynamic clustering interface to web search results Computer networks 31 1361 1374
21. Herbert J. P. Yao J. T. 2009 A Granular Computing Framework for Self-Organizing Maps Neurocomputing 72 2865 2872
22. Niklasson L. Bodén M. Ziemke 1998 Self-organization of very large document collections: State of the art Proceedings of ICANN98, the 8th International Conference on Artificial Neural Networks 65 74
23. Lin X. Soergel D. Marchionini G. 1991 A self-organizing semantic map for information retrieval Proceedings of the annual international ACM SIGIR conference on research and development in information retrieval 262 269
24. Su K. Lamu-Chun Chang Hsiao-te Chou Chien-hsing 1996 Approach to interactive exploration In proc int’l conf knowledge discovery and data mining(KDD’96) 238 243
25. Miikkulainen R. 1990 Script recognition with hierarchical feature maps Connection science 2 83 101
26. Merkl D. 1993 Structuring software for reuse: the case of self-organizing maps Int. Joint Conf. on Neural Networks III Piscataway, NJ, IEEE Service Center 1993 2468 2471
27. Roussinov D. Ramsey M. 1998 Information forage through adaptive visualization The Third ACM Conference on Digital Libraries 303 304
28. Rauber 1999 LabelSOM: On the labeling of self-ofganizing maps In proc int’l joint conf neural networks(IJCNN’99)
29. Martinetz T. M. Schulten K. J. 1991 A “neural-gas’’ network learns topologies’ in Kohonen Artificial neural networks 397 402
30. Fritzke B. 1995 Growing grid-a self-organising network with constant neighbourhood range and adaptation strength Neural Process. Letters 2 9 13
31. Bauer Ha. Villmann T. 1997 Growing a hypercubical output space in a self-organising feature map IEEE Transactions on Neural Networks NN-8 2 218 226
32. Ritter H. Martinetz T. Schulten K. 1992 Heurd Computation and Self-Organizinp Maps: Introduction Addison-Wesley
33. Kohonen T. 1990 The self-organizing map Proc. of the IEEE 9 1464 1479
34. Yin H. Allinson N. M. 1990 Bayesian self-organising map for gaussian mixtures IEEE Proceedings: Vision, Image and Signal Processing 148 4 234 240
35. Mulier F. Cherkassky V. 1994 Learning rate schedules for self-organizing maps In Proceedings of 12th International Conference on Pattern Recognition 2 224 228
36. Jung Yunjae 2001 Design and Evaluation of Clustering Criterion for Optimal Hierarchical Agglomerative Clustering Phd. thesis University of Minnesota
37. Steinbach M. Karypis G. Kumar V. 2000 A comparison of document clustering techniques In KDD Workshop on Text Mining
38. Larsen Bjornar Chinatsu Aone 1999 Fast and effective text mining using linear-time document clustering In Proc. of the Fifth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining 16 22
39. Aggarwal Charu C. Gates Stephen C. Yu Philip S. 1999 On the merits of building categorization systems by supervised clustering In Proc.of the Fifth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining 352 356
40. Cutting D. R. Pedersen J. O. Karger D. R. Tukey J. W. 1992 Scatter/gather: A cluster-based approach to browsing large document collections In Proceedings of the ACM SIGIR Copenhagen 318 329
41. Luke Brian T. 1999 K-Means Clustering http://fconyx.ncifcrf.gov/~lukeb/kmeans.html
42. Martin S. Detlef N. 2006 Towards the Automation of Intelligent Data Analysis Applied Soft Computing 6 348 356
43. Zhou X. Y. Sun Z. H. Zhang B. L. Yang Y. D. 2006 Research on Clustering and Evolution Analysis of High Dimensional Data Stream Journal of Computer Research and Development 43 2005 2011
44. Huang S. Chen Z. Yu Y. Ma W. Y. 2006 Multitype Features Coselection for Web Document Clustering IEEE Transactions on Knowledge and Data Engineering 18 448 459
45. Dhillon I. S. Guan Y. Q. Kogan J. 2002 Iterative Clustering of High Dimensional Text Data Augmented by Local Search In: Proceedings of the Second IEEE International Conference on Data Mining 131 138 IEEE Press Japan
46. Ghaseminezhad M. H. Karami A. 2011 A Novel Self-Organizing Map (SOM) Neural Network for Discrete Groups of Data Clustering Applied Soft Computing 11 3771 3778
47. Melody Y. K. 2001 Extending the Kohonen Self-Organizing Map Networks for Clustering Analysis Computational Statistics & Data Analysis 38 161 180
48. Tseng C. L. Chen Y. H. Xu Y. Y. Pao H. T. Fu H. C. 2004 A Self-Growing Probabilistic Decision-Based Neural Network with Automatic Data Clustering Neurocomputing 61 21 38
49. Tsai C. F. Tsai C. W. Wu H. C. Yang T. 2004 ACODF: A Novel Data Clustering Approach for Data Mining in Large Databases Journal of Systems and Software 73 133 145
50. Lee S. Kim G. Kim S. 2011 Self-Adaptive and Dynamic Clustering for Online Anomaly Detection Expert Systems with Applications 38 14891 14898
51. Alahakoon D. S. Halganmuge K. Srinivasan B. 2000 Dynamic self-organizing maps with controlled growth for knowledge discovery IEEE Transactions on Neural Networks 11 3 601 614
52. Merkl Rauber D. Dittenbach M. 2002 The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data IEEE Transactions on Neural Networks 13 6 1331 1341
53. Qin A.-K. Suganthan P.-N. 2004 Robust growing neural gas algorithm with application in cluster analysis Neural Networks 17 8-9 1135 1148
54. L. Robert K. Warwick K. 2002 The plastic self organising map Proceedings of the 2002 International Joint Conference on Neural Networks, IEEE Hawaii 727 732
55. Hung C. Wermter S. 2003 A dynamic adaptive self-organising hybrid model for text clustering Proceedings of the Third IEEE International Conference on Data Mining, IEEE Melbourne Florida, USA 75 82
56. Liu Yuanchao Wang Xiaolong Wu Chong ConSOM: A conceptional self-organizing map model for text clustering Neurocomputing 71 4-6 2008 January 857 862
57. Liu Yuan-chao Wu Chong Liu Ming Research of fast SOM clustering for text information Expert Systems with Applications 38 8 2011 August 9325 9333
58. Liu Yuanchao Wang Xiaolong Liu Ming 2009 V-SOM: A Text Clustering Method based on Dynamic SOM Model Journal of Computational Information Systems 5 1 141 145

[1] 1. Jardine N. van Rijsbergen C. J. 1971 The use of hierarchic clustering in information retrieval Information Storage and Retrieval 7 1 217 240

[2] 2. Fang Y. C. Parthasarathy S. Schwartz F. 2002 Using Clustering to Boost Text Classification In Proceedings of the IEEE ICDM Workshop on Text Mining 101 112

[3] 3. Charu 2004 On using Partial Supervision for Text Categorization IEEE Transactions On Knowledge And Data Engineering 16 2 245 258

[4] 4. Croft W. B. 1978 Organizing and searching large files of documents Ph.D. Thesis University of Cambridge

[5] 5. Hearst M. A. Pedersen J. O. 1996 Reexamining the cluster hypothesis: Scatter/Gather on retrieval results Proceedings of the 19th International ACM SIGIR Conference on Research and Development in InformationRetrieval (SIGIR’96) 76 84

[6] 6. Leouski A. V. Croft W. B. 1996 An evaluation of techniques for clustering search results Technical Report IR-76 Department of Computer Science, University of Massachusetts Amherst

[7] 7. Allen R. B. Obry P. Littman M. 1993 An interface for navigating clustered document sets returned by queries Proceedings of the ACM Conference on Organizational Computing Systems 166 171

[8] 8. Cutting Douglass R. Karger David R. etc. Scatter/Gather 1992 A Cluster-based Approach to Browsing Large Document Collections SIGIR’92 318 329

[9] 9. Voorhees E. M. 1986 The efficiency of Inverted Index and Cluster Searches In: Proceedings of the ACM Conference on R&D in IR. Pisa 1986 164 174

[10] 10. El -Hamdouchi A. Willett P. 1989 Comparison of Hierarchic Agglomerative Clustering Methods for Document Retrieval The Computer Journal 32 3 220 227

[11] 11. Kural Yasemin Robertson Steve Jones Susan 2001 Clustering Information Retrieval Search Outputs Information Processing & Management 1630 1700

[12] 12. Salton G. Wong A. Yang C. 1975 A vector space model for automatic indexing Communications of the ACM 18 11 613 620 .

[13] 13. Seung-Shik Kang 2003 Keyword-based Document Clustering The 6th International Workshop on Information Retrieval with Asian Languages 132 137

[14] 14. Lerman K. 2004 Document Clustering in Reduced Dimension Vector Space Proceedings of CSAW’04

[15] 15. Niu Z. Y. Ji D. H. Tan C. L. 2004 Document clustering based on cluster validation 13th Conference on Information and Knowledge Management. CIKM 2004 Washington DC, USA 501 506

[16] 16. Osiński S. 2004 Dimensionality Reduction Techniques for Search Results Clustering MSc. thesis University of Sheffield UK

[17] 17. Wang B. B. Mc Kay R. I. Hussein A. Abbass et al. 2003 A comparative study for domain ontology guided feature extraction1 In Proc of 26th Australian Computer Science Conference (ACSC2003) Darlinghurst, Australia Australian Computer Society Inc. 69 78

[18] 18. Witten I. H. Paynter G. W. Frank E. Gutwin C. Nevill-Manning C. G. 1999 KEA: Practical automatic keyphrase extraction Proceedings of DigitalLibraries 99 (DL’99) ACM Press 254 256

[19] 19. Turney P. Mining the Web for Lexical Knowledge to Improve Keyphase Extraction: Learning from Labeled and Unlabeled Data P. Source: NRC/ERB-1096 2002 July NRC Publication Number: NRC 44947.2002

[20] 20. Zamir O. Etzioni O. 1999 Grouper: A dynamic clustering interface to web search results Computer networks 31 1361 1374

[21] 21. Herbert J. P. Yao J. T. 2009 A Granular Computing Framework for Self-Organizing Maps Neurocomputing 72 2865 2872

[22] 22. Niklasson L. Bodén M. Ziemke 1998 Self-organization of very large document collections: State of the art Proceedings of ICANN98, the 8th International Conference on Artificial Neural Networks 65 74

[23] 23. Lin X. Soergel D. Marchionini G. 1991 A self-organizing semantic map for information retrieval Proceedings of the annual international ACM SIGIR conference on research and development in information retrieval 262 269

[24] 24. Su K. Lamu-Chun Chang Hsiao-te Chou Chien-hsing 1996 Approach to interactive exploration In proc int’l conf knowledge discovery and data mining(KDD’96) 238 243

[25] 25. Miikkulainen R. 1990 Script recognition with hierarchical feature maps Connection science 2 83 101

[26] 26. Merkl D. 1993 Structuring software for reuse: the case of self-organizing maps Int. Joint Conf. on Neural Networks III Piscataway, NJ, IEEE Service Center 1993 2468 2471

[27] 27. Roussinov D. Ramsey M. 1998 Information forage through adaptive visualization The Third ACM Conference on Digital Libraries 303 304

[28] 28. Rauber 1999 LabelSOM: On the labeling of self-ofganizing maps In proc int’l joint conf neural networks(IJCNN’99)

[29] 29. Martinetz T. M. Schulten K. J. 1991 A “neural-gas’’ network learns topologies’ in Kohonen Artificial neural networks 397 402

[30] 30. Fritzke B. 1995 Growing grid-a self-organising network with constant neighbourhood range and adaptation strength Neural Process. Letters 2 9 13

[31] 31. Bauer Ha. Villmann T. 1997 Growing a hypercubical output space in a self-organising feature map IEEE Transactions on Neural Networks NN-8 2 218 226

[32] 32. Ritter H. Martinetz T. Schulten K. 1992 Heurd Computation and Self-Organizinp Maps: Introduction Addison-Wesley

[33] 33. Kohonen T. 1990 The self-organizing map Proc. of the IEEE 9 1464 1479

[34] 34. Yin H. Allinson N. M. 1990 Bayesian self-organising map for gaussian mixtures IEEE Proceedings: Vision, Image and Signal Processing 148 4 234 240

[35] 35. Mulier F. Cherkassky V. 1994 Learning rate schedules for self-organizing maps In Proceedings of 12th International Conference on Pattern Recognition 2 224 228

[36] 36. Jung Yunjae 2001 Design and Evaluation of Clustering Criterion for Optimal Hierarchical Agglomerative Clustering Phd. thesis University of Minnesota

[37] 37. Steinbach M. Karypis G. Kumar V. 2000 A comparison of document clustering techniques In KDD Workshop on Text Mining

[38] 38. Larsen Bjornar Chinatsu Aone 1999 Fast and effective text mining using linear-time document clustering In Proc. of the Fifth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining 16 22

[39] 39. Aggarwal Charu C. Gates Stephen C. Yu Philip S. 1999 On the merits of building categorization systems by supervised clustering In Proc.of the Fifth ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining 352 356

[40] 40. Cutting D. R. Pedersen J. O. Karger D. R. Tukey J. W. 1992 Scatter/gather: A cluster-based approach to browsing large document collections In Proceedings of the ACM SIGIR Copenhagen 318 329

[41] 41. Luke Brian T. 1999 K-Means Clustering http://fconyx.ncifcrf.gov/~lukeb/kmeans.html

[42] 42. Martin S. Detlef N. 2006 Towards the Automation of Intelligent Data Analysis Applied Soft Computing 6 348 356

[43] 43. Zhou X. Y. Sun Z. H. Zhang B. L. Yang Y. D. 2006 Research on Clustering and Evolution Analysis of High Dimensional Data Stream Journal of Computer Research and Development 43 2005 2011

[44] 44. Huang S. Chen Z. Yu Y. Ma W. Y. 2006 Multitype Features Coselection for Web Document Clustering IEEE Transactions on Knowledge and Data Engineering 18 448 459

[45] 45. Dhillon I. S. Guan Y. Q. Kogan J. 2002 Iterative Clustering of High Dimensional Text Data Augmented by Local Search In: Proceedings of the Second IEEE International Conference on Data Mining 131 138 IEEE Press Japan

[46] 46. Ghaseminezhad M. H. Karami A. 2011 A Novel Self-Organizing Map (SOM) Neural Network for Discrete Groups of Data Clustering Applied Soft Computing 11 3771 3778

[47] 47. Melody Y. K. 2001 Extending the Kohonen Self-Organizing Map Networks for Clustering Analysis Computational Statistics & Data Analysis 38 161 180

[48] 48. Tseng C. L. Chen Y. H. Xu Y. Y. Pao H. T. Fu H. C. 2004 A Self-Growing Probabilistic Decision-Based Neural Network with Automatic Data Clustering Neurocomputing 61 21 38

[49] 49. Tsai C. F. Tsai C. W. Wu H. C. Yang T. 2004 ACODF: A Novel Data Clustering Approach for Data Mining in Large Databases Journal of Systems and Software 73 133 145

[50] 50. Lee S. Kim G. Kim S. 2011 Self-Adaptive and Dynamic Clustering for Online Anomaly Detection Expert Systems with Applications 38 14891 14898

[51] 51. Alahakoon D. S. Halganmuge K. Srinivasan B. 2000 Dynamic self-organizing maps with controlled growth for knowledge discovery IEEE Transactions on Neural Networks 11 3 601 614

[52] 52. Merkl Rauber D. Dittenbach M. 2002 The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data IEEE Transactions on Neural Networks 13 6 1331 1341

[53] 53. Qin A.-K. Suganthan P.-N. 2004 Robust growing neural gas algorithm with application in cluster analysis Neural Networks 17 8-9 1135 1148

[54] 54. L. Robert K. Warwick K. 2002 The plastic self organising map Proceedings of the 2002 International Joint Conference on Neural Networks, IEEE Hawaii 727 732

[55] 55. Hung C. Wermter S. 2003 A dynamic adaptive self-organising hybrid model for text clustering Proceedings of the Third IEEE International Conference on Data Mining, IEEE Melbourne Florida, USA 75 82

[56] 56. Liu Yuanchao Wang Xiaolong Wu Chong ConSOM: A conceptional self-organizing map model for text clustering Neurocomputing 71 4-6 2008 January 857 862

[57] 57. Liu Yuan-chao Wu Chong Liu Ming Research of fast SOM clustering for text information Expert Systems with Applications 38 8 2011 August 9325 9333

[58] 58. Liu Yuanchao Wang Xiaolong Liu Ming 2009 V-SOM: A Text Clustering Method based on Dynamic SOM Model Journal of Computational Information Systems 5 1 141 145

Application of Self-Organizing Maps in Text Clustering: A Review

Applications of Self-Organizing Maps

Author Information

Yuan-Chao Liu*

Ming Liu

Xiao-Long Wang

1. Introduction

2. The Advances In Text Clustering And SOM

2.1. Text Clustering And Its Recent Research And Development

Figure 1.

Figure 2.

Table 1.

2.2. Two Clustering strategies in Text Clustering: whole clustering and incremental clustering

2.3. SOM And Its Application For Text Clustering

2.4. The Comparison Of SOM With Other Text Clustering Methods

2.5. Dynamic clustering of SOM

3. Our Recent Work On Application Of Self-Organizing Maps In Text Clustering

3.1. The Conceptual SOM Model For Text Clustering

Figure 3.

Table 2.

3.2. Fast SOM Clustering Method For Large-Scale Text Clustering

3.3. The Variant Of SOM Model For Dynamic Text Clustering

Figure 4.

Figure 5.

4. Conclusions and discission

References

Non-Linear Spatial Patterning in Cultural Site Formation Processes - The Evidence from Micro-Artefacts in Cores from a Neolithic Tell Site in Greece

Application of Self-Organizing Maps in Text Clustering: A Review

Applications of Self-Organizing Maps

Author Information

Yuan-Chao Liu*

Ming Liu

Xiao-Long Wang

1. Introduction

2. The Advances In Text Clustering And SOM

2.1. Text Clustering And Its Recent Research And Development

Figure 1.

Figure 2.

Table 1.

2.2. Two Clustering strategies in Text Clustering: whole clustering and incremental clustering

2.3. SOM And Its Application For Text Clustering

2.4. The Comparison Of SOM With Other Text Clustering Methods

2.5. Dynamic clustering of SOM

3. Our Recent Work On Application Of Self-Organizing Maps In Text Clustering

3.1. The Conceptual SOM Model For Text Clustering

Figure 3.

Table 2.

3.2. Fast SOM Clustering Method For Large-Scale Text Clustering

3.3. The Variant Of SOM Model For Dynamic Text Clustering

Figure 4.

Figure 5.

4. Conclusions and discission

References

Continue reading from the same book

Applications of Self-Organizing Maps