Information‐Theoretic Clustering and Algorithms Information-Theoretic Clustering and Algorithms

Clustering is the task of partitioning objects into clusters on the basis of certain criteria so that objects in the same cluster are similar. Many clustering methods have been proposed in a number of decades. Since clustering results depend on criteria and algo- rithms, appropriate selection of them is an essential problem. Recently, large sets of users ’ behavior logs and text documents are common. These are often presented as high-dimensional and sparse vectors. This chapter introduces information-theoretic clustering (ITC), which is appropriate and useful to analyze such a high-dimensional data, from both theoretical and experimental side. Theoretically, the criterion, generative models, and novel algorithms are shown. Experimentally, it shows the effectiveness and usefulness of ITC for text analysis as an important example.


Introduction
Clustering is the task of partitioning objects into clusters on the basis of certain criteria so that objects in the same cluster are similar. It is a fundamental procedure to analyze data [1,2].
Clustering is unsupervised and different from supervised classification. In supervised classification, we have a set of labeled data (belong to predefined classes), train a classifier using the labeled data (training set), and judge which class a new object belongs to by the classifier. In the case of clustering, we find meaningful clusters without using any labeled data and group a given collection of unlabeled data into them. Clustering can also help us to find meaningful classes (labels) for supervised classification. Since it is more difficult to prepare the training set for larger data sets, recently unsupervised analysis of data such as clustering becomes more important.
For example, Table 1 user-item matrix shows which item a user bought. When considering the data as a set of feature vectors for users, we can find a lot of types of users' behavior by clustering. It is also possible to analyze data as a set of feature vectors for items. From worddocument matrix in Table 2, both document clusters and word clusters could be extracted.
Many clustering methods have been proposed in a number of decades. Those include k-means algorithm [3], competitive learning [4], spherical clustering [5], spectral clustering [6], and maximum margin clustering [7]. Since clustering results depend on criteria and algorithms, appropriate selection of them is an essential problem. Large sets of users' behavior logs and text documents (as shown in Tables 1 and 2) are common recently. These are often presented as high-dimensional and sparse vectors. This chapter introduces information-theoretic clustering [8] and algorithms that are appropriate and useful to analyze such a high-dimensional data.
Information-theoretic clustering (ITC) uses Kullback-Leibler divergence and Jensen-Shannon divergence to determine its criterion, while k-means algorithm uses the sum of squared error as criterion. This chapter explains ITC by contrasting these two clustering techniques (criteria and algorithms), because there are a number of interesting similarities between them. There exists difficulty in algorithms for ITC. We explain the details of it and propose novel algorithms to overcome.
Experimental results for text data sets are presented to show the effectiveness and usefulness of ITC and novel algorithms for it. In experiments, maximum margin clustering and spherical clustering are used to compare. We also provide the evidence to support the effectiveness of ITC by detailed analysis of clustering results.

The sum-of-squared-error criterion and algorithms
Given a set of M-dimensional input vectors X ¼ {x i jx i ∈R M ;i ¼ 1;…;N} where N is the number of vectors, clustering is the task of assigning each input vector x i a cluster label kðk ¼ 1;…;KÞ to    Advances in Statistical Methodologies and Their Application to Real Problems partition them into K clusters C ¼ {C 1 ;…;C K }. The sum-of-squared-error criterion [9] is a simple and widely used criterion for clustering.
Let μ k be the mean of the input vectors x i which belong to the cluster C k (see Figure 1). Then, the error in C k is the sum of squared lengths of the differential (= "error") vectors ∥x i −μ k ∥ 2 and the sum-of-squared-error criterion about all clusters (within-cluster sum of squares) is defined by J W is the objective function (criterion) to be minimized in clustering based on this criterion.
Also, we define the sum of squares of between-cluster J B and total J T as respectively, where N k is the number of input vectors x i in C k (i.e., N ¼ ∑ K k¼1 N k ) and It follows from these definitions that the total sum of squares is the sum of the within-cluster sum of squares and the between-cluster sum of squares: Since the mean of the all input vectors μ is derived from (2)] and is constant for the given input vectors X . Therefore, minimization of J W is equivalent to maximization of J B . In this sense, clustering based on minimizing this criterion J W works to find separable clusters each other.

Generative model
In the background of the clustering based on the objective function (criterion) J W , there exists assumption of Gaussian distribution about input vectors [10].  Suppose that there are clusters C k ðk ¼ 1;…;KÞ, which generates input vectors by the conditional probability density function: where σ k is a standard deviation of the cluster C k and M is the number of dimension of x i . In followings, we assume that σ k is constant value σ for all clusters C k ðk ¼ 1;…;KÞ. Considering independence of each generation, joint probability density function for the input vectors X becomes where C indicate cluster information that specifies which input vector x i belongs to cluster C k . Taking the logarithm of Eq. (6) yields Since σ is constant, the maximization of Eq. (7) is equivalent to the minimization of which is nothing more or less than the objective function (criterion) J W . Therefore, under the assumption of Gaussian distribution about input vectors, clustering based on Eq. (8) works to find the most probable solution C.

k-means algorithm
k-means [3,11] is well-known algorithm for clustering based on the sum-of-squared-error criterion. Main idea of this algorithm is as follows. In the objective function J W (1), error for vector x is calculated by ∥x−μ k ∥ 2 where μ k is the mean of cluster C k to which x belongs. If ∥x−μ t ∥ 2 < ∥x−μ k ∥ 2 , changing the cluster from C k to C t can reduce the objective function J W .
We introduce weight vector w k ðk ¼ 1;…;KÞ (W) that represent cluster C k to implement the idea mentioned above. The weight vector w k involves mean vector μ k and prototype vector of cluster C k . As illustrated in Figure 2, the idea of k-means is alternative repetition of two steps "(a) Update weights" (calculating mean μ k as weight vector w k ) and "(b) Update clusters" (allocating input vector x i to a cluster C k on the basis of minimum length from weight vectors w k ). Note that Figure 2b is a Voronoi tessellation determined by weight vectors w k , which are usually called prototype vector in this context.

Figure 3a
is a flow chart of k-means algorithm to which processes of initialization and termination are added. As a matter of fact, clustering is closely related to vector quantization. Vector quantization means mapping input vectors to a codebook that is a set of weight vectors (prototype vectors). When using quantization error E Q : clusters C determined by a local optimal solution of vector quantization W is a local optimal solution of clustering problem [12]. In this sense, clustering can be replaced by vector quantization and vice versa. We can write a flow chart for vector quantization as Figure 3b, but we also find this chart (b) as k-means algorithm. Furthermore, LBG algorithm [13], which is well known for vector quantization, is based on an approach of Lloyd [3] (one of original papers for k-means algorithm). These facts show a close relationship between clustering and vector quantization.
Initialization is important, because k-means algorithm converges to a local optimal solution which depends on an initial condition (a set of weights or clusters). If we initialize weights W by randomly selecting them from input vectors, it may converge to a very bad local optimal solution with high probability. Random labeling that randomly assigns cluster labels C to input vectors may lead to better solutions than random selection of weights. The initialization Random labeling can also be used for charts (b) and (c) in Figure 3 by replacing "Initialize weights" step to "Initialize clusters" and "Update weights" steps. For directly initializing weights, splitting algorithm [13] and k-means þþ [14] were known.

Competitive learning
Competitive learning [4,11] is a learning method for vector quantization and also utilized for clustering. While k-means algorithm updates all weights W by batch processing, competitive learning updates one weight w at a time to reduce a part of the quantization error Q E (see Figure 3c) as 1. Select one input vector x randomly from X .

2.
Decide a winner w c from W by c ¼ arg min k ∥x−w k ∥ 2 ðIf there are several candidates; choose the smallest kÞ: 3. Update the winner's weight w c as where γ is a given learning rate (e.g., 0.01-0.1).
Though the winner-take-all update in Step 3 ( Figure 4) that reduces partial error ∥x−w c ∥ 2 in steepest direction does not always reduce the total quantization error E Q , repetition of the update can reduce E Q on the basis of stochastic gradient decent method [15,16]. For termination condition, maximum number of times of iteration N r (the number of maximum repetitions) can be used. After termination, the step of deciding clusters C like Figure 2b is required for clustering purpose.  Against natural expectations, competitive learning outperforms k-means without any contrivance in most cases. Furthermore, information obtained in learning process allows us to improve its performance. Splitting rule [12] utilizes the number of each weight w wins to estimate density around it. As Figure 5a, b shows, higher density of input vectors around makes the weight vector w a win more frequently than w b .
Splitting rule in competitive learning [12] aims to overcome the problem of discrepancy between distribution of input vectors X and that of weight vectors W. The discrepancy causes a few weight vectors w monopolize X and leads to a solution of very poor quality, but it is impossible to figure out the distribution of input vectors beforehand. Accordingly, this splitting rule distributes weight vectors w in learning process as 1. One weight vector w 1 with a variable τ 1 is set. τ 1 denotes how many times weight vector w 1 wins and is initialized to 0.
2. Select one input vector x, decide winner w c , and update the winner's weight w c .
3. Add 1 to τ c . If τ c ¼ θ and the current number of weights K ′ is less than K, generate a new weight vector w which is the same as the winner w c and clear τ of both to 0, where θ is the threshold of times for splitting.
4. Repeat 2 and 3 until termination condition is true.

Information-theoretic clustering and algorithms
Information-theoretic clustering (ITC) [8] is closely related to works about distributional clustering [17][18][19] and uses Kullback-Leibler divergence and Jensen-Shannon divergence to determine its criterion. Though there exists difficulty in algorithms and effectiveness for highdimensional count data (e.g., text data), its definition and properties are similar to those of the sum-of-squared-error criterion. The main contributions of this chapter are to present the technique to overcome the difficulty and effectiveness of ITC.
N} be a set of M-dimensional input vectors (N denote the number of input vectors), where elements of vectors x are nonnegative real numbers. We define a l 1 -norm  of input vector t i ð¼ ∑ m jx i m jÞ, normalized input vectors p i ¼ x i =t i , and an input probability distribution P i whose mth random variable takes the mth element of p i ð¼ p i m Þ. Let P ¼ {P 1 , …, P N } be a set of input distributions (input data).
Suppose that we assign each distribution P i a cluster label kðk ¼ 1, …;KÞ to partition them into Let P k be the distributions on the mean of input data P i which belong to the cluster C k (see Figure 6).
Then, the generalized Jensen-Shannon (JS) divergence to be minimized in C k is defined by where N k is the number of distributions P i in cluster C k (i.e., Kullback-Leibler (KL) divergence to the mean distribution P k from P i , and π i is the probability considers all clusters C k ðk ¼ 1;…;KÞ as The within-cluster JS divergence JS W is the objective function (criterion) of information-theoretic clustering (ITC) to be minimized [8]. We also define JS divergence of between-cluster JS B and total JS T as is the distribution on the mean of all input data. It follows from these definitions that the total JS divergence is the sum of the within-cluster JS divergence and the between-cluster JS divergence [8]: Since JS T are constant for given input distributions P, minimization of JS W is equivalent to maximization of JS B . In this sense, clustering based on minimizing this criterion JS W works to find separable clusters each other.
The definition and properties of ITC as shown so far are similar to those of the sum-ofsquared-error criterion. Those will help us to understand ITC.

Generative model
In the background of information-theoretic clustering (ITC), there also exists the bag-of-words assumption [20] that disregards the order of words in a document. (Since ITC is not limited for document clustering, "word" is just an example of feature.) It means that features in data are conditionally independent and identically distributed, where the condition is a given probability distribution for an input vector. Based on this assumption, we describe a generative probabilistic model related to ITC and make clear the relationship between the model and the objective function (criterion) JS W .
where A i is the number of combination of the observation. Assuming independence of each generation, joint probability function for the input vectors where C indicates cluster information that specifies which input vector x i belongs to cluster C k . Taking the logarithm of Eq. (20) yields This is a generative probabilistic model related to ITC. If we assume that t i takes constant value t for all input vectors, maximization of the probability (22) as well as minimization of the objective function JS W (15) come to the minimization of for given input distribution P. Here, the relationship ∑ P i ∈C k p i m ¼ N k p k m is used. Since t i may not be constant value t, the generative model (22) is not an equivalent model of ITC but the related model. This difference comes from the fact that the model treats each observation about features equally, while ITC treats each data (input vector) equally. Though the additional assumption t i ¼ t is required, ITC works to find the most probable solution C in the generative probabilistic model. Furthermore, Eq. (23) is also based on the minimization of entropy in clusters as Eq. (23) shows. Entropy (specifically, Shannon Entropy) is the expected value of the information contained in each message that is an input distribution here. The smaller entropy becomes, and the more compactly a model can explain observations (input distributions). In this sense, the objective function JS W (15) presents the goodness of the generative model. The relationship (including difference) between the probabilistic model and the objective function JS W is meaningful to improve the model and the objective function in future.
Choice of appropriate model for data is important, when analyzing them. For example, large set of text documents contain many kinds of words and are presented as high-dimensional vectors. Taking extreme diversity of documents' topics into account, feature vectors of documents are distributed almost uniformly in the vector space. As known by "the curse of dimensionality" [10], most of the volume of a sphere in high-dimensional space is concentrated near the surface, and it becomes not appropriate to choose the model based on Gaussian distribution which concentrates values around the mean. In contrast, ITC on the basis of the multinomial distribution is a reasonable and useful tool to analyze such a high-dimensional count data, because the generative model of ITC is consistent with them.
We introduce weight distribution Q k ðk ¼ 1;…;KÞðQÞ that represent cluster C k and that involves mean distribution P k and prototype distribution of cluster C k in a manner similar to that of the sum-of-squared-error (SSE) criterion (see Section2.2.1). Figure 7 shows relationships between parameters in generative models. Parameters are generated or estimated by other parameters to maximize probability of generative model. For example, clustering is the task to find the most probable clusters C for given input vectors X or input distributions P. In Figure 7b, constructing a classifier is the task to find Q for given P and C (classes in this context) in training process. Then, it estimates C for unknown P using the trained Q. The classifier using multinominal distribution is known as multinominal Naive Bayes classifier [21].
As it shows, ITC and Naive Bayes classifier have a close relationship [18].

Algorithms
There exists difficulty (Appendix A) in algorithms for ITC. We show a novel idea to overcome it.

Competitive learning
When competitive learning decides a winner for an input distribution P, it easily faces the difficulty of calculating KL divergence from P to weight distributions Q (see Appendix A). To overcome this difficulty, we present the idea to change an order of steps in competitive learning (CL). As shown in Figure 8b, CL updates all weights (= weight distributions) before deciding winner by where γ is a learning rate. Since updated weight distributions Q k ðk ¼ 1;…;KÞ include all words (features) of input distribution P, it is possible to calculate KL divergence D KL ðP∥Q k Þ for all k.
In following steps, CL decide a winner Q c from Q by and activate winner's update and discard others. These steps satisfy the CL's requirement that it partially reduces value of objective function JS W in steepest direction with the given learning rate γ. Here, neither approximation nor distortion is added to the criterion of ITC. Note that updates of weight distributions Q k before activation are provisional (see Figure 8b).
Related work that avoids the difficulty in calculating KL divergence presented skew divergence [22]. The skew divergence is defined as where αð0≤α≤1Þ is the mixture ratio of distributions. The skew divergence is exactly the KL divergence at α ¼ 1. When α ¼ 1−γ, Eq. (26) becomes similar to Eq. (24). Then, we can rewrite the steps in CL above using the skew divergence as 1. Select one input distribution P randomly from P.

2.
Decide a winner Q c from Q by 3. Update the winner's weight distribution Q c as where γ is a learning rate and equal to 1−α (α is the mixture ratio for s α ) usually.
Hence, we call this novel algorithm for ITC as "competitive learning using skew divergence" (sdCL). In addition, splitting rule in competitive learning [12] can also be applied to this algorithm.

k-means type algorithm
Dhillon et al. [8] proposed information-theoretic divisive algorithm which is k-means type algorithm with divisive mechanism and uses KL divergence. 1 However, it still remains the difficulty to use KL divergence directly. In such a situation, we propose to use the skew divergence instead of KL divergence in k-means type algorithm as 1. Initialize clusters C of input distribution P randomly.
2. Update weight distributions Q by 3. Update each cluster c of an input distribution P i by c ¼ arg min where mixture ratio αð0≤α≤1Þ for skew divergence s α is 0.99 for example.
4. Repeat 2 and 3 until change ratio of objective function JS W is less than small value (e.g., 10 −8 ).
The algorithm itself works well to obtain valuable clustering results. Further, if α is close to 1, skew divergence s α becomes a good approximation of KL divergence. Therefore, restart of learning after termination with α closer to 1, such as 0:999; 0:9999;…, may lead to better clustering result.

Other algorithms
Slonim and Tishby [23] proposed an agglomerative hierarchical clustering algorithm, which is a hard clustering version of Information Bottleneck algorithm of Tishby et al. [24]. It is similar to the algorithm of Baker and McCallum [18] and merges just two clusters at every step based on the JS divergence of their distributions. A merit of the agglomerative algorithms is not affected by the difficulty of calculating KL divergence, because it just uses JS divergence. However, a merge of clusters at each step optimizes a local criterion but not a global criterion, as Dhillon et al. [8] pointed out. Therefore, clustering results may not be as good as results obtained by nonhierarchical algorithms (e.g., k-means and competitive learning) in the sense of optimizing the objective function of ITC. Additionally, hierarchical algorithms are computationally expensive, when the number of inputs is large.
Note that a lot of studies [8,18,23] aimed at improving accuracy of text classification using feature/word clustering based on ITC or distributional clustering. If a clustering is just a step to final goal, feature clustering is meaningful. However, features which characterize clusters should not be merged, when we aim to find clusters (topics) from a set of documents. Actually, finding topics using clustering is the aim of this chapter.

Evaluation of clustering
Since clustering results depend on methods (criteria and algorithms), appropriate selection of them is important. So far, we introduced two criteria for clustering. These are called internal criteria that depend on their own models and not enough for evaluation. If criterion for clustering is common, we can compare clustering results by objective function of the criterion. Under a certain model that is an assumption in other word, a more probable result can be regarded as a better result. However, it is not guaranteed that the model or the assumption is reasonable at all times. Moreover, good clustering results under a certain criterion can be bad results under different criteria. A view from outside is required.
This section introduces external criteria that are Purity, Rand index (RI), and Normalized mutual information (NMI) [25] to evaluate clustering quality and to find better clustering methods. These criteria compare clusters with a set of classes, which are produced on the basis of human judges. Here, each input data belong to one of class A j ðj ¼ 1;…;JÞ and one of cluster C k ðk ¼ 1;…;KÞ. Let TðC k ;A j Þ be the number of data that belongs to both C k and A j .
Purity is measured by counting the number of input data from the most frequent class in each cluster. Purity can be computed as where N is the total number of input data. Purity is close to 1, when each cluster has one dominant class.

Rand index (RI) checks all of the NðN−1Þ=2 pairs of input data and is defined by
where a, b, c, and d are the number of pairs in following conditions: • "a," where the cluster number (suffix) is the same and the class number is the same where, IðC; AÞ is mutual information and HðÞ is entropy and where PðC k Þ, PðA j Þ, and PðC k ;A j Þ are the probability of data being in cluster C k , class A j , and in the intersection of C k and A j , respectively. Mutual information IðC; AÞ measures the mutual dependence between clusters C and classes A. It quantifies the amount of information obtained for classes through knowing about clusters. Hence, high NMI shows some kind of goodness about clustering in information theory.

Experiments
This section provides experimental results that show the effectiveness and usefulness of ITC and the proposed algorithm (sdCL: competitive learning using skew divergence). Experiments consist of two parts, experiment1 and experiment2.
In experiment1, we applied sdCL to the same data sets as used in the paper of Wang et al. [26] and compared performance of sdCL with other clustering algorithms evaluated in it. The algorithms that the paper [26] evaluated are as follows.
As shown above, maximum margin clustering (MMC) [7] and related works are much focused. These works extend the idea of support vector machine (SVM) [30] to the unsupervised scenario. The experimental results obtained by the MMC technique are often better than conventional clustering methods. Among those, CPMMC and CPM3C (Cutting plane multiclass maximum margin clustering) [26] are known as successful methods. Experimental results will show that the proposed algorithm sdCL outperforms CPM3C in text data clustering.
In experiment2, we focus on text data clustering and compare performance of algorithms, sdCL, sdCLS (sdCL with splitting rule, see Sections 2.2.2 and 3.2.1), and spherical competitive learning (spCL). We also provide the evidence to support the effectiveness of ITC by detailed analysis of clustering results.
spCL is an algorithm for spherical clustering like the spherical k-means algorithm [5] that was proposed for clustering high-dimensional and sparse data, such as text data. The objective function to be maximized for the spherical clustering is cosine similarity between input vectors and the mean vector of a cluster to which they belong. To implement spCL, we turn input and weight vectors (x, w) into a unit vector and decide winner w c by c ¼ arg max k cos ðx;w k Þ ðIf there are several candidates; choose the smallest kÞ; and update the winner's weight w c as For all competitive learning algorithms, the learning rate γ ¼ 0:01, the number of maximum repetitions for updating weights N r ¼ 1; 000; 000 (termination condition), and the threshold of times for splitting rule θ ¼ 1000 are used. After competitive learning (sdCL, sdCLS, or spCL) is terminated, we apply k-means type algorithm to remove fluctuation as a post-processing. Specifically, sdKM (the k-means type algorithm using skew divergence shown in Section 3.2.2) with α ¼ 0:999; 0:9999; 0:99999 is applied consecutively after sdCL and sdCLS. In each learning procedure including post-processing, an operation is iterated 50 times with different initial random seeds for a given set of parameters.

Data sets
We mainly use the same data sets as used in the paper of Wang et al. [26]. When applying algorithms for ITC, we use probability distributions P i ði ¼ 1;…NÞ (P) derived from original data.
1. UCI data. From the UCI repository, 2 we use ionosphere, digits, letter, and satellite under the same setting of the paper [26]. The digits data (8 · 8 matrix) are generated from bitmaps of handwritten digits. Pairs (3 vs [32] are used. In experiment1, we follow the setting of the paper [26]. For 20Newsgroups data set, topic "rec" which contains four topics {autos, motorcycles, baseball, hockey} is used. From the four topics, two sets of two-class data sets {Text-1: autos vs motorcycles, Text-2: baseball vs hockey} are extracted. From WebKB data sets, the four Universities data set (Cornell, Texas, Washington, and Wisconsin University), which has seven classes (student, faculty, staff, department, course, project, and other), are used. Note that topic of the "other" class is ambiguous and may contain various topics (e.g., faculty), because it is a collection of pages that were not deemed the "main page" representing an instance of the other six classes, as pointed out in the web page of the data set. Cora data set (Cora research paper classification) [31] is a set of information of research papers classified into a topic hierarchy. From this data set, papers in subfield {data structure (DS), hardware and architecture (HA), machine learning (ML), operating system (OS), programming language (PL)} are used. We select papers that contain title and abstract. RCV1 data set contains more than 800 thousands documents to which topic category is assigned. The documents with the highest four topic codes (CCAT, ECAT, GCAT, and MCAT) in the topic codes hierarchy in the training set. Multi-labeled instances are removed.
In experiment2, we use all of 20Newsgroups and RCV1 data sets. For RCV1 data set, we obtain 53 classes (categories) by mapping the data set to the second level of RCV1 topic hierarchy and remove multi-labeled instances. For WebKB data set, we remove "other" class due to ambiguity, use the other six classes, and do not use information of universities.
For all text data, we remove stop words using stop list [32] and empty data, if they are not removed. In experiment1, we follow the setting of the paper [26], but properties of data sets are slightly different (see Table 3). For Cora data sets, the differences of data sizes are large. However, they must keep the same (or close at least) characteristics (e.g., distributions of words and topics), because they are extracted from the same source.
The properties of those data sets are listed in Table 3.

Results of experiment1
The clustering results are shown in Tables 4-7, where values (except for sdCL) are the same in the paper of Wang et al. [26] (accuracy in that paper is equivalent to purity from its definition).
In two-class problems, CPMMC outperforms other algorithms about purity and Rand Index (RI) in most cases. The proposed algorithm sdCL shows stable performances except for ionosphere to which sdCL cannot be applied. In multiclass problems, sdCL for text data (Cora, 20Newsgroups-4, and Reuters-RCV1-4) outperforms other algorithms. The results show that ITC and the proposed algorithm sdCL are effective for text data sets. Note that CPM3C shows the better results than sdCL for WebKB data. However, topic of the "other" class in WebKB is ambiguous (see Section5.1). The occupation ratio of them is large {0.710, 0.689, 0.777, 0.739} and almost same as the values of purity in CPM3C and sdCL. It means that these algorithms failed to find meaningful clusters in purity. Therefore, WebKB data are not appropriate to use for evaluation without removing "other" class.

Results of experiment2
In experiment2, we focus on text data clustering. Table 8 shows that the proposed algorithms for ITC (sdCL and sdCLS) outperform spCL in purity, RI, and NMI. Considering that spCL is an algorithm for spherical clustering [5] which was proposed to analyze high-dimensional data such as text documents, the criterion of information-theoretic clustering is worth to use for this purpose. Table 8 also shows that sdCLS (sdCL with splitting rule, see Sections 2.2.2 and 3.2.1) is slightly better than sdCL in some cases. As far as Figure 9 (left, right) shows, values of JS divergence for sdCLS are smaller ("better" in ITC) than sdCL, and sdCLS outperforms sdCL in purity on average. Nevertheless, an advantage of sdCLS against sdCL is not so obvious in this experiment. Bold fonts indicate the maximum rand indices for a give data set. Bold fonts indicate the maximum purities for a give data set. Bold fonts indicate the maximum rand indices for a give data set. Bold fonts indicate the maximum purities for a give data set. In followings, we examine inside of clustering results obtained by ITC to make clear whether ITC helps us to find meaningful clusters and candidates of classes for classification. Tables 9 and 10 show frequent words in classes and clusters obtained by sdCL of 20Newsgroups data set, respectively. The order of clusters is arranged so that clusters are made to correspond to classes. Table 11 is the cross table between clusters and classes. As shown in Table 10, the frequent words in some clusters remind us characteristics of them to distinguish from others. For example, the words in cluster 2: "image graphics jpeg," cluster 6: "sale offer shipping," and cluster 11: "key encryption chip" remind classes (comp.graphics), (misc.forsale), and (sci.crypt), respectively. We also imagine characteristics of clusters from the words in 7, 8, 9, 10, 13, 14, and 16th clusters. These clusters have documents of one dominant class and can be regarded as candidates of classes. However, there are some exceptions. The 1st and 15th clusters have the same word "god," while classes of (alt.atheism), (soc.religion.christian), and (talk.religion.misc) have also the same word "god." The cluster 1 and class (alt.atheism) have common words "religion evidence," and the cluster 1 has many documents of the dominant class (alt.atheism  Table 9. Frequent words in classes of 20Newsgroups data set. documents of (talk.religion.misc) as dominant, and the documents of (talk.religion.misc) are mostly shared by the clusters 1 and 15. Though there is a mismatch between clusters and classes, the clustering result is also acceptable, because words in the class (talk.religion.misc) are resemble those in the class (soc.religion.christian). We can also find that cluster 4 has many documents of the two classes (comp.sys.ibm.pc.hardware) and (comp.sys.mac.hardware). From Table 9, those classes have similar words except for "mac" and "apple." Thus, ITC missed to detect the difference of the classes, but found the cluster with common feature of them. In this sense, the clustering result is meaningful and useful. Another example is that documents in class (talk.politics.mideast) are divided into clusters 17 and 18. It means that ITC found two topics from one class and frequency words in the clusters seem to be reasonable (see 17th and 18th clusters in Table 10). The characteristic of cluster 20 that has words "mail list address email send" is different from all classes as well as other clusters, but the cluster 20 has some documents in all classes (see Table 11). This cluster may discover that all newsgroups include documents with such words. In summary, ITC helps us to find meaningful clusters, even when clusters obtained by ITC sometimes seem not to be the same as expected classes. The detailed analysis of the clustering results above could be the evidence to support the effectiveness and usefulness of ITC.

Conclusion
In this chapter, we introduced information-theoretic clustering (ITC) from both theoretical and experimental side. Theoretically, we have shown the criterion, generative model, and novel algorithms for ITC. Experimentally, we showed the effectiveness and usefulness of ITC for text analysis as an important example.

A Difficulty about KL divergence
Let P and Q be a distribution whose mth random variable p m and q m takes the mth element of a vector p and q, respectively. The Kullback-Leibler (KL) divergence to Q from P is defined to be D KL ðP∥QÞ ¼ p m log p m q m : In this definition, it is assumed that the support set of P is a subset of the support set of Q (If q m is zero, p m must be zero). For a given cluster C k , there is no problem to calculate JS divergence of cluster C k by Eq. (12), because the support set of any distribution P i ð∈C k Þ is the subset of the mean distribution P k . However, it is not guaranteed that KL divergence from P i ð∈C k Þ to Q t ðt≠kÞ (a weight distribution of other cluster C t ) is finite. This causes a serious problem to find similar weight distribution Q for an input distribution P. For example, lack of even one word (feature) in a distribution Q is enough not to be similar. Therefore, it is difficult to use k-means type algorithm, 5 which updates weights or clusters by batch processing, in ITC.