 Open access peer-reviewed chapter

# Information‐Theoretic Clustering and Algorithms

Written By

Toshio Uchiyama

Submitted: May 2nd, 2016 Reviewed: October 26th, 2016 Published: April 26th, 2017

DOI: 10.5772/66588

From the Edited Volume

## Advances in Statistical Methodologies and Their Application to Real Problems

Edited by Tsukasa Hokimoto

Chapter metrics overview

View Full Metrics

## Abstract

Clustering is the task of partitioning objects into clusters on the basis of certain criteria so that objects in the same cluster are similar. Many clustering methods have been proposed in a number of decades. Since clustering results depend on criteria and algorithms, appropriate selection of them is an essential problem. Recently, large sets of users’ behavior logs and text documents are common. These are often presented as high‐dimensional and sparse vectors. This chapter introduces information‐theoretic clustering (ITC), which is appropriate and useful to analyze such a high‐dimensional data, from both theoretical and experimental side. Theoretically, the criterion, generative models, and novel algorithms are shown. Experimentally, it shows the effectiveness and usefulness of ITC for text analysis as an important example.

### Keywords

• information‐theoretic clustering
• competitive learning
• Kullback‐Leibler divergence
• Jensen‐Shannon divergence
• clustering algorithm
• text analysis

## 1. Introduction

Clustering is the task of partitioning objects into clusters on the basis of certain criteria so that objects in the same cluster are similar. It is a fundamental procedure to analyze data [1, 2].

Clustering is unsupervisedand different from supervisedclassification. In supervised classification, we have a set of labeleddata (belong to predefined classes), train a classifier using the labeled data (training set), and judge which class a new object belongs to by the classifier. In the case of clustering, we find meaningful clusters without using any labeled data and group a given collection of unlabeled data into them. Clustering can also help us to find meaningful classes (labels) for supervised classification. Since it is more difficult to prepare the training set for larger data sets, recently unsupervised analysis of data such as clustering becomes more important.

For example, Table 1 user‐item matrixshows which item a user bought. When considering the data as a set of feature vectors for users, we can find a lot of types of users’ behavior by clustering. It is also possible to analyze data as a set of feature vectors for items. From word‐document matrixin Table 2, both document clusters and word clusters could be extracted.

item1item2item3item4item5
User130050(30050012001000200100)
User201200
User310002
User400100

### Table 1.

Consumption behavior of users.

Document1Document2Document3Document4Document5
Word130050
Word201200
Word310002
Word400100

### Table 2.

Word frequencies in documents (bag‐of‐words feature representation).

Many clustering methods have been proposed in a number of decades. Those include k‐means algorithm , competitive learning , spherical clustering , spectral clustering , and maximum margin clustering . Since clustering results depend on criteria and algorithms, appropriate selection of them is an essential problem. Large sets of users’ behavior logs and text documents (as shown in Tables 1 and 2) are common recently. These are often presented as high‐dimensional and sparse vectors. This chapter introduces information‐theoretic clustering  and algorithms that are appropriate and useful to analyze such a high‐dimensional data.

Information‐theoretic clustering (ITC) uses Kullback‐Leibler divergence and Jensen‐Shannon divergence to determine its criterion, while k‐means algorithm uses the sum of squared error as criterion. This chapter explains ITC by contrasting these two clustering techniques (criteria and algorithms), because there are a number of interesting similarities between them. There exists difficulty in algorithms for ITC. We explain the details of it and propose novel algorithms to overcome.

Experimental results for text data sets are presented to show the effectiveness and usefulness of ITC and novel algorithms for it. In experiments, maximum margin clustering and spherical clustering are used to compare. We also provide the evidence to support the effectiveness of ITC by detailed analysis of clustering results.

## 2. The sum‐of‐squared‐error criterion and algorithms

Given a set of M‐dimensional input vectors X={xi|xiRM,i=1,,N}where Nis the number of vectors, clustering is the task of assigning each input vector xia cluster label k(k=1,,K)to partition them into Kclusters C={C1,,CK}. The sum‐of‐squared‐error criterion is a simple and widely used criterion for clustering.

Let μkbe the mean of the input vectors xiwhich belong to the cluster Ck(see Figure 1). Then, the error in Ckis the sum of squared lengths of the differential (= “error”) vectors xiμk2and the sum‐of‐squared‐error criterion about all clusters (within‐clustersum of squares) is defined by

JW=k=1KxiCkxiμk2.E1

JWis the objective function (criterion) to be minimized in clustering based on this criterion.

Also, we define the sum of squares of between‐clusterJBand totalJTas

JB=k=1KNkμkμ2,JT=i=1Nxiμ2,E2

respectively, where Nkis the number of input vectors xiin Ck(i.e., N=k=1KNk) and

μk=1NkxiCkxi,μ=1Ni=1Nxi.E3

It follows from these definitions that the total sum of squares is the sum of the within‐cluster sum of squares and the between‐cluster sum of squares:

JT=JW+JB.E4

Since the mean of the all input vectors μis derived from X={x1,,xN}[see Eq. (3)], JTdoes not depend on clusters C[see Eq. (2)] and is constant for the given input vectors X. Therefore, minimization of JWis equivalent to maximization of JB. In this sense, clustering based on minimizing this criterion JWworks to find separable clusters each other.

### 2.1. Generative model

In the background of the clustering based on the objective function (criterion) JW, there exists assumption of Gaussian distribution about input vectors .

Suppose that there are clusters Ck(k=1,,K), which generates input vectors by the conditional probability density function:

p(xi|xiCk)=1(2πσk2)M/2exp(xiμk22σk2),E5

where σkis a standard deviation of the cluster Ckand Mis the number of dimension of xi. In followings, we assume that σkis constant value σfor all clusters Ck(k=1,,K). Considering independence of each generation, joint probability density function for the input vectors Xbecomes

p(X|C)=k=1KxiCk1(2πσ2)M/2exp(xiμk22σ2),E6

where Cindicate cluster information that specifies which input vector xibelongs to cluster Ck. Taking the logarithm of Eq. (6) yields

lnp(X|C)=NM2log(2πσ2)12σ2k=1KxiCkxiμk2.E7

Since σis constant, the maximization of Eq. (7) is equivalent to the minimization of

k=1KxiCkxiμk2.E8

which is nothing more or less than the objective function (criterion) JW. Therefore, under the assumption of Gaussian distribution about input vectors, clustering based on Eq. (8) works to find the most probable solution C.

### 2.2. Algorithms

#### 2.2.1. k‐means algorithm

k‐means[3, 11] is well‐known algorithm for clustering based on the sum‐of‐squared‐error criterion. Main idea of this algorithm is as follows. In the objective function JW(1), error for vector xis calculated by xμk2where μkis the mean of cluster Ckto which xbelongs. If xμt2<xμk2, changing the cluster from Ckto Ctcan reduce the objective function JW.

We introduce weight vectorwk(k=1,,K)(W) that represent cluster Ckto implement the idea mentioned above. The weight vector wkinvolves mean vector μkand prototype vector of cluster Ck. As illustrated in Figure 2, the idea of k‐meansis alternative repetition of two steps “(a) Update weights” (calculating mean μkas weight vector wk) and “(b) Update clusters” (allocating input vector xito a cluster Ckon the basis of minimum length from weight vectors wk). Note that Figure 2b is a Voronoi tessellation determined by weight vectors wk, which are usually called prototype vectorin this context. Figure 2.Two steps in k‐means algorithm. (a) Update weights. (b) Update clusters.

Figure 3a is a flow chart of k‐means algorithm to which processes of initialization and termination are added. As a matter of fact, clustering is closely related to vector quantization. Vector quantization means mapping input vectors to a codebook that is a set of weight vectors (prototype vectors). When using quantization error EQ:

EQ=i=1Nminkxiwk2,E9 Figure 3.Flow charts of algorithms based on sum‐of‐squared‐error criterion. (a) k‐means1, (b) k‐means2, and (c) competitive learning.

clusters Cdetermined by a local optimal solution of vector quantization Wis a local optimal solution of clustering problem . In this sense, clustering can be replaced by vector quantization and vice versa. We can write a flow chart for vector quantization as Figure 3b, but we also find this chart (b) as k‐means algorithm. Furthermore, LBG algorithm, which is well known for vector quantization, is based on an approach of Lloyd  (one of original papers for k‐means algorithm). These facts show a close relationship between clustering and vector quantization.

Initialization is important, because k‐means algorithm converges to a local optimal solution which depends on an initial condition (a set of weights or clusters). If we initialize weights Wby randomly selecting them from input vectors, it may converge to a very bad local optimal solution with high probability. Random labelingthat randomly assigns cluster labels Cto input vectors may lead to better solutions than random selection of weights. The initialization Random labelingcan also be used for charts (b) and (c) in Figure 3 by replacing “Initialize weights” step to “Initialize clusters” and “Update weights” steps. For directly initializing weights, splitting algorithm  and k‐means++ were known.

#### 2.2.2. Competitive learning

Competitive learning [4, 11] is a learning method for vector quantization and also utilized for clustering. While k‐means algorithm updates all weights Wby batch processing, competitive learning updates one weight wat a time to reduce a part of the quantization error QE(see Figure 3c) as

1. Select one input vector xrandomly from X.

2. Decide a winner wcfrom Wby

c=argminkxwk2(Ifthereareseveralcandidates,choosethesmallestk).E10

3. Update the winner's weight wcas

wc(1γ)wc+γx,E11

where γis a given learning rate (e.g., 0.01–0.1).

Though the winner‐take‐allupdate in Step 3 (Figure 4) that reduces partial error xwc2in steepest direction does not always reduce the total quantization error EQ, repetition of the update can reduce EQon the basis of stochastic gradient decentmethod [15, 16]. For termination condition, maximum number of times of iteration Nr(the number of maximum repetitions) can be used. After termination, the step of deciding clusters Clike Figure 2b is required for clustering purpose.

Against natural expectations, competitive learning outperforms k‐means without any contrivance in most cases. Furthermore, information obtained in learning process allows us to improve its performance. Splitting rule  utilizes the number of each weight wwins to estimate density around it. As Figure 5a, b shows, higher density of input vectors around makes the weight vector wawin more frequently than wb.

Splitting rule in competitive learning  aims to overcome the problem of discrepancy between distribution of input vectors Xand that of weight vectors W. The discrepancy causes a few weight vectors wmonopolize Xand leads to a solution of very poor quality, but it is impossible to figure out the distribution of input vectors beforehand. Accordingly, this splitting rule distributes weight vectors win learning process as

1. One weight vector w1with a variable τ1is set. τ1denotes how many times weight vector w1wins and is initialized to 0.

2. Select one input vector x, decide winner wc, and update the winner's weight wc.

3. Add 1 to τc. If τc=θand the current number of weights Kis less than K, generate a new weight vector wwhich is the same as the winner wcand clear τof both to 0, where θis the threshold of times for splitting.

4. Repeat 2 and 3 until termination condition is true.

## 3. Information‐theoretic clustering and algorithms

Information‐theoretic clustering (ITC)  is closely related to works about distributional clustering and uses Kullback‐Leibler divergence and Jensen‐Shannon divergence to determine its criterion. Though there exists difficulty in algorithms and effectiveness for high‐dimensional count data (e.g., text data), its definition and properties are similar to those of the sum‐of‐squared‐error criterion. The main contributions of this chapter are to present the technique to overcome the difficulty and effectiveness of ITC.

Let X={xi|xiR+M,i=1,,N}be a set of M‐dimensional input vectors (Ndenote the number of input vectors), where elements of vectors xare nonnegative real numbers. We define a l1‐norm of input vector ti(=m|xmi|), normalized input vectors pi=xi/ti, and an input probability distribution Piwhose mth random variable takes the mth element of pi(=pmi). Let P={P1,,PN}be a set of input distributions (input data).

Suppose that we assign each distribution Pia cluster label k(k=1,,K)to partition them into Kclusters C={C1,,CK}.

Let P¯kbe the distributions on the mean of input data Piwhich belong to the cluster Ck(see Figure 6). Then, the generalized Jensen‐Shannon (JS) divergence to be minimized in Ckis defined by

DJS({Pi|PiCk})=PiCkπiDKL(PiP¯k),P¯k=PiCkπiPi,E12

where Nkis the number of distributions Piin cluster Ck(i.e., N=k=1KNk), DKL(PiP¯k)is the Kullback‐Leibler (KL) divergence to the mean distribution P¯kfrom Pi, and πiis the probability of Pi(PiCkπi=1). Here Pi=1/Nk. Then, we define within‐clusterJS divergence JSWwhich considers all clusters Ck(k=1,,K)as

JSW=k=1KNkNDJS({Pi|PiCk})E13
=1Nk=1KPiCkDKL(PiP¯k)E14
=1Nk=1KPiCkm=1Mpmilogpmip¯mk=1Nk=1KPiCkm=1M(pmilogpmipmilogp¯mk).E15

The within‐clusterJS divergence JSWis the objective function (criterion) of information‐theoretic clustering (ITC)to be minimized . We also define JS divergence of between‐clusterJSBand totalJSTas

JSB=DJS({P¯k|k=1,,K})=k=1KπkDKL(P¯kP¯),πk=Nk/N,E16
JST=DJS({Pi|i=1,,N})=i=1NπiDKL(PiP¯),πi=1/N,E17

where P¯=i=1NπiPi=1/Ni=1NPiis the distribution on the mean of all input data. It follows from these definitions that the total JS divergence is the sum of the within‐cluster JS divergence and the between‐cluster JS divergence :

JST=JSW+JSB.E18

Since JSTare constant for given input distributions P, minimization of JSWis equivalent to maximization of JSB. In this sense, clustering based on minimizing this criterion JSWworks to find separable clusters each other.

The definition and properties of ITC as shown so far are similar to those of the sum‐of‐squared‐error criterion. Those will help us to understand ITC.

### 3.1. Generative model

In the background of information‐theoretic clustering (ITC), there also exists the bag‐of‐wordsassumption  that disregards the order of words in a document. (Since ITC is not limited for document clustering, “word” is just an example of feature.) It means that features in data are conditionally independent and identically distributed, where the condition is a given probability distribution for an input vector. Based on this assumption, we describe a generative probabilistic model related to ITC and make clear the relationship between the model and the objective function (criterion) JSW.

Let an input vector x={x1,,xm,,xM}present a set of the number of observations of mth feature. Suppose that there are clusters Ck(k=1,,K)which generates tifeatures for a data (= input vector) with the probability distribution P¯k={p¯1k,,p¯Mk}, and conditional probability of a set of the observation about features in an input vector xiis expressed by multinomial distribution

p(xi|xiCk)=Aim=1M(p¯mk)xmi,Ai=ti!x1i!x2i!xMi!,ti=m=1M|xmi|,E19

where Aiis the number of combination of the observation. Assuming independence of each generation, joint probability function for the input vectors X={x1,,xN}becomes

p(X|C)=k=1KxiCkAim=1M(p¯mk)xmi,E20

where Cindicates cluster information that specifies which input vector xibelongs to cluster Ck. Taking the logarithm of Eq. (20) yields

lnp(X|C)=i=1NlogAi+k=1KxiCkm=1Mxmilogp¯mkE21
=i=1NlogAi+k=1KPiCkm=1Mtipmilogp¯mk.E22

This is a generative probabilistic modelrelated to ITC. If we assume that titakes constant value tfor all input vectors, maximization of the probability (22) as well as minimization of the objective function JSW(15) come to the minimization of

1Nk=1KPiCkm=1Mpmilogp¯mk=1Nk=1KNkm=1Mp¯mklogp¯mk,E23

for given input distribution P. Here, the relationship PiCkpmi=Nkp¯mkis used. Since timay not be constant value t, the generative model (22) is not an equivalent model of ITC but the relatedmodel. This difference comes from the fact that the model treats each observation about features equally, while ITC treats each data (input vector) equally. Though the additional assumption ti=tis required, ITC works to find the most probable solution Cin the generative probabilistic model. Furthermore, Eq. (23) is also based on the minimization of entropyin clusters as Eq. (23) shows. Entropy (specifically, Shannon Entropy) is the expected value of the information contained in each message that is an input distribution here. The smaller entropy becomes, and the more compactly a model can explain observations (input distributions). In this sense, the objective function JSW(15) presents the goodness of the generative model. The relationship (including difference) between the probabilistic model and the objective function JSWis meaningful to improve the model and the objective function in future.

Choice of appropriate model for data is important, when analyzing them. For example, large set of text documents contain many kinds of words and are presented as high‐dimensional vectors. Taking extreme diversity of documents’ topics into account, feature vectors of documents are distributed almost uniformly in the vector space. As known by “the curse of dimensionality”, most of the volume of a sphere in high‐dimensional space is concentrated near the surface, and it becomes not appropriate to choose the model based on Gaussian distribution which concentrates values around the mean. In contrast, ITC on the basis of the multinomial distribution is a reasonable and useful tool to analyze such a high‐dimensional count data, because the generative model of ITC is consistent with them.

We introduce weight distributionQk(k=1,,K)(Q)that represent cluster Ckand that involves mean distribution P¯kand prototype distribution of cluster Ckin a manner similar to that of the sum‐of‐squared‐error (SSE) criterion (see Section 2.2.1). Figure 7 shows relationships between parameters in generative models. Parameters are generated or estimated by other parameters to maximize probability of generative model. For example, clustering is the task to find the most probable clusters Cfor given input vectors Xor input distributions P. In Figure 7b, constructing a classifier is the task to find Qfor given Pand C(classes in this context) in training process. Then, it estimates Cfor unknown Pusing the trained Q. The classifier using multinominal distribution is known as multinominal Naive Bayes classifier. As it shows, ITC and Naive Bayes classifier have a close relationship . Figure 7.Relationships between model parameters. (a) Clustering based on SSE criterion. (b) Information‐theoretic clustering

### 3.2. Algorithms

There exists difficulty (Appendix A) in algorithms for ITC. We show a novel idea to overcome it.

#### 3.2.1. Competitive learning

When competitive learning decides a winner for an input distribution P, it easily faces the difficulty of calculating KL divergence from Pto weight distributions Q(see Appendix A). To overcome this difficulty, we present the idea to change an order of steps in competitive learning (CL). As shown in Figure 8b, CL updates all weights (= weight distributions) before deciding winner by

Qk(1γ)Qk+γP,E24

where γis a learning rate. Since updated weight distributions Qk(k=1,,K)include all words (features) of input distribution P, it is possible to calculate KL divergence DKL(PQk)for all k. In following steps, CL decide a winner Qcfrom Qby

c=argminkDKL(PQk)(Ifthereareseveralcandidates,choosethesmallestk),E25 Figure 8.Flow charts of competitive learning. (a) Competitive learning for SSE. (b) Competitive learning for ITC

and activate winner's update and discard others. These steps satisfy the CL's requirement that it partially reduces value of objective function JSWin steepest direction with the given learning rate γ. Here, neither approximation nor distortion is added to the criterion of ITC. Note that updates of weight distributions Qkbefore activation are provisional (see Figure 8b).

Related work that avoids the difficulty in calculating KL divergence presented skew divergence. The skew divergence is defined as

sα(P,Q)=DKL(PαQ+(1α)P),E26

where α(0α1)is the mixture ratio of distributions. The skew divergence is exactly the KL divergence at α=1. When α=1γ, Eq. (26) becomes similar to Eq. (24). Then, we can rewrite the steps in CL above using the skew divergence as

1. Select one input distribution Prandomly from P.

2. Decide a winner Qcfrom Qby

c=argminksα(P,Qk)(Ifthereareseveralcandidates,choosethesmallestk),E27

3. Update the winner's weight distribution Qcas

Qc(1γ)Qc+γP,E28

where γis a learning rate and equal to 1α(αis the mixture ratio for sα) usually.

Hence, we call this novel algorithm for ITC as “competitive learning using skew divergence”(sdCL). In addition, splitting rule in competitive learning  can also be applied to this algorithm.

#### 3.2.2. k‐means type algorithm

Dhillon et al.  proposed information‐theoretic divisive algorithmwhich is k‐means type algorithm with divisive mechanism and uses KL divergence.

The algorithm was proposed for feature/word clustering and applied to text classification. Since the algorithm uses document class (labeled data), it cannot be applied to general clustering problem.

However, it still remains the difficulty to use KL divergence directly. In such a situation, we propose to use the skew divergence instead of KL divergence in k‐means type algorithm as

1. Initialize clusters Cof input distribution Prandomly.

2. Update weight distributions Qby

Qk=1NkPiCkPi.E29

3. Update each cluster cof an input distribution Piby

c=argminksα(Pi,Qk)(Ifthereareseveralcandidates,choosethesmallestk),E30

where mixture ratio α(0α1)for skew divergence sαis 0.99 for example.

4. Repeat 2 and 3 until change ratio of objective function JSWis less than small value (e.g., 108).

The algorithm itself works well to obtain valuable clustering results. Further, if αis close to 1, skew divergence sαbecomes a good approximation of KL divergence. Therefore, restart of learning after termination with αcloser to 1, such as 0.999,0.9999,, may lead to better clustering result.

#### 3.2.3. Other algorithms

Slonim and Tishby  proposed an agglomerative hierarchical clustering algorithm, which is a hard clustering version of Information Bottleneck algorithmof Tishby et al. . It is similar to the algorithm of Baker and McCallum  and merges just two clusters at every step based on the JS divergence of their distributions. A merit of the agglomerative algorithms is not affected by the difficulty of calculating KL divergence, because it just uses JS divergence. However, a merge of clusters at each step optimizes a local criterion but not a global criterion, as Dhillon et al.  pointed out. Therefore, clustering results may not be as good as results obtained by nonhierarchical algorithms (e.g., k‐means and competitive learning) in the sense of optimizing the objective function of ITC. Additionally, hierarchical algorithms are computationally expensive, when the number of inputs is large.

Note that a lot of studies [8, 18, 23] aimed at improving accuracy of text classification using feature/word clustering based on ITC or distributional clustering. If a clustering is just a step to final goal, feature clustering is meaningful. However, features which characterize clusters should not be merged, when we aim to find clusters (topics) from a set of documents. Actually, finding topics using clustering is the aim of this chapter.

## 4. Evaluation of clustering

Since clustering results depend on methods (criteria and algorithms), appropriate selection of them is important. So far, we introduced two criteria for clustering. These are called internal criteriathat depend on their own models and not enough for evaluation. If criterion for clustering is common, we can compare clustering results by objective function of the criterion. Under a certain model that is an assumption in other word, a more probable result can be regarded as a better result. However, it is not guaranteed that the model or the assumption is reasonable at all times. Moreover, good clustering results under a certain criterion can be bad results under different criteria. A view from outside is required.

This section introduces external criteriathat are Purity, Rand index (RI), and Normalized mutual information (NMI) to evaluate clustering quality and to find better clustering methods. These criteria compare clusters with a set of classes, which are produced on the basis of human judges. Here, each input data belong to one of class Aj(j=1,,J)and one of cluster Ck(k=1,,K). Let T(Ck,Aj)be the number of data that belongs to both Ckand Aj.

Purity is measured by counting the number of input data from the most frequent class in each cluster. Purity can be computed as

purity=1Nk=1KmaxjT(Ck,Aj),E31

where Nis the total number of input data. Purity is close to 1, when each cluster has one dominant class.

Rand index (RI) checks all of the N(N1)/2pairs of input data and is defined by

RI=a+ba+b+c+d,E32

where a, b, c, and dare the number of pairs in following conditions:

• a,” where the cluster number (suffix) is the same and the class number is the same

• b,” where the cluster numbers are different and the class numbers are different

• c,” where the cluster number is the same and the class numbers are different

• d,” where the cluster numbers are different and the class number is the same

The Rand index (RI) measures the percentage of agreements a + b in clusters and classes.

Normalized mutual information (NMI) is defined as

NMI=I(C;A)(H(C)+H(A))/2,E33

where, I(C;A)is mutual information and H()is entropy and

I(C;A)=k=1Kj=1JP(Ck,Aj)logP(Ck,Aj)P(Ck)P(Aj)=k=1Kj=1JT(Ck,Aj)NlogT(Ck,Aj)NT(Ck)T(Aj),E34
H(C)=k=1KP(Ck)logP(Ck)=k=1KT(Ck)NlogT(Ck)N,E35
H(A)=j=1JP(Aj)logP(Aj)=j=1JT(Aj)NlogT(Aj)N,E36

where P(Ck), P(Aj), and P(Ck,Aj)are the probability of data being in cluster Ck, class Aj, and in the intersection of Ckand Aj, respectively. Mutual information I(C;A)measures the mutual dependence between clusters Cand classes A. It quantifies the amount of information obtained for classes through knowing about clusters. Hence, high NMIshows some kind of goodness about clustering in information theory.

## 5. Experiments

This section provides experimental results that show the effectiveness and usefulness of ITC and the proposed algorithm (sdCL: competitive learning using skew divergence). Experiments consist of two parts, experiment1 and experiment2.

In experiment1, we applied sdCL to the same data sets as used in the paper of Wang et al.  and compared performance of sdCL with other clustering algorithms evaluated in it. The algorithms that the paper  evaluated are as follows.

• k‐means (KM). The weights Ware initialized by randomly .

• Normalized cut (NC).

• Maximum margin clustering (MMC).

• Generalized maximum margin clustering (GMMC).

• Iterative support vector regression (IterSVR).

• Cutting plane maximum margin clustering (CPMMC), CPM3C.

As shown above, maximum margin clustering(MMC)  and related works are much focused. These works extend the idea of support vector machine (SVM)  to the unsupervised scenario. The experimental results obtained by the MMC technique are often better than conventional clustering methods. Among those, CPMMC and CPM3C (Cutting plane multiclass maximum margin clustering)  are known as successful methods. Experimental results will show that the proposed algorithm sdCL outperforms CPM3C in text data clustering.

In experiment2, we focus on text data clustering and compare performance of algorithms, sdCL, sdCLS (sdCL with splitting rule, see Sections 2.2.2 and 3.2.1), and spherical competitive learning (spCL). We also provide the evidence to support the effectiveness of ITC by detailed analysis of clustering results.

spCL is an algorithm for spherical clustering like the spherical k‐means algorithm  that was proposed for clustering high‐dimensional and sparse data, such as text data. The objective function to be maximized for the spherical clustering is cosine similarity between input vectors and the mean vector of a cluster to which they belong. To implement spCL, we turn input and weight vectors (x, w) into a unit vector and decide winner wcby

c=argmaxkcos(x,wk)(Ifthereareseveralcandidates,choosethesmallestk),E37

and update the winner's weight wcas

wc(1γ)wc+γx(1γ)wc+γx.E38

For all competitive learning algorithms, the learning rate γ=0.01, the number of maximum repetitions for updating weights Nr=1,000,000(termination condition), and the threshold of times for splitting rule θ=1000are used. After competitive learning (sdCL, sdCLS, or spCL) is terminated, we apply k‐means type algorithm to remove fluctuation as a post‐processing. Specifically, sdKM (the k‐means type algorithm using skew divergence shown in Section 3.2.2) with α=0.999,0.9999,0.99999is applied consecutively after sdCL and sdCLS. In each learning procedure including post‐processing, an operation is iterated 50 times with different initial random seeds for a given set of parameters.

### 5.1. Data sets

We mainly use the same data sets as used in the paper of Wang et al. . When applying algorithms for ITC, we use probability distributions Pi(i=1,N)(P) derived from original data.

1. UCI data. From the UCI repository,

http://archive.ics.uci.edu/ml/[Accessed: 2016-10-25].

we use ionosphere, digits, letter, and satellite under the same setting of the paper . The digits data (8×8matrix) are generated from bitmaps of handwritten digits. Pairs (3 vs. 8, 1 vs. 7, 2 vs. 7, and 8 vs. 9) are focused due to the difficulty of differentiating. For the letter and satellite data sets, their first two classes are used. Since the ionosphere data contain minus values and cannot be transformed to probability distributions, we do not apply ITC to them.

2. Text data. Four text data sets: 20Newsgroups (http://qwone.com/∼jason/20Newsgroups/[Accessed: 2016‐10‐25]), WebKB,

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/[Accessed: 2016-10-25].

Cora , and RCV1 (Reuters Corpus Volume 1)  are used. In experiment1, we follow the setting of the paper . For 20Newsgroups data set, topic “rec” which contains four topics {autos, motorcycles, baseball, hockey} is used. From the four topics, two sets of two‐class data sets {Text‐1: autos vs motorcycles, Text‐2: baseball vs hockey} are extracted. From WebKB data sets, the four Universities data set (Cornell, Texas, Washington, and Wisconsin University), which has seven classes (student, faculty, staff, department, course, project, and other), are used. Note that topic of the “other” class is ambiguous and may contain various topics (e.g., faculty), because it is a collection of pages that were not deemed the “main page” representing an instance of the other six classes, as pointed out in the web page of the data set. Cora data set (Cora research paper classification)  is a set of information of research papers classified into a topic hierarchy. From this data set, papers in subfield {data structure (DS), hardware and architecture (HA), machine learning (ML), operating system (OS), programming language (PL)} are used. We select papers that contain title and abstract. RCV1 data set contains more than 800 thousands documents to which topic category is assigned. The documents with the highest four topic codes (CCAT, ECAT, GCAT, and MCAT) in the topic codes hierarchy in the training set. Multi‐labeled instances are removed.

In experiment2, we use all of 20Newsgroups and RCV1 data sets. For RCV1 data set, we obtain 53 classes (categories) by mapping the data set to the second level of RCV1 topic hierarchy and remove multi‐labeled instances. For WebKB data set, we remove “other” class due to ambiguity, use the other six classes, and do not use information of universities.

For all text data, we remove stop wordsusing stop list and empty data, if they are not removed. In experiment1, we follow the setting of the paper , but properties of data sets are slightly different (see Table 3). For Cora data sets, the differences of data sizes are large. However, they must keep the same (or close at least) characteristics (e.g., distributions of words and topics), because they are extracted from the same source.

3. Digits data. USPS (16×16) and MNIST (28×28) are data sets of handwritten digits image.

http://www.kernel-machines.org/data [Accessed: 2016-10-25].

For USPS data set, 1, 2, 3, and 4 digit images are used. For MNIST and digits data from UCI repository, all 45 pairs of digits 0–9 are used in two‐class problems.

The properties of those data sets are listed in Table 3.

DataSize (N)Feature (M)Class (K)
Ionosphere351342
Letter1555162
Digits1555642
Satellite2236362
Text‐1198116,2592
Text‐2198715,9552
20Newsgroups‐4396724,5064
20Newsgroups‐2018,77260,69820
Cora‐DS239757459
Cora‐HA91333407
Cora‐ML356968097
Cora‐OS208450294
Cora‐PL302660699
WebKB‐Cornell83555747
WebKB‐Texas80844827
WebKB‐Washington119177797
WebKB‐Wisconsin121882707
WebKB6421914,1426
Reuters‐RCV1‐419,80644,2144
Reuters‐RCV1‐53534,135216,70453
MNIST70,0007842
USPS30462564

### Table 3.

Properties of data sets.

### 5.2. Results of experiment1

The clustering results are shown in Tables 47, where values (except for sdCL) are the same in the paper of Wang et al.  (accuracyin that paper is equivalent to purityfrom its definition). In two‐class problems, CPMMC outperforms other algorithms about purity and Rand Index (RI) in most cases. The proposed algorithm sdCL shows stable performances except for ionosphere to which sdCL cannot be applied. In multiclass problems, sdCL for text data (Cora, 20Newsgroups‐4, and Reuters‐RCV1‐4) outperforms other algorithms. The results show that ITC and the proposed algorithm sdCL are effective for text data sets. Note that CPM3C shows the better results than sdCL for WebKB data. However, topic of the “other” class in WebKB is ambiguous (see Section 5.1). The occupation ratio of them is large {0.710, 0.689, 0.777, 0.739} and almost same as the values of purity in CPM3C and sdCL. It means that these algorithms failed to find meaningful clusters in purity. Therefore, WebKB data are not appropriate to use for evaluation without removing “other” class.

DataKMNCMMCGMMCIterSVRCPMMCsdCL
Iono0.54280.75000.78750.76500.77700.7548
Letter0.82060.76800.92800.95020.9267
Satellite0.95930.95790.96820.98790.9506
Text‐10.50530.93790.97020.95000.9306
Text‐20.50380.91350.93990.97210.9591
Dig 3–80.94680.65000.90000.94400.96640.96880.9440
Dig 1–70.94450.55000.68750.97800.99451.0001.000
Dig 2–70.96910.66000.98750.99501.0001.0000.9891
Dig 8–90.90680.52000.96250.84000.96330.98120.8910
UCI‐dig0.96380.97570.98180.99400.9516
MNIST0.89210.89920.92410.96210.8812

### Table 4.

Purity comparisons for two‐class problems.

Bold fonts indicate the maximum purities for a give data set.

DataKMNCMMCGMMCIterSVRCPMMCsdCL
Iono0.560.630.670.640.560.65
Letter0.710.640.870.920.86
Satellite0.920.920.940.970.91
Text‐10.500.880.940.940.87
Text‐20.500.840.890.930.92
Dig 3–80.900.550.820.900.940.950.89
Dig 1–70.990.500.570.960.991.01.0
Dig 2–70.940.550.980.991.01.00.98
Dig 8–90.840.500.930.730.930.970.81
UCI‐dig0.930.960.970.990.93
MNIST0.810.820.860.940.83

### Table 5.

Rand Index (RI) comparisons for two‐class problems.

Bold fonts indicate the maximum rand indices for a give data set.

DataKMNCMMCCPM3CsdCL
UCI‐digits 06890.42230.93130.94830.96740.9394
UCI‐digits 12790.40420.90110.91910.94520.8300
USPS0.92150.90110.91910.94520.9515
Cora‐DS0.28240.36880.44150.5057
Cora‐HA0.34020.42000.59800.6145
Cora‐ML0.27080.31030.45490.5974
Cora‐OS0.23870.23030.59160.6686
Cora‐PL0.33800.33970.47210.4729
WebKB‐Cornell0.55710.61430.72050.7192
WebKB‐Texas0.45050.35380.69100.6895
WebKB‐Washington0.53520.32850.78170.7767
WebKB‐Wisconsin0.49530.33310.74250.7397
20Newsgroups‐40.35270.41890.71340.9360
Reuters‐RCV1‐40.27050.62350.8064

### Table 6.

Purity comparisons for multiclass problems.

Bold fonts indicate the maximum purities for a give data set.

DataKMNCMMCCPM3CsdCL
UCI‐digits 06890.6960.9390.9410.9740.945
UCI‐digits 12790.40420.90110.91910.9450.868
USPS0.9320.9380.9500.958
Cora‐DS0.5890.7440.7460.823
Cora‐HA0.3850.6590.6950.767
Cora‐ML0.5140.7200.7610.802
Cora‐OS0.5180.5220.7300.735
Cora‐PL0.6430.6750.7120.819
WebKB‐Cornell0.6030.6020.7240.483
WebKB‐Texas0.6040.6020.7120.495
WebKB‐Washington0.6160.5810.7520.426
WebKB‐Wisconsin0.5810.5090.7610.464
20Newsgroups‐40.5810.4960.7800.940
Reuters‐RCV1‐40.4710.7030.800

### Table 7.

Rand Index (RI) comparisons for multiclass problems.

Bold fonts indicate the maximum rand indices for a give data set.

### 5.3. Results of experiment2

In experiment2, we focus on text data clustering. Table 8 shows that the proposed algorithms for ITC (sdCL and sdCLS) outperform spCL in purity, RI, and NMI. Considering that spCL is an algorithm for spherical clustering  which was proposed to analyze high‐dimensional data such as text documents, the criterion of information‐theoretic clustering is worth to use for this purpose.

sdCLsdCLSspCL
DataPurityRINMIPurityRINMIPurityRINMI
Cora‐DS0.5060.8230.3220.5140.8260.3270.3610.7960.162
Cora‐HA0.6140.7670.3630.6180.7670.3600.5180.7280.251
Cora‐ML0.5970.8020.3840.6020.8010.3850.4310.7570.181
Cora‐OS0.6690.7350.3130.6850.7410.3250.5820.6730.187
Cora‐PL0.4730.8190.2750.4840.8200.2740.4140.8020.175
WebKB60.5750.7120.2720.5750.7120.2720.5190.7020.181
20News‐200.6900.9540.6530.6970.9550.6560.3590.9050.324
RCV1‐530.7310.9100.5860.7380.9060.5800.5680.8950.384

### Table 8.

Comparison for text data sets.

Table 8 also shows that sdCLS (sdCL with splitting rule, see Sections 2.2.2 and 3.2.1) is slightly better than sdCL in some cases. As far as Figure 9 (left, right) shows, values of JS divergence for sdCLS are smaller (“better” in ITC) than sdCL, and sdCLS outperforms sdCL in purity on average. Nevertheless, an advantage of sdCLS against sdCL is not so obvious in this experiment. Figure 9.Purity versus JS divergence for 20Newsgroups (left) and Reuters‐RCV1 (right) data sets.

Note that clustering result by dCL (competitive learning using KL divergence) is shown below. Since the values of NMI clearly illustrate that dCL converged to unrelated solutions to the classes, use of skew divergenceis an effective technique to overcome this problem.

DataPurityRINMI
WebKB60.3630.6670.00629
20News-200.0710.9050.00596

In followings, we examine inside of clustering results obtained by ITC to make clear whether ITC helps us to find meaningful clusters and candidates of classes for classification. Tables 9 and 10 show frequent words in classes and clusters obtained by sdCL of 20Newsgroups data set, respectively. The order of clusters is arranged so that clusters are made to correspond to classes. Table 11 is the cross table between clusters and classes. As shown in Table 10, the frequent words in some clusters remind us characteristics of them to distinguish from others. For example, the words in cluster 2: “image graphics jpeg,” cluster 6: “sale offer shipping,” and cluster 11: “key encryption chip” remind classes (comp.graphics), (misc.forsale), and (sci.crypt), respectively. We also imagine characteristics of clusters from the words in 7, 8, 9, 10, 13, 14, and 16th clusters. These clusters have documents of one dominant class and can be regarded as candidates of classes. However, there are some exceptions. The 1st and 15th clusters have the same word “god,” while classes of (alt.atheism), (soc.religion.christian), and (talk.religion.misc) have also the same word “god.” The cluster 1 and class (alt.atheism) have common words “religion evidence,” and the cluster 1 has many documents of the dominant class (alt.atheism). The cluster 15 and the class (soc.religion.christian) have common words “jesus bible christ church,” and the cluster 15 has many documents of the dominant class (soc.religion.christian). On the other hand, there is no cluster which has documents of (talk.religion.misc) as dominant, and the documents of (talk.religion.misc) are mostly shared by the clusters 1 and 15. Though there is a mismatch between clusters and classes, the clustering result is also acceptable, because words in the class (talk.religion.misc) are resemble those in the class (soc.religion.christian). We can also find that cluster 4 has many documents of the two classes (comp.sys.ibm.pc.hardware) and (comp.sys.mac.hardware). From Table 9, those classes have similar words except for “mac” and “apple.” Thus, ITC missed to detect the difference of the classes, but found the cluster with common feature of them. In this sense, the clustering result is meaningful and useful. Another example is that documents in class (talk.politics.mideast) are divided into clusters 17 and 18. It means that ITC found two topics from one class and frequency words in the clusters seem to be reasonable (see 17th and 18th clusters in Table 10). The characteristic of cluster 20 that has words “mail list address email send” is different from all classes as well as other clusters, but the cluster 20 has some documents in all classes (see Table 11). This cluster may discover that all newsgroups include documents with such words. In summary, ITC helps us to find meaningful clusters, even when clusters obtained by ITC sometimes seem not to be the same as expected classes. The detailed analysis of the clustering results above could be the evidence to support the effectiveness and usefulness of ITC.

 alt.atheism god writes people article atheism religion time evidence comp.graphics image graphics jpeg file bit images software data files ftp comp.os.ms‐windows.misc windows file dos writes article files ms os problem win comp.sys.ibm.pc.hardware drive scsi card mb ide system controller bus pc writes comp.sys.mac.hardware mac apple writes drive system problem article mb monitor mhz comp.windows.x window file server windows program dos motif sun display widget misc.forsale sale shipping offer mail price drive condition dos st email rec.autos car writes article cars good engine apr ve people time rec.motorcycles writes bike article dod ca apr ve ride good time rec.sport.baseball writes year article game team baseball good games time hit rec.sport.hockey game team hockey writes play ca games article season year sci.crypt key encryption government chip writes clipper people article keys system sci.electronics writes article power good ve work ground time circuit ca sci.med writes article people medical health disease time cancer patients sci.space space writes nasa article earth launch orbit shuttle time system soc.religion.christian god people jesus church christ writes christian christians bible time talk.politics.guns gun people writes article guns fbi government fire time weapons talk.politics.mideast people israel armenian writes turkish jews article armenians israeli jewish talk.politics.misc people writes article president government mr stephanopoulos make time talk.religion.misc god writes people jesus article bible christian good christ life

### Table 9.

Frequent words in classes of 20Newsgroups data set.

 1 writes god article people religion evidence science moral objective time 2 image graphics jpeg file images ftp data bit files format 3 windows dos file system writes os files article program software 4 drive scsi mb card writes system mac article bit apple 5 window file server program motif sun widget set display output 6 sale offer shipping mail price condition st good email interested 7 car writes article cars engine good ve apr time oil 8 writes article bike dod apr ca ride ve good time 9 writes year article game team baseball good games time hit 10 game team hockey writes play ca games article year season 11 key encryption government chip writes clipper people article keys system 12 db writes article power good ground ca time ve circuit 13 writes article medical people disease health cancer patients time msg 14 space writes nasa article earth launch orbit time shuttle system 15 god jesus people bible writes christ christian church christians article 16 people writes gun article government fbi fire guns koresh batf 17 israel israeli writes jews article people arab jewish arabs state 18 people armenian turkish armenians president armenia war mr turks turkey 19 writes people article government cramer apr state make health optilink 20 mail list address email send information ca internet article writes

### Table 10.

Frequent words in clusters of 20Newsgroups data set.

 1 625 4 9 2 2 4 2 8 0 3 2 0 5 49 12 54 1 5 13 163 2 1 632 44 19 17 101 10 7 4 8 0 14 32 12 18 2 0 0 1 1 3 0 107 707 189 58 60 27 5 0 0 1 12 40 2 3 3 0 0 1 1 4 0 74 64 630 711 18 104 3 3 1 0 2 93 6 3 2 0 1 0 1 5 1 43 53 11 15 712 0 1 0 1 1 9 2 1 1 0 1 0 0 3 6 0 9 5 18 25 2 648 17 15 3 0 0 16 1 3 1 1 0 1 0 7 0 1 2 7 8 0 35 784 65 1 1 0 25 2 7 2 2 2 1 1 8 1 3 0 0 4 6 7 36 867 2 0 2 10 6 6 0 1 1 0 2 9 4 0 4 4 0 1 4 3 4 886 18 0 2 3 2 3 0 3 6 3 10 0 0 1 1 0 1 9 3 2 37 943 0 1 1 1 0 3 2 3 0 11 2 15 2 4 6 5 5 1 0 1 2 881 36 5 6 0 20 9 15 3 12 1 11 11 52 33 2 32 14 4 3 0 8 603 28 10 2 1 0 0 1 13 1 0 1 0 5 1 3 1 1 1 2 1 5 771 5 3 0 0 2 2 14 3 9 5 2 3 5 4 4 4 1 1 1 29 17 843 2 4 0 5 5 15 94 1 2 0 0 0 2 1 1 3 0 0 0 10 2 863 5 4 6 299 16 15 2 0 5 8 2 6 36 11 0 1 18 6 8 4 7 805 2 189 93 17 9 2 1 0 1 1 0 7 0 0 0 0 2 1 8 6 2 547 14 10 18 32 0 0 0 0 0 1 1 0 2 1 1 5 3 4 12 12 333 38 12 19 3 7 8 2 16 3 12 20 4 4 7 13 6 13 11 13 33 12 470 16 20 6 50 44 33 46 58 53 35 8 34 16 27 66 48 36 22 18 19 8 11

### Table 11.

The number of documents about each class in each cluster.

## 6. Conclusion

In this chapter, we introduced information‐theoretic clustering (ITC) from both theoretical and experimental side. Theoretically, we have shown the criterion, generative model, and novel algorithms for ITC. Experimentally, we showed the effectiveness and usefulness of ITC for text analysis as an important example.

## 7. A Difficulty about KL divergence

Let Pand Qbe a distribution whose mth random variable pmand qmtakes the mth element of a vector pand q, respectively. The Kullback‐Leibler (KL) divergence to Qfrom Pis defined to be

DKL(PQ)=pmlogpmqm.E39

In this definition, it is assumed that the support set of Pis a subset of the support set of Q(If qmis zero, pmmust be zero). For a given cluster Ck, there is no problem to calculate JS divergence of cluster Ckby Eq. (12), because the support set of any distribution Pi(Ck)is the subset of the mean distribution P¯k. However, it is not guaranteed that KL divergence from Pi(Ck)to Qt(tk)(a weight distribution of other cluster Ct) is finite. This causes a serious problem to find similar weight distribution Qfor an input distribution P. For example, lack of even one word (feature) in a distribution Qis enough not to be similar. Therefore, it is difficult to use k‐means type algorithm,

k-means algorithm is known as algorithm for clustering based on the sum-of-squared-error criterion (see Section 2.2.1). Therefore we added word "type" for information-theoretic clustering.

which updates weights or clusters by batch processing, in ITC.

## Acknowledgments

This work was supported by JSPS KAKENHI Grant Number 26330259.

## Notes

• The algorithm was proposed for feature/word clustering and applied to text classification. Since the algorithm uses document class (labeled data), it cannot be applied to general clustering problem.
• http://archive.ics.uci.edu/ml/[Accessed: 2016-10-25].
• http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/[Accessed: 2016-10-25].
• http://www.kernel-machines.org/data [Accessed: 2016-10-25].
• k-means algorithm is known as algorithm for clustering based on the sum-of-squared-error criterion (see Section 2.2.1). Therefore we added word "type" for information-theoretic clustering.

Written By

Toshio Uchiyama

Submitted: May 2nd, 2016 Reviewed: October 26th, 2016 Published: April 26th, 2017