Open access peer-reviewed chapter

Information‐Theoretic Clustering and Algorithms

By Toshio Uchiyama

Submitted: May 2nd 2016Reviewed: October 26th 2016Published: April 26th 2017

DOI: 10.5772/66588

Downloaded: 679

Abstract

Clustering is the task of partitioning objects into clusters on the basis of certain criteria so that objects in the same cluster are similar. Many clustering methods have been proposed in a number of decades. Since clustering results depend on criteria and algorithms, appropriate selection of them is an essential problem. Recently, large sets of users’ behavior logs and text documents are common. These are often presented as high‐dimensional and sparse vectors. This chapter introduces information‐theoretic clustering (ITC), which is appropriate and useful to analyze such a high‐dimensional data, from both theoretical and experimental side. Theoretically, the criterion, generative models, and novel algorithms are shown. Experimentally, it shows the effectiveness and usefulness of ITC for text analysis as an important example.

Keywords

  • information‐theoretic clustering
  • competitive learning
  • Kullback‐Leibler divergence
  • Jensen‐Shannon divergence
  • clustering algorithm
  • text analysis

1. Introduction

Clustering is the task of partitioning objects into clusters on the basis of certain criteria so that objects in the same cluster are similar. It is a fundamental procedure to analyze data [1, 2].

Clustering is unsupervised and different from supervised classification. In supervised classification, we have a set of labeled data (belong to predefined classes), train a classifier using the labeled data (training set), and judge which class a new object belongs to by the classifier. In the case of clustering, we find meaningful clusters without using any labeled data and group a given collection of unlabeled data into them. Clustering can also help us to find meaningful classes (labels) for supervised classification. Since it is more difficult to prepare the training set for larger data sets, recently unsupervised analysis of data such as clustering becomes more important.

For example, Table 1 user‐item matrix shows which item a user bought. When considering the data as a set of feature vectors for users, we can find a lot of types of users’ behavior by clustering. It is also possible to analyze data as a set of feature vectors for items. From word‐document matrix in Table 2, both document clusters and word clusters could be extracted.

item1item2item3item4item5
User130050(30050012001000200100)
User201200
User310002
User400100

Table 1.

Consumption behavior of users.

Document1Document2Document3Document4Document5
Word130050
Word201200
Word310002
Word400100

Table 2.

Word frequencies in documents (bag‐of‐words feature representation).

Many clustering methods have been proposed in a number of decades. Those include k‐means algorithm [3], competitive learning [4], spherical clustering [5], spectral clustering [6], and maximum margin clustering [7]. Since clustering results depend on criteria and algorithms, appropriate selection of them is an essential problem. Large sets of users’ behavior logs and text documents (as shown in Tables 1 and 2) are common recently. These are often presented as high‐dimensional and sparse vectors. This chapter introduces information‐theoretic clustering [8] and algorithms that are appropriate and useful to analyze such a high‐dimensional data.

Information‐theoretic clustering (ITC) uses Kullback‐Leibler divergence and Jensen‐Shannon divergence to determine its criterion, while k‐means algorithm uses the sum of squared error as criterion. This chapter explains ITC by contrasting these two clustering techniques (criteria and algorithms), because there are a number of interesting similarities between them. There exists difficulty in algorithms for ITC. We explain the details of it and propose novel algorithms to overcome.

Experimental results for text data sets are presented to show the effectiveness and usefulness of ITC and novel algorithms for it. In experiments, maximum margin clustering and spherical clustering are used to compare. We also provide the evidence to support the effectiveness of ITC by detailed analysis of clustering results.

2. The sum‐of‐squared‐error criterion and algorithms

Given a set of M‐dimensional input vectors X={xi|xiRM,i=1,,N}where Nis the number of vectors, clustering is the task of assigning each input vector xia cluster label k(k=1,,K)to partition them into Kclusters C={C1,,CK}. The sum‐of‐squared‐error criterion [9] is a simple and widely used criterion for clustering.

Let μkbe the mean of the input vectors xiwhich belong to the cluster Ck(see Figure 1). Then, the error in Ckis the sum of squared lengths of the differential (= “error”) vectors xiμk2and the sum‐of‐squared‐error criterion about all clusters (within‐cluster sum of squares) is defined by

JW=k=1KxiCkxiμk2.E1

Figure 1.

Input vectors and the mean vector in Ck.

JWis the objective function (criterion) to be minimized in clustering based on this criterion.

Also, we define the sum of squares of between‐cluster JBand total JTas

JB=k=1KNkμkμ2,JT=i=1Nxiμ2,E2

respectively, where Nkis the number of input vectors xiin Ck(i.e., N=k=1KNk) and

μk=1NkxiCkxi,μ=1Ni=1Nxi.E3

It follows from these definitions that the total sum of squares is the sum of the within‐cluster sum of squares and the between‐cluster sum of squares:

JT=JW+JB.E4

Since the mean of the all input vectors μis derived from X={x1,,xN}[see Eq. (3)], JTdoes not depend on clusters C[see Eq. (2)] and is constant for the given input vectors X. Therefore, minimization of JWis equivalent to maximization of JB. In this sense, clustering based on minimizing this criterion JWworks to find separable clusters each other.

2.1. Generative model

In the background of the clustering based on the objective function (criterion) JW, there exists assumption of Gaussian distribution about input vectors [10].

Suppose that there are clusters Ck(k=1,,K), which generates input vectors by the conditional probability density function:

p(xi|xiCk)=1(2πσk2)M/2exp(xiμk22σk2),E5

where σkis a standard deviation of the cluster Ckand Mis the number of dimension of xi. In followings, we assume that σkis constant value σfor all clusters Ck(k=1,,K). Considering independence of each generation, joint probability density function for the input vectors Xbecomes

p(X|C)=k=1KxiCk1(2πσ2)M/2exp(xiμk22σ2),E6

where Cindicate cluster information that specifies which input vector xibelongs to cluster Ck. Taking the logarithm of Eq. (6) yields

lnp(X|C)=NM2log(2πσ2)12σ2k=1KxiCkxiμk2.E7

Since σis constant, the maximization of Eq. (7) is equivalent to the minimization of

k=1KxiCkxiμk2.E8

which is nothing more or less than the objective function (criterion) JW. Therefore, under the assumption of Gaussian distribution about input vectors, clustering based on Eq. (8) works to find the most probable solution C.

2.2. Algorithms

2.2.1. k‐means algorithm

k‐means [3, 11] is well‐known algorithm for clustering based on the sum‐of‐squared‐error criterion. Main idea of this algorithm is as follows. In the objective function JW(1), error for vector xis calculated by xμk2where μkis the mean of cluster Ckto which xbelongs. If xμt2<xμk2, changing the cluster from Ckto Ctcan reduce the objective function JW.

We introduce weight vector wk(k=1,,K)(W) that represent cluster Ckto implement the idea mentioned above. The weight vector wkinvolves mean vector μkand prototype vector of cluster Ck. As illustrated in Figure 2, the idea of k‐means is alternative repetition of two steps “(a) Update weights” (calculating mean μkas weight vector wk) and “(b) Update clusters” (allocating input vector xito a cluster Ckon the basis of minimum length from weight vectors wk). Note that Figure 2b is a Voronoi tessellation determined by weight vectors wk, which are usually called prototype vector in this context.

Figure 2.

Two steps in k‐means algorithm. (a) Update weights. (b) Update clusters.

Figure 3a is a flow chart of k‐means algorithm to which processes of initialization and termination are added. As a matter of fact, clustering is closely related to vector quantization. Vector quantization means mapping input vectors to a codebook that is a set of weight vectors (prototype vectors). When using quantization error EQ:

EQ=i=1Nminkxiwk2,E9

Figure 3.

Flow charts of algorithms based on sum‐of‐squared‐error criterion. (a) k‐means1, (b) k‐means2, and (c) competitive learning.

clusters Cdetermined by a local optimal solution of vector quantization Wis a local optimal solution of clustering problem [12]. In this sense, clustering can be replaced by vector quantization and vice versa. We can write a flow chart for vector quantization as Figure 3b, but we also find this chart (b) as k‐means algorithm. Furthermore, LBG algorithm [13], which is well known for vector quantization, is based on an approach of Lloyd [3] (one of original papers for k‐means algorithm). These facts show a close relationship between clustering and vector quantization.

Initialization is important, because k‐means algorithm converges to a local optimal solution which depends on an initial condition (a set of weights or clusters). If we initialize weights Wby randomly selecting them from input vectors, it may converge to a very bad local optimal solution with high probability. Random labeling that randomly assigns cluster labels Cto input vectors may lead to better solutions than random selection of weights. The initialization Random labeling can also be used for charts (b) and (c) in Figure 3 by replacing “Initialize weights” step to “Initialize clusters” and “Update weights” steps. For directly initializing weights, splitting algorithm [13] and k‐means ++[14] were known.

2.2.2. Competitive learning

Competitive learning [4, 11] is a learning method for vector quantization and also utilized for clustering. While k‐means algorithm updates all weights Wby batch processing, competitive learning updates one weight wat a time to reduce a part of the quantization error QE(see Figure 3c) as

  1. Select one input vector xrandomly from X.

  2. Decide a winner wcfrom Wby

    c=argminkxwk2(Ifthereareseveralcandidates,choosethesmallestk).E10

  • Update the winner's weight wcas

    wc(1γ)wc+γx,E11

  • where γis a given learning rate (e.g., 0.01–0.1).

    Though the winner‐take‐all update in Step 3 (Figure 4) that reduces partial error xwc2in steepest direction does not always reduce the total quantization error EQ, repetition of the update can reduce EQon the basis of stochastic gradient decent method [15, 16]. For termination condition, maximum number of times of iteration Nr(the number of maximum repetitions) can be used. After termination, the step of deciding clusters Clike Figure 2b is required for clustering purpose.

    Figure 4.

    Update of the winner's weight in competitive learning.

    Against natural expectations, competitive learning outperforms k‐means without any contrivance in most cases. Furthermore, information obtained in learning process allows us to improve its performance. Splitting rule [12] utilizes the number of each weight wwins to estimate density around it. As Figure 5a, b shows, higher density of input vectors around makes the weight vector wawin more frequently than wb.

    Figure 5.

    Density of input vectors around a weight vector.

    Splitting rule in competitive learning [12] aims to overcome the problem of discrepancy between distribution of input vectors Xand that of weight vectors W. The discrepancy causes a few weight vectors wmonopolize Xand leads to a solution of very poor quality, but it is impossible to figure out the distribution of input vectors beforehand. Accordingly, this splitting rule distributes weight vectors win learning process as

    1. One weight vector w1with a variable τ1is set. τ1denotes how many times weight vector w1wins and is initialized to 0.

    2. Select one input vector x, decide winner wc, and update the winner's weight wc.

    3. Add 1 to τc. If τc=θand the current number of weights Kis less than K, generate a new weight vector wwhich is the same as the winner wcand clear τof both to 0, where θis the threshold of times for splitting.

    4. Repeat 2 and 3 until termination condition is true.

    3. Information‐theoretic clustering and algorithms

    Information‐theoretic clustering (ITC) [8] is closely related to works about distributional clustering [1719] and uses Kullback‐Leibler divergence and Jensen‐Shannon divergence to determine its criterion. Though there exists difficulty in algorithms and effectiveness for high‐dimensional count data (e.g., text data), its definition and properties are similar to those of the sum‐of‐squared‐error criterion. The main contributions of this chapter are to present the technique to overcome the difficulty and effectiveness of ITC.

    Let X={xi|xiR+M,i=1,,N}be a set of M‐dimensional input vectors (Ndenote the number of input vectors), where elements of vectors xare nonnegative real numbers. We define a l1‐norm of input vector ti(=m|xmi|), normalized input vectors pi=xi/ti, and an input probability distribution Piwhose mth random variable takes the mth element of pi(=pmi). Let P={P1,,PN}be a set of input distributions (input data).

    Suppose that we assign each distribution Pia cluster label k(k=1,,K)to partition them into Kclusters C={C1,,CK}.

    Let P¯kbe the distributions on the mean of input data Piwhich belong to the cluster Ck(see Figure 6). Then, the generalized Jensen‐Shannon (JS) divergence to be minimized in Ckis defined by

    DJS({Pi|PiCk})=PiCkπiDKL(PiP¯k),P¯k=PiCkπiPi,E12

    Figure 6.

    Input distributions and the mean in Ck.

    where Nkis the number of distributions Piin cluster Ck(i.e., N=k=1KNk), DKL(PiP¯k)is the Kullback‐Leibler (KL) divergence to the mean distribution P¯kfrom Pi, and πiis the probability of Pi(PiCkπi=1). Here Pi=1/Nk. Then, we define within‐cluster JS divergence JSWwhich considers all clusters Ck(k=1,,K)as

    JSW=k=1KNkNDJS({Pi|PiCk})E13
    =1Nk=1KPiCkDKL(PiP¯k)E14
    =1Nk=1KPiCkm=1Mpmilogpmip¯mk=1Nk=1KPiCkm=1M(pmilogpmipmilogp¯mk).E15

    The within‐cluster JS divergence JSWis the objective function (criterion) of information‐theoretic clustering (ITC) to be minimized [8]. We also define JS divergence of between‐cluster JSBand total JSTas

    JSB=DJS({P¯k|k=1,,K})=k=1KπkDKL(P¯kP¯),πk=Nk/N,E16
    JST=DJS({Pi|i=1,,N})=i=1NπiDKL(PiP¯),πi=1/N,E17

    where P¯=i=1NπiPi=1/Ni=1NPiis the distribution on the mean of all input data. It follows from these definitions that the total JS divergence is the sum of the within‐cluster JS divergence and the between‐cluster JS divergence [8]:

    JST=JSW+JSB.E18

    Since JSTare constant for given input distributions P, minimization of JSWis equivalent to maximization of JSB. In this sense, clustering based on minimizing this criterion JSWworks to find separable clusters each other.

    The definition and properties of ITC as shown so far are similar to those of the sum‐of‐squared‐error criterion. Those will help us to understand ITC.

    3.1. Generative model

    In the background of information‐theoretic clustering (ITC), there also exists the bag‐of‐words assumption [20] that disregards the order of words in a document. (Since ITC is not limited for document clustering, “word” is just an example of feature.) It means that features in data are conditionally independent and identically distributed, where the condition is a given probability distribution for an input vector. Based on this assumption, we describe a generative probabilistic model related to ITC and make clear the relationship between the model and the objective function (criterion) JSW.

    Let an input vector x={x1,,xm,,xM}present a set of the number of observations of mth feature. Suppose that there are clusters Ck(k=1,,K)which generates tifeatures for a data (= input vector) with the probability distribution P¯k={p¯1k,,p¯Mk}, and conditional probability of a set of the observation about features in an input vector xiis expressed by multinomial distribution

    p(xi|xiCk)=Aim=1M(p¯mk)xmi,Ai=ti!x1i!x2i!xMi!,ti=m=1M|xmi|,E19

    where Aiis the number of combination of the observation. Assuming independence of each generation, joint probability function for the input vectors X={x1,,xN}becomes

    p(X|C)=k=1KxiCkAim=1M(p¯mk)xmi,E20

    where Cindicates cluster information that specifies which input vector xibelongs to cluster Ck. Taking the logarithm of Eq. (20) yields

    lnp(X|C)=i=1NlogAi+k=1KxiCkm=1Mxmilogp¯mkE21
    =i=1NlogAi+k=1KPiCkm=1Mtipmilogp¯mk.E22

    This is a generative probabilistic model related to ITC. If we assume that titakes constant value tfor all input vectors, maximization of the probability (22) as well as minimization of the objective function JSW(15) come to the minimization of

    1Nk=1KPiCkm=1Mpmilogp¯mk=1Nk=1KNkm=1Mp¯mklogp¯mk,E23

    for given input distribution P. Here, the relationship PiCkpmi=Nkp¯mkis used. Since timay not be constant value t, the generative model (22) is not an equivalent model of ITC but the related model. This difference comes from the fact that the model treats each observation about features equally, while ITC treats each data (input vector) equally. Though the additional assumption ti=tis required, ITC works to find the most probable solution Cin the generative probabilistic model. Furthermore, Eq. (23) is also based on the minimization of entropy in clusters as Eq. (23) shows. Entropy (specifically, Shannon Entropy) is the expected value of the information contained in each message that is an input distribution here. The smaller entropy becomes, and the more compactly a model can explain observations (input distributions). In this sense, the objective function JSW(15) presents the goodness of the generative model. The relationship (including difference) between the probabilistic model and the objective function JSWis meaningful to improve the model and the objective function in future.

    Choice of appropriate model for data is important, when analyzing them. For example, large set of text documents contain many kinds of words and are presented as high‐dimensional vectors. Taking extreme diversity of documents’ topics into account, feature vectors of documents are distributed almost uniformly in the vector space. As known by “the curse of dimensionality” [10], most of the volume of a sphere in high‐dimensional space is concentrated near the surface, and it becomes not appropriate to choose the model based on Gaussian distribution which concentrates values around the mean. In contrast, ITC on the basis of the multinomial distribution is a reasonable and useful tool to analyze such a high‐dimensional count data, because the generative model of ITC is consistent with them.

    We introduce weight distribution Qk(k=1,,K)(Q)that represent cluster Ckand that involves mean distribution P¯kand prototype distribution of cluster Ckin a manner similar to that of the sum‐of‐squared‐error (SSE) criterion (see Section 2.2.1). Figure 7 shows relationships between parameters in generative models. Parameters are generated or estimated by other parameters to maximize probability of generative model. For example, clustering is the task to find the most probable clusters Cfor given input vectors Xor input distributions P. In Figure 7b, constructing a classifier is the task to find Qfor given Pand C(classes in this context) in training process. Then, it estimates Cfor unknown Pusing the trained Q. The classifier using multinominal distribution is known as multinominal Naive Bayes classifier [21]. As it shows, ITC and Naive Bayes classifier have a close relationship [18].

    Figure 7.

    Relationships between model parameters. (a) Clustering based on SSE criterion. (b) Information‐theoretic clustering

    3.2. Algorithms

    There exists difficulty (Appendix A) in algorithms for ITC. We show a novel idea to overcome it.

    3.2.1. Competitive learning

    When competitive learning decides a winner for an input distribution P, it easily faces the difficulty of calculating KL divergence from Pto weight distributions Q(see Appendix A). To overcome this difficulty, we present the idea to change an order of steps in competitive learning (CL). As shown in Figure 8b, CL updates all weights (= weight distributions) before deciding winner by

    Qk(1γ)Qk+γP,E24

    where γis a learning rate. Since updated weight distributions Qk(k=1,,K)include all words (features) of input distribution P, it is possible to calculate KL divergence DKL(PQk)for all k. In following steps, CL decide a winner Qcfrom Qby

    c=argminkDKL(PQk)(Ifthereareseveralcandidates,choosethesmallestk),E25

    Figure 8.

    Flow charts of competitive learning. (a) Competitive learning for SSE. (b) Competitive learning for ITC

    and activate winner's update and discard others. These steps satisfy the CL's requirement that it partially reduces value of objective function JSWin steepest direction with the given learning rate γ. Here, neither approximation nor distortion is added to the criterion of ITC. Note that updates of weight distributions Qkbefore activation are provisional (see Figure 8b).

    Related work that avoids the difficulty in calculating KL divergence presented skew divergence [22]. The skew divergence is defined as

    sα(P,Q)=DKL(PαQ+(1α)P),E26

    where α(0α1)is the mixture ratio of distributions. The skew divergence is exactly the KL divergence at α=1. When α=1γ, Eq. (26) becomes similar to Eq. (24). Then, we can rewrite the steps in CL above using the skew divergence as

    1. Select one input distribution Prandomly from P.

    2. Decide a winner Qcfrom Qby

      c=argminksα(P,Qk)(Ifthereareseveralcandidates,choosethesmallestk),E27

  • Update the winner's weight distribution Qcas

    Qc(1γ)Qc+γP,E28

  • where γis a learning rate and equal to 1α(αis the mixture ratio for sα) usually.

    Hence, we call this novel algorithm for ITC as “competitive learning using skew divergence” (sdCL). In addition, splitting rule in competitive learning [12] can also be applied to this algorithm.

    3.2.2. k‐means type algorithm

    Dhillon et al. [8] proposed information‐theoretic divisive algorithm which is k‐means type algorithm with divisive mechanism and uses KL divergence.[1] - However, it still remains the difficulty to use KL divergence directly. In such a situation, we propose to use the skew divergence instead of KL divergence in k‐means type algorithm as

    1. Initialize clusters Cof input distribution Prandomly.

    2. Update weight distributions Qby

      Qk=1NkPiCkPi.E29

  • Update each cluster cof an input distribution Piby

    c=argminksα(Pi,Qk)(Ifthereareseveralcandidates,choosethesmallestk),E30

    where mixture ratio α(0α1)for skew divergence sαis 0.99 for example.

  • Repeat 2 and 3 until change ratio of objective function JSWis less than small value (e.g., 108).

  • The algorithm itself works well to obtain valuable clustering results. Further, if αis close to 1, skew divergence sαbecomes a good approximation of KL divergence. Therefore, restart of learning after termination with αcloser to 1, such as 0.999,0.9999,, may lead to better clustering result.

    3.2.3. Other algorithms

    Slonim and Tishby [23] proposed an agglomerative hierarchical clustering algorithm, which is a hard clustering version of Information Bottleneck algorithm of Tishby et al. [24]. It is similar to the algorithm of Baker and McCallum [18] and merges just two clusters at every step based on the JS divergence of their distributions. A merit of the agglomerative algorithms is not affected by the difficulty of calculating KL divergence, because it just uses JS divergence. However, a merge of clusters at each step optimizes a local criterion but not a global criterion, as Dhillon et al. [8] pointed out. Therefore, clustering results may not be as good as results obtained by nonhierarchical algorithms (e.g., k‐means and competitive learning) in the sense of optimizing the objective function of ITC. Additionally, hierarchical algorithms are computationally expensive, when the number of inputs is large.

    Note that a lot of studies [8, 18, 23] aimed at improving accuracy of text classification using feature/word clustering based on ITC or distributional clustering. If a clustering is just a step to final goal, feature clustering is meaningful. However, features which characterize clusters should not be merged, when we aim to find clusters (topics) from a set of documents. Actually, finding topics using clustering is the aim of this chapter.

    4. Evaluation of clustering

    Since clustering results depend on methods (criteria and algorithms), appropriate selection of them is important. So far, we introduced two criteria for clustering. These are called internal criteria that depend on their own models and not enough for evaluation. If criterion for clustering is common, we can compare clustering results by objective function of the criterion. Under a certain model that is an assumption in other word, a more probable result can be regarded as a better result. However, it is not guaranteed that the model or the assumption is reasonable at all times. Moreover, good clustering results under a certain criterion can be bad results under different criteria. A view from outside is required.

    This section introduces external criteria that are Purity, Rand index (RI), and Normalized mutual information (NMI) [25] to evaluate clustering quality and to find better clustering methods. These criteria compare clusters with a set of classes, which are produced on the basis of human judges. Here, each input data belong to one of class Aj(j=1,,J)and one of cluster Ck(k=1,,K). Let T(Ck,Aj)be the number of data that belongs to both Ckand Aj.

    Purity is measured by counting the number of input data from the most frequent class in each cluster. Purity can be computed as

    purity=1Nk=1KmaxjT(Ck,Aj),E31

    where Nis the total number of input data. Purity is close to 1, when each cluster has one dominant class.

    Rand index (RI) checks all of the N(N1)/2pairs of input data and is defined by

    RI=a+ba+b+c+d,E32

    where a, b, c, and dare the number of pairs in following conditions:

    • a,” where the cluster number (suffix) is the same and the class number is the same

    • b,” where the cluster numbers are different and the class numbers are different

    • c,” where the cluster number is the same and the class numbers are different

    • d,” where the cluster numbers are different and the class number is the same

    The Rand index (RI) measures the percentage of agreements a + b in clusters and classes.

    Normalized mutual information (NMI) is defined as

    NMI=I(C;A)(H(C)+H(A))/2,E33

    where, I(C;A)is mutual information and H()is entropy and

    I(C;A)=k=1Kj=1JP(Ck,Aj)logP(Ck,Aj)P(Ck)P(Aj)=k=1Kj=1JT(Ck,Aj)NlogT(Ck,Aj)NT(Ck)T(Aj),E34
    H(C)=k=1KP(Ck)logP(Ck)=k=1KT(Ck)NlogT(Ck)N,E35
    H(A)=j=1JP(Aj)logP(Aj)=j=1JT(Aj)NlogT(Aj)N,E36

    where P(Ck), P(Aj), and P(Ck,Aj)are the probability of data being in cluster Ck, class Aj, and in the intersection of Ckand Aj, respectively. Mutual information I(C;A)measures the mutual dependence between clusters Cand classes A. It quantifies the amount of information obtained for classes through knowing about clusters. Hence, high NMIshows some kind of goodness about clustering in information theory.

    5. Experiments

    This section provides experimental results that show the effectiveness and usefulness of ITC and the proposed algorithm (sdCL: competitive learning using skew divergence). Experiments consist of two parts, experiment1 and experiment2.

    In experiment1, we applied sdCL to the same data sets as used in the paper of Wang et al. [26] and compared performance of sdCL with other clustering algorithms evaluated in it. The algorithms that the paper [26] evaluated are as follows.

    • k‐means (KM). The weights Ware initialized by randomly [26].

    • Normalized cut (NC) [27].

    • Maximum margin clustering (MMC) [7].

    • Generalized maximum margin clustering (GMMC) [28].

    • Iterative support vector regression (IterSVR) [29].

    • Cutting plane maximum margin clustering (CPMMC) [26], CPM3C [26].

    As shown above, maximum margin clustering (MMC) [7] and related works are much focused. These works extend the idea of support vector machine (SVM) [30] to the unsupervised scenario. The experimental results obtained by the MMC technique are often better than conventional clustering methods. Among those, CPMMC and CPM3C (Cutting plane multiclass maximum margin clustering) [26] are known as successful methods. Experimental results will show that the proposed algorithm sdCL outperforms CPM3C in text data clustering.

    In experiment2, we focus on text data clustering and compare performance of algorithms, sdCL, sdCLS (sdCL with splitting rule, see Sections 2.2.2 and 3.2.1), and spherical competitive learning (spCL). We also provide the evidence to support the effectiveness of ITC by detailed analysis of clustering results.

    spCL is an algorithm for spherical clustering like the spherical k‐means algorithm [5] that was proposed for clustering high‐dimensional and sparse data, such as text data. The objective function to be maximized for the spherical clustering is cosine similarity between input vectors and the mean vector of a cluster to which they belong. To implement spCL, we turn input and weight vectors (x, w) into a unit vector and decide winner wcby

    c=argmaxkcos(x,wk)(Ifthereareseveralcandidates,choosethesmallestk),E37

    and update the winner's weight wcas

    wc(1γ)wc+γx(1γ)wc+γx.E38

    For all competitive learning algorithms, the learning rate γ=0.01, the number of maximum repetitions for updating weights Nr=1,000,000(termination condition), and the threshold of times for splitting rule θ=1000are used. After competitive learning (sdCL, sdCLS, or spCL) is terminated, we apply k‐means type algorithm to remove fluctuation as a post‐processing. Specifically, sdKM (the k‐means type algorithm using skew divergence shown in Section 3.2.2) with α=0.999,0.9999,0.99999is applied consecutively after sdCL and sdCLS. In each learning procedure including post‐processing, an operation is iterated 50 times with different initial random seeds for a given set of parameters.

    5.1. Data sets

    We mainly use the same data sets as used in the paper of Wang et al. [26]. When applying algorithms for ITC, we use probability distributions Pi(i=1,N)(P) derived from original data.

    1. UCI data. From the UCI repository,[1] - we use ionosphere, digits, letter, and satellite under the same setting of the paper [26]. The digits data (8×8matrix) are generated from bitmaps of handwritten digits. Pairs (3 vs. 8, 1 vs. 7, 2 vs. 7, and 8 vs. 9) are focused due to the difficulty of differentiating. For the letter and satellite data sets, their first two classes are used. Since the ionosphere data contain minus values and cannot be transformed to probability distributions, we do not apply ITC to them.

    2. Text data. Four text data sets: 20Newsgroups (http://qwone.com/∼jason/20Newsgroups/[Accessed: 2016‐10‐25]), WebKB,[1] - Cora [31], and RCV1 (Reuters Corpus Volume 1) [32] are used. In experiment1, we follow the setting of the paper [26]. For 20Newsgroups data set, topic “rec” which contains four topics {autos, motorcycles, baseball, hockey} is used. From the four topics, two sets of two‐class data sets {Text‐1: autos vs motorcycles, Text‐2: baseball vs hockey} are extracted. From WebKB data sets, the four Universities data set (Cornell, Texas, Washington, and Wisconsin University), which has seven classes (student, faculty, staff, department, course, project, and other), are used. Note that topic of the “other” class is ambiguous and may contain various topics (e.g., faculty), because it is a collection of pages that were not deemed the “main page” representing an instance of the other six classes, as pointed out in the web page of the data set. Cora data set (Cora research paper classification) [31] is a set of information of research papers classified into a topic hierarchy. From this data set, papers in subfield {data structure (DS), hardware and architecture (HA), machine learning (ML), operating system (OS), programming language (PL)} are used. We select papers that contain title and abstract. RCV1 data set contains more than 800 thousands documents to which topic category is assigned. The documents with the highest four topic codes (CCAT, ECAT, GCAT, and MCAT) in the topic codes hierarchy in the training set. Multi‐labeled instances are removed.

      In experiment2, we use all of 20Newsgroups and RCV1 data sets. For RCV1 data set, we obtain 53 classes (categories) by mapping the data set to the second level of RCV1 topic hierarchy and remove multi‐labeled instances. For WebKB data set, we remove “other” class due to ambiguity, use the other six classes, and do not use information of universities.

      For all text data, we remove stop words using stop list [32] and empty data, if they are not removed. In experiment1, we follow the setting of the paper [26], but properties of data sets are slightly different (see Table 3). For Cora data sets, the differences of data sizes are large. However, they must keep the same (or close at least) characteristics (e.g., distributions of words and topics), because they are extracted from the same source.

    3. Digits data. USPS (16×16) and MNIST (28×28) are data sets of handwritten digits image.[1] - For USPS data set, 1, 2, 3, and 4 digit images are used. For MNIST and digits data from UCI repository, all 45 pairs of digits 0–9 are used in two‐class problems.

      The properties of those data sets are listed in Table 3.

    DataSize (N)Feature (M)Class (K)
    Ionosphere351342
    Letter1555162
    Digits1555642
    Satellite2236362
    Text‐1198116,2592
    Text‐2198715,9552
    20Newsgroups‐4396724,5064
    20Newsgroups‐2018,77260,69820
    Cora‐DS239757459
    Cora‐HA91333407
    Cora‐ML356968097
    Cora‐OS208450294
    Cora‐PL302660699
    WebKB‐Cornell83555747
    WebKB‐Texas80844827
    WebKB‐Washington119177797
    WebKB‐Wisconsin121882707
    WebKB6421914,1426
    Reuters‐RCV1‐419,80644,2144
    Reuters‐RCV1‐53534,135216,70453
    MNIST70,0007842
    USPS30462564

    Table 3.

    Properties of data sets.

    5.2. Results of experiment1

    The clustering results are shown in Tables 47, where values (except for sdCL) are the same in the paper of Wang et al. [26] (accuracy in that paper is equivalent to purity from its definition). In two‐class problems, CPMMC outperforms other algorithms about purity and Rand Index (RI) in most cases. The proposed algorithm sdCL shows stable performances except for ionosphere to which sdCL cannot be applied. In multiclass problems, sdCL for text data (Cora, 20Newsgroups‐4, and Reuters‐RCV1‐4) outperforms other algorithms. The results show that ITC and the proposed algorithm sdCL are effective for text data sets. Note that CPM3C shows the better results than sdCL for WebKB data. However, topic of the “other” class in WebKB is ambiguous (see Section 5.1). The occupation ratio of them is large {0.710, 0.689, 0.777, 0.739} and almost same as the values of purity in CPM3C and sdCL. It means that these algorithms failed to find meaningful clusters in purity. Therefore, WebKB data are not appropriate to use for evaluation without removing “other” class.

    DataKMNCMMCGMMCIterSVRCPMMCsdCL
    Iono0.54280.75000.78750.76500.77700.7548
    Letter0.82060.76800.92800.95020.9267
    Satellite0.95930.95790.96820.98790.9506
    Text‐10.50530.93790.97020.95000.9306
    Text‐20.50380.91350.93990.97210.9591
    Dig 3–80.94680.65000.90000.94400.96640.96880.9440
    Dig 1–70.94450.55000.68750.97800.99451.0001.000
    Dig 2–70.96910.66000.98750.99501.0001.0000.9891
    Dig 8–90.90680.52000.96250.84000.96330.98120.8910
    UCI‐dig0.96380.97570.98180.99400.9516
    MNIST0.89210.89920.92410.96210.8812

    Table 4.

    Purity comparisons for two‐class problems.

    Bold fonts indicate the maximum purities for a give data set.


    DataKMNCMMCGMMCIterSVRCPMMCsdCL
    Iono0.560.630.670.640.560.65
    Letter0.710.640.870.920.86
    Satellite0.920.920.940.970.91
    Text‐10.500.880.940.940.87
    Text‐20.500.840.890.930.92
    Dig 3–80.900.550.820.900.940.950.89
    Dig 1–70.990.500.570.960.991.01.0
    Dig 2–70.940.550.980.991.01.00.98
    Dig 8–90.840.500.930.730.930.970.81
    UCI‐dig0.930.960.970.990.93
    MNIST0.810.820.860.940.83

    Table 5.

    Rand Index (RI) comparisons for two‐class problems.

    Bold fonts indicate the maximum rand indices for a give data set.


    DataKMNCMMCCPM3CsdCL
    UCI‐digits 06890.42230.93130.94830.96740.9394
    UCI‐digits 12790.40420.90110.91910.94520.8300
    USPS0.92150.90110.91910.94520.9515
    Cora‐DS0.28240.36880.44150.5057
    Cora‐HA0.34020.42000.59800.6145
    Cora‐ML0.27080.31030.45490.5974
    Cora‐OS0.23870.23030.59160.6686
    Cora‐PL0.33800.33970.47210.4729
    WebKB‐Cornell0.55710.61430.72050.7192
    WebKB‐Texas0.45050.35380.69100.6895
    WebKB‐Washington0.53520.32850.78170.7767
    WebKB‐Wisconsin0.49530.33310.74250.7397
    20Newsgroups‐40.35270.41890.71340.9360
    Reuters‐RCV1‐40.27050.62350.8064

    Table 6.

    Purity comparisons for multiclass problems.

    Bold fonts indicate the maximum purities for a give data set.


    DataKMNCMMCCPM3CsdCL
    UCI‐digits 06890.6960.9390.9410.9740.945
    UCI‐digits 12790.40420.90110.91910.9450.868
    USPS0.9320.9380.9500.958
    Cora‐DS0.5890.7440.7460.823
    Cora‐HA0.3850.6590.6950.767
    Cora‐ML0.5140.7200.7610.802
    Cora‐OS0.5180.5220.7300.735
    Cora‐PL0.6430.6750.7120.819
    WebKB‐Cornell0.6030.6020.7240.483
    WebKB‐Texas0.6040.6020.7120.495
    WebKB‐Washington0.6160.5810.7520.426
    WebKB‐Wisconsin0.5810.5090.7610.464
    20Newsgroups‐40.5810.4960.7800.940
    Reuters‐RCV1‐40.4710.7030.800

    Table 7.

    Rand Index (RI) comparisons for multiclass problems.

    Bold fonts indicate the maximum rand indices for a give data set.


    5.3. Results of experiment2

    In experiment2, we focus on text data clustering. Table 8 shows that the proposed algorithms for ITC (sdCL and sdCLS) outperform spCL in purity, RI, and NMI. Considering that spCL is an algorithm for spherical clustering [5] which was proposed to analyze high‐dimensional data such as text documents, the criterion of information‐theoretic clustering is worth to use for this purpose.

    sdCLsdCLSspCL
    DataPurityRINMIPurityRINMIPurityRINMI
    Cora‐DS0.5060.8230.3220.5140.8260.3270.3610.7960.162
    Cora‐HA0.6140.7670.3630.6180.7670.3600.5180.7280.251
    Cora‐ML0.5970.8020.3840.6020.8010.3850.4310.7570.181
    Cora‐OS0.6690.7350.3130.6850.7410.3250.5820.6730.187
    Cora‐PL0.4730.8190.2750.4840.8200.2740.4140.8020.175
    WebKB60.5750.7120.2720.5750.7120.2720.5190.7020.181
    20News‐200.6900.9540.6530.6970.9550.6560.3590.9050.324
    RCV1‐530.7310.9100.5860.7380.9060.5800.5680.8950.384

    Table 8.

    Comparison for text data sets.

    Table 8 also shows that sdCLS (sdCL with splitting rule, see Sections 2.2.2 and 3.2.1) is slightly better than sdCL in some cases. As far as Figure 9 (left, right) shows, values of JS divergence for sdCLS are smaller (“better” in ITC) than sdCL, and sdCLS outperforms sdCL in purity on average. Nevertheless, an advantage of sdCLS against sdCL is not so obvious in this experiment.

    Figure 9.

    Purity versus JS divergence for 20Newsgroups (left) and Reuters‐RCV1 (right) data sets.

    Note that clustering result by dCL (competitive learning using KL divergence) is shown below. Since the values of NMI clearly illustrate that dCL converged to unrelated solutions to the classes, use of skew divergence is an effective technique to overcome this problem.

    DataPurityRINMI
    WebKB60.3630.6670.00629
    20News-200.0710.9050.00596

    In followings, we examine inside of clustering results obtained by ITC to make clear whether ITC helps us to find meaningful clusters and candidates of classes for classification. Tables 9 and 10 show frequent words in classes and clusters obtained by sdCL of 20Newsgroups data set, respectively. The order of clusters is arranged so that clusters are made to correspond to classes. Table 11 is the cross table between clusters and classes. As shown in Table 10, the frequent words in some clusters remind us characteristics of them to distinguish from others. For example, the words in cluster 2: “image graphics jpeg,” cluster 6: “sale offer shipping,” and cluster 11: “key encryption chip” remind classes (comp.graphics), (misc.forsale), and (sci.crypt), respectively. We also imagine characteristics of clusters from the words in 7, 8, 9, 10, 13, 14, and 16th clusters. These clusters have documents of one dominant class and can be regarded as candidates of classes. However, there are some exceptions. The 1st and 15th clusters have the same word “god,” while classes of (alt.atheism), (soc.religion.christian), and (talk.religion.misc) have also the same word “god.” The cluster 1 and class (alt.atheism) have common words “religion evidence,” and the cluster 1 has many documents of the dominant class (alt.atheism). The cluster 15 and the class (soc.religion.christian) have common words “jesus bible christ church,” and the cluster 15 has many documents of the dominant class (soc.religion.christian). On the other hand, there is no cluster which has documents of (talk.religion.misc) as dominant, and the documents of (talk.religion.misc) are mostly shared by the clusters 1 and 15. Though there is a mismatch between clusters and classes, the clustering result is also acceptable, because words in the class (talk.religion.misc) are resemble those in the class (soc.religion.christian). We can also find that cluster 4 has many documents of the two classes (comp.sys.ibm.pc.hardware) and (comp.sys.mac.hardware). From Table 9, those classes have similar words except for “mac” and “apple.” Thus, ITC missed to detect the difference of the classes, but found the cluster with common feature of them. In this sense, the clustering result is meaningful and useful. Another example is that documents in class (talk.politics.mideast) are divided into clusters 17 and 18. It means that ITC found two topics from one class and frequency words in the clusters seem to be reasonable (see 17th and 18th clusters in Table 10). The characteristic of cluster 20 that has words “mail list address email send” is different from all classes as well as other clusters, but the cluster 20 has some documents in all classes (see Table 11). This cluster may discover that all newsgroups include documents with such words. In summary, ITC helps us to find meaningful clusters, even when clusters obtained by ITC sometimes seem not to be the same as expected classes. The detailed analysis of the clustering results above could be the evidence to support the effectiveness and usefulness of ITC.

    alt.atheismgod writes people article atheism religion time evidence
    comp.graphicsimage graphics jpeg file bit images software data files ftp
    comp.os.ms‐windows.miscwindows file dos writes article files ms os problem win
    comp.sys.ibm.pc.hardwaredrive scsi card mb ide system controller bus pc writes
    comp.sys.mac.hardwaremac apple writes drive system problem article mb monitor mhz
    comp.windows.xwindow file server windows program dos motif sun display widget
    misc.forsalesale shipping offer mail price drive condition dos st email
    rec.autoscar writes article cars good engine apr ve people time
    rec.motorcycleswrites bike article dod ca apr ve ride good time
    rec.sport.baseballwrites year article game team baseball good games time hit
    rec.sport.hockeygame team hockey writes play ca games article season year
    sci.cryptkey encryption government chip writes clipper people article keys system
    sci.electronicswrites article power good ve work ground time circuit ca
    sci.medwrites article people medical health disease time cancer patients
    sci.spacespace writes nasa article earth launch orbit shuttle time system
    soc.religion.christiangod people jesus church christ writes christian christians bible time
    talk.politics.gunsgun people writes article guns fbi government fire time weapons
    talk.politics.mideastpeople israel armenian writes turkish jews article armenians israeli jewish
    talk.politics.miscpeople writes article president government mr stephanopoulos make time
    talk.religion.miscgod writes people jesus article bible christian good christ life

    Table 9.

    Frequent words in classes of 20Newsgroups data set.

    1writes god article people religion evidence science moral objective time
    2image graphics jpeg file images ftp data bit files format
    3windows dos file system writes os files article program software
    4drive scsi mb card writes system mac article bit apple
    5window file server program motif sun widget set display output
    6sale offer shipping mail price condition st good email interested
    7car writes article cars engine good ve apr time oil
    8writes article bike dod apr ca ride ve good time
    9writes year article game team baseball good games time hit
    10game team hockey writes play ca games article year season
    11key encryption government chip writes clipper people article keys system
    12db writes article power good ground ca time ve circuit
    13writes article medical people disease health cancer patients time msg
    14space writes nasa article earth launch orbit time shuttle system
    15god jesus people bible writes christ christian church christians article
    16people writes gun article government fbi fire guns koresh batf
    17israel israeli writes jews article people arab jewish arabs state
    18people armenian turkish armenians president armenia war mr turks turkey
    19writes people article government cramer apr state make health optilink
    20mail list address email send information ca internet article writes

    Table 10.

    Frequent words in clusters of 20Newsgroups data set.

    16254922428032054912541513163
    216324419171011074801432121820011
    30107707189586027500112402330011
    4074646307111810433102936320101
    514353111571201011921101003
    6095182526481715300161311010
    70127803578465110252722211
    8130046736867202106601102
    940440143488618023230363
    1000110193237943011103230
    1121524655101288136560209153
    12111115233232144308603281021001
    131010513111215771530022
    14395235444111291784324055
    1594120002113000102863546299
    1615205826361101186847805218993
    17921011070000218625471410
    18320000011021153412123333812
    1937821631220447136131113331247016
    206504433465853358341627664836221819811

    Table 11.

    The number of documents about each class in each cluster.

    6. Conclusion

    In this chapter, we introduced information‐theoretic clustering (ITC) from both theoretical and experimental side. Theoretically, we have shown the criterion, generative model, and novel algorithms for ITC. Experimentally, we showed the effectiveness and usefulness of ITC for text analysis as an important example.

    7. A Difficulty about KL divergence

    Let Pand Qbe a distribution whose mth random variable pmand qmtakes the mth element of a vector pand q, respectively. The Kullback‐Leibler (KL) divergence to Qfrom Pis defined to be

    DKL(PQ)=pmlogpmqm.E39

    In this definition, it is assumed that the support set of Pis a subset of the support set of Q(If qmis zero, pmmust be zero). For a given cluster Ck, there is no problem to calculate JS divergence of cluster Ckby Eq. (12), because the support set of any distribution Pi(Ck)is the subset of the mean distribution P¯k. However, it is not guaranteed that KL divergence from Pi(Ck)to Qt(tk)(a weight distribution of other cluster Ct) is finite. This causes a serious problem to find similar weight distribution Qfor an input distribution P. For example, lack of even one word (feature) in a distribution Qis enough not to be similar. Therefore, it is difficult to use k‐means type algorithm,[1] - which updates weights or clusters by batch processing, in ITC.

    Acknowledgments

    This work was supported by JSPS KAKENHI Grant Number 26330259.

    Notes

    • The algorithm was proposed for feature/word clustering and applied to text classification. Since the algorithm uses document class (labeled data), it cannot be applied to general clustering problem.
    • http://archive.ics.uci.edu/ml/[Accessed: 2016-10-25].
    • http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/[Accessed: 2016-10-25].
    • http://www.kernel-machines.org/data [Accessed: 2016-10-25].
    • k-means algorithm is known as algorithm for clustering based on the sum-of-squared-error criterion (see Section 2.2.1). Therefore we added word "type" for information-theoretic clustering.

    How to cite and reference

    Link to this chapter Copy to clipboard

    Cite this chapter Copy to clipboard

    Toshio Uchiyama (April 26th 2017). Information‐Theoretic Clustering and Algorithms, Advances in Statistical Methodologies and Their Application to Real Problems, Tsukasa Hokimoto, IntechOpen, DOI: 10.5772/66588. Available from:

    chapter statistics

    679total chapter downloads

    1Crossref citations

    More statistics for editors and authors

    Login to your personal dashboard for more detailed statistics on your publications.

    Access personal reporting

    Related Content

    This Book

    Advances in Statistical Methodologies and Their Application to Real Problems

    Edited by Tsukasa Hokimoto

    Next chapter

    Gamma-Kumaraswamy Distribution in Reliability Analysis: Properties and Applications

    By Indranil Ghosh and Gholamhossein G. Hamedani

    Related Book

    First chapter

    Introductory Chapter: Timeliness of Advantages of Bayesian Networks

    By Douglas S. McNair

    We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

    More About Us