Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing

Makoto Yasuda

doi:10.5772/48659

Author Information

Show +

Makoto Yasuda
- Gifu National College of Technology, Japan

*Address all correspondence to:

1. Introduction

Many engineering problems can be formulated as optimization problems, and the deterministic annealing (DA) method [20] is known as an effective optimization method for such problems. DA is a deterministic variant of simulated annealing (SA) [1, 10]. The DA characterizes the minimization problem of cost functions as the minimization of Helmholtz free energy which depends on a (pseudo) temperature, and tracks the minimum of free energy while decreasing temperature and thus it can deterministically optimize the function at a given temperature [20]. Hence, the DA is more efficient than the SA, but does not guarantee a global optimal solution. The study on the DA in [20] addressed avoidance of the poor local minima of cost function of data clustering. Then it was extensively applied to various subjects such as combinational optimization problems [21], vector quantization [4], classifier design [13], pairwise data clustering [9] and so on.

On the other hand, clustering is a method which partitions a given set of data points into subgroups, and is one of major tools for data analysis. It is supposed that, in the real world, cluster boundaries are not so clear that fuzzy clustering is more suitable than crisp clustering. Bezdek[2] proposed the fuzzy c-means (FCM) which is now well known as the standard technique for fuzzy clustering.

Then, after the work of Li et al.[11] which formulated the regularization of the FCM with Shannon entropy, Miyamoto et al.[14] discussed the FCM within the framework of the Shannon entropy based clustering. From the historical point of view, however, it should be noted that Rose et al.[20] first studied the statistical mechanical analogy of the FCM with the maximum entropy method, which was basically probabilistic clustering.

To measure the “indefiniteness” of fuzzy set, DeLuca and Termini [6] defined fuzzy entropy after Shannon. Afterwards, some similar measures from the wider viewpoints of the indefiniteness were proposed [15, 16]. Fuzzy entropy has been used for knowledge retrieval from fuzzy database [3] and image processing [31], and proved to be useful.

Tsallis [24] achieved nonextensive extension of the Boltzmann-Gibbs statistics. Tsallis postulated a generalization form of entropy with a generalization parameterq, which, in a limit ofq→1,reaches the Shannon entropy. Later on, Menard et.al.[12] derived a membership function by regularizing FCM with the Tsallis entropy.

In this chapter, by maximizing the various entropies within the framework of FCM, the membership functions which take the familiar forms of the statitical mechanical distribution functions are derived. The advantage to use the statistical mechanical membership functions is that the fuzzy c-means clustering can be interpreted and analyzed from a statistical mechanical point of view [27, 28]

After that, we focus on the Fermi-Dirac like membership function, because, as compared to the Maxwell-Boltzmann-like membership function, the Fermi-Dirac-like membership function has extra parameters αks (αkcorresponds to a chemical potential in statistical mechanics[19], and kdenotes a data point), which make it possible to represent various cluster shapes like former clustering methods based on such as the Gaussian mixture[7], and the degree of fuzzy entropy[23]. αks strongly affect clustering results and they must be optimized under a normalization constraint of FCM. On the other hand, the DA method, though it is efficient, does not give appropriate values of αks by itself and the DA clustering sometimes fails if αks are improperly given. Accordingly, we introduce SA to optimize αks because, as pointed above, both of DA and SA contain the parameter corresponding to the system temperature and can be naturally combined as DASA.

Nevertheless, this approach causes a few problems. (1)How to estimate the initial values of αks under the normalization constraint ? (2)How to estimate the initial annealing temperature? (3)SA must optimize a real number αk[5, 26]. (4)SA must optimize many αks[22].

Linear approximations of the Fermi-Dirac-like membership function is useful in guessing the initial αks and the initial annealing temperature of DA.

In order to perform SA in a many variables domain, αks to be optimized are selected according to a selection rule. In an early annealing stages, most αks are optimized. In a final annealing stage, however, only αks of data which locate sufficiently away from all cluster centers are optimized because their memberships might be fuzzy. Distances between the data and the cluster centers are measured by using linear approximations of the Fermi-Dirac-like membership function.

However, DASA suffers a few disadvantages. One of them is that it is not necessarily easy to interpolate membership functions obtained by DASA, since their values are quite different each other. The fractal interpolation method [17] is suitable for these rough functions [30].

Numerical experiments show that DASA clusters data which distribute in various shapes more properly and stably than single DA. Also, the effectiveness of the fractal interpolation is examined.

2. Fuzzy c-means

Let X={x1,…,xn}(xk=(xk1,…,xkp)∈Rp)be a data set in a p-dimensional real space, which should be divided into cclustersC={C1,…,Cc}. Let V={v1,…,vc}(vi=(vi1,…,vip))be the centers of clusters and uik∈[0,1](i=1,…,c;k=1,…,n) be the membership function. Also let

J=∑k=1n‍∑i=1c‍uik(dik)m(m>1)E1

be the objective function of the FCM wheredik=∥xk-vi∥2. In the FCM, under the normalization constraint of

∑i=1c‍uik=1∀k,E2

the Lagrange function LFCM is given by

LFCM=J-∑k=1n‍ηk∑i=1c‍uik-1,E3

where ηk is the Lagrange multiplier. Bezdek[2] showed that the FCM approaches crisp clustering as mdecreases to+1.

3. Entropy maximization of FCM

3.1. Shannon entropy maximization

First, we introduce the Shannon entropy into the FCM clustering. The Shannon entropy is given by

S=-∑k=1n‍∑i=1c‍uikloguik.E4

Under the normalization constraint and setting mto1, the fuzzy entropy functional is given by

δS-∑k=1n‍αkδ∑i=1c‍uik-1-β∑k=1n‍∑i=1c‍δ(uikdik),E5

where αk and βare the Lagrange multipliers, and αk must be determined so as to satisfy Eq. (2). The stationary condition for Eq. (5) leads to the following membership function

uik=e-βdik∑j=1c‍e-βdjk.E6

and the cluster centers

vi=∑k=1n‍uikxk∑k=1n‍uik.E7

(7)

3.2. Fuzzy entropy maximization

We then introduce the fuzzy entropy into the FCM clustering.

The fuzzy entropy is given by

S^=-∑k=1n‍∑i=1c‍{u^iklogu^ik+(1-u^ik)log(1-u^ik)}.E8

The fuzzy entropy functional is given by

δS^-∑k=1n‍αkδ∑i=1c‍u^ik-1-β∑k=1n‍∑i=1c‍δ(u^ikdik),E9

where αk and βare the Lagrange multipliers[28]. The stationary condition for Eq. (9) leads to the following membership function

u^ik=1eαk+βdik+1,E10

and the cluster centers

vi=∑k=1n‍u^ikxk∑k=1n‍u^ik.E11

In Eq. (10), βdefines the extent of the distribution [27]. Equation (10) is formally normalized as

u^ik=1eαk+βdik+1/∑j=1c‍1eαk+βdjk+1.E12

(12)

3.3. Tsallis entropy maximization

Let v~i and u~ik be the centers of clusters and the membership functions, respectively.

The Tsallis entropy is defined as

S~=-1q-1∑k=1n‍∑i=1c‍u~iq-1,E13

where q∈Ris any real number. The objective function is rewritten as

U~=∑k=1n‍∑i=1c‍u~ikqd~ik,E14

where

d~ik=∥xk-v~i∥2E15

.

Accordingly, the Tsallis entropy functional is given by

δS~-∑k=1n‍αkδ∑i=1c‍u~ik-1-β∑k=1n‍∑i=1c‍δ(u~ikqd~ik).E16

The stationary condition for Eq. (15) yields to the following membership function

u~ik={1-β(1-q)d~ik}11-qZ~,E17

where

Z~=∑j=1c‍{1-β(1-q)d~jk}11-q.E18

In this case, the cluster centers are given by

v~i=∑k=1n‍u~ikqxk∑k=1n‍u~ikq.E19

In the limit ofq→1, the Tsallis entropy recovers the Shannon entropy [24] and u~ik approaches uik in Eq.(6).

4. Entropy maximization and statistical physics

4.1. Shannon entropy based FCM statistics

In the Shannon entropy based FCM, the sum of the states (the partition function) for the grand canonical ensemble of fuzzy clustering can be written as

Z=∏k=1n‍∑i=1c‍e-βdik.E20

By substituting Eq. (19) for F=-(1/β)(logZ)[19], the free energy becomes

F=-1β∑k=1n‍log∑i=1c‍e-βdik.E21

Stable thermal equilibrium requires a minimization of the free energy. By formulating deterministic annealing as a minimization of the free energy, ∂F/∂vi=0yields

vi=∑k=1n‍uikxk∑k=1n‍uik.E22

This cluster center is the same as that in Eq. (7).

4.2. Fuzzy entropy based FCM statistics

In a group of independent particles, the total energy and the total number of particles are given by E=∑l‍εlnlandN=∑l‍nl, respectively, where εl represents the energy level and nl represents the number of particles that occupyεl. We can write the sum of states, or the partition function, in the form:

ZN=∑∑l‍εlnl=E,∑l‍nl=N‍e-β∑l‍εlnlE23

where βis the product of the inverse of temperature Tand kB (Boltzmann constant). However, it is difficult to take the sums in (22) counting up all possible divisions. Accordingly, we make the number of particles nl a variable, and adjust the new parameter α(chemical potential) so as to make ∑l‍εlnl=Eand ∑l‍nl=Nare satisfied. Hence, this becomes the grand canonical distribution, and the sum of states (the grand partition function) Ξis given by[8, 19]

Ξ=∑N=0∞‍(e-α)NZN=∏l‍∑nl=0∞‍(e-α-βεl)nl.E24

For particles governed by the Fermi-Dirac distribution, Ξcan be rewritten as

Ξ=∏l‍(1+e-α-βεl).E25

Also, nlis averaged as

⟨nl⟩=1eα+βεl+1E26

where αis defined by the condition that N=∑l‍⟨nl⟩[19]. Helmholtz free energy Fis, from the relationshipF=-kBTlogZN,

F=-kBTlogΞ-α∂∂αlogΞ=-1β∑l‍log(1+e-α-βεl)+αN.E27

Taking that

E=∑l‍εleα+βεl+1E28

into account, the entropy S=(E-F)/Thas the form

S=-kB∑l‍⟨nl⟩log⟨nl⟩+(1-⟨nl⟩)log(1-⟨nl⟩).E29

If states are degenerated to the degree ofνl, the number of particles which occupy εl is

⟨Nl⟩=νl⟨nl⟩,E30

and we can rewrite the entropy Sas

S=-kB∑l‍{⟨Nl⟩νllog⟨Nl⟩νl+1-⟨Nl⟩νllog1-⟨Nl⟩νl},E31

which is similar to fuzzy entropy in (8). As a result, uikcorresponds to a grain density ⟨nl⟩ and the inverse of βin (10) represents the system or computational temperatureT.

In the FCM clustering, note that any data can belong to any cluster, the grand partition function can be written as

Ξ=∏k=1n‍∏i=1c‍(1+e-αk-βdik),E32

which, from the relationshipF=-(1/β)(logΞ-αk∂logΞ/∂αk), gives the Helmholtz free energy

F=-1β∑k=1n‍∑i=1c‍log(1+e-αk-βdik)+αk.E33

The inverse of βrepresents the system or computational temperatureT.

4.3. Correspondence between Fermi-Dirac statistics and fuzzy clustering

	Fermi-Dirac Statistics	Fuzzy Clustering
Constraints	(a)∑l‍nl=N (b)∑l‍εlnl=E	(a)∑i=1c‍uik=1
Distribution Function	⟨nl⟩=1eα+βεl+1	uik=1eαk+βdik+1
Entropy	S=-kB∑l‍⟨Nl⟩νllog⟨Nl⟩νl +1-⟨Nl⟩νllog1-⟨Nl⟩νl	SFE=-∑k=1n‍∑i=1c‍{uikloguik +(1-uik)log(1-uik)}
Temperature(T)	(given)	(given)
Partition Function(Ξ)	∏l‍1+e-α-βεl	∏k=1n‍∏i=1c‍1+e-αk-βdik
Free Energy(F)	-1β∑l‍log(1+e-α-βεl)+αN	-1β∑k=1n‍∑i=1c‍log(1+e-αk-βdik)+αk
Energy(E)	∑l‍εleα+βεl+1	∑k=1n‍∑i=1c‍dikeαk+βdik+1

Table 1.

Correspondence of Fermi-Dirac Statistics and Fuzzy Clustering.

In the previous subsection, we have formulated the fuzzy entropy regularized FCM as the DA clustering and showed that its mechanics was no other than the statistics of a particle system (the Fermi-Dirac statistics). The correspondences between fuzzy clustering (FC) and the Fermi-Dirac statistics (FD) are summarized in TABLE 1. The major difference between fuzzy clustering and statistical mechanics is the fact that data are distinguishable and can belong to multiple clusters, though particles which occupy a same energy state are not distinguishable. This causes a summation or a multiplication not only on ibut on kas well in fuzzy clustering. Thus, fuzzy clustering and statistical mechanics described in this paper are not mathematically equivalent.

Constraints: (a) Constraint that the sum of all particles Nis fixed in FD is correspondent with the normalization constraint in FC. Energy level lis equivalent to the cluster numberi. In addition, the fact that data can belong to multiple clusters leads to the summation onk. (b) There is no constraint in FC which corresponds to the constraint that the total energy equals Ein FD. We have to minimize ∑k=1n‍∑i=1c‍dikuik in FC.
Distribution Function: In FD, ⟨nl⟩gives an average particle number which occupies energy levell, because particles can not be distinguished from each other. In FC, however, data are distinguishable, and for that reason, uikgives a probability of data belonging to multiple clusters.
Entropy: ⟨Nl⟩is supposed to correspond to a cluster capacity. The meanings of Sand SFE will be discussed in detail in the next subsection.
Temperature: Temperature is given in both cases
In the FCM, however, temperature is determined as a result of clustering.
.
Partition Function: The fact that data can belong to multiple clusters simultaneously causes the product over kfor FC.
Free Energy: Helmholtz free energy Fis given by -T(logΞ-αk∂logΞ/∂αk) in FC. Both Sand SFE equal -∂F/∂Tas expected from statistical physics.
Energy: The relationship E=F+TSor E=F+TSFEholds betweenE, F, Tand SorSFE.

4.4. Meanings of Fermi-Dirac distribution function and fuzzy entropy

In the entropy function (28) or (30) for the particle system, we can consider the first term to be the entropy of electrons and the second to be that of holes. In this case, the physical limitation that only one particle can occupy an energy level at a time results in the entropy that formulates the state in which an electron and a hole exist simultaneously and exchanging them makes no difference. Meanwhile, what correspond to electron and hole in fuzzy clustering are the probability of fuzzy event that a data belongs to a cluster and the probability of its complementary event, respectively.

Fig.2 shows a two-dimensional virtual cluster density distribution model. A lattice can have at most one data. Let Ml be the total number of lattices and ml be the number of lattices which have a data in it (marked by a black box). Then, the number of available divisions of data to lattices is denoted by

W=∏l‍Ml!ml!(Ml-ml)!E34

which, from S=kBlogW(the Gibbs entropy), gives the form similar to (30)[8]. By extremizingS, we have the most probable distribution like (25). In this case, as there is no distinction between data, only the numbers of black and white lattices constitute the entropy. Fuzzy entropy in (8), on the other hand, gives the amount of information of weather a data belongs to a fuzzy set (or cluster) or not, averaged over independent dataxk.

Changing a viewpoint, the stationary entropy values for the particle system seems to be a request for giving the stability against the perturbation with collisions between particles. In fuzzy clustering, data reconfiguration between clusters with the move of cluster centers or the change of cluster shapes is correspondent to this stability. Let us represent data density by⟨⟩. If data transfer from clusters Ca and Cb to Cc and Cd as a magnitude of membership function, the transition probability from {…,Ca,…,Cb,…} to {…,Cc,…,Cd,…} will be proportional to ⟨Ca⟩⟨Cb⟩(1-⟨Cc⟩)(1-⟨Cd⟩) because a data enters a vacant lattice. Similarly, the transition probability from {…,Cc,…,Cd,…} to {…,Ca,…,Cb,…} will be proportional to⟨Cc⟩⟨Cd⟩(1-⟨Ca⟩)(1-⟨Cb⟩). In the equilibrium state, the transitions exhibit balance (this is known as the principle of detailed balance[19]). This requires

⟨Ca⟩⟨Cb⟩(1-⟨Ca⟩)(1-⟨Cb⟩)=⟨Cc⟩⟨Cd⟩(1-⟨Cc⟩)(1-⟨Cd⟩).E35

As a result, if energy di is conserved before and after the transition, ⟨Ci⟩must have the form

⟨Ci⟩1-⟨Ci⟩=e-α-βdiE36

or Fermi-Dirac distribution

⟨Ci⟩=1eα+βdi+1,E37

where αand βare constants.

Consequently, the entropy like fuzzy entropy is statistically caused by the system that allows complementary states. Fuzzy clustering handles a data itself, while statistical mechanics handles a large number of particles and examines the change of macroscopic physical quantity. Then it is concluded that fuzzy clustering exists in the extreme of Fermi-Dirac statistics, or the Fermi-Dirac statistics includes fuzzy clustering conceptually.

Figure 1.
Simple lattice model of clusters. M1,M2,…represent clusters. Black and white box represent whether a data exists or not.

4.5. Tsallis entropy based FCM statistics

On the other hand, U~and S~ satisfy

S~-βU~=∑k=1n‍Z~1-q-11-q,E38

which leads to

∂S~∂U~=β.E39

Equation (38) makes it possible to regard β-1 as an artificial system temperature T[19]. Then, the free energy can be defined as

F~=U~-TS~=-1β∑k=1n‍Z~1-q-11-q.E40

U~can be derived from F~ as

U~=-∂∂β∑k=1n‍Z~1-q-11-q.E41

∂F~/∂v~i=0also gives

v~i=∑k=1n‍u~ikqxk∑k=1n‍u~ikq.E42

(41)

5. Deterministic annealing

The DA method is a deterministic variant of SA. DA characterizes the minimization problem of the cost function as the minimization of the Helmholtz free energy which depends on the temperature, and tracks its minimum while decreasing the temperature and thus it can deterministically optimize the cost function at each temperature.

According to the principle of minimal free energy in statistical mechanics, the minimum of the Helmholtz free energy determines the distribution at thermal equilibrium [19]. Thus, formulating the DA clustering as a minimization of (32) leads to ∂F/∂vi=0 at the current temperature, and gives (10) and (11) again. Desirable cluster centers are obtained by calculating (10) and (11) repeatedly.

In this chapter, we focus on application of DA to the Fermi-Dirac-like distribution function described in the Section 4.2.

5.1. Linear approximation of Fermi-Dirac distribution function

Figure 2.
The Fermi-Dirac distribution function f(x)and its linear approximation functionsg(x).

Figure 3.
Decreasing of extent of the Fermi-Dirac distribution function from xto xnew with decreasing the temperature from TtoTnew.

The Fermi-Dirac distribution function can be approximated by linear functions. That is, as shown in Fig.1, the Fermi-Dirac distribution function of the form:

f(x)=1eα+βx2+1E43

is approximated by the linear functions

g(x)=1.0x≤-α-1κ-κ2x-α2+12-α-1κ≤x≤-α+1κ,0.0-α+1κ≤xE44

whereκ=-αβ. g(x)satisfiesg(-α/β)=0.5, and requires αto be negative.

In Fig.2, Δx=x-xnewdenotes a reduction of the extent of distribution with decreasing the temperature from Tto Tnew (T>Tnew). The extent of distribution also narrows with increasingα. αnew(α<αnew) which satisfies g(0.5)α-g(0.5)αnew=Δxis obtained as

αnew=--α+-αβnew1β-1βnew2,E45

where β=1/Tandβnew=1/Tnew. Thus, taking that Tto the temperature at which previous DA was executed and Tnew to the next temperature, a covariance ofαk’s distribution is defined as

Δα=αnew-α.E46

(45)

5.2. Initial estimation of αk and annealing temperature

Before executing DA, it is very important to estimate the initial values of αks and the initial annealing temperature in advance.

From Fig.1, distances between a data point and cluster centers are averaged as

Lk=1c∑i=1c‍∥xk-vi∥,E47

and this gives

αk=-β(Lk)2.E48

With given initial clusters distributing wide enough, (47) overestimatesαk, so that αk needs to be adjusted by decreasing its value gradually.

Still more, Fig.1 gives the width of the Fermi-Dirac distribution function as wide as2(-α+1)/(-αβ), which must be equal to or smaller than that of data distribution (=2R). This condition leads to

2-α+1-αβ=2R.E49

As a result, the initial value of βor the initial annealing temperature is roughly determined as

β≃4R2 T≃R24.E50

(49)

5.3. Deterministic annealing algorithm

The DA algorithm for fuzzy clustering is given as follows:

InitializeTrateδ0Thigh(=1/βlow)T=Thigh(β=βlow)cαk
Calculate uik by (12).
Calculate vi by (11).
Compare a difference between the current objective value Jm=1=∑k=1n‍∑i=1c‍dikuik and that obtained at the previous iterationJ^. If ∥Jm=1-J^∥/Jm=1<δ0⋅T/Thigh is satisfied, then return. Otherwise decrease the temperature asT=T*Trate, and go back to2.

6. Combinatorial algorithm of deterministic and simulated annealing

6.1. Simulated annealing

The cost function for SA is

E(αk)=Jm=1+SFE+K∑k=1n‍∑i=1c‍uik-12,E51

where Kis a constant.

In order to optimize each αk by SA, its neighbor αknew (a displacement from the currentαk) is generated by assuming a normal distribution with a mean 0 and a covariance Δαkdefined in (45).

The SA’s initial temperature T0(=1/β0) is determined so as to make an acceptance probability becomes

exp-β0{E(αk)-E(αknew)}=0.5E52

(E(αk)-E(αknew)≥0)E53

By selecting αks to be optimized from the outside of a transition region in which the membership function changes form 0 to1, computational time of SA can be shortened. The boundary of the transition region can be easily obtained with the linear approximations of the Fermi-Dirac-like membership function. From Fig.1, data which have distances bigger than -αk/β from each cluster centers are selected.

6.2. Simulated annealing algorithm

The SA algorithm is stated as follows:

InitializeT0(=1/β0)TT0t1Δαkαk
Select data to be optimized, if necessary.
Calculate neighbors of current αks.
Apply the Metropolis algorithm to the selected αks using (50) as the objective function.
max<tT=T0/log(t+1)t2

6.3. Combinatorial algorithm of deterministic and simulated annealing

The DA and SA algorithms are combined as follows:

Initializeδ1l1max0max1max2
Execute the DA algorithm.
max=max0
Compare a difference between the current objective value eand that obtained at the previous iteratione^. If ∥e-e^∥/e<δ1 or max2<lis satisfied, then go to5. Otherwise incrementl, and go back to2.
max=max1

7. Experiments 1

Figure 4.
Experimental result 1. (Fuzzy clustering result using DASA. Big circles indicate centers of clusters.)

Figure 5.
Experimental result 2-1. (Fuzzy clustering result by DASA and single DA. “Experimental Data” are given data distributions. “Selected Data” are data selected for final SA by the selection rule. (1)~ (6a) and (6b) are results using DASA. (6c) and (6d) are results using single DA (success and failure, respectively). Data plotted on the xy plane show the cross sections of uik at 0.2 and 0.8.)

To demonstrate effectiveness of the proposed algorithm, numerical experiments were carried out. DASA’s results were compared with those of DA (single DA).

We setδ0=0.5, δ1=0.01, Trate=0.8, max0=500, max1=20000, andmax2=10. We also set Rin (48) to 350.0 for experimental data 1~3, and 250.0 for experimental data 4

These parameters have not been optimized particularly for experimental data.

.

In experiment 1, 11,479 data points were generated as ten equally sized normal distributions. Fig.4 shows a fuzzy clustering result by DASA. Single DA similarly clusters these data.

In experiment 2-1, three differently sized normal distributions consist of 2,249 data points in Fig.5 were used. Fig.5(0) shows initial clusters obtained by the initial estimation of αks and the annealing temperature. Fig.5(1)~ (6a) shows a fuzzy clustering process of DASA. At the high temperature in Fig.5(1), as described in 4.3, the membership functions were widely distributed and clusters to which a data belongs were fuzzy. However, with decreasing of the temperature (from Fig.5(2) to Fig.5(5)), the distribution became less and less fuzzy. After executing DA and SA alternately, the clusters in Fig.5(6a) were obtained. Then, data to be optimized by SA were selected by the criterion stated in the section 4, and SA was executed. The final result of DASA in Fig.5(6b) shows that data were desirably clustered. On the contrary, because of randomness of the initial cluster positions and hardness of good estimation of the initial αks, single DA becomes unstable, and sometimes gives satisfactory results as shown in Fig.5(6c) and sometimes not as shown in Fig.5(6d). By comparing Fig.5(6b) to (6c), it is found that, due to the optimization of αks by SA, the resultant cluster shapes of DASA are far less smooth than those of single DA.

Changes of the costs of DASA (Jm=1+SFEfor DA stage and (50) for SA stage (Kwas set to 1×1015 in (50)), respectively) are plotted as a function of iteration in Fig.6, and the both costs decreases with increasing iteration. In this experiment, the total iteration of SA stage was about12,500, while that of DA stage was only7. Accordingly, the amount of simulation time DASA was mostly consumed in SA stage.

Figure 6.
Experimental result 2-1. (Change of the cost of DASA as a function of iteration. Jm=1+SFEfor DA stage and Jm=1+SFE+K∑k=1n‍∑i=1c‍uik-12 for SA stage, respectively.)

In experiment 2-2, in order to examine effectiveness of SA introduced in DASA, experiment 2 was re-conducted ten times as in Table 1, where ratiolisted in the first row is a ratio of data optimized at SA stage. “UP” means to increase ratioas 1.0-1.0/twhere tis a number of execution times of SA stage. Also, “DOWN” means to decrease ratioas1.0/t. Results are judged “Success” or “Failure” from a human viewpoint

No close case was observed in this experiment.

. From Table 1, it is concluded that DASA always clusters the data properly if ratiois large enough (0.6<ratio), whereas, as listed in the last column, single DA succeeds by 50%.

DASA
						DA
ratio	0.3	0.6	1.0	UP	DOWN
Success	6	9	10	6	7	5
Failure	4	1	0	4	3	5

Table 2.

Experimental result 2-2. (Comparison of numbers of successes and failures of fuzzy clustering using DASA for ratio=0.3,0.6,1.0,1.0-1.0/t(UP), 1.0/t(DOWN) and single DA. (tis a number of execution times of SA stage))

Figure 7.
Experimental result 3. (Fuzzy clustering result of elliptic distributions using DASA. Data plotted on the xy plane show the cross sections of uik at 0.2 and 0.8.)

Figure 8.
Experimental result 4. (Fuzzy clustering result of horseshoe-shaped distributions using DASA. Data plotted on the xy plane show the cross sections of uik at 0.2 and 0.8.)

In experiments 3 and 4, two elliptic distributions consist of 2,024 data points, and two horseshoe-shaped distributions consist of 1,380 data points were used, respectively. Fig.5, 6 and 7 show DASA’s clustering results. It is found that DASA can cluster these data properly. In experiment 3, a percentage of success of DASA is 90%, though that of single DA is 50%. In experiment 4, a percentage of success of DASA is 80%, though that of single DA is 40%.

These experimental results demonstrate the advantage of DASA over single DA. Nevertheless, DASA suffers two disadvantages. First, it takes so long to execute SA repeatedly that, instead of (10), it might be better to use its linear approximation functions as the membership function. Second, since αks differ each other, it is difficult to interpolate them.

8. Experiments 2

8.1. Interpolation of membership function

DASA suffers a few disadvantages. First, it is not necessarily easy to interpolate αk oruik, since they differ each other. Second, it takes so long to execute SA repeatedly.

A simple solution for the first problem is to interpolate membership functions. Thus, the following step was added to the DASA algorithm.

When a new data is given, some neighboring membership functions are interpolated at its position.

Figure 9.
Experimantal data and membership functions obtained by DASA.(Data plotted on the xy plane show the cross sections of uik at 0.2 and 0.8)

To examine an effectiveness of interpolation, the proposed algorighm was applied to experimental data shown in Fig.9(a). For simplicity, the data were placed on rectangular grids on the xy plane.

First, some regional data were randomly selected from the data. Then, Initial and final memberhip functions obtained by DASA are shown in Figs.9(b) and (c) respectively.

After that, remaining data in the region were used as test data, and at each data point, they were interpolated by their four nearest neighboring membership values. Linear, bicubic and fractal interpolation methods were compared.

Prediction error of linear interpolation was 6.8%, and accuracy was not enough. Bicubic interpolation[18] also gave a poor result, because its depends on good estimated gradient values of neighboring points. Accordingly, in this case, fractal interpolation[17] is more suitable than smooth interpolation methods such as bicubic or spline interpolation, because the membership functions in Figs.9(c) are very rough.

The well-known InterpolatedFM (Fractal motion via interpolation and variable scaling) algorithm [17] was used in this experiment. Fractal dimension was estimated by the standard box-counting method [25]. Figs.10(a) and 3(b) represent both the membership functions and their interpolation values. Prediction error (averaged over 10 trials) of fractal interpolation was 2.2%, and a slightly better result was obtained.

Figure 10.
Plotted lines show the membership functions obtained by DASA. The functions are interpolated by the InterpolatedFM algorithm. Crosses show the interpolated data.

9. Conclusion

In this article, by combining the deterministic and simulated annealing methods, we proposed the new statistical mechanical fuzzy c-means clustering algorithm (DASA). Numerical experiments showed the effectiveness and the stability of DASA.

However, as stated at the end of Experiments, DASA has problems to be considered. In addition, a major problem of the fuzzy c-means methodologies is that they do not give a number of clusters by themselves. Thus, a method such as [28] which can determine a number of clusters automatically should be combined with DASA.

Our future works also include experiments and examinations of the properties of DASA, especially on an adjustment of its parameters, its annealing scheduling problem, and its applications for fuzzy modeling[29].

However, DASA has problems to be considered. One of them is that it is difficult to interpolate membership functions, since their values are quite different. Accordingly, the fractal interpolation method (InterpolationFM algorithm) is introduced to DASA and examined its effectiveness.

Our future works include experiments and examinations of the properties of DASA, a comparison of results of interpolation methods (linear, bicubic, spline, fractal and so on), an interpolation of higher dimensional data, an adjustment of DASA’s parameters, and DASA’s annealing scheduling problem.

References

1. E. Aarts and J. Korst, “Simulated Annealing and Boltzmann Machines”, Chichester: John Wiley & Sons, 1989.
2. J.C. Bezdek, “Pattern Recognition with Fuzzy Objective Function Algorithms”, New York: Prenum Press, 1981.
3. B.P.Buckles and F.E.Petry, “Information-theoretical characterization of fuzzy relational database”, IEEE Trans.Systems,Man and Cybernetics, 13174771983
4. BuhmannJ.Ku°hnelH.Vector“.quantizationwith.complexitycosts”. I. E. E. E.TransInformation Theory, 394113311431993
5. A.Corana, M.Marchesi, C.Martini, and S.Ridella, “Minimizing multimodal functions of continuous variables with the simulated annealing algorithm”, ACM Trans.On Mathematical Software, 1332622801987
6. A. DeLuca and S. Termini, “A definition of a nonprobabilistic entropy in the setting of fuzzy sets theory”, Information and Control, vol.20, pp.301-312, 1972.
7. A.P.Dempster, N.M.Laird, and D.B.Rubin, ”Maximum likelihood from incomplete data via the EM algorithms”, Journal of Royal Stat.Soc., Series B, 391381977
8. W. Greiner, L. Neise, and H. St°ocker, “Thermodynamics and Statistical Mechanics”, New York: Springer-Verlag, 1995.
9. HofmannT.BuhmannJ.Pairwise“.dataclustering.bydeterministic.annealing,”I. E. E. E.TransPattern Analysis and Machine Intelligence, 191141997
10. S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi, “Optimization by simulated annealing”, Science, vol.220, pp.671-680, 1983.
11. R.LiP.MukaidonoM.Maximum“. A.entropyapproach.tofuzzy.clustering”Proc.of the 4th IEEE Int. Conf. Fuzzy Systems(FUZZ-IEEE/IFES’95), 22272232
12. M. Menard, V. Courboulay, and P. Dardignac, “Possibilistic and probabilistic fuzzy clustering: unification within the framework of the non-extensive thermostatistics”, Pattern Recognition, vol.36, pp.1325–1342, 2003
13. MillerD.RaoA. V.RoseK.GershoA.global“. A.optimizationtechnique.forstatistical.classifierdesign”. I. E. E. E.TransSignal Processing, 44310831221996
14. MiyamotoS.MukaidonoM.Fuzzy“.asc-meansa.regularizationmaximumentropy.approach”Proc.of the 7th Int. Fuzzy Systems Association World Congress, vol.II, 86921997
15. N.R. Pal and J.C. Bezdek, “Measuring fuzzy uncertainty”, IEEE Trans. Fuzzy Systems, vol.2, no.2,1071181994
16. N.R. Pal, “On quantification of different facets of uncertainty”, Fuzzy Sets and Systems, vol.107, pp.81-91, 1999.
17. H.-O. Peitgen, et.al., The science of fractal images, Springer-Verlag, 1988
18. W.H. Press, S.A. Teukolsky, W.T. Vetteriling, and B.P. Flannery, Numerical Recipes in C++, Cambridge University Press, 2002.
19. L. E. Reichl, A Modern Course in Statistical Physics, New York: John Wiley & Sons, 1998.
20. K. Rose, E. Gurewitz, and B.C. Fox, “A deterministic annealing approach to clustering”, Pattern Recognition Letters, vol.11, no.9, pp.589-594, 1990.
21. RoseK.GurewitzE.FoxG. C.Constrained“.asclusteringan.optimizationmethod”. I. E. E. E.TransPattern Analysis and Machine Intelligence, 1587857941993
22. P.Siarry, “Enhanced simulated annealing for globally minimizing functions of many-continuous variables”, ACM Trans. on Mathematical Software, 2322092281997
23. D.Tran and M.Wagner, ”Fuzzy entropy clustering”, Proc.of the 9th IEEE Int. Conf. Fuzzy Systems(FUZZ-IEEE2000), 11521572000
24. C. Tsallis, Possible generalization of Boltzmann-Gibbs statistics, Journal of Statistical Phys., vol.52, pp.479-487, 1988.
25. VossR.Randomfractals.characterizationmeasurementPlenum.Press, 1986
26. P.R.Wang, “Continuous optimization by a variant of simulated annealing”, Computational Optimization and Applications, vol.6, pp.59-71, 1996.
27. YasudaM.FuruhashiT.OkumaS.Statisticalmechanical.analysisof.fuzzyclustering.basedon.fuzzyentropy. I. E. I. C. E.TransInformation and Systems, ED90-D68838882007
28. YasudaM.FuruhashiT.Fuzzyentropy.basedfuzzy.c-meansclustering.withdeterministic.simulatedannealing.methodsI. E. I. C. E.TransInformation ans Systems, ED92-D6123212392009
29. YasudaM.FuruhashiT.Statisticalmechanical.fuzzyc-means.clusteringwith.deterministicsimulatedannealing.methodsProc.of the Joint 3rd Int. Conf. on Soft Computing and Intelligent Systems, in CD-ROM, 2006
30. YasudaM.Entropybased.annealingapproach.tofuzzy.c-meansclustering.itsinterpolation.Procof the 8th Int. Conf. on Fuzzy Sysmtes and Knowledge Discovery, 4244282011
31. ZenzoS. D.CinqueL.Image“.thresholding.usingfuzzy.entropies”I. E. E. E.TransSystems, Man and Cybernetics-Part B, 28115231998

Notes

In the FCM, however, temperature is determined as a result of clustering.
These parameters have not been optimized particularly for experimental data.
No close case was observed in this experiment.

[1] 1. E. Aarts and J. Korst, “Simulated Annealing and Boltzmann Machines”, Chichester: John Wiley & Sons, 1989.

[2] 2. J.C. Bezdek, “Pattern Recognition with Fuzzy Objective Function Algorithms”, New York: Prenum Press, 1981.

[3] 3. B.P.Buckles and F.E.Petry, “Information-theoretical characterization of fuzzy relational database”, IEEE Trans.Systems,Man and Cybernetics, 13174771983

[4] 4. BuhmannJ.Ku°hnelH.Vector“.quantizationwith.complexitycosts”. I. E. E. E.TransInformation Theory, 394113311431993

[5] 5. A.Corana, M.Marchesi, C.Martini, and S.Ridella, “Minimizing multimodal functions of continuous variables with the simulated annealing algorithm”, ACM Trans.On Mathematical Software, 1332622801987

[6] 6. A. DeLuca and S. Termini, “A definition of a nonprobabilistic entropy in the setting of fuzzy sets theory”, Information and Control, vol.20, pp.301-312, 1972.

[7] 7. A.P.Dempster, N.M.Laird, and D.B.Rubin, ”Maximum likelihood from incomplete data via the EM algorithms”, Journal of Royal Stat.Soc., Series B, 391381977

[8] 8. W. Greiner, L. Neise, and H. St°ocker, “Thermodynamics and Statistical Mechanics”, New York: Springer-Verlag, 1995.

[9] 9. HofmannT.BuhmannJ.Pairwise“.dataclustering.bydeterministic.annealing,”I. E. E. E.TransPattern Analysis and Machine Intelligence, 191141997

[10] 10. S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi, “Optimization by simulated annealing”, Science, vol.220, pp.671-680, 1983.

[11] 11. R.LiP.MukaidonoM.Maximum“. A.entropyapproach.tofuzzy.clustering”Proc.of the 4th IEEE Int. Conf. Fuzzy Systems(FUZZ-IEEE/IFES’95), 22272232

[12] 12. M. Menard, V. Courboulay, and P. Dardignac, “Possibilistic and probabilistic fuzzy clustering: unification within the framework of the non-extensive thermostatistics”, Pattern Recognition, vol.36, pp.1325–1342, 2003

[13] 13. MillerD.RaoA. V.RoseK.GershoA.global“. A.optimizationtechnique.forstatistical.classifierdesign”. I. E. E. E.TransSignal Processing, 44310831221996

[14] 14. MiyamotoS.MukaidonoM.Fuzzy“.asc-meansa.regularizationmaximumentropy.approach”Proc.of the 7th Int. Fuzzy Systems Association World Congress, vol.II, 86921997

[15] 15. N.R. Pal and J.C. Bezdek, “Measuring fuzzy uncertainty”, IEEE Trans. Fuzzy Systems, vol.2, no.2,1071181994

[16] 16. N.R. Pal, “On quantification of different facets of uncertainty”, Fuzzy Sets and Systems, vol.107, pp.81-91, 1999.

[17] 17. H.-O. Peitgen, et.al., The science of fractal images, Springer-Verlag, 1988

[18] 18. W.H. Press, S.A. Teukolsky, W.T. Vetteriling, and B.P. Flannery, Numerical Recipes in C++, Cambridge University Press, 2002.

[19] 19. L. E. Reichl, A Modern Course in Statistical Physics, New York: John Wiley & Sons, 1998.

[20] 20. K. Rose, E. Gurewitz, and B.C. Fox, “A deterministic annealing approach to clustering”, Pattern Recognition Letters, vol.11, no.9, pp.589-594, 1990.

[21] 21. RoseK.GurewitzE.FoxG. C.Constrained“.asclusteringan.optimizationmethod”. I. E. E. E.TransPattern Analysis and Machine Intelligence, 1587857941993

[22] 22. P.Siarry, “Enhanced simulated annealing for globally minimizing functions of many-continuous variables”, ACM Trans. on Mathematical Software, 2322092281997

[23] 23. D.Tran and M.Wagner, ”Fuzzy entropy clustering”, Proc.of the 9th IEEE Int. Conf. Fuzzy Systems(FUZZ-IEEE2000), 11521572000

[24] 24. C. Tsallis, Possible generalization of Boltzmann-Gibbs statistics, Journal of Statistical Phys., vol.52, pp.479-487, 1988.

[25] 25. VossR.Randomfractals.characterizationmeasurementPlenum.Press, 1986

[26] 26. P.R.Wang, “Continuous optimization by a variant of simulated annealing”, Computational Optimization and Applications, vol.6, pp.59-71, 1996.

[27] 27. YasudaM.FuruhashiT.OkumaS.Statisticalmechanical.analysisof.fuzzyclustering.basedon.fuzzyentropy. I. E. I. C. E.TransInformation and Systems, ED90-D68838882007

[28] 28. YasudaM.FuruhashiT.Fuzzyentropy.basedfuzzy.c-meansclustering.withdeterministic.simulatedannealing.methodsI. E. I. C. E.TransInformation ans Systems, ED92-D6123212392009

[29] 29. YasudaM.FuruhashiT.Statisticalmechanical.fuzzyc-means.clusteringwith.deterministicsimulatedannealing.methodsProc.of the Joint 3rd Int. Conf. on Soft Computing and Intelligent Systems, in CD-ROM, 2006

[30] 30. YasudaM.Entropybased.annealingapproach.tofuzzy.c-meansclustering.itsinterpolation.Procof the 8th Int. Conf. on Fuzzy Sysmtes and Knowledge Discovery, 4244282011

[31] 31. ZenzoS. D.CinqueL.Image“.thresholding.usingfuzzy.entropies”I. E. E. E.TransSystems, Man and Cybernetics-Part B, 28115231998

Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing

Simulated Annealing - Advances, Applications and Hybridizations

Author Information

Makoto Yasuda