Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing

Many engineering problems can be formulated as optimization problems, and the deterministic annealing (DA) method [20] is known as an effective optimization method for such problems. DA is a deterministic variant of simulated annealing (SA) [1, 10]. The DA characterizes the minimization problem of cost functions as the minimization of Helmholtz free energywhich depends on a (pseudo) temperature, and tracks the minimum of free energy while decreasing temperature and thus it can deterministically optimize the function at a given temperature [20]. Hence, the DA is more efficient than the SA, but does not guarantee a global optimal solution. The study on the DA in [20] addressed avoidance of the poor local minima of cost function of data clustering. Then it was extensively applied to various subjects such as combinational optimization problems [21], vector quantization [4], classifier design [13], pairwise data clustering [9] and so on.


Introduction
Many engineering problems can be formulated as optimization problems, and the deterministic annealing (DA) method [20] is known as an effective optimization method for such problems. DA is a deterministic variant of simulated annealing (SA) [1,10]. The DA characterizes the minimization problem of cost functions as the minimization of Helmholtz free energy which depends on a (pseudo) temperature, and tracks the minimum of free energy while decreasing temperature and thus it can deterministically optimize the function at a given temperature [20]. Hence, the DA is more efficient than the SA, but does not guarantee a global optimal solution. The study on the DA in [20] addressed avoidance of the poor local minima of cost function of data clustering. Then it was extensively applied to various subjects such as combinational optimization problems [21], vector quantization [4], classifier design [13], pairwise data clustering [9] and so on.
On the other hand, clustering is a method which partitions a given set of data points into subgroups, and is one of major tools for data analysis. It is supposed that, in the real world, cluster boundaries are not so clear that fuzzy clustering is more suitable than crisp clustering.
Bezdek [2] proposed the fuzzy c-means (FCM) which is now well known as the standard technique for fuzzy clustering.
Then, after the work of Li et al. [11] which formulated the regularization of the FCM with Shannon entropy, Miyamoto et al. [14] discussed the FCM within the framework of the Shannon entropy based clustering. From the historical point of view, however, it should be noted that Rose et al. [20] first studied the statistical mechanical analogy of the FCM with the maximum entropy method, which was basically probabilistic clustering.
To measure the "indefiniteness" of fuzzy set, DeLuca and Termini [6] defined fuzzy entropy after Shannon. Afterwards, some similar measures from the wider viewpoints of the indefiniteness were proposed [15,16]. Fuzzy entropy has been used for knowledge retrieval from fuzzy database [3] and image processing [31], and proved to be useful.
Tsallis [24] achieved nonextensive extension of the Boltzmann-Gibbs statistics. Tsallis postulated a generalization form of entropy with a generalization parameter q, which, in a limit of q → 1 ,reaches the Shannon entropy. Later on, Menard et.al. [12] derived a membership function by regularizing FCM with the Tsallis entropy.
In this chapter, by maximizing the various entropies within the framework of FCM, the membership functions which take the familiar forms of the statitical mechanical distribution functions are derived. The advantage to use the statistical mechanical membership functions is that the fuzzy c-means clustering can be interpreted and analyzed from a statistical mechanical point of view [27,28] After that, we focus on the Fermi-Dirac like membership function, because, as compared to the Maxwell-Boltzmann-like membership function, the Fermi-Dirac-like membership function has extra parameters α k s (α k corresponds to a chemical potential in statistical mechanics [19], and k denotes a data point), which make it possible to represent various cluster shapes like former clustering methods based on such as the Gaussian mixture [7], and the degree of fuzzy entropy [23]. α k s strongly affect clustering results and they must be optimized under a normalization constraint of FCM. On the other hand, the DA method, though it is efficient, does not give appropriate values of α k s by itself and the DA clustering sometimes fails if α k s are improperly given. Accordingly, we introduce SA to optimize α k s because, as pointed above, both of DA and SA contain the parameter corresponding to the system temperature and can be naturally combined as DASA.
Nevertheless, this approach causes a few problems. (1)How to estimate the initial values of α k s under the normalization constraint ? (2)How to estimate the initial annealing temperature? (3)SA must optimize a real number α k [5,26]. (4)SA must optimize many α k s [22].
Linear approximations of the Fermi-Dirac-like membership function is useful in guessing the initial α k s and the initial annealing temperature of DA.
In order to perform SA in a many variables domain, α k s to be optimized are selected according to a selection rule. In an early annealing stages, most α k s are optimized. In a final annealing stage, however, only α k s of data which locate sufficiently away from all cluster centers are optimized because their memberships might be fuzzy. Distances between the data and the cluster centers are measured by using linear approximations of the Fermi-Dirac-like membership function.
However, DASA suffers a few disadvantages. One of them is that it is not necessarily easy to interpolate membership functions obtained by DASA, since their values are quite different each other. The fractal interpolation method [17] is suitable for these rough functions [30].
Numerical experiments show that DASA clusters data which distribute in various shapes more properly and stably than single DA. Also, the effectiveness of the fractal interpolation is examined.

Fuzzy c-means
) be the centers of clusters and u ik ∈ [0, 1](i = 1, . . . , c; k = 1, . . . , n) be the membership function. Also let be the objective function of the FCM where d ik = x k − v i 2 . In the FCM, under the normalization constraint of the Lagrange function L FC M is given by where η k is the Lagrange multiplier.
Bezdek [2] showed that the FCM approaches crisp clustering as m decreases to +1.

Shannon entropy maximization
First, we introduce the Shannon entropy into the FCM clustering. The Shannon entropy is given by Under the normalization constraint and setting m to 1, the fuzzy entropy functional is given by where α k and β are the Lagrange multipliers, and α k must be determined so as to satisfy Eq.
(2). The stationary condition for Eq. (5) leads to the following membership function and the cluster centers

Fuzzy entropy maximization
We then introduce the fuzzy entropy into the FCM clustering.
The fuzzy entropy is given bŷ The fuzzy entropy functional is given by where α k and β are the Lagrange multipliers [28]. The stationary condition for Eq. (9) leads to the following membership functionû and the cluster centers In Eq. (10), β defines the extent of the distribution [27]. Equation (10) is formally normalized asû

Tsallis entropy maximization
Letṽ i andũ ik be the centers of clusters and the membership functions, respectively.
The Tsallis entropy is defined asS where q ∈ R is any real number. The objective function is rewritten as Accordingly, the Tsallis entropy functional is given by The stationary condition for Eq. (15) yields to the following membership functioñ In this case, the cluster centers are given bỹ In the limit of q → 1, the Tsallis entropy recovers the Shannon entropy [24] andũ ik approaches u ik in Eq.(6).

Shannon entropy based FCM statistics
In the Shannon entropy based FCM, the sum of the states (the partition function) for the grand canonical ensemble of fuzzy clustering can be written as By substituting Eq. (19) for [19], the free energy becomes Stable thermal equilibrium requires a minimization of the free energy. By formulating deterministic annealing as a minimization of the free energy, This cluster center is the same as that in Eq. (7).

Fuzzy entropy based FCM statistics
In a group of independent particles, the total energy and the total number of particles are given by E = ∑ l l n l and N = ∑ l n l , respectively, where l represents the energy level and n l represents the number of particles that occupy l . We can write the sum of states, or the partition function, in the form: where β is the product of the inverse of temperature T and k B (Boltzmann constant). However, it is difficult to take the sums in (22) counting up all possible divisions. Accordingly, we make the number of particles n l a variable, and adjust the new parameter α(chemical potential) so as to make ∑ l l n l = E and ∑ l n l = N are satisfied. Hence, this becomes the grand canonical distribution, and the sum of states (the grand partition function) Ξ is given by [8,19] For particles governed by the Fermi-Dirac distribution, Ξ can be rewritten as Also, n l is averaged as where α is defined by the condition that N = ∑ l n l [19]. Helmholtz free energy F is, from the relationship F = −k B T log Z N , Taking that into account, the entropy S = (E − F)/T has the form If states are degenerated to the degree of ν l , the number of particles which occupy l is and we can rewrite the entropy S as which is similar to fuzzy entropy in (8). As a result, u ik corresponds to a grain density n l and the inverse of β in (10) represents the system or computational temperature T.
In the FCM clustering, note that any data can belong to any cluster, the grand partition function can be written as which, from the relationship gives the Helmholtz free energy The inverse of β represents the system or computational temperature T.

Correspondence between Fermi-Dirac statistics and fuzzy clustering
In the previous subsection, we have formulated the fuzzy entropy regularized FCM as the DA clustering and showed that its mechanics was no other than the statistics of a particle system (the Fermi-Dirac statistics). The correspondences between fuzzy clustering (FC) and the Fermi-Dirac statistics (FD) are summarized in TABLE 1. The major difference between fuzzy clustering and statistical mechanics is the fact that data are distinguishable and can belong to multiple clusters, though particles which occupy a same energy state are not distinguishable. This causes a summation or a multiplication not only on i but on k as well in fuzzy clustering. Thus, fuzzy clustering and statistical mechanics described in this paper are not mathematically equivalent.
• Constraints: (a) Constraint that the sum of all particles N is fixed in FD is correspondent with the normalization constraint in FC. Energy level l is equivalent to the cluster number i. In addition, the fact that data can belong to multiple clusters leads to the summation on k. (b) There is no constraint in FC which corresponds to the constraint that the total energy equals E in FD. We have to minimize ∑ n k=1 ∑ c i=1 d ik u ik in FC. • Distribution Function: In FD, n l gives an average particle number which occupies energy level l, because particles can not be distinguished from each other. In FC, however, data are distinguishable, and for that reason, u ik gives a probability of data belonging to multiple clusters.
• Entropy: N l is supposed to correspond to a cluster capacity. The meanings of S and S FE will be discussed in detail in the next subsection.
• Temperature: Temperature is given in both cases 1 .
• Partition Function: The fact that data can belong to multiple clusters simultaneously causes the product over k for FC.
• Free Energy: Helmholtz free energy F is given by −T(log Ξ − α k ∂ log Ξ/∂α k ) in FC. Both S and S FE equal −∂F/∂T as expected from statistical physics.
• Energy: The relationship E = F + TS or E = F + TS FE holds between E, F, T and S or S FE .

Meanings of Fermi-Dirac distribution function and fuzzy entropy
In the entropy function (28) or (30) for the particle system, we can consider the first term to be the entropy of electrons and the second to be that of holes. In this case, the physical limitation that only one particle can occupy an energy level at a time results in the entropy that formulates the state in which an electron and a hole exist simultaneously and exchanging them makes no difference. Meanwhile, what correspond to electron and hole in fuzzy clustering are the probability of fuzzy event that a data belongs to a cluster and the probability of its complementary event, respectively.  a black box). Then, the number of available divisions of data to lattices is denoted by which, from S = k B log W (the Gibbs entropy), gives the form similar to (30) [8]. By extremizing S, we have the most probable distribution like (25). In this case, as there is no distinction between data, only the numbers of black and white lattices constitute the entropy. Fuzzy entropy in (8), on the other hand, gives the amount of information of weather a data belongs to a fuzzy set (or cluster) or not, averaged over independent data x k .
Changing a viewpoint, the stationary entropy values for the particle system seems to be a request for giving the stability against the perturbation with collisions between particles. In fuzzy clustering, data reconfiguration between clusters with the move of cluster centers or the change of cluster shapes is correspondent to this stability. Let us represent data density by . If data transfer from clusters C a and C b to C c and C d as a magnitude of membership function, the transition probability from {. . . , C a , . . . , In the equilibrium state, the transitions exhibit balance (this is known as the principle of detailed balance [19]). This requires As a result, if energy d i is conserved before and after the transition, C i must have the form or Fermi-Dirac distribution where α and β are constants.
Consequently, the entropy like fuzzy entropy is statistically caused by the system that allows complementary states. Fuzzy clustering handles a data itself, while statistical mechanics handles a large number of particles and examines the change of macroscopic physical quantity.
Then it is concluded that fuzzy clustering exists in the extreme of Fermi-Dirac statistics, or the Fermi-Dirac statistics includes fuzzy clustering conceptually.

Tsallis entropy based FCM statistics
On the other hand,Ũ andS satisfyS Equation (38) makes it possible to regard β −1 as an artificial system temperature T [19]. Then, the free energy can be defined as U can be derived fromF asŨ (41)

Deterministic annealing
The DA method is a deterministic variant of SA. DA characterizes the minimization problem of the cost function as the minimization of the Helmholtz free energy which depends on the temperature, and tracks its minimum while decreasing the temperature and thus it can deterministically optimize the cost function at each temperature.
According to the principle of minimal free energy in statistical mechanics, the minimum of the Helmholtz free energy determines the distribution at thermal equilibrium [19]. Thus, formulating the DA clustering as a minimization of (32) leads to ∂F/∂v i = 0 at the current temperature, and gives (10) and (11) again. Desirable cluster centers are obtained by calculating (10) and (11) repeatedly.
In this chapter, we focus on application of DA to the Fermi-Dirac-like distribution function described in the Section 4.2.

Linear approximation of Fermi-Dirac distribution function
The Fermi-Dirac distribution function can be approximated by linear functions. That is, as shown in Fig.1, the Fermi-Dirac distribution function of the form:
In Fig.2, Δx = x − x new denotes a reduction of the extent of distribution with decreasing the temperature from T to T new (T > T new ). The extent of distribution also narrows with increasing α. α new (α < α new ) which satisfies g(0.5) α − g(0.5) α new = Δx is obtained as where β = 1/T and β new = 1/T new . Thus, taking that T to the temperature at which previous DA was executed and T new to the next temperature, a covariance of α k 's distribution is defined as

Initial estimation of α k and annealing temperature
Before executing DA, it is very important to estimate the initial values of α k s and the initial annealing temperature in advance.
From Fig.1, distances between a data point and cluster centers are averaged as and this gives With given initial clusters distributing wide enough, (47) overestimates α k , so that α k needs to be adjusted by decreasing its value gradually.
Still more, Fig.1 gives the width of the Fermi-Dirac distribution function as wide as 2(−α + 1)/( −αβ), which must be equal to or smaller than that of data distribution (=2R). This condition leads to As a result, the initial value of β or the initial annealing temperature is roughly determined as

Deterministic annealing algorithm
The DA algorithm for fuzzy clustering is given as follows: 1 Initialize: Set a rate at which a temperature is lowered T rate , and a threshold of convergence test δ 0 . Calculate an initial temperature T high (= 1/β low ) by (49) and set a current temperature T = T high (β = β low ). Place c clusters randomly and estimate initial α k s by (47) and adjust them to satisfy the normalization constraint (2). 2 Calculate u ik by (12). Calculate v i by (11). 4 Compare a difference between the current objective value J m=1 = ∑ n k=1 ∑ c i=1 d ik u ik and that obtained at the previous iterationĴ. If J m=1 −Ĵ /J m=1 < δ 0 · T/T high is satisfied, then return. Otherwise decrease the temperature as T = T * T rate , and go back to 2.

Simulated annealing
The cost function for SA is where K is a constant.
In order to optimize each α k by SA, its neighbor α new k (a displacement from the current α k ) is generated by assuming a normal distribution with a mean 0 and a covariance Δα k defined in (45).
The SA's initial temperature T 0 (= 1/β 0 ) is determined so as to make an acceptance probability becomes By selecting α k s to be optimized from the outside of a transition region in which the membership function changes form 0 to 1, computational time of SA can be shortened. The boundary of the transition region can be easily obtained with the linear approximations of the Fermi-Dirac-like membership function. From Fig.1, data which have distances bigger than −α k /β from each cluster centers are selected.

Simulated annealing algorithm
The SA algorithm is stated as follows: 1 Initialize: Calculate an initial temperature T 0 (= 1/β 0 ) from (51). Set a current temperature T to T 0 . Set an iteration count t to 1. Calculate a covariance Δα k for each α k by (45). 2 Select data to be optimized, if necessary.
3 Calculate neighbors of current α k s. 4 Apply the Metropolis algorithm to the selected α k s using (50) as the objective function. 5 If max < t is satisfied, then return. Otherwise decrease the temperature as T = T 0 / log(t + 1), increment t, and go back to 2.

Combinatorial algorithm of deterministic and simulated annealing
The DA and SA algorithms are combined as follows: 1 Initialize: Set a threshold of convergence test δ 1 , and an iteration count l to 1. Set maximum iteration counts max 0 , max 1 , and max 2 . 2 Execute the DA algorithm. 3 Set max = max 0 , and execute the SA algorithm. 4 Compare a difference between the current objective value e and that obtained at the previous iterationê. If e −ê /e < δ 1 or max 2 < l is satisfied, then go to 5. Otherwise increment l, and go back to 2. 5 Set max = max 1 , and execute the SA algorithm finally, and then stop.
In experiment 1, 11,479 data points were generated as ten equally sized normal distributions. Fig.4 shows a fuzzy clustering result by DASA. Single DA similarly clusters these data.
In experiment 2-1, three differently sized normal distributions consist of 2,249 data points in Fig.5-1 were used. Fig.5-1(0) shows initial clusters obtained by the initial estimation of α k s and the annealing temperature. Fig.5-1(1)∼(6a) shows a fuzzy clustering process of DASA. At the high temperature in Fig.5-1(1), as described in 4.3, the membership functions were widely distributed and clusters to which a data belongs were fuzzy. However, with decreasing of the temperature (from Fig.5-1(2) to Fig.5-1(5)), the distribution became less and less fuzzy. After executing DA and SA alternately, the clusters in Fig.5-1(6a) were obtained. Then, data to be optimized by SA were selected by the criterion stated in the section 4, and SA was executed. The final result of DASA in Fig.5-1(6b) shows that data were desirably clustered. On the contrary, because of randomness of the initial cluster positions and hardness of good estimation of the initial α k s, single DA becomes unstable, and sometimes gives satisfactory  results as shown in Fig.5-1(6c) and sometimes not as shown in Fig.5-1(6d). By comparing Fig.5-1(6b) to (6c), it is found that, due to the optimization of α k s by SA, the resultant cluster shapes of DASA are far less smooth than those of single DA.
Changes of the costs of DASA (J m=1 + S FE for DA stage and (50) for SA stage (K was set to 1 × 10 15 in (50)), respectively) are plotted as a function of iteration in Fig.5-2, and the both costs decreases with increasing iteration. In this experiment, the total iteration of SA stage was about 12, 500, while that of DA stage was only 7. Accordingly, the amount of simulation time DASA was mostly consumed in SA stage.
In experiment 2-2, in order to examine effectiveness of SA introduced in DASA, experiment 2 was re-conducted ten times as in Table 1, where ratio listed in the first row is a ratio of data optimized at SA stage. "UP" means to increase ratio as 1.0 − 1.0/t where t is a number of execution times of SA stage. Also, "DOWN" means to decrease ratio as 1.0/t. Results are judged "Success" or "Failure" from a human viewpoint 3 . From Table 1, it is concluded that DASA always clusters the data properly if ratio is large enough (0.6 < ratio), whereas, as listed in the last column, single DA succeeds by 50%. In experiments 3 and 4, two elliptic distributions consist of 2,024 data points, and two horseshoe-shaped distributions consist of 1,380 data points were used, respectively. Fig.5 and  6 show DASA's clustering results. It is found that DASA can cluster these data properly. In experiment 3, a percentage of success of DASA is 90%, though that of single DA is 50%. In experiment 4, a percentage of success of DASA is 80%, though that of single DA is 40%. (a)Experimental data.  To examine an effectiveness of interpolation, the proposed algorighm was applied to experimental data shown in Fig.8(a). For simplicity, the data were placed on rectangular grids on the xy plane.
First, some regional data were randomly selected from the data. Then, Initial and final memberhip functions obtained by DASA are shown in Figs.8(b) and (c) respectively.
After that, remaining data in the region were used as test data, and at each data point, they were interpolated by their four nearest neighboring membership values. Linear, bicubic and fractal interpolation methods were compared.
Prediction error of linear interpolation was 6.8%, and accuracy was not enough. Bicubic interpolation [18] also gave a poor result, because its depends on good estimated gradient values of neighboring points. Accordingly, in this case, fractal interpolation [17] is more suitable than smooth interpolation methods such as bicubic or spline interpolation, because the membership functions in Figs.8(c) are very rough.
The well-known InterpolatedFM (Fractal motion via interpolation and variable scaling) algorithm [17] was used in this experiment. Fractal dimension was estimated by the standard box-counting method [25]. Figs.9(a) and 3(b) represent both the membership functions and their interpolation values. Prediction error (averaged over 10 trials) of fractal interpolation was 2.2%, and a slightly better result was obtained. [y] [u ]

Conclusion
In this article, by combining the deterministic and simulated annealing methods, we proposed the new statistical mechanical fuzzy c-means clustering algorithm (DASA). Numerical experiments showed the effectiveness and the stability of DASA.
However, as stated at the end of Experiments, DASA has problems to be considered. In addition, a major problem of the fuzzy c-means methodologies is that they do not give a number of clusters by themselves. Thus, a method such as [28] which can determine a number of clusters automatically should be combined with DASA.
Our future works also include experiments and examinations of the properties of DASA, especially on an adjustment of its parameters, its annealing scheduling problem, and its applications for fuzzy modeling [29].
However, DASA has problems to be considered. One of them is that it is difficult to interpolate membership functions, since their values are quite different. Accordingly, the fractal interpolation method (InterpolationFM algorithm) is introduced to DASA and examined its effectiveness.
Our future works include experiments and examinations of the properties of DASA, a comparison of results of interpolation methods (linear, bicubic, spline, fractal and so on), an interpolation of higher dimensional data, an adjustment of DASA's parameters, and DASA's annealing scheduling problem.

Makoto Yasuda
Gifu National College of Technology, Japan