Deterministic Annealing: A Variant of Simulated Annealing and its Application to Fuzzy Clustering

Deterministic annealing (DA) is a deterministic variant of simulated annealing. In this chapter, after briefly introducing DA, we explain how DA is combined with the fuzzy c-means (FCM) clustering by employing the entropy maximization method, especially the Tsallis entropy maximization. The Tsallis entropy is a q parameter extension of the Shannon entropy. Then, we focus on Tsallis-entropy-maximized FCM (TsallisDAFCM), and examine effects of cooling functions for DA on accuracy and convergence. A shape of a membership function of Tsallis-DAFCM depends on both a system temperature and q. Accordingly, a relationship between the temperature and q is quantitatively investigated.


Introduction
Statistical mechanics investigates the macroscopic properties of a physical system consisting of many elements.Recently, great research activities of applying statistical mechanical models or tools to information engineering problems have been seen.The entropy maximization method applied to a fuzzy c-means clustering is a good example of such models.Simulated annealing (SA) [1,2] is one of the most commonly used optimization techniques and plays an important role in the field of engineering because many of the engineering problems can be formulated as optimization problems.SA is a stochastic relaxation method that treats an objective function as a system energy, and by analogy with the annealing process of solids, searches for its minimum with decreasing the system temperature.SA searches randomly at a high temperature, but more deterministically at a low temperature.As long as a neighborhood can be defined and the temperature is lowered sufficiently slowly, SA is a global optimization technique for solving optimization problems.It requires a very long time to find an optimal global solution because of a stochastic search at each temperature.Thus, SA is an approximation method in practical.Deterministic annealing (DA) is a deterministic variant of SA, which is first proposed by Rose et al. [3] for a vector quantization algorithm.DA characterizes the minimization problem of the objective function as the minimization of the free energy of the system, which depends on the temperature.DA tracks the minimum of the free energy while decreasing the temperature.Thus, it can deterministically optimize the objective function at each temperature, and is more efficient than SA but does not guarantee the optimal solution.In addition, an effect of a cooling function on the quality of a DA's solution is still unclear.
Membership functions of the fuzzy c-means (FCM) clustering [4] maximized or regularized with entropy [5,6] have similar forms with the distribution functions that appear in statistical mechanics.For example, a membership function obtained by the Shannon entropy maximization has a similar form with the Boltzmann-Gibbs distribution function [3,5].Similarly, a membership function obtained by the fuzzy entropy [7] has a similar form with the Fermi-Dirac distribution function [8].Annealing methods can be applicable to these membership functions because they contain a parameter that can be corresponded to the system temperature.The advantage of applying the entropy maximization methods to FCM is that fuzzy clustering can be analyzed from both information processing and statistical physical points of view.
Tsallis [9], by extending Boltzmann-Gibbs statistics nonextensively with a generalization parameter q, postulated a generalized formulation of entropy.The entropy is now well known as The Tsallis entropy, which, in the limit of q to 1, approaches the Shannon entropy.The Tsallis entropy is applicable to numerous fields, including physics, bioscience, chemistry, networks, computer science, and so on [10][11][12].Menard et al. [13,14] investigated fuzzy clustering in the framework of nonextensive thermostatistics, and derived the possibilistic membership function by taking the possibilistic constraint into account.
On the other hand, based on the Tsallis entropy, Yasuda [15] defined another form of entropy for FCM, and then derived the membership function by maximizing this entropy within FCM [15].After that, by combining the membership function with DA, a new fuzzy clustering algorithm (the Tsallis-DAFCM algorithm) has been developed.Tsallis-DAFCM was proved to yield superior results in comparison with the conventional annealing methods.
Similarly to SA, a performance of Tsallis-DAFCM strongly depends on how to decrease the temperature.Among the cooling functions for SA, the very fast annealing (VFA) method decreases the temperature fastest.Thus, VFA is applied to Tsallis-DAFCM to improve its performance, and proved to be effective [16].
In spite of its performance, it remains unknown how appropriate q value and initial annealing temperature T high for Tsallis-DAFCM should be determined according to the data distribution.
One of the important characteristics of the membership function of Tsallis-DAFCM is that centers of clusters are given as a weighted function of the membership function to the power of q( ).Furthermore, it changes its shape in a similar way by decreasing the temperature or by increasing q.In order to reveal the relationship between the temperature and q, is quantitatively analyzed using the Iris Data Set [17].The result shows that the temperature and q affect almost inversely, suggesting that a q-incrementation algorithm is possible.This algorithm might be a solution to the initialvalue problem of Tsallis-DAFCM [18,19].
This chapter is composed of five sections.Section 1 is this introduction.In Section 2, how DA works is explained generously, and its applications are summarized.In Section 3, each of the components of entropy-maximized FCM clustering methods is explained.In Section 4, VFA is used as the cooling function of Tsallis-DAFCM, and its effects are experimentally investigated.Effects of the temperature and q-values on the membership function of Tsallis-DAFCM are also examined.Section 5 gives a conclusion of this chapter.

Deterministic annealing
Vector quantization is a classification method for a big data, which is widely used in the field of image compression.Linde-Buzo-Gray [20] and self-organizing feature map [21] algorithms, for example, are well-known vector quantization algorithms.However, classification results of these algorithms depend on initial settings of the reference vector, and can easily fall into local minima.
In order to overcome the problem, by analogy with statistical mechanics, a DA method has been proposed.DA does not suffer from the initial vector problem.It usually performs better than other methods.However, it does not theoretically guarantee to find a global optimal solution.Furthermore, its performance depends on a way of decreasing the system temperature.
This section gives a brief introduction of the deterministic annealing method.In Section 2.1, a definition and major characteristics of DA are explained.In Section 2.2, DA is compared with SA.In Section 2.3, applications and modifications of DA are summarized.

Major characteristics of deterministic annealing
In DA, by analogy with statistical mechanics, the free energy F is derived from an objective function J of a problem.At a high temperature, F represents a global structure of J.As the temperature decreases, it gradually reaches J.
Based on these characteristics, at the high temperature, DA is able to find a global minimum of F by the steepest descent method because it should have multiple local minima.When the temperature is lowered a little, F would change its shape only a little.Accordingly, by setting the previous global mimimum as an initial value of the steepest descent mehod, DA searches the next global minimum.This procedure continues until the temperature is lowered sufficiently, and F reaches J. Consequently, at each temperature, DA searches the local minimum of J deterministically.

Comparison with simulated annealing
While decreasing the temperature, SA searches the minimum stochastically at each temperature and thus requires a very long time to find an optimal solution.Hence, though theoretically, it is guaranteed to find the optimal solution, SA is practically an approximation method.
On the contrary, DA consumes less computational time because it searches the local minimum deterministically at each temperature.Furthermore, it should be noticed that, in case that multiple local minima exist at some temperature, DA might not be able to find the minimum.For this reason, even theoretically, DA is not guaranteed to find the optimal solution.
Approaches to speed up SA are mainly based on the improvement of a transition method and a cooling function including its parallelization.For example, adaptive SA (ASA) [22], which may belong to the both categories, is an implementation of very fast simulated re-annealing (VFSA) [23].As compared with the acceleration method called fast annealing (FA) [24] in which the temperature is lowered inversely linear to a number of iterations, ASA is faster than FA.
In addition, among many features included in ASA, it can get the benefit of speeding-up by simulated quenching.
In DA, it seems no comprehensive studies on this topic have been conducted.
The summary of comparison is shown in Table 1.

SA DA
Search strategy Stochastic search based on the Metropolis algorithm.Deterministic search based on the steepest descent algorithm.
Cooling function Two categories of cooling functions are well-used.
1. Functions based on statistical analysis.

Adaptive functions depending on the problems.
Cooling functions appear in SA are used empirically.
Optimality The global minimum can be achieved if a temperature is decreased as slow as T∝1/log(iterations).

Not guaranteed.
Table 1.Comparison of SA and DA.

Applications and modifications of deterministic annealing
The study on DA first addressed avoidance of the poor local minima of data clustering [25].
In order to cluster a large number of data, research activities that attempt to parallelize DA have become popular.Kim et al. [32] discussed the parallelization method of DA using GPU, and applied it to color image segmentation.In order to cluster a large number of bioscientific data, Fox et al. [33,34] parallelized DA using MPI.Qiu et al. [35] compared the DA's performance using C# messaging runtime library CCR with that using MPI.

Application of deterministic annealing to fuzzy c-means clustering maximized with entropy
One of the important applications of DA is fuzzy clustering.In this section, we focus on fuzzy c-means (FCM) clustering.By maximizing the objective function of FCM with various entropies, membership functions similar to the statistical mechanical distribution functions are obtained.These membership functions can be easily combined with DA, because they contain a parameter corresponding to a system temperature.
In this section, first we outline the formulation of FCM.Then, we describe how to apply the entropy maximization methods to FCM.In Section 3.1, the classical FCM clustering method is introduced.In Section 3.2, various entropy maximization methods are explained.The free energies for each entropy maximization method are derived in Section 3.3.In Section 3.4, representative annealing functions are presented.

Fuzzy c-means clustering
Let = 1 , ⋯, = 1 , ⋯, ∈ be a data set in p-dimensional real space, which is to be divided into c clusters.Let = 1 , ⋯, = 1 , ⋯, ∈ be the centers of the clusters, and let ∈ 0, 1 = 1, ⋯, ; = 1, ⋯, be the membership functions.Then, let ( ) be the objective function of FCM to be minimized.Under the normalization constraint of the Lagrange function L is given by where η k denotes the Lagrange multiplier.∂/ ∂ = 0 gives the membership function of the form: Similarly, can be determined by ∂/ ∂ = 0 as follows: Desirable cluster centers can be obtained by calculating Eq. ( 4) and Eq. ( 5) repeatedly.

Shannon entropy maximization
Shannon entropy for FCM takes the form: By setting m to 1 in Eq. ( 1), under the normalization constraint of Eq. ( 2), the Shannon entropy functional is given by ( ) where α k and β denote the Lagrange multipliers.The stationary condition for Eq. ( 7) leads to the following Gaussian membership function: and the same formula for in Eq. ( 5).
The fuzzy entropy functional is given by ( ) The stationary condition for Eq. ( 10) leads to the following membership function: and the same formula for in Eq. ( 5).Eq. ( 11) is similar to the Fermi-Dirac distribution function.

Tsallis entropy maximization
The Tsallis entropy for FCM is defined as [15] where q ∈ R is a real number.In case of the Tsallis entropy maximization, the objective function should be rewritten as .
Accordingly, the Tsallis entropy functional is given by ( ) The stationary condition for Eq. ( 14) leads to the membership function of the form: where .

Free energy for entropy maximized fuzzy c-means
In each entropy, maximization methods introduced in Section 3.2, β can be regarded as the inverse of the system temperature T -1 .This feature makes it possible to apply DA, and Shannonentropy-maximized FCM with DA (Shannon-DAFCM, hereafter), fuzzy-entropy-maximized FCM with DA (fuzzy-DAFCM, hereafter), and Tsallis-entropy-maximized FCM with DA (Tsallis-DAFCM, hereafter) have been developed.

Free energy for Shannon-DAFCM
In Shannon-DAFCM, by analogy with statistical mechanics [37], the sum of the states (the partition function) for the grand canonical ensemble of FCM can be expressed as The free energy is derived as Stable thermal equilibrium requires a minimization of the free energy.By formulating deterministic annealing as a minimization of the free energy, ∂/ ∂ = 0 yields the same expression for as that in Eq. ( 5).

Free energy for fuzzy-DAFCM
Similarly, in fuzzy-DAFCM, the grand partition function for the grand canonical ensemble for FCM can be written as The free energy is calculated as ( ) ∂/ ∂ = 0 yields the same expression for as that in Eq. (5).

Free energy for Tsallis-DAFCM
In Tsallis-DAFCM, the free energy can be derived as where T is a system temperature.∂/ ∂ = 0 yields the same expression for as that in Eq. ( 16).
Deterministic Annealing: A Variant of Simulated Annealing and its Application to Fuzzy Clustering http://dx.doi.org/10.5772/66072

Dependency of shape of membership function on temperature
While reducing the system temperature T, DA achieves thermal equilibrium at each temperature by minimizing the free energy.Thus, DA searches a cluster distribution that minimizes the free energy at each temperature.When the temperature is high, the membership functions distribute widely.This makes clusters to which a data belong fuzzy.In case of Tsallis-DAFCM, when q is nearly equal to 2, the width of the membership function is almost proportional to .On the contrary, at the low temperature, fuzzy clustering approaches hard clustering.The relationship = Tsallis − Tsallis in Eq. ( 21) suggests that, at the higher temperature, the larger entropy state or chaotic state is caused by a widening of the extent of the membership function.

Cooling function
In SA, the temperature decreases according to a cooling function or an annealing schedule.
The representative cooling functions for SA [38] are: where T high is the highest initial temperature, r is a parameter which defines a temperature reduction rate, and t is a number of iterations of temperature reduction.
(II) Inversely linear function (III) Inversely logarithmic function (IV) Inversely exponential function Computational Optimization in Engineering -Paradigms and Applications (V) Very fast annealing Rosen [39] proposed another inversely exponential function known as VFA.VFA decreases the temperature exponentially in a similar way to ASA: where D is a dimension of a state space.

Tsallis-entropy-maximized fuzzy c-means clustering with deterministic annealing
In this section, we focus on Tsallis-DAFCM, and its important experimental results are explained.In Section 4.1, we present the Tsallis-DAFCM clustering algorithm.In Section 4.2, how VFA affects Tsallis-DAFCM is experimentally investigated.In Section 4.3, effects of the temperature and q-values on the membership function are examined.

Tsallis-DAFCM clustering algorithm
The Tsallis-entropy-maximization method, fuzzy c-means clustering, and the deterministic annealing method can be combined as the following Tsallis-DAFCM clustering algorithm [15]: 1. Set the number of clusters c, the highest temperature T high , the temperature reduction rate m, and the threshold of convergence test δ 1 and δ 2 ; 2. Generate c clusters at random positions.Set current temperature T to T high ; 3. Calculate the membership function u ik by Eq. ( 15);

5.
Compare the difference between the current cluster centers and the cluster centers obtained in the previous iteration − .If the convergence condition 1 ≤ ≤ − < 1 is satisfied, then go to 6, otherwise go back to 3; 6. Compare the difference between the current cluster centers and the cluster centers obtained at the previous temperature .If the convergence condition 1 ≤ ≤ − < 2 is satisfied, then stop, otherwise decrease the temperature using the cooling function, and go back to 3.

Effect of cooling functions
In general, the temperature should be reduced gradually in DA.However, this takes a long time to converge.If VFA is applicable to Tsallis-DAFCM, it is of great advantage to this method.Accordingly, VFA is tested as a cooling function of Tsallis-DAFCM.
In this subsection, both Shannon-and Tsallis-DAFCM are examined.

Experiment 1
In experiment 1, the numerical data composed of five clusters and 2000 data points are used, as shown in   However, when T high = 1.0×10 6 , the algorithm fails to converge with r = 100.0and 10.0 (expressed by "Not converged" in Figure 3) because the initial distribution of u ik is too wide.This result indicates that it is important to set both T high and r properly.Then, at the higher temperature, β is comparatively small and clusters move to near the center of gravity of data because the membership functions are extend over the data area and become extremely uniform.As T is lowered, contrarily, the membership functions become narrower and the associations of data to the clusters become less fuzzy.In this process, in Shannon-DAFCM, the clusters move to their nearest local data distribution centers.However, in Tsallis-DAFCM, the clusters can move a long distance to optimal positions because the membership functions have gentle base slopes.Consequently, because Tsallis-DAFCM has gentle slope in the region far from the origin, clusters can move long distance to optimal positions stably.This feature makes it possible to reduce the temperature rapidly.Thus, VFA is suitable as a cooling function of Tsallis-DAFCM.
On the other hand, final cluster positions obtained by Shannon-DAFCM tend to depend on their initial positions.

Experiment 2
In experiment 2, the Iris Data Set [17] consisting of 150 four-dimensional vectors of iris flowers is used.Three clusters of flowers detected are Versicolor, Virginia, and Setosa.Each cluster consists of 50 vectors.VFA is used as a cooling function of DA.The parameters are set as follows: c = 3, δ 1 = 0.1, δ 2 = 0.01, q = 1.5, T high = 2.0.

Dependencies of the membership function on temperature and q
In Eq. ( 16), it can be seen that plays an important role as a weight value to each , and it determines .For this reason, dependencies of on T and q are to be investigated.
In this subsection, for simplicity, is set to be 0, because this makes the denominator of Eq. ( 15) become the sum of the same formulas of its numerator.Figure 6 illustrates the numerator of (expressed by ) as a function of , parameterized by T and q, respectively.In order to plot the shape of as a function of the distance between the cluster center and various data points, in this figure, is replaced to a continuous variable x. Figure 6 confirms that the extent of becomes narrower with increasing q.On the contrary, as the temperature decreases, the distribution becomes narrower.

Quantitative relationship between temperature and q
As stated in the previous subsection, T and q inversely affect the extent of , which changes in a similar way with decreasing T or increasing q.In order to examine the quantitative relationship between T and q in more detail, they are changed independently as follows: First, we define 1 2 1 ( , , ) 1 .
Then, Eq. ( 27) is calculated by fixing T and q to some constants T 0 and q 0 .Next, by decreasing T, we search the q values that minimize the sum of squares of the residuals of the following two functions: In the following calculations, the parameters are set as follows: T high (= T 0 ) = 2.0; the domain of x is 0 ≤ x ≤ 100; the number of sampling points of the sum of residuals S max = 10,000; ∆x = 0.01.
For q (= q 0 ) values of 1.01, 2.0, 6.0, 10.0, and for T decreasing from T high , the q values that minimize Eq. (28) (expressed by q min ) are shown in Figure 7 (a).Figure 7 (b), on the other hand, shows the results of cases in which q is set to 2.0 and T is lowered from T high = 2.0, 20.0, 100.0, 200.0.Approximate curves plotted in Figure 7 are obtained by fitting the data to the following exponential function: where a and b denote the fitting parameters.Optimal values for these parameters obtained by the least squares method are summarized in Table 3 and Table 4. From these tables, it is concluded that b is nearly equal to 1.0 indicating that q is inversely proportional to T. In addition, it can be seen that, though b does not change its value much, a increases with increasing T. As a result, by using the approximate relationship of T and q min , instead of annealing or Treduction, q-incrementation clustering might be possible [18].

Conclusion
In this chapter, we first explained the major characteristics of DA and compared it with SA.DA is a variant of SA and searches a minimum deterministically.Thus, generally it is more efficient than SA.We then explained how DA could be applied to the fuzzy c-means clustering by employing the entropy maximization method.

Figure 3 .
Figure 3. Increasing of β by inversely exponential cooling function.

Figures 5 (
Figures 5 (a) and (b) illustrates the three-dimensional plots of u ik in the progress of Shannonand Tsallis-DAFCM clustering combined with VFA.When the temperature is as high as 3.7×104 , roughness of u ik of Tsallis-DAFCM is smaller than that of Shannon-DAFCM.After that, the shapes of both membership functions do not change greatly, because VFA reduces the temperature extremely only at the early annealing stage.When the temperature is lowered to 1.3×10 4 , both methods cluster data desirably.

Figure 7 .
Figure 7. Plots of as a function of parameterized by (a) and (b) high .
Deterministic Annealing: A Variant of Simulated Annealing and its Application to Fuzzy Clustering http://dx.doi.org/10.5772/66072

Table 2 .
The minimum, maximum, and average values of misclassified data points of 100 trials are summarized in It can be seen that Shannon-DAFCM gives slightly better results than Tsallis-DAFCM.However, it is also confirmed that Tsallis-DAFCM gives its best results when the temperature reduction rate r is set to 1 or 2, though the best result for Shannon-DAFCM is obtained only when r = 2. Furthermore, variances of Tsallis-DAFCM are smaller than those of Shannon-DAFCM.These features indicate that a wide range of r values are applicable to Tsallis-DAFCM.On the other hand, with larger r values, Shannon-DAFCM becomes unstable.

Table 2 .
Misclassified data points of Iris Data Set (100 trials).