Correspondence of Fermi-Dirac Statistics and Fuzzy Clustering.
Many engineering problems can be formulated as optimization problems, and the deterministic annealing (DA) method  is known as an effective optimization method for such problems. DA is a deterministic variant of simulated annealing (SA) [1, 10]. The DA characterizes the minimization problem of cost functions as the minimization of Helmholtz free energy which depends on a (pseudo) temperature, and tracks the minimum of free energy while decreasing temperature and thus it can deterministically optimize the function at a given temperature . Hence, the DA is more efficient than the SA, but does not guarantee a global optimal solution. The study on the DA in  addressed avoidance of the poor local minima of cost function of data clustering. Then it was extensively applied to various subjects such as combinational optimization problems , vector quantization , classifier design , pairwise data clustering  and so on.
On the other hand, clustering is a method which partitions a given set of data points into subgroups, and is one of major tools for data analysis. It is supposed that, in the real world, cluster boundaries are not so clear that fuzzy clustering is more suitable than crisp clustering. Bezdek proposed the fuzzy c-means (FCM) which is now well known as the standard technique for fuzzy clustering.
Then, after the work of Li et al. which formulated the regularization of the FCM with Shannon entropy, Miyamoto et al. discussed the FCM within the framework of the Shannon entropy based clustering. From the historical point of view, however, it should be noted that Rose et al. first studied the statistical mechanical analogy of the FCM with the maximum entropy method, which was basically probabilistic clustering.
To measure the “indefiniteness” of fuzzy set, DeLuca and Termini  defined fuzzy entropy after Shannon. Afterwards, some similar measures from the wider viewpoints of the indefiniteness were proposed [15, 16]. Fuzzy entropy has been used for knowledge retrieval from fuzzy database  and image processing , and proved to be useful.
Tsallis  achieved nonextensive extension of the Boltzmann-Gibbs statistics. Tsallis postulated a generalization form of entropy with a generalization parameter, which, in a limit of,reaches the Shannon entropy. Later on, Menard et.al. derived a membership function by regularizing FCM with the Tsallis entropy.
In this chapter, by maximizing the various entropies within the framework of FCM, the membership functions which take the familiar forms of the statitical mechanical distribution functions are derived. The advantage to use the statistical mechanical membership functions is that the fuzzy c-means clustering can be interpreted and analyzed from a statistical mechanical point of view [27, 28]
After that, we focus on the Fermi-Dirac like membership function, because, as compared to the Maxwell-Boltzmann-like membership function, the Fermi-Dirac-like membership function has extra parameters s (corresponds to a chemical potential in statistical mechanics, and denotes a data point), which make it possible to represent various cluster shapes like former clustering methods based on such as the Gaussian mixture, and the degree of fuzzy entropy. s strongly affect clustering results and they must be optimized under a normalization constraint of FCM. On the other hand, the DA method, though it is efficient, does not give appropriate values of s by itself and the DA clustering sometimes fails if s are improperly given. Accordingly, we introduce SA to optimize s because, as pointed above, both of DA and SA contain the parameter corresponding to the system temperature and can be naturally combined as DASA.
Nevertheless, this approach causes a few problems. (1)How to estimate the initial values of s under the normalization constraint ? (2)How to estimate the initial annealing temperature? (3)SA must optimize a real number [5, 26]. (4)SA must optimize many s.
Linear approximations of the Fermi-Dirac-like membership function is useful in guessing the initial s and the initial annealing temperature of DA.
In order to perform SA in a many variables domain, s to be optimized are selected according to a selection rule. In an early annealing stages, most s are optimized. In a final annealing stage, however, only s of data which locate sufficiently away from all cluster centers are optimized because their memberships might be fuzzy. Distances between the data and the cluster centers are measured by using linear approximations of the Fermi-Dirac-like membership function.
However, DASA suffers a few disadvantages. One of them is that it is not necessarily easy to interpolate membership functions obtained by DASA, since their values are quite different each other. The fractal interpolation method  is suitable for these rough functions .
Numerical experiments show that DASA clusters data which distribute in various shapes more properly and stably than single DA. Also, the effectiveness of the fractal interpolation is examined.
2. Fuzzy c-meansLet be a data set in a -dimensional real space, which should be divided into clusters. Let be the centers of clusters and be the membership function. Also let
be the objective function of the FCM where. In the FCM, under the normalization constraint of
the Lagrange function is given by
where is the Lagrange multiplier. Bezdek showed that the FCM approaches crisp clustering as decreases to.
3. Entropy maximization of FCM
3.1. Shannon entropy maximization
First, we introduce the Shannon entropy into the FCM clustering. The Shannon entropy is given by
Under the normalization constraint and setting to, the fuzzy entropy functional is given by
and the cluster centers
3.2. Fuzzy entropy maximization
We then introduce the fuzzy entropy into the FCM clustering.
The fuzzy entropy is given by
The fuzzy entropy functional is given by
and the cluster centers
3.3. Tsallis entropy maximizationLet and be the centers of clusters and the membership functions, respectively.
The Tsallis entropy is defined as
where is any real number. The objective function is rewritten as
Accordingly, the Tsallis entropy functional is given by
The stationary condition for Eq. (15) yields to the following membership function
In this case, the cluster centers are given by
4. Entropy maximization and statistical physics
4.1. Shannon entropy based FCM statistics
In the Shannon entropy based FCM, the sum of the states (the partition function) for the grand canonical ensemble of fuzzy clustering can be written as
Stable thermal equilibrium requires a minimization of the free energy. By formulating deterministic annealing as a minimization of the free energy, yields
This cluster center is the same as that in Eq. (7).
4.2. Fuzzy entropy based FCM statistics
In a group of independent particles, the total energy and the total number of particles are given by and, respectively, where represents the energy level and represents the number of particles that occupy. We can write the sum of states, or the partition function, in the form:
where is the product of the inverse of temperature and (Boltzmann constant). However, it is difficult to take the sums in (22) counting up all possible divisions. Accordingly, we make the number of particles a variable, and adjust the new parameter (chemical potential) so as to make and are satisfied. Hence, this becomes the grand canonical distribution, and the sum of states (the grand partition function) is given by[8, 19]
For particles governed by the Fermi-Dirac distribution, can be rewritten as
Also, is averaged as
where is defined by the condition that . Helmholtz free energy is, from the relationship,
into account, the entropy has the form
If states are degenerated to the degree of, the number of particles which occupy is
and we can rewrite the entropy as
which is similar to fuzzy entropy in (8). As a result, corresponds to a grain density and the inverse of in (10) represents the system or computational temperature.
In the FCM clustering, note that any data can belong to any cluster, the grand partition function can be written as
which, from the relationship, gives the Helmholtz free energy
The inverse of represents the system or computational temperature.
4.3. Correspondence between Fermi-Dirac statistics and fuzzy clustering
|Fermi-Dirac Statistics||Fuzzy Clustering|
In the previous subsection, we have formulated the fuzzy entropy regularized FCM as the DA clustering and showed that its mechanics was no other than the statistics of a particle system (the Fermi-Dirac statistics). The correspondences between fuzzy clustering (FC) and the Fermi-Dirac statistics (FD) are summarized in TABLE 1. The major difference between fuzzy clustering and statistical mechanics is the fact that data are distinguishable and can belong to multiple clusters, though particles which occupy a same energy state are not distinguishable. This causes a summation or a multiplication not only on but on as well in fuzzy clustering. Thus, fuzzy clustering and statistical mechanics described in this paper are not mathematically equivalent.
Constraints: (a) Constraint that the sum of all particles is fixed in FD is correspondent with the normalization constraint in FC. Energy level is equivalent to the cluster number. In addition, the fact that data can belong to multiple clusters leads to the summation on. (b) There is no constraint in FC which corresponds to the constraint that the total energy equals in FD. We have to minimize in FC.
Distribution Function: In FD, gives an average particle number which occupies energy level, because particles can not be distinguished from each other. In FC, however, data are distinguishable, and for that reason, gives a probability of data belonging to multiple clusters.
Entropy: is supposed to correspond to a cluster capacity. The meanings of and will be discussed in detail in the next subsection.
Temperature: Temperature is given in both cases -.
Partition Function: The fact that data can belong to multiple clusters simultaneously causes the product over for FC.
Free Energy: Helmholtz free energy is given by in FC. Both and equal as expected from statistical physics.
Energy: The relationship or holds between, , and or.
4.4. Meanings of Fermi-Dirac distribution function and fuzzy entropy
In the entropy function (28) or (30) for the particle system, we can consider the first term to be the entropy of electrons and the second to be that of holes. In this case, the physical limitation that only one particle can occupy an energy level at a time results in the entropy that formulates the state in which an electron and a hole exist simultaneously and exchanging them makes no difference. Meanwhile, what correspond to electron and hole in fuzzy clustering are the probability of fuzzy event that a data belongs to a cluster and the probability of its complementary event, respectively.
Fig.2 shows a two-dimensional virtual cluster density distribution model. A lattice can have at most one data. Let be the total number of lattices and be the number of lattices which have a data in it (marked by a black box). Then, the number of available divisions of data to lattices is denoted by
which, from (the Gibbs entropy), gives the form similar to (30). By extremizing, we have the most probable distribution like (25). In this case, as there is no distinction between data, only the numbers of black and white lattices constitute the entropy. Fuzzy entropy in (8), on the other hand, gives the amount of information of weather a data belongs to a fuzzy set (or cluster) or not, averaged over independent data.
Changing a viewpoint, the stationary entropy values for the particle system seems to be a request for giving the stability against the perturbation with collisions between particles. In fuzzy clustering, data reconfiguration between clusters with the move of cluster centers or the change of cluster shapes is correspondent to this stability. Let us represent data density by. If data transfer from clusters and to and as a magnitude of membership function, the transition probability from to will be proportional to because a data enters a vacant lattice. Similarly, the transition probability from to will be proportional to. In the equilibrium state, the transitions exhibit balance (this is known as the principle of detailed balance). This requires
As a result, if energy is conserved before and after the transition, must have the form
or Fermi-Dirac distribution
where and are constants.
Consequently, the entropy like fuzzy entropy is statistically caused by the system that allows complementary states. Fuzzy clustering handles a data itself, while statistical mechanics handles a large number of particles and examines the change of macroscopic physical quantity. Then it is concluded that fuzzy clustering exists in the extreme of Fermi-Dirac statistics, or the Fermi-Dirac statistics includes fuzzy clustering conceptually.
5. Deterministic annealing
The DA method is a deterministic variant of SA. DA characterizes the minimization problem of the cost function as the minimization of the Helmholtz free energy which depends on the temperature, and tracks its minimum while decreasing the temperature and thus it can deterministically optimize the cost function at each temperature.
According to the principle of minimal free energy in statistical mechanics, the minimum of the Helmholtz free energy determines the distribution at thermal equilibrium . Thus, formulating the DA clustering as a minimization of (32) leads to at the current temperature, and gives (10) and (11) again. Desirable cluster centers are obtained by calculating (10) and (11) repeatedly.
In this chapter, we focus on application of DA to the Fermi-Dirac-like distribution function described in the Section 4.2.
5.1. Linear approximation of Fermi-Dirac distribution function
The Fermi-Dirac distribution function can be approximated by linear functions. That is, as shown in Fig.1, the Fermi-Dirac distribution function of the form:
is approximated by the linear functions
where. satisfies, and requires to be negative.
In Fig.2, denotes a reduction of the extent of distribution with decreasing the temperature from to (). The extent of distribution also narrows with increasing. () which satisfies is obtained as
where and. Thus, taking that to the temperature at which previous DA was executed and to the next temperature, a covariance of’s distribution is defined as
5.2. Initial estimation of and annealing temperature
Before executing DA, it is very important to estimate the initial values of s and the initial annealing temperature in advance.
From Fig.1, distances between a data point and cluster centers are averaged as
and this gives
With given initial clusters distributing wide enough, (47) overestimates, so that needs to be adjusted by decreasing its value gradually.
Still more, Fig.1 gives the width of the Fermi-Dirac distribution function as wide as, which must be equal to or smaller than that of data distribution (=2R). This condition leads to
As a result, the initial value of or the initial annealing temperature is roughly determined as
5.3. Deterministic annealing algorithm
The DA algorithm for fuzzy clustering is given as follows:
Calculate by (12).
Calculate by (11).
Compare a difference between the current objective value and that obtained at the previous iteration. If is satisfied, then return. Otherwise decrease the temperature as, and go back to.
6. Combinatorial algorithm of deterministic and simulated annealing
6.1. Simulated annealing
The cost function for SA is
where is a constant.
In order to optimize each by SA, its neighbor (a displacement from the current) is generated by assuming a normal distribution with a mean and a covariance defined in (45).
The SA’s initial temperature is determined so as to make an acceptance probability becomes
By selecting s to be optimized from the outside of a transition region in which the membership function changes form to, computational time of SA can be shortened. The boundary of the transition region can be easily obtained with the linear approximations of the Fermi-Dirac-like membership function. From Fig.1, data which have distances bigger than from each cluster centers are selected.
6.2. Simulated annealing algorithm
The SA algorithm is stated as follows:
Select data to be optimized, if necessary.
Calculate neighbors of current s.
Apply the Metropolis algorithm to the selected s using (50) as the objective function.
6.3. Combinatorial algorithm of deterministic and simulated annealing
The DA and SA algorithms are combined as follows:
Execute the DA algorithm.
Compare a difference between the current objective value and that obtained at the previous iteration. If or is satisfied, then go to. Otherwise increment, and go back to.
7. Experiments 1
To demonstrate effectiveness of the proposed algorithm, numerical experiments were carried out. DASA’s results were compared with those of DA (single DA).
We set, , , , , and. We also set in (48) to for experimental data 13, and for experimental data 4 -.
In experiment 1, 11,479 data points were generated as ten equally sized normal distributions. Fig.4 shows a fuzzy clustering result by DASA. Single DA similarly clusters these data.
In experiment 2-1, three differently sized normal distributions consist of 2,249 data points in Fig.5 were used. Fig.5(0) shows initial clusters obtained by the initial estimation of s and the annealing temperature. Fig.5(1)(6a) shows a fuzzy clustering process of DASA. At the high temperature in Fig.5(1), as described in 4.3, the membership functions were widely distributed and clusters to which a data belongs were fuzzy. However, with decreasing of the temperature (from Fig.5(2) to Fig.5(5)), the distribution became less and less fuzzy. After executing DA and SA alternately, the clusters in Fig.5(6a) were obtained. Then, data to be optimized by SA were selected by the criterion stated in the section 4, and SA was executed. The final result of DASA in Fig.5(6b) shows that data were desirably clustered. On the contrary, because of randomness of the initial cluster positions and hardness of good estimation of the initial s, single DA becomes unstable, and sometimes gives satisfactory results as shown in Fig.5(6c) and sometimes not as shown in Fig.5(6d). By comparing Fig.5(6b) to (6c), it is found that, due to the optimization of s by SA, the resultant cluster shapes of DASA are far less smooth than those of single DA.
Changes of the costs of DASA (for DA stage and (50) for SA stage (was set to in (50)), respectively) are plotted as a function of iteration in Fig.6, and the both costs decreases with increasing iteration. In this experiment, the total iteration of SA stage was about, while that of DA stage was only. Accordingly, the amount of simulation time DASA was mostly consumed in SA stage.
In experiment 2-2, in order to examine effectiveness of SA introduced in DASA, experiment 2 was re-conducted ten times as in Table 1, where listed in the first row is a ratio of data optimized at SA stage. “UP” means to increase as where is a number of execution times of SA stage. Also, “DOWN” means to decrease as. Results are judged “Success” or “Failure” from a human viewpoint -. From Table 1, it is concluded that DASA always clusters the data properly if is large enough (), whereas, as listed in the last column, single DA succeeds by 50%.
In experiments 3 and 4, two elliptic distributions consist of 2,024 data points, and two horseshoe-shaped distributions consist of 1,380 data points were used, respectively. Fig.5, 6 and 7 show DASA’s clustering results. It is found that DASA can cluster these data properly. In experiment 3, a percentage of success of DASA is 90%, though that of single DA is 50%. In experiment 4, a percentage of success of DASA is 80%, though that of single DA is 40%.
These experimental results demonstrate the advantage of DASA over single DA. Nevertheless, DASA suffers two disadvantages. First, it takes so long to execute SA repeatedly that, instead of (10), it might be better to use its linear approximation functions as the membership function. Second, since s differ each other, it is difficult to interpolate them.
8. Experiments 2
8.1. Interpolation of membership function
DASA suffers a few disadvantages. First, it is not necessarily easy to interpolate or, since they differ each other. Second, it takes so long to execute SA repeatedly.
A simple solution for the first problem is to interpolate membership functions. Thus, the following step was added to the DASA algorithm.
When a new data is given, some neighboring membership functions are interpolated at its position.
To examine an effectiveness of interpolation, the proposed algorighm was applied to experimental data shown in Fig.9(a). For simplicity, the data were placed on rectangular grids on the xy plane.
First, some regional data were randomly selected from the data. Then, Initial and final memberhip functions obtained by DASA are shown in Figs.9(b) and (c) respectively.
After that, remaining data in the region were used as test data, and at each data point, they were interpolated by their four nearest neighboring membership values. Linear, bicubic and fractal interpolation methods were compared.
Prediction error of linear interpolation was 6.8%, and accuracy was not enough. Bicubic interpolation also gave a poor result, because its depends on good estimated gradient values of neighboring points. Accordingly, in this case, fractal interpolation is more suitable than smooth interpolation methods such as bicubic or spline interpolation, because the membership functions in Figs.9(c) are very rough.
The well-known InterpolatedFM (Fractal motion via interpolation and variable scaling) algorithm  was used in this experiment. Fractal dimension was estimated by the standard box-counting method . Figs.10(a) and 3(b) represent both the membership functions and their interpolation values. Prediction error (averaged over 10 trials) of fractal interpolation was 2.2%, and a slightly better result was obtained.
In this article, by combining the deterministic and simulated annealing methods, we proposed the new statistical mechanical fuzzy c-means clustering algorithm (DASA). Numerical experiments showed the effectiveness and the stability of DASA.
However, as stated at the end of Experiments, DASA has problems to be considered. In addition, a major problem of the fuzzy c-means methodologies is that they do not give a number of clusters by themselves. Thus, a method such as  which can determine a number of clusters automatically should be combined with DASA.
Our future works also include experiments and examinations of the properties of DASA, especially on an adjustment of its parameters, its annealing scheduling problem, and its applications for fuzzy modeling.
However, DASA has problems to be considered. One of them is that it is difficult to interpolate membership functions, since their values are quite different. Accordingly, the fractal interpolation method (InterpolationFM algorithm) is introduced to DASA and examined its effectiveness.
Our future works include experiments and examinations of the properties of DASA, a comparison of results of interpolation methods (linear, bicubic, spline, fractal and so on), an interpolation of higher dimensional data, an adjustment of DASA’s parameters, and DASA’s annealing scheduling problem.
- In the FCM, however, temperature is determined as a result of clustering.
- These parameters have not been optimized particularly for experimental data.
- No close case was observed in this experiment.