Learning Algorithms for Fuzzy Inference Systems Using Vector Quantization Learning Algorithms for Fuzzy Inference Systems Using Vector Quantization

Many studies on learning of fuzzy inference systems have been made. Specifically, it is known that learning methods using vector quantization (VQ) and steepest descent method (SDM) are superior to other methods. In their learning methods, VQ is used only in determination of the initial parameters for the antecedent part of fuzzy rules. In order to improve them, some methods determining the initial parameters for the consequent part by VQ are proposed. For example, learning method composed of three stages as VQ, generalized inverse matrix (GIM), and SDM was proposed in the previous paper. In this paper, we will propose improved methods for learning process of SDM for learning methods using VQ, GIM, and SDM and show that the methods are superior in the number of rules to the conventional methods in numerical simulations.


Introduction
There have been many studies on learning of fuzzy systems [1][2][3][4][5][6][7][8]. Their aim is to construct learning methods based on SDM. Some novel methods on them have been developed which (1) generate fuzzy rules one by one starting from any number of rules, or reduce fuzzy rules one by one starting from a sufficiently large number of rules [2]; (2) use genetic algorithm (GA) and particle swarm optimization (PSO) to determine fuzzy systems [3]; (3) use fuzzy inference systems composed of a small number of input rule modules, such as single input rule modules (SIRMs) and double input rule modules (DIRMs) methods [9,10]; and (4) use a self-organization or a vector quantization technique to determine the initial assignment of parameters [11][12][13][14][15]19]. Specifically, it is known that learning methods using vector quantization (VQ) and steepest descent method (SDM) are superior in the number of rules (parameters) to other methods [16,19]. So, why is it effective to combine VQ with SDM in fuzzy modeling? First, let us explain how to combine SDM with methods other than VQ. (1) Although the learning time is short, the generation method is known to have low test accuracy, while the reduction method has high test accuracy but takes long learning time [2]. (2) The method using GA and PSO shows high accuracy when the input dimension and the number of rules are small, but it is known that there is a problem of scalability [3]. (3) SIRM and DIRM methods are excellent in scalability, but the accuracy of learning is not always sufficient [9]. As described above, many methods are not necessarily effective models because of the difficulty of learning accompanying the increase of the input dimension and the number of rules and the low accuracy. On the other hand, the method combining VQ with SDM is possible to efficiently conduct learning of SDM by arranging suitably the initial parameters of fuzzy rules using VQ [1,16]. However, since VQ is unsupervised learning, it is easy to reflect the input part of learning data, but how to capture output information in learning is difficult. With their studies, the first learning method is the one using VQ only in determining the initial parameters of the antecedent part of fuzzy rules using input part of learning data [1,[11][12][13][14]. The second method is the one determining the same parameter using input/output parts of learning data [15,19]. Further, the third method is one iterating learning process of VQ and SDM for the second method. Kishida and Pedrycz proposed the method based on the third one [13,15]. These methods are the ones determining only the antecedent parameters by VQ. Therefore, we introduced generalized inverse matrix (GIM) to determine the initial assignment of weight parameters for the consequent part of fuzzy rules as the fourth method and showed the effectiveness in the previous paper [16,17]. In this paper, improved methods for learning process of SDM in learning methods using VQ, GIM, and SDM are introduced and show that the method is superior in the number of rules to other methods in numerical simulations.

The conventional fuzzy inference model
The conventional fuzzy inference model using SDM is described [1]. Let Z j = {1, …, j} and Zj * = {0, 1,…, j}. Let R be the set of real numbers. Let x = (x 1 , …, x m ) and y be input and output variables, respectively, where x j ∈ R for j ∈ Z m , and y ∈ R. Then, the rule of simplified fuzzy inference model is expressed as where j ∈ Z m is a rule number, i ∈ Z n is a variable number, M ij is a membership function of the antecedent part, and w i is the weight of the consequent part.
A membership value μ i of the antecedent part for input x is expressed as Then, the output y * of fuzzy inference method is obtained as If Gaussian membership function is used, then M ij is expressed as where c ij and b ij denote the center and the width values of M ij , respectively.
The objective function E is determined to evaluate the inference error between the desirable output y r and the inference output y * .
Let D = {(x p , … , x p , y r )|p∈Z P } and D * = {(x p , …, x p )|p∈Z p } be the set of learning data and the set of input part of D, respectively. The objective of learning is to minimize the following mean square error (MSE) as where yp * and y r mean inference and desired output for the pth input x p .
In order to minimize the objective function E, each parameter of c, b, and w is updated based on SDM using the following relation: where t is iteration time and K α is a learning constant [1].
The learning algorithm for the conventional fuzzy inference model is shown as follows: Learning Algorithm A Step A1: The threshold θ of inference error and the maximum number of learning time T max are set. Let n 0 be the initial number of rules. Let t = 1.
Step A2: The parameters b ij , c ij , and w i are set randomly.
Step A4: A data x p 1 ; ⋯; x p m ; y r p ∈ D is given.
Step A7: If p = P, then go to Step A8, and if p < P then go to Step A4 with p p + 1.
Step A8: Let E(t) be inference error at step t calculated by Eq. (5). If E(t) > θ and t < T max , then go to Step A3 with t t + 1; else, if E(t) ≤ θ and t ≤ T max , then the algorithm terminates.
Step A9: If t > T max and E(t) > θ, then go to Step A2 with n = n + 1 and t = 1.
In particular, Algorithm SDM is defined as follows: Algorithm SDM (c, b, w) Steps A3 to A8 of Algorithm A are performed.

Neural gas method
Vector quantization techniques encode a data space V ⊆ R m , utilizing only a finite set C = {c i | i∈Z r } of reference vectors [18].
Let the winner vector c i (v) be defined for any vector v ∈ V as By using the finite set C, the space V is partitioned as where The evaluation function for the partition is defined by where n i = |V i |.
Let us introduce the neural gas method as follows [18]: For any input data vector v, the neighborhood ranking c i k for k ∈ Z * rÀ1 is determined, being the reference vector for which there are k vectors c j with Let the number k associated with each vector c i denoted by k i (v,c i ). Then, the adaption step for adjusting the parameters is given by where ε ∈ [0, 1] and λ > 0.
Let the probability of v selected from V be denoted by p(v).
The flowchart of the conventional neural gas algorithm is shown in Figure 1 [18], where ε int , ε fin , and T max2 are learning constants and the maximum number of learning, respectively. The method is called learning algorithm NG.
Using the set D * , a decision procedure for center and width parameters is given as follows: Algorithm Center (c) : the probability of x selected for x∈D * .
As a result, the set C of reference vectors for D * is determined, where C = n.
Step 2: Each value for center parameters is assigned to a reference vector. Let where C i and n i are the set and the number of learning data belonging to the ith cluster C i and C ¼ ∪ r i¼1 C i and n ¼ P r i¼1 n i .
As a result, center and width parameters are determined from algorithm center (c).  Step 1: Initialize() Step 2: Center and width parameters are determined from Algorithm Center(P) and the set D * .
Step 3: Parameters c, b, and w are updated using Algorithm SDM (c, b, w).
Step 4: If E(t)≤θ, then algorithm terminates else go to Step 3 with n n + 1 and t t + 1.

The probability distribution of input data based on the rate of change of output
It is known that many rules are needed at or near the places where output data change quickly in fuzzy modeling. Then, how can we find the rate of output change? The probability p M (x) is one method to perform it. As shown in Eqs. (16) and (17), any input data where output changes quickly is selected with the high probability, and any input data where output changes slowly is selected with the low probability, where M is the size of range considering output change.
Based on the literature [13], the probability (distribution) is defined as follows: Step 1: Give an input data x i ∈D * , we determine the neighborhood ranking ( Step 2: Determine H(x i ) which shows the rate of output change for input data x i , by the following equation: where x i l for l Z M means the lth neighborhood ranking of x i , i ∈ Z P , and y i and y i l are output for input x i and x i l , respectively. The number M means the range considering H(x).
Step 3: Determine the probability p M (x i ) for x i by normalizing H(x i ) as follows: where See Ref. [19] for the detailed explanation using the example of p M (x). Using p M (x), Kishida has proposed the following learning algorithm [13]:  Step 1: Initialize ( ) Step 2: The probability p M (x) is obtained from algorithm prob (p M (x)).
Step 3: Center and width parameters are determined using p M (x) from Algorithm Center (P) and the data set D.
Step 4: Parameters c, b, and w are updated using Algorithm SDM (c, b, w).
Step 5: If E(t)≤θ, then algorithm terminates else go to Step 3 with n n + 1 and t = 1.

Determination of weight parameters using the generalized inverse method
The optimum values of parameters c and b are determined by using p K (x). Then, how can we decide weight parameters w? We can determine them as the interpolation problem for parameters c, b, and w. That is, it is the method that membership values for antecedent part of rules are computed from c and b and weight parameters w are determined by solving the interpolation problem. So far, the method was used as a determination problem of weight parameters for RBF networks [1].
Let us explain fuzzy inference systems and interpolation problem using the generalized inverse method [1]. This problem can be stated mathematically as follows: Given P points {x p |p∈Z P } and P real numbers {y r p |p∈Z P }, find a function f: R m !R such that the following conditions are satisfied: In fuzzy modeling, this problem is solved as follows: where μ i and M ij are defined as Eqs. (2) and (4).
Let P = n and x i = c i . The width parameters are determined by Eq. (15). Then, if φ ij ( ) is suitably selected as Gaussian function, then the solution of weights w is obtained as Let us consider the case n < P. This is the realistic case. The optimum solution w * that minimizes E = ||y r À φw||2 can be obtained as follows: where Φ + ≜[Φ T Φ] À1 Φ T , Ψ ≜ΦΦ T , and I is identify matrix of PÂP .
The matrix Φ + is called the generalized inverse of φ. The method using Φ + to determine the weights is called the generalized inverse method (GIM).
Using GIM, a decision procedure for parameters is defined as follows: Algorithm Weight(c, b) Input: D = {(x p , y r )|p∈Z P } Output: The weight parameters w Step 1: Calculate μ i based on Eq. (2) Step 2: Calculate the matrix Φ and Φ + using Eq. (20): Step 3: Determine the weight vectors w as follows:

The relation between the proposed algorithm and related works
Let us explain the relation between the proposed method and related works using Figure 2. Figure 2(a). Initial parameters of c, b, and w are set randomly, and all parameters are updated using SDM until the inference error become sufficiently small (see Figure 2(a)) [1].

The fundamental flow of algorithm A is shown in
2. The first method using VQ is the one that both the initial assignment of parameters and the assignment of parameters in iterating step (see outer loop of Figure 2(b)) are also determined by NG using D * . That is, it is learning method composed of two stages. The center parameters c are determined using D * by VQ, b is computed by Eq. (15) using the result of center parameters, and weight parameter w is set to the results of SDM, where the initial values of w are set randomly. Further, all parameters are updated using SDM for the definite number of learning time. In iterating processes, parameters of the result obtained by SDM are set as initial ones of the next process. Outer iterating process is repeated until the inference error become sufficiently small (see Figure 2(b)).
3. The second method using VQ is the one that is the same method as the first one except for selecting any learning data based on p M (x) (see Figure 2(c)). That is, center parameters c are determined by p M (x) using input and output learning data. 4. The third learning method using VQ is the one that parameters w are determined using GIM after parameters c and b are determined by VQ using p M (x) and all parameters are updated based on SDM. That is, it is learning method composed of three phases. In the first phase, the center parameters c are determined using the probability p M (x), and b is computed from the result of center parameters. In the second phase, weight parameters w are determined by solving the interpolation problem using GIM. In the third phase, all parameters are updated using SDM for the definite number of learning time. In iterating process, the result of SDM is set to initial ones of the next process based on hill climing. Outer process is repeated until the inference error becomes sufficiently small (see Figure 2(d)).

5.
The fourth method is the same to the one as the third method except for using p M (x) in learning process of SDM (see Figure 2(d')). This is a proposed method in this paper.

The proposed learning method using VQ
Let us explain the detailed algorithm of Figure 2(d'). The method is called Learning Algorithm D'. It is composed of four techniques as follows: 1. Determine the initial assignment of c using the probability p K (x).

2.
Determine the assignment of weight parameters w by solving the interpolation problem using GIM. (1) and (2) and learning steps of SDM using p M (x) are iterated.

4.
The optimum value of M is determined by hill climing method [16].
The general scheme of the proposed method is shown as Figure 3, where c min , b min , and w min are the optimal parameters for c, b, and w. The proposed method of Figure 3 consists of five phases: In the first phase, all values for algorithm are initialized. In the second phase, the probability p M (x) is determined for the size of range M. In the third phase, parameters c are determined by NG using p M (x), and parameters b are computed from parameters c. In the forth phase, parameters w are determined from algorithm weight(c, b). In the fifth phase, all parameters are updated using p M (x) by SDM. The optimum number n * of rules and the optimum size M * of range are determined in Figure 4. That is, the number M for the fixed number n is adjusted, and the optimum values of n * and M * with the minimum number for MSE are determined. Especially, Learning Algorithm D is same method as Learning Algorithm D' except for the step with the symbol "*" in Figure 3. In learning steps of SDM for Learning Algorithm D, learning data is selected randomly (see Figure 2(d)).
Likewise, we also propose improved methods for Figure 2

Numerical simulations
In order to compare the ability of Learning Algorithms (a'), (b'), (c'), and (d') with Learning Algorithms (a), (b), (c), and (d), numerical simulations for function approximation and pattern classification are performed.

Function approximation
The systems are identified by fuzzy inference systems. This simulation uses four systems specified by the following functions with two-dimensional input space [0, 1] 2 (Eqs. (25)-(28)) and one output with the range [0, 1]; In this simulation, T max1 = 100000 and T max2 = 50000 for (a) and T max1 = 10000 and T max2 = 5000 for (b), (c), and (d) and θ = 1.0 Â 10 À4 , K 0 = 100, K max = 190, K = 10, K c = 0.01, K b = 0.01, K c = 0.1, the number of learning data is 200 and the number of test data is 2500. Table 1 shows the results for the simulation. In Table 1, the number of rules, MSEs for learning and test, and learning time (second) are shown, where the number of rules means the one when the threshold θ of inference error is achieved in learning. The result of simulation is the average value from 20 trials. As a result, the results of (a'), (b'), (c'), and (d') are almost same as the cases of (a), (b), (c), and (d) as shown in Table 1. It seems that there is no difference of the ability for the regression problem.

Classification problems for UCI database
Iris, Wine, Sonar, and BCW data from UCI database shown in Table 2 are used as the second numerical simulation [20]. In this simulation, fivefold cross validation is used. As the initial conditions for classification problem, K c = 0.001, K b = 0.001, K w = 0.05, ε init = 0.1, ε fin = 0.01, and λ = 0.7 are used. Further, T max = 50000, M = 100, and θ = 1.0 Â 10 À2 for iris and wine. T max = 50000, M = 200, and θ = 2.0 Â 10 À2 for BCW; and T max = 5000, M = 100, and θ = 5.0 Â 10 À2 for sonar are used. Table 3 shows the result of classification problem. In Table 3, the number of rules, RMs for learning, and test data are shown, where RM means the rate of misclassification. As a result, the results of (a'), (b'), (c'), and (d') are superior in the number of rules to the cases of (a), (b), (c), and (d) as shown in Table 3. It seems that there is the difference of ability for pattern classification.
Let us consider the reason why we can get the good result by using the probability p M (x). In the conventional learning method, parameters are updated by any data selected randomly The number of input 4 13 9 60 The number of class 3 3 2 2 from the set of learning data. In the proposed method, parameters are updated by any data selected based on the probability p M (x). The probability p M (x) is determined based on output change for learning data, so many fuzzy rules are likely to generate at or near the places where output change is large for the set of learning data.
For example, if the number of learning time is 100 and p M (x 0 ) = 0.5, then learning data x 0 is selected 50 times from the set of learning data in learning. As a result, membership functions are likely to generate at or near the places where output change is large for the set of learning data. The probability p M (x) is used in a method to improve the local search of SDM.

Conclusion
In this paper, we proposed the improved methods using VQ, GIM, and SDM. The features of the proposed methods are as follows: 1. In determining the initial assignment of parameters, both input and output parts of learning data are used.
2. The initial assignment of weight parameters is determined by GIM.
3. In order to determine the range of the rate of output change, hill climing is used.

4.
Any learning data in SDM is selected based on the probability distribution p M (x) considering both input and output of learning data.
As a result, it was shown that the proposed methods using the probability distribution considering both input and output parts of learning data were superior to other methods in numerical simulation of pattern classification.
In the future works, we will consider the new idea using VQ and apply the proposed method to control problem.