Firefly Meta-Heuristic Algorithm for Training the Radial Basis Function Network for Data Classification and Disease Diagnosis

The radial basis function (RBF) network is a type of neural network that uses a radial basis function as its activation function (Ou, Oyang & Chen, 2005). Because of the better approximation capabilities, simpler network structure and faster learning speed, the RBF networks have attracted considerable attention in many science and engineering field. Horng (2010) used the RBF for multiple classifications of supraspinatus ultrasonic images. Korurek & Dogan (2010) used the RBF networks for ECG beat classifications. Wu, Warwick, Jonathan, Burgess, Pan & Aziz (2010) applied the RBF networks for prediction of Parkinson’s disease tremor onset. Feng & Chou (2011) use the RBF network for prediction of the financial time series data. In spite of the fact that the RBF network can effectively be applied, however, the number of neurons in the hidden layer of RBF network always affects the network complexity and the generalizing capabilities of the network. If the number of neurons of the hidden layer is insufficient, the learning of RBF network fails to correct convergence, however, the neuron number is too high, the resulting over-learning situation may occur. Furthermore, the position of center of the each neuron of hidden layer and the spread parameter of its activation function also affect the network performance considerably. The determination of three parameters that are the number of neuron, the center position of each neuron and its spread parameter of activation function in the hidden layer is very important.


Introduction
The radial basis function (RBF) network is a type of neural network that uses a radial basis function as its activation function (Ou, Oyang & Chen, 2005).Because of the better approximation capabilities, simpler network structure and faster learning speed, the RBF networks have attracted considerable attention in many science and engineering field.Horng (2010) used the RBF for multiple classifications of supraspinatus ultrasonic images.Korurek & Dogan (2010) used the RBF networks for ECG beat classifications.Wu, Warwick, Jonathan, Burgess, Pan & Aziz (2010) applied the RBF networks for prediction of Parkinson's disease tremor onset.Feng & Chou (2011) use the RBF network for prediction of the financial time series data.In spite of the fact that the RBF network can effectively be applied, however, the number of neurons in the hidden layer of RBF network always affects the network complexity and the generalizing capabilities of the network.If the number of neurons of the hidden layer is insufficient, the learning of RBF network fails to correct convergence, however, the neuron number is too high, the resulting over-learning situation may occur.Furthermore, the position of center of the each neuron of hidden layer and the spread parameter of its activation function also affect the network performance considerably.The determination of three parameters that are the number of neuron, the center position of each neuron and its spread parameter of activation function in the hidden layer is very important.
Several algorithms had been proposed to train the parameters of the RBF network for classification.The gradient descent (GD) algorithm (Karayiannis, 1999) is the most popular method for training the RBF network.It is a derivative based optimization algorithm that is used to search for the local minimum of a function.The algorithm takes steps proportional to negative of the gradient of function at the current situation.Many global optimization methods had been proposed to evolve the RBF networks.The genetic algorithm is a popular method for finding approximate solutions to optimization and search problems.Three genetic 116 operations that are selection, crossover and mutation, of the main aspects of GA evolve the optimal solution form an initial population.Barreto, Barbosa & Ebecken (2002) used the realcode genetic algorithm to decide the centers of hidden neurons, spread and bias parameters by minimizing the mean square error of the desired outputs and actual outputs.The particle swarm optimization is a swarm intelligence technique, first introduced by Kennedy & Eberhart (2007), inspired by the social behavior of bird flocks or fish schools.The computation of the PSO algorithm is dependent on the particle's local best solution (up to the point of evaluation) and the swarm's global best solution.Every particle has a fitness value, which is evaluated by the fitness function for optimization, and a velocity which directs the trajectory of the particle.Feng, (2006) designed the parameters of centers, the spread of each radial basis function and the connection weights as the particle, and then applied the PSO algorithm to search for the optimal solution for constructing the RBF network for classification.Kurban & Besdok, (2009) proposed an algorithm by using artificial bee colony algorithm to estimate the weights, spread, bias and center parameters based on the algorithm.This chapter concluded the ABC algorithm is superior to the GA, PSO and GD algorithms.
The firefly algorithm is a new swarm-based approach for optimization, in which the search algorithm is inspired by social behavior of fireflies and the phenomenon of bioluminescent communication.There are two important issues in the firefly algorithm that are the variation of light intensity and formulation of attractiveness.Yang (2008) that simplifies the attractiveness of a firefly is determined by its brightness which in turn is associated with the encoded objective function.The attractiveness is proportional to their brightness.Furthermore, every member i x of the firefly swarm is characterized by its bright I i which can be directly expressed as an inverse of a cost function for a minimization problem.Lukasik & Zak (2009) applied the firefly algorithm for continuous constrained optimization.Yang (2010) compared the firefly algorithm with the other meta-heuristic algorithms such as genetic and particle swarm optimization algorithms in the multimodal optimization.These works had the same conclusions that the algorithm applied the proposed firefly algorithm is superior to the two existing meta-heuristic algorithms.
In this chapter, a firefly algorithm of the training of the RBF network is introduced and the performance of the proposed firefly algorithm is compared with the conventional algorithms such as conventional GD, GA, PSO and ABC algorithms on classification problems from the UCI repository.Furthermore, the receiver operating characteristic analysis is used to evaluate the diagnosis performance of medical datasets.Some conclusions are made in the final section.

Radial basis function network
The neural network are non-linear statistical data modeling tools and can be used to model complex relationships between inputs and outputs or to find patterns in a dataset.The radial basis function network is a popular type of network that is very useful for pattern classification (Bishop, 1995).A radial basis function (RBF) network can be considered a special three-layered network shown in Fig 1.
The input nodes pass the input values x to the internal nodes that construct the hidden layer.Each unit of hidden layer implements a specific activation function called radial basis function.The nonlinear responses of hidden nodes are weighted in order to calculate the Fig. 1.The structure of radial basis function network final outputs of network in the output layer.The input layer of this network has m units for m dimensional input vectors.The input units are fully connected to I hidden layer units, which are in turn fully connected to the J output layer units, where J is the number of output layer.Each neuron of the hidden layer has a parameter mean vector called center.
Figure 1 shows the detailed structure of an RBF network.Each input data x with m dimensions, x= 12 ( m x ,x ,.....,x ) , are located in the input layer, which broadcast to hidden layer.The hidden layer has I neurons and each neuron compute the distance between the centers and the inputs.Each activation function of the neuron in hidden layer is chosen to be Gaussians and is characterized by their mean vectors i c and its spread parameter i  (i=1,2,…,I).That is, the activation function (x)  of the i th hidden unit for an input vector x is given by: The i  affects the smoothness of the mapping, thus, the output value of the neuron j of output layer j y for training sample x, are given by o( x ) in (2). 12 The weights, ij w (i=1,2,…,I., j=1,2,…,J), is the i-th node of output of hidden layer that transmitted to j-th node of the output layer, and j  is the bias parameter of the j-th node of output layer determined by the RBF network training procedure.In practice, the training procedure of RBF is to find the adequate parameters ij w , i  , i  and i c such that the error metrics such as the mean square error (MSE) is minimum.

Training algorithms: GD, GA, PSO, ABC and FA
This section gives brief descriptions of training algorithms of RBF network that include the gradient descent algorithm (GD), the genetic algorithm (GA), the particle swarm optimization (PSO) algorithm and the artificial colony bee (ABC) algorithm.

Gradient Descent (GD) algorithm
GD is the derivative based optimization algorithm (Karayiannis, 1999) and where the weight matrix is represented as W and the  matrix is the H matrix, respectively.
The GD algorithm can be implemented to minimize the MSE term defined as the equation (3) based on the following equations.
where the  is the parameter of learning rate.Genetic algorithm (Goldberg, 1989) inspired by the evolutionary biology is a popular method for finding approximate solutions to optimization and search problems.In the genetic algorithm, a population of strings called chromosomes which encode candidate solutions to an optimization problem, evolves toward better solutions.The evolution usually starts from a population of randomly generated individuals and happens in generations.In each generation, the fitness of every individual in the population is evaluated, multiple individuals are stochastically selected from the current population based on their fitness, and modified by recombined and possibly randomly mutated to form a new population.The new population is then used in the next iteration of the algorithm.
Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.If the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached.The three genetic operations that are selection, crossover and mutation, of the main aspects of GA evolve the optimal solution form an initial population.Barreto, Barbosa & Ebecken (2002) used the real-code genetic algorithm to decide the centers of hidden neurons, spread and bias parameters by minimizing the MSE of the desired outputs and actual outputs.

Particle Swarm Optimization (PSO) algorithm
The particle swarm optimization (PSO) first introduced by Kennedy & Eberhart (1995), is a swarm optimization method that optimizes a problem by iteratively trying to improve candidate solutions called particles.x ( x ,x ,.....,x )  .Like to the position, the velocity of particle i can be described as The movements of particles i at the 1 t  iteration are followed as the Eq.[8] and [9].
where 1 c indicates the cognition learning factor; 2 c indicates the social learning factor, and 1 r and 2 r are random numbers between (0, 1).Feng (2006) designed the parameters of centers, the spread of each radial basis function and the connection weights as the particle, and then applied the PSO algorithm to search for the optimal solution for constructing the RBF network for classification.

www.intechopen.com
Theory and New Applications of Swarm Intelligence 120

Artificial Bee Colony (ABC) algorithm
The artificial bee colony (ABC) algorithm was proposed by the Kurban and Besdok, (2009) applied it to train the RBF network.In the ABC algorithm, the colony of artificial bees contains three groups of bees: employed bees, onlookers and scouts.The employed bees bring loads of nectar from the food resource to the hive and may share the information about food source in the dancing area.These bees carry information about food sources and share them with a certain probability by dancing in a dancing area in the hive.The onlooker bees wait in the dances area for making a decision on the selection of a food source depending on the probability delivered by employed bees.The computation of probability is based on the amounts of the food source.The other kind of bee is scout bee that carries out random searches for new food sources.The employed bee of an abandoned food source becomes a scout and as soon as it finds a new food source it becomes employed again.In other words, the each search cycle of the ABC algorithm contains three steps.First, the employed bees are sent into their food sources and the amounts of nectar are evaluated.After sharing this information about the nectar, onlooker bees select the food source regions and evaluating the amount of nectar in the food sources.The scout bees and then chosen and sent out to find the new food sources.
In the ABC algorithm, the position of a food source i z represents a possible solution to the optimization problems and the amount of nectar in a food source corresponds to the fitness i fit( z ) of the corresponding solution i z .In the training RBF network, a solution i z is made up of the parameters of weights, spread, bias and vector centers of RBF network.The number of employed or onlooker bees is generally equal to the number of solutions in the population of solutions.Initially, the ABC algorithm randomly produced a distributed initial population P of SN solutions, where SN denoted the number of employed bees or onlooker bees.Each solution i z (i=1,2,…,SN) is a D-dimensional vector.Here D is the number of optimization parameters.In each execution cycle, C ( C=1, 2,…, MCN), the population of the solutions is subjected to the search processes of the employed, the onlooker and scout bees.An employed bee modifies the possible solution depending on the amount of nectar (fitness vale) of the new source (new solution) by using the Eq. ( 10).If there is more nectar in new solution is than that in the precious one, the bee remembers the new position and forgets the old one, otherwise it retains the location of the previous one.When all employed bees have finished this search process, they deliver the nectar information and the position of the food sources to the onlooker bees, each of whom chooses a food source according to a probability proportional to the amount of nectar in that food source.The probability i p of selecting a food source i z is determined using the following Eq.( 11).In practical terms, any food source i z ,(i=1,2,…,SN) sequentially generates a random number between [0, 1] and if this number is less than i p , an onlooker bee are sent to food source i z and produces a new solution based on the equation ( 9).If the fitness of the new solution is more than the old one, the onlooker memorizes the new solution and shares this information with other onlooker bees.Otherwise, the new solution will be discarded.The process is repeated until all onlookers have been distributed to the food sources and produces the corresponding new solution.
If the position of food source can not be improved through the predetermined number of "limit' of bees, then the food resource i z is abandoned and then the employed bee becomes a scout.Assume that the abandoned source is i z and 12 j { , ,....,D}  , then the scout discovers a new food source to be replaced with i z .This operation can be defined as in (12).
where the j min z and j max z are the upper bound and upper bound of the j-th component of all solutions.If the new solution is better than the abandoned one, the scout will become an employed bee.The selection of employed bees, onlooker bees and scouts is repeated until the termination criteria have been satisfied.

Firefly Algorithm
Firefly algorithm (FA) was developed by Xin-She Yang at Cambridge University in 2008.In the firefly algorithm, there are three idealized rules: (1) all fireflies are unisex so that one firefly will be attracted to other fireflies regardless of their sex; (2) Attractiveness is proportional to their brightness, thus for any two flashing fireflies, the less brighter one will move towards the brighter one.If there is no brighter one than a particular firefly, it will move randomly.As firefly attractiveness one should select any monotonically decreasing function of the distance i,j j i rd ( x , x )  to the chosen j-th firefly, e.g. the exponential function.
i,j i j rx x  (13) where the 0  is the attractiveness at 0 i,j r  and  is the light absorption coefficient at the source.
The movement of a firefly i is attracted to another more attractive firefly j is determined by The particular firefly i x with maximum fitness will move randomly according to the following equation.
when 1 rand ,2 rand are random vector whose each element obtained from the uniform distribution range from 0 to 1; (3).The brightness of a firefly is affected or determined by the landscape of the fitness function.For maximization problem, the brightness I of a firefly at a particular location x can be chosen as I(x) that is proportional to the value of the fitness function.

Training RBF network using firefly algorithm
The individuals of the fireflies include the parameters of weights (w), spread parameters (  ), center vector (c) and the bias parameters (  ).The mean vector i c of the i-th neuron of hidden layers is defined by

     
In fact, each of fireflies can represent a specific RBF network for classification.In our proposed FF-based training algorithm, the optimum vectors i t of firefly of specific trained RBF network can maximize the fitness function defined in the Eq. ( 18).x of RBF network designed by parametric vector i t .The N is the number of the training samples.Figure 2 shows the pseudo codes of this proposed algorithm and the steps of the proposed algorithm are detailed described as follows.

Step 1. (Generate the initial solutions and given parameters)
In this step, the initial population of m solutions are generating with dimension IJ I mI J  , denoted by the matrix D.
where the values of weights (w) and centers (c) are assigned between -1 and 1, and the values of the spread and bias parameters  and  range from 0 to 1. Furthermore, the step will assign the parameters of firefly algorithm, that are  , 0  , the maximum cycle number (MCL) and  .Let number of cycle l to be 0. Step 2. Firefly movement

www.intechopen.com
In step 2, each solution i t computes its fitness value i f(t )as the corresponding the brightness of firefly.For each solution i t , this step randomly selects another one solution j t with the more bright and then moves toward to j t by using the following equations.
2 1 where 01 j,k uŨ ( , ) is a randomly number ranged form 0 to 1 and the i,k t is the k-th element of the solution i t .
Step 3. (Select the current best solution) The step 3 selects the best one from the all solutions and defines as max i x , that is, Step

(Check the termination criterion)
If the cycle number l is equal to the MCL then the algorithm is finished and output the best solution max i x .Otherwise, l increases by one and randomly walks the best solution max i x then go to Step 2. The best solution max i x will randomly walk its position based the following equation.
where 0 1 max i, k uŨ ( , ) is a randomly number ranged from 0 to 1.

Experimental results and discussion
The platform used to develop the five training algorithm included the gradient descent (GD), genetic algorithm (GA), particle swarm optimization (PSO), artificial bee colony algorithm (ABC) and the firefly algorithm (FF) is a person computer with following features: Intel Pentium IV 3.0 GHZ CPU, 2GB RAM, a Windows XP operating system and the Visual C++ 6.0 development environment.In experiments, learning parameter of GD is selected as 001 .

 
. The used parameters of GA, PSO, ABC and FF algorithms are given at Tables 1, 2, 3 and 4, respectively.In order to obtain the classification results without partiality, the following data set are used: Iris, Wine, Glass, Heart SPECTF and Breast cancer (WBDC) listed in Table 5, taken from the UCI machine repository (Asuncion, 2007).
In order to avoid the feature values in greater numeric ranges from dominating those in smaller numeric range, the scaling of feature is used, that is the range of each feature value can be linearly scaled to range [-1, 1].Furthermore, the 4-fold method is employed in experiments, thus, the dataset is split into 4 parts, with each part of the data sharing the same proportion of each class of data.Three data parts is applied in the training process, while the remaining one is used in the testing process.The program is run 4 times to enable each slice of data to take a turn as the testing data.The percentage of correct classification of this experiment is computed by summing the individual accuracy rate for each run of testing, and then dividing the total by 4.
The complexity index shows in ( 27) that is the sum of squared weights which is based on the concept of regularization and represents the smoothness of the RBF network.

Classification evaluation
One of the most important issues of designing the RBF network is the number of neurons in the hidden layer.Thus, we implement the RBF networks which have 1 neuron to 8 neurons for comparison, and each dataset is running 10 times based on 4-flod cross-validation.The average percentage and the corresponding standard derivation defined as the Eq. ( 25) of the designed RBF network by different algorithms are listed in Tables 6-10.These tables reveal that GD is the worst because the gradient descent algorithm is a traditional derivative method which traps at local minima.Furthermore, unlike the other four algorithms, as the number of neurons increases, the correct classification rates of the network designed by GD algorithm increase accordingly.In other words, the usage of bioinspired algorithms is more robust than traditional GD algorithms.The Table 6 and 7 are the classification results of the Iris and Wine datasets, which are three-class classification problems.In Table 6, we find the fact that the results of the deigned RBF networks using the PSO, ABC and FF are not significantly difference but are superior to the results using GA.In Table 7, the results of ABC and FF algorithms are better than the results of the GA and PSO algorithms.These results may reveal that the GA and PSO algorithms need more number of initials or more execution iterations for searching the optimal parameters of the radial basis function network.Tables 8-10 are the classification results of the Glass, Heart SPECTF and WDBC datasets, which are two-class classification problems.We also find that the results designed by the PSO, ABC and FF algorithms are better than the result of GA algorithm.The better results of each of the three tables are the usages of PSO, ABC and FF, but, the differences between them are not indistinct from these tables.

www.intechopen.com
Theory and New Applications of Swarm Intelligence 128

The analysis of complexity and mean square error
Generally speaking, the complexity of trained RBF network with a large number of hidden nodes is larger but its corresponding mean square error is smaller.In experiments, The Figs. 3-7 recorded the mean square error and complexity of each trained RBF network based the Eq. ( 23) and ( 24).These figures clearly appear the phenomenon that the GD is the worst because of the largest mean square error with the same complexity among all algorithms.

Receiver operating characteristic analysis
The receiver operating characteristics analysis is a graphical curve is a tool for two-class classification problems that gives the evaluation of the predictive accuracy of a logistic model.The curve displays the relationship of the true positive rate (sensitivity) and the false positive rate (1-specificity) within a range of cutoffs.The sensitivity is a measure of accuracy for predicting events that is equal to the true positive/total actual positive; nevertheless, the specificity is a measure of accuracy for predicting nonevents that is equal to the true negative/total actual negative of a classifier.The area under curve (AUC) is an important index for evaluating the performance of classification.In general, the high AUC represents to good performance in the classification problems.The classifications of the two Heart SPECTF and Breast WDBC datasets listed Table 5 are two-class problems of the medical diagnosis that are suitable for this analysis.The SPECT dataset generated from describes diagnosing of cardiac single proton emission computed tomography images.The database of 267 SPECT image sets (patients) with 22 binary attributes was processed to extract features that summarize the original SPECT images and each of the patients is classified into two categories: normal (negative) and abnormal (positive).The Wisconsin Diagnostic Breast Cancer (WDBC) dataset was collected from Dr. William H. Wolberg of Wisconsin University.The dataset includes 567 data samples with 30 continuous attributes that are divided into 357 benign (negative) and 210 malignant (positive).In order to take one step ahead for analyzing the capability of classifications by using the five algorithms, the average of sensitivity and the average specificity of the receiver operating characteristic (ROC) analysis by using the SPECTF and WDBC datasets under the eight hidden nodes of trained RBF network are listed in the Table 11; and further, the corresponding AUC of ROC analysis with varied the bias parameters also listed in this table.In this table we find that the usage of ABC algorithm can have better capability in the classification of the SPECT dataset, however, the FF algorithm is best in the classification of WDBC dataset.The average computation times of classifying the Heart SPECT dataset in 4-fold cross validation by using the GD, GA, PSO,ABC and FF are 0.21,429.67,103.76,123.67 and 98.21 seconds, however, the average computation times of classifying the Breast dataset in 4-fold cross validation by using the GD, GA, PSO,ABC and FF are 0.24,513.23,161.84,189.59 and 134.91

Conclusions
In this chapter, the firefly algorithm has been applied to train the radial basis function network for data classification and disease diagnosis.The training procedure involves selecting the optimal values of parameters that are the weights between layer and the output layer, the spread parameters, the center vectors of the radial functions of hidden nodes; and the bias parameters of the neurons of the output layer.The other four algorithms that are gradient descent (GD), genetic algorithm (GA), particle swarm optimization (PSO) and artificial bee colony algorithms are also implemented for comparisons.In experiments, the well-known classification problems such as the iris, Wine, Glass, heart SPECT and WDBC datasets, obtained from UCI repository had been used to evaluate the capability of classification among the five algorithms.Furthermore, the complexity and trained error also be discussed form experiments conducted in this chapter.The experimental results show that the usage of the firefly algorithm can obtain the satisfactory results over the GD and GA algorithm, but it is not apparent superiority to the PSO and ABC methods form exploring the experimental results of the classifications of UCI datasets.In order to go a step further for talking over the capability of classification among the five algorithms, the receiver operating characteristic (ROC) analysis are applied for this objective in classification of the heart SPECT and WDBC datasets.The experimental results also appear that the use of firefly algorithm has satisfactory in the high sensitivity, high specificity and bigger AUC in the corresponding ROC curves in WDBC dataset, however, the differences between ABC, PSO and firefly algorithms are not significant.The experimental results of this chapter reveal that the swarm intelligence algorithms, such as the particle swarm optimization, the artificial bee colony algorithm and the firefly algorithm are the good choices to search for the parameters of radial basis function neural network for classifications and disease diagnosis.
is denoted to the desired output vector and actual output vector for training sample ix .In (3), the N is the number of the training samples.
11) www.intechopen.comFirefly Meta-Heuristic Algorithm for Training the Radial Basis Function Network for Data Classification and Disease Diagnosis 121 are denoted to the desired output vector and actual output vector for training sample i

Firefly
Meta-Heuristic Algorithm for Training the Radial Basis Function Network for Data Classification and Disease Diagnosis 123 Qasem & Shamsuddin (2011) uses three indices to evaluate the performance of trained RBF network using the different algorithms.The three performance indices are:www.intechopen.comTheoryand New Applications of Swarm Intelligence 126The percent of correct classification (PCC) is used as the measure for evaluating the trained RBF networkerror (MSE) on the data set is used to act as the performance index shown in are the actual output and the desired output and N is the number of data paris in all dataset.

Fig. 3 .
Fig. 3.The mean square error versus complexity of the classification of the Iris dataset.

Fig. 4 .Figure 3
Fig. 4. The mean square error versus complexity of the Wine classification.

Fig. 5 .
Fig. 5.The mean square error versus complexity of the Glass classification.

Fig. 6 .
Fig. 6.The mean square error v.s.complexity of the Heart SPECT classification.

Firefly
Meta-Heuristic Algorithm for Training the Radial Basis Function Network for Data Classification and Disease Diagnosis 131 that is used to search for the local minimum of a function.The algorithm takes steps proportional to negative of the gradient of function at the current situation with given the parameters i

Table 4 .
The used parameters of Firefly algorithm   , any of fireflies is a IJ+I+mI+Jdimensional vector, and the given parameters m,  , 0  , iteration number l and  .

Table 5 .
The used datasets in this study

Table 10 .
Statistical average results of the WDBC dataset using different algorithms.

Table 11 .
Area under curve (AUC) of ROC analysis of RBF network with eight hidden nodes.(The best results are highlighted in bold)