Open access

# Efficient Estimation of Distribution Algorithms by Using the Empirical Selection Distribution

Written By

S. Ivvan Valdez, Arturo Hernández and Salvador Botello

Published: February 1st, 2010

DOI: 10.5772/8056

From the Edited Volume

## New Achievements in Evolutionary Computation

Edited by Peter Korosec

Chapter metrics overview

View Full Metrics

## 1. Introduction

Estimation of Distribution Algorithms (EDAs) (Mühlenbein et al., 1996, Mühlenbein & PaaB, 1996) are a promising area of research in evolutionary computation. EDAs propose to create models that can capture the dependencies among the decision variables. The widely known Genetic Algorithm could benefit from the available dependencies if the building blocks of the solution were correlated. However, it was proved that the building blocks of a genetic algorithm have a limited capacity for discovering and using complex relationships (correlations) among variables. EDAs instead, focus on learning probability distributions which serve as the vehicle to capture the data dependencies and the data structure as well.

In order to show how the proposed method unifies the theory for infinite sized population with the finite sized population case of practical EDAs, we explain them first. An EDA with infinite sized population would perform the steps shown in the algorithm in Table 1.

 EDA with an infinite size population 1 t=0 2 Initialize a probability model p(x,t) (usually a uniform distribution). 3 Generate an infinite sized sample X t from p(x,t) . 4 Evaluate X t in the objective function(s) and constraint(s). 5 Compute the selection distribution p S (x,t) , and use it as the new search distribution. Then: p(x,t+1) = p S (x,t) . (Selection and model rcomputation step) 6 t=t+1 7 If the stop criterion is not reached go to Step 3

### Table 1.

The estimation of distribution algorithm with an infinite size population: the selection distribution is equal to the search distribution.

The selection method introduces the search bias necessary to improve the current best solution, and in fact, “pushes” the population towards the optimum. For this case where the population is infinite, imagine a probability value could be computed for every point in the search space. Therefore, the “exact” probability distribution of the selected individuals (which is also an infinite set), can be used to sample the new population. Such distribution, the selection distribution, is the probability of any point in the search space of being selected. Four selection distribution formulas for an infinite population are shown in Table 2, left column: Truncation, Boltzmann, Proportional, and Binary Tournament selections. In the same table but right column, we show the proposed empirical selection distributions studied in this paper. The first objective of this paper is to show that the proposed distributions converge to the theoretical ones as the population grows. The second objective is to apply the empirical distributions in practical EDAs, and to compare their performance through the analysis of several experiments. It has been proved that the ideal EDAs with any of the four selection distributions shown, converges to the optimum after a large number of generations (Zhang & Mühlenbein, 2004).

 Selection Model Empirical Selection Distribution TruncationpS(x,t)={p(x,t)α(t)iff(x)≥θt0otherwise​ p^S(xi,t)={1|XtS|iff(xi)≥θt0otherwise​ BoltzmannpS(x,t)=eβ(t)f(x)p(x,t)Z(t) p^S(xi,t)=eβ(t)f(x)∑j=1|X|eβ(t)f(xj) ProportionalpS(x,t)={p(x,t)α(t)iff(x)≥θt0otherwise​ p^S(xi,t)=f(xi)∑j=1|X|f(xj) Binary TournamentpS(x,t)=2p(x,t)∫f(y)≤f(x)p(y,t)dy p^S(xi,t)=∑j=1|X|I(i,j)∑i=1|X|∑j=1|X|I(i,j).WhereI(i,j)=1,iff(xj)f(xi),and0otherwise.

### Table 2.

Left: Selection models for the EDA with an infinite sized population. Right: the respective empirical selection distribution.

Let us now contrast the infinite sized population case with the standard practical EDA, thereby, with a finite sized population. The practical EDA is presented in the algorithm in Table 3. Notice the selection operator returns a selected setXtSwhose size is a fraction of the total population. In the next step the standard EDA seeks to approximate the distribution of the selected set via a parametric distribution. The distribution parameters, and in some cases the data structure, are learned from the selected set. Such distribution, the search distribution, is the model (with a predefined structure), used for learning the underlying joint probability density of the selected set,p(x,t+1)p(XtS,t). Also, note that the next population is simulated from the search distribution. The search distribution and its learning algorithm are so important for an EDA that, in fact, gives name to the EDA version. Examples of search distributions are: Bayesian networks (Pelikan et al.., 1999), Polytrees (Soto & Ochoa, 2000), and Dependency Trees in discrete spaces. Also, Gaussian univariate and multivariate models (Bosman & Thierens, 2000; Larrañaga et al., 1999) in the continuous case, among others (Larrañaga & Lozano, 2001, Lozano et al., 2006, Pelikan et al., 2006).

 Standard EDA with a finite size population 1 t=0 2 Initialize a probability model p(x,t) ( search distribution ). 3 Generate an infinite sized sample X t from p(x,t) . 4 Evaluate X t in the objective function(s) and constraint(s). 5 St -SELECTION( X t ) (selection step) 6 Recompute the search distribution model p(x,t+1), such that, p(x,t+1)?p(S t ) (parametercomputation step). 7 t=t+1 7 If the stop criterion is not reached go to Step 3

### Table 3.

The estimation of distribution algorithm with a finite size population: the search distribution is computed by learning parameters from the selected set.

The approach introduced in this work is shown in the algorithm in Table 4. No selection operator is used, our approach firstly calculates the proposed empirical selection distribution to approximate to the theoretical selection distribution. The empirical selection distribution,p^S(xi,t), is the exact selection distribution when the population is thought as a model of the whole search space. Then, the search distribution model is created directly using the information from the empirical selection distribution, and used to simulate the new population. Remember, the empirical selection distribution equations are presented in Table 2, right column. The calculation of the empirical selection distribution is easy (as we shall explain later), however, its main advantage is to carry all the information needed to build the best approximating search distribution.

 EDA with the Empirical Selection Distribution 1 t=0 2 Initialize a probability model p(x,t) ( search distribution ). 3 Generate an infinite sized sample X t from p(x,t) . 4 Evaluate X t in the objective function(s) and constraint(s). 5 Compute the empirical selection distributionp^S(x,t)(selection step) 6 Recompute the search distribution model p(x,t+1) , withp^S(x,t)(parametercomputation step). 7 t=t+1 7 If the stop criterion is not reached go to Step 3

### Table 4.

The estimation of distribution algorithm with a finite size population: the search distribution is computed by learning parameters from the selected set.

The chapter is presented as follows: Section 2 briefly reviews the most common selection methods used in EDAs. Section 3 discusses about the convergence of the empirical selection distribution to the exact selection distribution. Section 4 introduces the general method to approximate the search distribution to the selection model. Section 5 is a comparison with related work. Section 6 presents well known EDAs, which have been modified to apply the proposed method. A set of experiments is presented in Section 7 and discussion about the performance of different selection methods. Finally, Section 8 shows the perspectives of future work and concludes.

## 2. Selection methods

The main goal of a selection operator is to bias the population towards promising regions of the search space. Truncation, Boltzmann, Proportional and Tournament selection operators discussed in this paper are introduced through an example. Assume the following objective function and an infinite sized population whose elements have as fitness the function value

at(x,y)=exp(|x|2)2+4cos(20x)+exp(|y|2)2+4. The plot of this function is shown in Figure 1(a). Most of the EDAs draw the initial population from a uniform distribution, which is shown in Figure 1(b). Hence, during the first iteration any point has the same sampling probability. Assuming an infinite sized population, the rest of the figures show the selection distribution function.
• The Truncation selection is shown in Figure 1(c). This selection is widely used by many EDAs. Here the select set would be the best population above a function threshold ofθ=30. The probability density of the truncation selection is shown in Figure 1(c). Observe the flat area; it means the selection probability for the population above the threshold value is the same. The truncation selection hides the roughness of the objective function above the threshold value by assigning to such area the same selection probability.

• The Boltzmann selection is shown in Figure 1(d). This operator exponentially favors the most promising regions (the zones with high function value), as shown in Figure 1(d). Most of the probability mass is condensed on a single peak which corresponds to the function optimum. Since the remaining region is quite flat, the Boltzmann selection will deliver a selected set clustered on the peak region.

• The Proportional selection is shown in Figure 1(e). This operator, proposed by John Holland for the standard genetic algorithm, selects points with some probability directly proportional to its objective value. The resulting probabilistic model of this method, shown in Figure 1(e), is very similar to the objective function (although in a different scale).

• The Tournament selection is shown in Figure 1(f). The tournament selection picks the best point found in a randomly chosen subset of the population. The usual size of the subset is 2. Figure 1(f) shows that the resulting probabilistic model acquires the roughness of the objective function. A larger subset would increase the selection pressure on the winning individual, therefore, flatting the density function shown.

Zhang and Mühlenbein have shown the selection models just illustrated can drive the population to the function optimum (Zhang & Mühlenbein, 2004). The main factor that makes convergence possible is the bias the selection operator introduces into the population. In the following section we provide simple proofs of the convergence of the empirical selection distribution to the selection distribution.

## 3. Convergence of the empirical selection distribution to the selection distribution

The empirical selection distribution equations for the four selection methods are shown in Table 2. In this section we provide simple proofs on the convergence of the empirical selection distribution to the selection distribution with large population size. In fact, the larger the population size, the better the approximation of the empirical selection distribution to the selection distribution.

### 3.1. Discrete variables

Assume an EDA with discrete variables and a search distribution denoted byp(x,t)(the continuous case will be further tackled).

Truncation selection. Say the search space is described bymbinary variables,{y1,y2,...,ym}. There aren=2mcombinations, and denote byxi,i={1,2,...,n}to each combination. Then, for a large sampleXwith size|X|n, every combinationxireceives a frequencyfreqiof instances. The empirical selection distribution at generationtis given by:
p^S(xi,t,θt)=freqi1|St|E1

Being|St|the number of instances ofXsuch thatf(x)θt. Now recall that for a infinite samplefreqi/|X|=p(xi,t), then forα(t)=|St|/|X|the empirical selection distribution is given by:

p^S(xi,t,θt)=p(xi,t)α(t)E2

which is exactly the expression given in Table 2 for the infinite sized population. Theα(t)value is the proportion of truncated solutions.

Boltzmann selection. The empirical selection model is given by:
p^S(xi,t)=eβ(t)f(xi)j=1|X|eβ(t)f(xj)E3

then, in an analogous way to the truncation method:

p^S(xi,t)=freqieβ(t)f(xi)j=1|X|freqkeβ(t)f(xk)E4

Substitutingfreqi=|X|p(xi,t), andfreqi=|X|p(xi,t),Z(t)=k=1np(xk,t)eβ(t)f(xk), then:

p^S(xi,t)=p(xi,t)eβ(t)f(xi)Z(t)E5

So, we have the exact Boltzmann selection distribution.

Proportional selection. The empirical selection model is given as follows:
p^S(xi,t)=f(xi)k=1|X|f(xk)E6

then:

p^S(xi,t)=freqif(xi)k=1|X|freqkf(xk)E7

Substituting,freqi=|X|p(xi,t)then:

p^S(xi,t)=p(xi,t)f(xi)k=1|X|p(xk,t)f(xk)E8
,andE(t)=k=1np(xk,t)f(xk), then the empirical selection for such large sampleXis given by:
p^S(xi,t)=p(xi,t)f(xi)E(t)E9

which is exactly the expression in Table 2.

Binary tournament selection. For the tournament selection method the exact selection distribution for the discrete case is given by:
p^S(xi,t)=p(xi,t)f(y)f(xi)p(y,t)CE10

being Cthe normalization constant, the number 2 in Table 2 is also a constant, so it is absorbed by C. Then the empirical selection distribution is given by:

p^S(xi,t)=j=1|X|I(i,j)i=1|X|j=1|X|I(i,j)E11

WhereI(i,j)=1,iff(xj)f(xi)and 0 otherwise. The termi=1|X|j=1|X|I(i,j)is a normalization constant. Then, considering that for any yvalue the number of instances in the large populationXarefreqy=|X|p(y,t), and the number of instances of a variable (vector) valuexiisfreqi=|X|p(xi,t), then substituting:

p^S(xi,t)=p(xi,t)f(y)f(xi)p(y,t)CE12

which is the exact selection distribution forC=i=1|X|p(xi,t)f(y)f(xi)p(y,t).

### 3.2. Continuous variables

For the continuous case, consider a univariate search space with domain in the interval [a,b], then a set of pointsxifor i=1,2,...,|X|define partitions. If we use these points as possible instances of a discrete variable, then, we have the equivalence with the discrete selections distribution previously shown. The partition size isΔi=xi+1xi, ifΔiε, the sums in Equation 8 can be written as Riemann sums. Even though this is not a proof for continuous cases, it is a strong argument to explain the convergence of the empirical selection distribution to the selection distribution. Figure 2 shows the similarities between the exact selection density function and the empirical selection distributions.

## 4. Computing the search distribution

Following the usual steps of an EDA, an arbitrary large sample should be obtained from the empirical selection distribution, and then used to learn the search distribution parameters. In our approach, sampling the empirical selection is avoided without diminishing the advantages of using all the information at hand. It is known that the relative frequency of a pointxifor an infinitely large sample is equal to the probabilityp^S(xi,t). Thus, the sampling process can be avoided andp^S(xi,t)can be used as the frequency of the pointxi.

For example, suppose that the search distribution is a Gaussian with meanμand variance v. These parameters must be computed to get the best approximation of the Gaussian to the selection model. Assume (for a moment) that a sampleS^of size|S^|is obtained from the empirical selection distribution. The only possible values sampled from the empirical selection distribution are those in the population. Hence, most of the pointsxiwould get more than one instance in the sampleS^. Denote the number of instances of a pointxiasfreqi, the estimator of the mean is:

μ=i=1|X|freqixi|S^|E13

Let us denotep^iS=p^S(xi,t). We know that when|S^|, then,freqi|S^|=p^iS. Substituting this term in Equation 13, the mean is simply computed as follows:

μ=i=1|X|p^iSxiE14

wherep^iSis the probability of a point computed with the empirical selection distribution. Therefore, sampling the empirical distribution is not necessary. Also important, note that the computation of the probabilitiesp^iSis independent of the search distribution parameters. This allows us to easily adapt an EDA to any of the four selection models. These models cover most EDA implementations and other selection methods can be also expressed as distributions. In addition, the computational cost ofp^iSin most of the cases is lower or equal to that of applying a selection operator. For example, the Proportional and Boltzmann selections must calculate the probabilities in Table 2 in order to obtain a sample which becomes the selected set. This sample is not necessary for the empirical selection distribution based EDA (ES-EDA). For the truncation method in the standard EDA as well as the ES-EDA, it is only necessary to know the individuals whose objective function is greater thanθt, thus, the computational cost is the same. The binary tournament selection requires comparisons of the order of the selected set size, while the empirical selection distribution requires comparisons of order |X|(|X|-1)/2, hence, it is the unique case in which an ES-EDA has a greater cost than the standard EDA. The parameter computation in the ES-EDA in general increases its computational cost only by multiplying each individual value xi by its corresponding frequencyp^iS, which is a minimal cost.

## 5. Related work

In order to show that this work is an extension of previous but less general approaches, we will show the equivalence of the resulting formulae when applying frequencies from the empirical selection distribution. Yunpeng et al. (Yunpeng et al., 2006) approximate the Boltzmann distribution with a Gaussian model by minimizing the Kullback-Leibeler divergence. The computation of the Gaussian distribution parameters is presented in Equations 15 and 16. Note that this approach which requires costly analytical work, gives exactly the same formulae as the Boltzmann empirical selection distribution presented in Table 2 with the UMDAc and the EMNAglobal.

μ=i=1|S|eΔβf(xi)xii=1|S|eΔβf(xi)E15
Γ=i=1|S|[eΔβf(xi)(xiμ)(xiμ)T]i=1|S|eΔβf(xi)E16

## 6. Modifying successful EDAs with the empirical selection distribution

This section presents several successful EDAs which have been modified to accept the relative frequencies given by the empirical selection distribution. Continuous and discrete EDAs with univariate and multivariate models are presented. Table 5 presents the notation used.

 p(x,t) Search distribution in generation t . n Number of variables. Xt Population. |X| Population size. p^jS Frequency ofxi, according top^S(xi,t)in Table 2. Ft Population objective values. Xbest An optimum approximation. I(x,k) Indicator, 1 ifxi=k, and 0 otherwise.

### Table 5.

Notation used for the empirical selection based EDAs.

A practical EDA based on the empirical selection distribution is presented in Table 6. Since the parameter computation in line 8 is the only section that should be changed according to the particular search distribution, a pseudo-code is separately presented for each algorithm. The selection procedure in line 6 must be understood as the computation of the empirical selection distribution (according to Equations in Table 2).

 Practical EDA based on the Empirical Selection Distribution 1 Initialize the search distribution model p(x, 0 ) 2 Sampling( X t , p(x, 0 ) ) 3 Evaluation( F t , X t ) 4 While the stopping criterion is not met do 5 Selection(p^S(x,t), X t , F t ) 6 Parameter_Computation ( p(x,t),p^S(x,t), X t , F t ) 7 Sampling( X t+1 , p(x, t ) ) 8 Evaluation( F t , X t +1 ) 9 Elitism( X best , F t +1 , X t+1 ) 10 End While Ensure: An optimum approximation X best .

### Table 6.

Pseudo-code a practical EDA based on the empirical selection distribution.

 1 For ( i=1 to n ) { 2 For ( k=1 to m i ) { 3 bi,k=∑j=1|X|I(xj,k)⋅p^jS 4 } 5 }

### Table 7.

Pseudo-code for the ES-UMDA parameter computation.

### 6.1. UMDA

The Univariate Marginal Distribution Algorithm (UMDA) was introduced by Mühlenbein and PaaB (Mühlenbein & PaaB, 1996). It is based on the estimation of univariate marginal probabilities. This model considers that all variables are statistically independent. It uses the simplest model for discrete distributions. Each variablexihas attached a probability vectorbi,k, that is, the probability ofxitaking the value k, is:bi,k=p(xi=k). Note that in the original UMDA the computation of the parameterbi,kis basically done by counting bits of a variable of the selected set.

The UMDA is quite simple to adapt to the relative frequencies given by the empirical selection distribution. The pseudo-code of the ES-UMDA (ES=Empirical Selection) parameter computation is shown in Table 7.

### 6.2. UMDAc

The Univariate Marginal Distribution Algorithm for continuous domains (UMDAc) was introduced by Larrañaga et al. (Larrañaga et al., 1999). It uses a univariate model, in the specific case of UMDAG c, there are nunivariate Gaussian distributions (for a n-dimensional problem). Two parameters are needed for the Gaussian at each dimension i, the meanμiand the standard deviation i. The computation of both parameters is simply done by weighting each point by its corresponding relative frequency (probability) as shown in Table 8.

 1 For ( i=1 to n ) { 2 μ=∑j=1|X|p^iS⋅xj,i 3 σi2=∑j=1|X|p^iS⋅(xj,i−μi)2 4 }

### Table 8.

Pseudo-code for the ES-UMDAc parameter computation.

### 6.3. K2-Bayesian-network based EDA

Bayesian networks have been successfully used in EDAs, for instance the Bayesian Optimization Algorithm (BOA) introduced by Pelikan et al. (Pelikan et al., 1999). A BOA-like algorithm based on the K2 algorithm (Cooper & Herskovits, 1992) is presented. The parameter computation has been modified to use the empirical selection distribution. The K2 is a greedy heuristic search method, for maximizing the probability P(BS,D) of the structure BS and the data D. For maximizing P(BS,D) the K2 maximizesg(i,πi), which is a measure related with the probability of xi given a set of parents πi.

g(i,πi)=j=1qi(ri1)(Nij+ri1)k=1riNijk!E17

where xi has ri possible discrete values. Each variable xi in Bs has a set of parents, which are represented by a list of variables πi. Nijk is the number of cases in D in which the variable xi has the value vik, and πi is instantiated as wij. wij denotes the j-th unique instantiation of πi relative to D, and qi is the number of such unique instantiations of πi.Nij=k=1riNijk. For a deeper explanation of the K2 algorithm the reader is directed to (Cooper & Herskovits, 1992).

Learning a Bayesian network with the K2 is a counting process of point instances. Then, in order to compute Equation 17 an integer frequency for each point is needed. For obtaining an integer frequency, let us define a sample size for the selection method, say|S^|. When using the empirical selection distribution we have a relative frequencyp^lSassociated with a point xl, if that point is such a case forNijk, then, instead of summing a single case we will suminteger(|S^|p^lS). As|S^|grows, the approximation of the K2 Bayesian network to the empirical selection distribution will be more accurate, but the computational cost of Equation 17 increases as well. Once the K2 Bayesian network structure has been learned, the frequencies|S^|p^lSmust be used to compute the conditional probabilities.

### 6.4. EMNAglobal

The EMNAglobal was introduced by Larrañaga et al. (Larrañaga et al., 2001). It is based on the estimation of a multivariate normal density function. This model can represent linear dependencies between normal variables in the covariance matrix. The pseudo-code of the ES-EMNAglobal parameter computation is shown in Table 9. Note that it is quite easy to insert the empirical selection distribution in univariate as well as multivariate algorithms, also it is inserted in discrete and continuous domains with a minimum analytical and computational effort.

 1 For ( i=1 to n ) { 2 μ=∑j=1|X|p^iS⋅xj,i 3 For ( k= 1 to i ) { σi,k=∑j=1|X|p^iS⋅(xj,i−μi)(xj,k−μi) σk,i=σi,k } 4 }

### Table 9.

Pseudo-code for the ES-EMNAglobal parameter computation.

## 7. Experiments and performance analysis

This section provides a set of experiments to address the performance and advantages of the empirical selection distribution. Two kinds of experiments are presented:

1. Graphically we show that the search distribution based on the empirical selection distribution constitutes a robust and better approximation to the selection model.

2. Using the EMNAglobal (Larrañaga et al., 2001) and the BMDA (Pelikan & Mühlenbein, 1999), we show the impact of the empirical selection distribution on the EDA performance.

### 7.1. Graphical comparison

For better plots, the unidimensional objective function shown in Figure 3 is used. This analysis contrasts the three concepts discussed in the introduction: the ideal EDA, the standard EDA, and the EDA with the empirical selection distribution, combined with the four discussed selection methods.

Even though the exact selection distribution can be computed for the ideal EDA, to perform the correct comparison we use a predefined Gaussian model for the three approaches. We show the Gaussian approximation with a very large population (considered as infinite), contrasted with the approximation computed by using the selected set from a standard sized population, and the approximation computed by using the empirical selection distribution. The experiments are designed as follows:

• 455θ=0.9

Truncation selection. The truncation method shown in Figure 4(a) is basically the same for the standard EDA and the empirical selection distribution, because the points used are the same in both approximations. Thus, in this case the empirical selection performs at least as well as the standard selection method.

Boltzmann selection. As shown in Figure 4(b) , the empirical selection method delivers a very good approximation to the exact model. The search models computed by the standard EDA approximation are a belt of Gaussians. The empirical selection approximation is at the middle of this belt, it is less affected by a single point or small set of points (robustness). It is possible that the randomness of the standard selection method guides the search to a better optimum approximation. In general, it is not the expected behaviour, given its difference with the exact model, the small size of the selected set (usually 50% of the population) and the consequent loss of information which favors the tendency of being biased to sub-optimal regions. Also, it is expected that the behaviour of the Boltzmann selection varies according to theβvalue. The empirical selection computes the same search model from the same population, thus, a more stable behaviour is expected in contrast with the standard EDA. This could be especially useful when designing annealing schedules, because with the empirical selection distribution a similar performance under similar conditions is expected.

Proportional selection. The proportional selection does not use extra parameters, therefore, no expensive parameter tuning is required. At the same time, however, there is no way to control the high selection pressure exerted over the fittest individuals that usually lead the population to premature convergence. Thus, a search distribution which better represents the selection method could be useful, also a more predictable search model is useful when tuning other controls such as the population size, because the expected behaviour could be better inferred. Figure 4(c) , shows the search densities (points) computed when applying the selection method 10 times on the same population. Notice that these models are easily biased to different regions, most of them suboptimal.

Binary tournament selection. The binary tournament selection seems to be the most robust selection method according to Figure 4(d). The plot shows that this selection delivers Gaussians more similar than those delivered by other selection methods. This selection can be used as a reference for the expected performance of an EDA, considering the robustness, and the good performance of this selection according to the experiments in the next section. The parameter tuning for other selection methods can be done trying to obtain at least the same performance as the binary tournament selection. It is parameter free, and it is less sensible to large objective values in the population. As shown in Figure 4(d) the empirical selection distribution computes an accurate approximation when compared with the exact method.

### 7.2. Performance comparison

It has been previously shown that the estimation of the search distribution according to the empirical selection distribution accurately approximates the exact search distribution. Other claim is that the empirical selection distribution preserves some building blocks that can not be preserved by the standard EDA, due to the fact that the last one only uses a part of the population, while the empirical selection based EDA use the whole population in most of the cases. To support these arguments, we present experiments using the bivariate marginal distribution algorithm (BMDA) (Pelikan & Mühlenbein, 1999), and the EMNAglobal (Larrañaga et al., 2001), as well as the counterparts based on the empirical selection distribution: the ES-BMDA and ES-EMNAglobal.

### 7.3. Test 1. The EMNAglobal

We present a comparison using the EMNAglobal and three problems widely used to test EDAs performance (Larrañaga et al., 2001): Sphere, Griewangk and Rosenbrock. In order to maximize the functions and convert them to have positive objective values, an objective function g(x) is adjusted as: f(x)=-g(x)+1.8e7, and f(x) is used as the fitness function.

Experiments description: 30 independent runs are performed with 50 variables, the domain for all variables is [-600,600], the population size is 2000, selecting 1000 individuals for the standard EMNA. Truncation is 50% of the population, and elitism according to the general EDA in Table 6. The annealing schedule for the Boltzmann selection is fixed by initializingβ0=0,Δβ=100/n, andβt=βt1+Δβ. For the Rosenbrock functionΔβ=50/n. For the Boltzmann and proportional selections the objective function is normalized by applying f(x)=g(x)-gmin/(gmax-gmin+1), in order to avoid numerical problems. Stopping criterion: 300 000 function evaluations. Results: Table 10 compares the results obtained by the standard EMNA (denoted by St.), and the Empirical Selection based EMNA (denoted by ES). We report the mean and standard deviation of the objective function as well as the result of a hypothesis test based
 n Truncation Boltzmann Proportional Binary Tournament Sphere 10 St 5.048e-9 (1.158e-9) 2.168e-6 (7.745e-7) 7.703e-2 (2.666e-2) 6.219e-9 (1.234e-9) ES 5.159e-9 (1.105e-9) 2.370e-6 (5.901e-7) 6.738e-2 (2.347e-2) 5.762e-9 (1.395e-9) Hyp. Test N N N N 20 St 3.767e-8 (7.306e-9) 9.924e-1 (1.356) 7.365 (2.314) 6.607e-8 (1.166e-8) ES 3.790e-8 (6.003e-9) 1.092 (2.298e+0) 6.661 (1.071) 7.628e-8 (1.074e-8) Hyp. Test N N N St 30 St 1.093e-7 (1.123e-8) 2.725e+1 (1.804e+1) 2.018e+2 (1.524e+2) 1.660e-5 (1.914e-6) ES 1.091e-7 (7.991e-9) 1.726e+1 (1.503e+1) 5.549e+1 (1.005e+1) 3.520e-5 (3.874e-6) Hyp. Test N ES ES St 40 St 2.213e-7 (1.673e-8) 1.060e+2 (4.624e+1) 1.283e+3 (6.554e+2) 1.889e-3 (2.983e-3) ES 2.239e-7 (1.682e-8) 6.394e+1 (4.757e+1) 2.142e+2 (2.060e+1) 1.094e-3 (1.099e-4) Hyp. Test N ES ES ES 50 St 1.999e-6 (2.266e-7) 4.289e+2 (1.779e+2) 4.735e+3 (1.379e+3) 3.543 (5.801) ES 1.712e-6 (2.082e-7) 1.826e+2 (9.569e+1) 5.357e+2 (6.471e+1) 1.111e-2 (1.032e-3) Hyp. Test ES ES ES ES Griewnagk 10 St 9.356e-10 (2.077e-10) 0.36098 (0.18955) 0.3194 (0.05671) 1.062e-9 (3.608e-10) ES 9.231e-10 (2.251e-10) 0.17173 (0.09324) 0.3343 (0.06324) 1.114e-9 (2.647e-10) Hyp. Test N ES N N 20 St 4.921e-9 (5.351e-10) 0.19738 (0.19323) 0.6325 (0.07413) 1.003e-8 (1.788e-9) ES 4.968e-9 (6.329e-10) 0.06382 (0.08591) 0.6553 (0.04763) 1.207e-8 (1.526e-9) Hyp. Test N ES ES St 30 St 1.044e-8 (1.007e-9) 0.13667 (0.11265) 1.0302 (0.06677) 1.125e-6 (1.846e-7) ES 1.004e-8 (9.885e-10) 0.03527 (0.05150) 0.9466 (0.03131) 2.223e-6 (3.029e-7 Hyp. Test N ES ES St 40 St 1.817e-8 (1.357e-9) 0.26945 (0.12897) 1.3946 (0.19001) 6.056e-4 (2.088e-3) ES 1.809e-8 (1.694e-9) 0.05683 (0.05088) 1.0543 (0.00679) 4.868e-5 (5.506e-6) Hyp. Test N ES ES ES 50 St 7.568e-8 (1.352e-8) 0.54303 (0.24312) 2.3571 (0.46186) 1.036e-1 (1.265e-1) ES 6.852e-8 (1.364e-8) 0.17259 (0.07979) 1.1365 (0.02057) 3.899e-4 (3.021e-5) Hyp. Test ES ES ES ES Rosenbrock 10 St 9.292 (7.714) 6.795 (0.5562) 1.125e+05 (65566) 7.647 (0.2958) ES 8.929 (3.113) 6.862 (0.4907) 1.133e+05 (56613) 7.880 (0.7487) Hyp. Test N N N St 20 St 20.131 (5.753) 19.404 (2.4900) 1.313e+07 (7154997) 17.915 (0.9468) ES 19.339 (4.762) 18.386 (1.1359) 9.138e+06 (2823051) 17.657 (0.8191) Hyp. Test N ES ES N 30 St 28.865 (1.218) 35.728 (7.9373) 1.656e+08 (154793875) 31.681 (8.1603) ES 30.695 (6.149) 35.246 (5.2851) 5.800e+07 (16388688) 27.852 (0.6024) Hyp. Test N ES ES ES 40 St 42.413 (4.465) 71.699 (18.6654) 5.469e+08 (269436266) 84.128 (52.2806) ES 41.709 (3.324) 53.984 (8.8081) 1.845e+08 (53383769) 40.544 (3.1687) Hyp. Test N ES ES ES 50 St 65.604 (13.422) 1109.840 (3713.7494) 1.183e+09 (641262869) 2594.724 (3675.9199) ES 60.987 (9.248) 93.034 (16.7544) 3.695e+08 (82888904) 61.753 (4.9202) Hyp. Test N ES ES ES

### Table 10.

Performance comparison between the standard EMNAglobal (St) and the empirical selection distribution based EMNAglobal (ES).

on the Bootstrap methodology with a significance of 5% (Efron & Tibshirani, 1993). We test that the mean of the ES-based EMNAglobal is less than the mean of the standard EMNAglobal, and vice versa. An N below the mean and standard deviation means that neither is better, St means that the standard EMNAglobal is the best, an ES that the empirical selection based EMNAglobal is the best. The conditions of the experiment are maintained while the dimensions are growing, the purpose is to show that the empirical selection in general performs better, and can use the information efficiently. Note that the standard EDA (St-EDA) and the Empirical Selection EDA (ES-EDA) are very similar in lower dimensions, but for higher dimensions the ES-EDA clearly outperforms the St-EDA. The explanation is that the population size is not growing with the dimensions, thus, for lower dimensions the selected set size is enough to adequately estimate the search distribution, but for higher dimensions the information is not sufficient, or it is not efficiently used. The runs where the St-EDA outperforms the ES-EDA (tournament 20 and 30 variables), can be explained by selection pressure issues. The St-EDA uses a subset of the best individuals in the population, no matter which selection method is used. On the other hand, the ES-EDA uses all the points, so for the same parameters the variance of the ES-EDA will be greater than the St-EDA, because the whole population covers a wider region than the selected set. So, the convergence of the St-EDA is faster, and the exploration is concentrated in a smaller region, resulting in a poor performance for higher dimensions as shown in Table 10. Notice that even the hypothesis test says that the St-EDA is better in 4 cases, the ES-EDA finds solutions very close to the optimum in all cases.

### 7.4. Test 2. The BMDA

The BMDA works in discrete domains, in this case a binary domain. It builds a graphical model with bivariate dependencies, the dependencies are considered according aχ2hypothesis test. Theχ2computation uses empirical probabilities, thus, the ES-BMDA computes these empirical probabilities by summing the relative frequenciesp^iSgiven by the empirical selection distribution formulae. For example, if we need to compute the marginal probability of the variablexjtakes the value of 1, sayp(xj=1), if an individual i in the population is an instance ofxj=1then we sump^iS, the standard procedure at the end divides the sum over the total number of individuals, when using the empirical selection this normalization is not necessary. This test is performed by using the deceptive function of order 3 shown in Equation 18, which was also used for the experiments in the original paper of the BMDA (Pelikan & Mühlenbein, 1999).

f3deceptive(x,πi)=i=1n3f3(xπ(3(i1)+1)+xπ(3(i1)+2)+xπ(3(i1)+3))E18

where x is a bit string,πis any permutation of order n, and f3 is defined in Equation 19.

f3(u)={0.9ifu=00.8ifu=10ifu=21otherwiseE19

For n = 30 we useπ= { 6, 27, 18, 20, 28, 16, 1, 23, 24, 3, 2, 13, 8, 5, 17, 11, 29, 15, 30, 9, 25, 12, 19, 22, 4, 10, 14, 21, 26). For n = 60,π= { 49, 31, 40, 59, 5, 23, 57, 37, 47, 19, 27, 30, 8, 56, 3, 36, 45, 17, 41, 33, 21, 53, 39, 51, 50, 29, 16, 10, 24, 55, 15, 32, 7, 13, 2, 52, 14, 60, 12, 22, 34, 25, 35, 1, 28, 18, 20, 9, 38, 26, 46, 58, 42, 43, 44, 48, 6, 4, 11}. The deceptive function correlates subsets of 3 variables. Pelikan and Mühlenbein (Pelikan & Mühlenbein, 1999) shown that the BMDA is capable of solving this problem that can not be solved by the simple GA. Our purpose is to show that the ES-BMDA is more efficient than the standard BMDA and can use the information in the whole population, thus it is capable of finding and using more correlations to approximate the optimum.

Experiments description: We reproduce the settings of the original BMDA paper (Pelikan & Mühlenbein, 1999). 30 independent runs are performed with 30 and 60 variables, the population size is 1300, selecting 1000 individuals for the standard EDA. 50% of the best individuals are preserved no matter which selection is used (as in the original paper) for the BMDA and the ES-BMDA. Truncation is 50% of the population, and elitism according to the general EDA in Table 6. The annealing schedule for the Boltzmann selection is fixed by initializingβ0=0,Δβ=1, andβt=βt1+Δβ. For the Boltzmann and proportional selections the objective function is normalized by applying f(x)=g(x)-gmin/(gmax-gmin+1), in order to avoid numerical problems. Stopping criterion: To reproduce the results of the original BMDA we use the ordering parameter as stopping criterion (Pelikan & Mühlenbein, 1999). The ordering parameter is defined in Equation 20, where p is the vector of univariate marginal frequencies pi(1). Whenχ(p)0.95we stop the algorithm that means that the univariate probabilities are almost 1 or 0.
χ(p)=i=1n(pi(1)12)2E20

Results: Table 11 compares the results obtained by the standard BMDA (denoted by St.), and the Empirical Selection based BMDA (denoted by ES). We report the mean and standard deviation of the evaluations and objective function, as well as the result of a hypothesis test based on the Bootstrap methodology with a significance of 5% (Efron & Tibshirani, 1993). We test that the mean of the ES-BMDA is less than the mean of the standard BMDA, and vice versa, for 30 and 60 variables. For the evaluations test we only use the runs in which the optimum is found. The evaluations comparison is not perform for 60 variables, because most of the runs did not find the optimum. The results bring out evidence about the arguments that the empirical selection based EDAs are more efficient, and that the use of the information in the whole population is an advantage to build optimal solutions, because as shown the ES-BMDA needs less evaluations than the St-BMDA, also, the ES-BMDA is more effective to find the optimum in the 60 dimension runs, so, with the same resources (population size) the ES-BMDA can deliver better results.

## 8. Perspectives, future work and conclusions

EDAs researchers have approximated the selection distribution since the first approaches (Mühlenbein, 1997, Bosman & Thierens, 2000). This chapter proposes a general method for this purpose. Most of the search distributions used by EDAs are parametric, and the

 n Truncation Boltzmann Proportional Bin.ary Tournament Deceptive Function of Order 3,Objective Function 30 St 10 (0) 9.993333(0.02537081) 9.997 (0.01825742) 9.997 (0.01825742) ES 10 (0) 10 (0) 9.997 (0.01825742) 9.997 (0.01825742) Hyp. Test N N N N 60 St 19.79 (0.1213431) 19.73 (0.1368362) 19.60667 (0.1818171) 19.81 (0.1028893) ES 19.84 (0.1101723) 19.85(0.1252584) 19.80667(0.1080655) 19.85333(0.1166585) Hyp. Test N ES ES N Evaluations 30 St 14841.67(783.7887) 14300(482.8079) 24526.67(1497.223) 15925(609.5151) ES 14755(619.0023) 14148.33(441.2919) 22923.33(869.2777) 15578.33(578.4467) Hyp. Test N N ES ES Number of times the optimum was reached (in 30 runs) 30 St 30 28 29 29 ES 30 30 29 29 60 St 3 2 0 3 ES 6 9 3 7

### Table 11.

Performance comparison between the standard BMDA (St) and the empirical selection distribution based BMDA (ES).

parameters are learned from a sample by counting frequencies. Thus, in addition to the presented algorithms, any other search distribution which uses frequencies can be modified to use the empirical selection distribution. For instance bivariate models (Pelikan & Mühlenbein, 1999) and histograms (Tsutsui et al., 2001) are completely based on frequencies. Another important line of study are clustering based algorithms (Larrañaga & Lozano, 2001), for example the k-means algorithm is based on distances. When using the empirical selection distribution in clustering, instead of using a single point in the positionxi, we use its relative frequencyp^S(xi,t). This measurement will move the mean of the cluster to the regions with highest fitness values, helping to perform a better search.

Important issues in EDAs such as diversity and premature convergence can be tackled using the empirical selection distribution. Studies on convergence phases (Grahl et al., 2007) have shown that the maximum likelihood estimated variance might not be the best to perform an optimum search. Thus, a possible line of work is: how to favor diversity using the empirical selection distribution? A possibility is by simply modifying the fitness function into the empirical selection distribution formulae. This line of work may be an alternative to recent proposals on variance scaling (Grahl et al., 2006; Bosman et al., 2007).

Multi-objective applications and how to insert the Pareto dominance in the selection distribution is another possible research line. With respect to this topic the Pareto ranking seems to be the most natural way of tackling this important issue.

Ever since the very first proposed EDAs (Baluja, 1994) to the most recent works (Pelikan et al., 2008), incremental learning has been applied to the learning distribution phase. Future work must contemplate how to insert the empirical selection distribution into incremental approaches, or how to use historical or previous empirical selection distributions.

The selection methods presented are just a sample of the possibilities, other methods such as combined truncation-proportional, truncation-Boltzmann, 3-tournament, etcetera must be explored.

Finally, the advantages of the presented method are summarized as follows:

• it is easy to implement.

• It has a wide range of applications.

• It has low computational as well as analytical cost.

• It avoids being wrongly biased by a single solution or a small set.

• It uses all the information from the population to accurately approximate the selection distribution.

• The perspectives, future use and applications are promising, and the possible lines of work are really extensive.

## References

1. 1. BalujaS.1994Population-Based Incremental Learning: A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning.Technical report, Carnegie Mellon University, Pittsburgh, PA, USA, 1994.
2. 2. BosmanP. A. N.GrahlJ.RothlaufF.2007SDR: A Better Trigger for Adaptive Variance Scaling in Normal EDAs.In GECCO’07: Proceedings of the 8th annual conference on Genetic and evolutionary computation,516522,978-1-59593-697-4London, UK. July, 2007, ACM.
3. 3. BosmanP. A. N.ThierensD.2000Expanding from Discrete to Continuous Estimation of Distribution Algorithms: The IDEA.In PPSN VI: Proceedings of the 6th International Conference on Parallel Problem Solving from Nature,767776,3-54041-056-2France, 2000. Springer-Verlag.
4. 4. CooperG.HerskovitsE.1992A Bayesian Method for the Induction of Probabilistic Networks from Data.Machine Learning,94(october, 1992),309347,0885-6125
5. 5. EfronB.TibshiraniR. J.1993An Introduction to the Bootstrap. Chapman and Hall, Monographs on Statistics and Applied Probability,13York, 1993.
6. 6. GrahlJ.BosmanP. A.RothlaufF.2006The Correlation-Triggered Adaptive Variance Scaling IDEA.In GECCO’06: Proceedings of the 8th annual conference on Genetic and evolutionary computation,397404,1-59593-010-8Washington, USA, July, 2006. ACM.
7. 7. GrahlJ.BosmanP. A. N.MinnerS.2007Convergence Phases, Variance Trajectories, and Runtime Analysis of Continuos EDAs. In GECCO’07: Proceedings of the 8th annual conference on Genetic and evolutionary computation,516522,978-1-59593-697-4London, UK. July, 2007, ACM.
8. 8. LarrañagaP.EtxeberriaR.LozanoJ.PeñaJ.1999Optimization by Learning and Simulation of Bayesian and Gaussian Networks.Technical Report EHU-KZAA-IK-4/99, Department of Computer Science and Artificial Intelligence, University of the Basque Country, 1999.
9. 9. LarrañagaP.LozanoJ. A.2001Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer Academic Publishers,978-0-79237-466-4Norwell, MA, USA, 2001.
10. 10. LozanoJ. A.LarrañagaP.InzaI.BengoetxeaE.2006Towards a New Evolutionary Computation,Studies in Fuzziness and Soft Computing.192978-3-54029-006-3Springer, 2006.
11. 11. MüehlenbeinH.BendischJ.VoightH. M.1996From recombination of genes to the estimation of distributions II. Continuous parameters.In PPSN IV: Proceedings of the 4th International Conference on Parallel Problem Solving from Nature,188197,978-3-54061-723-5Berlin, Germany, September, 1996, Springer-Verlag.
12. 12. MühlenbeinH.1997The Equation for Response to Selection and Its Use for Prediction.Evolutionary Computation,53September, 1997,303346,1063-6560
13. 13. MühlenbeinH.PaaB. G.1996From Recombination of Genes to the Estimation of Distributions I. Binary Parameters.In PPSN IV: Proceedings of the 4th International Conference on Parallel Problem Solving from Nature,178187,978-3-54061-723-5Berlin, Germany, September, 1996. Springer-Verlag.
14. 14. PelikanM.GoldbergD. E.Cantú-PazE.1999BOA: The Bayesian Optimization Algorithm. In W. Banzhaf, J. Daida, A. E. Eiben, M. H. Garzon, V. Honavar, M. Jakiela, and R. E. Smith, editors,Proceedings of the Genetic and Evolutionary Computation Conference GECCO-99, Vol. I,525532, Orlando, FL, USA, 1999. Morgan Kaufmann Publishers, San Fransisco, CA.
15. 15. PelikanM.MühlenbeinH.1999The Bivariate Marginal Distribution Algorithm.Advances in Soft Computing Engineering Design and Manufacturing,521535, 1999.
16. 16. PelikanM. .SastryK.Cantu-PazE.2006Scalable Optimization via Probabilistic Modeling,Studies in Computational Intelligence.33Springer,978-3-54034-953-2
17. 17. PelikanM.SastryK.GoldbergD. E.2008iBOA: The Incremental Bayesian Optimization Algorithm.In GECCO’08: Proceedings of the 2008 Conference on Genetic and Evolutionary Computation,455462,978-1-60558-131-6Atlanta, USA, July, 2008, ACM.
18. 18. SotoM.OchoaA.2000A Factorized Distribution Algorithm based on Polytrees.In Proceedings of the 2000 Congress on Evolutionary Computation (CEC 2000),232237,0-78036-375-2Jolla, CA, USA, July, 2000, IEEE.
19. 19. TsutsuiS.PelikanM.GoldbergD. E.2001Evolutionary Algorithm using Marginal Histogram Models in Continuous Domain.Technical Report 2001019, Illinois Genetic Algorithms Laboratory, 2001.
20. 20. YunpengC.XiaominS.PeifaJ.2006Probabilistic Modeling for Continuous EDA with Boltzmann selection and Kullback-Leibeler Divergence.In GECCO’06: Proceedings of the 8th annual conference on Genetic and evolutionary computation,389396,1-59593-010-8Washington, USA, July, 2006. ACM.
21. 21. ZhangQ.MühlenbeinH.2004On the Convergence of a Class of Estimation of Distribution Algorithms.IEEE Transactions on Evolutionary Computation,82(April, 2004),127136,0108-9778X.

Written By

S. Ivvan Valdez, Arturo Hernández and Salvador Botello

Published: February 1st, 2010