Programming with Annotated Grammar Estimation

Evolutionary algorithms (EAs) mimic natural evolution to solve optimization problems. Because EAs do not require detailed assumptions, they can be applied to many real-world problems. In EAs, solution candidates are evolved using genetic operators such as crossover and mutation which are analogs to natural evolution. In recent years, EAs have been considered from the viewpoint of distribution estimation, with estimation of distribution algorithms (EDAs) attracting much attention ([14]). Although genetic operators in EAs are inspired by natural evolution, EAs can also be considered as algorithms that sample solution candidates from distributions of promising solutions. Since these distributions are generally unknown, approximation schemes are applied to perform the sampling. Genetic algorithms (GAs) and genetic programmings (GPs) approximate the sampling by randomly changing the promising solutions via genetic operators (mutation and crossover). In contrast, EDAs assume that the distributions of promising solutions can be expressed by parametric models, and they perform model learning and sampling from the learnt models repeatedly. Although GA-type sampling (mutation or crossover) is easy to perform, it has the disadvantage that GA-type sampling is valid only for the case where two structurally similar individuals have similar fitness values (e.g. the one-max problem). GA and GP have shown poor search performance in deceptive problems ([6]) where the condition above is not satisfied. However, EDAs have been reported to show much better search performance for some problems that GA and GP do not handle well. As in GAs, EDAs usually employ fixed length linear arrays to represent solution candidates (these EDAs are referred to as GA-EDAs in the present chapter). This decade, EDAs have been extended so as to handle programs and functions having tree structures (we refer to these as GP-EDAs in the present chapter). Since tree structures have different node number, the model learning is muchmore difficult than that of GA-EDAs. From the viewpoint of modeling types, GP-EDAs can be broadly classified into two groups: probabilistic proto-type tree (PPT) based methods and probabilistic context-free grammar (PCFG) based methods. PPT-based methods employ techniques devised in GA-EDAs by transforming variable length tree structures into fixed length linear arrays. PCFG-based methods employ


Introduction
Evolutionary algorithms (EAs) mimic natural evolution to solve optimization problems. Because EAs do not require detailed assumptions, they can be applied to many real-world problems. In EAs, solution candidates are evolved using genetic operators such as crossover and mutation which are analogs to natural evolution. In recent years, EAs have been considered from the viewpoint of distribution estimation, with estimation of distribution algorithms (EDAs) attracting much attention ( [14]). Although genetic operators in EAs are inspired by natural evolution, EAs can also be considered as algorithms that sample solution candidates from distributions of promising solutions. Since these distributions are generally unknown, approximation schemes are applied to perform the sampling. Genetic algorithms (GAs) and genetic programmings (GPs) approximate the sampling by randomly changing the promising solutions via genetic operators (mutation and crossover). In contrast, EDAs assume that the distributions of promising solutions can be expressed by parametric models, and they perform model learning and sampling from the learnt models repeatedly. Although GA-type sampling (mutation or crossover) is easy to perform, it has the disadvantage that GA-type sampling is valid only for the case where two structurally similar individuals have similar fitness values (e.g. the one-max problem). GA and GP have shown poor search performance in deceptive problems ( [6]) where the condition above is not satisfied. However, EDAs have been reported to show much better search performance for some problems that GA and GP do not handle well. As in GAs, EDAs usually employ fixed length linear arrays to represent solution candidates (these EDAs are referred to as GA-EDAs in the present chapter). This decade, EDAs have been extended so as to handle programs and functions having tree structures (we refer to these as GP-EDAs in the present chapter). Since tree structures have different node number, the model learning is much more difficult than that of GA-EDAs. From the viewpoint of modeling types, GP-EDAs can be broadly classified into two groups: probabilistic proto-type tree (PPT) based methods and probabilistic context-free grammar (PCFG) based methods. PPT-based methods employ techniques devised in GA-EDAs by transforming variable length tree structures into fixed length linear arrays. PCFG-based methods employ PCFG to model tree structures. PCFG-based methods are more advantageous than PPT-based methods in the sense that PCFG-based methods can estimate position-independent building blocks.
The conventional PCFG adopts the context freedom assumption that the probabilities of production rules do not depend on their contexts, namely parent or sibling nodes. Although the context freedom assumption makes parameter estimation easier, it cannot in principle consider interaction among nodes. In general, programs and functions have dependencies among nodes, and as a consequence, the conventional PCFG is not suitable as a baseline model of GP-EDAs. In the field of natural language processing (NLP), many approaches have been proposed in order to weaken the content freedom assumption of PCFG. For instance, the vertical Markovization annotates symbols with their ancestor symbols and has been adopted as a baseline grammar of vectorial stochastic grammar based GP (vectorial SG-GP) or grammar transformation in an EDA (GT-EDA) ( [4]) (see Section 2). Matsuzaki et al. ([17]) proposed the PCFG with latent annotations (PCFG-LA), which assumes that all annotations are latent and the annotations are estimated from learning data. Because the latent annotation models are much richer than fixed annotation models, it is expected that GP-EDAs using PCFG-LA may more precisely grasp the interactions among nodes than other fixed annotation based GP-EDAs. In GA-EDAs, EDAs with Bayesian networks or Markov networks exhibited better search performance than simpler models such as a univariate model. In a similar way, it is generally expected that GP-EDAs using PCFG-LA are more powerful than GP-EDAs with PCFG with heuristics-based annotations because the model flexibility of PCFG-LA is much richer. We have proposed a GP-EDA named programming with annotated grammar estimation (PAGE) which adopts PCFG-LA as a baseline grammar ( [9,12]). In Section 4 of the present chapter, we explain the details of PAGE, including the parameter update formula.
As explained above, EDAs model promising solutions with parametric distributions. For the case in multimodal problems, it is not sufficient to express promising solutions with only one model, because dependencies for each optimal solution are different in general. When considering tree structures, this problem arises even in unimodal optimization problems due to diversity of tree expression. These problems can be tackled by considering global contexts in each individual, which represents which optima (e.g. multiple solutions in multimodal problems) it derives from. Consequently, we have proposed the PCFG-LA mixture model (PCFG-LAMM) which extends PCFG-LA into a mixture model, and have also proposed a new GP-EDA named unsupervised PAGE (UPAGE) which employs PCFG-LAMM as a baseline grammar ( [11]). By using PCFG-LAMM, not only local dependencies but also global contexts behind individuals can be taken into account.
The main objectives of proposed algorithms may be summarized as follows: 1. PAGE employs PCFG-LA to consider local dependencies among nodes.
2. UPAGE employs PCFG-LAMM to take into account global contexts behind individuals in addition to the local dependencies.
This chapter is structured as follows: Following a section on related work, we briefly introduce the basics of PCFG. We explain PAGE in Section. 4, where details of PCFG-LA, forward-backward probabilities and a parameter update formula are provided. In Section 5, we propose UPAGE, which is a mixture model extension of PAGE. We describe PCFG-LAMM and also derive a parameter update formula for UPAGE. We compare the performance of UPAGE and PAGE using three benchmark tests selected for experiments. We discuss the results obtained in these experiments in Section 6. Finally, we conclude the present chapter in Section 7.

Related work
Many GP-EDAs have been proposed, and these methods can be broadly classified into two groups: (i) PPT based methods and (ii) grammar model based methods.
Methods of type (i) employ techniques developed in GA-EDAs. This type of algorithm converts tree structures into the fixed-length chromosomes used in GA and applies probabilistic models of GA-EDAs. Probabilistic incremental program evolution (PIPE) ( [25]) is a univariate model, which can be considered to be a combination of population-based incremental learning (PBIL) ( [3]) and GP. Because tree structures have explicit edges between parent and children nodes, estimation of distribution programming (EDP) ( [37,38] Methods of type (ii) are based on Whigham's grammar-guided genetic programming (GGGP) ( [33]). GGGP expresses individuals using derivation trees (see Section 3), which is in contrast with the conventional GP. Whigham indicated the connection between PCFG and GP ( [35]), and actually, the probability table learning in GGGP can be viewed as an EDA with local search. Stochastic grammar based GP (SG-GP) ( [23]) applied the concept of PBIL to GGGP. The authors of SG-GP also proposed vectorial SG-GP, which considers depth in its grammar (simple SG-GP is then called scalar SG-GP). Program evolution with explicit learning (PEEL) ( [28]) takes into account the positions (arguments) and depths of symbols. Unlike SG-GP and PEEL, which employ predefined grammars, grammar model based program evolution (GMPE) ( [29]) learns not only parameters but also the grammar itself from promising solutions. GMPE starts from specialized production rules which exclusively generate learning data and merges non-terminals to yield more general production rules using the MDL principle. Grammar transformation in an EDA (GT-EDA) ( [4]) extracts good subroutines using the MDL principle. GT-EDA starts from general rules and expands non-terminals to yield more specialized production rules. Although the concept of GT-EDA is similar to that of GMPE, the learning procedure is opposite to GMPE [specialized to general (GMPE) versus general to specialized (GT-EDA)]. Tanev proposed GP based on a probabilistic context sensitive grammar ( [31,32]). He used sibling nodes and a parent node as context information, and production rule probabilities are expressed by conditional probabilities of these context information. Bayesian automatic programming (BAP) ( [24]) uses a Bayesian network to consider relations among production rules in PCFG.
There are other GP-EDAs not belonging to either of the groups presented above. N-gram GP ( [21]) is based on the linear GP ( [18]), which is the assembly language of a register-based CPU, and learns the sub-sequences using an N-gram model. The N-gram model is very popular in NLP which considers N consecutive sub-sequences for calculating the probabilities of symbols. AntTAG ( [1]) also shares similar concepts with GP-EDAs, although AntTAG does not employ a statistical inference method for probability learning; instead, AntTAG employs the ant colony optimization method (ACO), where the pheromone matrix in ACO can be interpreted as a probability distribution.

Basics of PCFG
In this section, we explain basic concepts of PCFG. It is important to note that the terms "non-terminal" and "terminal" in CFG are different from those in GP (for example in symbolic regression problems, not only variables x, y but also sin, + are treated as terminals in CFG). In CFG, sentences are generated by applying production rules to non-terminal symbols, which are generally given by In Equation 1, (N ∪ T ) * represents a set of possible elements composed of (N ∪ T ). By applying production rules to the start symbol B, grammar G generates sentences. A language generated by grammar G is represented by L(G). If W ∈ L(G), then W ∈ T * .
By applying production rules, non-terminal A is replaced by another symbol. For instance, application of the production rule represented by Equation 1 to α 1 Aα 2 (α 1 , α 2 ∈ (N ∪ T ) * , A ∈ N ) yields α 1 αα 2 . In this case, it is said that "α 1 Aα 2 derived α 1 αα 2 ", and this process is represented as follows: Furthermore, if we have the following consecutive applications α n is derived from α 1 and is described by α 1 * ⇒ G α n . This derivation process can be represented by a tree structure, which is known as a derivation tree. Derivation trees of grammar G are defined as follows.
1. Node is an element of (N ∪ T )

Root is B
3. Branch node is an element of N 4. If children of A ∈ N are α 1 α 2 · · · α k (α i ∈ (N ∪ T )) from left, production rule A → α 1 α 2 · · · α k is an element of R We next explain CFG with an example. We now consider a univariate function f (x) composed of sin, cos, exp, log and arithmetic operators (+, −, × and ÷). A grammar G reg can be We define the following production rules.
# Production rule  In this case, the derived function is and its derivation process is represented by the derivation tree in Figure 1(a).
Although functions and programs are represented with standard tree representations (S-expression) in the conventional GP ( Figure 1(b)), derivation trees can express the same functions and programs. Consequently, derivation trees can be used in program evolution, and GGGP ( [33,34]) adopted derivation trees for its chromosome.
We next proceed to PCFG, which extends CFG by adding probabilities to each production rule. For example, the likelihood (probability) of the derivation tree in Fig. 1(a) is where W ∈ T * is a sentence (i.e. W corresponds to log x + x + C in G reg ), T is a derivation tree, π( expr ) is the probability of expr and β(A → α) is the probability of a production rule A → α. Furthermore, the probability P(W) of sentence W is given by calculating the marginal probability in terms of T ∈ Φ(W): where Φ(W) is the set of all possible derivation trees which derive W. In NLP, inference of the production rule parameters β(A → α) is carried out with learning data W = {W 1 , W 2 , · · · }, which is a set of sentences. The learning data does not have information about derivation processes. Because there are many possible derivations Φ(W) for large sentences, directly calculating P(W) with marginalization in terms of Φ(W) (Equation 2) is computationally intractable. Consequently, a computationally efficient method called the inside-outside algorithm is used to estimate the parameters. The inside-outside algorithm takes advantage of dynamic programming to reduce the computational cost. However, in contrast to the case of NLP, the derivation trees are observed in GP-EDAs, and the parameter estimation of production rules in GP-EDAs with PCFG is very easy. However, when using more complicated grammars such as PCFG-LA, more advanced estimation methods (i.e. the expectation maximization (EM) algorithm ( [5])) have to be used even when derivation trees are given.

PAGE
Our proposed algorithm PAGE is based on PCFG-LA. In PCFG-LA, latent annotations are estimated from promising solutions using the EM algorithm, and PCFG-LA takes advantage of forward-backward probabilities for computationally efficient estimation. In this section, we describe the details of PCFG-LA, forward-backward probabilities and a parameter update formula derived from the EM algorithm.

PCFG-LA
Although the PCFG-LA used in PAGE has been developed specifically for the present application, it is essentially identical to the conventional PCFG-LA. In this section, we describe the specialized version of PCFG-LA. For further details on PCFG-LA, the reader may refer to Ref.
PCFG-LA assumes that every non-terminal is labeled with annotations. In the complete form, non-terminals are represented by is an annotation (which is latent), and H is a set of annotations (in this paper, we take where h is the annotation size). Fig. 2 shows an example of a tree with annotations (a), and the corresponding observed tree (b). The likelihood of an annotated tree (complete data) is given by where T i denotes the ith derivation tree; X i is the set of latent annotations of T i represented by   The likelihood of an observed tree can be calculated by summing over annotations: PCFG-LA estimates β and π using the EM algorithm. Before explaining the estimation procedure, we should note the form of production rules. In PAGE, production rules are not Chomsky normal form (CNF), as is assumed in the original PCFG-LA, because of the understandability of GP programs. Any function which can be handled with traditional GP can be represented by which is a subset of Greibach normal form (GNF). Here S ∈ N and g ∈ T (N and T are the sets of non-terminal and terminal symbols in CFG; see Section 3). A terminal symbol g in CFG is a function node (+, −, sin, cos ∈ F) or a terminal (v, w ∈ T) in GP (F and T denote set of GP functions and terminals, respectively). Annotated production rules are where x, z m ∈ H and a max is the arity of g in GP. If g has a max arity, the number of parameters for the production rule S → g S...S with annotations is h a max +1 , which increases exponentially as the arity number increases. In order to reduce the number of parameters, we assume that all the right-hand side non-terminal symbols have the same annotation, that is With this assumption, the number of parameters can be reduced to h 2 , which is tractable. Let

Forward-backward probability
We explain forward and backward probabilities for PCFG-LA in this section. PCFG-LA ([17]) adopted forward and backward probabilities to apply the EM algorithm ( [5]). The backward probability b i T (x; β, π) represents the probability that the tree beneath the ith non-terminal S [x] is generated (β and π are parameters, Fig. 3 (b)), and the forward probability f i T (y; β, π) represents the probability that the tree above the ith non-terminal S [y] is generated ( Fig. 3  (a)). Forward and backward probabilities can be recursively calculated as follows: where ch(i, T) is a function that returns the set of non-terminal children indices of the ith non-terminal in T, pa(i, T) returns the parent index of the ith non-terminal in T, and g i T is a terminal symbol in CFG and is connected to the ith non-terminal symbol in T. For example, for the tree shown in Fig. 4, ch(3, T) = {5, 6}, pa(5, T) = 3, and g 2 T = sin. Using the forward-backward probabilities, P(T; β, π) can be expressed by the following two equations: Here, cover(g, T i ) represents a function that returns a set of non-terminal indices at which the production rule generating g without annotations is rooted in T i . For example, if g = + and T is the tree represented in Fig. 4, then cover(+, T) = {1, 3}.

Parameter update formula
We describe the parameter estimation in PCFG-LA. Because PCFG-LA contains latent variables X, the parameter estimation is carried out with the EM algorithm. Let β and π be current parameters β and π be nextstep parameters. The Q function to optimize in the EM algorithm can be expressed as follows: where N is the number of learning data (promising solutions in EDA). A set of learning data is represented by D ≡ {T 1 , T 2 , · · · , T N }. Using the forward-backward probabilities and obtain the following update formula: The EM algorithm maximizes the log-likelihood given by By iteratively performing Equations 15-16, the log-likelihood monotonically increases and we obtain locally maximum likelihood estimation parameters. For the case of the EM algorithm, the annotation size h has to be given in advance. Because the EM algorithm is a point estimation method, this algorithm cannot estimate the optimum annotation size. For the case of models that do not include latent variables, a model selection method such as Akaike information criteria (AIC) or Bayesian information criteria (BIC) is often used. However, these methods take advantage of the asymptotic normality of estimators, which is not satisfied in models that include latent variables. In Ref. ([12]), we derived variational Bayesian (VB) ([2]) based inference for PCFG-LA, which can estimate the optimal annotation size. Because the derivation of the VB-based algorithm is much more complicated than that of the EM algorithm and because such explanation is outside the scope of this chapter, we do not explain the details of the VB-based algorithm. For details of VB-based PAGE, please read Ref. ([12]).
The procedures of PAGE are listed below.
1. Generate initial population Initial population P 0 is generated by randomly creating M individuals.
2. Select promising solutions N individuals D g are selected from a population of gth generation P g . In our implementation, we use the truncation selection.

Unsupervised PAGE
In this section, we introduce UPAGE ( [11]) which is a mixture model extension of PAGE. UPAGE uses PCFG-LAMM as a baseline grammar, and we explain details of PCFG-LAMM and a parameter update formula in this section.

PCFG-LAMM
Although PCFG-LA is suitable for estimating local dependencies among nodes, it cannot consider global contexts behind individuals. Suppose there are two optimal solutions represented by F 1 (x) and F 2 (x). In this case, a population includes solution candidates for F 1 (x) and F 2 (x) at the same time. Since building blocks for two optimal solutions are different, model and parameter learning with one model results in slow convergence due to the mixed learning data. Furthermore in GP, there are multiple optimal structures even if the problems to be solved are not multimodal. For instance, if an optimum includes a substructure represented by sin(2x), sin(2x) as well as 2 sin(x) cos(x) which are mathematically equivalent can be building blocks, where their tree representations are different. When modeling such a mixed population, it is very difficult for PCFG-LA to estimate these multiple structures separately as in the multimodal case. We have proposed a PCFG-LAMM which is a mixture model extension of PCFG-LA and have also proposed UPAGE based on PCFG-LAMM.
PCFG-LAMM assumes that the probability distributions are a mixture of more than two PCFG-LA models. In PCFG-LAMM, each solution is considered to be sampled from either of the PCFG-LA models ( Figure 5). We introduce a latent variable z k i , where z k i is 1 when the ith derivation tree is generated from the kth model and 0 otherwise ( . We summarized variables in Appendix B. As a consequence, PCFG-LAMM handles X i and Z i as latent variables. The likelihood of complete data is given by where ζ k is the mixture ratio of the kth model (ζ = {ζ 1 , ζ 2 , · · · , ζ μ } where ∑ k ζ k = 1). β k (r) and π k (S [x]) denote the probabilities of production rule r and root S [x] of the kth model, respectively. By calculating the marginal of Equation 18 with respect to X i and Z i , the likelihood of observed tree T i is calculated as

Parameter update formula
As in PCFG-LA, the parameter inference of PCFG-LAMM is carried out via the EM algorithm because PCFG-LAMM contains latent variables X i and Z i . Let β, π and ζ be current parameters β, π and ζ be nextstep parameters. The Q function of the EM algorithm is given by By maximizing Q(β, π, ζ|β, π, ζ) under constraints ( ∑ , a parameter update formula can be obtained as follows (see Appendix B): The parameter inference starts from some initial values and converges to a local optimum using Equations 21-23. A log-likelihood is given by log P (T i ; β, π, ζ).
The procedures of UPAGE are listed below.
1. Generate initial population Initial population P 0 is generated by randomly creating M individuals. In our implementation, the ratio between production rules of function nodes (e. 2. Select promising solutions N individuals D g are selected from a population of gth generation P g . In our implementation, we used the truncation selection.
Since the EM algorithm is a point estimation method, new individuals can be generated with probabilistic logic sampling, which is computationally cheap. The details of the sampling procedures are summarized below (note, when at the maximum depth limitation, select a terminal node unconditionally).
(a) Select a model following probability distribution ζ * = {ζ 1 * , ζ 2 * , · · · , ζ μ * }. (b) Let the selected model index be . A root node is selected following probability distribution π * = {π * (S [x])|x ∈ H}. (c) If there are non-terminal symbols S [x] (x ∈ H) in a derivation tree, select a production rule following the probability distribution Repeat (c) until there are no non-terminal symbols left in the derivation tree.

Computer experiments
In order to show the effectiveness of UPAGE, we analyze UPAGE from the viewpoint of the number of fitness evaluations. We applied UPAGE to three benchmark problems: the royal tree problem (Section 5.3.1), the bipolar royal tree problem (Section 5. PAGE used the same population size, elite rate and selection rate. For the method-specific parameters of PAGE and UPAGE, we determined h and μ so that the number of parameters to be estimated is almost the same in UPAGE and PAGE. In the three benchmark problems, we carried out UPAGE and PAGE 30 times to compare the number of fitness evaluations and also performed the Welch t-test (two-tailed) to determine the statistical significance.

Royal tree problem
We apply UPAGE to the royal tree problem ( [22]), which has only one optimal solution. The royal tree problem is a popular benchmark problem in GP. The royal tree problem is suitable for analyzing GP because the optimal structure of the royal tree is composed of smaller substructures (building blocks), and hence it well reflects the behavior of GP.
The royal tree problem defines the state perfect tree at each level. The perfect tree at a given level is composed of the perfect tree that is one level smaller than the given level. Thus, the perfect tree of level c is composed of the perfect tree of level b. In perfect trees, alphabets of functions descend by one from a root to leaves in a tree. A function a has a terminal x. The fitness function of the royal tree problem is given by where X i is the ith node in tree structures, and X ij denotes the jth child of X i . The fitness value of the royal tree problem is calculated recursively from a root node. In Equation 25, wb i and wa ij are weights which are defined as follows: • wa ij • Full Bonus = 2 If a subtree rooted at X ij has a correct root and is a perfect tree.  Table 2. The number of fitness evaluations, standard deviation and P-value of t-test in the royal tree problem.
• Partial Bonus = 1 If a subtree rooted at X ij has a correct root but is not a perfect tree.
If X ij is not a correct root.
• wb i • Complete Bonus = 2 If a subtree rooted at X i is a perfect tree.
In the present chapter, we employ the following GP functions and terminals: Here, F and T denote function and terminal sets, respectively, of GP. For details of the royal tree problem, please see Ref. ([22]). Table 2 shows the average number of fitness evaluations (along with their standard deviation) and the P-value of a t-test (Welch, two-tailed). As can been seen with Table 2, there is no noticeable difference between UPAGE and PAGE in the average number of fitness evaluations, which is confirmed by the P-value of t-test. The royal tree problem is not multimodal, and hence the optimal solution has only one tree expression. Consequently, we do not have to consider global contexts behind optimal solutions, which is an advantage of UPAGE over PAGE.

Bipolar royal tree problem
We next apply UPAGE to the bipolar royal tree problem. In the field of GA-EDAs, a mixture model based method UEBNA was proposed, and it was reported that UEBNA is especially effective in multimodal problems such as two-max problem. Consequently, we apply UPAGE to a bipolar problem having two optimal solutions, which is a multimodal extension of the royal tree problem. In order to make the royal tree problem multimodal, we set T = {x, y} and Score(x) = Score(y) = 1. With this setting, the royal tree problem has two optimal solutions of x ( Fig. 7(a)) and y (Fig. 7(b)). PAGE and UPAGE stop when either of the two optimal solutions is obtained. Table 3 shows the average number of fitness evaluations along with their standard deviation. We see that UPAGE can obtain an optimal solution with a smaller number of fitness  evaluations than PAGE. Table 3 gives the P-value of a t-test (Welch, two-tailed), which allows us to say that the difference between UPAGE and PAGE is statistically significant.
Because the bipolar royal tree problem has two optimal solutions (x and y), PAGE learns the production rule probabilities with learning data containing solution candidates of both x and y optima. Let us consider the annotation size required to express optimal solutions of the bipolar royal tree problem of depth 5. For the case of PAGE, the minimum annotation size to be able to learn the two optimal solutions separately is 10. In contrast, UPAGE can express the two optimal solutions with mixture size 2 and annotation size 5, which results in a smaller number of parameters. This consideration shows that a mixture model is more suitable for this class of problems. Figure 8 shows the increase in the log-likelihood for the bipolar royal tree problem, in particular, the transitions at generation 0 and generation 5. As can been seen from the figure, the log-likelihood converges after about 10 iterations. The log-likelihood improvement at generation 5 is larger than that at generation 0 because the tree structures have converged toward the end of the search.

DMAX Problem
We apply UPAGE to the DMAX problem ( [8,10]), which has deceptiveness when it is solved with GP. The main objective of the DMAX problem is identical to that of the original MAX problem: to find the functions that return the largest real value under the limitation of a  Table 3. The number of fitness evaluations, standard deviation and P-value of t-test in the bipolar royal tree problem. maximum tree depth. However, the symbols used in the DMAX problem are different from those used in the MAX problem. The DMAX problem has three parameters, and the difficulty of the problem can be tuned using these three parameters. For the problem of interest in the present chapter, we selected m = 3 and r = 2, whose deceptiveness is of medium degree. In this setting, the GP terminals and functions are where + 3 and × 3 are 3 arity addition and multiplication operators, respectively. The optimal solution in the present setting is given by Table 4 shows the average number of fitness evaluations along with their standard deviation for the DMAX problem. We can see that UPAGE obtained the optimal solution with a smaller number of fitness evaluations compared to PAGE. Table 4 gives the P-value of a t-test (Welch and two-tailed) and allows us to say that the difference in the averages of UPAGE and PAGE is statistically significant.
In the bipolar royal tree problem, expressions of the two optimal solutions (x or y) are different, and thus building blocks of the optima are also different. In contrast, the DMAX problem has mathematically only one optimal solution, which are represented by Equation 26.
Although the DMAX problem is a unimodal problem, the DMAX problem has different expressions for the optimal solution due to commutative operators such as + 3 and × 3 . From this experiment, we see that UPAGE is superior to PAGE for this class of benchmark problems.     Table 5. Parameter settings for a multimodal problem.

Multimodal problem
In the preceding section, we evaluated the performance of UPAGE from the viewpoint of the average number of fitness evaluations. In this section, we show the effectiveness of UPAGE in terms of its capability for obtaining multiple solutions of a multimodal problem. Because there are two optimal solutions in the bipolar royal tree problem (see Fig. 7(a) and (b)), we  show that UPAGE can obtain both optimal solutions in a single run. Parameter settings are shown in Table 5. Table 6 shows the number of successful runs in which both optimal solutions are obtained in a single run. As can been seen in Table 6, UPAGE succeeded in obtaining both optimal solutions in 10 out of 15 runs, whereas PAGE could not obtain them at all. Table 7 shows production rule probabilities of UPAGE in a successful run. Although the mixture size is μ = 4, we have only presented probabilities of Model = 0 and Model = 3, which are related to optimal solutions of y ( Fig. 7(b)) and x ( Fig. 7(a)), respectively (i.e. Model = 1 and Model = 2 are not shown). Because we see in Model = 0 that the probabilities generating y are very high, we consider that the optimal solution of y was generated by Model = 0. On the other hand, it is estimated that the optimal solution of x was generated by Model = 3. From this probability table, we can confirm that UPAGE successfully estimated the mixed population separately, because Model = 3 and 0 can generate optimal solutions of x and y with relatively high probability. It is very difficult for PAGE to estimate multiple solutions because PCFG-LA is not a mixture model and it is almost impossible to learn the distributions separately. As was shown in Section 5.3, UPAGE is superior to PAGE in terms of the number of fitness evaluations. From Table 7, it is considered that this superiority is due to UPAGE's capability of learning distributions in a separate way.

Discussion
In the present chapter, we have introduced PAGE and UPAGE. PAGE is based on PCFG-LA, which takes into account latent annotations to weaken the context freedom assumption. By considering latent annotations, dependencies among nodes can be considered. We reported in Ref. ([12]) that PAGE is more powerful for several benchmark tests than other GP-EDAs, including GMPE and POLE.
Although PCFG-LA is suitable for estimating dependencies among local nodes, it cannot consider global contexts (contexts of entire tree structures) behind individuals. In many real-world problems, not only local dependencies but also global contexts have to be taken into account. In order to consider the global contexts, we have proposed UPAGE by extending PCFG-LA into a mixture model (PCFG-LAMM). In the bipolar royal tree problem, there are two optimal structures of x and y and the global contexts represent which optima (x or y) each tree structure comes from. From Table 7, the mixture model of UPAGE successfully worked and UPAGE could estimate mixed population separately. We have also shown that a mixture model is effective not only in multimodal problems but also in some unimodal problems, namely in the DMAX problem. Although the optimal solution of the DMAX problem is represented by mathematically one expression, the tree expressions are not unique, due to commutative operators (× 3 and + 3 ). Consequently, the mixture model is also effective in the DMAX problem (see Section 5.3.3), and this situation where there exists the expression diversity often arises in real world problems. When obtaining multiple optimal solutions in a single run, UPAGE succeeded in cases for which PAGE obtained only one of the 1.00 S [0] → a S [13] 0.16 optima. This result shows that UPAGE is more effective than PAGE not only quantitatively but also qualitatively. We also note that UPAGE is more powerful than PAGE in terms of computational time. In our computer experiments, we set the number of parameters in UPAGE and PAGE to be approximately the same. Figure 10 shows the relative computational time per generation of UPAGE and PAGE (the computational time of PAGE is normalized to 1) and we see that UPAGE required only sixty percent of the time required by PAGE. Although we have shown in Section 5.3.1 that UPAGE and PAGE required approximately the same number of fitness evaluations to obtain the optimal solution in the royal tree problem, UPAGE is more effective even for the royal tree problem if the actual computational time is considered.  Table 8 summarizes functionalities of several GP-EDAs. SG-GP employs the conventional PCFG and hence it cannot estimate dependencies among nodes. Although GT-EDA, GMPE and PAGE adopt different types of grammar models, they belong to the same class in the sense that these three methods can take into account dependencies among nodes, which is enabled by a use of specialized production rules depending on contexts. However, these methods cannot consider global contexts, and consequently, they are not suitable for estimating problems having complex distributions. In contrast, in addition to local dependencies among nodes, UPAGE can consider global contexts of tree structures. The model of UPAGE is the most flexible among these GP-EDAs, and this flexibility is reflected by the search performance.
In the present implementation of UPAGE, we had to set the mixture size μ and the annotation size h in advance because UPAGE employed the EM algorithm. However, it is desirable to estimate μ and h, as well as β, π and ζ during search. In the case of PAGE, we proposed PAGE-VB in Ref. ([12]), which adopted VB to estimate the annotation size h. In a similar fashion, it is possible to apply VB to UPAGE to enable the inference of μ and h.
We have shown the effectiveness of PAGE and UPAGE with benchmark problems not having intron structures. However, in real-world applications, problems generally include intron structures, which make the model and parameter inference much more difficult. For such problems, we consider that intron removal algorithms ( [13,30]) are effective, and application of such algorithms to GP-EDAs is left as a topic of future study.

Conclusion
We have introduced a probabilistic program evolution algorithm named PAGE and its extension UPAGE. PAGE takes advantage of latent annotations that enables consideration of dependencies among nodes, and UPAGE incorporates a mixture model for taking into account global contexts. By applying UPAGE to computational experiments, we have confirmed that a mixture model is highly effective for obtaining solutions in terms of the number of fitness evaluations. At the same time, UPAGE is more advantageous than PAGE in the sense that UPAGE can obtain multiple solutions for multimodal problems. We hope that it will be possible to apply PAGE and UPAGE to a wide class of real-world problems, which is an intended future area of study.