Fast Computation of the EM Algorithm for Mixture Models

Mixture models become increasingly popular due to their modeling flexibility and are applied to the clustering and classification of heterogeneous data. The EM algorithm is largely used for the maximum likelihood estimation of mixture models because the algorithm is stable in convergence and simple in implementation. Despite such advantages, it is pointed out that the EM algorithm is local and has slow convergence as the main drawback. To avoid the local convergence of the EM algorithm, multiple runs from several different initial values are usually used. Then the algorithm may take a large number of iterations and long computation time to find the maximum likelihood estimates. The speedup of computation of the EM algorithm is available for these problems. We give the algorithms to accelerate the convergence of the EM algorithm and apply them to mixture model estimation. Numerical experiments examine the performance of the acceleration algorithms in terms of the number of iterations and computation time.


Introduction
Mixture models become increasingly popular due to their modeling flexibility and are applied to the clustering and classification of heterogeneous data, see [1][2][3]. The EM algorithm [4] is largely used for the maximum likelihood estimation of mixture models because the algorithm is stable in convergence and simple in implementation. Despite such advantages, it is pointed out that the EM algorithm is local and has slow convergence as the main drawback.
To circumvent the problem of slow convergence of the EM algorithm, various acceleration algorithms incorporating optimization methods are proposed. The optimization methods include the multivariate Aitken method [5], the conjugate gradient method [6], and the quasi-Newton method [7,8]. However, these methods require matrix computation such as matrix inversion or evaluation of Hessian and Jacobian matrices and a line search for step length optimization. Therefore, their acceleration algorithms tend to lack one or more of the nice properties of the EM algorithm, although they may converge faster than the EM algorithm.
As another approach, the ε-accelerated EM algorithm [9] is proposed to accelerate the convergence of the EM algorithm by using the vector ε (vε) algorithm [10] that is a vector extrapolation algorithm [11,12]. The vε algorithm can accelerate the convergence of the sequence of estimates from the EM algorithm, and therefore, the ε-accelerated EM algorithm does not require any modification of the E-and M-steps of the EM algorithm. This point is the advantage of the ε-accelerated EM algorithm over other acceleration algorithms using the optimization methods. To reduce the number of iterations and computation time of the ε-accelerated EM algorithm, the εR-accelerated EM algorithm [13] is developed. The algorithm improves the computation speed of the ε-accelerated EM algorithm by embedding a restarting procedure. Then the restarting procedure finds a value for restarting the EM iterations such that a newly generated sequence of EM iterations from the value moves quickly into a neighborhood of a stationary point. We use the ε-accelerated EM and εR-accelerated EM algorithms for parameter estimation.
In application of the EM algorithm to mixture models, the algorithm is sensitive to the choice of the initial value and may find estimates at a local maximum of the log-likelihood function. Several strategies are proposed to efficiently initiate the EM algorithm for getting the global maximum of the log-likelihood function, see [14][15][16][17]. We use the emEM algorithm [14] for the mixture model estimation and improve its computation speed by the ε-accelerated EM and εR-accelerated EM algorithms.
The chapter is organized as follows. Section 2 describes the EM algorithm for normal mixture models. In Section 3, we introduce the ε-accelerated EM and εRaccelerated EM algorithms. Section 4 presents numerical experiments to evaluate the performance of these acceleration algorithms. In Section 5, we provide an acceleration algorithm that applies the ε-accelerated EM and εR-accelerated EM algorithms to the emEM algorithm. Numerical experiments in Section 6 study the effects of these acceleration algorithms on the emEM algorithm. In Section 7, we present our concluding remarks.

The EM algorithm for normal mixture models
Let Y 1 , … , Y n be p-dimensional random vectors. Assume that an observed data vector y i of Y i arises from a mixture distribution of G components with density where The log-likelihood function of θ for y ¼ y 1 Direct maximization of the function (2) is complicated, and then the maximum likelihood estimate (MLE) of θ is usually found via the EM algorithm [4].
In the setting of the EM algorithm, we regard y i as incomplete data and introduce the component-label vector Z i ¼ Z i1 , … , Z iG ½ ⊤ of zero-one indicator variables such that Z ik ¼ 1 indicates that y i arises from the k-th component of the mixture model and Z ik ¼ 0 otherwise. Assume that Z i has a multinomial distribution Mu 1, λ ð Þ with parameter λ ¼ λ 1 , … , λ G ½ ⊤ . In the mixture model, the complete data vector is for k ¼ 1, … , G. The EM algorithm findsθ by iterating the expectation step (E-step) and the maximization step (M-step). Let θ t ð Þ be the t-th estimate of θ in parameter space Θ. The E-step calculates the Q function that is the conditional expectation of ℓ c θ ð Þ given y and θ t ð Þ and is written as Mixture models treat z ¼ z 1 , … , z n ½ as missing data. The E-step calculates the conditional expectation of Z ik given y and θ t ð Þ : The quantity τ t ð Þ ik is the posterior probability that y i belongs to the k-th component of the mixture. From Eq. (9), the Q function (8) is given by The M-step finds θ tþ1 ð Þ maximizing the Q function (10) with respect to θ over Θ When replacing z ik in Eq.
From Eqs. (6) and (7), we also have We describe the EM algorithm for the normal mixture model in Algorithm 1.

Algorithm 1:
The EM algorithm.

Acceleration of the EM algorithm
In order to accelerate the convergence of the EM algorithm, we can use the εaccelerated EM algorithm [9] and the εR-accelerated EM algorithm [13]. The εaccelerated EM algorithm incorporates the vector ε (vε) algorithm [10] in the EM algorithm. The εR-accelerated EM algorithm improves the computation speed of the ε-accelerated EM algorithm by adding a restarting procedure.
We briefly introduce the vε algorithm. Let θ t ð Þ n o t ≥ 0 be a linearly convergent vector sequence from an iterative computational procedure and converge to a stationary pointθ as t ! ∞. Then the vε algorithm generates a sequence ψ t The algorithm enables accelerating the convergence of a slowly convergent vector sequence and is very effective for linearly convergent sequences.
We define the EM algorithm as a mapping θ↦M θ ð Þ from Θ to Θ such that each
The ε-accelerated EM algorithm is shown in Algorithm 2. Given a convergence criterion δ, the ε-accelerated EM algorithm iterates until Assume that the sequence θ t The theorems with the convergence and acceleration of the algorithm are given in [18].
As shown in Algorithm 2, the ε-accelerated EM algorithm generates two parallel When this occurs, the EM step is restarted with M ψ tÀ1 ð Þ À Á as the initial value, and the ε acceleration step gets ψ t ð Þ from ψ tÀ1 . Notice that at the restarting point, we still generate the EM sequence using three estimates obtained from the same initial value ψ tÀ1 ð Þ . By this manner, we keep to always apply the ε-acceleration to a sequence obtained by the EM mapping M from the same initial value.
By our experiments, the restarting procedure is performed almost every time when we only use the restarting condition , and then it inefficiently takes much computation time. As one more condition for restarting the EM step, we give ∥ψ tÀ1 where k is an integer, such as one. By adding this condition, we can control the restarting frequency. For example, set δ ¼ 10 À12 , and initialize δ Re ¼ 1 and k ¼ 1. Then the restarting procedure is performed at most 12 times.
The restarting conditions are summarized as follows: Condition (i) means that the log-likelihood function can be increased by restarting. Condition (ii) is used to reduce the frequency of restarting. This is the key idea of the restarting procedure. The εR-accelerated EM algorithm is the εaccelerated EM algorithm with the restarting procedure using conditions (i) and (ii) and is given in Algorithm 3.
and reset The εR-accelerated EM algorithm also givesθ from the final value of ψ t ð Þ È É t ≥ 0 . When the restarting step effectively finds values for restating the EM step, the εR-accelerated EM algorithm greatly reduces the number of iterations and computation time for convergence. The advantage of the εR-accelerated EM algorithm over the ε-accelerated EM algorithm is that it restarts the EM step at a better current estimate and also keeps that the log-likelihood function increases in the iterations.
Theoretical results of convergence and speed of convergence of the εRaccelerated EM algorithm are given in [13].

Numerical experiments for the acceleration of the EM algorithm
We investigate how much faster the ε-accelerated EM and εR-accelerated EM algorithms converge than the EM algorithm. All computations are performed with the statistical package R [19] executing on Windows, Intel Core i5 3.00 GHz with 8 GB of memory.
The R package MixSim [17,20] is used to simulate a random data matrix y having a p-variate normal mixture distribution of G components. We generate y ¼ y 1 , … , y 1000 Â Ã and find the MLE of θ using the EM, ε-accelerated EM, and εRaccelerated EM algorithms. The procedure is replicated 100 times. Here, we consider p ¼ 2, 3, 4, 5, 6 and G ¼ 4. For all experiments, we set δ ¼ 10 À12 for convergence of the algorithms, δ Re ¼ 1 and k ¼ 1 for the restarting condition of the εR-accelerated EM algorithm. Initial values of the algorithms are obtained from the k-means method using the R function kmeans.
Tables 1 and 2 report the results of the number of iterations and CPU time of these algorithms for each p. The CPU times (in seconds) are measured by the R function proc.time that times are typically available to 10 milliseconds. For all computations, the acceleration algorithms found the same MLEs as those from the EM algorithm. We see from the tables that the EM algorithm requires a large number of iterations for convergence, whereas two acceleration algorithms converge a smaller number of iterations than the EM algorithm. Then the εR-accelerated EM algorithm can greatly reduce both the number of iterations and CPU time.
To measure the speed of convergence of the EM and two acceleration algorithms, we calculate iteration and CPU time speedups. The iteration speedup of an acceleration algorithm for the EM algorithm is defined by The number of iterations of the EM algorithm The number of iterations of an acceleration algorithm : The CPU time speedup is also calculated similarly to the iteration speedup. Tables 3 and 4 show the results of the iteration and CPU time speedups of two acceleration algorithms. We compare the mean values of the iteration and CPU time speedups of the algorithms. The ε-accelerated EM algorithm is about 1.5 times and 1.4 times faster than the EM algorithm in the number of iterations and CPU time, respectively. Then the εR-accelerated EM algorithm is more than twice as fast as the EM algorithm in both the number of iterations and CPU time. The boxplots of Figures 1 and 2 also show that the εR-accelerated EM algorithm is obviously much faster than the ε-accelerated EM algorithm. Table 3 and Figure 1 indicate that in 75 out of 100 replications, the number of iterations of the εR-accelerated EM algorithm is less than half as small as that of the EM algorithm. For CPU time of Table 4 and Figure 2, the εR-accelerated EM algorithm is more than twice as fast as the EM algorithm in 50 out of 100 replications. Figure 3 shows the boxplots of the iteration and CPU time speedups of the εRaccelerated EM algorithm for p ¼ 6. Here, "more" ("less") means that the number of iterations of the EM algorithm is more (less) than the median in Tables  We can see from the figure that, for the larger number of iterations of the EM algorithm ("more"), the εR-accelerated EM algorithm works well to speed up the convergence of ψ t ð Þ È É t ≥ 0 . We observed a similar result for other p. Therefore, the algorithm is more powerful when the EM algorithm takes a larger number of iterations.
The results from the tables and figures demonstrate that the restarting step in the εR-accelerated EM algorithm enables a significant increase in the computation speed with less computational effort.  Boxplots of the CPU time speedup of the ε-accelerated EM (ε) and εR-accelerated EM (εR) algorithms for 100 random data. Each data is generated from a p-variate normal mixture distribution of four components.

Initial value selection for normal mixture models
It is well known that the log-likelihood function (2) may have numerous maximums. The EM algorithm does not guarantee to obtain the global maximum of the log-likelihood function due to its local convergence. Thus, the initial value of θ deeply depends on the performance of the EM algorithm. Several methods for selecting the initial value are proposed; for example, see [14][15][16][17]. These methods are based on the multiple runs of the EM algorithm using different initial values and findθ for getting the global maximum of the log-likelihood function.
We apply the emEM algorithm [14] to the mixture model estimation. The algorithm is a popular one and usually provides excellent results when the number of components is not large [21]. The emEM algorithm selects an initial value in the em step that is several short runs of the EM algorithm using different initial values and a lax convergence criterion and obtainsθ from the EM step that runs the EM algorithm starting from the initial value with a strict convergence criterion.
The em step consists of three steps. The first step generates J initial values of θ. The second step runs the EM algorithm from these initial values with a lax convergence criterion. Hence, we do not wait for convergence of the EM algorithm and stop the iterations. The third step selects the value giving the largest log-likelihood function among J trials.
Let δ ini be a convergence criterion and T max the maximum number of iterations. We present the emEM algorithm in Algorithm 4.
EM step: Given θ 0 ð Þ in the em step, findθ using the EM algorithm.
The em step performs multiple runs of the EM algorithm, and then its computation may be time-consuming. We replace the EM algorithm with the ε-accelerated EM algorithm in the em step and use the εR-accelerated EM algorithm to obtainθ in the EM step. By applying these acceleration algorithms to the emEM algorithm, it is possible to reduce the number of iterations and CPU time. The acceleration of the emEM algorithm is referred as to the εem-εREM algorithm and is shown in Algorithm 5.

Numerical experiments for the initial value selection
We evaluate the performance of the ε-accelerated EM and εR-accelerated EM algorithms in application to the emEM algorithm.
By using MixSim, we simulate y ¼ y 1 , … , y 1000 Â Ã having the p-variate normal mixture distribution of six components for p ¼ 2, 3, 4, 5, 6. The values of δ, δ Re , and k are the same as in the experiments of Section 1.4. Assume that the probability of not finding the global maximum of the log-likelihood function in a single run is 0.80 for safety. Then the probability of finding the global maximum at least once is 1 À 0:80 50 > 0:9999. In the em and ε-em steps, we draw 50 initial values from kmeans and set δ ini ¼ 0:001 and T max ¼ 1000.
Tables 5 and 6 present the number of iterations and CPU time for each p. We see from Table 5 that the number of iterations of the ε-em step is much smaller than that of the em step. The ε-accelerated EM algorithm effectively improves the computation speed of the em step. We compare the number of iterations and CPU time of the εem-εREM algorithm with those of the emEM algorithm. Then these values of the εem-εREM algorithm are about less than half of those of the emEM algorithm. The results illustrate that the ε-accelerated EM and εR-accelerated EM algorithms can sufficiently accelerate the convergence of the emEM algorithm.

Concluding remarks
In this chapter, we introduced the ε-accelerated EM and εR-accelerated EM algorithms. Both algorithms are given by very simple computational procedures and are executed with a little bit of computation for each iteration, while they well accelerate the convergence of the EM algorithm.
When the EM algorithm is applied to normal mixture models, the algorithm may converge slowly and be heavily dependent on the initial value. The first problem is solved by the acceleration of the EM algorithm. The numerical experiments  Table 6. CPU times of the emEM and εem-εREM algorithms. The em and ε-em steps generate 50 random initial values.
indicated the availability of the ε-accelerated EM and εR-accelerated EM algorithms. For the second problem, the initial value selection is useful to initiate the EM algorithm. We applied the emEM algorithm to normal mixture model estimation and developed the εem-εREM algorithm to speed up the computation of the emEM algorithm. Then the ε-accelerated EM algorithm is used in the em step, and the εR-accelerated EM algorithm is in the EM step. Numerical experiments showed that the εem-εREM algorithm can converge in a smaller number of iterations and shorter CPU time than the emEM algorithm.
The ε-accelerated EM and εR-accelerated EM algorithms accelerate the convergence of the EM algorithm without any modification of the E-and M-steps of the algorithm. This means that these algorithms do not require to derive the acceleration formula for every statistical model. Thus, these algorithms are applied to several mixture models-mixtures of factor analyzers, mixtures of multivariate tdistributions, mixtures of generalized hyperbolic distributions, and parsimonious Gaussian mixture models. We expect that the convergence of the EM algorithms used in these mixture models tends to be slow. The results from the experiments show that the εR-accelerated EM and εR-accelerated EM algorithms are useful due to their fast speed of convergence and ease of use.