Open access peer-reviewed chapter - ONLINE FIRST

# A Review on the Exact Monte Carlo Simulation

By Hongsheng Dai

Submitted: April 23rd 2019Reviewed: July 16th 2019Published: November 13th 2019

DOI: 10.5772/intechopen.88619

## Abstract

Perfect Monte Carlo sampling refers to sampling random realizations exactly from the target distributions (without any statistical error). Although many different methods have been developed and various applications have been implemented in the area of perfect Monte Carlo sampling, it is mostly referred by researchers to coupling from the past (CFTP) which can correct the statistical errors for the Monte Carlo samples generated by Markov chain Monte Carlo (MCMC) algorithms. This paper provides a brief review on the recent developments and applications in CFTP and other perfect Monte Carlo sampling methods.

### Keywords

• coupling from the past
• diffusion
• Monte Carlo
• perfect sampling

## 1. Introduction

In the past 30 years, substantial progress has been made in popularizing Bayesian methods for the statistical analysis of complex data sets. An important driving force has been the availability of different types of Bayesian computational methods, such as Markov chain Monte Carlo (MCMC), sequential Monte Carlo (SMC), approximate Bayesian computation (ABC) and so on. For many practical examples, these computational methods can provide rapid and reliable approximations to the posterior distributions for unknown parameters.

The basic idea that lies behind these methods is to obtain Monte Carlo samples from the distribution of interest, in particular the posterior distribution. In Bayesian analysis of complex statistical models, the calculation of posterior normalizing constants and the evaluation of posterior estimates are typically infeasible either analytically or by numerical quadrature. Monte Carlo simulation provides an alternative. One of the most popular Bayesian computational methods is MCMC, which is based on the idea of constructing a Markov chain with the desired distribution as its stationary distribution.

By running a Markov chain, MCMC methods generate statistically dependent and approximate realizations from the limiting (target) distribution. A potential weakness of these methods is that the simulated trajectory of a Markov chain will depend on its initial state. A common practical recommendation is to ignore the early stages, the so-called burn-in phase, before collecting realizations of the state of the chain. How to choose the length of the burn-in phase is an active research area. Many methods have been proposed for convergence diagnostics; [10] gave a comparative review. Rigorous application of diagnostic methods requires either substantial empirical analysis of the Markov chain or complex mathematical analysis. In practice, judgments about convergence are often made by visual inspection of the realized chain or the application of simple rules of thumb.

Concerns about the quality of the sampled realizations of the simulated Markov chains have motivated the search for Monte Carlo methods that can be guaranteed to provide samples from the target distribution. This is usually referred to as perfect sampling or exact sampling. In some cases, for example, the multivariate normal, perfect samples are readily available. For more complicated distributions, perfect sampling can be achieved, in principle, by the rejection method. This involves sampling from a density that bounds a suitable multiple of the target density, followed by acceptance or rejection of the sampled value. The difficulty here is in finding a bounding density that is amenable to rapid sampling while at the same time providing sample values that are accepted with high probability. In general this is a challenging task, although efficient rejection sampling methods have been developed for the special class of log-concave densities; see, for example, [20, 22].

A surprising breakthrough in the search for perfect sampling methods was made by [37]. The method is known as coupling from the past (CFTP). This is a sophisticated MCMC-based algorithm that produces realizations exactly from the target distribution. CFTP transfers the difficulty of running the Markov chain for extensive periods (to ensure convergence) to the difficulty of establishing whether a potentially large number of coupled Markov chains have coalesced. An efficient CFTP algorithm depends on finding an appropriate Markov chain construction that will ensure fast coalescence. There have been a few further novel theoretical developments following the breakthrough of CFTP, including [20, 21, 38]. Since then, perfect sampling methods have attracted great attention in various Bayesian computational problems and applied probability areas.

Apart from coupling from the past, many other perfect sampling methods were proposed for specific problems, for example, perfect sampling for random spanning trees [2, 47] and path-space rejection sampler for diffusion processes [3, 4, 5]. Very recently, a type of divide-and-conquer method has been developed in [15, 16]. These methods use the technique for the exact simulation of diffusions and samples from simple sub-densities to obtain perfect samples from the target distribution.

Perfect samples are useful in Bayesian applications either as a complete replacement for MCMC-generated values or as a source of initial values that will guarantee that a conventional MCMC algorithm runs in equilibrium. Perfect samples can also be used as a means of quality control in judging a proposed MCMC implementation when there are questions about the speed of convergence of the MCMC algorithm or whether the chain is capable of exploring all parts of the sample space. Of course, when perfect samples can be obtained speedily, they will be preferred to MCMC values, since they eliminate such doubts. In addition, sophisticated perfect sampling methodology often motivates efficient approximate algorithms and computational techniques. For example, [43] uses regenerated Markov chains to obtain simple standard error estimates for importance sampling under MCMC context. The condition required there will allow us to carry out perfect sampling via multigamma coupling approach [23].

This paper will present a brief review for perfect Monte Carlo sampling and explain the advantages and drawbacks of different types of methods. Section 2 will present rejection sampling techniques, and then CFTP will be covered in Section 3. Recent divide-and-conquer methods will be reviewed in Section 4. The paper ends with a discussion in Section 5.

## 2. Rejection sampling techniques

Rejection sampling, also known as ‘acceptance-rejection sampling’, generates realizations from a target probability density function fxby using a hat function Mgx, where fxMgxand gxare a probability density function from which samples can be readily simulated. The basic rejection sampling algorithm is as follows:

Algorithm 2.1 (Basic rejection sampling)

 Sample xfrom gxand Ufrom Unif01. 01 If U≤fxMgx, 02 accept xas a realisation of fxand stop; 03 else 04 reject the value of xand go back to step 01. 05

Many other perfect sampling methods are actually equivalent to rejection sampling. For example, ratio-of-uniform (RoU) method [45] may have to be implemented via a rejection sampling approach.

The efficiency of rejection sampling depends on the acceptance probability, which is 1/M. To perform rejection sampling efficiently, it is very important to find hat functions which provides higher acceptance probabilities. In other words, we shall choose Mas small as possible [39]; however for many complicated problems, there is no easy way to find Msmall enough to guarantee high acceptance probability. A number of sophisticated rejection sampling methods have been suggested for dealing with complex Bayesian posterior densities, which we discuss below.

### 2.1 Log-concave densities

A function hxis called log-concave if

loghλx+1λyλloghx+1λloghy,

for all x,yand λ01. For the special class of log-concave densities, Gilks and wild [22] developed the adaptive rejection sampling (ARS) method. The method constructs an envelope function for the logarithm of the target density, fx, by using tangents to logfxat an increasing number, n, of points. The envelope function unxis the piecewise linear upper hull formed from the tangents. Note that, the envelope can be easily constructed due to the concavity of logfx. The method also constructs a squeeze function lnxwhich is formed from the chords of the tangent points. The sampling steps of the ARS algorithm are as follows.

 Outputs a stream of perfect samples from fx. Initialise unxand lnxby using tangents at several points. 01 Sample x∗from density ∝expunxand U∼Unif01 02 If U≤explnx∗−unx∗, Output x∗; 03 else if U≤fx∗/expunx∗, 04 Output x∗; Update unlnto un+1ln+1using x∗; 05 Goto 02 06

The ARS algorithm is adaptive and the sampling density is modified whenever fxis evaluated. In this way, the method becomes more efficient as the sampling continues. Leydold [29] extends ARS to log-concave multivariate densities. Leydold’s algorithm is based on decomposing the domain of the density into cones and then computing tangent hyperplanes for these cones. Generic computer code for sampling from a multivariate log-concave density is available on the author’s website; it is only necessary to code a subroutine for the target density. Leydold’s algorithm works well for simple low-dimensional densities. The drawback of ARS algorithm is that it only works for log-concave densities, which is a very small class of posteriors in practice. Also computationally it is very challenging to implement ARS algorithm for high-dimensional densities [11].

### 2.2 Fill’s rejection sampling algorithm

We consider a discrete Markov chain with transition probability Pxyand stationary distribution πx,xS. Let P˜xy=πyPyx/πxbe the transition probability for the time-reversed chain. Suppose that there is a partial ordering on the states S, denoted by xy. Let 0̂and 1̂be unique extremal points of the partial ordering.

To demonstrate the algorithm given by [20], we will assume that there are update functions ϕand ϕ˜both mapping S×01to Ssuch that

Pxy=PϕxU=y,E1
P˜xy=Pϕ˜xU=y,E2

where UUnif01and

xyϕ˜xuϕ˜yua.e.u01.

The algorithm is as follows:

Algorithm 2.3 (Fill’s algorithm)

 1. Choose an integer t>0. Fix x0=0̂and y0=1̂. 2. Generate xk=ϕxk−1Uk, k=1,⋯,t, where Ukare i.i.d. Unif01. 3. Generate U˜kfrom the conditional distribution of Ugiven ϕ˜xt−k+1U=xt−k,k=1,⋯,t. 4. Generate yk=ϕ˜yk−1U˜k,k=1,⋯,t. 5. If yt=x0then accept xt; else double tand repeat from step 2.

In Algorithm 2.3 (step 2) zxtis sampled from Pt0̂. If we find an upper bound Mfor πz/Pt0̂z, then we can use rejection sampling. Fill [20] finds a bounding constant given by M=π0̂/P˜t1̂0̂and proves that steps from 3 to 5 of Algorithm 2.3 are to accept zwith probability πzMPt0̂z. The output of this algorithm is indeed a perfect sample from π.

From Algorithm 2.3, we can see that rejection sampling can still be possible, even if we do not have a closed form of the hat function. The first limitation of Algorithm 2.3 is that it works only if the time-reversed chain is monotone, but [21] has extended the algorithm theoretically for general chains. The second limitation is that step 3 of Algorithm 2.3 is difficult to perform [20]. For these reasons, Fill’s algorithm is not practical for realistic problems.

## 3. Coupling from the past

Coupling from the past was introduced in the landmark paper of [37] which showed how to provide perfect samples from the limiting distribution of a Markov chain.

### 3.1 Basic CFTP algorithms

Let Xtbe an ergodic Markov chain with state space X=1n, where the probability of going from ito jis pijand the stationary distribution is π. Suppose we design an updating function ϕU, which satisfies PϕiU=j=pij, where ϕis a deterministic function and Uis a random variable. To simulate the next state Yof the Markov chain, currently in state i, we draw a random variable Uand let Y=ϕiU.

Let fti=ϕiUt, and define the composition

Ft1t2=ft21ft22ft1+1ft1,E3

for t1<t2.

Proposition 3.1 [37] Suppose there exists a time t=T, the backward coupling time, such that chains starting from any state in X=1n, at time t=T, and with the same sequence Utt=T1, arrive at the same state X0. Then it must follow that X0comes from π.

This proposition is easy to prove. If we run an ergodic Markov chain from time t=and with the sequence Utt=T1after T, the Markov chain will arrive at X0. Then X0comes exactly from πsince it is collected at time 0and the Markov chain started from . The algorithm is as follows:

Algorithm 3.1 (Basic CFTP)

 t=0 01 repeat 02 t=t−1 03 generate Ut 04 Ft0=Ft+10∘ϕ⋅Ut 05 until Ft0⋅is a constant 06 return Ft0⋅ 07

Propp and Wilson [37] also proved that this algorithm is certain to terminate. The idea of simulating from the past is important. Note that if we collect F0Tas the result, where Tis the smallest value that makes F0Ta constant, we will get a biased sample. This is because Tis a stopping time, which is not independent of the Markov chain.

The basic CFTP algorithm needs to restart the Markov chains at some points in the past if they have not coalesced by time 0. We must use the same random sequence Ut1when we restart the Markov chains. In Monte Carlo simulations, we usually use pseudorandom number generators, which are deterministic algorithms. So if we give the same random seed, we will get the same random sequence. This means that the same sequence Utcan be regenerated in each coupling procedure.

If we can run the Markov chain forward starting at 0and collect a perfect sample in the future, we will not have to regenerate Ut. Wilson [48] developed a read-once CFTP method to implement the forward coupling idea. A simple example is provided by [6]. In fact, the multigamma coupler in [23] can be implemented via the more efficient read-once CFTP algorithm.

### 3.3 Improvement of CFTP algorithms

Propp and Wilson [37] showed that the computational cost of the algorithm can be reduced if there is a partial order for the state space Xthat is preserved by the update function ϕ, i.e. if xythen ϕxUϕyU. Their procedure is outlined in Algorithm 3.2, whereas before 0̂and 1̂are the unique extremals. Note that their algorithm needs a monotone update function ϕfor the Markov chain, while Algorithm 2.3 requires a monotone update function ϕ˜for the time-reversed chain.

Algorithm 3.2 (Monotone CFTP)

 T=1 01 Repeat 02 upper =1̂ 03 lower =0̂ 04 for t=−Tto t=−1 05 upper =ϕtupperUt 06 lower =ϕtlowerUt 07 T=2T 08 until upper =lower 09 return upper 10

Algorithm 3.2 is much simpler than Algorithm 3.1, since only two chains have to be run at the same time, but the requirement of monotonicity is very restrictive. Markov chains with transitions given by independent Metropolis-Hastings and perfect slice sampling have been shown to have this property, by [9, 32], respectively. However [32, 34] have also noticed that such independent M-H CFTP is equivalent to simple rejection sampler.

In general it is hard to code perfect slice samplers correctly. For example, Hörmann and Leydold [26] have pointed out that the perfect slice samplers in [7, 36] are incorrect. The challenge of monotone CFTP is usually to construct the detailed updating function with a guarantee of preserving the partial order.

Finding a partial order preserved by the Markov chains is a non-trivial task in many cases. An alternative improvement is to use CFTP with bounding chains, such as that in [27, 33]. If the bounding chains, which bound all the Markov chains, coalesce, then all Markov chains coalesce. Thus if only a few bounding chains are required, the efficiency of the CFTP algorithm can be improved significantly. Sometimes, it may be impossible to define an explicit bounding chain (the maximum of the state space may be infinity, and the upper bound chain cannot start from infinity), but it is possible to use a dominated process to bound all Markov chains [28].

### 3.4 Applications and challenges

Although CFTP is extremely challenging to be implemented for many practical problems, it did find a few applications in certain discrete state space problems, for example, the Ising model [37]. Also [18] applied CFTP to ancestral selection graph to simulate samples from population genetic models. Refs. [12, 14] applied CFTP to a class of fork-join type queuing system problems. Connor and Kendal [8] applied CFTP for the perfect simulation of M/G/c queues. CFTP also finds its application in signal processing [17].

CFTP for continuous state space Markov chains is very challenging, since a random map from an interval to a finite number of points is required. In recent years, many methods have been developed for unbounded continuous state space Markov chains, such as perfect slice sampler in [32], multigamma coupler and the bounded M-H coupler in [23, 34]. Wilson [49] developed a layered multi-shift coupling, which shifts states in an interval to a finite number of points. However, none of these methods can solve any practical problems.

## 4. Recent advances in perfect sampling

Recently, a new type of perfect Monte Carlo sampling method based on the decomposition of the target density f, as f=g1g2, was proposed in [15], where g1and g2are also (proportional to) proper density functions. Note that here g1and g2are continuous density functions which are easy to simulate from. Suppose that q-dimensional vector values x1and x2are independently drawn from g1and g2, respectively. If the two independent samples are equal, i.e. x1=x2=ythen we have ymust be from fg1g2. Note that such a naive approach may be practical for discrete random variables with low-dimensional state space, but for continuous random variables, it is impossible since Px1=x2=0. Dai [15] proposed a novel approach to deal with this, which is explained in the following subsection.

### 4.1 Perfect distributed Monte Carlo without using hat functions

First we introduce the following notations related to the logarithm of the sub-densities:

αx=α1αqtrx=logg1xE4

where is the partial derivative operator for each component of x. Then we consider a q-dimensional diffusion process Xtω,t0T(T<), defined on the q-dimensional continuous function space Ω, given by:

dXt=αXtdt+dBt,E5

where Btω=ωtis a Brownian motion and ω=ωtt0Tis a typical element of Ω. Let Wbe the probability measure for a Brownian motion with the initial probability distribution B0=w0f1=g12.

Clearly Xthas the invariant distribution f1x(using the Langevin diffusion results [24]). Let Qbe the probability law induced by Xt,t0T, with X0=ω0f1, i.e. under Qwe have Xtf1xfor any t0T.

The idea in [15] is to use a biased diffusion process X¯=X¯t0tTto simulate from the target function f. It is defined as follows.

Definition 4.1 (Biased Langevin diffusions) The joint density for the pair X¯0X¯T(the starting and ending points of the biased diffusion process), evaluated at point xy, is f1xtyxf2y, where tyxis the transition density for the diffusion process Xdefined in Eq. (6) from X0=xto XT=yand f2y=g2y/g1y.

Given X¯0X¯Tthe process X¯t0<t<Tis given by the diffusion bridge driven by Eq. (6).

The marginal distribution for X¯Tis fy. Therefore, to draw a sample from the target distribution fx, we need to simulate a process X¯t,t0Tfrom Q¯(the law induced by X¯) and then X¯Tfx.

Simulation from Q¯can be done via rejection sampling. We can use a biased Brownian motionB¯t0tTas the proposal diffusion:

Definition 4.2 (Biased Brownian motion) The starting and ending points B¯0B¯Tfollow a distribution with a density hxy, and B¯t0<t<Tis a Brownian bridge given B¯0B¯T.

Under certain mild conditions, Dai [15] proved the following lemma.

Lemma 4.1 Let Zbe the probability law induced by B¯t0tT. By letting

hω0ωT=g2ωTg1ω012πTeωTω022TE6

we have

dQ¯dZωexp120Tα2+divαωtdtE7

where div is the divergence operator.

Condition 4.1 There exists l>such that

12α2+divαxl0.E8

Under Condition 4.1 the ratio (8) can be rewritten as

dQ¯dZωexp0T12α2+divαωtldt,E9

which has a value no more than 1.

Therefore we can use rejection sampling to simulate from Q¯, with proposal measure Z. This acceptance probability (10) can be dealt with using similar methods as that in [3, 5]. The algorithm is presented below; see [13, 15] for more details.

Algorithm 4.1 (Simple distributed sampler)

 Simulate ω0ωTfrom density h 01 Simulate the biased Brownian bridge B¯tt∈0T 02 Accept ωTas a sample from f, with probability (6); If rejected, go back to step 01. 03

Such a method is a rejection sampling algorithm, but it does not require finding a hat function to bound the target density, which is usually the main challenge of the traditional rejection sampling for complicated target densities. The above algorithm uses g2as the proposal density function, which does not have to bound the target f. However, it requires a bound for the derivatives of the logarithm of the sub-densities (see Condition 4.1). This is usually easier to get in practice, since the logarithm of the posterior is usually easy to deal with. Also [15] noted that we should choose sub-densities g1and g2as similar as possible, in order to achieve high acceptance probability.

Dai [15] focused on the simple decomposition of f=g1g2, although it mentioned that for more general decomposition of f=g1g2gC, a recursive method can be used. Unfortunately, a naive recursive method is very inefficient. A more sophisticated method is introduced in the following section.

### 4.2 Monte Carlo fusion for distributed analysis

A more efficient and sophisticated methods were proposed recently in [16], named as Monte Carlo fusion. Suppose that we consider

fxg1xgCx,E10

where each gcx(c1C) is a density (up to a multiplicative constant). Here Cdenotes the number of parallel computing cores available in big data problems, and each gcxmeans the sub-posterior density based on a subset of the big data. In group decision problems, Cmeans the number of different decisions which should be combined and gcxstands for the decision from each group member.

Dai et al. [16] considered simulating from the following density on extended space,

gx1xCy=c=1Cgc2xcpcyxc1gcy,E11

which admits the marginal density ffor y. Here pcyxcis the transition density from xcto yfor the Langevin diffusion defined in Eq. (6) associated with each sub-density gc.

Dai et al. [16] considered a rejection sampling approach with proposal density proportional to the function

hx1xCy=c=1CgcxcexpCyx¯22T,E12

where x¯=C1c=1Cxcand Tis an arbitrary positive constant.

Simulation from the proposal hcan be achieved directly. In particular, x1,xCare first drawn independently from g1,,gC, respectively, and then yis simply a Gaussian random variable centred on x¯. This is a distributed analysis or divide-and-conquer approach. Detailed acceptance probabilities and rejection sampling algorithms can be found in [16].

The above fusion approach arises in modern statistical methodologies for ‘big data’. A full dataset will be artificially split into a large number of smaller data sets, and inference is then conducted on each smaller data set and combined (see, for instance, [1, 30, 31, 35, 40, 41, 42, 46]). The advantage for such an approach is that inference on each small data set can be conducted in parallel. Then the heavy computational cost of algorithms such as MCMC will not be a concern. Traditional methods suffer from the weakness that the fusion of the separately conducted inferences is inexact. However, the Monte Carlo fusion in [16] is an exact simulation algorithm and does not have any approximation weakness.

The above fusion approach also arises in a number of other settings, where distributed analysis came naturally. For example, in signal processing, distributed multi-sensor may be used for network fusion systems. Fusion approach arises naturally to combine results from different sensors [44].

## 5. Conclusion

Although perfect simulation usually refers to correcting the statistical errors for the samples drawn via MCMC, it actually covers a much wider area beyond CFTP. In fact for certain applications, it is often possible to construct other types of perfect sampling methods which are much more efficient than CFTP. For example, for the exact simulation of the posterior of simple mixture models, the geometric-arithmetic mean (GAM) method in [11] is much more efficient than CFTP in [25]. Details of GAM method is provided in Appendix. Also the random walk construction for exact simulation for random spanning trees [2] is much more efficient than the CFTP version.

Bayesian computational algorithms keep evolving, in particular under the current big data era. Although almost all newly developed algorithms are approximate simulation algorithms, perfect sampling is still one of the key wheel-driven forces for new Bayesian computational algorithms, and they usually can quickly motivate new class of ‘mainstream’ algorithms. More focus should be given to methods beyond CFTP, for example, the fusion type of algorithms.

The Monte Carlo fusion method has the potential to be used in many Bayesian big data applications. For example, for large car accident data, the response variable is usually a categorical variable representing the seriousness of the accident, and generalized linear regression model is often used. Under a Bayesian framework, we may estimate the posterior distribution for the regression parameters via such a fusion approach. Then the posterior mean, the posterior median, or other characteristics of the posterior distribution can be estimated using the Monte Carlo samples. Also such an algorithm is perfect sampling algorithm, and no convergence justification is needed, since it always provided realizations exactly from the target distribution.

## Appendix

Observations from a simple mixture model are assumed to be either discrete or continuous. The density function of an individual observation yhas the form

fyp=k=1Kpkfky,wherek=1Kpk=1,andpk>0,k=1,,K.E13

We assume that the component weights p=p1pKare unknown parameters and the number of components, K, and the component densities, fk, are all known. We focus on the perfect sampling from the posterior distribution of p.

Suppose that we have Nobservations, y1,,yN. Let Lnk=fkynand assume that the prior distribution of pis Dirichlet:

π0pk=1Kpkαk1,αk>0,k=1,,K.E14

Then the posterior distribution is given by

fpyπ0pn=1Nk=1KpkLnkIXp,E15

where X=pk=1Kpk=1pk>0k=1K.

There are several ways to carry out perfect sampling from Eq. (16). The first method is based on CFTP [25]. An alternative perfect sampling method for simple mixture models is introduced by [19]. The third approach is to use adaptive rejection sampling [22, 29], since the posterior is log-concave. We may also use the ratio-of-uniform method. However, none of these methods are more efficient than the geometric–arithmetic mean method in [11].

## A.1 Geometric-arithmetic mean method

Suppose that p, the MLE of pis unique and for simplicity, assume the prior π0pis uniform. Define ank=Lnk/pkLnk. The posterior density of pis then given by

fpyhpy=n=1Nk=1KpkankIXp,E16

where Xis defined in (16).

Let Inbe a random element of arg maxkLnk. Define Aj=n:In=jand let n=n1nkwhere njis the number of elements in Aj.

Define M=mjkwith mjk=nAjank/nj. If nj=0, then set mjj=1and mjk=0for jk. We now make two assumptions, which we will return to later on:

A:Mis invertible.

B: The elements of v=MT11are all positive.

Under these assumptions, we will show that the following rejection sampler generates simulated values from the posterior distribution of p. First we define Vto be the diagonal matrix with diagonal elements vT=v1vK.

Algorithm 6.1 (GAM sampler)

 Sample qfrom the Dirichlet distribution with parameter n+1. 01 Sample Ufrom Unif01. 02 Calculate pwith p=M−1V−1q. 03 If U≤hpy/∏j=1Kqj/vjnj, 04 Accept pand stop; 05 else 06 reject pand go to 01. 07

Proposition 6.1 Under assumptions A and B, Algorithm 6.1 samples pwith probability density (17).

Proof: Since the geometric average is no larger than the arithmetic average, for pX, we have

hpy=n=1Nk=1Kpkank=j=1KnAjk=1KpkankE17
j=1KnAjk=1Kpkanknjnj,E18

where in the case nj=0, the product term is taken as 1. So that, for pX, with mjkas previously defined, we have

hpyj=1Kk=1KpkmjknjE19
=j=1Kvjnjj=1Kvjk=1KpkmjknjE20
=j=1Kqj/vjnj,E21

where qj=vjk=1Kpkmjk,j=1,,Kor equivalently q=VMp.

Since vj>0and k=1Kpkmjk>0, it follows that qj>0for j=1,,K. Furthermore

j=1Kqj=j=1Kvjk=1Kpkmjk=pTMTv=pT1=1,

since MTv=1, from the definition of v. It follows that pXimplies qX, so that

hpyIXpj=1Kqj/vjnjIXq.E22

Note that the right-hand side of Eq. (22) is proportional to a Dirichlet distribution with parameters n1+1nK+1.

Rejection sampling then proceeds as usual:

• A sample qis drawn from Dirichletn+1.

• The value p=M1V1qis calculated.

• It is accepted with probability hpyIXp/j=1Kqj/vjnj.

We now return to assumptions A and B. Suppose that Mis invertible but the elements of v=MT11are not all positive. In this case, let

αk=1Nn=1Nank,α=maxkαk,v=αN1nandM˜=αMΔ,

where Δis a diagonal matrix with diagonal elements 1/α11/αK. Note that v=M˜T11>0. Algorithm 6.1 and its proof can then be modified by replacing Mby M˜.

Suppose now that assumption A does not hold, i.e. Mis not invertible. This can be remedied by adding positive quantities to the diagonal elements of M. This also provides an alternative way of ensuring that the elements of vare positive.

## A.2 Dirichlet priors and pseudo data

Suppose that the prior π0pis Dirichletα1+1αK+1, where αi:i=1,,Kare positive, integers and let A=j=1Kαj. The prior can be synthesized by introducing pseudo data, a˜mk,m=1,,A;k=1,,K, defined as follows:

a˜mk=1ifj=1k1αj+1mj=1kαj0otherwise,E23

since

π0pk=1Kpkαk=m=1Ak=1Ka˜mkpk.E24

With the Dirichlet prior, the posterior distribution given by Eq. (16) can be written as

fpyl=1N+Ak=1Kpka¯lkIXp,E25

where a¯lkl=1N+Acontains the real data ankn=1Nand the pseudo data a˜mkm=1A.

The posterior distribution (26) has the same form as Eq. (17). Therefore GAM can be used to sample realizations from the posterior distribution (26).

## A.3 Simulation results and discussion

We compare the running time of mixture models with sample sizes (N) and different number of components (K) in Table 1. The components have specified normal distributions with means μ=μ1μKand variances σ2=σ12σK2. The prior on pis uniform. We sample 10,000 realizations from the posterior of the models.

K334466
N400100040010004001000
Leydold’s1 s3.6 s****
RoU16:11 s28:16 s31:18 s68:33 s88:60 s152:76 s
GAM4 s9 s11 s16 s6 s11 s
GAM AP0.74720.75090.24330.30880.53250.5505

### Table 1.

Running times (in s).

Fearnhead’s algorithm, Leydold’s algorithm and ratio-of-uniform. GAM method. GAM acceptance probability. The * indicates that Fearnhead’s method, and Leydold’s method will not run on a standard desktop when K=4and K=5.

When K=3,4, we simulate Nobservations from a three-component normal mixture with μ=0,0,2, σ2=1,4,1and mixture weight p0=1/21/31/6. We then either sample from the posterior distribution of pusing the same distributional components in the case K=3or sample from the posterior distribution of pwith an additional component having mean μ4=4and variance σ42=4, in the case K=4.

When K=5, observations are simulated from the normal mixture distribution with components having means μ=2,0,4,2,3, variances σ2=1,1,4,1,4and p0=0.35,0.3,0.1,0.2,0.05. Samples from the posterior distribution of pare drawn assuming the same components. Similar calculations are carried out for K=6, where μ=0,3,2245, variances σ2=1,1,1,1,1,4and p0=0.05,0.3,0.3,0.1,0.08,0.17, again assuming that the component distributions are known.

From Table 1, we can see that the GAM algorithm, while using very little memory, is highly efficient in running time. The last row of the table is the estimated acceptance probability of the GAM algorithm. The algorithm is very efficient when the component densities are known. We can see this not only by simulation but also from theoretical considerations, as follows.

### A.3.1 Explanation of efficiency

When v=MT11>0, we are able to use Mdirectly without modification to construct the hat function, thereby speeding up the calculations. In the simulations of the previous section, this was always found to be the case. Now we explain why this should be so.

If the maximum likelihood estimator of pis consistent, then when N,

1njnAjLnkk=1KpkLnkLnkk=1KpkLnkp0.E26

Assuming sufficient regularity, we also have

mjk=nAjanknjE27
=1njnAjLnkk=1KpkLnkE28
pEfkYfYLjE29
=Ljfkydyγj,asN,E30

where Yhas density fy=k=1Kpkfky, Lj=yfjyfkyk=1Kand γj=Ljfydy.

Therefore

MTpWT,E31

where

W=wjk,wjk=Ljfkydyγj.E32

Let v¯=γ1γK, then

WTv¯=j=1KLjf1ydyj=1KLjfKydy=f1ydyfKydy=1,E33

where the second equal sign is because j=1KLj=. So, there exists v¯>0, satisfying WTv¯=1. Using Eq. (32), we can conclude that when Nis large enough, there also exists vv¯>0, satisfying MTv=1.

Since γj=Ljfydyand nj=#Aj, we have nj/Npγj.When each nj,j=1,,Kis large, if a random sample qis drawn from a Dirichlet distribution with parameter n+1, then each qjis approximately equal to nj/Nγj. Furthermore, vjγj, so qsatisfies

V1q1,E34

and then,

p=M1V1qM11=p.E35

If pis approximately equal to the mode p, the two sides of the inequality,

hpy=n=1Nk=1Kpkankj=1Kvjnjj=1Kqjnj,E36

are approximately equal as well. Thus, the closer the sampled realization pis to p, the larger the acceptance probability is. So the algorithm runs very rapidly, since the sampled values of pare always around the mode p.

This algorithm requires calculating the MLE, which can be performed very quickly since the likelihood function is log-concave. In fact an approximate guess for pwill suffice. The more accurate the guess is, the more efficient the algorithm will be.

The method performs well when the component densities are correctly specified, as explained in the previous section. For these same reasons, we would expect the algorithm to perform poorly under misspecification. Details of robustness to misspecification can be found in [11].

## How to cite and reference

### Cite this chapter Copy to clipboard

Hongsheng Dai (November 13th 2019). A Review on the Exact Monte Carlo Simulation [Online First], IntechOpen, DOI: 10.5772/intechopen.88619. Available from: