Open access peer-reviewed chapter

A Review on the Exact Monte Carlo Simulation

Written By

Hongsheng Dai

Reviewed: 16 July 2019 Published: 13 November 2019

DOI: 10.5772/intechopen.88619

From the Edited Volume

Bayesian Inference on Complicated Data

Edited by Niansheng Tang

Chapter metrics overview

921 Chapter Downloads

View Full Metrics

Abstract

Perfect Monte Carlo sampling refers to sampling random realizations exactly from the target distributions (without any statistical error). Although many different methods have been developed and various applications have been implemented in the area of perfect Monte Carlo sampling, it is mostly referred by researchers to coupling from the past (CFTP) which can correct the statistical errors for the Monte Carlo samples generated by Markov chain Monte Carlo (MCMC) algorithms. This paper provides a brief review on the recent developments and applications in CFTP and other perfect Monte Carlo sampling methods.

Keywords

  • coupling from the past
  • diffusion
  • Monte Carlo
  • perfect sampling

1. Introduction

In the past 30 years, substantial progress has been made in popularizing Bayesian methods for the statistical analysis of complex data sets. An important driving force has been the availability of different types of Bayesian computational methods, such as Markov chain Monte Carlo (MCMC), sequential Monte Carlo (SMC), approximate Bayesian computation (ABC) and so on. For many practical examples, these computational methods can provide rapid and reliable approximations to the posterior distributions for unknown parameters.

The basic idea that lies behind these methods is to obtain Monte Carlo samples from the distribution of interest, in particular the posterior distribution. In Bayesian analysis of complex statistical models, the calculation of posterior normalizing constants and the evaluation of posterior estimates are typically infeasible either analytically or by numerical quadrature. Monte Carlo simulation provides an alternative. One of the most popular Bayesian computational methods is MCMC, which is based on the idea of constructing a Markov chain with the desired distribution as its stationary distribution.

By running a Markov chain, MCMC methods generate statistically dependent and approximate realizations from the limiting (target) distribution. A potential weakness of these methods is that the simulated trajectory of a Markov chain will depend on its initial state. A common practical recommendation is to ignore the early stages, the so-called burn-in phase, before collecting realizations of the state of the chain. How to choose the length of the burn-in phase is an active research area. Many methods have been proposed for convergence diagnostics; [1] gave a comparative review. Rigorous application of diagnostic methods requires either substantial empirical analysis of the Markov chain or complex mathematical analysis. In practice, judgments about convergence are often made by visual inspection of the realized chain or the application of simple rules of thumb.

Concerns about the quality of the sampled realizations of the simulated Markov chains have motivated the search for Monte Carlo methods that can be guaranteed to provide samples from the target distribution. This is usually referred to as perfect sampling or exact sampling. In some cases, for example, the multivariate normal, perfect samples are readily available. For more complicated distributions, perfect sampling can be achieved, in principle, by the rejection method. This involves sampling from a density that bounds a suitable multiple of the target density, followed by acceptance or rejection of the sampled value. The difficulty here is in finding a bounding density that is amenable to rapid sampling while at the same time providing sample values that are accepted with high probability. In general this is a challenging task, although efficient rejection sampling methods have been developed for the special class of log-concave densities; see, for example, [2, 3].

A surprising breakthrough in the search for perfect sampling methods was made by [4]. The method is known as coupling from the past (CFTP). This is a sophisticated MCMC-based algorithm that produces realizations exactly from the target distribution. CFTP transfers the difficulty of running the Markov chain for extensive periods (to ensure convergence) to the difficulty of establishing whether a potentially large number of coupled Markov chains have coalesced. An efficient CFTP algorithm depends on finding an appropriate Markov chain construction that will ensure fast coalescence. There have been a few further novel theoretical developments following the breakthrough of CFTP, including [2, 5, 6]. Since then, perfect sampling methods have attracted great attention in various Bayesian computational problems and applied probability areas.

Apart from coupling from the past, many other perfect sampling methods were proposed for specific problems, for example, perfect sampling for random spanning trees [7, 8] and path-space rejection sampler for diffusion processes [9, 10, 11]. Very recently, a type of divide-and-conquer method has been developed in [12, 13]. These methods use the technique for the exact simulation of diffusions and samples from simple sub-densities to obtain perfect samples from the target distribution.

Perfect samples are useful in Bayesian applications either as a complete replacement for MCMC-generated values or as a source of initial values that will guarantee that a conventional MCMC algorithm runs in equilibrium. Perfect samples can also be used as a means of quality control in judging a proposed MCMC implementation when there are questions about the speed of convergence of the MCMC algorithm or whether the chain is capable of exploring all parts of the sample space. Of course, when perfect samples can be obtained speedily, they will be preferred to MCMC values, since they eliminate such doubts. In addition, sophisticated perfect sampling methodology often motivates efficient approximate algorithms and computational techniques. For example, [14] uses regenerated Markov chains to obtain simple standard error estimates for importance sampling under MCMC context. The condition required there will allow us to carry out perfect sampling via multigamma coupling approach [15].

This paper will present a brief review for perfect Monte Carlo sampling and explain the advantages and drawbacks of different types of methods. Section 2 will present rejection sampling techniques, and then CFTP will be covered in Section 3. Recent divide-and-conquer methods will be reviewed in Section 4. The paper ends with a discussion in Section 5.

Advertisement

2. Rejection sampling techniques

Rejection sampling, also known as ‘acceptance-rejection sampling’, generates realizations from a target probability density function fx by using a hat function Mgx, where fxMgx and gx are a probability density function from which samples can be readily simulated. The basic rejection sampling algorithm is as follows:

Algorithm 2.1 (Basic rejection sampling)

Sample x from gx and U from Unif01.01
If UfxMgx,02
accept x as a realisation of fx and stop;03
else04
reject the value of x and go back to step 01.05

Many other perfect sampling methods are actually equivalent to rejection sampling. For example, ratio-of-uniform (RoU) method [16] may have to be implemented via a rejection sampling approach.

The efficiency of rejection sampling depends on the acceptance probability, which is 1/M. To perform rejection sampling efficiently, it is very important to find hat functions which provides higher acceptance probabilities. In other words, we shall choose M as small as possible [17]; however for many complicated problems, there is no easy way to find M small enough to guarantee high acceptance probability. A number of sophisticated rejection sampling methods have been suggested for dealing with complex Bayesian posterior densities, which we discuss below.

2.1 Log-concave densities

A function hx is called log-concave if

loghλx+1λyλloghx+1λloghy,

for all x,y and λ01. For the special class of log-concave densities, Gilks and wild [3] developed the adaptive rejection sampling (ARS) method. The method constructs an envelope function for the logarithm of the target density, fx, by using tangents to logfx at an increasing number, n, of points. The envelope function unx is the piecewise linear upper hull formed from the tangents. Note that, the envelope can be easily constructed due to the concavity of logfx. The method also constructs a squeeze function lnx which is formed from the chords of the tangent points. The sampling steps of the ARS algorithm are as follows.

Algorithm 2.2 (Adaptive rejection sampling).

Outputs a stream of perfect samples from fx.
Initialise unx and lnx by using tangents at several points.01
Sample x from density expunx and UUnif0102
If Uexplnxunx, Output x;03
else if Ufx/expunx,04
Output x; Update unln to un+1ln+1 using x;05
Goto 0206

The ARS algorithm is adaptive and the sampling density is modified whenever fx is evaluated. In this way, the method becomes more efficient as the sampling continues. Leydold [18] extends ARS to log-concave multivariate densities. Leydold’s algorithm is based on decomposing the domain of the density into cones and then computing tangent hyperplanes for these cones. Generic computer code for sampling from a multivariate log-concave density is available on the author’s website; it is only necessary to code a subroutine for the target density. Leydold’s algorithm works well for simple low-dimensional densities. The drawback of ARS algorithm is that it only works for log-concave densities, which is a very small class of posteriors in practice. Also computationally it is very challenging to implement ARS algorithm for high-dimensional densities [19].

2.2 Fill’s rejection sampling algorithm

We consider a discrete Markov chain with transition probability Pxy and stationary distribution πx,xS. Let P˜xy=πyPyx/πx be the transition probability for the time-reversed chain. Suppose that there is a partial ordering on the states S, denoted by xy. Let 0̂ and 1̂ be unique extremal points of the partial ordering.

To demonstrate the algorithm given by [2], we will assume that there are update functions ϕ and ϕ˜ both mapping S×01 to S such that

Pxy=PϕxU=y,E1
P˜xy=Pϕ˜xU=y,E2

where U Unif01 and

xyϕ˜xuϕ˜yua.e.u01.

The algorithm is as follows:

Algorithm 2.3 (Fill’s algorithm)

1. Choose an integer t>0. Fix x0=0̂ and y0=1̂.
2. Generate xk=ϕxk1Uk, k=1,,t, where Uk are i.i.d. Unif01.
3. Generate U˜k from the conditional distribution of U given ϕ˜xtk+1U=xtk,k=1,,t.
4. Generate yk=ϕ˜yk1U˜k,k=1,,t.
5. If yt=x0 then accept xt; else double t and repeat from step 2.

In Algorithm 2.3 (step 2) zxt is sampled from Pt0̂. If we find an upper bound M for πz/Pt0̂z, then we can use rejection sampling. Fill [2] finds a bounding constant given by M=π0̂/P˜t1̂0̂ and proves that steps from 3 to 5 of Algorithm 2.3 are to accept z with probability πzMPt0̂z. The output of this algorithm is indeed a perfect sample from π.

From Algorithm 2.3, we can see that rejection sampling can still be possible, even if we do not have a closed form of the hat function. The first limitation of Algorithm 2.3 is that it works only if the time-reversed chain is monotone, but [5] has extended the algorithm theoretically for general chains. The second limitation is that step 3 of Algorithm 2.3 is difficult to perform [2]. For these reasons, Fill’s algorithm is not practical for realistic problems.

Advertisement

3. Coupling from the past

Coupling from the past was introduced in the landmark paper of [4] which showed how to provide perfect samples from the limiting distribution of a Markov chain.

3.1 Basic CFTP algorithms

Let Xt be an ergodic Markov chain with state space X=1n, where the probability of going from i to j is pij and the stationary distribution is π. Suppose we design an updating function ϕU, which satisfies PϕiU=j=pij, where ϕ is a deterministic function and U is a random variable. To simulate the next state Y of the Markov chain, currently in state i, we draw a random variable U and let Y=ϕiU.

Let fti=ϕiUt, and define the composition

Ft1t2=ft21ft22ft1+1ft1,E3

for t1<t2.

Proposition 3.1 [4] Suppose there exists a time t=T, the backward coupling time, such that chains starting from any state in X=1n, at time t=T, and with the same sequence Utt=T1, arrive at the same state X0. Then it must follow that X0 comes from π.

This proposition is easy to prove. If we run an ergodic Markov chain from time t= and with the sequence Utt=T1 after T, the Markov chain will arrive at X0. Then X0 comes exactly from π since it is collected at time 0 and the Markov chain started from . The algorithm is as follows:

Algorithm 3.1 (Basic CFTP)

t=001
repeat02
t=t103
generate Ut04
Ft0=Ft+10ϕUt05
until Ft0 is a constant06
return Ft007

Propp and Wilson [4] also proved that this algorithm is certain to terminate. The idea of simulating from the past is important. Note that if we collect F0T as the result, where T is the smallest value that makes F0T a constant, we will get a biased sample. This is because T is a stopping time, which is not independent of the Markov chain.

3.2 Read-once CFTP

The basic CFTP algorithm needs to restart the Markov chains at some points in the past if they have not coalesced by time 0. We must use the same random sequence Ut1 when we restart the Markov chains. In Monte Carlo simulations, we usually use pseudorandom number generators, which are deterministic algorithms. So if we give the same random seed, we will get the same random sequence. This means that the same sequence Ut can be regenerated in each coupling procedure.

If we can run the Markov chain forward starting at 0 and collect a perfect sample in the future, we will not have to regenerate Ut. Wilson [20] developed a read-once CFTP method to implement the forward coupling idea. A simple example is provided by [21]. In fact, the multigamma coupler in [15] can be implemented via the more efficient read-once CFTP algorithm.

3.3 Improvement of CFTP algorithms

Propp and Wilson [4] showed that the computational cost of the algorithm can be reduced if there is a partial order for the state space X that is preserved by the update function ϕ, i.e. if xy then ϕxUϕyU. Their procedure is outlined in Algorithm 3.2, whereas before 0̂ and 1̂ are the unique extremals. Note that their algorithm needs a monotone update function ϕ for the Markov chain, while Algorithm 2.3 requires a monotone update function ϕ˜ for the time-reversed chain.

Algorithm 3.2 (Monotone CFTP)

T=101
Repeat02
upper =1̂03
lower =0̂04
for t=T to t=105
upper =ϕtupperUt06
lower =ϕtlowerUt07
T=2T08
until upper = lower09
return upper10

Algorithm 3.2 is much simpler than Algorithm 3.1, since only two chains have to be run at the same time, but the requirement of monotonicity is very restrictive. Markov chains with transitions given by independent Metropolis-Hastings and perfect slice sampling have been shown to have this property, by [22, 23], respectively. However [23, 24] have also noticed that such independent M-H CFTP is equivalent to simple rejection sampler.

In general it is hard to code perfect slice samplers correctly. For example, Hörmann and Leydold [25] have pointed out that the perfect slice samplers in [26, 27] are incorrect. The challenge of monotone CFTP is usually to construct the detailed updating function with a guarantee of preserving the partial order.

Finding a partial order preserved by the Markov chains is a non-trivial task in many cases. An alternative improvement is to use CFTP with bounding chains, such as that in [28, 29]. If the bounding chains, which bound all the Markov chains, coalesce, then all Markov chains coalesce. Thus if only a few bounding chains are required, the efficiency of the CFTP algorithm can be improved significantly. Sometimes, it may be impossible to define an explicit bounding chain (the maximum of the state space may be infinity, and the upper bound chain cannot start from infinity), but it is possible to use a dominated process to bound all Markov chains [30].

3.4 Applications and challenges

Although CFTP is extremely challenging to be implemented for many practical problems, it did find a few applications in certain discrete state space problems, for example, the Ising model [4]. Also [31] applied CFTP to ancestral selection graph to simulate samples from population genetic models. Refs. [32, 33] applied CFTP to a class of fork-join type queuing system problems. Connor and Kendal [34] applied CFTP for the perfect simulation of M/G/c queues. CFTP also finds its application in signal processing [35].

CFTP for continuous state space Markov chains is very challenging, since a random map from an interval to a finite number of points is required. In recent years, many methods have been developed for unbounded continuous state space Markov chains, such as perfect slice sampler in [23], multigamma coupler and the bounded M-H coupler in [15, 24]. Wilson [36] developed a layered multi-shift coupling, which shifts states in an interval to a finite number of points. However, none of these methods can solve any practical problems.

Advertisement

4. Recent advances in perfect sampling

Recently, a new type of perfect Monte Carlo sampling method based on the decomposition of the target density f, as f=g1g2, was proposed in [12], where g1 and g2 are also (proportional to) proper density functions. Note that here g1 and g2 are continuous density functions which are easy to simulate from. Suppose that q-dimensional vector values x1 and x2 are independently drawn from g1 and g2, respectively. If the two independent samples are equal, i.e. x1=x2=y then we have y must be from fg1g2. Note that such a naive approach may be practical for discrete random variables with low-dimensional state space, but for continuous random variables, it is impossible since Px1=x2=0. Dai [12] proposed a novel approach to deal with this, which is explained in the following subsection.

4.1 Perfect distributed Monte Carlo without using hat functions

First we introduce the following notations related to the logarithm of the sub-densities:

αx=α1αqtrx=logg1xE4

where is the partial derivative operator for each component of x. Then we consider a q-dimensional diffusion process Xtω,t0T (T<), defined on the q-dimensional continuous function space Ω, given by:

dXt=αXtdt+dBt,E5

where Btω=ωt is a Brownian motion and ω=ωtt0T is a typical element of Ω. Let W be the probability measure for a Brownian motion with the initial probability distribution B0=w0f1=g12.

Clearly Xt has the invariant distribution f1x (using the Langevin diffusion results [37]). Let Q be the probability law induced by Xt,t0T, with X0=ω0f1, i.e. under Q we have Xtf1x for any t0T.

The idea in [12] is to use a biased diffusion process X¯=X¯t0tT to simulate from the target function f. It is defined as follows.

Definition 4.1 (Biased Langevin diffusions) The joint density for the pair X¯0X¯T (the starting and ending points of the biased diffusion process), evaluated at point xy, is f1xtyxf2y, where tyx is the transition density for the diffusion process X defined in Eq. (6) from X0=x to XT=y and f2y=g2y/g1y.

Given X¯0X¯T the process X¯t0<t<T is given by the diffusion bridge driven by Eq. (6).

The marginal distribution for X¯T is fy. Therefore, to draw a sample from the target distribution fx, we need to simulate a process X¯t,t0T from Q¯ (the law induced by X¯) and then X¯Tfx.

Simulation from Q¯ can be done via rejection sampling. We can use a biased Brownian motionB¯t0tT as the proposal diffusion:

Definition 4.2 (Biased Brownian motion) The starting and ending points B¯0B¯T follow a distribution with a density hxy, and B¯t0<t<T is a Brownian bridge given B¯0B¯T.

Under certain mild conditions, Dai [12] proved the following lemma.

Lemma 4.1 Let Z be the probability law induced by B¯t0tT. By letting

hω0ωT=g2ωTg1ω012πTeωTω022TE6

we have

dQ¯dZωexp120Tα2+divαωtdtE7

where div is the divergence operator.

Condition 4.1 There exists l> such that

12α2+divαxl0.E8

Under Condition 4.1 the ratio (8) can be rewritten as

dQ¯dZωexp0T12α2+divαωtldt,E9

which has a value no more than 1.

Therefore we can use rejection sampling to simulate from Q¯, with proposal measure Z. This acceptance probability (10) can be dealt with using similar methods as that in [9, 11]. The algorithm is presented below; see [12, 38] for more details.

Algorithm 4.1 (Simple distributed sampler)

Simulate ω0ωT from density h01
Simulate the biased Brownian bridge B¯tt0T02
Accept ωT as a sample from f, with probability (6); If rejected, go back to step 01.03

Such a method is a rejection sampling algorithm, but it does not require finding a hat function to bound the target density, which is usually the main challenge of the traditional rejection sampling for complicated target densities. The above algorithm uses g2 as the proposal density function, which does not have to bound the target f. However, it requires a bound for the derivatives of the logarithm of the sub-densities (see Condition 4.1). This is usually easier to get in practice, since the logarithm of the posterior is usually easy to deal with. Also [12] noted that we should choose sub-densities g1 and g2 as similar as possible, in order to achieve high acceptance probability.

Dai [12] focused on the simple decomposition of f=g1g2, although it mentioned that for more general decomposition of f=g1g2gC, a recursive method can be used. Unfortunately, a naive recursive method is very inefficient. A more sophisticated method is introduced in the following section.

4.2 Monte Carlo fusion for distributed analysis

A more efficient and sophisticated methods were proposed recently in [13], named as Monte Carlo fusion. Suppose that we consider

fxg1xgCx,E10

where each gcx (c1C) is a density (up to a multiplicative constant). Here C denotes the number of parallel computing cores available in big data problems, and each gcx means the sub-posterior density based on a subset of the big data. In group decision problems, C means the number of different decisions which should be combined and gcx stands for the decision from each group member.

Dai et al. [13] considered simulating from the following density on extended space,

gx1xCy=c=1Cgc2xcpcyxc1gcy,E11

which admits the marginal density f for y. Here pcyxc is the transition density from xc to y for the Langevin diffusion defined in Eq. (6) associated with each sub-density gc.

Dai et al. [13] considered a rejection sampling approach with proposal density proportional to the function

hx1xCy=c=1CgcxcexpCyx¯22T,E12

where x¯=C1c=1Cxc and T is an arbitrary positive constant.

Simulation from the proposal h can be achieved directly. In particular, x1,xC are first drawn independently from g1,,gC, respectively, and then y is simply a Gaussian random variable centred on x¯. This is a distributed analysis or divide-and-conquer approach. Detailed acceptance probabilities and rejection sampling algorithms can be found in [13].

The above fusion approach arises in modern statistical methodologies for ‘big data’. A full dataset will be artificially split into a large number of smaller data sets, and inference is then conducted on each smaller data set and combined (see, for instance, [39, 40, 41, 42, 43, 44, 45, 46]). The advantage for such an approach is that inference on each small data set can be conducted in parallel. Then the heavy computational cost of algorithms such as MCMC will not be a concern. Traditional methods suffer from the weakness that the fusion of the separately conducted inferences is inexact. However, the Monte Carlo fusion in [13] is an exact simulation algorithm and does not have any approximation weakness.

The above fusion approach also arises in a number of other settings, where distributed analysis came naturally. For example, in signal processing, distributed multi-sensor may be used for network fusion systems. Fusion approach arises naturally to combine results from different sensors [47].

Advertisement

5. Conclusion

Although perfect simulation usually refers to correcting the statistical errors for the samples drawn via MCMC, it actually covers a much wider area beyond CFTP. In fact for certain applications, it is often possible to construct other types of perfect sampling methods which are much more efficient than CFTP. For example, for the exact simulation of the posterior of simple mixture models, the geometric-arithmetic mean (GAM) method in [19] is much more efficient than CFTP in [48]. Details of GAM method is provided in Appendix. Also the random walk construction for exact simulation for random spanning trees [7] is much more efficient than the CFTP version.

Bayesian computational algorithms keep evolving, in particular under the current big data era. Although almost all newly developed algorithms are approximate simulation algorithms, perfect sampling is still one of the key wheel-driven forces for new Bayesian computational algorithms, and they usually can quickly motivate new class of ‘mainstream’ algorithms. More focus should be given to methods beyond CFTP, for example, the fusion type of algorithms.

The Monte Carlo fusion method has the potential to be used in many Bayesian big data applications. For example, for large car accident data, the response variable is usually a categorical variable representing the seriousness of the accident, and generalized linear regression model is often used. Under a Bayesian framework, we may estimate the posterior distribution for the regression parameters via such a fusion approach. Then the posterior mean, the posterior median, or other characteristics of the posterior distribution can be estimated using the Monte Carlo samples. Also such an algorithm is perfect sampling algorithm, and no convergence justification is needed, since it always provided realizations exactly from the target distribution.

Advertisement

Observations from a simple mixture model are assumed to be either discrete or continuous. The density function of an individual observation y has the form

fyp=k=1Kpkfky,wherek=1Kpk=1,andpk>0,k=1,,K.E13

We assume that the component weights p=p1pK are unknown parameters and the number of components, K, and the component densities, fk, are all known. We focus on the perfect sampling from the posterior distribution of p.

Suppose that we have N observations, y1,,yN. Let Lnk=fkyn and assume that the prior distribution of p is Dirichlet:

π0pk=1Kpkαk1,αk>0,k=1,,K.E14

Then the posterior distribution is given by

fpyπ0pn=1Nk=1KpkLnkIXp,E15

where X=pk=1Kpk=1pk>0k=1K.

There are several ways to carry out perfect sampling from Eq. (16). The first method is based on CFTP [48]. An alternative perfect sampling method for simple mixture models is introduced by [49]. The third approach is to use adaptive rejection sampling [3, 18], since the posterior is log-concave. We may also use the ratio-of-uniform method. However, none of these methods are more efficient than the geometric–arithmetic mean method in [19].

Advertisement

A.1 Geometric-arithmetic mean method

Suppose that p, the MLE of p is unique and for simplicity, assume the prior π0p is uniform. Define ank=Lnk/pkLnk. The posterior density of p is then given by

fpyhpy=n=1Nk=1KpkankIXp,E16

where X is defined in (16).

Let In be a random element of arg maxkLnk. Define Aj=n:In=j and let n=n1nk where nj is the number of elements in Aj.

Define M=mjk with mjk=nAjank/nj. If nj=0, then set mjj=1 and mjk=0 for jk. We now make two assumptions, which we will return to later on:

A:M is invertible.

B: The elements of v=MT11 are all positive.

Under these assumptions, we will show that the following rejection sampler generates simulated values from the posterior distribution of p. First we define V to be the diagonal matrix with diagonal elements vT=v1vK.

Algorithm 6.1 (GAM sampler)

Sample q from the Dirichlet distribution with parameter n+1.01
Sample U from Unif01.02
Calculate p with p=M1V1q.03
If Uhpy/j=1Kqj/vjnj,04
Accept p and stop;05
else06
reject p and go to 01.07

Proposition 6.1 Under assumptions A and B, Algorithm 6.1 samples p with probability density (17).

Proof: Since the geometric average is no larger than the arithmetic average, for pX, we have

hpy=n=1Nk=1Kpkank=j=1KnAjk=1KpkankE17
j=1KnAjk=1Kpkanknjnj,E18

where in the case nj=0, the product term is taken as 1. So that, for pX, with mjk as previously defined, we have

hpyj=1Kk=1KpkmjknjE19
=j=1Kvjnjj=1Kvjk=1KpkmjknjE20
=j=1Kqj/vjnj,E21

where qj=vjk=1Kpkmjk,j=1,,K or equivalently q=VMp.

Since vj>0 and k=1Kpkmjk>0, it follows that qj>0 for j=1,,K. Furthermore

j=1Kqj=j=1Kvjk=1Kpkmjk=pTMTv=pT1=1,

since MTv=1, from the definition of v. It follows that pX implies qX, so that

hpyIXpj=1Kqj/vjnjIXq.E22

Note that the right-hand side of Eq. (22) is proportional to a Dirichlet distribution with parameters n1+1nK+1.

Rejection sampling then proceeds as usual:

  • A sample q is drawn from Dirichletn+1.

  • The value p=M1V1q is calculated.

  • It is accepted with probability hpyIXp/j=1Kqj/vjnj.

We now return to assumptions A and B. Suppose that M is invertible but the elements of v=MT11 are not all positive. In this case, let

αk=1Nn=1Nank,α=maxkαk,v=αN1nandM˜=αMΔ,

where Δ is a diagonal matrix with diagonal elements 1/α11/αK. Note that v=M˜T11>0. Algorithm 6.1 and its proof can then be modified by replacing M by M˜.

Suppose now that assumption A does not hold, i.e. M is not invertible. This can be remedied by adding positive quantities to the diagonal elements of M. This also provides an alternative way of ensuring that the elements of v are positive.

Advertisement

A.2 Dirichlet priors and pseudo data

Suppose that the prior π0p is Dirichletα1+1αK+1, where αi:i=1,,K are positive, integers and let A=j=1Kαj. The prior can be synthesized by introducing pseudo data, a˜mk,m=1,,A;k=1,,K, defined as follows:

a˜mk=1ifj=1k1αj+1mj=1kαj0otherwise,E23

since

π0pk=1Kpkαk=m=1Ak=1Ka˜mkpk.E24

With the Dirichlet prior, the posterior distribution given by Eq. (16) can be written as

fpyl=1N+Ak=1Kpka¯lkIXp,E25

where a¯lkl=1N+A contains the real data ankn=1N and the pseudo data a˜mkm=1A.

The posterior distribution (26) has the same form as Eq. (17). Therefore GAM can be used to sample realizations from the posterior distribution (26).

Advertisement

A.3 Simulation results and discussion

We compare the running time of mixture models with sample sizes (N) and different number of components (K) in Table 1. The components have specified normal distributions with means μ=μ1μK and variances σ2=σ12σK2. The prior on p is uniform. We sample 10,000 realizations from the posterior of the models.

K334466
N400100040010004001000
Fearnhead’s242 s3610 s****
Leydold’s1 s3.6 s****
RoU16:11 s28:16 s31:18 s68:33 s88:60 s152:76 s
GAM4 s9 s11 s16 s6 s11 s
GAM AP0.74720.75090.24330.30880.53250.5505

Table 1.

Running times (in s).

Fearnhead’s algorithm, Leydold’s algorithm and ratio-of-uniform. GAM method. GAM acceptance probability. The * indicates that Fearnhead’s method, and Leydold’s method will not run on a standard desktop when K=4 and K=5.

When K=3,4, we simulate N observations from a three-component normal mixture with μ=0,0,2, σ2=1,4,1 and mixture weight p0=1/21/31/6. We then either sample from the posterior distribution of p using the same distributional components in the case K=3 or sample from the posterior distribution of p with an additional component having mean μ4=4 and variance σ42=4, in the case K=4.

When K=5, observations are simulated from the normal mixture distribution with components having means μ=2,0,4,2,3, variances σ2=1,1,4,1,4 and p0=0.35,0.3,0.1,0.2,0.05. Samples from the posterior distribution of p are drawn assuming the same components. Similar calculations are carried out for K=6, where μ=0,3,2245, variances σ2=1,1,1,1,1,4 and p0=0.05,0.3,0.3,0.1,0.08,0.17, again assuming that the component distributions are known.

From Table 1, we can see that the GAM algorithm, while using very little memory, is highly efficient in running time. The last row of the table is the estimated acceptance probability of the GAM algorithm. The algorithm is very efficient when the component densities are known. We can see this not only by simulation but also from theoretical considerations, as follows.

A.3.1 Explanation of efficiency

When v=MT11>0, we are able to use M directly without modification to construct the hat function, thereby speeding up the calculations. In the simulations of the previous section, this was always found to be the case. Now we explain why this should be so.

If the maximum likelihood estimator of p is consistent, then when N,

1njnAjLnkk=1KpkLnkLnkk=1KpkLnkp0.E26

Assuming sufficient regularity, we also have

mjk=nAjanknjE27
=1njnAjLnkk=1KpkLnkE28
pEfkYfYLjE29
=Ljfkydyγj,asN,E30

where Y has density fy=k=1Kpkfky, Lj=yfjyfkyk=1K and γj=Ljfydy.

Therefore

MTpWT,E31

where

W=wjk,wjk=Ljfkydyγj.E32

Let v¯=γ1γK, then

WTv¯=j=1KLjf1ydyj=1KLjfKydy=f1ydyfKydy=1,E33

where the second equal sign is because j=1KLj=. So, there exists v¯>0, satisfying WTv¯=1. Using Eq. (32), we can conclude that when N is large enough, there also exists vv¯>0, satisfying MTv=1.

Since γj=Ljfydy and nj=#Aj, we have nj/Npγj. When each nj,j=1,,K is large, if a random sample q is drawn from a Dirichlet distribution with parameter n+1, then each qj is approximately equal to nj/Nγj. Furthermore, vjγj, so q satisfies

V1q1,E34

and then,

p=M1V1qM11=p.E35

If p is approximately equal to the mode p, the two sides of the inequality,

hpy=n=1Nk=1Kpkankj=1Kvjnjj=1Kqjnj,E36

are approximately equal as well. Thus, the closer the sampled realization p is to p, the larger the acceptance probability is. So the algorithm runs very rapidly, since the sampled values of p are always around the mode p.

This algorithm requires calculating the MLE, which can be performed very quickly since the likelihood function is log-concave. In fact an approximate guess for p will suffice. The more accurate the guess is, the more efficient the algorithm will be.

The method performs well when the component densities are correctly specified, as explained in the previous section. For these same reasons, we would expect the algorithm to perform poorly under misspecification. Details of robustness to misspecification can be found in [19].

References

  1. 1. Cowles MK, Carlin BP. Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association. 1996;91:883-904
  2. 2. Fill JA. An interruptible algorithm for perfect sampling via Markov chains. The Annals of Applied Probability. 1998;8:131-162
  3. 3. Gilks WR, Wild P. Adaptive rejection sampling for Gibbs sampling. Applied Statistics. 1992;41:337-348
  4. 4. Propp JG, Wilson DB. Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures and Algorithms. 1996;9:223-252
  5. 5. Fill JA, Machida M, Murdoch DJ, Rosenthal JS. Extension of Fill’s perfect rejection sampling algorithm to general chains. Random Structure and Algorithms. 2000;17:290-316
  6. 6. Propp JG, Wilson DB. How to get an exact sample from a generic Markov chain and sample a random spanning tree from a directed graph, both within the cover time. Journal of Algorithms. 1998;27:170-217
  7. 7. Aldous DJ. The random walk construction of uniform spanning trees and uniform labelled trees. SIAM Journal on Discrete Mathematics. 1990;3:450-465
  8. 8. Wilson DB. Generating random spanning trees more quickly than the cover time. In: Annual ACM Symposium on Theory of Computing Archive, Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing. 1996. pp. 296-303
  9. 9. Beskos A, Papaspiliopoulos O, Roberts GO, Fearnhead P. Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processess. Journal of Royal Statistical Society, B. 2006;68:333-382
  10. 10. Beskos A, Roberts GO. Exact simulation on diffusions. The Annals of Applied Probability. 2005;15:2422-2444
  11. 11. Beskos A, Papaspiliopoulos O, Roberts GO. A factorisation of diffusion measure and finite sample path constructions. Methodology and Computing in Applied Probability. 2008;10:85-104
  12. 12. Dai H. A new rejection sampling method without using hat function. Bernoulli. 2017;23:2434-2465
  13. 13. Dai H, Pollock M, Roberts G. Monte Carlo fusion. Journal of Applied Probability. 2019;56:174-191
  14. 14. Tan A, Doss H, Hobert JP. Honest importance sampling with multiple Markov chains. Journal of Computational and Graphical Statistics. 2015;24(3):792-826
  15. 15. Green PJ, Murdoch DJ. Exact sampling for Bayesian inference: Towards general purpose algorithms. Bayesian Statistics. 1998;6:301-321
  16. 16. Wakefield JC, Gelfand AE, Smith AFM. Efficient generation of random variates via the ratio-of-uniforms method. Statistics and Computing. 1991;1:129-133
  17. 17. Robert C, Casella G. Monte Carlo Statistical Methods. Springer Science & Business Media; 2013
  18. 18. Leydold J. A rejection technique for sampling from log-concave multivariate distributions. Modeling and Computer Simulation. 1998;8(3):254-280. Available from: citeseer.nj.nec.com/leydold98rejection.html
  19. 19. Dai H. Perfect simulation methods for Bayesian applications [PhD thesis], University of Oxford. 2007
  20. 20. Wilson DB. How to couple from the past using a read-once source of randomness. Random Structures and Algorithms. 2000;16:85-113
  21. 21. Cai H. Exact sampling using auxiliary variables. In: Statistical Computing Section of ASA Proceedings. 1999
  22. 22. Corcoran JN, Tweedie RL. Perfect sampling from independent Metropolis-Hastings chains. Journal of Statistical Planning and Inference. 2000;104(2):297-314
  23. 23. Mira A, Møller J, Roberts GO. Perfect slice samplers. Journal of the Royal Statistical Society B. 2001;63:593-606
  24. 24. Murdoch DJ, Green PJ. Exact sampling from a continuous state space. Scandinavian Journal of Statistics. 1998;25:483-502
  25. 25. Hörmann W, Leydold J. Improved perfect slice sampling. Technique Report. 2003
  26. 26. Casella G, Mengersen KL, Robert CP, Titterington DM. Perfect samplers for mixtures of distributions. Journal of the Royal Statistical Society B. 2002;64:777-790
  27. 27. Phillipe A, Robert CP. Perfect simulation of positive Gaussian distributions. Statistics and Computing. 2003;13:179-186
  28. 28. Huber M. Perfect sampling using bounding chains. The Annals of Applied Probability. 2004;14:734-753
  29. 29. Møller J. Perfect simulation of conditionally specified models. Journal of the Royal Statistical Society, B. 1999;61:251-264
  30. 30. Kendall WS, Moller J. Perfect simulation using dominating processes on ordered spaces, with application to locally stable point processes. Advances in Applied Probability. 2000;32(3):844-865
  31. 31. Fearnhead P. Perfect simulation from population genetic models with selection. Theoretical Population Biology. 2001;59(4):263-279
  32. 32. Dai H. Exact Monte Carlo simulation for fork-join networks. Advances in Applied Probability. 2011;43(2):484-503
  33. 33. Dai H. Exact simulation for fork-join networks with heterogeneous service. International Journal of Statistics and Probability. 2015;4(1):19-32
  34. 34. Connor SB, Kendal WS. Perfect simulation of M/G/c queues. Advances in Applied Probability. 2015;47(4):1039-1063
  35. 35. Djuric PM, Huang Y, Ghirmai T. Perfect sampling: A review and applications to signal processing. IEEE Transactions on Signal Processing. 2002;50(2):345-356
  36. 36. Wilson DB. Layered multishift coupling for use in perfect sampling algorithms (with a primer on CFTP). In: Madras N, editor. Monte Carlo Methods. Vol. 26. Fields Institute Communications. American Mathematical Society; 2000. pp. 141-176
  37. 37. Hansen NR. Geometric ergodicity of discrete-time approximations to multivariate diffusions. Bernoulli. 2003;9(4):725-743
  38. 38. Dai H. Exact simulation for diffusion bridges: An adaptive approach. Journal of Applied Probability. 2014;51(2):346-358
  39. 39. Agarwal A, Duchi JC. Distributed delayed stochastic optimization. In: 51st IEEE Conference on Decision and Control; Maui Hawaii, USA. 2012
  40. 40. Li C, Srivastava S, Dunson DB. Simple, scalable and accurate posterior interval estimation. Biometrika. 2017;104(3):665-680
  41. 41. Minsker S, Srivastava S, Lin L, Dunson DB. Scalable and robust Bayesian inference via the median posterior. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014. pp. 1656-1664
  42. 42. Neiswanger W, Wang C, Xing E. Asymptotically exact, embarrassingly parallel MCMC. In: Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (2014). 2014. pp. 623-632
  43. 43. Scott SL, Blocker AW, Bonassi FV, Chipman HA, George EI, McCulloch RE. Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management. 2016;11(2):78-88
  44. 44. Srivastava S, Cevher V, Tan-Dinh Q, Dunson DB. WASP: Scalable Bayes via barycenters of subset posteriors. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2016). 2016. pp. 912-920
  45. 45. Stamatakis A, Aberer AJ. Novel parallelization schemes for large-scale likelihood-based phylogenetic inference. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 2013. DOI: 10.1109/IPDPS.2013.70
  46. 46. Wang X, Dunson. Parallelizing MCMC via Weierstrass Sampler. arXiv preprint arXiv:1312.4605. 2013
  47. 47. Uney M, Clark DE, Julier SJ. Distributed fusion of PHD filters via exponential mixture densities. IEEE Journal of Selected Topics in Signal Processing. 2013;7(3):521-531
  48. 48. Hobert J, Robert C, Titterington D. On perfect simulation for some mixtures of distributions. Statistics and Computing. 1999;9:287-298
  49. 49. Fearnhead P. Direct simulation for discrete mixture distributions. Statistics and Computing. 2005;15(2):125-133

Written By

Hongsheng Dai

Reviewed: 16 July 2019 Published: 13 November 2019