Running times (in s).
Perfect Monte Carlo sampling refers to sampling random realizations exactly from the target distributions (without any statistical error). Although many different methods have been developed and various applications have been implemented in the area of perfect Monte Carlo sampling, it is mostly referred by researchers to coupling from the past (CFTP) which can correct the statistical errors for the Monte Carlo samples generated by Markov chain Monte Carlo (MCMC) algorithms. This paper provides a brief review on the recent developments and applications in CFTP and other perfect Monte Carlo sampling methods.
- coupling from the past
- Monte Carlo
- perfect sampling
In the past 30 years, substantial progress has been made in popularizing Bayesian methods for the statistical analysis of complex data sets. An important driving force has been the availability of different types of Bayesian computational methods, such as Markov chain Monte Carlo (MCMC), sequential Monte Carlo (SMC), approximate Bayesian computation (ABC) and so on. For many practical examples, these computational methods can provide rapid and reliable approximations to the posterior distributions for unknown parameters.
The basic idea that lies behind these methods is to obtain Monte Carlo samples from the distribution of interest, in particular the posterior distribution. In Bayesian analysis of complex statistical models, the calculation of posterior normalizing constants and the evaluation of posterior estimates are typically infeasible either analytically or by numerical quadrature. Monte Carlo simulation provides an alternative. One of the most popular Bayesian computational methods is MCMC, which is based on the idea of constructing a Markov chain with the desired distribution as its stationary distribution.
By running a Markov chain, MCMC methods generate statistically dependent and approximate realizations from the limiting (target) distribution. A potential weakness of these methods is that the simulated trajectory of a Markov chain will depend on its initial state. A common practical recommendation is to ignore the early stages, the so-called
Concerns about the
A surprising breakthrough in the search for perfect sampling methods was made by . The method is known as
Apart from coupling from the past, many other perfect sampling methods were proposed for specific problems, for example, perfect sampling for random spanning trees [7, 8] and path-space rejection sampler for diffusion processes [9, 10, 11]. Very recently, a type of divide-and-conquer method has been developed in [12, 13]. These methods use the technique for the exact simulation of diffusions and samples from simple sub-densities to obtain perfect samples from the target distribution.
Perfect samples are useful in Bayesian applications either as a complete replacement for MCMC-generated values or as a source of initial values that will guarantee that a conventional MCMC algorithm runs in equilibrium. Perfect samples can also be used as a means of quality control in judging a proposed MCMC implementation when there are questions about the speed of convergence of the MCMC algorithm or whether the chain is capable of exploring all parts of the sample space. Of course, when perfect samples can be obtained speedily, they will be preferred to MCMC values, since they eliminate such doubts. In addition, sophisticated perfect sampling methodology often motivates efficient approximate algorithms and computational techniques. For example,  uses regenerated Markov chains to obtain simple standard error estimates for importance sampling under MCMC context. The condition required there will allow us to carry out perfect sampling via multigamma coupling approach .
This paper will present a brief review for perfect Monte Carlo sampling and explain the advantages and drawbacks of different types of methods. Section 2 will present rejection sampling techniques, and then CFTP will be covered in Section 3. Recent divide-and-conquer methods will be reviewed in Section 4. The paper ends with a discussion in Section 5.
2. Rejection sampling techniques
Rejection sampling, also known as ‘acceptance-rejection sampling’, generates realizations from a target probability density function by using a hat function , where and are a probability density function from which samples can be readily simulated. The basic rejection sampling algorithm is as follows:
|Sample from and from Unif.||01|
|accept as a realisation of and stop;||03|
|reject the value of and go back to step 01.||05|
Many other perfect sampling methods are actually equivalent to rejection sampling. For example, ratio-of-uniform (RoU) method  may have to be implemented via a rejection sampling approach.
The efficiency of rejection sampling depends on the acceptance probability, which is . To perform rejection sampling efficiently, it is very important to find hat functions which provides higher acceptance probabilities. In other words, we shall choose as small as possible ; however for many complicated problems, there is no easy way to find small enough to guarantee high acceptance probability. A number of sophisticated rejection sampling methods have been suggested for dealing with complex Bayesian posterior densities, which we discuss below.
2.1 Log-concave densities
A function is called log-concave if
for all and . For the special class of log-concave densities, Gilks and wild  developed the adaptive rejection sampling (ARS) method. The method constructs an envelope function for the logarithm of the target density, , by using tangents to at an increasing number, , of points. The envelope function is the piecewise linear
|Outputs a stream of perfect samples from .|
|Initialise and by using tangents at several points.||01|
|Sample from density and||02|
|If , Output ;||03|
|else if ,||04|
|Output ; Update to using ;||05|
The ARS algorithm is adaptive and the sampling density is modified whenever is evaluated. In this way, the method becomes more efficient as the sampling continues. Leydold  extends ARS to log-concave multivariate densities. Leydold’s algorithm is based on decomposing the domain of the density into cones and then computing tangent hyperplanes for these cones. Generic computer code for sampling from a multivariate log-concave density is available on the author’s website; it is only necessary to code a subroutine for the target density. Leydold’s algorithm works well for simple low-dimensional densities. The drawback of ARS algorithm is that it only works for log-concave densities, which is a very small class of posteriors in practice. Also computationally it is very challenging to implement ARS algorithm for high-dimensional densities .
2.2 Fill’s rejection sampling algorithm
We consider a discrete Markov chain with transition probability and stationary distribution . Let be the transition probability for the time-reversed chain. Suppose that there is a partial ordering on the states , denoted by . Let and be unique extremal points of the partial ordering.
To demonstrate the algorithm given by , we will assume that there are update functions and both mapping to such that
The algorithm is as follows:
|1. Choose an integer . Fix and .|
|2. Generate , , where are i.i.d. Unif.|
|3. Generate from the conditional distribution of given .|
|4. Generate .|
|5. If then accept ; else double and repeat from step 2.|
In Algorithm 2.3 (step 2) is sampled from . If we find an upper bound for , then we can use rejection sampling. Fill  finds a bounding constant given by and proves that steps from 3 to 5 of Algorithm 2.3 are to accept with probability . The output of this algorithm is indeed a perfect sample from .
From Algorithm 2.3, we can see that rejection sampling can still be possible, even if we do not have a closed form of the hat function. The first limitation of Algorithm 2.3 is that it works only if the time-reversed chain is monotone, but  has extended the algorithm theoretically for general chains. The second limitation is that step 3 of Algorithm 2.3 is difficult to perform . For these reasons, Fill’s algorithm is not practical for realistic problems.
3. Coupling from the past
Coupling from the past was introduced in the landmark paper of  which showed how to provide perfect samples from the limiting distribution of a Markov chain.
3.1 Basic CFTP algorithms
Let be an ergodic Markov chain with state space , where the probability of going from to is and the stationary distribution is . Suppose we design an updating function , which satisfies , where is a deterministic function and is a random variable. To simulate the next state of the Markov chain, currently in state , we draw a random variable and let .
Let , and define the composition
Proposition 3.1  Suppose there exists a time , the backward coupling time, such that chains starting from any state in , at time , and with the same sequence , arrive at the same state . Then it must follow that comes from .
This proposition is easy to prove. If we run an ergodic Markov chain from time and with the sequence after , the Markov chain will arrive at . Then comes exactly from since it is collected at time and the Markov chain started from . The algorithm is as follows:
|until is a constant||06|
Propp and Wilson  also proved that this algorithm is certain to terminate. The idea of simulating from the past is important. Note that if we collect as the result, where is the smallest value that makes a constant, we will get a biased sample. This is because is a stopping time, which is not independent of the Markov chain.
3.2 Read-once CFTP
The basic CFTP algorithm needs to restart the Markov chains at some points in the past if they have not coalesced by time . We must use the same random sequence when we restart the Markov chains. In Monte Carlo simulations, we usually use
If we can run the Markov chain forward starting at and collect a perfect sample in the future, we will not have to regenerate . Wilson  developed a read-once CFTP method to implement the forward coupling idea. A simple example is provided by . In fact, the multigamma coupler in  can be implemented via the more efficient read-once CFTP algorithm.
3.3 Improvement of CFTP algorithms
Propp and Wilson  showed that the computational cost of the algorithm can be reduced if there is a partial order for the state space that is preserved by the update function , i.e. if then . Their procedure is outlined in Algorithm 3.2, whereas before and are the unique extremals. Note that their algorithm needs a monotone update function for the Markov chain, while Algorithm 2.3 requires a monotone update function for the time-reversed chain.
|until upper lower||09|
Algorithm 3.2 is much simpler than Algorithm 3.1, since only two chains have to be run at the same time, but the requirement of monotonicity is very restrictive. Markov chains with transitions given by independent Metropolis-Hastings and perfect slice sampling have been shown to have this property, by [22, 23], respectively. However [23, 24] have also noticed that such independent M-H CFTP is equivalent to simple rejection sampler.
In general it is hard to code perfect slice samplers correctly. For example, Hörmann and Leydold  have pointed out that the perfect slice samplers in [26, 27] are incorrect. The challenge of monotone CFTP is usually to construct the detailed updating function with a guarantee of preserving the partial order.
Finding a partial order preserved by the Markov chains is a non-trivial task in many cases. An alternative improvement is to use CFTP with bounding chains, such as that in [28, 29]. If the bounding chains, which bound all the Markov chains, coalesce, then all Markov chains coalesce. Thus if only a few bounding chains are required, the efficiency of the CFTP algorithm can be improved significantly. Sometimes, it may be impossible to define an explicit bounding chain (the maximum of the state space may be infinity, and the upper bound chain cannot start from infinity), but it is possible to use a dominated process to bound all Markov chains .
3.4 Applications and challenges
Although CFTP is extremely challenging to be implemented for many practical problems, it did find a few applications in certain discrete state space problems, for example, the Ising model . Also  applied CFTP to ancestral selection graph to simulate samples from population genetic models. Refs. [32, 33] applied CFTP to a class of fork-join type queuing system problems. Connor and Kendal  applied CFTP for the perfect simulation of M/G/c queues. CFTP also finds its application in signal processing .
CFTP for continuous state space Markov chains is very challenging, since a random map from an interval to a finite number of points is required. In recent years, many methods have been developed for unbounded continuous state space Markov chains, such as perfect slice sampler in , multigamma coupler and the bounded M-H coupler in [15, 24]. Wilson  developed a layered multi-shift coupling, which shifts states in an interval to a finite number of points. However, none of these methods can solve any practical problems.
4. Recent advances in perfect sampling
Recently, a new type of perfect Monte Carlo sampling method based on the decomposition of the target density , as , was proposed in , where and are also (proportional to) proper density functions. Note that here and are continuous density functions which are easy to simulate from. Suppose that -dimensional vector values and are independently drawn from and , respectively. If the two independent samples are equal, i.e. then we have must be from . Note that such a naive approach may be practical for discrete random variables with low-dimensional state space, but for continuous random variables, it is impossible since . Dai  proposed a novel approach to deal with this, which is explained in the following subsection.
4.1 Perfect distributed Monte Carlo without using hat functions
First we introduce the following notations related to the logarithm of the sub-densities:
where is the partial derivative operator for each component of . Then we consider a -dimensional diffusion process (), defined on the -dimensional continuous function space , given by:
where is a Brownian motion and is a typical element of . Let be the probability measure for a Brownian motion with the initial probability distribution .
Clearly has the invariant distribution (using the Langevin diffusion results ). Let be the probability law induced by , with , i.e. under we have for any .
The idea in  is to use a
Definition 4.1 (Biased Langevin diffusions) The joint density for the pair (the starting and ending points of the biased diffusion process), evaluated at point , is , where is the transition density for the diffusion process defined in Eq. (6) from to and .
Given the process is given by the diffusion bridge driven by Eq. (6).
The marginal distribution for is . Therefore, to draw a sample from the target distribution , we need to simulate a process from (the law induced by ) and then .
Simulation from can be done via rejection sampling. We can use a
Definition 4.2 (Biased Brownian motion) The starting and ending points follow a distribution with a density , and is a Brownian bridge given .
Under certain mild conditions, Dai  proved the following lemma.
Lemma 4.1 Let be the probability law induced by . By letting
where div is the divergence operator.
Condition 4.1 There exists such that
Under Condition 4.1 the ratio (8) can be rewritten as
which has a value no more than .
Therefore we can use rejection sampling to simulate from , with proposal measure . This acceptance probability (10) can be dealt with using similar methods as that in [9, 11]. The algorithm is presented below; see [12, 38] for more details.
|Simulate from density||01|
|Simulate the biased Brownian bridge||02|
|Accept as a sample from , with probability (6); If rejected, go back to step 01.||03|
Such a method is a rejection sampling algorithm, but it does not require finding a hat function to bound the target density, which is usually the main challenge of the traditional rejection sampling for complicated target densities. The above algorithm uses as the proposal density function, which does not have to bound the target . However, it requires a bound for the derivatives of the logarithm of the sub-densities (see Condition 4.1). This is usually easier to get in practice, since the logarithm of the posterior is usually easy to deal with. Also  noted that we should choose sub-densities and as similar as possible, in order to achieve high acceptance probability.
Dai  focused on the simple decomposition of , although it mentioned that for more general decomposition of , a recursive method can be used. Unfortunately, a naive recursive method is very inefficient. A more sophisticated method is introduced in the following section.
4.2 Monte Carlo fusion for distributed analysis
A more efficient and sophisticated methods were proposed recently in , named as Monte Carlo fusion. Suppose that we consider
where each () is a density (up to a multiplicative constant). Here denotes the number of parallel computing cores available in big data problems, and each means the sub-posterior density based on a subset of the big data. In group decision problems, means the number of different decisions which should be combined and stands for the decision from each group member.
Dai et al.  considered simulating from the following density on extended space,
which admits the marginal density for . Here is the transition density from to for the Langevin diffusion defined in Eq. (6) associated with each sub-density .
Dai et al.  considered a rejection sampling approach with proposal density proportional to the function
where and is an arbitrary positive constant.
Simulation from the proposal can be achieved directly. In particular, are first drawn independently from , respectively, and then is simply a Gaussian random variable centred on . This is a distributed analysis or divide-and-conquer approach. Detailed acceptance probabilities and rejection sampling algorithms can be found in .
The above fusion approach arises in modern statistical methodologies for ‘big data’. A full dataset will be artificially split into a large number of smaller data sets, and inference is then conducted on each smaller data set and combined (see, for instance, [39, 40, 41, 42, 43, 44, 45, 46]). The advantage for such an approach is that inference on each small data set can be conducted in parallel. Then the heavy computational cost of algorithms such as MCMC will not be a concern. Traditional methods suffer from the weakness that the fusion of the separately conducted inferences is inexact. However, the Monte Carlo fusion in  is an exact simulation algorithm and does not have any approximation weakness.
The above fusion approach also arises in a number of other settings, where distributed analysis came naturally. For example, in signal processing, distributed multi-sensor may be used for network fusion systems. Fusion approach arises naturally to combine results from different sensors .
Although perfect simulation usually refers to correcting the statistical errors for the samples drawn via MCMC, it actually covers a much wider area beyond CFTP. In fact for certain applications, it is often possible to construct other types of perfect sampling methods which are much more efficient than CFTP. For example, for the exact simulation of the posterior of simple mixture models, the geometric-arithmetic mean (GAM) method in  is much more efficient than CFTP in . Details of GAM method is provided in Appendix. Also the random walk construction for exact simulation for random spanning trees  is much more efficient than the CFTP version.
Bayesian computational algorithms keep evolving, in particular under the current big data era. Although almost all newly developed algorithms are approximate simulation algorithms, perfect sampling is still one of the key wheel-driven forces for new Bayesian computational algorithms, and they usually can quickly motivate new class of ‘mainstream’ algorithms. More focus should be given to methods beyond CFTP, for example, the fusion type of algorithms.
The Monte Carlo fusion method has the potential to be used in many Bayesian big data applications. For example, for large car accident data, the response variable is usually a categorical variable representing the seriousness of the accident, and generalized linear regression model is often used. Under a Bayesian framework, we may estimate the posterior distribution for the regression parameters via such a fusion approach. Then the posterior mean, the posterior median, or other characteristics of the posterior distribution can be estimated using the Monte Carlo samples. Also such an algorithm is perfect sampling algorithm, and no convergence justification is needed, since it always provided realizations exactly from the target distribution.
Observations from a simple mixture model are assumed to be either discrete or continuous. The density function of an individual observation has the form
We assume that the component weights are unknown parameters and the number of components, , and the component densities, , are all known. We focus on the perfect sampling from the posterior distribution of .
Suppose that we have observations, . Let and assume that the prior distribution of is Dirichlet:
Then the posterior distribution is given by
There are several ways to carry out perfect sampling from Eq. (16). The first method is based on CFTP . An alternative perfect sampling method for simple mixture models is introduced by . The third approach is to use adaptive rejection sampling [3, 18], since the posterior is log-concave. We may also use the ratio-of-uniform method. However, none of these methods are more efficient than the geometric–arithmetic mean method in .
A.1 Geometric-arithmetic mean method
Suppose that , the MLE of is unique and for simplicity, assume the prior is uniform. Define . The posterior density of is then given by
where is defined in (16).
Let be a random element of arg . Define and let where is the number of elements in .
Define with . If , then set and for . We now make two assumptions, which we will return to later on:
Under these assumptions, we will show that the following rejection sampler generates simulated values from the posterior distribution of . First we define to be the diagonal matrix with diagonal elements .
|Sample from the Dirichlet distribution with parameter .||01|
|Sample from Unif.||02|
|Calculate with .||03|
|Accept and stop;||05|
|reject and go to 01.||07|
Proposition 6.1 Under assumptions
where in the case , the product term is taken as . So that, for , with as previously defined, we have
where or equivalently .
Since and , it follows that for . Furthermore
since , from the definition of . It follows that implies , so that
Note that the right-hand side of Eq. (22) is proportional to a Dirichlet distribution with parameters .
Rejection sampling then proceeds as usual:
A sample is drawn from Dirichlet.
The value is calculated.
It is accepted with probability .
We now return to assumptions
where is a diagonal matrix with diagonal elements . Note that . Algorithm 6.1 and its proof can then be modified by replacing by .
Suppose now that assumption
A.2 Dirichlet priors and
Suppose that the prior is Dirichlet, where are positive, integers and let . The prior can be synthesized by introducing
With the Dirichlet prior, the posterior distribution given by Eq. (16) can be written as
where contains the real data and the pseudo data .
A.3 Simulation results and discussion
We compare the running time of mixture models with sample sizes () and different number of components () in Table 1. The components have specified normal distributions with means and variances . The prior on is uniform. We sample 10,000 realizations from the posterior of the models.
|Fearnhead’s||242 s||3610 s||*||*||*||*|
|Leydold’s||1 s||3.6 s||*||*||*||*|
|RoU||16:11 s||28:16 s||31:18 s||68:33 s||88:60 s||152:76 s|
|GAM||4 s||9 s||11 s||16 s||6 s||11 s|
When , we simulate observations from a three-component normal mixture with , and mixture weight . We then either sample from the posterior distribution of using the same distributional components in the case or sample from the posterior distribution of with an additional component having mean and variance , in the case .
When , observations are simulated from the normal mixture distribution with components having means , variances and . Samples from the posterior distribution of are drawn assuming the same components. Similar calculations are carried out for , where , variances and , again assuming that the component distributions are known.
From Table 1, we can see that the GAM algorithm, while using very little memory, is highly efficient in running time. The last row of the table is the estimated acceptance probability of the GAM algorithm. The algorithm is very efficient when the component densities are known. We can see this not only by simulation but also from theoretical considerations, as follows.
A.3.1 Explanation of efficiency
When , we are able to use directly without modification to construct the hat function, thereby speeding up the calculations. In the simulations of the previous section, this was always found to be the case. Now we explain why this should be so.
If the maximum likelihood estimator of is consistent, then when ,
Assuming sufficient regularity, we also have
where has density , and .
Let , then
where the second equal sign is because . So, there exists , satisfying . Using Eq. (32), we can conclude that when is large enough, there also exists , satisfying .
Since and , we have When each is large, if a random sample is drawn from a Dirichlet distribution with parameter , then each is approximately equal to . Furthermore, , so satisfies
If is approximately equal to the mode , the two sides of the inequality,
are approximately equal as well. Thus, the closer the sampled realization is to , the larger the acceptance probability is. So the algorithm runs very rapidly, since the sampled values of are always around the mode .
This algorithm requires calculating the MLE, which can be performed very quickly since the likelihood function is log-concave. In fact an approximate
The method performs well when the component densities are correctly specified, as explained in the previous section. For these same reasons, we would expect the algorithm to perform poorly under misspecification. Details of robustness to misspecification can be found in .