Open access peer-reviewed chapter

Bayesian Methods and Monte Carlo Simulations

Written By

Pavel Loskot

Reviewed: 21 October 2022 Published: 19 November 2022

DOI: 10.5772/intechopen.108699

From the Edited Volume

Numerical Simulation - Advanced Techniques for Science and Engineering

Edited by Ali Soofastaei

Chapter metrics overview

255 Chapter Downloads

View Full Metrics

Abstract

Bayesian methods provide the means for studying probabilistic models of linear as well as non-linear stochastic systems. They allow tracking changes in probability distributions by applying Bayes’s theorem and the chain rule for factoring the probabilities. However, an excessive complexity of resulting distributions often dictates the use of numerical methods when performing statistical and causal inferences over probabilistic models. In this chapter, the Bayesian methods for intractable distributions are first introduced as sampling, filtering, approximation, and likelihood-free methods. Their fundamental principles are explained, and the key challenges are identified. The concise survey of Bayesian methods is followed by outlining their applications. In particular, Bayesian experiment design aims at maximizing information gain or utility, and it is often combined with an optimum model selection. Bayesian hypothesis testing introduces optimality in the data-driven decision making. Bayesian machine learning assumes data labels to be random variables. Bayesian optimization is a powerful strategy for configuring and optimizing large-scale complex systems, for which conventional optimization techniques are usually ineffective. The chapter is concluded by examining Bayesian Monte Carlo simulations. It is proposed that augmented Monte Carlo simulations can achieve explainability and also provide much better information efficiency.

Keywords

  • Bayesian analysis
  • distribution
  • Monte Carlo
  • numerical method
  • machine learning
  • optimization
  • posterior
  • prior
  • simulation
  • statistical inference

1. Introduction

Many real-world systems exhibit some level or form of randomness. This is reflected in their mathematical models, which are often probabilistic. There are two basic strategies to interpret the randomness captured by probabilistic models. The frequency of occurrence of a random phenomenon is a measurable quantity that can be used to predict how likely the phenomenon can be observed in future. The other approach quantifies the uncertainty about the occurrence of random phenomenon more subjectively, that is, as a degree of belief or expectation of observing the phenomenon, given domain knowledge and the past experiences. This latter approach led to a broad area of general Bayesian methods [1], Bayesian data analyses [2], Bayesian signal processing [3], Bayesian regression [4], Bayesian machine learning [5], and Bayesian optimization [6]. In the tasks of statistical inference, the assumption of knowing (or not) a prior distribution crucially affects the feasibility as well as the structure of estimators [7]. There is also an intimate connection between Bayesian probabilistic models and making causal inferences [8], as will be discussed in Section 3.

Bayesian methods found widespread applications in many probabilistic modeling frameworks. These methods are all rooted in the surprisingly simple Bayes’s theorem, that is,

pθx=pxθpθpxE1

where p denotes the (for continuous random variables) or the probability (for discrete random variables). Eq. (1) quantifies how our belief about the parameter θ changes from its prior, pθ, to posterior, pθx, after observing data x. The conditional term, pxθ, represents the likelihood of the parameter θ, given observations, x. The scaling term, px, in Eq. (1) is usually referred to as evidence. It should be noted that both the parameter and the data can be assumed in an arbitrary number of dimensions.

The majority of tasks in Bayesian inference involves explicitly or implicitly solving one of the following integrals (or sums, for discrete random variables), that is,

marginalization:px=pxθE2
summarization: Efxθ=fxpxθdxE3
prediction: pxt+1=pxt+1xtpxtdxtE4
pxt+1xt=pxt+1θpθxt.

Unfortunately, in most but a few real-world scenarios, the expressions (2)(4) are not mathematically tractable. In particular, the distributions often involve multiple sums and/or integrals, and their closed-form expressions cannot be obtained. The distributions are sometimes only known up to a scaling constant. Then, even numerically computing Eq. (1) can be rather challenging, since many complex distributions are multi-modal with a large number of local minima and maxima. Moreover, in online data processing, the distributions must be updated continuously as soon as the new data arrives.

The rest of this chapter is organized as follows. The strategies for performing Bayesian inferences with intractable distributions are outlined in Section 2 including sampling, filtering, approximation, and likelihood-free methods. The applications of Bayesian inferences are discussed in Section 3 including Bayesian experiment design, Bayesian hypothesis testing, Bayesian machine learning, and Bayesian optimization. Bayesian Monte Carlo simulations are reviewed in Section 4. Although the chapter mostly reviews known concepts and frameworks in Bayesian analysis of probabilistic models, Section 4 contributes a description of augmented Monte Carlo simulations. These simulations aim at providing explainability and improving information gain.

The references cited at the end of this chapter are by no means comprehensive; rather, they are suggestions of initial readings where to find further information about the topics discussed in this chapter.

Advertisement

2. Bayesian inference with intractable distributions

Bayesian inference assumes that prior knowledge is available, so it should be exploited to improve the estimation accuracy. Such knowledge is often represented as a prior distribution of the parameters to be estimated. It can be provided by the domain expertise or otherwise be subjectively selected. As already mentioned, non-linear and high-dimensional models are rarely tractable analytically. This section presents four basic strategies on how to deal with intractable distributions in Bayesian inference. In particular, the sampling methods define a proposal distribution, which is much easier to sample from. Most of these methods form a Markov chain of samples, which gradually converges to the desired distribution. The filtering methods update the estimates iteratively, so they are particularly suitable for streaming data. These methods are also more efficient in higher dimensions than the pure sampling methods. The approximate methods strive to improve the efficiency, even at the cost of estimation accuracy. The last group is the likelihood-free methods, which are particularly suited for parameter estimation using simulations.

2.1 Sampling methods

Consider, for example, the expectation (3). The integral (3) may be not only mathematically intractable but also numerically intractable, provided that the distribution px is difficult or even impossible to sample from. A common strategy then is to choose and sample from another, simpler distribution, qx, such that qx>0, whenever fxpx>0. The resulting method is known as importance sampling (IS), and qx is referred to as the proposal distribution. Eq. (3) is then rewritten as

fxpxdx=fxpxqxqxdx1Ni=1NfxipxiqxiwiE5

where the sum is an unbiased and consistent estimator of the expectation due to the law of large numbers.

Although the estimate (5) is unbiased and consistent, the number of samples required grows exponentially with the number of dimensions. It is also desirable to make the weights, wi, more uniform; that is, qx should approximate px, in order to reduce the variance of the estimator in (5). Alternatively, the distribution px could be approximated by a histogram; the bins of such histogram can be optimized for a maximum efficiency. The histogram can be sampled directly using a uniform random number generator.

The sampling efficiency of the IS methods, particularly in high dimensions, can be improved by accounting for the likelihood of samples. The Markov chain Monte Carlo (MCMC) strategy represents a broad variety of sampling methods, such that the current sample is affected by the location of the previous sample. It allows learning complex target distributions in multiple dimensions. However, the MCMC sampling requires a burn-in period for achieving the convergence. Furthermore, the samples may get stuck in areas of small probability (requiring a random restart), and the correlation between consecutive samples increases the estimator variance.

Metropolis–Hastings sampling is the most commonly used MCMC algorithm. It can be succinctly described as the following three steps:

  1. Given the current state xt, sample a new state, xt+1, from the proposal qxt+1xt;

  2. Calculate the acceptance probability αxt+1xt of the new state xt+1;

  3. If qxtxt+1=qxt+1xt is symmetric, then accept the new state xt+1 if πxt+1>πxt; otherwise remain at the current state, that is, xt+1xt.

The challenge is to find a good proposal, q, with a bounded support to achieve fast convergence to the target distribution, πx. The target distribution only needs to be known up to a proportionality constant. The generated samples tend to concentrate in the areas of large probability. The generated samples occasionally contain sub-sequences of the same value. More importantly, as with any other sampling methods, there is an exploration–exploitation trade-off. Large acceptance rate means slow exploration of the target distribution, π, as well as large correlations between the samples. On the other hand, a small acceptance rate means large jumps across the support of π. The acceptance rate can be controlled by assuming a parameterized proposal, qϕ. The acceptance rate of about 50% is recommended for samples in a few dimensions, and it should be reduced to about 25% for distributions in many dimensions.

In the literature, many variants of the Metropolis–Hastings sampling algorithm were proposed, for example, combining random walk sampling, adaptive rejection sampling, Langevin algorithm, and augmented estimator.

Gibbs sampling iteratively samples from the conditional densities

X1tp1x1x2tx3txNtX2tp2x2x1tx3txNtXNtpNxNx1tx2txN1t.E6

It can be represented as a set of N univariate Metropolis–Hastings samplers. The sampling dimensions can be chosen systematically or at random. Even though Gibbs sampling has 100% acceptance rate, it can still experience a slow convergence to the target distribution, so it is often combined with other sampling methods.

Gibbs sampling motivates the Rao–Blackwell estimators

Ehx11Tt=1TEhx1x2txNtE7
pixi1Tt=1TpixixjtjiE8

of function expectation and of the marginal distribution, respectively. These estimators are unbiased and have lower variance, since

varEhx1x2txNtvarEhx1.E9

An alternative strategy combines importance sampling with MCMC sampling assuming independent proposals, i=1Nqixi or i=1Nqtxixit1.

An interesting problem is how to design MCMC sampling when the target distribution is not stationary. The adaptive MCMC sampling can be trained online with the generation of new data. However, using past samples for adaptation invalidates the Markov assumption, so the convergence to the target distribution may be problematic, or the adaptation should cease after the burn-in period.

The Hamiltonian MCMC sampling tracks movements of a hypothetical ball under the potential and kinetic energy constraints. The new sample representing the updated ball location is obtained by integrating the ball’s speed. The integration can be approximated, for example, by discretization. More importantly, the interval of integration trades-off the acceptance rate and the sampling rate, so, also the amount of correlations between subsequent samples and the waiting time. The main advantage of Hamiltonian MCMC method is a fast mix-in; that is, the samples converge quickly to the target distribution.

Another sampling method motivated by models in statistical physics is a restricted Boltzmann machine (RBM). This method learns the target distribution as a configuration, h, of a bipartite graph. The target distribution is ph=expEh/Z, where Eh is the energy of the configuration h, and Z=hexpEh is a partition (scaling) function.

The reversible-jump MCMC sampling can be used for distributions having an uncertain number of parameters (the model order). The idea is to increase the number of parameters by 1 with a certain probability every time the sample is taken from the proposal distribution.

The common pitfall of all sampling methods is how to recognize that the generated samples are not converging to the target distribution, for example, since the assumptions were violated, or the method has not been implemented correctly. These issues may not be easily detected by simply evaluating the samples or the simulation outputs.

2.2 Filtering methods

Processing time-series and streaming data must be often done recursively. The corresponding data model can be usually represented as a dynamic system, so the current state, xt, is updated from the previous state, xt1, by an innovation random process, and the states, xt, are observed as values, zt. The states form a Markov chain, that is, pxtxt1x1=pxtxt1, and the observations are assumed to be independent, that is, pzizjxt=pzixtpzjxt. Assuming the Bayes’s theorem

px1:tz1:t=pz1:tx1:tpx1:tz1:t1pztz1:t1,E10

the states can be estimated recursively from the incoming observations as a two-step procedure. The first step predicts the (distribution of) current state as

pxtz1:t1=pxtxt1pxt1z1:t1dxt1.E11

The predicted state is corrected in the second step after receiving the latest observation, zt, as

pxtz1:t=pztxtpxtz1:t1pztxtpxtz1:t1dxt.E12

For linear systems with Gaussian inputs, the processes remain Gaussian, so only their mean values, Ex1:tz1:t, and covariances, covx1:tz1:t, need to be tracked. In such a case, the recursive estimator implementing (11) and (12) is known as Kalman filter. The Kalman filtering can be succinctly described as

x̂t=predictionofxt+KtztpredictionofztE13

where Kt is the Kalman gain at time t. It can be shown that Kalman filter is unbiased, and it is optimum in a sense of having the smallest possible variance; it belongs to a class of the best linear unbiased estimators (BLUE). Practical implementations of Kalman filter are usually concerned with numerical stability and process observability. The posterior distribution of the model states can be estimated using kernel and other smoothing methods.

Kalman filter can be modified to work with non-linear models. In particular, extended Kalman filter performs local linearization at each iteration assuming the first-order Taylor expansion. This can, however, cause large estimation errors and stability issues. Unscented Kalman filter defines a set of sigma points, which are tracked through non-linearity. The weighted and transformed sigma points are used to reconstruct the mean and the covariance of Gaussian distribution, so canonical Kalman filter can be then used. Ensemble Kalman filter has been adopted for large but sparse systems. Expectation maximization (EM) filtering is used when there are also unknown model parameters; see Eqs. (16) and (17) below.

In practice, having an approximate solution for a complex model is often much more useful than obtaining the exact solution of a simplified model. This has motivated the development of particle filters to estimate random signals under non-linear and non-Gaussian conditions. The key idea is to represent the posterior distribution by a set of random weighted samples (called particles), xi, that is,

pxi=1Nwiδxx˜iEfxi=1Nwifx˜iE14

where δ denotes the Dirac-delta function. However, the number of required particles grows quickly with the number of dimensions.

The particles should be chosen to represent high-density regions by many samples with large weights. Provided that the particles are generated using some MCMC sampling strategies, the resulting recursive Bayesian estimator is referred to as the sequential MC (SMC) method. Since only after a few iterations, most particles often have negligible weights (so-called particle degeneracy problem), particle resampling (with replacement and proportionally to their weight) is necessary in order to recover a good representation of the underlying distribution while maintaining a constant number of particles.

Expectation maximization (EM) method has been developed to deal with missing data and to filter data under parameter uncertainty. Assuming the observed but incomplete data, x, as well as the missing data, y, the data likelihood conditioned on an unknown parameter, θ, can be computed as

pxθ=pxyθdy=pyxθpxθdy.E15

The integral (15) is evaluated iteratively by first computing the Q-function (the E-step) as

Qθθn1=Elogpxyθxθn1E16

followed by estimating the parameter θ (the M-step) as

θn=argmaxθQθθn1.E17

This procedure guarantees that the likelihood is maximized, since always pxθn>pxθn1. However, Eqs. (16) and (17) are normally analytically intractable, so they must be evaluated numerically.

2.3 Approximation methods

The sampling methods, which mostly involve MCMC sampling, for performing Bayesian inference are asymptotically exact. However, these methods are also numerically expensive, especially for high-dimensional models. When working with big data, or when there is a need to choose among many plausible models, having less accurate but numerically cheaper solution is highly desirable. Variational inference deterministically approximates the log-evidence px in Bayes’s theorem (1) as

logpx=L+KLqθp(θx)LE18

where L is the evidence lower-bound (ELBO), since the Kullback–Leibler (KL) divergence is always non-negative. Therefore, given the evidence px, minimizing the KL divergence between the approximation, qθ, and the posterior, pθx, is equivalent to maximizing ELBO, L. Alternatively, it is possible to approximate the prior, pθ, instead of the posterior. These are functional optimization problems defined as ddqLq=0, s.t., qθ=1, with the optimum solutions qθ=pθx or qθ=pθ.

Any function, qθ, can be assumed, but some choices are better than others. The mean-field approximation assumes that the dimensions are independent, that is, qθ=i=1Mqiθi. Then, qiθi can be optimized in turn until a convergence is obtained, indicated by no more increases in the ELBO value. There is a performance penalty if the independence assumption is not satisfied. Another popular choice is to assume a class of distributions, qϕθ, parameterized by ϕ.

The mean-field assumption is implemented in the coordinated ascent variational inference (CAVI) algorithm. It is related to Gibbs sampling and message passing. However, this method is not computationally efficient and only works for distributions, which are conditional conjugates of the prior. Stochastic variational inference (SVI) improves the efficiency as well as accuracy of CAVI by assuming the stochastic approximation of the ELBO gradient and by optimizing the parameters using only a subset of data in each iteration (similarly to the EM method). Unfortunately, both CAVI and SVI require deriving the approximation components, qi, analytically, which is often impractical.

Motivated by the limitations of conventional variational inference methods, the black box variational inference (BBVI) algorithm samples from a family of distributions, qϕθ, assuming that pxθpθ=pxθ is known. The objective is to avoid the need for analytical derivations. The BBVI method can be used to estimate a wide range of linear and non-linear models. Since it is also the most efficient method for performing variational inference, it has become available as a library in many programming languages.

The divergence between prior or posterior and its approximation cannot be performed directly but only by maximizing the ELBO. Although, at each iteration, the ELBO is guaranteed not to decrease, the convergence can be slow. However, the efficiency of variational methods is still better than that of the sampling and EM methods. The KL divergence can be replaced with alternative measures such as α divergence, f divergence, and mutual information. In addition, there are only a few theoretical guarantees, and the overall estimation accuracy of variational methods is crucially affected by not knowing how well the posterior (or prior) has been approximated and approximating asymmetric distributions is more challenging.

2.4 Likelihood-free inference

In many practical scenarios, a mathematical model can be available as a simulation. It is often easy to simulate the model for different parameter values. Thus, by generating and comparing many simulation outputs for different parameter values, the simulation run can be identified that best matches the observed data. The parameter values for this simulation run are then declared to be the estimate. The inference methods implementing this simple idea are known as being likelihood free. They can be found in the literature as indirect inference, bootstrap filter, synthetic likelihood, and other methods. However, the most popular among these methods is approximate Bayesian computation (ABC).

The ABC method only requires that a generative model to generate data is available. There is otherwise no need to assume, know, or calculate the model posterior, parameter likelihoods, or importance weights. The basic ABC rejection sampling is performed by repeating the following three steps:

  1. Sample the parameter θ from its prior, p0θ;

  2. Simulate or otherwise generate the data xθ;

  3. If the distance, xx˜<ε, between the simulated data, x, and the observed data, x˜, is sufficiently small, record the parameter θ.

The posterior distribution, πεθx, can be estimated after enough samples, θ and x, have been collected. It is clear that the number of samples required to accurately estimate πεθx grows quickly with the number of dimensions. The exact inference can be obtained in the limit limε0πεθ=pθx. On the other hand, no learning occurs if limεπεθ=p0θ.

The efficiency of the ABC method can be improved by considering summary statistics, Sx, rather than directly comparing the data, x. If Sx is the sufficient statistic for estimating the parameter θ, then πεθSx=πεθx; however, in practice, an informative statistic is often used.

Even though the ABC method can be readily parallelized, it can still be very inefficient, especially if the informative statistic has been poorly chosen and in high dimensions. Since the default ABC method does not exploit information about the accepted and rejected values of θ, the efficiency could be improved by employing the MCMC sampling. The resulting ABC-MCMC algorithm performs the following four steps:

  1. Augment the posterior distribution as pθ˜x˜x;

  2. At the current state θ˜, generate a new sample, θqθθ˜;

  3. Simulate the data, x, for the new state θ;

  4. Accept the new state, θ, with some probability, αθx.

This algorithm can be fine-tuned by gradually reducing ε in order to balance rejections with the estimation accuracy while the convergence requires ε to be sufficiently small. Unlike pure MCMC sampling, the MCMC-ABC method requires a small acceptance rate in order to achieve good accuracy.

Advertisement

3. Other topics in Bayesian analysis

The Bayesian frameworks find many real-world applications in data analysis and data-driven planning and decision making. Bayesian experiment design is about choosing experiments and models that can provide the largest information gain or utility. Bayesian hypothesis testing accounts for prior knowledge to bias the hypothesis rejection or acceptance. This framework can be also used to make statistically informed decisions. Bayesian machine learning has become increasingly popular in recent years. It considers data labels to be random variables. Both supervised and semi-supervised strategies are discussed. Bayesian optimization shows a remarkable promise in optimizing complex, difficult-to-evaluate systems. It considers the observed system responses to be samples of a random process.

3.1 Bayesian experiment design

In Bayesian experiment design, the task is to choose the optimum experiment and possibly also a data model [9]. This can be probabilistically represented as the augmented posterior (1), that is,

pθxdm=pxθdmpθmpxdmE19

where d and m, respectively, represent the experiment and the model choice. More specifically, θ denotes uncontrolled inputs to the experiment as well as unknown model parameters, whereas d are the experiment inputs that can be controlled, that is, designed by a selection. The objective is to specify the optimum design, d, in order to facilitate the estimation of θ as well as selection of the best model, m, for observations, x.

For every configuration of the experiment including the model selection, let Udxmθ be the perceived utility, for instance, the amount of information that can be gained from the experiment. Since the observations are random and the data model is unknown, the optimum experiment maximizes the average utility,

d=argmaxdDU¯d=argmaxdDEx,m,θUdxmθ.E20

Alternatively, the model selection can be formulated as a hypothesis-testing problem. The observed data and models are combined to obtain and then sample the predictive distribution for each model candidate. Additional experiments can be performed as needed in order to maximize reduction in the model uncertainty. A mismatch between the supports of the model prior and the model likelihood indicates that the model has either too few or too many parameters, and there is not enough evidence to make the model selection with a high confidence [10].

Instead of choosing one best model, the predictions from multiple models can be combined. The joint prediction has potentially much larger discriminatory power than individual predictions [10]. Moreover, how well the model can describe the majority of outcomes from many random experiments under different experimental conditions can be as important as the model likelihood.

The information value of an experiment can be quantified as the Fisher information matrix with the entries

I;θi,j=EθilogpXθθjlogpXθθ.E21

Since performing experiments is often costly, the following optimization objectives have been defined in the literature for linear models:

  • A-optimality: minimize the average variance of parameter estimates;

  • C-optimality: obtain the BLUE of linearly combined model parameters;

  • D-optimality: maximize the entropy or determinant of information matrix;

  • E-optimality: maximize the minimum eigenvalue of information matrix;

  • T-optimality: maximize the trace of information matrix.

These optimality objectives may include prior distributions about the model and experiment parameters and combine other objectives for model selection.

The main challenge in evaluating the experiment design (20) is the computational complexity involved in searching the whole design space. The search strategies include linearization, local search, discretization, enumeration, approximation by regression or surrogate models, random (e.g., MCMC) sampling, genetic algorithms, as well as using Bayesian optimization methods, as will be discussed in Section 3.4.

Furthermore, the design (20) yields the single best experiment. In batch-optimum experiment design, N experiments are performed simultaneously. The data from different experiments are conditionally independent given the parameters, θ. The expected utility from these N experiments will be, in general, different from the sum of utilities of individual experiments. A simpler objective is to minimize the predicted variance of observations, x, when the experiment designed as a single optimum is repeated N times.

In the sequential experiment design, the posterior, pθxtdt, from the tth experiment is used as a prior for the t+1th experiment, and then, the experiment conditions, dt+1, are optimized. This is a greedy (sub-optimum) approach; however, it may still outperform batch design due to an inherent adaptation to xt and dt. The optimum sequential experiment design is more complex, and it leads to a problem of dynamic programming.

3.2 Bayesian hypothesis testing

Let θ0 be the critical parameter value, so that the null hypothesis, H0, can be accepted, if the parameter value θ<θ0, and it is rejected in favor of the alternative hypothesis, Ha, otherwise. Denoting the loss function as LθH0 and LθHa under the respective hypotheses, the Bayesian risk for observation, x, is computed as

Rxθ0=θ<θ0LθH0Prxθpθ+θθ0LθHaPrxθpθ.E22

The optimum decision threshold is then obtained by minimizing the average risk, that is, θ0=argminθExRxθ. The statistical power of the hypothesis test is determined by the sample size [11].

3.3 Bayesian machine learning

The general objective of machine learning is to learn how to label unseen data. In supervised learning, this objective is accomplished by training the model with labeled data. For instance, consider the minimum mean square estimation (MMSE) of label Y for data X [3], that is,

fx=argminfEYfX=x2=EYX=x.E23

The training data, X1Y1,X2Y2,,XnYn, are samples from the distribution, pYX=pXYpY, assuming the likelihood, pXY, and the prior, pY, of data labels, Y. The MMSE expectation (23) can be then approximated as

EYX=x1ni=1nYiI;Xi=xE24

assuming the indicator function I.

In semi-supervised learning, there are additional unlabeled data, Xn+1,Xn+2,, which can be used to estimate the distribution, pXYdY. It allows formulating and improving the estimates of the conditional expectation, EYX=x, and cast it as a missing labels problem. More generally, automated data labeling involves the problem of finding the discriminative data model, pYX, or pXY. However, the full generative model, pYXpX=pXYpY, may be required by some of the Bayesian inference methods outlined in the previous section.

Assuming data labels as random variables with their prior and posterior distributions is useful in tolerating label errors as well as missing labels. However, the challenge is that the assumed distributions may be biased or even incorrect. In addition to automated data labeling, Bayesian machine learning can generate more training data and improve the quality of training data by correcting the labels.

There is also an interesting connection to causal machine learning. In particular, the label can be assumed to be a cause of observed data (features) representing the effects. Since the probability, Preffect, and the conditional probability, Prcauseeffect, are not independent, the data samples can be used to estimate their distribution, pX, which in turn facilitates an anti-causal learning of causes (i.e., the data labels, Y). On the other hand, the probabilities Prcause and Preffectcause are independent, and so, intuitively, they cannot be exploited for causal learning of data from their labels.

3.4 Bayesian optimization

Consider the task of minimizing or maximizing a function fx over some high-dimensional feasible set, An. It is typically assumed that evaluating the function is numerically or otherwise very expensive, whereas testing whether xA is computationally cheap. Even though f is normally assumed to be continuous, its derivatives are not known. Furthermore, the problem is non-convex, as the function has no special structure, it usually contains many local optima, and its observations may be noisy.

The basic idea to solve such a difficult optimization problem is to learn a surrogate approximation of f, which is cheap to evaluate. The surrogate approximation is constructed in the context of Bayesian inference [12]. In particular, Bayesian optimization starts by evaluating the function f at a few randomly chosen points, x. The obtained values, fx, are assumed to be samples of a random process. A Gaussian process (GP) is most commonly assumed, since it can yield closed-form expressions for the extrapolated values, even though inverting large covariance matrices is often numerically rather problematic. Then, the following steps are repeated as many times as can be practically afforded.

  1. The posterior of a candidate sample is obtained given the existing samples x1x2xn;

  2. The new sample is chosen as the one maximizing so-called acquisition function (to be discussed below);

  3. The function f is evaluated at the new sample, and the posterior is updated accordingly.

Assuming Bayesian regression over a Gaussian process, the n existing samples, fx1,,fxn, are normally distributed and have the joint mean, μ0x1:n, and the covariance matrix, Σ0x1:n. The corresponding posterior distribution of a candidate sample, fx, is also normal with the mean, μnx, and the variance, σn2x.

The acquisition function predicts the value of a new sample after already knowing the values, fx1,,fxn. The values of the acquisition function tend to be larger when the calculated credible intervals are wider and when the posterior mean is larger. Since the mean, μnx, is also a point estimate of fx after n observations, the value of fx is estimated by interpolating μnx; this corresponds to Gaussian regression. In other words, the measured values, fx1,,fxn, are interpolated to estimate the mean, μnx. The covariance matrix, Σnx1:n, determines how fast the function values vary in between the already measured samples. It is usually required that the values de-correlate with their distance. The covariance matrix can be also estimated from the existing samples as η̂=argmaxηpfx1:nη (maximum likelihood estimation, MLE) or as η̂=argmaxηpfx1:nηpη (maximum a posterior estimation, MAP). It is also possible to treat some parameters, η, as nuisance parameters and marginalize them from the likelihood assuming their prior, pη, before employing one of the Bayesian methods for intractable distributions.

Expected improvement (EI) is the most commonly assumed acquisition function. Provided that f is observed without measurement noise, the current best choice is fn=maxfx1fxn. The expected improvement is then defined as

EInx=Efxfn+E25

where +0, and the expectation is taken over by the posterior of fx conditioned on the values known so far. The next best choice for sampling the function f is

xn+1=argmaxxEInx.E26

Unlike f, the function EInx is inexpensive to evaluate, and its derivatives can be used to effectively maximize Eq. (26). The best expected improvement occurs at points far away from the previously evaluated points (the points having a large posterior variance) and at points having large posterior means; this represents an exploration–exploitation trade-off.

Knowledge gradient (KG) acquisition function selects the sampling point x having the largest posterior mean, μn=maxxμnx. It allows considering the posterior across the full domain of f and how it is changed by a new sample. The knowledge gradient is computed as

KGnx=Eμn+1μnxn+1=x.E27

The next best sample corresponds to the maximum, that is,

xn+1=argmaxxKGnx.E28

An efficient practical implementation of KG can assume multi-start stochastic gradient descent (or ascent, when the target function is to be maximized). This leads to a two-step maximization procedure when the maximum is first searched among a collection of candidate functions, and then, the selected function is maximized separately, for example, by differentiation.

In general, the KG acquisition function tends to significantly outperform the EI method, especially when the function observations are noisy.

Entropy search (ES) acquisition function assumes that the global optimum, x, is a random variable implied by the Gaussian process, fx. It performs the search for a new sample in order to obtain the largest decrease in differential entropy corresponding to the largest reduction of uncertainty about the global optimum. The ES acquisition function is defined as

ESnx=HPnxEfxHPn(xxfx)E29

where H denotes entropy, so that the posterior across a full domain of f and how it is changed by a new sample are again accounted for. This is useful when the observations are noisy. However, unlike the KG method, stochastic gradients cannot be obtained for the ES acquisition function.

Finally, the predictive entropy search (PES) acquisition function rewrites Eq. (29) using mutual information. Conceptually, it is exactly the same as the ES method; however, numerical properties of these two algorithms are different.

The basic optimization problem can be augmented by considering a set of objectives, fxs, indexed by “fidelity”, s, such that lower values of s mean higher fidelity, and fx0fx represents the original objective. For example, fidelity can represent the model order or the modeling granularity. There is also a cost, cs, associated with fidelity, s; the larger the required fidelity, the larger the cost. These problems are then referred to as multi-fidelity source evaluation, in the literature. The task is to maximize the function, fx0, by observing a sequence of values of fxs at n points, that is, x1s1,,xnsn, subject to the total available budget, i=1ncsiCtotal.

This problem can be further generalized by assuming that neither fxs nor cs is monotonic in s; for example, keeping s constant, the observed fxs across different regions of x can have varying accuracy.

The problem of random environmental conditions is closely related to multi-task Bayesian optimization. It maximizes the function fxwpwdw by evaluating fxw for each w at multiple values of x. It is assumed that evaluating fxw at both x and w is expensive, but evaluating pw is cheap. The values w represent random environmental conditions, and they act as noise in observations of fxw. For example, f can represent the average performance of a machine learning model with w-fold cross-validation. Another example are Bayesian neural networks having random coefficients with a certain probability distribution.

Comparing Bayesian optimization with other optimization methods, the latter usually work well only for specific problems or under specific conditions, but have high computational cost, and suffer from the curse of dimensionality, leading to a low sample efficiency and slow convergence. Bayesian optimization, on the other hand, can work very well for moderate dimensionality (the maximum dimension of about 20 is recommended in the literature), and the sampling decisions appear to be optimum despite being sequential. In addition, Bayesian optimization is considered to be derivative free but a black box method.

Despite a good performance, there is a need to provide better theoretical understanding of Bayesian optimization methods, define stopping rules, and improve learning of surrogate models, especially for non-Gaussian processes and for systems containing non-linear transformations. Bayesian optimization can be used to efficiently find hyperparameters of machine learning models [13]. Recently, maximization of a composite function, ghx, has been considered in the literature, where the inner function, h, is expensive to evaluate and a black box, whereas the outer function, g, is easy to evaluate.

Advertisement

4. Bayesian Monte Carlo simulations

Monte Carlo simulations are a general class of random sampling methods. They can be used to solve problems such as (2)(4) using numerical algorithms. Simulations are often used to solve problems involving exploration of high-dimensional parameter spaces in order to perform sensitivity analysis, performance analysis, and system optimization. Unlike analytical approaches, computer simulations are more flexible, require fewer assumptions, and allow for more complex and, thus, more realistic models to be considered. In silico experiments are much more cost-effective than laboratory experiments conducted in life sciences. Digital twins are now often built to monitor complex engineering systems. Synthetic data from simulations can be used to train machine learning models and to avoid expensive or hard-to-get real-world measurements. However, there are often problems with reproducibility and validation of simulation results, especially when the simulation procedures and models become more complex. In general, it is extremely difficult to distinguish among algorithmic errors; incorrect choice and use of algorithms; the errors in mathematical derivations; conceptual errors, for example, due to invalid assumptions; and the errors in algorithm implementations.

Vast amounts of generated simulation data are usually reduced into a few summary statistics. This results in a loss of potentially valuable information, unless the statistics are also sufficient within a given context. There is rarely a deeper analysis of simulation techniques to explain why the implemented algorithms work or do not work, under what conditions, and whether they could be improved further. Thus, there is a need to enhance traditional Monte Carlo simulations and define the best practices of how to plan and conduct computer experiments in order to maximize the information gain and extract maximum useful knowledge from these simulations [14].

Improving the usefulness of Monte Carlo simulations requires that they are interpreted as statistical estimators of quantities of interest including identifying relevant causal associations [15, 16]. For instance, the sample statistics are point estimates, whereas it may be more useful to extract Bayesian posterior or credible intervals of these statistics when performing simulations. Bayesian analysis also enables counterfactual reasoning in order to find how the performance or system behavior would change, if the simulated model was altered. In particular, if the average performance of a system, fx, is evaluated as fxpxdx over a distribution of factors, px, then the counterfactual performance of a modified system having the factor distribution px can be defined as

fxpxdx=fxpxpxwxpxdx.E30

Eq. (30) is akin to IS Eq. (5), that is, the system response, fx, is scaled by weights, wx, when assessing its hypothetical performance. Such approach can yield the following systematic and general strategy for designing systems via computer simulations:

  1. Build a structural equation model (SEM) of the system using available performance data;

  2. Use the SEM to define points of interventions and measurements;

  3. Infer changes in the performance distribution under different hypothetical modifications;

  4. Select and validate the modification with the best hypothetical improvement by implementing it in a real-world system.

A basic strategy for increasing the information gain of Monte Carlo simulations is to augment the number of observation points. In Figure 1A, the simulation inputs and outputs are represented by random variables, X and Y, respectively, and the augmented output is represented by a random variable, Z. The corresponding posteriors and likelihoods are then computed as

Figure 1.

The input–output causal relationships in augmented Monte Carlo simulations.

pXY=pYXpXpYpYX=ZpYXZpZXE31
pZY=pYZpZpYpYZ=XpYXZpXZE32
pXZ=pZXpXpZpYZX=pYZXpZXpX.E33

The causal relationships can be represented by probabilistic graphs, which are referred to as structural causal models (SCMs). These models are similar to SEM as well as allow capturing non-linear relationships. The SCMs are effectively Bayesian networks with directed edges indicating causal effects (since these effects are generally asymmetric) rather than statistical dependencies. The graph cycles are not allowed in order to avoid a variable to be a cause of itself. The endogenous noises are not shown explicitly.

The fundamental idea of studying causal effects is by accommodating interventions. Such a framework has been devised by Pearl [8], and it is now known as Do-calculus. The intervention denoted as doX enforces a certain deterministic or a random value from some distribution on the variable X. If this causes a change in the posterior, pYdoX, then X and Y are causally related. However, in general, observing pYX is not sufficient to determine causality. The basic rules of Pearl’s Do-calculus are:

  1. Observations can be inserted or deleted in conditional probabilities;

  2. Actions and observations can be exchanged in conditional probabilities;

  3. Actions can be inserted or deleted in conditional probabilities.

Moreover, if the causal effect is identifiable, then the causal effect statement can be transformed into a probability expression containing only the observed variables. Unknown causal dependencies can be replaced with conditional probabilities. The significance of these rules is that they allow performing causal inferences using traditional methods of statistical inference for Bayesian networks. This can be used for removing confounding bias, recovering from a selection bias, defining surrogate experiments, and extrapolating and transferring causal knowledge to other similar SCMs.

To illustrate the basic causal components of SCMs, consider three random variables, A, B, and C. The causal chain ABC and the causal fork ABC imply that A and C are generally dependent; however, they are independent, conditioned on B. On the other hand, the causal collider ABC implies that A and C are generally independent, but they are dependent, conditioned on B. Thus, conditioning on B in the case of causal chain or fork, blocks (or D-separates) the path between A and C, whereas conditioning on B opens up such a path in the case of causal collider.

These rules can be used to discuss causality in different SCM representations of augmented Monte Carlo simulation. In particular, the SCM in Figure 1B contains direct causal paths, XZ, ZY, and XY. There is also a backdoor path between Z and Y, that is, ZXY. Thus, X is a common cause, or a confounder, so conditioning on X blocks the backdoor path and enables causal inference.

The SCM in Figure 1C contains unmeasured or unobserved (e.g., exogenous, outside the model) variable, U. This is an example of so-called confounding by indication. There are two confounded associations, that is, XZY and UXZY. Moreover, although conditioning on U is not possible, conditioning on X removes any unmeasured confounding. In the case of SCM in Figure 1D, conditioning on X is sufficient to block the backdoor path.

The SCM in Figure 1E contains two unmeasured variables, U1 and U2. There is no bias without any conditioning. However, fixing X as doX will induce a selection bias by opening the backdoor path, ZU2XU1Y, between Z and Y. On the other hand, conditioning on X will create direct as well as backdoor association between Z and Y. The causal relationships in the remaining SCMs in Figures 1FH are trivial.

The augmented Monte Carlo simulation in Figure 1A may require more sophisticated processing of the observed inputs and outputs as suggested in Figure 2 with the aim of improving the information gain and achieving explainability [17]. In particular, the summary statistics should be calculated between variables X and Y as in traditional Monte Carlo simulations and then also between X and Z and Z and Y in augmented Monte Carlo simulations. This should also improve detectability of events and anomalies. More importantly, it may be possible to define formal rules to automate building SEM and SCM models [18]. The models can be then fitted to observations and analyzed to discover causal relationships. The schematic of SEM or SCM is depicted in Figure 3, where U’s represent the inferred latent variables. The associations, Ai, interconnect the identified events, Ej, for example, using Bayesian rules of statistical dependencies and causal inferences.

Figure 2.

A structure of explainable Monte Carlo simulations.

Figure 3.

A schematic SCM for the explainable Monte Carlo simulation in Figure 2.

Advertisement

5. Conclusions

This chapter discussed the most important Bayesian frameworks and their applications. These frameworks play a crucial role in statistical and causal inferences. Learning from observations is one of the most fundamental concepts. Learning should be unbiased, systematic, replicable, generalizable, sample and other resources efficient and also provide a sufficient information gain. Since learning problems are often mathematically intractable, Bayesian reasoning can guide development of corresponding numerical algorithms. The algorithms can be considered to be concise representations of data, which can be generated by these algorithms. Bayesian frameworks allow building much more sophisticated models and carry out more complex computer simulations.

Bayesian experiment design is especially important due to a widespread adoption of computer simulations in designing and analyzing various real-world systems. Research in machine learning is slowly moving from statistical correlations-based models to models exploiting causal relationships. As discussed in this chapter, the causal relationships in SCM (or SEM) can be analyzed by statistical inference methods, which were developed for Bayesian probabilistic models. Bayesian machine learning assigns a distribution to data labels. It improves the performance of semi-supervised learning and enables automated data labeling and data label corrections. Bayesian optimization is particularly useful for optimizing real-world complex systems, which are very expensive to evaluate. Both Bayesian optimization and Bayesian machine learning are very active areas of research with many exciting open research problems. Bayesian Monte Carlo simulations aim at providing a systematic, semi-automated, explainable simulation framework. They augment the set of observations and leverage SEM or SCM in order to move from traditional descriptive simulations to predictive and even prescriptive simulations. Such research is still little explored in the existing literature.

Advertisement

Acknowledgments

This research was funded by a research grant from Zhejiang University.

Advertisement

Abbreviations

ABCapproximate Bayesian computation
BBVIblack box variational inference
BLUEbest linear unbiased estimators
CAVIcoordinated ascent variational inference
ESentropy search
ELBOevidence lower bound
EMexpectation maximization
EIexpected improvement
GPGaussian process
ISimportance sampling
KGknowledge gradient
KLKullback–Leibler
Monte Carlo
MAPmaximum a posterior estimation,
MCMCMarkov chain Monte Carlo
MLEmaximum likelihood estimation
MMSEminimum mean square estimation
PESpredictive entropy search
RBMrestricted Boltzmann machines
SMCsequential Monte Carlo
SVIstochastic variational inference
SCMstructural causal model
SEMstructural equation model

References

  1. 1. Robert CP. The Bayesian Choice. 2nd ed. New York, NY, USA: Springer; 2007
  2. 2. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. 3rd ed. Boca Raton, FL, USA: CRC Press, Taylor & Francis Group; 2014
  3. 3. Kay SM. Fundamentals of Statistical Signal Processing: Estimation Theory. Vol. I. Upper Saddle River, NJ, USA: Prentice Hall; 1993
  4. 4. Marin JM, Robert CP. Bayesian Core: A Practical Approach to Computational Bayesian Statistics. New York, NY, USA: Springer; 2007
  5. 5. Theodoridis S. Machine Learning: A Bayesian and Optimization Perspective. 2nd ed. Elsevier: Academic Press; 2020
  6. 6. Zhigljavsky A, Žilinskas A. Bayesian and High-Dimensional Optimization. Cham, Switzerland: Springer; 2021
  7. 7. Loskot P, Atitey K, Mihaylova L. Comprehensive review of models and methods for inferences in bio-chemical reaction networks. Frontiers in Genetics. 2019;10(549):1-29. DOI: 10.3389/fgene.2019.00549
  8. 8. Pearl J, Glymour M, Jewell NP. Probabilistic Reasoning In Intelligent Systems. San Francisco, CA, USA: John Wiley & Sons; 2016
  9. 9. Huan X, Marzouk YM. Simulation-based optimal Bayesian experimental design for nonlinear systems. Journal of Computational Physics. 2013;232(1):288-317. DOI: 10.1016/j.jcp.2012.08.013
  10. 10. Vanlier J, Tiemann CA, Hilbers PAJ, van Riel NAW. Optimal experiment design for model selection in biochemical networks. BMC System Biology. 2014;8(20):1-15. DOI: 10.1186/1752-0509-8-20
  11. 11. Sahu SK, Smith TMF. A Bayesian method of sample size determination with practical applications. Journal Royal Statistical Society A. 2006;169(Part 2):235-253. DOI: 10.1111/j.1467-985X.2006.00408.x
  12. 12. Frazier PI. A tutorial on Bayesian optimization. 2018. ArXiv:1807.02811 [stat.ML]
  13. 13. Bergstra J, Yamins D, Cox DD. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Inter. Conf. Machine Learning. 2013;28:1-9
  14. 14. Liepe J, Filippi S, Komorowski M, Stumpf MPH. Maximizing the information content of experiments in systems biology. PLOS Computational Biology. 2013;9(1):1-13. DOI: 10.1371/journal.pcbi.1002888
  15. 15. Bocquet M, Brajard J, Carrassi A, Bertino L. Bayesian inference of chaotic dynamics by merging data assimilation, machine learning and expectation-maximization. Foundations of Data Science. 2020;2(1):55-80. DOI: 10.3934/fods.2020004
  16. 16. Cranmer K, Brehmer J, Louppe G. The frontier of simulation-based inference. PNAS. 2020;117(48):30055-30062. DOI: 10.1073/pnas.1912789117
  17. 17. Liang Y, Li S, Yan C, Li M, Jiang C. Explaining the black-box model: A survey of local interpretation methods for deep neural networks. Neurocomputing. 2021;419:168-182. DOI: 10.1016/j.neucom.2020.08.011
  18. 18. Le TA, Baydin AG, Wood F. Inference compilation and universal probabilistic programming. In: Inter. Conf. On Artificial Intelligence and Statistics. 2017;54:1-11

Written By

Pavel Loskot

Reviewed: 21 October 2022 Published: 19 November 2022