Open access

Introductory Chapter: Development of Bayesian Inference

Written By

Niansheng Tang and Ying Wu

Published: 23 November 2022

DOI: 10.5772/intechopen.108011

From the Edited Volume

Bayesian Inference - Recent Advantages

Edited by Niansheng Tang

Chapter metrics overview

68 Chapter Downloads

View Full Metrics

1. Introduction

Bayesian inference derives from Bayesian theory, which depicts the probability of occurrence of an event given some prior information. Due to the huge advances in computational and modeling techniques, Bayesian inference has been increasingly become an important tool for data analysis in the Bayesian framework and has widely been applied to various fields, including social science, engineering, philosophy, medicine, sport, law and psychology, for parameter/nonparameter estimation, hypothesis test, and prediction. Various Bayesian methods including Markov chain Monte Carlo, objective Bayesian method, subjective Bayesian method, approximate Bayesian computation, and variational Bayesian methods have been developed to make Bayesian inference on various problems such as large-scale image classification and cluster analysis of microarray, and models including parametric, nonparametric, semiparametric models, and other complicated models such as joint models of survival data and longitudinal data, graphical models, computer models, neural network models, and spatial econometric models. In particular, in the big data era, various Bayesian fronts including theories, methods, and computational algorithms have been developed for accommodating the applications of AI and data science in recent years [1], for example, the prior learning, Bayes factor evaluation, Bayesian variable selection, robust Bayesian inference, variational Bayesian inference, resampling, approximation of posterior distribution, approximate Bayesian computation, and debias methods for high−/ultrahigh-dimensional data, multisource heterogeneous data, imbalanced data, missing data, and data stream. But there are some challenging problems, for example, how to balance the computational times and statistical efficiency, design efficient Bayesian computational algorithm and robust sampling schemes for big/massive data, distributed data and streaming data in the privacy protection and the defense of malicious attacks framework, and modeling so that they can adapt to the development of AI and the requirement of data mining, to be addressed and solved for Bayesian inference. In what follows, we will introduce the recent development and some topics of interest on Bayesian inference.

Advertisement

2. Bayesian estimation

For statistical models, Bayesian estimation is usually obtained from its posterior distribution based on Bayes theory. In general, Bayesian estimation includes Bayesian estimation of parameters and nonparametric functions. For parametric Bayesian estimation, we need first to specify the prior distribution of the parameter and then evaluate its posterior mean/median (i.e., Bayesian estimation of parameter) from its posterior distributions if the quadratic/mean-absolute loss function is used. For nonparametric Bayesian estimation, we need first approximate nonparametric function via some proper method such as B-splines and P-splines [2], i.e., parameterized approximation to nonparametric function, which leads to a parameterized model, and then employ Bayesian idea of parametric models to evaluate Bayesian estimates of parameter and nonparametric function. In what follows, we introduce how to evaluate Bayesian estimates of parameters or nonparametric functions in a relatively complicated model (e.g., random effects model/latent variable model).

In a latent variable model with missing response data, we assume the following form:

Yi=Xiβ+Λωi+εi,ηi=Πηi+Γξi+ϵi,i=1,2,,n,E1

where Yi is a p×1 vector of manifest variables including continuous and categorical variables [3], Xi is a p×q matrix of covariates, ωi is a r×1 vector of latent variables, ϵi and εi are the p×1 and r1×1 vectors of measurement errors, respectively, β is a q×1 vector of unknown parameters, Λ is a p×r factor loading matrix, ωi=ηiTξiTT in which ηi and ξi are the r1×1 and r2×1 sub-vectors of ωi and r1+r2=r, respectively, Π and Γ are r1×r1 and r1×r2 matrices of unknown parameters. It is also assumed that IΠ0, ξi follows a normal distribution or an unknown distribution, and yij‘s are subject to missingness, where I is a r1×r1 identity matrix, yij is the jth component of Yi for i=1,,n,j=1,,p.

In general, a simple and standard assumption for the distributions of εi, ϵi, and ξi is to follow some parametric density family such as skew-normal/skew-t/skew-normal-cauchy/skew-symmetric-Laplace distribution [2] or exponential family distribution [3] or normal distribution [4], or unequal time autoregression series or their mixture. But this assumption may be unreasonable or too restrictive. To this end, some alternative methods have been developed to relax their parametric distribution assumption. For example, let εi or ξi follow an unknown distribution, which is approximated by a Dirichlet process prior [2], spiked Dirichlet process prior [5], truncated centered Dirichlet process prior [6], or a class of smooth densities that are approximated by the semiparametric approach of Gallant and Nychka [7], or unknown distribution with its quantiles specified leading to the well-known quantile regression models, or a Bayesian neural network approach to learning unknown distribution [8].

To introduce missing data, let δij be an indicator of missing value yij, i.e., δij=1 if yij is observed, and δij=0 if yij is missing. In this case, we usually need to assume a missingness data mechanism, for example, missing at completely random or missing at random or missing not at random (also called nonignorable missing), and then specify a parametric or semiparmetric model for the considered missingness data mechanism. For example, see Lee and Tang [4] for the logistic regression model, Kim and Yu [9] and Tang, Zhao, and Zhu [10] for the exponential tilting model, and Wang and Tang [11] for the probit regression model together with latent variables. Also, one can train a missingness data mechanism model via a data-driven machine learning method [12], which is a completely new and not yet studied topic, including its implementation, algorithm, and theories. We are working on this new and promising topic, which may lead to a new research field.

To make Bayesian inference on the considered model (1), we need to specify a prior distribution for unknown parameters or coefficients in approximating unknown nonparametric functions. A standard assumption for unknown parameters is some proper parametric distribution family such as normal distribution, gamma distribution, inverse gamma distribution, inverse Gaussian distribution, Wishart distribution, and Beta distribution in which their hyperparameters are user prespecified. Their misspecification or improper application may lead to unreasonable even misleading parameter estimation. Bayesian inference based on these assumptions did not utilize historical data and limits its popularity in that the usage of historical data may improve the efficiency of parameter estimation. To address this issue, some relaxed priors are considered, for example, see power prior, g-prior, normalized power prior [13], calibrated power prior, dynamic power prior, and the power-expected-posterior prior and the scale transformed power prior [14]. For a high-dimensional sparse parametric model, we can assume a spike and slab prior for the parameter, which can be hierarchically expressed as a mixture of a normal distribution and an exponential distribution.

Let ϑ1 be the set of unknown parameters associated with the distribution of εi, and ϑ2 be the set of unknown parameters associated with the distributions of εi and ξi, ϑ3 be the set of unknown parameters associated with the distribution of δi, and denote ϑ=ϑ1ϑ2ϑ3. Let θ be a set of unknown parameters in βΛΠΓϑ. Denote Y=Yii=1n, Yobs=Yi,obsi=1n, Ymis=Yi,misi=1n, X=Xii=1n, F=ωii=1n, where Yi,mis and Yi,obs are sub-vectors of Yi corresponding to missing and observed values, respectively. Let D=YobsX and δ={δij,i=1,,n,j=1,,p. If the marginal posterior distribution of θ given the dataset D is

π(θD=π(θFYmisδDdFdYmis=πθπYθFXπFϑ2πδYϑ3dFdYmisπYobsX,E2

its posterior mean (i.e., Bayesian estimate) can be evaluated by

θ̂=E(θ|D=θπθD=θπθπYθFXπFϑ2πδYϑ3dFdYmisdδdθπYobsX,E3

where πθ is the prior distribution of θ, πYθFX is the probability density function of Y given XβΛϑ1, i.e., the likelihood function of θ associated with the considered latent variable model, πFϑ2=Πi=1nπηiξiΠΓϑ2πξiϑ2 is the probability density of F, πYobsX=πθπYθFXπFϑ2πδYϑ3dFdYmisdδdθ is the marginal likelihood of Yobs given X, and πδYϑ3 is the probability density of δ given Yϑ3.

From Eq. (3), it is easily seen that evaluating θ̂ is almost impossible due to high-dimensional integral involved. To address this issue, the well-known Markov chain Monte Carlo (MCMC) algorithm is employed to approximate θ̂ by sequentially drawing observations from the posterior distributions of components of θ and FYmisδ via the Gibbs sampler together with the Metropolis-Hastings algorithm. Denote πθYFXδ, πYmisYobsFXθδ, and πFYθX as the conditional distributions of θ given (Y,F, X,δ), Ymis given YobsFXδ, and F given YθX, respectively. The Gibbs sampler is implemented as follows. At the tth iteration of the Gibbs sampler with the current observations θtFtYmistδt of θFYmisδ, we sequentially draw i) βt+1 from the conditional distribution πβYFXΛϑ1, ii) Λt+1 from the conditional distribution πΛYFXβϑ1, iii) Πt+1 from the conditional distribution πΠFΓϑ2, iv) Γt+1 from the conditional distribution πΓFΠϑ2, v) ϑ1t+1 from the conditional distribution πϑ1YXβΛF, vi) ϑ2t+1 from the conditional distribution πϑ2FΠΓ, vii) Ymist+1 from the conditional distribution πYmisYobsδXβΛFϑ1, and viii) Ft+1 from the conditional distribution πFYXθ. The aforementioned conditional distributions may be some familiar distributions from which observations can be directly drawn. But in some cases, these conditional distributions may be some unfamiliar and rather complicated distributions from which observations can be indirectly drawn. In this case, some alternative approaches, for example, the Metropolis-Hastings algorithm, rejection sampling, acceptance-rejection sampling, importance sampling, hybrid-jump-based sampling, and reversible jump sampling, can be employed to sample observations from these complicated distributions. The convergence of the above introduced Gibbs algorithm can be monitored by the estimated potential scale reduction (EPSR) values associated with the parameters [15], which are evaluated continuously as the iterations run. The Gibbs sampler converges if the EPSR values of unknown parameters are less than 1.2. Also, we can assess the convergence of the Gibbs sampler by plotting several parallel sequences of observations drawn from different starting values of unknown parameters against iterations.

Let θmm=1M, Fmm=1M and Ymismm=1M be M observations sampled from their corresponding conditional distributions via the aforementioned Gibbs sampler after the Gibbs sampler algorithm converges, respectively. Bayesian estimates of θ, F, and Ymis can be computed by

θ̂=1Mm=1Mθm,F̂=1Mm=1Mθm,Ŷmis=1Mm=1MYmism,E4

respectively. Their corresponding standard deviations can be computed with their corresponding sample covariance matrices of the observations. The details can refer to the literature [7]. The above argument on Bayesian inference is a classical method. However, for a high-dimensional parametric or nonparametric model, one needs some new approaches to solve the computing time and efficiency and stability of algorithm problem. In fact, when the dimension of covariate matrix is large and the sample size is relatively small, i.e., the well-known “large p and small n” problem, the Gibbs sampler is computationally expensive and has poor stability.

To solve this issue for a high-dimensional regression model, there are some novel approaches developed for parameter/nonparametric function estimation in the Bayesian framework, for example, see Bayesian Lasso, Bayesian adaptive Lasso, Bayesian elastic net, and Bayesian L1/2. These approaches can be utilized to estimate model parameters or nonparametric functions and are simultaneously used to select variables in a high-dimensional regression model, which have received considerable attention and extended to various models such as generalized linear models and linear mixed models. In particular, to reduce the computational cost, various variational Bayesian methods have been developed for various models in recent years. For example, see linear mixed models [16] and reference therein. However, there are a lot of unsolved problems. For example, for a complicated model, how to find the optimal variational densities for approximating complicated posterior distributions? How to extend/break the assumption of mean field that is a basic assumption in variational analysis? How to utilize other divergence criteria rather than Kullback–Leibler divergence to develop variational Bayesian theories?

Advertisement

3. Model comparison

Model comparison is widely used to select a plausible model to fit a given dataset among all the considered candidate models. Various methods have been developed to make model comparisons for many models such as linear/nonlinear regression models, structural equation models, multilevel models, machine learning models, and pattern recognition model in the Bayesian framework over the past years.

To select a better model among all the candidate models, we can adopt the well-known best subset selection methods such as Akaike information criterion (AIC), Bayesian information criterion (BIC), deviance information criterion (DIC), generalized information criterion (GIC), minimum description length (MDL), Hannan-Quinn information criterion (HIC), and log scoring criterion (also called the conditional predictive ordinate, i.e., CPO), which trade off a measure of model plausibility and a measure of model complexity. Also, the Bayes factor [17] has been developed to conduct Bayesian model comparison and is widely utilized to investigate the strength of the evidence in favor of one model among two candidate models. The Bayes factor for two competing models H0 and H1 is defined as follows:

B10=πYobsδH1πYobsδH0,

where πYobsδHk=πYobsδθkπθkdθk is the marginal density of Hk with parameter vectors θk, and πθk is the prior density of θk associated with model Hk for k=0,1. In general, if the Bayes factor B10>1, the model H1 is more plausible by the observed data than the model H0, which leads to the following model comparison rule: B10‘s value lying in the intervals (3,10), (10,30), (30,100), and (100,) yields moderate, strong, very strong, and extreme evidence in favor of model H1, respectively. It is rather difficult to compute πYobsδθk due to the intractable high-dimensional integral involved, thus computing the Bayes factor B10 is challenging. Many methods have been proposed to compute marginal likelihoods πYobsδHk or Bayes factors [3]. For example, see importance sampling, path sampling, bridge sampling, Harmonic mean method, random weight importance sampling, sequential Monte Carlo method, and pareto-smoothed importance sampling leave-one-out cross-validation.

One serious defect of the Bayes factor for model comparison is that it is well defined for improper priors of θk‘s and is sensitive to the selection of the hyperparameters in the priors. According to our experience, different priors together with different sampling methods lead to different values of the Bayes factor, i.e., different model comparison results. To this end, some modifications of the Bayes factor have been proposed, for instance, the partial Bayes factor, the intrinsic Bayes factor, and the fractional Bayes factor, which are subject to more or less arbitrary selection of training samples, weights for averaging training samples, and fractions, respectively. Also, some robust methods were developed to compute the sensitivity of the marginal likelihoods via the simulation-based methods, called the automated prior robustness method. Recently, some novel methods were proposed to deal with improper priors in computing the Bayes factor. For example, see machine learning method, i.e., first using a part of the dataset studied to train the Bayes factor/transform the improper prior into a proper prior and then utilizing the remainder of the dataset for model comparison, which provides a new idea for computing the Bayes factor. The robustness of model comparison is a challenging topic, which is worth further studying.

References

  1. 1. Tang N, Liu C, Shi JQ, Huang Y. Editorial: Bayesian inference and Ai. Frontiers in Big Data. 2022;5:1-2
  2. 2. Tang AM, Tang NS. Semiparametric Bayesian inference on skew-normal joint modeling of multivariate longitudinal and survival data. Statistics in Medicine. 2015;34:824-843
  3. 3. Lee SY, Tang NS. Bayesian analysis of structural equation models with mixed exponential family and ordered categorical data. British Journal of Mathematical and Statistical Psychology. 2006;59:151-172
  4. 4. Lee SY, Tang NS. Analysis of nonlinear structural equation models with nonignorable missing covariates and ordered categorical data. Statistica Sinica. 2006;16:1117-1141
  5. 5. Kim S, Dahl DB, Vannucci M. Spiked Dirichlet process prior for Bayesian multiple hypothesis testing in random effects models. Bayesian Analysis. 2009;4:707-732
  6. 6. Tang N, Wu Y, Chen D. Semiparametric Bayesian analysis of transformation linear mixed models. Journal of Multivariate Analysis. 2018;166:225-240
  7. 7. Gallant AR, Nychka DW. Semiparametric maximum likelihood estimation. Econometrica. 1987;55:363-390
  8. 8. Wright WA. Bayesian approach to neural-network modeling with input uncertainty. IEEE Transactions on Neural Network. 1999;10:1261-1270
  9. 9. Kim JK, Yu CL. A semeiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association. 2011;2011(106):157-165
  10. 10. Tang NS, Zhao PY, Zhu H. Empirical likelihood for estimating equations with nonignorable missing data. Statistica Sinica. 2014;24:723-747
  11. 11. Wang ZQ, Tang NS. Bayesian quantile regression with mixed discrete and nonignorable missing covariates. Bayesian Analysis. 2020;15:579-604
  12. 12. Liu M, Zhang Y, Zhou D. Double/debiased machine learning for logistic partially linear model. The Econometrics Journal. 2021;24:559-588
  13. 13. Ibrahim JG, Chen MH, Sinha D. On optimality properties of the power prior. Journal of the American Statistical Association. 2003;98:204-213
  14. 14. Nifong B, Psioda MA, Ibrahim JG. The scale transformed power prior for use with historical data from a different outcome model. DOI: 10.48550/arXiv.2105.05157
  15. 15. Gelman A. Inference and monitoring convergence. In: Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov Chain Monte Carlo in Practice. London: Chapman and Hall; 1996. pp. 131-143
  16. 16. Yi JY, Tang N. Variational Bayesian inference in high-dimensional linear mixed models. Mathematics. 2022;10:463
  17. 17. Kass RE, Raftery AE. Bayes factors. Journal of the American Statistical Association. 1995;90:773-795

Written By

Niansheng Tang and Ying Wu

Published: 23 November 2022