Open access peer-reviewed chapter

Bayesian Analysis for Hidden Markov Factor Analysis Models

Written By

Yemao Xia, Xiaoqian Zeng and Niansheng Tang

Reviewed: November 30th, 2017 Published: December 20th, 2017

DOI: 10.5772/intechopen.72837

Chapter metrics overview

1,210 Chapter Downloads

View Full Metrics


The purpose of this chapter is to provide an introduction to Bayesian approach within a general framework and develop a Bayesian procedure for analyzing multivariate longitudinal data within the hidden Markov factor analysis framework.


  • hidden Markov factor analysis model
  • Markov chain Monte Carlo sampling
  • cocaine use

1. Introduction

The Bayesian approach is now well recognized in the statistics literature as an attractive approach to analyzing a wide variety of models [1], and there is rich literature on this issue. Here, we are not going to present a full coverage on the general Bayesian theory, and readers may refer to excellent books, for example [2, 3], for more details for this general statistical method. This chapter provides an introduction to the Bayesian approach within a general framework and develops a specific Bayesian procedure for analyzing multivariate longitudinal data within the hidden Markov factor analysis framework. We begin with the basic ideas of the Bayesian approach and then describe the model under consideration in the second section. The following section considers Bayesian inferences including parameter estimation, model selection, and posterior density estimates. The final section demonstrates the practical value of proposed methodology to cocaine use data to get some Bayesian results. Some technical details are given in the Appendix.

Consider a data set Y with the probability model pYθ where θ is a univariate or multivariate population parameters vector, which quantifies the uncertainty of data. In the statistical literature, pYθ is called likelihood or sampling distribution and often represented as Lθ. From the frequency statistics point of view, statistical inferences are carried out based on Lθ. In this case, θ, though unknown, is treated as fixed. Unlike the frequency statistical inferences, the Bayesian approach for data analysis assumes that θ is random and has a distribution πθ. This distribution, which represents the knowledge about θ, is referred to as prior distribution or prior. When data are available, the information on θ is summarized within the posterior distribution or posterior, a conditional distribution θ given data, i.e.,


where pY=pYθπθ is the marginal distribution of Y. The right-hand-side term in (1) omits the factor pY since given Y it is a known constant. In Bayes literature, pYθpθ is also termed the unnormalized posterior. Analogous to the role of likelihood in frequency statistical inferences, posterior is the starting point of Bayesian inferences.

Selecting proper priors for parameters is fundamental to Bayesian analysis. Basically, there are two kinds of prior distributions, namely, the noninformative prior distributions and the informative prior distributions. Noninformative prior distributions associate with situations when the prior distributions have no population basis. They are used when we have little prior information on θ and desire that the prior distributions play a minimal role in the posterior distribution distribution. Informative prior distribution represents the distribution of possible parameter values, from which the parameter θ has been drawn. We may have prior knowledge about this distribution, either from closed related data or from the subjective knowledge of experts. A commonly used informative prior distribution in the general Bayesian approach to statistical problems is the conjugate prior distribution, a prior ensuring that the posterior distribution follows the same parametric form as the prior distribution [1, 3].

A potential difficulty underlying Bayesian inferences is the statistical computation when posterior distribution takes on the complicated form. This is particularly true in the situation where latent variables or other unobservable quantities are involved in the model, as discussed in this chapter. In such cases, statistical inferences usually recur to simulation-based methods. Among various sampling methods, Markov chains Monte Carlo methods (MCMC) provide powerful tools for simulating observations from posterior. The key to Markov chain simulation is to create a Markov sequence whose stationary distribution is a specified posterior pθY. Posterior inferences are carried out based on these simulated observations. There are many ways of constructing these Markov chains, but all of them, including the Gibbs sampler [4, 5], are special cases of the general framework of Metropolis et al. [6] and Hastings [7]. However, we do not intend to pursue this issue here, and details on simulation-based methods can be referenced to [2, 3, 8, 9].

In what follows, as an illustration, we will develop a Bayesian analysis procedure for multivariate data under longitudinal setting. Multivariate longitudinal or clustered data occur when multiple items are measured repeatedly over periods of time or across occasions. Under such setting, the primary interest is inference about the dependence of the multiple measurements and the temporal correlation resulting from the repeated measures on the same items. But more often, particular interest also focuses on exploring the potential heterogeneity of data and investigating its transition pattern over time. In these cases, hidden Markov latent variable model (HMLVM) [1013] provides a feasible and unified framework to address these issues. HMLVM assumes that the overall model constitutes the observed process and the underlying hidden state process. The state process, as the convention in the classic HMM (see for example, [1417]), is an univariate discrete process, which follows a first-order Markov chain, while the observed process, conditional on the state sequence, is an independent process with emission distribution specified via LVMs [18]. Hence, in this regard, HMLVM provides a unified way of describing the correlation of multiple items, temporal dependence, and heterogeneity among the data simultaneously. However, the current existing developments cited beforehand focus on the maximum likelihood analysis in which statistical inferences heavily depend on the asymptotic properties. As an illustration of Bayesian inferences on practical problems, in this chapter, we develop a Bayesian procedure to analyze cocaine use data within the hidden Markov factor analysis model framework. Compared to ML, a basic nice feature of a Bayesian approach is its flexibility to utilize useful prior information for achieving better results. Additionally, simulation-based Bayesian methods depend less on asymptotic theory and hence have the potential to produce more reliable results even with small samples.


2. Model description

2.1. Hidden Markov factor analysis model

Consider a set of multivariate longitudinal observations formed by p-dimensional observed vectors yit=yit1yitp, which are recorded on p items over periods of length T: t=1,,T across N subjects: i=1,,N. In the field of multivariate analysis, interest mainly focuses on exploring item dependence since measurements may be highly correlated arising from the multicollinearity problem. But more often, interest also concentrates on the heterogeneity resulting from the situation where the population of yit constitutes more than one component. This is particularly true in the situation where the data illustrate extreme behaviors such as multimodal and/or skewed characteristics. In these cases, a finite mixture factor analysis model (FMFAM) can provide a powerful tool to address these issues. Typically, FMFAM assumes that conditioning on an univariate discrete value state variable zit and an m-dimensional (m<p) continuous latent factor vector ωit, yit are independent and distributed with a p-dimensional multivariate normal distribution, and meanwhile, given zit, ωit also follows an m-dimensional normal distribution, that is,


where μr=μr1μrp is a p-dimensional intercept vector, which represents the baseline level of yit, Λr=Λr1Λrp is a p×m factor loading matrix, Ψϵr=diagΨϵkr1Ψϵkrp is a p×p diagonal matrix with the jth diagonal element Ψϵkrj>0, and Φr is an m×m positive definite matrix.

Formulation given in (2) has two basic features: one is to characterize heterogeneity of population of yit at the occasion level and the other is to establish the dependence among the multiple measurements. The heterogeneous population is specified via state-specific parameters contained in the model while the dependence between different measurements is identified via sharing the common factors in the manner of liner combinations. In particular, apart from explaining the idiosyncratic part of measurements, latent factors also characterize the association between any two measurements. As a matter of fact, one can show that the correlation coefficient between yitj and yitk at state zit is given by


in which λrjk is the jkth element of Λr and Φr,hk is the hkth element in Φ, respectively. The strength of correlation is identified by the factor loadings and covariance of factors together. In the case when ωit degenerates to zero (i.e., Φ=0) or Λ=0, the association among items disappears and model (2) reduces to p-independent mean-variance models within cluster r. Hence, latent factors play a dominant role in characterizing association of multiple items. Note that, in actual applications, latent factors, though unobservable, often have their own physical interpretations. In psychology, for example, latent factors are often used to identify concepts such as treatment, temper, and anxiety, which are important within the framework of theoretical models. The measurements are just proxies for these unobserved concepts of interest. We will provide further interpretations in the real example.

The primary reason for collecting information on multiple occasions for each subject is that it allows investigation of change and/or temporal dependence over time within the subject. There exist various constructs for characterizing dynamic characteristics. A commonly used method is to construct proper dynamic structures for latent factors and establish dynamic factor models, see for example, [1921]. An alternative choice we adopt here is specifying the joint distribution for state sequences. Following the common routine (see, for example, [22, 23]), we assume that each individual state sequence zi=zi1ziT satisfies the following first-order hidden Markov model


where pzi1 and pzitzi,t1 are, respectively, the initial distribution and transition probability given by


where S is a positive integer, δ=δ1δS is an S×1 vector satisfying δr0 and r=1Sδr=1.0, and Q=Qrs is an S×S transition matrix with the rsth entry being Qrs, that is, Qrs0 and s=1SQrs=1.0 for r=1,,S. Modeling state sequences into (5) allows us to explore the transition pattern of individuals across occasions exactly. For example, in the cocaine use data analysis, zit is often identified with the latent state of patient i at time t, then Qrs specifies how individual i being in state r transfers to state s on two successive occasions. Surely, we can relax the time-homogeneous assumption of transition probabilities by including relevant covariates to interpret the inhomogeneous transition behavior among observation data (see, for example, [12, 13, 16]) but at the expense of computational burden.

The current model defined in (2)(5) provides a comprehensive framework for modeling the multivariate longitudinal data with the latent variables. It accommodates the dynamic behavior of observed sequences, heterogeneity of observed data at the occasion level, and dependence among the multiple items simultaneously. In particular, it makes sense to measure effects of latent factors on the manifest variables quantitatively.

Let Y be the collection of all observations, and Ω be the set of corresponding factors. Denote Z=zit:1iN1tT be set of state variables. It follows from Eqs. (2), (4) and (5) that the joint sampling distribution of Y,Ω, and Z is given by


where θ is formed by free parameters in μr,Λr,Ψr, and Φr. Here, we write a2=aa and denote IA the indicator function of a set A. The observed data likelihood is then achieved by taking integration of pYΩZθδQ over Ω and Z, which involves high-dimensional integrations.


3. Posterior inferences

3.1. Prior specifications

Let μ=μr, Λ=Λr, Ψϵ=Ψϵr, and Φ=Φkr. For the Bayesian analysis, we need to assign priors to the unknown parameters involved for completing model specification. Since θ, δ, and Q are involved in different submodels, it is natural to assume that θ,δ, and Q are mutually independent and the components contained in θ are also mutually independent, that is,


For the convenience of conjugacy, we assume that the parameters are drawn from the following commonly used conjugate types prior distributions (see for example [24]).

pμ=r=1Spμr=Dr=1SNpμ0rΣ0r,Φr=1SpΦr=Dr=1SWm1ρ0rR0r1,pΛΨϵ=r=1SpΛrΨϵr×pΨϵr =Dr=1Sj=1pNmΛ0rjψϵrjHϵ0rjGa1αϵ0rjβϵ0rj,δδ0DirSγ0γ0,pQ=r=1SpQr=Dr=1SDirSν0ν0E8

where ‘Ga1ab’ denotes the inverse Gamma distributions with shape a>0 and scale b>0 and ‘Wm21ρ0rR0r1’ represents the q-dimensional inverse Wishart distribution with ρ0r degrees of freedom and m2×m2 scale matrix R0r; Qr is the rth row vector of Q. The scalars αϵ0rj, βϵ0rj, ρ0r, γ0, ν0, the vectors μ0r, Λ0rj, and the matrices R0r and Hϵ0rj are assumed to be known. Thus, standard conjugate priors were specified for all parametric components in the model. The conjugate type prior distributions are sufficiently flexible in most applications, and for situations with a reasonable amount of data available, the hyperparameters scarcely affect the analysis. It should be noted that although Eq. (8) allows different hyperparameters for different latent states, in practice, we choose identical priors for all s. Details of hyperparameter choices are discussed later when we present the empirical results.

3.2. Gibbs sampling scheme and posterior analysis

Combining the sampling distribution for the observable yit’s and the prior distribution specified in (8) yields the joint posterior distribution of θδQ given by


where we ignore the normalization constant pY. However, due to the latent factors and state variables present, the computation of pYθδQ is intractable since it involves high-dimensional integrals. Consequently, no closed form can be available for the posterior pθ,δ,Q|Y. This problem can be addressed via the data augmentation idea in Tanner and Wong [25]. Data augmentation technique treats the latent quantities ΩZ as the hypothetical missing data and augments them with the observed data to form complete data. The posterior analysis is now carried out based on the joint distribution pΩZθδQY, which is proportional to pYΩZθδQpθδQ, the product of likelihood of complete data and priors. Compared to the intractable observed data likelihood, the complete data likelihood has nice hierarchical structure based on conditional independent assumptions in (2) and (4) and hence is relatively easy to analyze. However, pΩZθδQY is still not in closed form and is thus difficult to deal with analytically. In this regard, simulation-based methods can be used to generate observations to carry out posterior analysis. In view of the multiple components involved, the usual independent sampling methods are not feasible. Note that, on the basis of complete data, the full conditional distributions of Ω,Z,θ,δ, and Q have closed forms. This provides a solid foundation for Markov chain Monte Carlo methods. Markov chain Monte Carlo sampling does not draw observations from pΩZθδQY directly. On the contrary, it generates observations from the full conditionals of each component alternatively, thus forming the dependent sample, i.e., Markov chains. Specifically, as pointed out in the introduction, we use Gibbs sampler [4, 5] to draw observations from this target distribution. Obviously, the sampling scheme in the Gibbs sampler includes two types of moves: updating the components involved in the factor analysis model and updating the components related to the hidden Markov model. We propose using the following Gibbs sampler which iteratively simulates from the conditional distributions, where variables are removed from the conditioning set either by explicit integration or by conditional independence. The steps involved in the Gibbs sampler are

  1. Step a: Generate Z from pZθδQΩY

  2. Step b: Generate Ω from pΩZθY

  3. Step c: Generate μΛΨϵ from pμΛΨϵZΩY

  4. Step d: Generate Φ from pΦZΩ

  5. Step e: Generate δ from pδZ

  6. Step f: Generate Q from pQZ

Under mild conditions and similar to [4] (see also, for example, [26]), one can show that for sufficiently large b, say B0, the joint distribution of ΩbZbθbδbQb converges at an exponential rate to the desired posterior distribution pΩZθδQY. Hence, pΩZθδQY can be approximated by the empirical distribution of ΩbZbθbδbQb:b=B0+1,,B0+B} where B is chosen to give sufficient precision to the empirical distribution. The convergence of the Gibbs sampler can be monitored by the ‘estimated potential scale reduction (EPSR)’ values as suggested by Gelman and Rubin [27] or by plotting the traces of estimates against iterations under different starting values.

Simulated observations obtained from the posterior can be used for statistical inferences via straightforward analysis procedures. For brevity, let θbδbQbΩbZb be the random observations generated by the Gibbs sampler from pθδQΩZY. The joint Bayesian estimate of θ and Ω can be obtained easily via the corresponding sample means of the generated observations as follows:


Clearly, these Bayesian estimates are consistent estimates of the corresponding posterior means, see [26]. The consistent estimates of covariance matrix of estimates can be obtained as follows:


Hence, the standard error estimates can be obtained conveniently by the Gibbs sampler algorithm. Other statistical inferences about θ and Ω such as deriving the confidence intervals and statistics for hypothesis testing can be achieved based on the simulative observations as well (see, for example, [28, 29]).

One important statistical inference beyond estimation is on testing of various hypotheses about the model. In the field of hidden Markov modeling, determining the proper number of states may be the first step towards data analysis. Too many states may overfit the observations, meaning that it can fit the training data accurately but may not be a good model for underlying data-generating process. On the other hand, too few states may not be flexible enough to approximate the underlying model. In the context of Bayesian model selection, Bayes factor (BF, [30]) is a popular choice for model comparison. BF is defined as the ratio of the marginal likelihoods of data under two competing models. However, the computation of BF is difficult since it often involves the high-dimensional integrations. It has also been shown that BF is sensitive to the choice of priors and will become infeasible when improper priors are used. A simple and more convenient alternative is the Lν-measures [3134] which is based on the posterior predictive density. It has been shown [34] that this approach is conceptually and computationally simple and is useful in model checking for wide varieties of complicated situations. Moreover, the required computation is a by-product of the common Bayesian simulation procedures such as the Gibbs sampler or its related algorithms. Specifically, let Yrep denotes future values of Y in a replicate experiment, that is, Yrep has the same sampling density as that of Y. The posterior predictive distribution pYrepY is defined as


Naturally, if the posited model under consideration is the true model in the sense that from which the data are generated, then Yrep would behave like data Y and its squared biases and covariances should be small. With this notion in mind, Ibrahim, Chen, and Sinha [34] proposed an L statistics to assess the fitness of posited models to the data by weighting the squared biases and covariance, which can be interpreted as a trade-off between them. Here, we extend it to the multivariate longitudinal setting. Let Yrep=y1repyNrep be a collection set of future responses in our proposal. For some 0ν<1, we consider the following multivariate version of Lν-measure:


where the expectation is taken with respect to the posterior predictive distribution. Clearly, small values of the Lν-measure indicate that the model gives predictions close to the observed values, and the variability in the predictions is low as well. Hence, the model with the smallest Lν-measure is selected from a collection of competing models. It has been shown that Lν-measure with ν=0.5 has nice theoretical properties [34]. Thus, this value of ν will be used in our empirical illustrations.


4. Cocaine use data analysis

In this section, a small portion of cocaine use data is analyzed to illustrate the practical value of the proposed methodology. The original data are collected from 321 cocaine use patients who were admitted in 1988–1989 to the West Los Angeles Veterans Affairs Medical Center. The whole data constitute 68 measurements of 17 items, which were recorded at four time points: at baseline, 1 year after the treatment, 2 years after the treatment, and 12 years after the treatment in 2002–2003. These measurements cover the information on the cocaine use, treatment received, psychological problems, social status, employments, and so on. As an illustration, three variables are selected to conduct data analysis: ‘y1: days of cocaine use per month at intake (CC)’, ‘y2: times per month in formal treatment (FT)’, and ‘y3: months in formal treatment (MFT)’, which, respectively, represent the severity of cocaine use and the levels of treatment received by a patient. Since these variables were measured in 0–120 points scale, to unify the scales, we take logarithms and standardize them. Among them, some measurements are missing. The missing proportion is about 8.4%. For brevity, we assume that the missing is missing at random [35]. A distinct characteristic underlying data are nonnormal and heavy tailed. Figure 1 gives the plots of histograms and the posterior predictive density estimates (see below) of logarithms of CC, FT, and MFT (with missing data removed) on four occasions. The histograms illustrate that the distributions of selected variables are deviated from normality in terms of multimodality and skewness. The skewness and kurtosis of CC on four occasions are 1.6315.031, 0.8473.354, 0.3281.476, and 0.4732.467, respectively. Data set also demonstrates dynamic characteristics. The distribution of CC, for instance, is skewed to the left at baseline and moves to the right gradually on the following two occasions and becomes right-skewed eventually. This implies that a single factor analysis model may not be appropriate to fit the data at each time point.

Figure 1.

Plots of histograms and posterior predictive density estimates of ‘CC’, ‘FT’ and ‘MFT’ under FA model and hidden Markov CFA model with seven states in the cocaine use data analysis: the dashed lines denote CFA and the solid lines represent the hidden Markov FA.

In this analysis, one of the objectives is to explore the effects of latent factors on the observed variables and assess the dependence among latent factors. Based on the nature of the problem under consideration, it is natural to group the single variable ‘CC’ to reflect one latent factor ‘cocaine use’ (η) and to group ‘FT’ and ‘MFT’ to represent another latent factor ‘treatment’ (ξ). Let yit=yit1yit2yit3 and ωit=ηitξit. To be convenient for interpretation and computation, Φr and Λr are restricted to be invariant across states but leave the baseline level μr varying with r. Further, the following non-overlapped structure for factor loading matrix is considered


where parameters with an asterisk are treated as fixed for identification. Note that fixing Λ11=1 indicates that η is identified with CC. This is similar to that in Λ22. Hence, in this case, Φ12 in Φ measures the magnitude of dependence of ξ on η.

Data set is fitted to the proposed models with 10 different transition models: S=1,,10. Although these state spaces are in nested forms, the corresponding models are not, since one cannot be reduced to another by constraining parameters in the interior of parameter space. This indicates that chi-square distribution may not be suitable for the classic likelihood ratio test statistic. We use L-measure to implement model selection. Obviously, if S1 is taken, then the proposed model reduces to common factor analysis model (CFA, [18]).

The following inputs are taken for the super-parameters involved in the prior distributions (8): for r=1,,S, μ0rj=minyitj+r/S, Σ0r=Syy/S, where Syy is the sample covariance matrix of data. The entries in Λ0 are set to be zeros, ρ0=10.0, R01=7.0×I2, which leads to the mean of Φ equal to I2, Hϵ0=I3, αϵ0j=9.0, βϵ0j=8.0, ν0=γ0=0.1. Note that these values are the standard inputs in the latent variable analysis (see [24]). We also took other values for these inputs and found that the resulting estimates are scarcely affected.

We implement the proposed algorithm given in Section 3 to conduct Bayesian analysis. Let Yobs be the collection of observed data and Ymis be the set of missing data. Due to the missing data, we need to draw Ymis from pYmisΩZθYobs in MCMC sampling. This can be implemented easily since conditioning on Ω, Z, and θ, pYmisΩZθYobs, independent of Yobs, has the normal distribution. Hence, drawing Ymis is rather straightforward and fast. To obtain some idea about the number of the Gibbs sampler iterations in getting convergence, we conducted a few test runs as a pilot study and found that in all these runs, the Gibbs sampler converged in about 1000–2000 iterations, where the EPSR values [27] are less than 1.2. So, for all cases under consideration, we collect 3000 random observations after initial 2000 iterations being removed for posterior analysis.

We calculate the values of L0.5 under each fitting. For computation, we use simulation-based method by drawing predictive values Yobsrep from pYobsrepYobs, where Yobsrep is the hypothetical replication of Yobs. Note that pYobsrepYobs=pYobsrepΩZθpΩZθYobsdΩdZdθ. Hence, drawing Yobsrep is rather easy when Ω,Z, and θ are available. Given that we have M simulations from the posterior of Ω,Z,θ via MCMC sampling discussed before, we just draw one Yobsrep from pYobsrepΩZθ for each Ω,Z, and θ and obtain M simulations in the end for Yobsrep. Based on these simulated observations, Lν measures can be estimated consistently via sample means. We draw 3000 observations after convergence of MCMC algorithm for calculating L0.5 and the results are reported in Table 1.


Table 1.

Summary of L0.5 under competing models in the analysis of cocaine use data.

Examination of Table 1 indicates that the proposed model with six to eight latent states seems to give better fits to the data. Furthermore, we calculate the posterior predictive density estimates of ytjt=14j=13 under one state and seven states, respectively (see Figure 1). It can be seen clearly that our proposed method is successful in capturing the skewness and modes of data while factor analysis model fails. For the computation details, we choose 60–100 equally spaced grids in the interval minyobs,itj1.0maxyobs,itj+1.0 and collect 3000 simulated observations from the Gibbs sampler at each point after removing initial 2000 iteration as burn-ins.

Table 2 presents the summary of Bayesian estimates of unknown parameters and their standard errors using the formula given in (11) with S=1 (denoted by FA) and S=7 (denoted by HMFA). For comparison, maximum likelihood estimates of unknown parameters with their standard deviations under HMFA are also presented in Table 2. The maximum likelihood analysis is conducted via MCECM algorithm [36] and the standard error estimates are calculated via Louis formula [37].


Table 2.

Summary statistics for Bayesian and ML estimates in the cocaine use data analysis.

Based on Table 2, we can find the following facts: First, three estimates of Λ32 give the positive effects of latent factor ξ on the ‘MFT’. This is not surprising since ξ is related to the treatment level of a patient received. But there are obvious differences in magnitudes among the three methods. For FA and HMFA, the former gives Λ̂32=0.001 associated with standard deviation 0.014, while the latter gives Λ̂32=0.752 with standard deviation 0.045. This reflects that the heterogeneity of data affects the estimates Λ̂32 seriously. Compared to the previous two methods, ML method produces that Λ̂32=0.196 with SD = 0.029, which are in between them. Second, the estimates of variance parameters Ψϵj under S=1 are larger than those under S=7. This indicates that factor analysis model accommodates heavy tails of data at the expense of variance inflation. Further investigations on the estimates of Φjj under FA and HMFA also reveal the same phenomenon as that of Ψϵj. However, we observe that the ML estimate of Ψϵ3, the unique variance corresponding to the third item, is equal to 0.008 with SD = NAN, an illogical number, which is very close to an improper Heywood case. As pointed out by Lee [18], Heywood cases in the ML estimation can be avoided by imposing an inequality constraint on Ψϵ3 with a penalty function. In the Bayesian approach, the conjugate prior distribution of Ψϵ31 specified Ψϵ3 in a region of positive values and hence has a similar effect as adding a penalty function. Hence, no Heywood cases are found in the Bayesian solution because of the penalty function induced by the prior distribution on Ψϵ31. Third, three estimates give the negative correlation between η and ξ, which is consistent with the fact that the improvement of treatment will decrease the intensity of cocaine use, thus leading to a decrease of cocaine use in days. ML estimates for Φjk are very close to those under HMFA. However, the estimate of Φ12 under S=1 is −0.018, which is quite different from 0.182 for S=7. Furthermore, the coefficients of correlation of ξ and η under S=1 and S=7 are −0.0204 and −0.6612, respectively. The former suggests that ξ and η are approximately independent while the latter implies stronger dependence between them.

Moreover, we computed the posterior probabilities Pzt=rYobs for r=1,,7 and t=1,,4 under S=7 based on 10,000 simulated observations drawn from pZYobs and found that the transition path corresponding to the maximum posterior probability is 7111. This implies that latent state of the patient being in is extremely serious at baseline and becomes moderate in the subsequent treatments. This also reflects a positive effect of intervention on the patient’s latent state. Note that unlike the common Viterbi algorithm in exploring the optimal transition path of states in ML analysis, calculating posterior probability Pzt=rYobs within Bayesian framework is a by-product of the estimation procedure. This voids the complex computation of marginal likelihood of the observed data and hence is very fast.


5. Discussion

This chapter reviews Bayesian inferences within a general framework and proposes a Bayesian procedure for analyzing hidden Markov factor analysis model under multivariate longitudinal setting. Compared to ML method, the pragmatic advantage of Bayesian framework is its flexibility and generality for coping with very complex problems. When good prior information can be available, results obtained from Bayesian method are more reliable and accurate than that under ML. With increased access to computation advances in simulation-based approaches, in particular the MCMC methodology, Bayesian inferences provide enormous scope for realistic statistical modeling.

Although we concentrate our attention on applications of the hidden Markov factor analysis model, the methodology developed in this chapter can be extended to the case where the LVM is nonlinear. Another possible extension is to consider a dynamic LVM, wherein model parameters vary over time. These extensions will raise theoretical and computational challenges and certainly require further investigation.



The authors are grateful to the editor’s valuable suggestions and comments which have greatly improved the manuscript. Xia’s work was fully supported by grant from National Nature Science Foundation of China (No. 11471161) and Tang’s work was supported by National Science Fund for Distinguished Young Scholars of China (No. 11225103).


  1. (a) pZθδQΩY

Let ωi denote the sequence of latent factors across T occasions for individual i. To draw state variables Z from pZδQθΩY, we first notice that


Hence, drawing Z can be accomplished via single-component method by drawing zi independently from pziωiδQθyi. Furthermore, notice that the sequences yiωizi are still the one-order Markov sequences. Hence, we can simulate zi through a well-known forward filtering-backward sampling algorithm (see, for example, [38]). For notation clarity, we suppress θ, δ, and Q in the following derivations.

Forward filtering-backward sampling (FFBS) consists of first forward filtering (FF) and then backward sampling (BS). The forward filtering step recursively updates


Here yi,1:t represents the set of observations of subject i up to time t and so are ωi,1:t and zi,1:t. The backward sampling is to draw zi from the joint distribution of the states given the data using

pzi,1:Tωi,1:Tyi,1:T=pziTωi,1:Tyi,1:T  pzi1zi,2:Tωi,1:Tyi,1:T.E18

That is, we first draw the last state given all the data and then work backwards in time drawing each state conditional on all the subsequent ones.

To implement forward filtering, let


Obviously, αi1r=δrpyi1ωi1zi1=r. Moreover, it can be shown that


The outputs αitt=1T from recursive Eq. (20) can be used to calculate the posterior probability


which leads to the forward filtering (FF) iteration.

The backward sampling step depends on the observation that

pzitzi,t+1:Tωi,1:Tyi,1:Tpzitzi,t+1:Tωi,1:Tyi,1:T= pzitωi,1:tyi,1:tpzi,t+1:Tωi,t+1:Tyi,t+1:Tzitωi,1:tyi,1:t= pzitωi,1:tyi,1:tpzi,t+1:Tzitωi,1:tyi,1:tpωi,t+1:Tyi,t+1:Tzi,t+1:Tzitωi,1:tyi,1:t= pzitωi,1:tyi,1:tpzi,t+1:Tzitpωi,t+1:Tyi,t+1:Tzi,t+1:TE22

The last equation holds since given zit, yi,t:Tωi,t:Tzi,t+1:T does not depend on the previous values due to the Markov Chain characteristics of yitωitzit. This leads to


Hence, FFBS algorithm for drawing zi is implemented by


  1. running the recursion αit and stored the conditional probabilities αi,tt for t=1,,T;

  2. sampling ziT from the filtered conditional probability αi,TT;

  3. for t=T1,,1, sampling zit from the conditional probability

  1. (b) pΩZθY

To draw Ω, we first note that


in which


with r=zit. Hence, similar to that in drawing Z, updating Ω can be achieved by drawing ωit independently from pωitzitθyit for i=1,,N and t=1,,T. It can be shown that


in which

  1. (c) pμΛΨϵZΩY

To draw μΛΨϵ, we can first draw μ from pμΛΨϵZΩY and then draw ΛΨϵ from pΛΨϵμZΩY. For this end, let n̂r=#zit=r be the size of cluster zit, and let witr=Izit=r. Denote


be the sample means and covariance matrices of Y and Ω within the rth cluster, respectively. By some algebra calculations, it can be shown that






in which y¯jr is the jth element in Y¯r, Syyjjr is the jth main diagonal element of Syyr, and Sωyjr is the jth column vector of Sωyr.

  1. (d) pΦΩZ

From the prior distribution of Φr1 and the distribution of Ω, it can be shown that


where n̂r and Sωω are given in (c). Hence, pΦrΩZ is the m-dimensional inverse Wishart distribution Wm1n̂r+ρ0rn̂rSωωr+R01. It can be shown from exactly the same reasoning as before that drawing Φ can be achieved by drawing Φr from pΦrΩZ independently.

  1. (e) pδZ and (f) pQZ

It can be verified directly that


in which n̂1r=i=1NIzi1=r. Similarly, it can be shown that


in which n̂rs=i=1Nt=2TIzit1=rzit=s.


  1. 1. Berger JO. Statistical Decision Theory and Bayesian Analysis. New York: Springer-Verlag; 1985. DOI: 10.1007/978-1-4757-4286-2
  2. 2. Box GEP, Tiao GC. Bayesian Inference in Statistical Analysis. Reading, MA: Addison-Wesley; 1973. DOI: 10.1002/9781118033197
  3. 3. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. London: Chapman & Hall Ltd; 1995
  4. 4. Geman S, Geman D. Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1984;(6):721-741. DOI: 10.1109/TPAMI.1984.4767596
  5. 5. Gelfand AE, Smith AFM. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association. 1990;85:398-409. DOI: 10.1080/01621459.1990.10476213
  6. 6. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of state calculations by fast computing machine. Journal of Chemical Physics. 1953;21:1087-1091
  7. 7. Hastings WK. Monte Carlo sampling methods using Markov chains and their application. Biometrika. 1970;57(1):97-109. DOI: 10.1093/biomet/57.1.97
  8. 8. Robert CR, Casella G. Monte Carlo Statistical Methods. New York, Inc.: Springer-Verlag; 1999. DOI: 10.1007/978-1-4757-3071-5
  9. 9. Ross SM. Simulations. Amsterdam: Academic Press/Elsevier, Inc.; 2013. DOI: 10.1016/B978-0-12-375686-2.00001-7
  10. 10. Schmittmann VD, Dolan CV, Han LJ, van der Maas, Neale CM. Discrete latent Markov models for normally distributed response data. Multivariate Behavioral Research. 2005;40(4):461-488. DOI: 10.1207/s15327906mbr4004_4
  11. 11. Xia YM, Gou JW, Liu YA. Semi-parametric Bayesian analysis for factor analysis model mixed with hidden Markov model. Applied Mathematics A Journal of Chinese Universities, Series A. 2015;30(1):17-30
  12. 12. Song XY, Xia YM, Zhu HT. Hidden Markov latent variable models with multivariate longitudinal data. Biometrics. 2017;73(1):313-323. DOI: 10.1111/biom.12536
  13. 13. Xia YM, Tang NS, Gou JW. Generalized linear latent model for multivariate longitudinal measurements mixed with hidden Markov model. Journal of Multivariate Analysis. 2017;152:259-275. DOI: 10.1016/j.jmva.2016.09.001
  14. 14. Wiggings LM. Panel Analysis: Latent Probability Models for Attitude and Behavior Processes. San Francisco, CA: Elsevier Scientific; 1973
  15. 15. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77(2):257-284. DOI: 10.1109/5.18626
  16. 16. Altman RM. Mixed hidden Markov models: An extension of the hidden Markov mode to the longitudinal data setting. Journal of the American Statistical Association. 2007;102(477):201-210. DOI: 10.1198/016214506000001086
  17. 17. Maruotti A. Mixed hidden Markov models for longitudinal data: An overview. International Statistical Review. 2011;79(3):427-454. DOI: 10.1111/j.1751-5823.2011.00160.x
  18. 18. Lee SY. Structural Equation Modelling: A Bayesian Approach. New York: John Wiley & Sons; 2007
  19. 19. Dunson DB. Dynamic latent trait models for multidimensional longitudinal data. Journal of the American Statistical Association. 2003;98(463):555-563. DOI: 10.1198/016214503000000387
  20. 20. Zhang ZY, Hamaker EL, Nesselroade JR. Comparisons of four methods for estimating a dynamic factor model. Structural Equation Modeling: A Multidisciplinary Journal. 2008;(3, 377):377-402. DOI: 10.1080/10705510802154281
  21. 21. Chow SY, Tang NS, Yuan Y, Song XY, Zhu HT. Bayesian estimation of semiparametric nonlinear dynamic factor analysis model using the Dirichlet prior. British Journal of Mathematical and Statistical Psychology. 2011;64:69-106
  22. 22. Ebbes P, Grewal R, DeSarbo WS. Modeling strategic group dynamics: A hidden Markov approach. Quantitative Marketing and Economics. 2010;8:241-274
  23. 23. Marruotti A. Robust fitting of hidden Markov regression models under a longitudinal data. Journal of Statistical Computation and Simulation. 2014;84:1728-1747
  24. 24. Zhu HT, Lee SY. A Bayesian analysis of finite mixtures in the LISREL model. Psychometrika. 2001;66(1):133-152. DOI: 10.1007/BF02295737
  25. 25. Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association. 1987;82(398):528-550. DOI: 10.1080/01621459.1987.10478458
  26. 26. Geyer CJ. Practical Markov chain Monte Carlo. Statistical Science. 1992;7(4):473-511. DOI: 10.1214/ss/1177011137
  27. 27. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences (with discussion). Statistical Science. 1992;7(4):457-511. DOI: 10.1214/ss/1177011136
  28. 28. Besage J, Green P, Higdon D, Mengersen K. Bayesian computation and stochastic system. Statistical Science. 1995;10(1):3-66. DOI: 10.1214/ss/1177010123
  29. 29. Gelman A. Inference and monitoring convergence. In: Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov Chain Monte Carlo in Practice. London: Chapman and Hall; 1996. pp. 131-140
  30. 30. Kass RE, Raftery AE. Bayes factor (with discussion). Journal of the American Statistical Association. 1995;90(430):773-795. DOI: 10.1080/01621459.1995.10476572
  31. 31. Geisser S, Eddy W. A predictive approach to model selection. Journal of the American Statistical Association. 1979;74(365):1537-1160. DOI: 10.1080/01621459.1979.10481632
  32. 32. Laud PW, Ibrahim JG. Predictive model selection. Journal of the Royal Statistical Society, Series B. 1995;57(1):247-262. DOI: 10.2307/2346098
  33. 33. Gelfand AE, Ghosh SK. Model choice: A minimum posterior predictive loss approach. Biometrika. 1998;85(1):1C13. DOI: 10.1093/biomet/85.1.1
  34. 34. Ibrahim JG, Chen MH, Sinha D. Criterion based methods for Bayesian model assessment. Statistica Sinica. 2001;11:419-443
  35. 35. Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: Wiley; 1987. DOI: 10.1002/9781119013563
  36. 36. Wei GCG, Tanner MAA. Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association. 1990;85:699-704
  37. 37. Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B. 1982;44:226-233
  38. 38. Capṕe O, Moulines E, Rydén T. Inference in Hidden Markov Models. New York: Springer Verlag; 2005

Written By

Yemao Xia, Xiaoqian Zeng and Niansheng Tang

Reviewed: November 30th, 2017 Published: December 20th, 2017