Open access peer-reviewed chapter

Bayesian Analysis for Hidden Markov Factor Analysis Models

By Yemao Xia, Xiaoqian Zeng and Niansheng Tang

Submitted: August 6th 2017Reviewed: November 30th 2017Published: December 20th 2017

DOI: 10.5772/intechopen.72837

Downloaded: 260

Abstract

The purpose of this chapter is to provide an introduction to Bayesian approach within a general framework and develop a Bayesian procedure for analyzing multivariate longitudinal data within the hidden Markov factor analysis framework.

Keywords

  • hidden Markov factor analysis model
  • Markov chain Monte Carlo sampling
  • cocaine use

1. Introduction

The Bayesian approach is now well recognized in the statistics literature as an attractive approach to analyzing a wide variety of models [1], and there is rich literature on this issue. Here, we are not going to present a full coverage on the general Bayesian theory, and readers may refer to excellent books, for example [2, 3], for more details for this general statistical method. This chapter provides an introduction to the Bayesian approach within a general framework and develops a specific Bayesian procedure for analyzing multivariate longitudinal data within the hidden Markov factor analysis framework. We begin with the basic ideas of the Bayesian approach and then describe the model under consideration in the second section. The following section considers Bayesian inferences including parameter estimation, model selection, and posterior density estimates. The final section demonstrates the practical value of proposed methodology to cocaine use data to get some Bayesian results. Some technical details are given in the Appendix.

Consider a data set Ywith the probability model pYθwhere θis a univariate or multivariate population parameters vector, which quantifies the uncertainty of data. In the statistical literature, pYθis called likelihood or sampling distribution and often represented as Lθ. From the frequency statistics point of view, statistical inferences are carried out based on Lθ. In this case, θ, though unknown, is treated as fixed. Unlike the frequency statistical inferences, the Bayesian approach for data analysis assumes that θis random and has a distribution πθ. This distribution, which represents the knowledge about θ, is referred to as prior distribution or prior. When data are available, the information on θis summarized within the posterior distribution or posterior, a conditional distribution θgiven data, i.e.,

pθY=pYθπθpYpYθpθE1

where pY=pYθπθis the marginal distribution of Y. The right-hand-side term in (1) omits the factor pYsince given Yit is a known constant. In Bayes literature, pYθpθis also termed the unnormalized posterior. Analogous to the role of likelihood in frequency statistical inferences, posterior is the starting point of Bayesian inferences.

Selecting proper priors for parameters is fundamental to Bayesian analysis. Basically, there are two kinds of prior distributions, namely, the noninformative prior distributions and the informative prior distributions. Noninformative prior distributions associate with situations when the prior distributions have no population basis. They are used when we have little prior information on θand desire that the prior distributions play a minimal role in the posterior distribution distribution. Informative prior distribution represents the distribution of possible parameter values, from which the parameter θhas been drawn. We may have prior knowledge about this distribution, either from closed related data or from the subjective knowledge of experts. A commonly used informative prior distribution in the general Bayesian approach to statistical problems is the conjugate prior distribution, a prior ensuring that the posterior distribution follows the same parametric form as the prior distribution [1, 3].

A potential difficulty underlying Bayesian inferences is the statistical computation when posterior distribution takes on the complicated form. This is particularly true in the situation where latent variables or other unobservable quantities are involved in the model, as discussed in this chapter. In such cases, statistical inferences usually recur to simulation-based methods. Among various sampling methods, Markov chains Monte Carlo methods (MCMC) provide powerful tools for simulating observations from posterior. The key to Markov chain simulation is to create a Markov sequence whose stationary distribution is a specified posterior pθY. Posterior inferences are carried out based on these simulated observations. There are many ways of constructing these Markov chains, but all of them, including the Gibbs sampler [4, 5], are special cases of the general framework of Metropolis et al. [6] and Hastings [7]. However, we do not intend to pursue this issue here, and details on simulation-based methods can be referenced to [2, 3, 8, 9].

In what follows, as an illustration, we will develop a Bayesian analysis procedure for multivariate data under longitudinal setting. Multivariate longitudinal or clustered data occur when multiple items are measured repeatedly over periods of time or across occasions. Under such setting, the primary interest is inference about the dependence of the multiple measurements and the temporal correlation resulting from the repeated measures on the same items. But more often, particular interest also focuses on exploring the potential heterogeneity of data and investigating its transition pattern over time. In these cases, hidden Markov latent variable model (HMLVM) [1013] provides a feasible and unified framework to address these issues. HMLVM assumes that the overall model constitutes the observed process and the underlying hidden state process. The state process, as the convention in the classic HMM (see for example, [1417]), is an univariate discrete process, which follows a first-order Markov chain, while the observed process, conditional on the state sequence, is an independent process with emission distribution specified via LVMs [18]. Hence, in this regard, HMLVM provides a unified way of describing the correlation of multiple items, temporal dependence, and heterogeneity among the data simultaneously. However, the current existing developments cited beforehand focus on the maximum likelihood analysis in which statistical inferences heavily depend on the asymptotic properties. As an illustration of Bayesian inferences on practical problems, in this chapter, we develop a Bayesian procedure to analyze cocaine use data within the hidden Markov factor analysis model framework. Compared to ML, a basic nice feature of a Bayesian approach is its flexibility to utilize useful prior information for achieving better results. Additionally, simulation-based Bayesian methods depend less on asymptotic theory and hence have the potential to produce more reliable results even with small samples.

2. Model description

2.1. Hidden Markov factor analysis model

Consider a set of multivariate longitudinal observations formed by p-dimensional observed vectors yit=yit1yitp, which are recorded on pitems over periods of length T: t=1,,Tacross Nsubjects: i=1,,N. In the field of multivariate analysis, interest mainly focuses on exploring item dependence since measurements may be highly correlated arising from the multicollinearity problem. But more often, interest also concentrates on the heterogeneity resulting from the situation where the population of yitconstitutes more than one component. This is particularly true in the situation where the data illustrate extreme behaviors such as multimodal and/or skewed characteristics. In these cases, a finite mixture factor analysis model (FMFAM) can provide a powerful tool to address these issues. Typically, FMFAM assumes that conditioning on an univariate discrete value state variable zitand an m-dimensional (m<p) continuous latent factor vector ωit, yitare independent and distributed with a p-dimensional multivariate normal distribution, and meanwhile, given zit, ωitalso follows an m-dimensional normal distribution, that is,

yitωitzit=rNpμr+ΛrωitΨεrωitzit=rNm0ΦrE2

where μr=μr1μrpis a p-dimensional intercept vector, which represents the baseline level of yit, Λr=Λr1Λrpis a p×mfactor loading matrix, Ψϵr=diagΨϵkr1Ψϵkrpis a p×pdiagonal matrix with the jth diagonal element Ψϵkrj>0, and Φris an m×mpositive definite matrix.

Formulation given in (2) has two basic features: one is to characterize heterogeneity of population of yitat the occasion level and the other is to establish the dependence among the multiple measurements. The heterogeneous population is specified via state-specific parameters contained in the model while the dependence between different measurements is identified via sharing the common factors in the manner of liner combinations. In particular, apart from explaining the idiosyncratic part of measurements, latent factors also characterize the association between any two measurements. As a matter of fact, one can show that the correlation coefficient between yitjand yitkat state zitis given by

Corryitjyitkzit=r==1mh=1mλrjλrkhΦrlh=1mh=1mλrjλrjhΦrh+Ψϵrj=1mh=1mλrkλrkhΦrh+ΨϵrkE3

in which λrjkis the jkth element of Λrand Φr,hkis the hkth element in Φ, respectively. The strength of correlation is identified by the factor loadings and covariance of factors together. In the case when ωitdegenerates to zero (i.e., Φ=0) or Λ=0, the association among items disappears and model (2) reduces to p-independent mean-variance models within cluster r. Hence, latent factors play a dominant role in characterizing association of multiple items. Note that, in actual applications, latent factors, though unobservable, often have their own physical interpretations. In psychology, for example, latent factors are often used to identify concepts such as treatment, temper, and anxiety, which are important within the framework of theoretical models. The measurements are just proxies for these unobserved concepts of interest. We will provide further interpretations in the real example.

The primary reason for collecting information on multiple occasions for each subject is that it allows investigation of change and/or temporal dependence over time within the subject. There exist various constructs for characterizing dynamic characteristics. A commonly used method is to construct proper dynamic structures for latent factors and establish dynamic factor models, see for example, [1921]. An alternative choice we adopt here is specifying the joint distribution for state sequences. Following the common routine (see, for example, [22, 23]), we assume that each individual state sequence zi=zi1ziTsatisfies the following first-order hidden Markov model

pzi=pzi1t=2Tpzitzi,t1E4

where pzi1and pzitzi,t1are, respectively, the initial distribution and transition probability given by

Pzi1=r=δr,Pzit=szi,t1=r=Qrsrs=1SE5

where Sis a positive integer, δ=δ1δSis an S×1vector satisfying δr0and r=1Sδr=1.0, and Q=Qrsis an S×Stransition matrix with the rsth entry being Qrs, that is, Qrs0and s=1SQrs=1.0for r=1,,S. Modeling state sequences into (5) allows us to explore the transition pattern of individuals across occasions exactly. For example, in the cocaine use data analysis, zitis often identified with the latent state of patient iat time t, then Qrsspecifies how individual ibeing in state rtransfers to state son two successive occasions. Surely, we can relax the time-homogeneous assumption of transition probabilities by including relevant covariates to interpret the inhomogeneous transition behavior among observation data (see, for example, [12, 13, 16]) but at the expense of computational burden.

The current model defined in (2)(5) provides a comprehensive framework for modeling the multivariate longitudinal data with the latent variables. It accommodates the dynamic behavior of observed sequences, heterogeneity of observed data at the occasion level, and dependence among the multiple items simultaneously. In particular, it makes sense to measure effects of latent factors on the manifest variables quantitatively.

Let Ybe the collection of all observations, and Ωbe the set of corresponding factors. Denote Z=zit:1iN1tTbe set of state variables. It follows from Eqs. (2), (4) and (5) that the joint sampling distribution of Y,Ω, and Zis given by

pYΩZθδQ=i=1Npyi1ωitzi1θpzi1δt=2Tpyitωitzitθpzitzi,t1Qi=1Nt=1T1Ψϵzit1/2exp12trΨϵzityitμzitΛzitωit2×1Φzit1/2exp12trΦzit1ωit2×i=1Nt=1Tr=1SδrIzi1=rs=1SQrsIzi,t1=rzit=sE6

where θis formed by free parameters in μr,Λr,Ψr, and Φr. Here, we write a2=aaand denote IAthe indicator function of a set A. The observed data likelihood is then achieved by taking integration of pYΩZθδQover Ωand Z, which involves high-dimensional integrations.

3. Posterior inferences

3.1. Prior specifications

Let μ=μr, Λ=Λr, Ψϵ=Ψϵr, and Φ=Φkr. For the Bayesian analysis, we need to assign priors to the unknown parameters involved for completing model specification. Since θ, δ, and Qare involved in different submodels, it is natural to assume that θ,δ, and Qare mutually independent and the components contained in θare also mutually independent, that is,

pθδQ=pμpΛΨϵpΦpδpQ.E7

For the convenience of conjugacy, we assume that the parameters are drawn from the following commonly used conjugate types prior distributions (see for example [24]).

pμ=r=1Spμr=Dr=1SNpμ0rΣ0r,Φr=1SpΦr=Dr=1SWm1ρ0rR0r1,pΛΨϵ=r=1SpΛrΨϵr×pΨϵr =Dr=1Sj=1pNmΛ0rjψϵrjHϵ0rjGa1αϵ0rjβϵ0rj,δδ0DirSγ0γ0,pQ=r=1SpQr=Dr=1SDirSν0ν0E8

where ‘Ga1ab’ denotes the inverse Gamma distributions with shape a>0and scale b>0and ‘Wm21ρ0rR0r1’ represents the q-dimensional inverse Wishart distribution with ρ0rdegrees of freedom and m2×m2scale matrix R0r; Qris the rth row vector of Q. The scalars αϵ0rj, βϵ0rj, ρ0r, γ0, ν0, the vectors μ0r, Λ0rj, and the matrices R0rand Hϵ0rjare assumed to be known. Thus, standard conjugate priors were specified for all parametric components in the model. The conjugate type prior distributions are sufficiently flexible in most applications, and for situations with a reasonable amount of data available, the hyperparameters scarcely affect the analysis. It should be noted that although Eq. (8) allows different hyperparameters for different latent states, in practice, we choose identical priors for all s. Details of hyperparameter choices are discussed later when we present the empirical results.

3.2. Gibbs sampling scheme and posterior analysis

Combining the sampling distribution for the observable yit’s and the prior distribution specified in (8) yields the joint posterior distribution of θδQgiven by

pθδQYpYθδQpθpδpQE9

where we ignore the normalization constant pY. However, due to the latent factors and state variables present, the computation of pYθδQis intractable since it involves high-dimensional integrals. Consequently, no closed form can be available for the posterior pθ,δ,Q|Y. This problem can be addressed via the data augmentation idea in Tanner and Wong [25]. Data augmentation technique treats the latent quantities ΩZas the hypothetical missing data and augments them with the observed data to form complete data. The posterior analysis is now carried out based on the joint distribution pΩZθδQY, which is proportional to pYΩZθδQpθδQ, the product of likelihood of complete data and priors. Compared to the intractable observed data likelihood, the complete data likelihood has nice hierarchical structure based on conditional independent assumptions in (2) and (4) and hence is relatively easy to analyze. However, pΩZθδQYis still not in closed form and is thus difficult to deal with analytically. In this regard, simulation-based methods can be used to generate observations to carry out posterior analysis. In view of the multiple components involved, the usual independent sampling methods are not feasible. Note that, on the basis of complete data, the full conditional distributions of Ω,Z,θ,δ,and Qhave closed forms. This provides a solid foundation for Markov chain Monte Carlo methods. Markov chain Monte Carlo sampling does not draw observations from pΩZθδQYdirectly. On the contrary, it generates observations from the full conditionals of each component alternatively, thus forming the dependent sample, i.e., Markov chains. Specifically, as pointed out in the introduction, we use Gibbs sampler [4, 5] to draw observations from this target distribution. Obviously, the sampling scheme in the Gibbs sampler includes two types of moves: updating the components involved in the factor analysis model and updating the components related to the hidden Markov model. We propose using the following Gibbs sampler which iteratively simulates from the conditional distributions, where variables are removed from the conditioning set either by explicit integration or by conditional independence. The steps involved in the Gibbs sampler are

  1. Step a: Generate Zfrom pZθδQΩY

  2. Step b: Generate Ωfrom pΩZθY

  3. Step c: Generate μΛΨϵfrom pμΛΨϵZΩY

  4. Step d: Generate Φfrom pΦZΩ

  5. Step e: Generate δfrom pδZ

  6. Step f: Generate Qfrom pQZ

Under mild conditions and similar to [4] (see also, for example, [26]), one can show that for sufficiently large b, say B0, the joint distribution of ΩbZbθbδbQbconverges at an exponential rate to the desired posterior distribution pΩZθδQY. Hence, pΩZθδQYcan be approximated by the empirical distribution of ΩbZbθbδbQb:b=B0+1,,B0+B}where Bis chosen to give sufficient precision to the empirical distribution. The convergence of the Gibbs sampler can be monitored by the ‘estimated potential scale reduction (EPSR)’ values as suggested by Gelman and Rubin [27] or by plotting the traces of estimates against iterations under different starting values.

Simulated observations obtained from the posterior can be used for statistical inferences via straightforward analysis procedures. For brevity, let θbδbQbΩbZbbe the random observations generated by the Gibbs sampler from pθδQΩZY. The joint Bayesian estimate of θand Ωcan be obtained easily via the corresponding sample means of the generated observations as follows:

θ̂=B11b=1Bθb,Ω̂=B11b=1BΩb,Ẑ=B11b=1BZb.E10

Clearly, these Bayesian estimates are consistent estimates of the corresponding posterior means, see [26]. The consistent estimates of covariance matrix of estimates can be obtained as follows:

Cov(θ^Y)=B11b=1Bθbθ̂θbθ̂E11
Cov(Ω̂Y)=B11b=1BΩbΩ̂ΩbΩ̂E12

Hence, the standard error estimates can be obtained conveniently by the Gibbs sampler algorithm. Other statistical inferences about θand Ωsuch as deriving the confidence intervals and statistics for hypothesis testing can be achieved based on the simulative observations as well (see, for example, [28, 29]).

One important statistical inference beyond estimation is on testing of various hypotheses about the model. In the field of hidden Markov modeling, determining the proper number of states may be the first step towards data analysis. Too many states may overfit the observations, meaning that it can fit the training data accurately but may not be a good model for underlying data-generating process. On the other hand, too few states may not be flexible enough to approximate the underlying model. In the context of Bayesian model selection, Bayes factor (BF, [30]) is a popular choice for model comparison. BF is defined as the ratio of the marginal likelihoods of data under two competing models. However, the computation of BF is difficult since it often involves the high-dimensional integrations. It has also been shown that BF is sensitive to the choice of priors and will become infeasible when improper priors are used. A simple and more convenient alternative is the Lν-measures [3134] which is based on the posterior predictive density. It has been shown [34] that this approach is conceptually and computationally simple and is useful in model checking for wide varieties of complicated situations. Moreover, the required computation is a by-product of the common Bayesian simulation procedures such as the Gibbs sampler or its related algorithms. Specifically, let Yrepdenotes future values of Yin a replicate experiment, that is, Yrephas the same sampling density as that of Y. The posterior predictive distribution pYrepYis defined as

pYrepY=pYrepθδQpθδQYdθdδdQE13

Naturally, if the posited model under consideration is the true model in the sense that from which the data are generated, then Yrepwould behave like data Yand its squared biases and covariances should be small. With this notion in mind, Ibrahim, Chen, and Sinha [34] proposed an Lstatistics to assess the fitness of posited models to the data by weighting the squared biases and covariance, which can be interpreted as a trade-off between them. Here, we extend it to the multivariate longitudinal setting. Let Yrep=y1repyNrepbe a collection set of future responses in our proposal. For some 0ν<1, we consider the following multivariate version of Lν-measure:

LνY=i=1NtrCovyirepY+νi=1NtrEyirepYyiEyirepYyiE14

where the expectation is taken with respect to the posterior predictive distribution. Clearly, small values of the Lν-measure indicate that the model gives predictions close to the observed values, and the variability in the predictions is low as well. Hence, the model with the smallest Lν-measure is selected from a collection of competing models. It has been shown that Lν-measure with ν=0.5has nice theoretical properties [34]. Thus, this value of νwill be used in our empirical illustrations.

4. Cocaine use data analysis

In this section, a small portion of cocaine use data is analyzed to illustrate the practical value of the proposed methodology. The original data are collected from 321 cocaine use patients who were admitted in 1988–1989 to the West Los Angeles Veterans Affairs Medical Center. The whole data constitute 68 measurements of 17 items, which were recorded at four time points: at baseline, 1 year after the treatment, 2 years after the treatment, and 12 years after the treatment in 2002–2003. These measurements cover the information on the cocaine use, treatment received, psychological problems, social status, employments, and so on. As an illustration, three variables are selected to conduct data analysis: ‘y1:days of cocaine use per month at intake (CC)’, ‘y2:times per month in formal treatment (FT)’, and ‘y3:months in formal treatment (MFT)’, which, respectively, represent the severity of cocaine use and the levels of treatment received by a patient. Since these variables were measured in 0–120 points scale, to unify the scales, we take logarithms and standardize them. Among them, some measurements are missing. The missing proportion is about 8.4%. For brevity, we assume that the missing is missing at random [35]. A distinct characteristic underlying data are nonnormal and heavy tailed. Figure 1 gives the plots of histograms and the posterior predictive density estimates (see below) of logarithms of CC, FT, and MFT (with missing data removed) on four occasions. The histograms illustrate that the distributions of selected variables are deviated from normality in terms of multimodality and skewness. The skewness and kurtosis of CC on four occasions are 1.6315.031, 0.8473.354, 0.3281.476, and 0.4732.467, respectively. Data set also demonstrates dynamic characteristics. The distribution of CC, for instance, is skewed to the left at baseline and moves to the right gradually on the following two occasions and becomes right-skewed eventually. This implies that a single factor analysis model may not be appropriate to fit the data at each time point.

Figure 1.

Plots of histograms and posterior predictive density estimates of ‘CC’, ‘FT’ and ‘MFT’ under FA model and hidden Markov CFA model with seven states in the cocaine use data analysis: the dashed lines denote CFA and the solid lines represent the hidden Markov FA.

In this analysis, one of the objectives is to explore the effects of latent factors on the observed variables and assess the dependence among latent factors. Based on the nature of the problem under consideration, it is natural to group the single variable ‘CC’ to reflect one latent factor ‘cocaine use’ (η) and to group ‘FT’ and ‘MFT’ to represent another latent factor ‘treatment’ (ξ). Let yit=yit1yit2yit3and ωit=ηitξit. To be convenient for interpretation and computation, Φrand Λrare restricted to be invariant across states but leave the baseline level μrvarying with r. Further, the following non-overlapped structure for factor loading matrix is considered

Λ=10001Λ32E15

where parameters with an asterisk are treated as fixed for identification. Note that fixing Λ11=1indicates that ηis identified with CC. This is similar to that in Λ22. Hence, in this case, Φ12in Φmeasures the magnitude of dependence of ξon η.

Data set is fitted to the proposed models with 10 different transition models: S=1,,10. Although these state spaces are in nested forms, the corresponding models are not, since one cannot be reduced to another by constraining parameters in the interior of parameter space. This indicates that chi-square distribution may not be suitable for the classic likelihood ratio test statistic. We use L-measure to implement model selection. Obviously, if S1is taken, then the proposed model reduces to common factor analysis model (CFA, [18]).

The following inputs are taken for the super-parameters involved in the prior distributions (8): for r=1,,S, μ0rj=minyitj+r/S, Σ0r=Syy/S, where Syyis the sample covariance matrix of data. The entries in Λ0are set to be zeros, ρ0=10.0, R01=7.0×I2, which leads to the mean of Φequal to I2, Hϵ0=I3, αϵ0j=9.0, βϵ0j=8.0, ν0=γ0=0.1. Note that these values are the standard inputs in the latent variable analysis (see [24]). We also took other values for these inputs and found that the resulting estimates are scarcely affected.

We implement the proposed algorithm given in Section 3 to conduct Bayesian analysis. Let Yobsbe the collection of observed data and Ymisbe the set of missing data. Due to the missing data, we need to draw Ymisfrom pYmisΩZθYobsin MCMC sampling. This can be implemented easily since conditioning on Ω, Z, and θ, pYmisΩZθYobs, independent of Yobs, has the normal distribution. Hence, drawing Ymisis rather straightforward and fast. To obtain some idea about the number of the Gibbs sampler iterations in getting convergence, we conducted a few test runs as a pilot study and found that in all these runs, the Gibbs sampler converged in about 1000–2000 iterations, where the EPSR values [27] are less than 1.2. So, for all cases under consideration, we collect 3000 random observations after initial 2000 iterations being removed for posterior analysis.

We calculate the values of L0.5under each fitting. For computation, we use simulation-based method by drawing predictive values Yobsrepfrom pYobsrepYobs, where Yobsrepis the hypothetical replication of Yobs. Note that pYobsrepYobs=pYobsrepΩZθpΩZθYobsdΩdZdθ. Hence, drawing Yobsrepis rather easy when Ω,Z, and θare available. Given that we have Msimulations from the posterior of Ω,Z,θvia MCMC sampling discussed before, we just draw one Yobsrepfrom pYobsrepΩZθfor each Ω,Z, and θand obtain Msimulations in the end for Yobsrep. Based on these simulated observations, Lνmeasures can be estimated consistently via sample means. We draw 3000 observations after convergence of MCMC algorithm for calculating L0.5and the results are reported in Table 1.

ModelL0.5ModelL0.5
S=12322.447S=6590.448
S=22107.514S=7572.172
S=31030.264S=8597.843
S=4941.230S=9932.763
S=5839.726S=101030.264

Table 1.

Summary of L0.5under competing models in the analysis of cocaine use data.

Examination of Table 1 indicates that the proposed model with six to eight latent states seems to give better fits to the data. Furthermore, we calculate the posterior predictive density estimates of ytjt=14j=13under one state and seven states, respectively (see Figure 1). It can be seen clearly that our proposed method is successful in capturing the skewness and modes of data while factor analysis model fails. For the computation details, we choose 60–100 equally spaced grids in the interval minyobs,itj1.0maxyobs,itj+1.0and collect 3000 simulated observations from the Gibbs sampler at each point after removing initial 2000 iteration as burn-ins.

Table 2 presents the summary of Bayesian estimates of unknown parameters and their standard errors using the formula given in (11) with S=1(denoted by FA) and S=7(denoted by HMFA). For comparison, maximum likelihood estimates of unknown parameters with their standard deviations under HMFA are also presented in Table 2. The maximum likelihood analysis is conducted via MCECM algorithm [36] and the standard error estimates are calculated via Louis formula [37].

FAMLHMFA
Para.Est.SDEst.SDEst.SD
Λ320.0010.0140.1960.0290.7520.045
Ψε11.4430.3150.5590.2970.4320.049
Ψε20.4390.0560.2040.0390.3390.034
Ψε30.3050.0300.008NAN0.0250.001
Φ110.7700.3150.5100.1320.3460.049
Φ12−0.0180.018−0.0530.041−0.1820.052
Φ221.0070.0800.3120.0530.2190.033

Table 2.

Summary statistics for Bayesian and ML estimates in the cocaine use data analysis.

Based on Table 2, we can find the following facts: First, three estimates of Λ32give the positive effects of latent factor ξon the ‘MFT’. This is not surprising since ξis related to the treatment level of a patient received. But there are obvious differences in magnitudes among the three methods. For FA and HMFA, the former gives Λ̂32=0.001associated with standard deviation 0.014, while the latter gives Λ̂32=0.752with standard deviation 0.045. This reflects that the heterogeneity of data affects the estimates Λ̂32seriously. Compared to the previous two methods, ML method produces that Λ̂32=0.196with SD = 0.029, which are in between them. Second, the estimates of variance parameters Ψϵjunder S=1are larger than those under S=7. This indicates that factor analysis model accommodates heavy tails of data at the expense of variance inflation. Further investigations on the estimates of Φjjunder FA and HMFA also reveal the same phenomenon as that of Ψϵj. However, we observe that the ML estimate of Ψϵ3, the unique variance corresponding to the third item, is equal to 0.008 with SD = NAN, an illogical number, which is very close to an improper Heywood case. As pointed out by Lee [18], Heywood cases in the ML estimation can be avoided by imposing an inequality constraint on Ψϵ3with a penalty function. In the Bayesian approach, the conjugate prior distribution of Ψϵ31specified Ψϵ3in a region of positive values and hence has a similar effect as adding a penalty function. Hence, no Heywood cases are found in the Bayesian solution because of the penalty function induced by the prior distribution on Ψϵ31. Third, three estimates give the negative correlation between ηand ξ, which is consistent with the fact that the improvement of treatment will decrease the intensity of cocaine use, thus leading to a decrease of cocaine use in days. ML estimates for Φjkare very close to those under HMFA. However, the estimate of Φ12under S=1is −0.018, which is quite different from 0.182for S=7. Furthermore, the coefficients of correlation of ξand ηunder S=1and S=7are −0.0204 and −0.6612, respectively. The former suggests that ξand ηare approximately independent while the latter implies stronger dependence between them.

Moreover, we computed the posterior probabilities Pzt=rYobsfor r=1,,7and t=1,,4under S=7based on 10,000 simulated observations drawn from pZYobsand found that the transition path corresponding to the maximum posterior probability is 7111. This implies that latent state of the patient being in is extremely serious at baseline and becomes moderate in the subsequent treatments. This also reflects a positive effect of intervention on the patient’s latent state. Note that unlike the common Viterbi algorithm in exploring the optimal transition path of states in ML analysis, calculating posterior probability Pzt=rYobswithin Bayesian framework is a by-product of the estimation procedure. This voids the complex computation of marginal likelihood of the observed data and hence is very fast.

5. Discussion

This chapter reviews Bayesian inferences within a general framework and proposes a Bayesian procedure for analyzing hidden Markov factor analysis model under multivariate longitudinal setting. Compared to ML method, the pragmatic advantage of Bayesian framework is its flexibility and generality for coping with very complex problems. When good prior information can be available, results obtained from Bayesian method are more reliable and accurate than that under ML. With increased access to computation advances in simulation-based approaches, in particular the MCMC methodology, Bayesian inferences provide enormous scope for realistic statistical modeling.

Although we concentrate our attention on applications of the hidden Markov factor analysis model, the methodology developed in this chapter can be extended to the case where the LVM is nonlinear. Another possible extension is to consider a dynamic LVM, wherein model parameters vary over time. These extensions will raise theoretical and computational challenges and certainly require further investigation.

Acknowledgments

The authors are grateful to the editor’s valuable suggestions and comments which have greatly improved the manuscript. Xia’s work was fully supported by grant from National Nature Science Foundation of China (No. 11471161) and Tang’s work was supported by National Science Fund for Distinguished Young Scholars of China (No. 11225103).

  1. (a) pZθδQΩY

Let ωidenote the sequence of latent factors across Toccasions for individual i. To draw state variables Zfrom pZδQθΩY, we first notice that

pZθΩY=i=1NpziωiδQθyiE16

Hence, drawing Zcan be accomplished via single-component method by drawing ziindependently from pziωiδQθyi. Furthermore, notice that the sequences yiωiziare still the one-order Markov sequences. Hence, we can simulate zithrough a well-known forward filtering-backward sampling algorithm (see, for example, [38]). For notation clarity, we suppress θ, δ, and Qin the following derivations.

Forward filtering-backward sampling (FFBS) consists of first forward filtering (FF) and then backward sampling (BS). The forward filtering step recursively updates

αi,tt=pzitωi,1:tyi,1:t,t=1,,T.E17

Here yi,1:trepresents the set of observations of subject iup to time tand so are ωi,1:tand zi,1:t. The backward sampling is to draw zifrom the joint distribution of the states given the data using

pzi,1:Tωi,1:Tyi,1:T=pziTωi,1:Tyi,1:T  pzi1zi,2:Tωi,1:Tyi,1:T.E18

That is, we first draw the last state given all the data and then work backwards in time drawing each state conditional on all the subsequent ones.

To implement forward filtering, let

αitr=Pyi,1:tωi,1:tzit=r,t=1,,TE19

Obviously, αi1r=δrpyi1ωi1zi1=r. Moreover, it can be shown that

αitr=s=1Sαit1sqsrpyitωitzit=r,t=2,,TE20

The outputs αitt=1Tfrom recursive Eq. (20) can be used to calculate the posterior probability

αi,ttr=Pzit=rωi,1:tyi,1:t=αitrs=1SαitsE21

which leads to the forward filtering (FF) iteration.

The backward sampling step depends on the observation that

pzitzi,t+1:Tωi,1:Tyi,1:Tpzitzi,t+1:Tωi,1:Tyi,1:T= pzitωi,1:tyi,1:tpzi,t+1:Tωi,t+1:Tyi,t+1:Tzitωi,1:tyi,1:t= pzitωi,1:tyi,1:tpzi,t+1:Tzitωi,1:tyi,1:tpωi,t+1:Tyi,t+1:Tzi,t+1:Tzitωi,1:tyi,1:t= pzitωi,1:tyi,1:tpzi,t+1:Tzitpωi,t+1:Tyi,t+1:Tzi,t+1:TE22

The last equation holds since given zit, yi,t:Tωi,t:Tzi,t+1:Tdoes not depend on the previous values due to the Markov Chain characteristics of yitωitzit. This leads to

Pzit=rzi,t+1:Tyi,1:Tωi,1:T=αi,ttrqrzi,t+1s=1Sαi,ttsqszi,t+1t=T1,,1.E23

Hence, FFBS algorithm for drawing ziis implemented by

Algorithm:

  1. running the recursion αitand stored the conditional probabilities αi,ttfor t=1,,T;

  2. sampling ziTfrom the filtered conditional probability αi,TT;

  3. for t=T1,,1, sampling zitfrom the conditional probability

Pzit=rωi,1:Tyi,1:Tzi,t+1:T.E24
  1. (b) pΩZθY

To draw Ω, we first note that

pΩZθY=i=1Nt=1TpωitzitθyitE25

in which

pωitzitθyitexp12yitμrΛrωitΨϵr1yitμkrΛrωit12ωitΦr1ωitE26

with r=zit. Hence, similar to that in drawing Z, updating Ωcan be achieved by drawing ωitindependently from pωitzitθyitfor i=1,,Nand t=1,,T. It can be shown that

pωitzit=rθyit=DNmm̂itΣ̂r.E27

in which

m̂it=Σ̂rΛrΨϵr1yitμr,Σ̂r=ΛrΨϵr1Λr+Φr11.E28
  1. (c) pμΛΨϵZΩY

To draw μΛΨϵ, we can first draw μfrom pμΛΨϵZΩYand then draw ΛΨϵfrom pΛΨϵμZΩY. For this end, let n̂r=#zit=rbe the size of cluster zit, and let witr=Izit=r. Denote

Y¯r=i=1Nt=1Twitryit/n̂r,Ω¯r=i=1Nt=1Twitrωit/n̂r,Syyr=i=1Nt=1Twitryityit/n̂r,Sωyr=i=1Nt=1Twitrωityit/n̂r,E29

be the sample means and covariance matrices of Yand Ωwithin the rth cluster, respectively. By some algebra calculations, it can be shown that

pμΛΨϵKZΩY=r=1SpμrΛrΨϵrZΩY,andpΛΨϵμKZΩY=r=1SpΛrΨϵrμrΩZY,E30

where

pμrΛrΨϵrZΩY=NpâμrΣ̂μr,pΛrΨϵrμrZΩY=j=1ppΨϵrjμrZΩYpΛrjΨϵrjμrZΩY=Dj=1pGa1α̂ϵrjβ̂ϵrjNmΛ̂rjΨϵrjĤrj,E31

with

âμr=Σ̂μrΣ0r1μ0r+n̂rΨϵr1Y¯rΛrΩ¯r,Σ̂μr=Σ0r1+n̂rΨϵr11,Λ̂rj=ĤrjHϵ0rj1Λ0rj+n̂rSωyjrμrjΩ¯r,Ĥrj1=H0rj1+n̂rSωωr,α̂ϵrj=αϵ0rj+n̂r/2,β̂ϵrj=βϵ0rj+Λ0rjH0rj1Λ0rj+n̂rSyyjjr2μrjy¯jr+μrj2Λ̂rjĤrj1Λ̂rj/2,E32

in which y¯jris the jth element in Y¯r, Syyjjris the jth main diagonal element of Syyr, and Sωyjris the jth column vector of Sωyr.

  1. (d) pΦΩZ

From the prior distribution of Φr1and the distribution of Ω, it can be shown that

pΦrΩZΦrn̂r+ρ0r+m+1/2exp12trΦ1(n̂rSωωr+R01E33

where n̂rand Sωωare given in (c). Hence, pΦrΩZis the m-dimensional inverse Wishart distribution Wm1n̂r+ρ0rn̂rSωωr+R01. It can be shown from exactly the same reasoning as before that drawing Φcan be achieved by drawing Φrfrom pΦrΩZindependently.

  1. (e) pδZand (f) pQZ

It can be verified directly that

pδZ=pδkZandpδZ=DDirSγ0+n̂11γ0+n̂1SE34

in which n̂1r=i=1NIzi1=r. Similarly, it can be shown that

pQZ=r=1SpQrZ,pQrZ=Dr=1SDirSν0+n̂r1ν0+n̂rS.E35

in which n̂rs=i=1Nt=2TIzit1=rzit=s.

© 2017 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Yemao Xia, Xiaoqian Zeng and Niansheng Tang (December 20th 2017). Bayesian Analysis for Hidden Markov Factor Analysis Models, New Insights into Bayesian Inference, Mohammad Saber Fallah Nezhad, IntechOpen, DOI: 10.5772/intechopen.72837. Available from:

chapter statistics

260total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Dynamic Process Model Parameter Estimation by Global System Analysis

By Shigeru Kashiwaya

Related Book

First chapter

Bayesian Networks for Supporting Model Based Predictive Control of Smart Buildings

By Alessandro Carbonari, Massimo Vaccarini and Alberto Giretti

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us