Bayesian Analysis for Hidden Markov Factor Analysis Models Bayesian Analysis for Hidden Markov Factor Analysis Models

The purpose of this chapter is to provide an introduction to Bayesian approach within a general framework and develop a Bayesian procedure for analyzing multivariate longitudinal data within the hidden Markov factor analysis framework.


Introduction
The Bayesian approach is now well recognized in the statistics literature as an attractive approach to analyzing a wide variety of models [1], and there is rich literature on this issue. Here, we are not going to present a full coverage on the general Bayesian theory, and readers may refer to excellent books, for example [2,3], for more details for this general statistical method. This chapter provides an introduction to the Bayesian approach within a general framework and develops a specific Bayesian procedure for analyzing multivariate longitudinal data within the hidden Markov factor analysis framework. We begin with the basic ideas of the Bayesian approach and then describe the model under consideration in the second section. The following section considers Bayesian inferences including parameter estimation, model selection, and posterior density estimates. The final section demonstrates the practical value of proposed methodology to cocaine use data to get some Bayesian results. Some technical details are given in the Appendix.
Consider a data set Y with the probability model p Yjθ ð Þwhere θ is a univariate or multivariate population parameters vector, which quantifies the uncertainty of data. In the statistical literature, p Yjθ ð Þ is called likelihood or sampling distribution and often represented as L θ ð Þ. From the frequency statistics point of view, statistical inferences are carried out based on L θ ð Þ. In this case, θ, though unknown, is treated as fixed. Unlike the frequency statistical inferences, the Bayesian approach for data analysis assumes that θ is random and has a distribution π θ ð Þ. This distribution, which represents the knowledge about θ, is referred to as prior distribution or prior. When data are available, the information on θ is summarized within the posterior distribution or posterior, a conditional distribution θ given data, i.e., The right-hand-side term in (1) omits the factor p Y ð Þ since given Y it is a known constant. In Bayes literature, p Yjθ ð Þp θ ð Þ is also termed the unnormalized posterior. Analogous to the role of likelihood in frequency statistical inferences, posterior is the starting point of Bayesian inferences.
Selecting proper priors for parameters is fundamental to Bayesian analysis. Basically, there are two kinds of prior distributions, namely, the noninformative prior distributions and the informative prior distributions. Noninformative prior distributions associate with situations when the prior distributions have no population basis. They are used when we have little prior information on θ and desire that the prior distributions play a minimal role in the posterior distribution distribution. Informative prior distribution represents the distribution of possible parameter values, from which the parameter θ has been drawn. We may have prior knowledge about this distribution, either from closed related data or from the subjective knowledge of experts. A commonly used informative prior distribution in the general Bayesian approach to statistical problems is the conjugate prior distribution, a prior ensuring that the posterior distribution follows the same parametric form as the prior distribution [1,3].
A potential difficulty underlying Bayesian inferences is the statistical computation when posterior distribution takes on the complicated form. This is particularly true in the situation where latent variables or other unobservable quantities are involved in the model, as discussed in this chapter. In such cases, statistical inferences usually recur to simulation-based methods. Among various sampling methods, Markov chains Monte Carlo methods (MCMC) provide powerful tools for simulating observations from posterior. The key to Markov chain simulation is to create a Markov sequence whose stationary distribution is a specified posterior p θjY ð Þ. Posterior inferences are carried out based on these simulated observations. There are many ways of constructing these Markov chains, but all of them, including the Gibbs sampler [4,5], are special cases of the general framework of Metropolis et al. [6] and Hastings [7]. However, we do not intend to pursue this issue here, and details on simulation-based methods can be referenced to [2,3,8,9].
In what follows, as an illustration, we will develop a Bayesian analysis procedure for multivariate data under longitudinal setting. Multivariate longitudinal or clustered data occur when multiple items are measured repeatedly over periods of time or across occasions. Under such setting, the primary interest is inference about the dependence of the multiple measurements and the temporal correlation resulting from the repeated measures on the same items. But more often, particular interest also focuses on exploring the potential heterogeneity of data and investigating its transition pattern over time. In these cases, hidden Markov latent variable model (HMLVM) [10][11][12][13] provides a feasible and unified framework to address these issues. HMLVM assumes that the overall model constitutes the observed process and the underlying hidden state process. The state process, as the convention in the classic HMM (see for example, [14][15][16][17]), is an univariate discrete process, which follows a first-order Markov chain, while the observed process, conditional on the state sequence, is an independent process with emission distribution specified via LVMs [18]. Hence, in this regard, HMLVM provides a unified way of describing the correlation of multiple items, temporal dependence, and heterogeneity among the data simultaneously. However, the current existing developments cited beforehand focus on the maximum likelihood analysis in which statistical inferences heavily depend on the asymptotic properties. As an illustration of Bayesian inferences on practical problems, in this chapter, we develop a Bayesian procedure to analyze cocaine use data within the hidden Markov factor analysis model framework. Compared to ML, a basic nice feature of a Bayesian approach is its flexibility to utilize useful prior information for achieving better results. Additionally, simulation-based Bayesian methods depend less on asymptotic theory and hence have the potential to produce more reliable results even with small samples.

Model description 2.1. Hidden Markov factor analysis model
Consider a set of multivariate longitudinal observations formed by p-dimensional observed vectors y it ¼ y it1 ; …; y itp ⊺ , which are recorded on p items over periods of length T: t ¼ 1, ⋯, T across N subjects: i ¼ 1, ⋯, N. In the field of multivariate analysis, interest mainly focuses on exploring item dependence since measurements may be highly correlated arising from the multicollinearity problem. But more often, interest also concentrates on the heterogeneity resulting from the situation where the population of y it constitutes more than one component. This is particularly true in the situation where the data illustrate extreme behaviors such as multimodal and/or skewed characteristics. In these cases, a finite mixture factor analysis model (FMFAM) can provide a powerful tool to address these issues. Typically, FMFAM assumes that conditioning on an univariate discrete value state variable z it and an m-dimensional (m < p) continuous latent factor vector ω it , y it are independent and distributed with a p-dimensional multivariate normal distribution, and meanwhile, given z it , ω it also follows an m-dimensional normal distribution, that is, where μ r ¼ μ r1 ; …; μ rp ⊺ is a p-dimensional intercept vector, which represents the baseline a p Â p diagonal matrix with the jth diagonal element Ψ ekrj > 0, and Φ r is an m Â m positive definite matrix.
Formulation given in (2) has two basic features: one is to characterize heterogeneity of population of y it at the occasion level and the other is to establish the dependence among the multiple measurements. The heterogeneous population is specified via state-specific parameters contained in the model while the dependence between different measurements is identified via sharing the common factors in the manner of liner combinations. In particular, apart from explaining the idiosyncratic part of measurements, latent factors also characterize the association between any two measurements. As a matter of fact, one can show that the correlation coefficient between y itj and y itk at state z it is given by in which λ rjk is the j; k ð Þth element of Λ r and Φ r, hk is the h; k ð Þth element in Φ , respectively. The strength of correlation is identified by the factor loadings and covariance of factors together. In the case when ω it degenerates to zero (i.e., Φ ¼ 0 ) or Λ ¼ 0, the association among items disappears and model (2) reduces to p-independent mean-variance models within cluster r. Hence, latent factors play a dominant role in characterizing association of multiple items. Note that, in actual applications, latent factors, though unobservable, often have their own physical interpretations. In psychology, for example, latent factors are often used to identify concepts such as treatment, temper, and anxiety, which are important within the framework of theoretical models. The measurements are just proxies for these unobserved concepts of interest. We will provide further interpretations in the real example.
The primary reason for collecting information on multiple occasions for each subject is that it allows investigation of change and/or temporal dependence over time within the subject. There exist various constructs for characterizing dynamic characteristics. A commonly used method is to construct proper dynamic structures for latent factors and establish dynamic factor models, see for example, [19][20][21]. An alternative choice we adopt here is specifying the joint distribution for state sequences. Following the common routine (see, for example, [22,23]), we assume that each individual state sequence z i ¼ z i1 ; ⋯; z iT ð Þsatisfies the following first-order hidden Markov model where p z i1 ð Þ and p z it jz i, tÀ1 ð Þare, respectively, the initial distribution and transition probability given by where S is a positive integer, δ ¼ δ 1 ; ⋯; δ S ð Þis an S Â 1 vector satisfying δ r ≥ 0 and P S r¼1 δ r ¼ 1:0, and Q ¼ Q rs ð Þ is an S Â S transition matrix with the r; s ð Þth entry being Q rs , that is, Q rs ≥ 0 and P S s¼1 Q rs ¼ 1:0 for r ¼ 1, ⋯, S. Modeling state sequences into (5) allows us to explore the transition pattern of individuals across occasions exactly. For example, in the cocaine use data analysis, z it is often identified with the latent state of patient i at time t, then Q rs specifies how individual i being in state r transfers to state s on two successive occasions. Surely, we can relax the time-homogeneous assumption of transition probabilities by including relevant covariates to interpret the inhomogeneous transition behavior among observation data (see, for example, [12,13,16]) but at the expense of computational burden.
The current model defined in (2)-(5) provides a comprehensive framework for modeling the multivariate longitudinal data with the latent variables. It accommodates the dynamic behavior of observed sequences, heterogeneity of observed data at the occasion level, and dependence among the multiple items simultaneously. In particular, it makes sense to measure effects of latent factors on the manifest variables quantitatively.
Let Y be the collection of all observations, and Ω be the set of corresponding factors. Denote f g be set of state variables. It follows from Eqs. (2), (4) and (5) that the joint sampling distribution of Y, Ω , and Z is given by where θ is formed by free parameters in μ r , Λ r , Ψ r , and Φ r . Here, we write a ⊗ 2 ¼ aa ⊺ and denote I A ð Þ the indicator function of a set A. The observed data likelihood is then achieved by taking integration of p Y; Ω; Zjθ; δ; Q ð Þ over Ω and Z, which involves high-dimensional integrations.

Prior specifications
For the Bayesian analysis, we need to assign priors to the unknown parameters involved for completing model specification. Since θ, δ, and Q are involved in different submodels, it is natural to assume that θ, δ, and Q are mutually independent and the components contained in θ are also mutually independent, that is, For the convenience of conjugacy, we assume that the parameters are drawn from the following commonly used conjugate types prior distributions (see for example [24]).
where 'Ga À1 a; b ð Þ' denotes the inverse Gamma distributions with shape a > 0 and scale b > 0 and 'W À1 m2 r 0r ; R À1 0r À Á ' represents the q-dimensional inverse Wishart distribution with r 0r degrees of freedom and m 2 Â m 2 ð Þscale matrix R 0r ; Q r is the rth row vector of Q. The scalars α e0rj , β e0rj , r 0r , γ 0 , ν 0 , the vectors μ 0r , Λ 0rj , and the matrices R 0r and H e0rj are assumed to be known. Thus, standard conjugate priors were specified for all parametric components in the model. The conjugate type prior distributions are sufficiently flexible in most applications, and for situations with a reasonable amount of data available, the hyperparameters scarcely affect the analysis. It should be noted that although Eq. (8) allows different hyperparameters for different latent states, in practice, we choose identical priors for all s. Details of hyperparameter choices are discussed later when we present the empirical results.

Gibbs sampling scheme and posterior analysis
Combining the sampling distribution for the observable y it 's and the prior distribution specified in (8) yields the joint posterior distribution of θ; δ; Q f ggiven by where we ignore the normalization constant p Y ð Þ. However, due to the latent factors and state variables present, the computation of p Yj θ; δ; Q ð Þis intractable since it involves high-dimensional integrals. Consequently, no closed form can be available for the posterior p θ, δ, QjY ð Þ . This problem can be addressed via the data augmentation idea in Tanner and Wong [25]. Data augmentation technique treats the latent quantities Ω; Z f g as the hypothetical missing data and augments them with the observed data to form complete data. The posterior analysis is now carried out based on the joint distribution p Ω; Z; θ; δ; QjY ð Þ , which is proportional to p Y; Ω; Zjθ; δ; Q ð Þ p θ; δ; Q ð Þ , the product of likelihood of complete data and priors. Compared to the intractable observed data likelihood, the complete data likelihood has nice hierarchical structure based on conditional independent assumptions in (2) and (4) and hence is relatively easy to analyze. However, p Ω; Z; θ; δ; QjY ð Þ is still not in closed form and is thus difficult to deal with analytically. In this regard, simulation-based methods can be used to generate observations to carry out posterior analysis. In view of the multiple components involved, the usual independent sampling methods are not feasible. Note that, on the basis of complete data, the full conditional distributions of Ω, Z, θ, δ, and Q have closed forms. This provides a solid foundation for Markov chain Monte Carlo methods. Markov chain Monte Carlo sampling does not draw observations from p Ω; Z; θ; δ; QjY ð Þ directly. On the contrary, it generates observations from the full conditionals of each component alternatively, thus forming the dependent sample, i.e., Markov chains. Specifically, as pointed out in the introduction, we use Gibbs sampler [4,5] to draw observations from this target distribution. Obviously, the sampling scheme in the Gibbs sampler includes two types of moves: updating the components involved in the factor analysis model and updating the components related to the hidden Markov model. We propose using the following Gibbs sampler which iteratively simulates from the conditional distributions, where variables are removed from the conditioning set either by explicit integration or by conditional independence. The steps involved in the Gibbs sampler are Step a: Generate Z from p Zjθ; δ; Q; Ω; Y ð Þ Step b: Generate Ω from p Ω jZ; θ; Y ð Þ Step c: Generate μ; Λ; Ψ e f gfrom p μ; Λ; Ψ e jZ; Ω; Y ð Þ Step d: Generate Φ from p Φ jZ; Ω ð Þ Step e: Generate δ from p δjZ ð Þ Step f: Generate Q from p QjZ ð Þ Under mild conditions and similar to [4] (see also, for example, [26]), one can show that for be the random observations generated by the Gibbs sampler from p θ; δ; Q; Ω; ZjY ð Þ . The joint Bayesian estimate of θ and Ω can be obtained easily via the corresponding sample means of the generated observations as follows: Clearly, these Bayesian estimates are consistent estimates of the corresponding posterior means, see [26]. The consistent estimates of covariance matrix of estimates can be obtained as follows: Hence, the standard error estimates can be obtained conveniently by the Gibbs sampler algorithm. Other statistical inferences about θ and Ω such as deriving the confidence intervals and statistics for hypothesis testing can be achieved based on the simulative observations as well (see, for example, [28,29]).
One important statistical inference beyond estimation is on testing of various hypotheses about the model. In the field of hidden Markov modeling, determining the proper number of states may be the first step towards data analysis. Too many states may overfit the observations, meaning that it can fit the training data accurately but may not be a good model for underlying data-generating process. On the other hand, too few states may not be flexible enough to approximate the underlying model. In the context of Bayesian model selection, Bayes factor (BF, [30]) is a popular choice for model comparison. BF is defined as the ratio of the marginal likelihoods of data under two competing models. However, the computation of BF is difficult since it often involves the high-dimensional integrations. It has also been shown that BF is sensitive to the choice of priors and will become infeasible when improper priors are used. A simple and more convenient alternative is the L ν -measures [31][32][33][34] which is based on the posterior predictive density. It has been shown [34] that this approach is conceptually and computationally simple and is useful in model checking for wide varieties of complicated situations. Moreover, the required computation is a by-product of the common Bayesian simulation procedures such as the Gibbs sampler or its related algorithms. Specifically, let Y rep denotes future values of Y in a replicate experiment, that is, Y rep has the same sampling density as that of Y. The posterior predictive distribution p Y rep jY ð Þis defined as Naturally, if the posited model under consideration is the true model in the sense that from which the data are generated, then Y rep would behave like data Y and its squared biases and covariances should be small. With this notion in mind, Ibrahim, Chen, and Sinha [34] proposed an L statistics to assess the fitness of posited models to the data by weighting the squared biases and covariance, which can be interpreted as a trade-off between them. Here, we extend it to the multivariate longitudinal setting. Let Y rep ¼ y rep⊺ 1 ; ⋯; y rep⊺ N À Á ⊺ be a collection set of future responses in our proposal. For some 0 ≤ ν < 1, we consider the following multivariate version of L ν -measure: where the expectation is taken with respect to the posterior predictive distribution. Clearly, small values of the L ν -measure indicate that the model gives predictions close to the observed values, and the variability in the predictions is low as well. Hence, the model with the smallest L ν -measure is selected from a collection of competing models. It has been shown that L νmeasure with ν ¼ 0:5 has nice theoretical properties [34]. Thus, this value of ν will be used in our empirical illustrations.

Cocaine use data analysis
In this section, a small portion of cocaine use data is analyzed to illustrate the practical value of the proposed methodology. The original data are collected from 321 cocaine use patients who were admitted in 1988-1989 to the West Los Angeles Veterans Affairs Medical Center. The whole data constitute 68 measurements of 17 items, which were recorded at four time points: at baseline, 1 year after the treatment, 2 years after the treatment, and 12 years after the treatment in 2002-2003. These measurements cover the information on the cocaine use, treatment received, psychological problems, social status, employments, and so on. As an illustration, three variables are selected to conduct data analysis: 'y 1 : days of cocaine use per month at intake (CC)', 'y 2 : times per month in formal treatment (FT)', and 'y 3 : months in formal treatment (MFT)', which, respectively, represent the severity of cocaine use and the levels of treatment received by a patient. Since these variables were measured in 0-120 points scale, to unify the scales, we take logarithms and standardize them. Among them, some measurements are missing. The missing proportion is about 8:4%. For brevity, we assume that the missing is missing at random [35]. A distinct characteristic underlying data are nonnormal and heavy tailed. Figure 1 gives the plots of histograms and the posterior predictive density estimates (see below) of logarithms of CC, FT, and MFT (with missing data removed) on four occasions. The histograms illustrate that the distributions of selected variables are deviated from normality in terms of multimodality and skewness. The skewness and kurtosis of CC on four occasions are À1:631; 5:031 f g , À0:847; 3:354 f g , 0:328; 1:476 f g , and À0:473; 2:467 f g , respectively. Data set also demonstrates dynamic characteristics. The distribution of CC, for instance, is skewed to the left at baseline and moves to the right gradually on the following two occasions and becomes right-skewed eventually. This implies that a single factor analysis model may not be appropriate to fit the data at each time point.
In this analysis, one of the objectives is to explore the effects of latent factors on the observed variables and assess the dependence among latent factors. Based on the nature of the problem under consideration, it is natural to group the single variable 'CC' to reflect one latent factor 'cocaine use' (η) and to group 'FT' and 'MFT' to represent another latent factor 'treatment' (ξ).
Let y it ¼ y it1 ; y it2 ; y it3 À Á ⊺ and ω it ¼ η it ; ξ it À Á ⊺ . To be convenient for interpretation and computation, Φ r and Λ r are restricted to be invariant across states but leave the baseline level μ r varying with r. Further, the following non-overlapped structure for factor loading matrix is considered where parameters with an asterisk are treated as fixed for identification. Note that fixing Λ 11 ¼ 1 indicates that η is identified with CC. This is similar to that in Λ 22 . Hence, in this case, Φ 12 in Φ measures the magnitude of dependence of ξ on η.
Data set is fitted to the proposed models with 10 different transition models: S ¼ 1, ⋯, 10.
Although these state spaces are in nested forms, the corresponding models are not, since one cannot be reduced to another by constraining parameters in the interior of parameter space. This indicates that chi-square distribution may not be suitable for the classic likelihood ratio test statistic. We use L-measure to implement model selection. Obviously, if S 1 is taken, then the proposed model reduces to common factor analysis model (CFA, [18]).
The following inputs are taken for the super-parameters involved in the prior distributions (8): where S yy is the sample covariance matrix of data. The entries in Λ 0 are set to be zeros, r 0 ¼ 10:0, R À1 0 ¼ 7:0 Â I 2 , which leads to the mean of Φ equal to I 2 , H e0 ¼ I 3 , α e0j ¼ 9:0, β e0j ¼ 8:0, ν 0 ¼ γ 0 ¼ 0:1. Note that these values are the standard inputs in the latent variable analysis (see [24]). We also took other values for these inputs and found that the resulting estimates are scarcely affected. We implement the proposed algorithm given in Section 3 to conduct Bayesian analysis. Let Y obs be the collection of observed data and Y mis be the set of missing data. Due to the missing data, we need to draw Y mis from p Y mis j Ω; Z; θ; Y obs ð Þ in MCMC sampling. This can be implemented easily since conditioning on Ω, Z, and θ, p Y mis j Ω; Z; θ; Y obs ð Þ , independent of Y obs , has the normal distribution. Hence, drawing Y mis is rather straightforward and fast. To obtain some idea about the number of the Gibbs sampler iterations in getting convergence, we conducted a few test runs as a pilot study and found that in all these runs, the Gibbs sampler converged in about 1000-2000 iterations, where the EPSR values [27] are less than 1.2. So, for all cases under consideration, we collect 3000 random observations after initial 2000 iterations being removed for posterior analysis.  Table 1. Table 1 indicates that the proposed model with six to eight latent states seems to give better fits to the data. Furthermore, we calculate the posterior predictive density estimates of y tj t ¼ 1; ⋯; 4; j ¼ 1; ⋯; 3 ð Þ under one state and seven states, respectively (see  analysis is conducted via MCECM algorithm [36] and the standard error estimates are calculated via Louis formula [37].

Examination of
Based on Table 2, we can find the following facts: First, three estimates of Λ 32 give the positive effects of latent factor ξ on the 'MFT'. This is not surprising since ξ is related to the treatment level of a patient received. But there are obvious differences in magnitudes among the three methods. For FA and HMFA, the former gives b Λ 32 ¼ 0:001 associated with standard deviation 0.014, while the latter gives b Λ 32 ¼ 0:752 with standard deviation 0.045. This reflects that the heterogeneity of data affects the estimates b Λ 32 seriously. Compared to the previous two methods, ML method produces that b Λ 32 ¼ 0:196 with SD = 0.029, which are in between them. Second, the estimates of variance parameters Ψ ej under S ¼ 1 are larger than those under S ¼ 7. This indicates that factor analysis model accommodates heavy tails of data at the expense of variance inflation. Further investigations on the estimates of Φ jj under FA and HMFA also reveal the same phenomenon as that of Ψ ej . However, we observe that the ML estimate of Ψ e3 , the unique variance corresponding to the third item, is equal to 0.008 with SD = NAN, an illogical number, which is very close to an improper Heywood case. As pointed out by Lee [18], Heywood cases in the ML estimation can be avoided by imposing an inequality constraint on Ψ e3 with a penalty function. In the Bayesian approach, the conjugate prior distribution of Ψ À1 e3 specified Ψ e3 in a region of positive values and hence has a similar effect as adding a penalty function. Hence, no Heywood cases are found in the Bayesian solution because of the penalty function induced by the prior distribution on Ψ À1 e3 . Third, three estimates give the negative correlation between η and ξ, which is consistent with the fact that the improvement of treatment will decrease the intensity of cocaine use, thus leading to a decrease of cocaine use in days. ML estimates for Φ jk are very close to those under HMFA. However, the estimate of Φ 12 under S ¼ 1 is À0.018, which is quite different from À0:182 for S ¼ 7. Furthermore, the coefficients of correlation of ξ and η under S ¼ 1 and S ¼ 7 are À0.0204 and À0.6612, respectively. The former suggests that ξ and η are approximately independent while the latter implies stronger dependence between them. Moreover, we computed the posterior probabilities P z t ¼ rjY obs ð Þfor r ¼ 1, ⋯, 7 and t ¼ 1, ⋯, 4 under S ¼ 7 based on 10,000 simulated observations drawn from p ZjY obs ð Þand found that the transition path corresponding to the maximum posterior probability is 7 ! 1 ! 1 ! 1. This implies that latent state of the patient being in is extremely serious at baseline and becomes moderate in the subsequent treatments. This also reflects a positive effect of intervention on the patient's latent state. Note that unlike the common Viterbi algorithm in exploring the optimal transition path of states in ML analysis, calculating posterior probability P z t ¼ rjY obs ð Þwithin Bayesian framework is a by-product of the estimation procedure. This voids the complex computation of marginal likelihood of the observed data and hence is very fast.

Discussion
This chapter reviews Bayesian inferences within a general framework and proposes a Bayesian procedure for analyzing hidden Markov factor analysis model under multivariate longitudinal setting. Compared to ML method, the pragmatic advantage of Bayesian framework is its flexibility and generality for coping with very complex problems. When good prior information can be available, results obtained from Bayesian method are more reliable and accurate than that under ML. With increased access to computation advances in simulation-based approaches, in particular the MCMC methodology, Bayesian inferences provide enormous scope for realistic statistical modeling.
Although we concentrate our attention on applications of the hidden Markov factor analysis model, the methodology developed in this chapter can be extended to the case where the LVM is nonlinear. Another possible extension is to consider a dynamic LVM, wherein model parameters vary over time. These extensions will raise theoretical and computational challenges and certainly require further investigation.
Hence, drawing Z can be accomplished via single-component method by drawing z i independently from p z i j ω i ; δ; Q; θ; y i À Á . Furthermore, notice that the sequences y i ; ω i ; z i È É are still the one-order Markov sequences. Hence, we can simulate z i through a well-known forward filtering-backward sampling algorithm (see, for example, [38]). For notation clarity, we suppress θ , δ, and Q in the following derivations.
Forward filtering-backward sampling (FFBS) consists of first forward filtering (FF) and then backward sampling (BS). The forward filtering step recursively updates α i, t|t ¼ p z it j ω i, 1:t ; y i, 1:t , t ¼ 1, …, T: Here y i, 1:t represents the set of observations of subject i up to time t and so are ω i, 1:t and z i, 1:t . The backward sampling is to draw z i from the joint distribution of the states given the data using p z i, 1:T j ω i, 1:T ; y i, 1:T ¼ p z iT j ω i, 1:T ; y i, 1:T … p z i1 jz i, 2:T ; ω i, 1:T ; y i, 1:T : That is, we first draw the last state given all the data and then work backwards in time drawing each state conditional on all the subsequent ones.
To implement forward filtering, let α it r ð Þ ¼ P y i, 1:t ; ω i, 1:t ; z it ¼ r , t ¼ 1, ⋯, T Obviously, α i1 r ð Þ ¼ δ r p y i1 ; ω i1 jz i1 ¼ r À Á . Moreover, it can be shown that The outputs α it f g T t¼1 from recursive Eq. (20) can be used to calculate the posterior probability which leads to the forward filtering (FF) iteration.