Summary of under competing models in the analysis of cocaine use data.
The purpose of this chapter is to provide an introduction to Bayesian approach within a general framework and develop a Bayesian procedure for analyzing multivariate longitudinal data within the hidden Markov factor analysis framework.
- hidden Markov factor analysis model
- Markov chain Monte Carlo sampling
- cocaine use
The Bayesian approach is now well recognized in the statistics literature as an attractive approach to analyzing a wide variety of models , and there is rich literature on this issue. Here, we are not going to present a full coverage on the general Bayesian theory, and readers may refer to excellent books, for example [2, 3], for more details for this general statistical method. This chapter provides an introduction to the Bayesian approach within a general framework and develops a specific Bayesian procedure for analyzing multivariate longitudinal data within the hidden Markov factor analysis framework. We begin with the basic ideas of the Bayesian approach and then describe the model under consideration in the second section. The following section considers Bayesian inferences including parameter estimation, model selection, and posterior density estimates. The final section demonstrates the practical value of proposed methodology to cocaine use data to get some Bayesian results. Some technical details are given in the Appendix.
Consider a data set with the probability model where is a univariate or multivariate population parameters vector, which quantifies the uncertainty of data. In the statistical literature, is called likelihood or sampling distribution and often represented as . From the frequency statistics point of view, statistical inferences are carried out based on . In this case, , though unknown, is treated as fixed. Unlike the frequency statistical inferences, the Bayesian approach for data analysis assumes that is random and has a distribution . This distribution, which represents the knowledge about , is referred to as prior distribution or prior. When data are available, the information on is summarized within the posterior distribution or posterior, a conditional distribution given data, i.e.,
where is the marginal distribution of . The right-hand-side term in (1) omits the factor since given it is a known constant. In Bayes literature, is also termed the unnormalized posterior. Analogous to the role of likelihood in frequency statistical inferences, posterior is the starting point of Bayesian inferences.
Selecting proper priors for parameters is fundamental to Bayesian analysis. Basically, there are two kinds of prior distributions, namely, the noninformative prior distributions and the informative prior distributions. Noninformative prior distributions associate with situations when the prior distributions have no population basis. They are used when we have little prior information on and desire that the prior distributions play a minimal role in the posterior distribution distribution. Informative prior distribution represents the distribution of possible parameter values, from which the parameter has been drawn. We may have prior knowledge about this distribution, either from closed related data or from the subjective knowledge of experts. A commonly used informative prior distribution in the general Bayesian approach to statistical problems is the conjugate prior distribution, a prior ensuring that the posterior distribution follows the same parametric form as the prior distribution [1, 3].
A potential difficulty underlying Bayesian inferences is the statistical computation when posterior distribution takes on the complicated form. This is particularly true in the situation where latent variables or other unobservable quantities are involved in the model, as discussed in this chapter. In such cases, statistical inferences usually recur to simulation-based methods. Among various sampling methods, Markov chains Monte Carlo methods (MCMC) provide powerful tools for simulating observations from posterior. The key to Markov chain simulation is to create a Markov sequence whose stationary distribution is a specified posterior . Posterior inferences are carried out based on these simulated observations. There are many ways of constructing these Markov chains, but all of them, including the Gibbs sampler [4, 5], are special cases of the general framework of Metropolis et al.  and Hastings . However, we do not intend to pursue this issue here, and details on simulation-based methods can be referenced to [2, 3, 8, 9].
In what follows, as an illustration, we will develop a Bayesian analysis procedure for multivariate data under longitudinal setting. Multivariate longitudinal or clustered data occur when multiple items are measured repeatedly over periods of time or across occasions. Under such setting, the primary interest is inference about the dependence of the multiple measurements and the temporal correlation resulting from the repeated measures on the same items. But more often, particular interest also focuses on exploring the potential heterogeneity of data and investigating its transition pattern over time. In these cases, hidden Markov latent variable model (HMLVM) [10–13] provides a feasible and unified framework to address these issues. HMLVM assumes that the overall model constitutes the observed process and the underlying hidden state process. The state process, as the convention in the classic HMM (see for example, [14–17]), is an univariate discrete process, which follows a first-order Markov chain, while the observed process, conditional on the state sequence, is an independent process with emission distribution specified via LVMs . Hence, in this regard, HMLVM provides a unified way of describing the correlation of multiple items, temporal dependence, and heterogeneity among the data simultaneously. However, the current existing developments cited beforehand focus on the maximum likelihood analysis in which statistical inferences heavily depend on the asymptotic properties. As an illustration of Bayesian inferences on practical problems, in this chapter, we develop a Bayesian procedure to analyze cocaine use data within the hidden Markov factor analysis model framework. Compared to ML, a basic nice feature of a Bayesian approach is its flexibility to utilize useful prior information for achieving better results. Additionally, simulation-based Bayesian methods depend less on asymptotic theory and hence have the potential to produce more reliable results even with small samples.
2. Model description
2.1. Hidden Markov factor analysis model
Consider a set of multivariate longitudinal observations formed by -dimensional observed vectors , which are recorded on items over periods of length : across subjects: . In the field of multivariate analysis, interest mainly focuses on exploring item dependence since measurements may be highly correlated arising from the multicollinearity problem. But more often, interest also concentrates on the heterogeneity resulting from the situation where the population of constitutes more than one component. This is particularly true in the situation where the data illustrate extreme behaviors such as multimodal and/or skewed characteristics. In these cases, a finite mixture factor analysis model (FMFAM) can provide a powerful tool to address these issues. Typically, FMFAM assumes that conditioning on an univariate discrete value state variable and an -dimensional () continuous latent factor vector , are independent and distributed with a -dimensional multivariate normal distribution, and meanwhile, given , also follows an -dimensional normal distribution, that is,
where is a -dimensional intercept vector, which represents the baseline level of , is a factor loading matrix, is a diagonal matrix with the th diagonal element , and is an positive definite matrix.
Formulation given in (2) has two basic features: one is to characterize heterogeneity of population of at the occasion level and the other is to establish the dependence among the multiple measurements. The heterogeneous population is specified via state-specific parameters contained in the model while the dependence between different measurements is identified via sharing the common factors in the manner of liner combinations. In particular, apart from explaining the idiosyncratic part of measurements, latent factors also characterize the association between any two measurements. As a matter of fact, one can show that the correlation coefficient between and at state is given by
in which is the th element of and is the th element in , respectively. The strength of correlation is identified by the factor loadings and covariance of factors together. In the case when degenerates to zero (i.e., ) or , the association among items disappears and model (2) reduces to -independent mean-variance models within cluster . Hence, latent factors play a dominant role in characterizing association of multiple items. Note that, in actual applications, latent factors, though unobservable, often have their own physical interpretations. In psychology, for example, latent factors are often used to identify concepts such as treatment, temper, and anxiety, which are important within the framework of theoretical models. The measurements are just proxies for these unobserved concepts of interest. We will provide further interpretations in the real example.
The primary reason for collecting information on multiple occasions for each subject is that it allows investigation of change and/or temporal dependence over time within the subject. There exist various constructs for characterizing dynamic characteristics. A commonly used method is to construct proper dynamic structures for latent factors and establish dynamic factor models, see for example, [19–21]. An alternative choice we adopt here is specifying the joint distribution for state sequences. Following the common routine (see, for example, [22, 23]), we assume that each individual state sequence satisfies the following first-order hidden Markov model
where and are, respectively, the initial distribution and transition probability given by
where is a positive integer, is an vector satisfying and , and is an transition matrix with the th entry being , that is, and for . Modeling state sequences into (5) allows us to explore the transition pattern of individuals across occasions exactly. For example, in the cocaine use data analysis, is often identified with the latent state of patient at time , then specifies how individual being in state transfers to state on two successive occasions. Surely, we can relax the time-homogeneous assumption of transition probabilities by including relevant covariates to interpret the inhomogeneous transition behavior among observation data (see, for example, [12, 13, 16]) but at the expense of computational burden.
The current model defined in (2)–(5) provides a comprehensive framework for modeling the multivariate longitudinal data with the latent variables. It accommodates the dynamic behavior of observed sequences, heterogeneity of observed data at the occasion level, and dependence among the multiple items simultaneously. In particular, it makes sense to measure effects of latent factors on the manifest variables quantitatively.
Let be the collection of all observations, and be the set of corresponding factors. Denote be set of state variables. It follows from Eqs. (2), (4) and (5) that the joint sampling distribution of , and is given by
where is formed by free parameters in , and . Here, we write and denote the indicator function of a set . The observed data likelihood is then achieved by taking integration of over and , which involves high-dimensional integrations.
3. Posterior inferences
3.1. Prior specifications
Let , , , and . For the Bayesian analysis, we need to assign priors to the unknown parameters involved for completing model specification. Since , , and are involved in different submodels, it is natural to assume that , and are mutually independent and the components contained in are also mutually independent, that is,
For the convenience of conjugacy, we assume that the parameters are drawn from the following commonly used conjugate types prior distributions (see for example ).
where ‘’ denotes the inverse Gamma distributions with shape and scale and ‘’ represents the -dimensional inverse Wishart distribution with degrees of freedom and scale matrix ; is the th row vector of . The scalars , , , , , the vectors , , and the matrices and are assumed to be known. Thus, standard conjugate priors were specified for all parametric components in the model. The conjugate type prior distributions are sufficiently flexible in most applications, and for situations with a reasonable amount of data available, the hyperparameters scarcely affect the analysis. It should be noted that although Eq. (8) allows different hyperparameters for different latent states, in practice, we choose identical priors for all . Details of hyperparameter choices are discussed later when we present the empirical results.
3.2. Gibbs sampling scheme and posterior analysis
Combining the sampling distribution for the observable ’s and the prior distribution specified in (8) yields the joint posterior distribution of given by
where we ignore the normalization constant . However, due to the latent factors and state variables present, the computation of is intractable since it involves high-dimensional integrals. Consequently, no closed form can be available for the posterior . This problem can be addressed via the data augmentation idea in Tanner and Wong . Data augmentation technique treats the latent quantities as the hypothetical missing data and augments them with the observed data to form complete data. The posterior analysis is now carried out based on the joint distribution , which is proportional to , the product of likelihood of complete data and priors. Compared to the intractable observed data likelihood, the complete data likelihood has nice hierarchical structure based on conditional independent assumptions in (2) and (4) and hence is relatively easy to analyze. However, is still not in closed form and is thus difficult to deal with analytically. In this regard, simulation-based methods can be used to generate observations to carry out posterior analysis. In view of the multiple components involved, the usual independent sampling methods are not feasible. Note that, on the basis of complete data, the full conditional distributions of and have closed forms. This provides a solid foundation for Markov chain Monte Carlo methods. Markov chain Monte Carlo sampling does not draw observations from directly. On the contrary, it generates observations from the full conditionals of each component alternatively, thus forming the dependent sample, i.e., Markov chains. Specifically, as pointed out in the introduction, we use Gibbs sampler [4, 5] to draw observations from this target distribution. Obviously, the sampling scheme in the Gibbs sampler includes two types of moves: updating the components involved in the factor analysis model and updating the components related to the hidden Markov model. We propose using the following Gibbs sampler which iteratively simulates from the conditional distributions, where variables are removed from the conditioning set either by explicit integration or by conditional independence. The steps involved in the Gibbs sampler are
Step a: Generate from
Step b: Generate from
Step c: Generate from
Step d: Generate from
Step e: Generate from
Step f: Generate from
Under mild conditions and similar to  (see also, for example, ), one can show that for sufficiently large , say , the joint distribution of converges at an exponential rate to the desired posterior distribution . Hence, can be approximated by the empirical distribution of where is chosen to give sufficient precision to the empirical distribution. The convergence of the Gibbs sampler can be monitored by the ‘estimated potential scale reduction (EPSR)’ values as suggested by Gelman and Rubin  or by plotting the traces of estimates against iterations under different starting values.
Simulated observations obtained from the posterior can be used for statistical inferences via straightforward analysis procedures. For brevity, let be the random observations generated by the Gibbs sampler from . The joint Bayesian estimate of and can be obtained easily via the corresponding sample means of the generated observations as follows:
Clearly, these Bayesian estimates are consistent estimates of the corresponding posterior means, see . The consistent estimates of covariance matrix of estimates can be obtained as follows:
Hence, the standard error estimates can be obtained conveniently by the Gibbs sampler algorithm. Other statistical inferences about and such as deriving the confidence intervals and statistics for hypothesis testing can be achieved based on the simulative observations as well (see, for example, [28, 29]).
One important statistical inference beyond estimation is on testing of various hypotheses about the model. In the field of hidden Markov modeling, determining the proper number of states may be the first step towards data analysis. Too many states may overfit the observations, meaning that it can fit the training data accurately but may not be a good model for underlying data-generating process. On the other hand, too few states may not be flexible enough to approximate the underlying model. In the context of Bayesian model selection, Bayes factor (BF, ) is a popular choice for model comparison. BF is defined as the ratio of the marginal likelihoods of data under two competing models. However, the computation of BF is difficult since it often involves the high-dimensional integrations. It has also been shown that BF is sensitive to the choice of priors and will become infeasible when improper priors are used. A simple and more convenient alternative is the -measures [31–34] which is based on the posterior predictive density. It has been shown  that this approach is conceptually and computationally simple and is useful in model checking for wide varieties of complicated situations. Moreover, the required computation is a by-product of the common Bayesian simulation procedures such as the Gibbs sampler or its related algorithms. Specifically, let denotes future values of in a replicate experiment, that is, has the same sampling density as that of . The posterior predictive distribution is defined as
Naturally, if the posited model under consideration is the true model in the sense that from which the data are generated, then would behave like data and its squared biases and covariances should be small. With this notion in mind, Ibrahim, Chen, and Sinha  proposed an statistics to assess the fitness of posited models to the data by weighting the squared biases and covariance, which can be interpreted as a trade-off between them. Here, we extend it to the multivariate longitudinal setting. Let be a collection set of future responses in our proposal. For some , we consider the following multivariate version of -measure:
where the expectation is taken with respect to the posterior predictive distribution. Clearly, small values of the -measure indicate that the model gives predictions close to the observed values, and the variability in the predictions is low as well. Hence, the model with the smallest -measure is selected from a collection of competing models. It has been shown that -measure with has nice theoretical properties . Thus, this value of will be used in our empirical illustrations.
4. Cocaine use data analysis
In this section, a small portion of cocaine use data is analyzed to illustrate the practical value of the proposed methodology. The original data are collected from 321 cocaine use patients who were admitted in 1988–1989 to the West Los Angeles Veterans Affairs Medical Center. The whole data constitute 68 measurements of 17 items, which were recorded at four time points: at baseline, 1 year after the treatment, 2 years after the treatment, and 12 years after the treatment in 2002–2003. These measurements cover the information on the cocaine use, treatment received, psychological problems, social status, employments, and so on. As an illustration, three variables are selected to conduct data analysis: ‘days of cocaine use per month at intake (CC)’, ‘times per month in formal treatment (FT)’, and ‘months in formal treatment (MFT)’, which, respectively, represent the severity of cocaine use and the levels of treatment received by a patient. Since these variables were measured in 0–120 points scale, to unify the scales, we take logarithms and standardize them. Among them, some measurements are missing. The missing proportion is about . For brevity, we assume that the missing is missing at random . A distinct characteristic underlying data are nonnormal and heavy tailed. Figure 1 gives the plots of histograms and the posterior predictive density estimates (see below) of logarithms of CC, FT, and MFT (with missing data removed) on four occasions. The histograms illustrate that the distributions of selected variables are deviated from normality in terms of multimodality and skewness. The skewness and kurtosis of CC on four occasions are , , , and , respectively. Data set also demonstrates dynamic characteristics. The distribution of CC, for instance, is skewed to the left at baseline and moves to the right gradually on the following two occasions and becomes right-skewed eventually. This implies that a single factor analysis model may not be appropriate to fit the data at each time point.
In this analysis, one of the objectives is to explore the effects of latent factors on the observed variables and assess the dependence among latent factors. Based on the nature of the problem under consideration, it is natural to group the single variable ‘CC’ to reflect one latent factor ‘cocaine use’ () and to group ‘FT’ and ‘MFT’ to represent another latent factor ‘treatment’ (). Let and . To be convenient for interpretation and computation, and are restricted to be invariant across states but leave the baseline level varying with . Further, the following non-overlapped structure for factor loading matrix is considered
where parameters with an asterisk are treated as fixed for identification. Note that fixing indicates that is identified with CC. This is similar to that in . Hence, in this case, in measures the magnitude of dependence of on .
Data set is fitted to the proposed models with 10 different transition models: . Although these state spaces are in nested forms, the corresponding models are not, since one cannot be reduced to another by constraining parameters in the interior of parameter space. This indicates that chi-square distribution may not be suitable for the classic likelihood ratio test statistic. We use -measure to implement model selection. Obviously, if is taken, then the proposed model reduces to common factor analysis model (CFA, ).
The following inputs are taken for the super-parameters involved in the prior distributions (8): for , , , where is the sample covariance matrix of data. The entries in are set to be zeros, , , which leads to the mean of equal to , , , , . Note that these values are the standard inputs in the latent variable analysis (see ). We also took other values for these inputs and found that the resulting estimates are scarcely affected.
We implement the proposed algorithm given in Section 3 to conduct Bayesian analysis. Let be the collection of observed data and be the set of missing data. Due to the missing data, we need to draw from in MCMC sampling. This can be implemented easily since conditioning on , , and , , independent of , has the normal distribution. Hence, drawing is rather straightforward and fast. To obtain some idea about the number of the Gibbs sampler iterations in getting convergence, we conducted a few test runs as a pilot study and found that in all these runs, the Gibbs sampler converged in about 1000–2000 iterations, where the EPSR values  are less than 1.2. So, for all cases under consideration, we collect 3000 random observations after initial 2000 iterations being removed for posterior analysis.
We calculate the values of under each fitting. For computation, we use simulation-based method by drawing predictive values from , where is the hypothetical replication of . Note that . Hence, drawing is rather easy when , and are available. Given that we have simulations from the posterior of via MCMC sampling discussed before, we just draw one from for each , and and obtain simulations in the end for . Based on these simulated observations, measures can be estimated consistently via sample means. We draw 3000 observations after convergence of MCMC algorithm for calculating and the results are reported in Table 1.
Examination of Table 1 indicates that the proposed model with six to eight latent states seems to give better fits to the data. Furthermore, we calculate the posterior predictive density estimates of under one state and seven states, respectively (see Figure 1). It can be seen clearly that our proposed method is successful in capturing the skewness and modes of data while factor analysis model fails. For the computation details, we choose 60–100 equally spaced grids in the interval and collect 3000 simulated observations from the Gibbs sampler at each point after removing initial 2000 iteration as burn-ins.
Table 2 presents the summary of Bayesian estimates of unknown parameters and their standard errors using the formula given in (11) with (denoted by FA) and (denoted by HMFA). For comparison, maximum likelihood estimates of unknown parameters with their standard deviations under HMFA are also presented in Table 2. The maximum likelihood analysis is conducted via MCECM algorithm  and the standard error estimates are calculated via Louis formula .
Based on Table 2, we can find the following facts: First, three estimates of give the positive effects of latent factor on the ‘MFT’. This is not surprising since is related to the treatment level of a patient received. But there are obvious differences in magnitudes among the three methods. For FA and HMFA, the former gives associated with standard deviation 0.014, while the latter gives with standard deviation 0.045. This reflects that the heterogeneity of data affects the estimates seriously. Compared to the previous two methods, ML method produces that with SD = 0.029, which are in between them. Second, the estimates of variance parameters under are larger than those under . This indicates that factor analysis model accommodates heavy tails of data at the expense of variance inflation. Further investigations on the estimates of under FA and HMFA also reveal the same phenomenon as that of . However, we observe that the ML estimate of , the unique variance corresponding to the third item, is equal to 0.008 with SD = NAN, an illogical number, which is very close to an improper Heywood case. As pointed out by Lee , Heywood cases in the ML estimation can be avoided by imposing an inequality constraint on with a penalty function. In the Bayesian approach, the conjugate prior distribution of specified in a region of positive values and hence has a similar effect as adding a penalty function. Hence, no Heywood cases are found in the Bayesian solution because of the penalty function induced by the prior distribution on . Third, three estimates give the negative correlation between and , which is consistent with the fact that the improvement of treatment will decrease the intensity of cocaine use, thus leading to a decrease of cocaine use in days. ML estimates for are very close to those under HMFA. However, the estimate of under is −0.018, which is quite different from for . Furthermore, the coefficients of correlation of and under and are −0.0204 and −0.6612, respectively. The former suggests that and are approximately independent while the latter implies stronger dependence between them.
Moreover, we computed the posterior probabilities for and under based on 10,000 simulated observations drawn from and found that the transition path corresponding to the maximum posterior probability is . This implies that latent state of the patient being in is extremely serious at baseline and becomes moderate in the subsequent treatments. This also reflects a positive effect of intervention on the patient’s latent state. Note that unlike the common Viterbi algorithm in exploring the optimal transition path of states in ML analysis, calculating posterior probability within Bayesian framework is a by-product of the estimation procedure. This voids the complex computation of marginal likelihood of the observed data and hence is very fast.
This chapter reviews Bayesian inferences within a general framework and proposes a Bayesian procedure for analyzing hidden Markov factor analysis model under multivariate longitudinal setting. Compared to ML method, the pragmatic advantage of Bayesian framework is its flexibility and generality for coping with very complex problems. When good prior information can be available, results obtained from Bayesian method are more reliable and accurate than that under ML. With increased access to computation advances in simulation-based approaches, in particular the MCMC methodology, Bayesian inferences provide enormous scope for realistic statistical modeling.
Although we concentrate our attention on applications of the hidden Markov factor analysis model, the methodology developed in this chapter can be extended to the case where the LVM is nonlinear. Another possible extension is to consider a dynamic LVM, wherein model parameters vary over time. These extensions will raise theoretical and computational challenges and certainly require further investigation.
The authors are grateful to the editor’s valuable suggestions and comments which have greatly improved the manuscript. Xia’s work was fully supported by grant from National Nature Science Foundation of China (No. 11471161) and Tang’s work was supported by National Science Fund for Distinguished Young Scholars of China (No. 11225103).
Let denote the sequence of latent factors across occasions for individual . To draw state variables from , we first notice that
Hence, drawing can be accomplished via single-component method by drawing independently from . Furthermore, notice that the sequences are still the one-order Markov sequences. Hence, we can simulate through a well-known forward filtering-backward sampling algorithm (see, for example, ). For notation clarity, we suppress , , and in the following derivations.
Forward filtering-backward sampling (FFBS) consists of first forward filtering (FF) and then backward sampling (BS). The forward filtering step recursively updates
Here represents the set of observations of subject up to time and so are and . The backward sampling is to draw from the joint distribution of the states given the data using
That is, we first draw the last state given all the data and then work backwards in time drawing each state conditional on all the subsequent ones.
To implement forward filtering, let
Obviously, . Moreover, it can be shown that
The outputs from recursive Eq. (20) can be used to calculate the posterior probability
which leads to the forward filtering (FF) iteration.
The backward sampling step depends on the observation that
The last equation holds since given , does not depend on the previous values due to the Markov Chain characteristics of . This leads to
Hence, FFBS algorithm for drawing is implemented by
running the recursion and stored the conditional probabilities for ;
sampling from the filtered conditional probability ;
for , sampling from the conditional probability
To draw , we first note that
with . Hence, similar to that in drawing , updating can be achieved by drawing independently from for and . It can be shown that
To draw , we can first draw from and then draw from . For this end, let be the size of cluster , and let . Denote
be the sample means and covariance matrices of and within the th cluster, respectively. By some algebra calculations, it can be shown that
in which is the th element in , is the th main diagonal element of , and is the th column vector of .
From the prior distribution of and the distribution of , it can be shown that
where and are given in (c). Hence, is the -dimensional inverse Wishart distribution . It can be shown from exactly the same reasoning as before that drawing can be achieved by drawing from independently.
(e) and (f)
It can be verified directly that
in which . Similarly, it can be shown that
in which .