Bayesian Estimation of Multivariate Autoregressive Hidden Markov Model with Application to Breast Cancer Biomarker Modeling

Additional information is available at the end of the chapter Abstract In this work, a first-order autoregressive hidden Markov model (AR(1)HMM) is proposed. It is one of the suitable models to characterize a marker of breast cancer disease progression essentially the progression that follows from a reaction to a treatment or caused by natural developments. The model supposes we have observations that increase or decrease with relation to a hidden phenomenon. We would like to discover if the information about those observations can let us learn about the progression of the phenomenon and permit us to evaluate the transition between its states (supposed discrete here). The hidden states governed by the Markovian process would be the disease stages, and the marker observations would be the depending observations. The parameters of the autoregressive model are selected at the first level according to a Markov process, and at the second level, the next observation is generated from a standard autoregressive model of first order (unlike other models considering the successive observations are indepen-dents). A Markov Chain Monte Carlo (MCMC) method is used for the parameter estimation, where we develop the posterior density for each parameter and we use a joint estimation of the hidden states or block update of the states.


Introduction
The main motivation behind this work is to characterize progression in breast cancer. In fact, disease progression cannot be assessed correctly without the use of biomarkers, which would effectively monitor the evolution of the patient health state, and this is the case for breast cancer. The major challenge in this matter for researchers and clinicians is to unravel the stage of the disease, so as to tailor the treatment for each patient and to monitor the response of a patient to a treatment.
Currently, studies have shown that there is a correlation between the levels of certain markers such as cancer antigen CA15-3, carcinoembryonic antigen (CEA), and serum HER2 Neu with the stage of the disease [1]. This gives an opportunity of using a hidden Markov model (HMM) to predict the stage of the disease based on biomarker data and to address the effectiveness of the treatments in their influence on the transition of the cancer from one state to another. In HMM, we have two constituents: the Markovian hidden process suitable to represent the breast cancer stage and the observation process given by the biomarker data. By the way, we can learn about the disease transition rates and how it progresses from primary breast cancer to advanced cancer stage, for example.
Indeed, HMM is a useful tool for tackling numerous concrete problems in many fields but some possible applications of HMM are in speech processing [2], in biology [3], in disease progression [4], in economics [5,6], and in gene expression [7]. For a complete review of HMM, the reader is referred to Zucchini and MacDonald [8], in which properties and definitions of HMM are presented in a plausible way with both classical estimation by maximum likelihood method and expectation maximization (EM) algorithm and the new Bayesian inference is addressed.
The model we consider here is a variation of the regular hidden Markov model, since we will use extensions to incorporate dependence among successive observations, suggesting autoregressive dependence among continuous observations. Consequently, we have relaxed the conditional independence assumption from a standard HMM, because we would like to add some dynamics to the patient disease progression and because in reality the current patient biomarker observation is dependent on the past one. In fact, the autoregressive assumption in HMM has shown its advantage over regular HMM that cannot catch the strong dependence between successive observations (e.g., Ref. [9]). A similar model to ours can be found in Ref. [10]. This kind of models, which were first proposed in Ref. [11] to describe econometrics time series, is generalization of both HMM and autoregressive models, will be effective in representing multiple heterogeneous dynamics such as the disease progression dynamics, and can be even generalized to a regime-switching ARMA models such as in Ref. [12]. Moreover, Our model can also be viewed as an extension of the multivariate double-chain Markov model (DCMM) developed by Ref. [13], where there are two discrete Markov chains of first order: the first Markov chain is observed and the second one is hidden. In contrast to this DCMM, our multivariate first-order autoregressive hidden Markov model (MAR(1)HMM) will lead to continuous observations, where each observation conditional on the hidden process will depend on the previous observation according to an autoregressive process of first order. This dynamic is promising for continuous observed disease biomarkers.
Parameter estimation is very challenging for HMM family models since the likelihood is not available in a closed form most of the time. Thus, we call for a Markov Chain Monte Carlo (MCMC) procedure instead of a maximum likelihood-based approach. This choice rises from the fact that the Bayesian analysis uses prior knowledge about the process being measured, and it allows direct probability statements and an approximation of posterior distributions for the parameters. Instead in the maximum likelihood approach, we cannot have declared prior or have exact distribution for the parameters when the likelihood is untractable or when we have missing data (e.g., Refs. [14][15][16]).
Since the realization of HMM includes two separate entities: the parameters and the hidden states, the Bayesian computation is carried out after augmenting the likelihood by the missing hidden states [17]. The hidden states are sampled using a Gibbs sampler adopting a joint estimation of the hidden states or block update of the states (instead of a single update of each state separately) by means of a forward filtering/backward smoothing algorithm. Given the hidden states, we can compute the autoregressive parameters and the transition probabilities of the Markov chain by Gibbs sampler from their posterior densities after specifying conjugate priors for the parameters. Hence, the MCMC algorithm will alternate between simulating the hidden states and the parameters. Finally, we can obtain posteriors statistics such as the means, standard deviations and confidence intervals after assessing the convergence of the MCMC algorithm.
This chapter is organized as follows: after a preliminary on HMM, a description of the model is given in Section 3. In Section 4, we will give the Bayesian estimation of the parameters and the hidden states and provide the details of the MCMC algorithm, before presenting the results of a simulation studies in Section 5 and we will finish by a conclusion.

Preliminary
Since the model suggested is of the HMM type, we will describe HMM in more detail: an HMM is a stochastic process X t , Y t f g T t¼0 , where X t f g T t¼0 is a hidden Markov chain (unobservable) and Y t f g T t¼0 is a sequence of observable independent random variables such that Y t depends only on X t for the time t = 0,1,…T. Here the process X t f g T t¼0 evolves independently of Y t f g T t¼0 and is supposed to be a homogeneous finite Markov chain with probability transition matrix Π of dimension a Â a, where a indicates the number of the hidden states and Π 0 = (Π 01 ,…,Π 0a ) is the initial state distribution.
We denote the probability density function of Y t = y t given X t = k for k ∈ {1, …, a} with P xt ðy t , θ k Þ, where θ k refers to the parameters of P when X t = k. We suppose further that the processes Y t |X t and Y t 0 |X t 0 are independent for t 6 ¼ t 0 . Let Θ = (θ 1 ,…, θ a ) and θ = (Π 0 , Π, Θ), and then, the HMM can be described as follows: First, the likelihood of the observations and the hidden states can be decomposed to Pðy 0 , …, y T , x 0 , …, x T , θÞ ¼ Pðy 0 , …, y T jx 0 , …, x T , θÞPðx 0 , …, x T , θÞ: Πðx t jx tÀ1 Þ. Under the conditional independence of the observations given the hidden states, Pðy 0 , …, y T jx 0 , …, x T , θÞ ¼ P x0 ðy 0 jθ x0 Þ Y T t¼1 P xt ðy t jθ xt Þ.
Consequently, the likelihood function for the hidden states and the observations is given by Πðx t jx tÀ1 ÞP xt ðy t jθ xt Þ.

Model description and specification
The MAR(1)HMM model we consider in this work is a hidden Markov model, where conditionally on the latent states, the observations are not independent like it is the case for a regular hidden Markov model. Instead, the current observation is allowed to depend on the previous observation according to an autoregressive model of first order. As in an HMM model, the latent states evolve according to a discrete first-order time homogeneous Markov model. We consider data of n continuous random variables observed over time, each of potentially different lengths, i.e., for each individual i = 1,2, …, n, we observe a vector y i;: ¼ ðy i, ui , …, y i, mi Þ T , with u i < m i . We assume, for i = 1,2,…,n for integer time t = u i ,…,m i , that the random variable Y i,t taking nonnegative values depends only on the states X t and the previous observation Y i,t-1 , and based on the model developed by Farcomeni and Arima [10], we get the following model: The choice of the autoregressive part of the model is motivated by the fact that successive biomarker observations are most of the time correlated from many diseases unlike the hypothesis of independence between observations in HMMs.
We interpret x as the vector of the hidden health states of the patients; in the case of breast cancer, those states would be localized or advanced metastatic breast cancer for example, while y is the vector of the biomarkers observed and measured for the patients. The ε i , t are normal variables with mean 0 and variance σ 2 such that ε i , t and ε i 0 , t 0 are uncorrelated, (i, t) 6 ¼ (i 0 , t 0 ).
The parameters β ðxtÞ and μ ðxtÞ are parameters taking values in R for each hidden state and Similar to Ref. [13], the transition matrix of the Markov chain Π is time homogeneous with dimension a Â a where a is the number of hidden states, and Π = (Π gh , g = 1,…,a; h = 1,…,a) where Π gh = P(X t = h|X t-1 = g), for g = 1,2,…,a; h = 1,2,…, a; and t = u 0+1 ,…, M. We let the first state X u0 to be selected from a discrete distribution with vector of probabilities r = (r 1 ,…,r a ). Also we consider the time of initial observation u i , the initial observed state y i, ui , and the number of consecutive time points that were observed m iu i + 1. Let μ = (μ (1) ,…, μ (a) ), β = (β (1) ,…,β (a) ), and θ = (μ, β, σ 2 , r, Π) be the set of all parameters in the model. We suppose that the individuals, i.e., Y i , t , behave independently conditionally on X. Therefore, for i = 1,…,n, Pðy i;: Then, the likelihood density for the observations of all individuals y = (y 1 ,…,y n ) given first time vector of observations y 0 ¼ ðy 1;u1 , …, y n, un Þ, x, and θ is Pðy i;: jy i, ui , x, θÞ, This is due to the conditional independence of the y i , given x and θ. The joint mass of each y i ,. and x given y ui and θ can be written as follows: Pðy i;: , xjy i, ui , θÞ ¼ Pðy i;: jy i, ui , x, θÞ Â Pðxjy ui , θÞ: Using the Markov property of the hidden process, we have after simplification Pðxjy i, ui , θÞ ∝ Pðy i, ui jx, θÞPðxjθÞ ¼ Pðy i, ui jx ui , θÞr xu 0 Π xu 0 , xu 0 þ1 Â ⋯ Â Π xMÀ1, xM : In addition, Pðy i;: Pðy i, t jy i, tÀ1 , x, θÞ, and consequently, Pðy i, t jy i, tÀ1 , x, θÞ: Finally, under the hypothesis of normal error distribution for the autoregressive parameters of the model (Eq. (1)) and the Chapman-Kolmogorov property, the joint distribution of y i ,. and x given y i, ui and θ can be simplified to: where φ denotes the density of a standard normal distribution N ð0; 1Þ and χ A f g ðxÞ is the usual indicator function of a set A. Finally, the joint distribution of y and x has the following form: (2)

Bayesian estimation of the model parameters
We will use a Bayesian approach to estimate the model parameters. Inference in the Bayesian framework is obtained through the posterior density, which is proportional to the prior multiplied by the likelihood. The posterior distribution for our model, as in most cases, cannot be derived analytically, and we will approximate it through MCMC methods specifically designed for working with the augmented likelihood with the hidden states. In fact, MCMC methods start by specifying the prior density Π(θ) for the parameters. Since the data Y are available, the general sampling methods work recursively by alternating between simulating the full conditional distributions X given y and θ given x and y.

Prior distributions
Under the assumption of independence between the parameters θ ¼ ðμ, β, σ 2 , r, ΠÞ, the prior density could be written as PðθÞ ¼ PðrÞPðΠÞPðμÞPðβÞPðσ 2 Þ. r is the parameters of a multinomial distribution; hence, the natural choice for the prior would be a Dirichlet distribution r $ Dðα 01 , …, α 0a Þ. Later on, X a j¼1 Π ij ¼ 1, and we assume that Π i $ Dðδ i1 , …, δ ia Þ for each row i of the transition matrix. This choice of the Dirichlet prior can be even the default Dð1;…; 1Þ as recently discussed in Ref. [18]. In fact, a Dirichlet prior is justified because the posterior density of each row of the transition matrix is proportional to the density of a Dirichlet distribution, and hence, choosing a Dirichlet prior would give a posterior Dirichlet. This can be justified as follows for a given set of parameters λ ¼ ðλ 1 , …, λ a Þ from a discrete or from a multinomial density: This probability mass function can be expressed, using the gamma function Γ, as Γðxiþ1Þ Y a i¼1 λ xi i : This form shows its resemblance to the Dirichlet distribution, and by starting from supposing the prior λ ∝ Dðα 0 , …, α a Þ, the posterior is Furthermore, concerning the priors for parameters of the autoregressive model, we suppose for h ¼ 1;…, a: μ ðhÞ $ N ðα h , τ h Þ, β ðhÞ $ N ðb h , c h Þ, and inverse gamma (IG) prior for σ 2 $ IG ðε, ζÞ. α h , τ h , b h , c h , E, ζ are hyperparameters to be specified. For more details on Bayesian inference and prior selection in HMM, the reader is referred to Ref. [19]. In our case, prior distributions for the autoregressive parameters were proposed by Ref. [20] for a mixture autoregressive model, who points out that they are conventional prior choices for mixture models.

Sampling the posterior distribution for the hidden states
Chib [21] developed a method for the simulation of the hidden states from the full joint distribution for the univariate hidden Markov model case. We will describe his full Bayesian algorithm for the univariate hidden Markov model before a generalization to our MAR(1) HMM.

Chib's algorithm for the univariate hidden Markov model for estimation of the states
Suppose we have an observed process Y n ¼ ðy 1 , …, y n Þ and the hidden states X n ¼ ðx 1 , …, x n Þ, θ are the parameters of the model. We adopt for simplicity X t ¼ ðx 1 , …, x t Þ the history of the states up to time t and X tþ1 ¼ ðx tþ1 , …, x n Þ the future from t + 1 to n. We use the same notation for Y t and Y t+1 .
Chib [21] shows that it is preferable to simulate the full latent data X n ¼ ðx 1 , …, x n Þ from the joint distribution of x 1 , …, x n jY n , θ, in order to improve the convergence property of the MCMC algorithm because instead of n additional blocks if each state is simulated separately, only one additional block is required. First, we write the joint conditional density as PðX n jY n , θ, ΠÞ ¼ Pðx n jY n , θÞPðx nÀ1 jY n , x n , θ, ΠÞ Â ⋯ Â Pðx 1 jY n , X 2 , θ, ΠÞ: For sampling, it is sufficient to consider the sampling of x t from Pðx t jY n , X tþ1 , θ, ΠÞ. Moreover, Pðx t jY n , X tþ1 , θ, ΠÞ ∝ Pðx t jY t , θ, ΠÞPðx tþ1 jx t , ΠÞ. This expression has two ingredients: the first is Pðx tþ1 jx t , ΠÞ, which is the transition matrix from the Markov chain. The second is Pðx t jY t , θ, ΠÞ that would be obtained by recursively starting at t = 1.

Sampling Π
Under the prior assumption of Dirichlet prior for each row of the transition matrix PðΠ i Þ ∝ Dðδ i1 , …, δ ia Þ, and the independence assumption between those rows, the posterior distribution for Π i can be developed using Eq. (2) as follows: Let n ij denote the number of single transitions from state i to state j, so ∝ Dðδ i1 þ n i1 , …, δ ia þ n ia Þ:

Simulation study
In this section, we apply our results to the breast cancer model discussed earlier. The main reason behind our work is that the progression of breast cancer cannot be seen directly unless we use observations related to the disease that could characterize its progression; those observations here are quantities which could be measured; they are called biomarkers, where the word biomarker is used to designate any objective indication of a biological process or disease condition including during treatment and should be measurable. Furthermore, biomarkers are increasingly used in the management of breast cancer patients. One example is reported in Ref. [22], stating that there is correlation between elevation of CEA and/or CA15-3 and disease progression, in breast cancer patients. Also we use the autoregressive dependence among the observations to add more dynamics to the model unlike conventional HMMs where the successive observations given the Markov process are independent. We used the classification of breast cancer in three states: local where the disease is confined within the breast, the regional phase when the lymph nodes are involved, and the distant stage where the cancer is found in other parts of the body. We restrict ourselves to these three stages unlike other stage classifications that divide the progression in more than three stages such as the TNM (tumor, node, and metastasis) system. By lack of finding data about breast cancer biomarkers, we will confine ourselves to simulate an MAR(1)HMM model for observation time M = 24, and a number of individuals n = 210, a = 3 for Markov states number, with the length observation time for each individual selected uniformly between 2 and M. The simulation process supposes we have for the autoregressive means μ ¼ ðμ ð1Þ , μ ð2Þ , μ ð3Þ Þ ¼ ð12; 24; 36Þ, since markers such as CA15-3 increase as the disease advances toward metastatic breast cancer. In addition, CA15-3 increase rapidly between successive observations, and thus, we take in the simulation the parameters β ¼ ðβ ð1Þ , β ð2Þ , β ð3Þ Þ ¼ ð0:2; 0:4; 0:8Þ.
The algorithm of simulation works as follows: 1. For each individual i ¼ 1;…, n, choose m i the length of observation for that individual i.
3. Generate the observations y i,t for all individuals using our model 1.
We choose a prior σ 2 $ IGð0:001; 0:001Þ, a Dð1;…; 1Þ prior for each row of Π, and Gaussian noninformative priors for the μs and the βs. Having the hidden states and the observations, we ran our algorithm for 8000 MCMC iterations. MCMC algorithm convergence was assessed by analyzing MCMC iterations mixing plots that are shown in Figure 1, autocorrelation sample graphs checking as illustrated in Figure 2, and inspecting histograms of posterior densities for the parameters of the models in Figure 3. All parameters show good mixing of chains, autocorrelations that decay immediately after a few lags, and perfect posterior densities fitting. Also the Gelman [23] potential scale reduction factor (PSRF) was plot. The PSRF is measured for more than two MCMC chains (three chains in this works are considered), and it is measured for each parameter of the model; it should show how the chains have forgotten their initial values and that the output from all chains is indistinguishable. It is based on a comparison of within-chain and between-chains variances and is similar to a classical analysis of variance; when the PSRF is high (perhaps greater than 1.1 or 1. 2), then we should run our chains out longer to improve convergence to the stationary distribution.
Each PSRF declines to 1 as the number of iterations approaches infinity to confirm convergence. All the parameters have shown a PSRF less than 1.1 as the number of iteration increases and by the way a good sign of convergence ( Figure 4). Moreover, we should point out that in the family of Markov switching model there is the so-called label switching problem (e.g., Ref. [24]) which arises identifiability problem, and hence, we would not estimate perfectly the parameters. In addition, the posterior densities could show evidence of multimodality. some authors postprocess the output of the MCMC to deal with the issue (e.g., [25]), while other uses a random permutation of the parameters in each iteration of the MCMC algorithm (e.g., [26]) or one can call for an invariant loss function method (e.g., [27]). In our case, no identifiability issue is noticed since we used well-separated prior hyperparameters. Even when we start from different initial values for the parameters, our algorithm converges immediately after a few iterations.
Finally and before giving our results, we should report that the simulation of the Dirichlet posterior was carried out following ( [28, p. 22], [29, p. 155]) who reported that the posterior Dirichlet parameters should be simulated using the beta distribution approach. Table 1 shows how the posterior values estimated from algorithm are very close to the true ones.     . Potential scale reduction factor convergence to less than 1.02 with more iterations.

Conclusion
We have extended the method of Chib [21] for block update estimation of the states to a MAR (1)HMM model. Furthermore, we would like to point out that our model can easily be extended to include missing observations, as we should only add an extra step in each MCMC iteration to estimate the missing observations. Also, we can estimate the autoregressive model for different values of the autoregressive order, p ≥ 1, by evaluating the Bayesian information criterion to select the best order that fits the observations of the model. Our model would capture the complexity and the dynamics of the evolution of breast cancer by introducing the latent states; the probabilities of transition between the latent states allow to compare among the effects of treatments on slowing or accelerating the transition of the disease from one health stage to another the autoregressive parameter mean values corresponding to different stages of the disease would guide medical doctors and scientists to monitor patients in different phases of the disease. The model incorporates individual observations with different lengths.
Last but not least, we like to mention the utilities of switching diffusion processes in addressing and analyzing many complicated applications such as in finance and risk management.
Our future work would be to apply these processes to explore disease progression, because they are characterized by the coexistence of continuous dynamics and discrete events as well as their interactions.