Open access peer-reviewed chapter

Bayesian Estimation of Multivariate Autoregressive Hidden Markov Model with Application to Breast Cancer Biomarker Modeling

Written By

Hamid El Maroufy, El Houcine Hibbah, Abdelmajid Zyad and Taib Ziad

Submitted: 12 December 2016 Reviewed: 08 June 2017 Published: 02 November 2017

DOI: 10.5772/intechopen.70053

From the Edited Volume

Bayesian Inference

Edited by Javier Prieto Tejedor

Chapter metrics overview

1,692 Chapter Downloads

View Full Metrics

Abstract

In this work, a first-order autoregressive hidden Markov model (AR(1)HMM) is proposed. It is one of the suitable models to characterize a marker of breast cancer disease progression essentially the progression that follows from a reaction to a treatment or caused by natural developments. The model supposes we have observations that increase or decrease with relation to a hidden phenomenon. We would like to discover if the information about those observations can let us learn about the progression of the phenomenon and permit us to evaluate the transition between its states (supposed discrete here). The hidden states governed by the Markovian process would be the disease stages, and the marker observations would be the depending observations. The parameters of the autoregressive model are selected at the first level according to a Markov process, and at the second level, the next observation is generated from a standard autoregressive model of first order (unlike other models considering the successive observations are independents). A Markov Chain Monte Carlo (MCMC) method is used for the parameter estimation, where we develop the posterior density for each parameter and we use a joint estimation of the hidden states or block update of the states.

Keywords

  • autoregressive hidden Markov model
  • breast cancer progression marker
  • Gibbs sampler
  • hidden states joint estimation
  • Markov Chain Monte Carlo

1. Introduction

The main motivation behind this work is to characterize progression in breast cancer. In fact, disease progression cannot be assessed correctly without the use of biomarkers, which would effectively monitor the evolution of the patient health state, and this is the case for breast cancer. The major challenge in this matter for researchers and clinicians is to unravel the stage of the disease, so as to tailor the treatment for each patient and to monitor the response of a patient to a treatment.

Currently, studies have shown that there is a correlation between the levels of certain markers such as cancer antigen CA15-3, carcinoembryonic antigen (CEA), and serum HER2 Neu with the stage of the disease [1]. This gives an opportunity of using a hidden Markov model (HMM) to predict the stage of the disease based on biomarker data and to address the effectiveness of the treatments in their influence on the transition of the cancer from one state to another. In HMM, we have two constituents: the Markovian hidden process suitable to represent the breast cancer stage and the observation process given by the biomarker data. By the way, we can learn about the disease transition rates and how it progresses from primary breast cancer to advanced cancer stage, for example.

Indeed, HMM is a useful tool for tackling numerous concrete problems in many fields but some possible applications of HMM are in speech processing [2], in biology [3], in disease progression [4], in economics [5, 6], and in gene expression [7]. For a complete review of HMM, the reader is referred to Zucchini and MacDonald [8], in which properties and definitions of HMM are presented in a plausible way with both classical estimation by maximum likelihood method and expectation maximization (EM) algorithm and the new Bayesian inference is addressed.

The model we consider here is a variation of the regular hidden Markov model, since we will use extensions to incorporate dependence among successive observations, suggesting autoregressive dependence among continuous observations. Consequently, we have relaxed the conditional independence assumption from a standard HMM, because we would like to add some dynamics to the patient disease progression and because in reality the current patient biomarker observation is dependent on the past one. In fact, the autoregressive assumption in HMM has shown its advantage over regular HMM that cannot catch the strong dependence between successive observations (e.g., Ref. [9]). A similar model to ours can be found in Ref. [10]. This kind of models, which were first proposed in Ref. [11] to describe econometrics time series, is generalization of both HMM and autoregressive models, will be effective in representing multiple heterogeneous dynamics such as the disease progression dynamics, and can be even generalized to a regime-switching ARMA models such as in Ref. [12].

Moreover, Our model can also be viewed as an extension of the multivariate double-chain Markov model (DCMM) developed by Ref. [13], where there are two discrete Markov chains of first order: the first Markov chain is observed and the second one is hidden. In contrast to this DCMM, our multivariate first-order autoregressive hidden Markov model (MAR(1)HMM) will lead to continuous observations, where each observation conditional on the hidden process will depend on the previous observation according to an autoregressive process of first order. This dynamic is promising for continuous observed disease biomarkers.

Parameter estimation is very challenging for HMM family models since the likelihood is not available in a closed form most of the time. Thus, we call for a Markov Chain Monte Carlo (MCMC) procedure instead of a maximum likelihood-based approach. This choice rises from the fact that the Bayesian analysis uses prior knowledge about the process being measured, and it allows direct probability statements and an approximation of posterior distributions for the parameters. Instead in the maximum likelihood approach, we cannot have declared prior or have exact distribution for the parameters when the likelihood is untractable or when we have missing data (e.g., Refs. [1416]).

Since the realization of HMM includes two separate entities: the parameters and the hidden states, the Bayesian computation is carried out after augmenting the likelihood by the missing hidden states [17]. The hidden states are sampled using a Gibbs sampler adopting a joint estimation of the hidden states or block update of the states (instead of a single update of each state separately) by means of a forward filtering/backward smoothing algorithm. Given the hidden states, we can compute the autoregressive parameters and the transition probabilities of the Markov chain by Gibbs sampler from their posterior densities after specifying conjugate priors for the parameters. Hence, the MCMC algorithm will alternate between simulating the hidden states and the parameters. Finally, we can obtain posteriors statistics such as the means, standard deviations and confidence intervals after assessing the convergence of the MCMC algorithm.

This chapter is organized as follows: after a preliminary on HMM, a description of the model is given in Section 3. In Section 4, we will give the Bayesian estimation of the parameters and the hidden states and provide the details of the MCMC algorithm, before presenting the results of a simulation studies in Section 5 and we will finish by a conclusion.

Advertisement

2. Preliminary

Since the model suggested is of the HMM type, we will describe HMM in more detail: an HMM is a stochastic process { X t , Y t } t = 0 T , where { X t } t = 0 T is a hidden Markov chain (unobservable) and { Y t } t = 0 T is a sequence of observable independent random variables such that Y t depends only on X t for the time t = 0,1,…T. Here the process { X t } t = 0 T evolves independently of { Y t } t = 0 T and is supposed to be a homogeneous finite Markov chain with probability transition matrix Π of dimension a × a, where a indicates the number of the hidden states and Π0 = (Π01,…,Π0a ) is the initial state distribution.

We denote the probability density function of Y t = y t given X t = k for k ∈ {1, …, a} with P x t ( y t , θ k ) , where θ k refers to the parameters of P when X t = k. We suppose further that the processes Y t |X t and Y t′ |X t′ are independent for tt′. Let Θ = (θ 1,…, θ a ) and θ = (Π0, Π, Θ), and then, the HMM can be described as follows: First, the likelihood of the observations and the hidden states can be decomposed to P ( y 0 , , y T , x 0 , , x T , θ ) = P ( y 0 , , y T | x 0 , , x T , θ ) P ( x 0 , , x T , θ ) . Since { X t } t = 0 T is a Markov chain, P ( x 0 , , x T , θ ) = Π 0 ( x 0 ) t = 1 T Π ( x t | x t 1 ) . Under the conditional independence of the observations given the hidden states, P ( y 0 , , y T | x 0 , , x T , θ ) = P x 0 ( y 0 | θ x 0 ) t = 1 T P x t ( y t | θ x t ) . Consequently, the likelihood function for the hidden states and the observations is given by P ( y 0 , y 1 , , y T , x 0 , x 1 , , x T , θ ) = Π ( x 0 ) P x 0 ( y 0 | θ x 0 ) t = 0 T Π ( x t | x t 1 ) P x t ( y t | θ x t ) .

Advertisement

3. Model description and specification

The MAR(1)HMM model we consider in this work is a hidden Markov model, where conditionally on the latent states, the observations are not independent like it is the case for a regular hidden Markov model. Instead, the current observation is allowed to depend on the previous observation according to an autoregressive model of first order. As in an HMM model, the latent states evolve according to a discrete first-order time homogeneous Markov model. We consider data of n continuous random variables observed over time, each of potentially different lengths, i.e., for each individual i = 1,2, …, n, we observe a vector y i ,. = ( y i , u i , , y i , m i ) T , with u i < m i .

Define u 0 = min 1 i n { u i } and M = max 1 i n { m i } and note that the times u i and m i may vary over the entire observation period from u 0 to M with the restriction that u i – m i ≥ 1, for i = 1,2,…,n.

We assume, for i = 1,2,…,n for integer time t = u i ,…,m i , that the random variable Y i,t taking nonnegative values depends only on the states X t and the previous observation Y i,t–1, and based on the model developed by Farcomeni and Arima [10], we get the following model:

Y i , t | X t = x t = β ( x t ) Y i , t 1 + μ ( x t ) + ε i , t . E1

The choice of the autoregressive part of the model is motivated by the fact that successive biomarker observations are most of the time correlated from many diseases unlike the hypothesis of independence between observations in HMMs.

We interpret x as the vector of the hidden health states of the patients; in the case of breast cancer, those states would be localized or advanced metastatic breast cancer for example, while y is the vector of the biomarkers observed and measured for the patients. The ε i , t are normal variables with mean 0 and variance σ 2 such that ε i , t and ε i′ , t′ are uncorrelated, (i, t) ≠ (i′, t′).

The parameters β ( x t ) and μ ( x t ) are parameters taking values in R for each hidden state and σ 2 R + .

Similar to Ref. [13], the transition matrix of the Markov chain Π is time homogeneous with dimension a × a where a is the number of hidden states, and Π = (Π gh , g = 1,…,a; h = 1,…,a) where Π gh = P(X t = h|X t–1 = g), for g = 1,2,…,a; h = 1,2,…, a; and t = u 0+1,…, M. We let the first state X u 0 to be selected from a discrete distribution with vector of probabilities r = (r 1,…,r a ). Also we consider the time of initial observation u i , the initial observed state y i , u i , and the number of consecutive time points that were observed m i u i + 1. Let μ = (μ (1),…, μ (a)), β = (β (1),…,β (a)), and θ = (μ, β, σ 2, r, Π) be the set of all parameters in the model. We suppose that the individuals, i.e., Y i , t , behave independently conditionally on X. Therefore, for i = 1, …, n, P ( y i ,. | y i , u i , x , θ ) = t = u i + 1 m i P ( y i , t | y i , t 1 , x t , Θ ) and P ( x | θ ) = P ( x u 0 ) t = u 0 + 1 M P ( x t | x t 1 , Π ) , where P ( x t | x t 1 , Π ) = P ( X t = x t | X t 1 = x t 1 , Π ) = Π x t 1 , x t . Then, the likelihood density for the observations of all individuals y = (y 1,…,y n ) given first time vector of observations y 0 = ( y 1, u 1 , , y n , u n ) , x, and θ is P ( y | y 0 , x , θ ) = i = 1 n P ( y i ,. | y i , u i , x , θ ) ,

This is due to the conditional independence of the y i , given x and θ. The joint mass of each y i ,. and x given y u i and θ can be written as follows: P ( y i ,. , x | y i , u i , θ ) = P ( y i ,. | y i , u i , x , θ ) × P ( x | y u i , θ ) . Using the Markov property of the hidden process, we have after simplification P ( x | y i , u i , θ ) P ( y i , u i | x , θ ) P ( x | θ ) = P ( y i , u i | x u i , θ ) r x u 0 Π x u 0 , x u 0 + 1 × × Π x M 1 , x M . In addition, P ( y i ,. | y i , u i , x , θ ) = t = u i + 1 m i P ( y i , t | y i , t 1 , x , θ ) , and consequently,

P ( y i ,. , x | y i , u i , θ ) r x u 0 P ( y i , u i | x u i , θ ) t = u 0 + 1 M Π x t 1 , x t t = u i + 1 m i P ( y i , t | y i , t 1 , x , θ ) . Finally, under the hypothesis of normal error distribution for the autoregressive parameters of the model (Eq. (1)) and the Chapman-Kolmogorov property, the joint distribution of y i ,. and x given y i , u i and θ can be simplified to:

P ( y i ,. , x | y i , u i θ ) P ( y i , u i | x u i , θ ) h = 1 a r h χ { x u 0 } ( h ) t = u 0 + 1 M g = 1 a h = 1 a Π g , h χ { x t , x t 1 } ( g , h ) × t = u i + 1 m i h = 1 a [ 1 σ ϕ ( y i , t μ ( h ) β ( h ) y i , t 1 σ ) ] χ { x t } ( h ) , E2032

where ϕ denotes the density of a standard normal distribution N ( 0,1 ) and χ { A } ( x ) is the usual indicator function of a set A. Finally, the joint distribution of y and x has the following form:

P ( y , x | y 0 , θ ) h = 1 a r h χ { x u 0 } ( h ) t = u 0 + 1 M g = 1 a h = 1 a Π g , h χ { x t , x t 1 } ( g , h ) × i = 1 n l = 1 a [ 1 σ ϕ ( y i , u i μ ( l ) σ ) ] χ { x t } ( l ) × i = 1 n t = u i + 1 m i h = 1 a [ 1 σ ϕ ( y i , t μ ( h ) β ( h ) y i , t 1 σ ) ] χ { x t } ( h ) . E2
Advertisement

4. Bayesian estimation of the model parameters

We will use a Bayesian approach to estimate the model parameters. Inference in the Bayesian framework is obtained through the posterior density, which is proportional to the prior multiplied by the likelihood. The posterior distribution for our model, as in most cases, cannot be derived analytically, and we will approximate it through MCMC methods specifically designed for working with the augmented likelihood with the hidden states. In fact, MCMC methods start by specifying the prior density Π(θ) for the parameters. Since the data Y are available, the general sampling methods work recursively by alternating between simulating the full conditional distributions X given y and θ given x and y.

4.1. Prior distributions

Under the assumption of independence between the parameters θ = ( μ , β , σ 2 , r , Π ) , the prior density could be written as P ( θ ) = P ( r ) P ( Π ) P ( μ ) P ( β ) P ( σ 2 ) . r is the parameters of a multinomial distribution; hence, the natural choice for the prior would be a Dirichlet distribution r D ( α 01 , , α 0 a ) . Later on, j = 1 a Π i j = 1 , and we assume that Π i D ( δ i 1 , , δ i a ) for each row i of the transition matrix. This choice of the Dirichlet prior can be even the default D ( 1, ,1 ) as recently discussed in Ref. [18]. In fact, a Dirichlet prior is justified because the posterior density of each row of the transition matrix is proportional to the density of a Dirichlet distribution, and hence, choosing a Dirichlet prior would give a posterior Dirichlet. This can be justified as follows for a given set of parameters λ = ( λ 1 , , λ a ) from a discrete or from a multinomial density:

π ( x 1 , , x a , λ 1 , , λ a ) = n ! x 1 ! x a ! λ 1 x 1 λ a x a for the nonnegative integers x 1 , , x a , with i = 1 a x i = n . This probability mass function can be expressed, using the gamma function Γ, as π ( x 1 , , x a , λ 1 , , λ a ) = Γ ( i = 1 a x i + 1 ) i = 1 a Γ ( x i + 1 ) i = 1 a λ i x i . This form shows its resemblance to the Dirichlet distribution, and by starting from supposing the prior λ D ( α 0 , , α a ) , the posterior is P ( λ | x ) P ( λ ) P ( x | λ ) i λ i x i i λ i α i 1 i λ i x i + α i 1 D ( x 1 + α 1 , , x a + α a ) .

Furthermore, concerning the priors for parameters of the autoregressive model, we suppose for h = 1, , a : μ ( h ) N ( α h , τ h ) , β ( h ) N ( b h , c h ) , and inverse gamma ( I G ) prior for σ 2 I G ( ε , ζ ) . α h , τ h , b h , c h , ϵ , ζ are hyperparameters to be specified. For more details on Bayesian inference and prior selection in HMM, the reader is referred to Ref. [19]. In our case, prior distributions for the autoregressive parameters were proposed by Ref. [20] for a mixture autoregressive model, who points out that they are conventional prior choices for mixture models.

4.2. Sampling the posterior distribution for the hidden states

Chib [21] developed a method for the simulation of the hidden states from the full joint distribution for the univariate hidden Markov model case. We will describe his full Bayesian algorithm for the univariate hidden Markov model before a generalization to our MAR(1)HMM.

4.2.1. Chib’s algorithm for the univariate hidden Markov model for estimation of the states

Suppose we have an observed process Y n = ( y 1 , , y n ) and the hidden states X n = ( x 1 , , x n ) , θ are the parameters of the model. We adopt for simplicity X t = ( x 1 , , x t ) the history of the states up to time t and X t + 1 = ( x t + 1 , , x n ) the future from t + 1 to n. We use the same notation for Y t and Y t+1.

For each state x t { 1,2, , a } for t = 1,2, , n , the hidden model can be described by a conditional density given the hidden states π ( y t | Y t 1 , x t = k ) = π ( y t | Y t 1 , θ k ) , k = 1, , a , with x t depending only on x t–1 and having transition matrix Π and initial distribution Π0, and the parameters for π(.) are θ = ( θ 1 , , θ a ) .

Chib [21] shows that it is preferable to simulate the full latent data X n = ( x 1 , , x n ) from the joint distribution of x 1 , , x n | Y n , θ , in order to improve the convergence property of the MCMC algorithm because instead of n additional blocks if each state is simulated separately, only one additional block is required. First, we write the joint conditional density as

P ( X n | Y n , θ , Π ) = P ( x n | Y n , θ ) P ( x n 1 | Y n , x n , θ , Π ) × × P ( x 1 | Y n , X 2 , θ , Π ) . E4

For sampling, it is sufficient to consider the sampling of x t from P ( x t | Y n , X t + 1 , θ , Π ) . Moreover, P ( x t | Y n , X t + 1 , θ , Π ) P ( x t | Y t , θ , Π ) P ( x t + 1 | x t , Π ) . This expression has two ingredients: the first is P ( x t + 1 | x t , Π ) , which is the transition matrix from the Markov chain. The second is P ( x t | Y t , θ , Π ) that would be obtained by recursively starting at t = 1.

The mass function P ( x t 1 | Y t 1 , θ , Π ) is transformed into P ( x t | Y t , θ , Π ) , which is in turn transformed into P ( x t + 1 | Y t + 1 , θ , Π ) and so on. The update is as follows: for k = 1, , a , we could write

P ( x t = k | Y t , θ , Π ) = P ( x t = k | Y t 1 , θ , Π ) π ( y t | y t 1 , θ k ) l = 1 a P ( x t = l | Y t 1 , θ , Π ) π ( y t | y t 1 , θ l ) . E5

These calculations are initialized at t = 0, by setting P ( x 1 | Y 0 , θ ) to be the stationary distribution of the Markov chain. Precisely, the simulation proceeds for k = 1, , a , recursively by first simulating P ( x 1 = k | Y 0 , θ ) , from the initial distribution Π0(k) and P ( x 1 = k | Y 1 , θ , Π ) P ( x 1 = k | Y 0 , θ , Π ) π ( y 1 | Y 0 , θ k ) . Then, we get by forward calculation P ( x t = k | Y t 1 , θ ) = l = 1 a Π l k P ( x t 1 = l | Y t 1 , θ ) , for each t = 2, , n , where Π lk is the transition probability and P ( x t = k | Y t , θ ) P ( x t = k | Y t 1 , θ , P ) Π ( y t | Y t 1 , θ k ) . The last term in the forward computation P ( x n = k | Y n , θ ) would serve as a start for the backward pass, and we get recursively for each t = n 1, ,1. ; P ( x t = k | Y n , X t + 1 , θ ) P ( x t | Y t 1 , θ , Π ) P ( x t + 1 | x t = k , Π ) , which permits the obtention of X n = ( x 1 , , x n ) .

4.2.2. Simulating the hidden states for the MAR(1)HMM

Returning to our model, and adopting notations and algorithm developed by Fitzpatrick and Marchev, f will denote the observation density for the MAR(1)HMM, and for u 0 < t < M ; x t = ( x u 0 , , x t ) , x t = ( x t , , x M ) , y ( t ) = ( y i , t , i = 1,2, , n ) , y , t = i : u i < t { y i , u i , , y i , min { t , m i } } , and y t = i : t < m i { y i , max { t + 1, u i } , , y i , m i } . The posterior distribution of the hidden state could be written as: P ( x M | y , M , θ ) = P ( x M | y , M , θ ) × × P ( x u 0 | y , M , x u 0 + 1 , θ ) . So we could sample the whole sequence of states by sampling from P ( x t | y , M , x t + 1 , θ ) . Hence, the estimation of the hidden states is performed recursively by first initializing

P ( x u 0 | y , u 0 , θ ) P ( y , u 0 | x u 0 ) P ( x u 0 | r ) ; y , u 0 = { y i , u i , , u i = u 0 , i = 1, , n } . P ( x u 0 + 1 = k | y , u 0 , θ ) = l = 1 a Π l k P ( x u 0 = l | Y , u 0 ) ; k = 1, , a . P ( x u 0 + 1 = k | y , u 0 + 1 , θ ) P ( x u 0 + 1 = k | y , u 0 , θ ) f ( y ( u 0 ) | y , u 0 , θ k ) . E6

We perform a similar calculation for every state at time t, and we conclude by calculating P ( x M = k | y , M 1 , θ ) = l = 1 a Π l k P ( x M 1 = l | Y , M 1 , θ ) , and P ( x M = k | y , M , θ ) P ( x M = k | y , M 1 , θ ) f ( y ( M ) | y , M 1 , θ k ) . Later on, we get P ( x M = k | y , M , θ ) , which permits the simulation of P ( x M | y , M , θ ) . Finally, by backward calculation, we simulate from the probabilities P ( x t | y , M , x t + 1 , θ ) P ( x t + 1 | x t , Π ) P ( x t | y , t , θ ) for each time t = M 1, , u 0 . Those backward probabilities would permit the simulation of the latent states.

4.3. Sampling from P(θ|x, y)

4.3.1. Sampling Π

Under the prior assumption of Dirichlet prior for each row of the transition matrix P ( Π i ) D ( δ i 1 , , δ i a ) , and the independence assumption between those rows, the posterior distribution for Π i can be developed using Eq. (2) as follows: Let n ij denote the number of single transitions from state i to state j, so

P ( Π i | y , x ) P ( Π i ) t = u 0 + 1 M j = 1 a Π χ { x t 1 , x t } ( i , j ) P ( Π i ) j = 1 a Π i j n i j j = 1 a Π i j δ i j + n i j 1 D ( δ i 1 + n i 1 , , δ i a + n i a ) . E7

4.3.2. Sampling posterior distribution for initial distribution

Let n 0 l = χ x u 0 ( l ) , for l = 1, , a . Using (2), under Dirichlet prior D ( δ 01 , , δ 0 a ) for the parameter r, we obtain P ( r | x , y ) P ( r ) l = 1 a r l χ { x u 0 } ( l ) l = 1 a r l δ 0 l + n 0 l 1 D ( δ 01 + n 01 , , δ 0 a + n 0 a ) .

4.3.3. Sampling posterior distribution for the autoregressive parameters μ , β , σ 2

When a complete conditional distribution is known such as the normal distribution or beta distribution, we use the Gibbs sampler to draw the random variable. This is the case for our model. Let us define n u i ( l ) = i = 1 n χ { x u i = l } , n l = i = 1 n t = u i + 1 m i χ { x t = l } , N = l = 1 a n l , n 0 l = χ { x u 0 } ( l ) . So for l = 1,2, , a ; by supposing N ( α l , τ l ) as prior distribution and using Eq. (2), the conditional posterior distribution of μ (l) is:

P ( μ ( l ) | y , x ) P ( μ ( l ) ) i = 1 n [ 1 σ ϕ ( y i , u i μ ( l ) σ ) ] χ { x u i } ( l ) × i = 1 n t = u i + 1 m i ( 1 σ ϕ ( y i , t μ ( l ) β ( l ) y i , t 1 σ ) ) χ { x t } ( l ) . exp 1 2 { ( μ ( l ) α l ) 2 τ l + i = 1, x u i = l n ( y i , u i μ ( l ) σ ) 2 + i = 1 n t = u i + 1, x t = l m i ( y i , t μ ( l ) β ( l ) y i , t 1 σ ) 2 } . E8

then μ ( l ) / y , x N ( τ ˜ l , α ˜ l ) with inverse mean τ ˜ l 1 = n u i ( l ) + n l σ 2 + 1 τ l and variance

α ˜ l = τ ˜ l ( i = 1, x u i = l n y i , u i + i = 1 n t = u i + 1, x t = l m i ( y i , t β ( l ) y i , t 1 ) σ 2 + α l τ l ) . E9

For β ( l ) , l = 1, , a , and similar to μ (l), N ( b l , c l ) was proposed as prior choice to obtain:

P ( β ( l ) | y , x ) P ( β ( l ) ) i = 1 n t = u i + 1 m i [ 1 σ ϕ ( y i , t μ ( l ) β ( l ) y i , t 1 σ ) ] χ { x t } ( l ) , E10

and therefore, β ( l ) / y , x N ( c ˜ l , b ˜ l ) with inverse mean c ˜ l 1 = 1 c l + i = 1 n t = u i + 1, x t = l m i y i , t 1 2 σ 2 and variance

b ˜ l = c ˜ l ( b l c l + i = 1 n t = u i + 1, x t = l m i ( y i , t μ ( l ) ) y i , t 1 σ 2 ) . E11

For the posterior distribution of σ 2, by supposing I G ( ε , ζ ) as prior, we deduce from Eq. (2)

P ( σ 2 | y , x ) ( σ 2 ) ( ε + 1 ) exp ( ζ σ 2 ) i = 1 n [ 1 σ ϕ ( y i , u i μ ( x u i ) σ ) ] × i = 1 n t = u i + 1 m i [ 1 σ ϕ ( y i , t μ ( x t ) β ( x t ) y i , t 1 σ ) ] , E12

consequently σ 2 / y , x I G ( ε ˜ , ζ ˜ ) w i t h p a r a m e t e r s ε ˜ = n u i + N 2 + ε and

ζ ˜ = i = 1 n ( y i , u i μ ( x u i ) ) 2 + i = 1 n t = u i + 1 m i ( y i , t μ ( x t ) β ( x t ) y i , t 1 ) 2 2 + ζ . E13

Finally, the algorithm is ran for d = 1,…,D iterations by alternating between the following steps, where in each step we compute a conditional posterior for the given parameter:

The MCMC algorithm:

  1. For h = 1,2,...., a , give reference values for the hyperparameters α h , τ h , a h , b h , δ 0h , and δ ih for i = 1,2,...., a .

  2. Initialization (Step d = 1 of the MCMC iterations): Initialize Π(1), r (1), μ (1), β (1), and σ 2(1).

  3. Simulation of the hidden states:

    1. Initialization of forward simulation: P ( x u 0 ( d ) | y , u 0 , θ ) P ( y , u 0 | x u 0 ( d ) ) P ( x u 0 ( d ) | r ( d ) ) , with y , u 0 = { y i , u i , , u i = u 0 , i = 1,…, n } .

    2. Forward simulation: For k = 1,…., a and t = u 0 + 1,…., M : P ( x t ( d ) = k | y , t 1 , θ ) = l = 1 a Π l k ( d ) P ( x t 1 ( d ) = l | Y , t 1 , θ ) and

      P ( x t ( d ) = k | y , t , θ ) = P ( x t ( d ) = k | y , t 1 , θ k ) f ( y ( t ) | y , t 1 , θ k ) l = 1 a P ( x t d = l | y , t 1 , θ ) f ( y ( t ) | y , t 1 , θ l ) .

    3. Initialization of backward simulation: For k = 1,…., a , given

      P ( x M ( d ) = k | y , M , θ ) from forward simulation, we get P ( x M ( d ) | y , M , θ ) .

    4. Backward simulation: For k = 1,…., a and t = M 1,…, u 0 :

      P ( x t ( d ) | y , M , x t + 1 ( d ) , θ ) P ( x t + 1 ( d ) | x t ( d ) , π ) P ( x t ( d ) | y , t , θ ) .

  4. Estimation of the initial distribution and the transition distribution

    1. for l = 1,…., a , k = 1,…., a . Calculate n 0 l = χ { x u 0 ( d ) } ( l ) and n k l = Σ t = u 0 + 1 M χ { x t 1 ( d ) , x t ( d ) } ( k , l ) .

    2. Sample ( r 1 ( d + 1 ) ,…., r a ( d + 1 ) ) D ( δ 01 + n 01 ,…., δ 0 a + n 0 a ) .

    3. For i = 1,…., a ; sample ( Π i 1 ( d + 1 ) ,…., Π i a ( d + 1 ) ) D ( δ i 1 + n i 1 ,…., δ i a + n i a ) .

  5. Simulation of μ: For l = 1,…., a ,

    1. τ ˜ l 1 = n u i ( l ) + n l σ ( d ) 2 + 1 τ l .

    2. α ˜ l = τ ˜ l ( i = 1, x u i = l n y i , u i + i = 1 n t = u i + 1, x t = l m i ( y i , t β ( d ) ( l ) y i , t 1 ) σ ( d ) 2 + α l τ l ) .

    3. Simulate μ ( d + 1 ) ( l ) / y , x N ( α ˜ l , τ ˜ l ) .

  6. Simulation of β: For l = 1,…., a ,

    1. c ˜ l 1 = 1 c l + i = 1 n t = u i + 1, x t = l m i y i , t 1 2 σ ( d ) 2 .

    2. b ˜ l = c ˜ l ( b l c l + i = 1 n t = u i + 1, x t = l m i ( y i , t μ d + 1 ( l ) ) y i , t 1 σ ( d ) 2 ) .

    3. Simulate β ( d + 1 ) ( l ) / y , x N ( b ˜ l , c ˜ l ) .

  7. - Simulation of σ 2:

    1. ϵ ˜ = n u i + N 2 + ϵ .

    2. ζ ˜ = i = 1 n ( y i , u i μ ( d + 1 ) ( x u i ) ) 2 + i = 1 n t = u i + 1 m i ( y i , t μ ( d + 1 ) ( x t ) β ( d + 1 ) ( x t ) y i , t 1 ) 2 2 + ζ .

    3. Simulate σ ( d + 1 ) 2 / y , x I G ( ϵ , ζ ˜ ) .

Advertisement

5. Simulation study

In this section, we apply our results to the breast cancer model discussed earlier. The main reason behind our work is that the progression of breast cancer cannot be seen directly unless we use observations related to the disease that could characterize its progression; those observations here are quantities which could be measured; they are called biomarkers, where the word biomarker is used to designate any objective indication of a biological process or disease condition including during treatment and should be measurable. Furthermore, biomarkers are increasingly used in the management of breast cancer patients. One example is reported in Ref. [22], stating that there is correlation between elevation of CEA and/or CA15-3 and disease progression, in breast cancer patients. Also we use the autoregressive dependence among the observations to add more dynamics to the model unlike conventional HMMs where the successive observations given the Markov process are independent. We used the classification of breast cancer in three states: local where the disease is confined within the breast, the regional phase when the lymph nodes are involved, and the distant stage where the cancer is found in other parts of the body. We restrict ourselves to these three stages unlike other stage classifications that divide the progression in more than three stages such as the TNM (tumor, node, and metastasis) system. By lack of finding data about breast cancer biomarkers, we will confine ourselves to simulate an MAR(1)HMM model for observation time M = 24, and a number of individuals n = 210, a = 3 for Markov states number, with the length observation time for each individual selected uniformly between 2 and M. The simulation process supposes we have for the autoregressive means μ = ( μ ( 1 ) , μ ( 2 ) , μ ( 3 ) ) = ( 12,24,36 ) , since markers such as CA15–3 increase as the disease advances toward metastatic breast cancer. In addition, CA15–3 increase rapidly between successive observations, and thus, we take in the simulation the parameters β = ( β ( 1 ) , β ( 2 ) , β ( 3 ) ) = ( 0.2,0.4,0.8 ) .

The algorithm of simulation works as follows:

  1. For each individual i = 1, , n , choose m i the length of observation for that individual i.

  2. Generate each discrete disease state x t using transition matrix Π = ( 0.7,0.2, 0.1 ; 0.1,0.6,0.3 ; 0.2,0.3,0.5 ) for t = u 0 + 1 , , M .

  3. Generate the observations y i,t for all individuals using our model 1.

We choose a prior σ 2 I G ( 0.001,0.001 ) , a D ( 1, ,1 ) prior for each row of Π, and Gaussian noninformative priors for the μs and the βs. Having the hidden states and the observations, we ran our algorithm for 8000 MCMC iterations. MCMC algorithm convergence was assessed by analyzing MCMC iterations mixing plots that are shown in Figure 1 , autocorrelation sample graphs checking as illustrated in Figure 2 , and inspecting histograms of posterior densities for the parameters of the models in Figure 3 . All parameters show good mixing of chains, autocorrelations that decay immediately after a few lags, and perfect posterior densities fitting. Also the Gelman [23] potential scale reduction factor (PSRF) was plot. The PSRF is measured for more than two MCMC chains (three chains in this works are considered), and it is measured for each parameter of the model; it should show how the chains have forgotten their initial values and that the output from all chains is indistinguishable. It is based on a comparison of within-chain and between-chains variances and is similar to a classical analysis of variance; when the PSRF is high (perhaps greater than 1.1 or 1.2), then we should run our chains out longer to improve convergence to the stationary distribution. Each PSRF declines to 1 as the number of iterations approaches infinity to confirm convergence. All the parameters have shown a PSRF less than 1.1 as the number of iteration increases and by the way a good sign of convergence ( Figure 4 ). Moreover, we should point out that in the family of Markov switching model there is the so-called label switching problem (e.g., Ref. [24]) which arises identifiability problem, and hence, we would not estimate perfectly the parameters. In addition, the posterior densities could show evidence of multimodality. some authors postprocess the output of the MCMC to deal with the issue (e.g., [25]), while other uses a random permutation of the parameters in each iteration of the MCMC algorithm (e.g., [26]) or one can call for an invariant loss function method (e.g., [27]). In our case, no identifiability issue is noticed since we used well-separated prior hyperparameters. Even when we start from different initial values for the parameters, our algorithm converges immediately after a few iterations.

Figure 1.

Markov chain mixing for each parameter through MCMC algorithm simulation.

Figure 2.

Autocorrelation sample plots for parameters of the model.

Figure 3.

Posterior densities for the parameters of the model (after 8000 iterations).

Figure 4.

Potential scale reduction factor convergence to less than 1.02 with more iterations.

Finally and before giving our results, we should report that the simulation of the Dirichlet posterior was carried out following ([28, p. 22], [29, p. 155]) who reported that the posterior Dirichlet parameters should be simulated using the beta distribution approach. Table 1 shows how the posterior values estimated from algorithm are very close to the true ones.

Parameter True value Posterior statistics
Mean Standard deviation Confidence interval (5%)
μ 1 12 11.929 0.047 (11.851–12.005)
μ 2 24 23.923 0.047 (23.847–24.000)
μ 3 36 35.843 0.070 (35.729–35.959)
β 1 0.2 0.2016 0.0012 (0.1997–0.2035)
β 2 0.4 0.4018 0.0009 (0.4004–0.4032)
β 3 0.8 0.8022 0.0010 (0.8005–0.8038)
π 11 0.7 0.688 0.068 (0.5715–0.797)
π 12 0.2 0.223 0.062 (0.129–0.332)
π 13 0.1 0.090 0.042 (0.032–0.17)
π 21 0.1 0.091 0.035 (0.041–0.154)
π 22 0.6 0.607 0.059 (0.507–0.701)
π 23 0.3 0.302 0.055 (0.214–0.397)
π 31 0.2 0.153 0.053 (0.075–0.250)
π 32 0.3 0.368 0.071 (0.257–0.488)
π 33 0.5 0.479 0.073 (0.358–0.599)
σ 2 2 2.023 0.032 (1.970–2.077)

Table 1.

Posterior inference for the parameters of the MAR(1)HMM model.

Advertisement

6. Conclusion

We have extended the method of Chib [21] for block update estimation of the states to a MAR(1)HMM model. Furthermore, we would like to point out that our model can easily be extended to include missing observations, as we should only add an extra step in each MCMC iteration to estimate the missing observations. Also, we can estimate the autoregressive model for different values of the autoregressive order, p ≥ 1, by evaluating the Bayesian information criterion to select the best order that fits the observations of the model. Our model would capture the complexity and the dynamics of the evolution of breast cancer by introducing the latent states; the probabilities of transition between the latent states allow to compare among the effects of treatments on slowing or accelerating the transition of the disease from one health stage to another the autoregressive parameter mean values corresponding to different stages of the disease would guide medical doctors and scientists to monitor patients in different phases of the disease. The model incorporates individual observations with different lengths.

Last but not least, we like to mention the utilities of switching diffusion processes in addressing and analyzing many complicated applications such as in finance and risk management. Our future work would be to apply these processes to explore disease progression, because they are characterized by the coexistence of continuous dynamics and discrete events as well as their interactions.

Advertisement

Acknowledgments

We would like to thank the editorial staff for the comments that helped in improving this work. Also, we would like to thank the supporters of this work: The Lalla Salma Foundation Prevention and Treatment of Cancer, Rabat, Morocco; and the Germano-Morrocan Program for Scientific Research PMARS 2015-060.

References

  1. 1. Samy N, Ragab HM, El Maksoud NA, Shaalan M. Prognostic significance of serum Her2/neu, BCL2, CA15-3 and CEA in breast cancer patients: A short follow up. Cancer Biomarkers. 2009;6:63‐72
  2. 2. Benmiloud B, Piczunski W. Estimation des parametres dans les chaines de markov cachees et segmentation. Traitement du Signal. 1995;12:433–454
  3. 3. Boys R, Handerson D. A Bayesian approach to DNA sequence segmentation (with discussion). Biometrics. 2004;60:573–588
  4. 4. Guihenneuc-Jouyaux C, Richardson S, Longini IM Jr. Modeling disease progression by a hidden Markov process: Application to characterizing CD4 cell decline. Biometrics. 2000;56:733–741
  5. 5. Albert J, Chib S. Bayes inference via Gibbs sampling of autoregressive time series subject to Markov mean and variance shifts. Journal of Business and Economic Statistics. 1993;11:1–15
  6. 6. Korolkiewickz M, Elliot J. A hidden Markov model of credit quality. Journal of Economic Dynamics and Control. 2008;32:3807–3819
  7. 7. Zeng Y, Frias J. A novel HMM-based clustering algorithm for the analysis of gene expression time-course data. Computational Statistics and Data Analysis. 2006;50:2472–2494
  8. 8. Zucchini W, MacDonald I. Hidden Markov Models for Time Series: An Introduction Using R. New York: Springer; 2009
  9. 9. Ailliot P, Monbet V. Markov-switching autoregressive models for wind time series. Environmental Modelling & Software. 2012;30:92–101
  10. 10. Farcomeni A, Arima S. A Bayesian autoregressive three state hidden Markov model for identifying switching monotonic regimes in microarray time course data. Statistical Applications in Genetics and Molecular Biology. 2013;23:467–480
  11. 11. Hamilton J. A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica. 1989;57:357–384
  12. 12. Kim C, Kim J. Bayesian inference in regime-switching ARMA models with absorbing states: The dynamics of the ex-ante real interest rate under regime shifts. Journal of Business and Economic Statistics. 2015;33:566–578
  13. 13. Fitzpatrick, M. and Marchev, D. Ef?cient bayesian estimation of the multivariate double chain markov model. Statistics and Computing, 2013;23(4):467-480
  14. 14. Lindley D. The philosophy of statistics. The Statistician. 2000;49:293–337
  15. 15. Bolstad, W. M. (2007). Introduction to Bayesian Statistics. John Wiley and Sons, Inc., Hoboken, New Jersey; 2007
  16. 16. Gelman A, Shalizi CR. Philosophy and the practice of Bayesian statistics in the social sciences. British Journal of Mathematical and Statistical Psychology. 2013;66:8–38
  17. 17. Hobert, J.P. The data augmentation algorithm: Theory and methodology. In: Brooks S, Gelman A, Jones GL, Meng X-L, editors. The Handbook of Markov Chain Monte Carlo. Boca Raton, FL: Chapman and Hall/CRC; 2011. p. 253
  18. 18. Tuyl F, Gerlach R, Mengersen K. Posterior predictive arguments in favor of the Bayes-Laplace priors as the consensus prior for the binomial and multinomial parameters. Bayesian Analysis. 2013;4:151–158
  19. 19. Cappe O, Moulines E, Ryden T. Inference in Hidden Markov Models. New York: Springer-Verlag; 2005
  20. 20. Sampietro S. Mixture of autoregressive components for modeling financial market volatility. LIUC Papers. Serie Metodi quantitativi 16. 2005
  21. 21. Chib S. Calculating posterior distributions and model estimates in Markov mixtures models. Journal of Econometrics. 1996;75:79–98
  22. 22. Laessig D, Nagel D, Heinemann V, Untch M, Kahlert S, Bauerfeind I, Stieber P. Importance of CEA and CA15-3 during disease progression in metastatic breast cancer patients. Anticancer Research. 2007;27:1963–1968
  23. 23. Gelman A. Inference and monitoring convergence. In: Gilks W, Richardson S, Spiegelhalter D, editors. Markov Chain Monte Carlo in Practice. London: Chapman and Hall/CRC; 1995. p. 131
  24. 24. Fruhwirth-Schnatter S. Finite Mixture and Markov Switching Models. New York: Springer; 2006
  25. 25. Celoux G. Bayesian inference for mixture: The label switching problem. In: Payne R, Green P, editors. Proceedings in Computational Statistics. Heidelberg: Physica; 1998. pp. 227–232
  26. 26. Fruhwirth-Schnatter S. Markov Chain Monte Carlo estimation of classical and dynamic switching and mixture models. Journal of the American Statistical Association. 2001;96:194–209
  27. 27. Hurn M, Justel A, Rober C. Estimating mixtures of regressions. Journal of Computational and Graphical Statistics. 2003;79:55–79
  28. 28. Kim C, Nelson C. State-Space Models with Regime Switching: Classical and Gibbs Sampling Approaches with Applications. Cambridge, MA: MIT Press; 1999
  29. 29. Krozlig H. Markov-Switching Vector Autoregressions. Berlin, Heidelberg: Springer-Verlag; 1997

Written By

Hamid El Maroufy, El Houcine Hibbah, Abdelmajid Zyad and Taib Ziad

Submitted: 12 December 2016 Reviewed: 08 June 2017 Published: 02 November 2017