Dependent Dirichlet Processes for Analysis of a Generalized Shared Frailty Model

Bayesian paradigm takes advantage of well fitting complicated survival models and feasible computing in survival analysis owing to the superiority in tackling the complex censoring scheme, compared with the frequentist paradigm. In this chapter, we aim to display the latest tendency in Bayesian computing, in the sense of automating the posterior sampling, through Bayesian analysis of survival modeling for multivariate survival outcomes with complicated data structure. Motivated by relaxing the strong assumption of proportionality and the restriction of a common baseline population, we propose a generalized shared frailty model which includes both parametric and nonparametric frailty random effects so as to incorporate both treatment-wise and temporal variation for multiple events. We develop a survival-function version of ANOVA dependent Dirichlet process to model the dependency among the baseline survival functions. The posterior sampling is implemented by the No-U-Turn sampler in Stan, a contemporary Bayesian computing tool, automatically. The proposed model is validated by analysis of the bladder cancer recurrences data. The estimation is consistent with existing results. Our model and Bayesian inference provide evidence that the Bayesian paradigm fosters complex modeling and feasible computing in survival analysis and Stan relaxes the posterior inference.


Introduction
The shared frailty model, coined by [1], has been widely used in the analysis of multivariate survival outcomes that might be associated within subgroups or clusters. Enormous work has been devoted to the development of shared frailty model in both Bayesian and frequency paradigms, and the reviews can be found in [2][3][4]. As an extension of the well-known Cox's proportional hazard model, conditional on the frailty effect, the shared frailty model assumes the hazard ratio between two subjects is proportional to their difference in relative risk scores over time. In addition to the proportional hazard assumption, the shared frailty model fixes the baseline hazard function among all clusters.
Traditional shared frailty models provide a good framework for expediently mathematical tackling the heterogeneity among the multivariate observations, whereas in practice it needs modification and adaption to tolerate complex structure so as to incorporate cross information owing to the intra-and intersubject variability ( [5,6]). Take the renowned data on recurrences of bladder cancer for instance ( [7]). There are three treatment arms, placebo, thiotepa, and pyridoxine. Patients had multiple recurrences of tumors which were sparse beyond the fourth recurrence. Figure 1 shows the Kaplan-Meier estimators of the survival function for the times of the first and the second recurrences under three treatments. One observes that, the estimated survival curves at the first recurrence are crossed indicating a crossed hazard and that the proportional hazard assumption is suspected ( [8]); the survival curve of pyridoxine falls below that of placebo at the second recurrence compared to the first recurrence, indicating the functional form of the survival curves varies between recurrences. Neglecting such characteristics of non-proportionality and stratification of recurrences may yield inefficiency by encumbering borrowing strength from potentially related information sources, and consequently may jeopardize the prediction of the global survival times. Moreover, dependency might be existing among the treatment strata and the stratification of recurrences ( [5,9]). Consequently, more complex modeling is needy to characterize the dependence among the baseline hazard functions and treatment strata due to the temporal effects of recurrences. Frequentist inference and computing are pretty challenging and even infeasible. Existing Bayesian literature considered modifications of shared frailty model based on some kind of partial aberrant phenomena ( [10,11]; among others) but rare work has taken bi-level stratification into account ( [12]), not to mention that dependence among treatment strata ( [5]).
We propose a generalized shared frailty model (GSFM) for multiple events time data that allows the baseline hazard function to change along with the types of events and treatment strata, strengthening the ability to borrow information from many sources. The proposed model postulates multiplier frailty including both parametric and nonparametric ones, where the parametric frailty random effect accounts for the within subject association by treating each subject as a cluster; and a nonparametric frailty effect represents dependency among treatment strata and temporal recurrences. For the proposed model GSFM, we suggest a Bayesian solution to estimate the regression coefficient vector and the variance parameter of the frailty term, and baseline survival functions stratified by treatments and recurrences. In a Bayesian workflow, the posterior distribution is determined by the combination of observational data in the form of likelihood function and the prior distribution represented based on the background knowledge. From a Bayesian perspective, we model the dependent nonparametric prior through transferring the data context aforementioned into the ANOVA dependent Dirichlet process (ANOVA DDP), which will be further reviewed in Section 2. The construction of No-U-Turn sampler for Markov chain Monte Carlo (MCMC) sampling is automated by Stan ( [13]) with its R interface ( [14]). The posterior inference is conducted by Stan as well.
The rest of this chapter is organized as follows. In Section 2, under typical data scenarios of dependence structure, we summarize several modification versions of the dependent Dirichlet process (DDP) initiated from MacEachern's regression spirit that nested dependent predictors into the traditional Dirichlet Process (DP). In Section 3, we postulate the GSFM and transform the dependent dual-stratified multiple events to the survival-function based version of the ANOVA DDP. We have a short comparison between Stan and Nimble, two contemporary Bayesian computing tools based on our user experience. In section 4, we demonstrate the validity of the GSFM and Bayesian inference and analysis of the data on recurrences of bladder cancer. A brief conclusion is contained in Section 5.

Review of MacEachern's DDP
The DP is the most popular Bayesian nonparametric prior since the seminal work of [15]. The belief in data background that there exists some kind of dependence structure stimulates construction and selection of proper dependent prior. Some dependent DPs are constructed for unsupervised purposes such as clustering ( [16,17]). The DDP prior adopted in our proposed model is supervised and predictor-dependent, originated from [18,19], named as MacEachern's DDP in two recent review papers, which are interpretive and comprehensive ( [18,19]). The key idea behind the MacEachern's DDP is that the distributions of the random measures are marginally DP distributed, validated by in our subsections 3.2 and 3.3. Therefore we here confine how the MacEachern's DDP (henceforth we use the DDP to denote the MacEachern's DDP if the context is clear) came into being expanded from the DP, and compare various modification versions of the DDP under various dependent data structures.

DP vs. DDP
The DP is a distribution on distributions whereas the DDP aims to construct prior for a collection of distributions F = {F x |x ∈ X} indexed by covariate x. In general, there are several representations of the DP such as Polya Urn, Levy measure, and stick-breaking representations ( [20]). Here we use Sethuraman's stick-breaking construction to connect the DP with the DDP. The stick-breaking construction is a kind of infinite sum representation that divides the DP into two countable series, the weights and the atoms. Generally, a DP is expressed as a process with two components, the mass parameter determining the weights and the base measure to generate atoms. Through the stickbreaking construction, the DDP can be easily extended from the DP. We list their comparison in Table 1, where we can find that the dependency among the covariates set X is realized by indexing the mass parameter and base measure with the covariate x ∈ X. More specifically, the dependency can be characterized through the dependency among the weights and atoms in the DDP. Table 1.
The DDP can be widely applied to scenarios of various dependence data structure. We review modification versions of the DDP from three categories depending on which part it modifies in the stick-breaking representation, weights, atoms, or both. The first is to impose the dependency on the atoms but keep common weights, leading to two typical representatives, ANOVA and Spatial ( [9,21,22]). The ANOVA type DDP encoded the covariate dependence in the form of regression for the atom processes. The Spatial DDP models for nonstrationary spatial random fields with heterogeneous variance. The second category is to modify the weights to be dependent but keep the common atoms. The early and typical work is the time series DDP ( [23]). They introduced a Markov Beta process on the weights to account for the temporal dependency. The third category is to impose dependency on both weights and atoms ( [24]). They constructed vector autoregressive and autoregressive models for atoms and weights, respectively. We summarize the aforementioned types of typical modifications in Figure 2

Model and Bayesian inference
Consider a clinical trial with multiple event types, for example, the time of the kth recurrence of a certain disease. In the trial, n subjects are divided into G strata of treatment. Our goal is to describe the relationship between the time to the kth recurrence of a subject, and its treatment stratum as well as its vector covariates Z. For a certain subject, the times of recurrences may be dependent since they occur on the same individual and thus we assume an unobservable independent shared-frailty random effect W to account for this dependence. On No convolution the other hand, we may allow the conditional hazard affiliated with the script pair kj implying distinct survival distributions along with the temporal order of the recurrences of the disease and for specific treatment. For the ith subject in the jth treatment stratum, at the kth recurrence, given the value of frailty variable w i and its covariate vector z kji , we propose the following frailty model, Model (1) is called the generalized shared frailty model in the sense that non-proportionality among k-varying recurrences is allowed by the fact that the right-hand baseline hazard has footnotes k and j. We allow dependency among treatment strata in model (1) . Therefore, the baseline hazard function 0kj acts as if a nonparametric frailty random measure accounting for the dependency owing to the recurrences and treatment scheme.
Model (1) is an extension of the classical shared frailty model (4.1.1) on page 101 of [2] since the baseline hazard function there does not vary from the recurrences and the treatment strata. Model (1) has the analog spirit to the frailty model (1) in [10], whereas their treatment strata are independent.

Likelihood
The corresponding survival function of model (1) is given by: where S 0kj denotes the baseline survival function of the kth recurrence for subjects in the jth treatment stratum, v i = log(w i ) denotes logarithm transformation of the frailty effect. Let f 0kj be the corresponding baseline density function. Given the data sample (Y kji , kji , z ji ), where Y kji = min(C kji , T kji ), kji = I(T kji ≤ C kji ), with T kji being the gap time between the (k − 1)th and kth recurrence of the ith subject in the jth stratum and C kij being the corresponding censoring variable that is independent of T kji given the covariate vector z kji , for k = 1, · · · , K, j = 1, · · · , G, i = 1, · · · , n j , and G j=1 n j = n. In the jth stratum, suppose that there are n kj (n kj ≤ n j ) subjects suffering from the kth recurrence. Then the likelihood is written as:

Survival-function based version of the ANOVA DDP
In the Bayesian workflow for the estimation, prior distributions are first determined. We here specify appropriate nonparametric priors for S 0kj and f 0kj . Since they can be easily derived from one to the other, we here only introduce the priors for S 0kj .
We divide S 0kj into K groups, and the kth group has G baseline survival functions of different treatment strata at the kth time of recurrence. That is, for a fixed k, S k = {S 0kj , j = 1, · · · , G} is a collection of baseline survival functions with length G indexed by the categorical covariate j denoting the treatment stratum. The next procedures come from the spirit of [9]. As a general example, suppose two dugs A and B will be taken in treatment, with V and U levels of doses, respectively. In this case, G = VU denotes the number of treatment strata and let the level of the jth stratum be (v, w). We write the stick-breaking form of S 0kj such that S 0kj (t) = ∞ h=1 p h I(t > kjh ). We impose an ANOVA structure on kjh : where m kh denotes the ANOVA effect shared by all the strata at the kth recurrence, and the rest terms are the ANOVA effects of the jth stratum at the kth recurrence. Let the three components be independently generated from three distributions, and marginally on j, the baseline survival function S 0kj follows a DP. The aforementioned procedure implies that S k is a survivalfunction version of the ANOVA DDP.
Since any function in the stick-breaking form is discrete almost surely, we place a convolution through the Dirichlet process mixture (DPM) model ( [25]). Particularly, since the baseline survival functions are defined on the positive half real line, the convolution kernel in DPM should be positive such as log-normal, Gamma, and Weibull. In this chapter, a log-normal kernel is considered. For different recurrences, we treat the relationship among S k s to be independent.

One-way ANOVA DDP
Considering the data of our interest, where only one drug and one level of dose is used in each treatment stratum, we introduce the modeling of the survival-function version of one-way ANOVA DDP. In this case the prior for the S k reduces to a one-way ANOVA form since the dependency among the G treatment strata is explained by only one ANOVA effect. Furthermore, if we set m kh = 0 , kh = ( k1h , · · · , kGh ) T reduces to a G-variate variable denoting the locations of all G baseline distributions and thus kjh = T kh d j , where d j is the design vector of the jth stratum to select the appropriate ANOVA effects corresponding to j.
2. Convolution step. Let kh = ( k1h , · · · , kGh ) T , and d j be the jth design vector of length G with the jth element being 1 and others being 0. Let H 0k = (H 0k1 , · · · , H 0kG ) be the collection of base measures, , where S LN denotes the survival function of the log-normal distribution, and H k ∼ DP(M k , H 0k ).  Step 1 is a standard stick-breaking representation for DP.
Step 2 is kernel mixture of DP whereas the kernel is a survival function rather than a cumulative distribution function. The realization of Step 2 is quite straightforward in Stan as it provides the function lognormal lccdf to be used as the kernel of the survival function of the log-normal family.
In Step 3, we specify the base measure as the prior for the location and shape parameters of the log-normal kernel directly rather than adding another hyper prior distribution like [22] did. The main reason is to simplify the computation in Stan. Particularly, inspired by [27] and [28], we use the half-Cauchy distribution as the non-informative prior for the variance parameter instead of the inverse Gamma prior. In our practice, the choice of half-Cauchy prior significantly improves the speed of convergence and mixture performance of the MCMC chains in our real data analysis and simulation. Another interesting point we met in numerical studies is that the informativeness of the base measure for . Here we don't assign the non-informative distribution but a weakly informative one is considered since we find such a weakly informative prior provides better MCMC performance than that of non-informative one with higher effective sample size and better mixture performance. In our other research experience, the weakly informative prior for the variance parameter in the mixing component of the DPM seems to be more preferable.

Other priors and MCMC
In terms of the prior for the parametric prior w i , we choose log normal prior that v i = log(w i ) and v i ∼ N(0, 2 ), where > 0 is an unknown parameter. We further assign a half Cauchy prior for s.t ∼ Cauchy + (0, 5) as a non-informative prior. The prior for the vector of regression coefficients is ∼ N(0, 1000I) as a non-informative prior. We use the truncated Dirichlet process to replace the infinite summand in the DP. The selection of the truncation point is often ad-hoc. Since in Stan the NUTS cannot sampler discrete parameters, we have to fit the truncation number and the mass parameter before the MCMC procedure. In general, the truncation number is set to be large enough s.t the truncated part is negligible. [29] suggests to use a truncation number L that is greater than 5M + 5. In our computation, we set L = 12.
The MCMC sampling for the posterior distribution is realized in Stan. Stan and its R version are widely used in statistical modeling and high-performance statistical computing, especially in Bayesian. Stan realizes the MCMC sampling through the No-U-Turn sampler (NUTS). Stan automates the deriving of the full conditional posterior distribution and NUTS is able to obtain high effective sample size ( [30]).

Stan and NIMBLE: programming styles
The MCMC sampling procedure is implemented in Stan and we also tried to implement the model in NIMBLE, another contemporary Bayesian computing tool in R. Stan and NIMBLE are two contemporary Bayesian computing tools that have drawn arising interest for Bayesian analysis but still remain under active development ( [31,32]). The main advantage of Stan and NIMBLE is that they provide clear automatic posterior sampling procedures based on their specific sampling algorithms without particular justification. Therefore, users can be released from complicated probabilistic deriving and implementation. There has been buzz group discussion about the . One comparison on their built-in samplers is demonstrated through implementing weakly informative and informative estimation within the trimmed mean regression model setting ( [35]). Here we contribute a naive comparison on their programming styles based on the first two authors' experience in coding this project and using Stan and NIMBLE, respectively. A Bayesian paradigm is made up of three main steps, the prior, likelihood, and the posterior. MCMC generates samples to approximate the posterior distribution. Therefore, what one needs to set in a Bayesian computing tool is the prior and likelihood, let alone Stan or NIMBLE. Nevertheless, Stan and NIMBLE take different programming styles in writing likelihood. In Stan, the default way to present the log likelihood is the syntax target and users can add log contribution to it freely, which is similar to the natural language and straightforward to users whatever level of mathematical background. In NIMBLE, the default way is to transfer the likelihood into some standard distributions given by NIMBLE, which may not be friendly for users who have a relatively less mathematical background.
We take fitting the finite mixture of Gaussian model as an example. For a fixed positive integer L, the distribution of Y is given by F Y (s) = L l=1 p l N(s| l , 2 l ) and the log-likelihood is log L(p, , |Y) = n i=1 L l=1 {log(p l ) + log (y i | 1 , l )}, where denotes the density function of normal distribution. The code for Stan and NIMBLE to implement this model is listed in Listing 1.1 and 1.2, respectively. In Listing 1.1 we clearly find that the contribution to the syntax target is just the sum of log(p l ) and the logarithm of the density of normal distribution denoted by normal lpdf. The rest is to assign a Dirichlet prior for the weights p l and other parameters. However, in NIMBLE code shown in Listing 1.2, we have to transfer the likelihood into some sampling procedures by IMAGING that there are L clusters of random numbers, the random numbers are i.i.d Gaussian within each cluster, and the probability a random number is drawn from the lth cluster is p l . Thereafter, the Dirichlet prior is assigned to p l s. Such imagine matches the Bayesian philosophy but when the likelihood function becomes to be quite complicated, to understand this sampling procedure may not be easy anymore, especially for practitioners not coming from a mathematics or statistics background.

Application: bladder cancer recurrences
We apply the GSFM to analyze the Bladder cancer recurrences data set contained in R package survival. Totally 118 subjects in the clinical trial are divided into 3 treatment strata including placebo, pyridoxine (vitamin B6), and thiotepa. Each subject may experience k (from 1 to 9) times of recurrences and may die from or not from the recurrence of bladder cancer. We don't discriminate the death from cancer and the recurrence, and the death from other causes is treated as censoring status. Our interest is the gap time between the (k − 1)th and the kth recurrences. Besides the treatment schemes, two clinical covariates are considered: the number of tumors at the beginning (x 1 ) and the size of the largest tumor (x 2 ) within a subject. The values of these two covariates are evaluated at the beginning of each recurrence interval. This data set was once analyzed for the time between the first to the second recurrence as a univariate time-to-event outcome ( [36]). In this chapter, we consider both the first and the second recurrences and thus K = 2 here. The two covariates are scaled by divided by 100. To simplify the computation, the follow-up time is transferred from months to years to get lower scalars.

Model checking for baseline survival functions
Before further inference, we need to check whether the proposed model is appropriate. As an alternative, a shared frailty model is fit by R package spBayeSurv. In the shared frailty model, the treatment strata are considered as indicator covariates in the parametric term. We run 4 independent MCMC For the shared frailty model, we run the MCMC 16000 times with the first 6000 times burn-in through R function survregbayes using the "IID" Gaussian frailty under "PH" model name.
Other settings are default. The plots of the estimated baseline survival functions under different models stratified by treatment strata can be viewed in figure 3. From that, we find the baseline survival functions estimated under the GSFM shows similar trends as that of the K-M estimator in each recurrence, and reflects the crossing survival curves at the first recurrence like the K-M estimator. However, the curves estimated by the shared frailty model are not crossed and cannot change along with recurrences. Therefore, the proposed GSFM is appropriate for the data.

Parametric estimation I: real data
We use the mean of posterior samples (median for ) as the estimator of parameter and we list the estimation of vector of regression coefficients and standard deviation parameter in Table 2. From table 2 we find that as the number of tumors at start point increases, the hazard for recurrences increases as well whereas the larger size of the largest tumor will decrease the hazard. This conclusion is similar to that in  [36] who analyzed the first recurrence as a univariate time-to-event data by a transformation model. Besides the parametric estimation result, we also report two metrics about the MCMC performance here. The first one is the effective sample size (ESS), an approximation to the number of "independent" draws in MCMC sampling. It shows that the ESS of all parameters is greater than 400, which is considered to be adequate by [37]. The ESS of is significantly lower than that of , a possible reason is that the frailty random effect might be time dependent w i (t) rather than a time-fixed effect. Another metric of interest is the average time needed to generate each effective sample, called MCMC Pace. Stan develop team emphasized the importance of MCMC Pace, and the definition is given by the team of NIMBLE in [38] as the time consuming of generating one effective sample. The MCMC Pace to generate is much higher than that of , and we conjecture the possible reason is that the posterior distribution has a long upper tail leading to outliers in posterior samples, which slows down the speed to generate effective samples.

Parametric estimation II: simulation
Another simulation study is considered to evaluate the performance of parametric estimation of the MCMC procedure. Our simulation aims to simulate the occurrences of multiple events on the same individual. We take K = 2 and G = 3 denoting the number of types of events and number of treatment strata, respectively. The simulation includes two independent covariates, x i ∼ Bin(1, 0.5) and x 2 ∼ N(0, 1) to incorporate indicator variable and continuous variable as well. For k = 1, 2, j = 1, 2, 3, the baseline survival functions S 0kj are set as: When k = 1, the three baseline survival functions are crossed whereas when k = 2, the three curves are not. The vector of regression coefficients is = (1, 1) T and the log frailty random effect v i ∼ N(0, 1) independently. The survival time is generated following model (1) . The censoring variable of each event is generated from Unif(4, 6) independently, leading to a censoring rate of about 28%. We set the number of subjects to be 90 and they are equally divided into three treatment strata. We repeat the simulation for 150 times. Table 3

Discussion
In this chapter, we show the power of Bayesian computing illustrated by successfully applying the ANOVA DDP model as the nonparametric prior for a relatively complicated shared frailty model. Our survival-function based version of the ANOVA DDP, modified based on the ANOVA DDP directly in subsection 3.3, is constructed for the shared frailty model, but can reduce to modeling the univariate dependent survival functions by involving the continuous covariates into the predictor space of the ANOVA DDP. Hence, our work is an extension of [22] to some extent. However, the proposed GSFM is different to the Linear DDP models for generalization of accelerated failure time model ([39, 40]). Furthermore, although we point out that there exists potentially dual dependence for dual stratification of treatment strata and recurrences, we just simply allow dependence in treatment strata and assume that the recurrences are independent in our methodology demonstration. The dependence across recurrences per subject is dealt with only by the parametric frailty random effect in the proposed shared frailty model. It is more reasonable to be incorporated into the baseline survival functions so that the interaction effects between recurrence and treatment may be accounted for. Under the one-level stratification, [5] modeled such serial correlation among baseline hazard functions by constructing the so-called dependent tail free process as the prior. It is non-trivial to accommodate dual temporal and stratified dependency as a future research plan. [37] Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. Rank-normalization, folding, and localization: An improved r for assessing convergence of mcmc. Bayesian analysis, 1( [39] TIMOTHY E Hanson and ALEJANDRO Jara. Surviving fully bayesian nonparametric regression models. Bayesian theory and applications, pages 593-615, 2013.