Results from 31 randomized controlled trials of heart disease patients, where the treatment group received bone marrow stem cells and the control group a placebo treatment. The source of this table is Nowbar et al. [4].

## Abstract

This is the Information Age. We can expect that for a particular research question that is empirically testable, we should have a collection of evidence which indicates the best way to proceed. Unfortunately, this is not the case in several areas of empirical research and decision making. Instead, when researchers and policy makers ask a specific question, such as “What is the effectiveness of a new treatment?”, the structure of the evidence available to answer this question may be complex and fragmented (e.g. published experiments may have different grades of quality, observational data, subjective judgments, etc.).

### Keywords

- Bayesian hierarchical models
- meta-analysis
- multi-parameters evidence synthesis
- conflict of evidence
- randomized control trials
- retrospective studies

## 1. Introduction

In today’s information age one can expect that the digital revolution can create a knowledge-based society surrounded by global communications that influence our world in an efficient and convenient way. It is recognized that never in human history we have accumulated such an astronomical amount of data, and we keep on generating data at in an alarming rate. A new term, “big data,” was coined to indicate the existence of “oceans of data” where we may expect to extract useful information for any problem of interest.

In this technological society, one could expect that for a particular research question we should have a collection of high quality evidence which indicates the best way to proceed. Paradoxically, this is not the case in several areas of empirical research and decision making. Instead, when researchers and policy makers ask a specific and important question, such as “What is the effectiveness of a new treatment?”, the structure of the evidence available to answer this question may be complex and fragmented (e.g., published experiments may have different grades of quality, observational data, subjective judgments, etc.). The way how researchers interpret this multiplicity of evidence will be the basis for their understanding of reality and it will determine their future decisions.

Bayesian meta-analysis, which has its roots in the work of Eddy et al. [1], is a branch of statistical techniques for interpreting and displaying results of different sources of evidence, exploring the effects of biases and assessing the propagation of uncertainty into a coherent statistical model. A gentle introduction of this area can be found in Chap. 8 of Spiegelhalter et al. [2] and a recent review in Verde and Ohmann [3].

In this chapter we present a new method for meta-analysis that we have called: the “Hierarchical Meta-Regression” (HMR). The aim of HMR is to have an integrated approach for bias modeling when disparate pieces of evidence are combined in meta-analysis, for instance randomized and non-randomized studies or studies with different qualities. This is a different application of Bayesian inference than those applications with which we could be familiar, for instance an intricate regression model, where the available data bear directly upon the question of interest.

We are going to discuss two recent meta-analyses in clinical research. The reason for highlighting these two cases is that they illustrate a main problem in evidence synthesis, which is the presence of a multiplicity of bias in systematic reviews.

### 1.1. An example of meta-analysis of therapeutic trials

The first example, is a meta-analysis of 31 randomized controlled trials (RCTs) of two treatment groups of heart disease patients, where the treatment group received bone marrow stem cells and the control group a placebo treatment, Nowbar et al. [4]. The data of this meta-analysis appear in the Appendix, see Table 1. Figure 1 presents the forest plot of these 31 trials, where the treatment effect is measured as the difference of the ejection fraction between groups, which measures the improvement of left ventricular function in the heart.

At the bottom of Figure 1 we see average summaries represented by two diamonds: the first one corresponds to the fixed effect meta-analysis model. This model is based under the assumption that studies are identical and the between study variability is zero. The widest diamond represents the results of a random effects meta-analysis model, which assume a substantial heterogeneity between studies. In this meta-analysis both models confirmed a positive treatment of effect of a mean difference 3.95 95% CI [3.43; 4.47] and 2.92 and a 95% CI of [1.47, 4.36], respectively.

Could we conclude that we have enough evidence to demonstrate the efficacy of the treatment? Unfortunately, these apparently confirming results are completely misleading. The problem is that these 31 studies are very heterogeneous, which resulted in a wide 95% prediction interval [−4.33; 10.16] covering the no treatment effect, and a large number of contradictory evidence displayed in Figure 1.

In order to explain the sources of heterogeneity in this area Nowbar et al. [4] investigated whether detected discrepancies in published trials, might account for the variation in reported effect sizes. They define a discrepancy in a trial as two or more reported facts that cannot both be true because they are logically or mathematically incompatible. In other words, the term discrepancies is a polite way to indicate that a published study suffers from poor reporting, could be implausible or its results have been manipulated. For example, as we see at the bottom of Table 1 in the appendix, it would be difficult to believe in the results of a study with 55 discrepancies. In Section 2 we present a HMR model to analyze a possible link between the risk of bias results and the amount of discrepancies.

### 1.2. An example of meta-analysis of diagnostic trials

The topic of Section 3 is the meta-analysis of diagnostic trials. These trials play a central role in personalized medicine, policy making, healthcare and health economics. Figure 2 presents our example in this area. The scatter plot shows the diagnostic summaries of a meta-analysis investigating the diagnostic accuracy of computer tomography scans in the diagnostic of appendicitis [5]. Each circle identifies the true positive rate vs. the false positive rate of each study, where the different circles’ sizes indicate different sample sizes. One characteristic of this meta-analysis is the combination of disparate data. From 51 studies 22 were retrospective and 29 were prospective, which is indicated by the different grey scale of the circles.

The main problem in this area is the multiple sources of variability behind those diagnostic results. Diagnostic studies are usually performed under different diagnostic setups and patients’ populations. For a particular diagnostic technique we may have a small number of studies which may differ in their statistical design, their quality, etc. Therefore, the main question in meta-analysis of diagnostic test is: How can we combine the multiplicity of diagnostic accuracy rates in a single coherent model? A possible answer to this question is a HMR presented in Section 3. This model has been introduced by Verde [5] and it is available in the R’s package **bamdit** [6].

## 2. A Hierarchical Meta-Regression model to assess reported bias

Figure 3 shows the reported effect size and the 95% confidence intervals of 31 trials from [4] against the number of discrepancies (in logarithmic scale). The authors reported a positive statistical significant correlation between the size effect and the number of discrepancies detected in the papers. However, a direct correlation analysis of aggregated results is threatened by ecological bias and it may lead to misleading conclusions. The amount of variability presented by the 95% confidence intervals is very big to accept a positive correlation at face value. In this section we are going to present a HMR model to link the risk of reporting bias with the amount of reported discrepancies. This model assumes that the connection between discrepancies and size effect could be much more subtle.

The starting point of any meta-analytic model is the description of a model for the pieces of evidence at face value. In statistical terms, this means the likelihood of the parameter of interest. Let *y*_{1}, …, *y*_{N} and *SE*_{1}, …, *SE*_{N} be the reported effect sizes and their corresponding standard errors, we assume a normal likelihood of *θ*_{i} the treatment effect of study *i*:

If a prior assumption of exchangeability was considered reasonable, a random effects Bayesian model incorporates all the studies into a single model, where the *θ*_{1}, …, *θ*_{N} are assumed to be a random sample from a prior distribution with unknown parameters, which is known as a hierarchical model.

In this section we assume that exchangeability is unrealistic and we wish to learn how the un-observed treatment effects *θ*_{1}, …, *θ*_{N} are linked with some observed covariate *x*_{i}.

Let *x*_{i} be the number of observed discrepancies in the logarithmic scale. We propose to model the association between the treatment effect *θ*_{i} and the observed discrepancies *x*_{i} with the following HMR model:

where the non-observable variable *I*_{i} indicates if study *i* is at risk of bias:

The parameter *μ* corresponds to the mean treatment effect of studies with low risk of bias. We assume that in our context of application biased studies could report higher effect sizes and the biased mean *μ*_{biased} can be expressed as:

In this way, *K* measures the average amount of bias with respect to the mean effect *μ*. Eq. (4) also ensures that *μ* and *μ*_{biased} are identifiable parameters in this model. The parameter *τ* measures the between-studies variability in both components of the mixture distributions.

We model the probability that a study is biased as a function of *x*_{i} as follows:

In Eq. (5) positive values of *α*_{1} indicate that an increase in the number of discrepancies is associated with an increased risk of study bias.

In this HMR model the conditional mean is given by

Eqs. (5) and (6) can be calculated as functional parameters for a grid of values of *x*. Their posteriors intervals are calculated at each value of *x*.

This HMR not only quantifies the average bias *K* and the relationship between bias and discrepancies in Eq. (5), but also allows to correct the treatment effect *θ*_{i} by its propensity of being biased:

where the amount (*θ*_{i} − *K*) measures the bias of study *i* and Pr(*I*_{i} = 1|*x*_{i}) its propensity of being biased.

The HMR model presented above is completed by the following vague hyper-priors: For the regression parameters *α*_{0}, *α*_{1} ∼ *N*(0, 100). We give to the mean *μ*_{1} ∼ *N*(0, 100) and for the bias parameter *K* ∼ Uniform(0, 50). Finally, for the variability between studies we use *τ* ∼ Uniform(0, 100), which represent a vague prior within the range of possible study deviations.

The model presented in this section is mathematically non-tractable. We approximated the posterior distributions of the model parameters with Markov Chain Monte Carlo (MCMC) techniques implemented in *OpenBUGS*.

*BUGS* stands for *Bayesian Analysis Using Gibbs Sampling*, the *OpenBUGS* software constructs a Directed Acyclic Graph (DAG) representation of the posterior distribution of all model’s parameters. This representation allows to automatically factorize the DAG as a product of each node (parameters or data) conditionally on its parents and children. The software scans each node and proposes a method of sampling. The kernel of the Gibbs sampling is built upon this algorithm.

Computations were performed with the statistical language *R* and MCMC computations were linked to *R* with the package **R2OpenBUGS**. We used two chains of 20,000 iterations and we discarded the first 5000 for the burn-in period. Convergence was assessed visually by using the R package **coda**.

The diagonal panels of Figure 4 summarize the resulting posterior distributions for *μ*, *K*, *τ*, *α*_{0} and *α*_{1}. The posterior of *μ* clearly covers the zero indicating that the stem cells treatment is not effective. The bias parameter *K* indicates a considerable over-estimation of treatment effects reported for some trials. The posterior of *α*_{1} is concentrated in positive values, which indicates that an increase in discrepancies is associated with an increase of the risk of reporting bias. The posteriors of *α*_{0} and *α*_{1} also present a large variability, which is expected when a hidden effect is modeled.

Further results of the Hierarchical Meta-Regression model appears in Figure 5, where posteriors 95% intervals are plotted against the number of discrepancies. On the left panel, we can see the relationship between the number of discrepancies and the probability that a study is biased. We can observe an increase of probability with an increase of the number of discrepancies, but also a large amount of variability. On the right panel appears the conditional mean of effect size as a function of the number of discrepancies, which corresponds to Eq. (6). Our analysis shows that the 95% posterior intervals of the conditional mean covers the zero effect in most of the range of discrepancies. Only for studies with more than 33 (exp(3.5)) discrepancies the model predicts a positive effect. One interesting result of this analysis is, that a horizontal line which may represent a zero correlation is also predicted by the model. This means that the regression calculated directly from the aggregated data contains an ecological bias and it is misleading. We have added this regression line to the plot to highlight this issue.

The results presented so far indicate that increases in the amount of discrepancies increases the propensity of bias. The question is: How can we correct a particular study for its bias? Eq. (7) gives the bias correction of treatment effect in this HMR model.

In Figure 6 we can see HMR bias correction in action. We display two studies which have 21 and 18 discrepancies respectively. The solid lines correspond to the likelihood functions of these studies. These likelihoods represent the information of the effect size at face value. The dashed lines correspond to the posterior treatment effects after bias correction. Clearly, we can see a strong bias correction with the conclusion of no treatment effect.

## 3. Hierarchical Meta-Regression analysis for diagnostic test data

In meta-analysis of diagnostic test data, the pieces of evidence that we aim to combine are the results of *N* diagnostic studies, where results of the *i*th study (*i* = 1, …, *N*) are summarized in a 2 × 2 table as follows:

Patient status | |||
---|---|---|---|

With disease | Without disease | ||

Test | + | tp_{i} | fp_{i} |

Outcome | − | fn_{i} | tn_{i} |

Sum: | n_{i,1} | n_{i,2} |

where *tp*_{i} and *fn*_{i} are the number of patients with positive and negative diagnostic results from *n*_{i,1} patients with disease, and *fp*_{i} and *tn*_{i} are the positive and negative diagnostic results from *n*_{i,2} patients without disease.

Assuming that *n*_{i,1} and *n*_{i,2} have been fixed by design, we model the *tp*_{i} and *fp*_{i} outcomes with two independent Binomial distributions:

where TPR_{i} is the true positive rate or sensitivity, Se_{i}, of study *i* and FPR_{i} is the false positive rate or complementary specificity, i.e., 1 − Sp_{i}.

At face value, diagnostic performance of each study is summarized by the empirical true positive rate and true negative rate or specificity

and the complementary empirical rates of false positive rate and false negative diagnostic results,

In this type of meta-analysis we could separately model TPR_{i} and FPR_{i} (or Sp_{i}), but this approach ignores that these rates could be correlated by design. Therefore, it is more sensible to handle TPR_{i} and FPR_{i} jointly.

We define the random effect *D*_{i} which represents the study effect associated with the diagnostic discriminatory power:

However, diagnostic results are sensitive to diagnostic settings (e.g., the use of different thresholds) and to populations where the diagnostic procedure under investigation is applied. These issues are associated with the *external validity* of diagnostic results. To model external validity bias we introduce the random effect *S*_{i}:

This random effect quantifies variability produced by patients’ characteristics and diagnostic setup, that may produce a correlation between the observed *S*_{i} **the threshold effect** of study *i* and it represents an adjustment of external validity in the meta-analysis.

We could assume exchangeability of pairs (*D*_{i}, *S*_{i}), but study’s quality is known to be an issue in diagnostic studies. For this reason we model the *internal validity* of a study by introducing random weights *w*_{1}, …, *w*_{N}. Conditionally to a study weight *w*_{i}, the study effects *D*_{i} and *S*_{i} are modeled as exchangeable between studies and they follow a *scale-mixture of bivariate Normal* distributions with the following mean and variance:

and scale mixing density

The inclusion of the random weights *w*_{i} into the model was proposed by [5]. This approach was generalized in [6] in two ways: firstly, by splitting *w*_{i} in two weights *w*_{1,i} and *w*_{2,i} corresponding to each component *D*_{i} and *S*_{i} respectively. Secondly, by putting a prior on the degrees of freedom parameter *ν*, which corresponds to an adaptive robust distribution of the random-effects.

The Hierarchical Meta-Regression representation of the model introduced above is the model based on the conditional distribution of (*D*_{i}|*S*_{i} = *x*) and the marginal distribution of *S*_{i}. This HMR model was introduced by [7], who followed the stepping stones of the classical Summary Receiving Operating Characteristic (SROC) [8].

The conditional mean of (*D*_{i}|*S*_{i} = *x*) is given by:

where the functional parameters *A* and *B* are

We define the *Bayesian SROC Curve* (BSROC) by transforming back results from (*S*, *D*) to (FPR, TPR) with

where g(p) is the logit(p) transformation, i.e. logit(p) = log(p/(1 − p)).

The BSROC curve is obtained by calculating TPR in a grid of values of FPR which gives a posterior conditionally on each value of FPR. Therefore, it is straightforward to give credibility intervals for the BSROC for each value of FPR.

One important aspect of the BSROC is that it incorporates the variability of the model’s parameters, which influences the width of its credibility intervals. In addition, given that FPR is modeled as a random variable, the curve is corrected by measurement error bias in FPR.

Finally, we can define a *Bayesian Area Under the SROC Curve* (BAUC) by numerically integrating the BSROC for a range of values of the FPR:

In some applications it is recommend to use the limits of integration within the observed values of

In order to make this complex HMR model applicable in practice, we have implemented the model in the R’s package **bamdit**, which uses the following set of hyper-priors:

and

The correlation parameter *ρ* is transformed by using the Fisher transformation,

and a Normal prior is used for *z*:

Modeling priors in this way guarantees that in each MCMC iteration the variance-covariance matrix of the random effects *θ*_{1} and *θ*_{2} is positive definite. The values of the constants *m*_{1}, *v*_{1}, *m*_{2}, *v*_{2}, *u*_{1}, *u*_{2}, *m*_{r} and *v*_{r} have to be given. They can be used to include valid prior information which might be empirically available or they could be the result of expert elicitation. If such information is not available, we recommend setting these parameters to values that represent weakly informative priors. In this work, we use *m*_{1} = *m*_{2} = *m*_{r} = 0, *v*_{1} = *v*_{2} = 1, u_{1} = u_{2} = 5 and

These values are fairly conservative, in the sense that they induce prior uniform distributions for *TPR*_{i} and *FPR*_{i}. They give locally uniform distributions for *μ*_{1} and *μ*_{2}; uniforms for *σ*_{1} and *σ*_{2}; and a symmetric distribution for *ρ* centered at 0.

Figure 7 summarizes the meta-analysis results of fitting the bivariate random-effect model to the computer tomography diagnostic data. The Bayesian Predictive Surface are presented by contours at different credibility levels and compare these curves with the observed data represented by the circles with varying diameters according to the sample size of each study. The scattered points are samples from the predictive posteriors and the histograms correspond to the posterior predictive marginals. This result was generated by using the functions**bamdit**.

Figure 8 displays the posteriors of each components’ weights. The left panel shows that prospective studies number 25 and 33 deviate with respect to the prior mean of 1, while on the right panel we see that a prospective study (number 47) and five retrospective studies (number 1, 3, 4, 8 and 29) have substantial variability.

An important aspect of *w*_{i} is its interpretation as **estimated bias correction**. *A priori* all studies included in the review have a mean of *E*(*w*_{i}) = 1. We can expect that studies which are unusually heterogeneous will have posteriors substantially greater than 1. Unusual studies’ results could be produced by factors that may affect the quality of the study, such as errors in recording diagnostic results, confounding factors, loss to follow-up, etc. For that reason, the studies’ weights *w*_{i} can be interpreted as an adjustment of studies’ **internal validity bias**.

The BSROC curve and its area under the curve are presented in Figure 9. The left panel shows this HMR as a meta-analytic summary for this data. On the right panel the posterior distribution of the BAUC show quite a high diagnostic ability for computer tomography scans as diagnostic of appendicitis.

## 4. Conclusions

In this work we have seen the HMR in action. This approach of meta-analysis is based on a simple strategy: two sub-models are defined in the meta-analysis, one which models the problem of interest, for instance the treatment effect, and one which handles the multiplicity of bias. The meta-analysis is summarized by understanding how these components interact with each other.

The examples presented in this work have shown that we could have misleading conclusions from indirect evidence, if it were analyzed as directly contributing to the problem of interest.

For instance, in the first example, Section 2, we have seen in Figure 1 that pooling studies gave a wrong conclusion about the effect of stem cells treatment. The positive correlation between the aggregated effect size and the number of discrepancies exaggerates its relationship.

Actually, in Figure 5 the HMR has shown that it is possible to simultaneously have a zero correlation between effect size and discrepancies while still having a risk of reporting bias. In addition, the HMR allows to extract the amount of bias in the meta-analysis and to correct the treatment effect at the level of the study (Figure 6).

In the second example, Section 3, biases come from the external validity of diagnostic studies and the internal validity due to their quality. In this example the HMR showed that it was possible to simultaneously model these two types of subtle biases.

To account for internal validity bias, the application of a scale mixture of normal distributions allows us to detect conflictive studies, which can be considered as outliers. The Bayesian Summary Receiving Operative Curve accounts for the external validity bias due to changes in factors that affected the diagnostic results. In addition, the posterior for its Area Under the Curve (AUC) summarizes the results of the meta-analysis.

## Acknowledgments

This work was supported by the German Research Foundation project DFG VE 896 1/1.

Trial ID | Effect size | SE (effect size) | Sample size | Number of discrepancies | Author or principal investigator | Year | Country |
---|---|---|---|---|---|---|---|

t01 | 1.5 | 3.67 | 21 | 17 | Quyyumi | 2011 | USA |

t02 | 1.1 | 2.09 | 100 | 7 | Lunde | 2007 | Norway |

t03 | −1.7 | 2.91 | 23 | 7 | Srimahachota | 2011 | Thailand |

t05 | 0.8 | 2.78 | 60 | 4 | Meyer | 2006 | Germany |

t06 | 7 | 0.63 | 40 | 4 | Meluzín | 2006 | Czech Republic |

t09 | 7.8 | 2.76 | 38 | 21 | Piepoli | 2010 | Italy |

t11 | 14 | 4.05 | 20 | 13 | Suárez de Lezo | 2007 | Spain |

t12 | 5.4 | 2.44 | 77 | 18 | Huikuri HV | 2008 | Finland |

t13 | 2.7 | 1.2 | 82 | 16 | Perin | 2012 | USA |

t15 | 4.1 | 0.98 | 46 | 0 | Assmus | 2006 | Germany |

t16 | 2.2 | 0.88 | 79 | 27 | Assmus | 2013 | Germany |

t17 | 2.5 | 1.01 | 187 | 11 | Assmus | 2010 | Germany |

t18 | 2.5 | 3.96 | 20 | 2 | Hendrikx | 2006 | Belgium |

t19 | −0.2 | 1.17 | 127 | 3 | Hirch | 2011 | The Netherlands |

t20 | −2 | 2.09 | 20 | 0 | Perin | 2012 | USA |

t23 | −3 | 2.03 | 81 | 0 | Traverse | 2011 | USA |

t24 | 3.6 | 2.52 | 17 | 2 | Ang | 2008 | UK |

t25 | 4 | 1.28 | 40 | 3 | Rodrigo | 2012 | The Netherlands |

t26 | −1.5 | 1.75 | 66 | 8 | Herbolts | 2009 | Belgium |

t29 | 8.8 | 6.07 | 10 | 6 | Castellani | 2010 | Italy |

t30 | 4 | 3.89 | 14 | 2 | Maureira | 2012 | France |

t31 | −0.2 | 1.54 | 183 | 19 | Ribero dos Santos | 2012 | Brazil |

t32 | −3.2 | 3.63 | 40 | 7 | Traverse | 2010 | USA |

t35 | 10.1 | 1.21 | 20 | 11 | Patel | 2004 | USA |

t38 | 5.4 | 2.54 | 27 | 15 | Tse | 2007 | Hong Kong |

t42 | 3.5 | 1.04 | 86 | 6 | Cao | 2009 | China |

t45 | 1.25 | 1.56 | 118 | 9 | Sürder | 2013 | Switzerland |

t46 | 6.7 | 3.16 | 20 | 2 | Ge | 2006 | China |

t47 | −0.1 | 2.03 | 112 | 2 | Traverse | 2012 | USA |

t48 | −3.9 | 2.62 | 40 | 1 | Wöhrle | 2010 | Germany |

t49 | 10.4 | 1.01 | 116 | 55 | Yousef (Strauer) | 2009 | Germany |