Open access peer-reviewed chapter

Node-Level Conflict Measures in Bayesian Hierarchical Models Based on Directed Acyclic Graphs

Written By

Jørund I. Gåsemyr and Bent Natvig

Submitted: 12 December 2016 Reviewed: 08 June 2017 Published: 02 November 2017

DOI: 10.5772/intechopen.70058

From the Edited Volume

Bayesian Inference

Edited by Javier Prieto Tejedor

Chapter metrics overview

1,285 Chapter Downloads

View Full Metrics

Abstract

Over the last decades, Bayesian hierarchical models defined by means of directed, acyclic graphs have become an essential and widely used methodology in the analysis of complex data. Simulation-based model criticism in such models can be based on conflict measures constructed by contrasting separate local information sources about each node in the graph. An initial suggestion of such a measure was not well calibrated. This shortcoming has, however, to a large extent been rectified by subsequently proposed alternative mutually similar tail probability-based measures, which have been proved to be uniformly distributed under the assumed model under various circumstances, and in particular, in quite general normal models with known covariance matrices. An advantage of this is that computationally costly precalibration schemes needed for some other suggested methods can be avoided. Another advantage is that noninformative prior distributions can be used when performing model criticism. In this chapter, we describe the basic framework and review the main uniformity results.

Keywords

  • cross-validation
  • data splitting
  • information contribution
  • MCMC
  • model criticism
  • pivotal quantity
  • preexperimental distribution
  • p-value

1. Introduction

Over the last decades, Bayesian hierarchical models have become an essential and widely used methodology in the analysis of complex data. Computational techniques such as Markow Chain Monte Carlo (MCMC) methods make it possible to treat very complex models and data structures. Analysis of such models gives intuitively appealing Bayesian inference based on posterior probability distributions for the parameters.

In the construction of such models, an understanding of the underlying structure of the problem can be represented by means of directed acyclic graphs (DAGs), with nodes in the graph corresponding to data or parameters, and directed edges between parameters representing conditional distributions. However, a perfect understanding of the underlying structure is usually an unachievable goal, and there is always a danger of constructing inadequate models. Box [1] suggests a pattern for the model building process where an initial candidate model is assessed for adequacy, and if necessary modified and elaborated on, leading to a new candidate that again is checked for adequacy, and so on. As a tool in this model criticism process, Ref. [1] suggests using the prior predictive distribution of some checking function or test statistic as a reference for the observed value of this checking function, resulting in a prior predictive p-value. This requires an informative and realistic prior distribution, which is not always available or even desirable. Indeed, as pointed out in Ref. [2], in an early phase of the model building process, it is often convenient to use noninformative or even improper priors and thus avoid costly and time-consuming elicitation of prior information. Noninformative priors may be used also for the inference because relevant prior information is unavailable.

There exist many other methods for checking the overall fit of the model or an aspect of the model of special interest, based on locating a test statistic or a discrepancy measure in some kind of a reference distribution. The posterior predictive p-value (ppp) of Ref. [3] uses the posterior distribution as reference and does not require informative priors. But this method uses data twice and can as a result be very conservative [2, 46]. Hjort et al. [5] suggest remedying this by using the ppp value as a test statistic in a prior predictive test. The computation of the resulting calibrated cppp-value is, however, very computer intensive in the general case, and again realistic, informative priors are needed. A node-level discrepancy measure suggested in Ref. [7] is subject to the same limitations. The partial posterior predictive p-value of Ref. [4] avoids double use of data and allows noninformative priors but may be difficult to compute and interpret in hierarchical models.

Comparison with other candidate models through a technique for model comparison or model choice, such as predictive methods, maximum posterior probability, Bayes factors or an information criterion, can also serve as tools for checking model adequacy indirectly when alternative candidate models exist.

In this chapter, we will, however, focus on methods for criticizing models in the absence of any particular alternatives. We will review methods for checking the modeling assumptions at each node of the DAG. The aim is to identify parts or building blocks of the model that are in discordance with reality, which may be in need of adjustment or further elaboration. O’Hagan [8] regards any node in the graph as receiving information from two disjoint subsets of the neighboring nodes. This information is represented as a conditional probability density or a likelihood or as a combination of these two kinds of information sources. Adopting the same basic perspective, our aim is to check for inconsistency between such subsets. The suggestion in Ref. [8] is to normalize these information sources to have equal height 1 and to regard the height of the curves at the point of intersection as a measure of conflict. However, as shown in Ref. [2], this measure tends to be quite conservative. Dahl et al. [9] demonstrated that it is also poorly calibrated, with false warning probabilities that vary substantially between models. Dahl et al. [9] also identified the different sources of inaccuracy and modified the measure of Ref. [8] to an approximately χ2-distributed quantity under the assumed model by instead normalizing the information sources to probability densities. In Ref. [10], these densities were instead used to define tail probability-based conflict measures. Gåsemyr and Natvig [10] showed that these measures are uniformly distributed in quite general hierarchical normal models with fixed variances/covariances. In Ref. [11], such uniformity results were proved in various situations involving nonnormal and nonsymmetric distributions. These uniformity results indicate that the measures of Refs. [9] and [10] have comparable interpretations across different models. Therefore, they can be used without computationally costly precalibration schemes, such as the one suggested in Ref. [5]. Gåsemyr [12] focuses on some situations where the conflict measure approach can be directly compared to the calibration method of Ref. [5] and shows that the less computer-intensive conflict measure approach performs at least as well in these situations. Moreover, the conflict measure approach can be applied in models using noninformative prior distributions.

Focusing on the special problem of identifying outliers among the second-level parameters in a random-effects model, Ref. [13] defines similar conflict measures. In this setting, the group-specific means are the nodes of interest. In some models, there exist sufficient statistics for these means. Then, outlier detection at the group level can also be based on cross validation, measuring the tail probability beyond the observed value of the statistic in the posterior predictive distribution given data from the other groups. In this context, the conflict measure approach can be viewed as an extension of cross-validation to situations where sufficient statistics do not exist. Also in Ref. [13] applications to the examination of exceptionally high hospital mortality rates and to results from a vaccination program are given. In Ref. [14], this methodology is used to check for inconsistency in multiple treatment comparison of randomized clinical trials. Presanis et al. [15] apply these conflict measures in complex cases of medical evidence synthesis.

Advertisement

2. Directed acyclic graphs and node-specific conflict

2.1. Directed acyclic graphs and Bayesian hierarchical models

An example of a DAG discussed extensively in Ref. [8] is the random-effects model with normal random effects and normal error terms defined by

Y i , j N ( λ i , σ 2 ) , λ i N ( μ , τ 2 ) , j = 1 , , n i , i = 1 , , m . E1

In general, we identify the nodes or vertices of the graph with the unknown parameters θ and the observed data y, the latter appearing as bottom nodes and being the realizations of the random vector Y. In the Bayesian model, the parameters, the components of θ, are also considered as random variables. In general, if there is a directed edge from node a to node b, then a is a parent of b, and b is a child of a. We denote by Ch(a) the set of child nodes of a, and by Pa(b) the set of parent nodes of b. More generally, b is a descendant of a if there is a directed path from a to b. The set of descendants of a is denoted by Desc(a) and, for convenience, is defined to contain a itself. The directed edges encode conditional independence assumptions, indicating that, given its parents, a node is assumed to be independent of all other nondescendants. Hence, writing θ = (ν, μ), with μ representing the vector of top-level nodes, the joint density of (Y, θ) = (Y, ν, μ) is

p ( y , ν , μ ) = y y p ( y | Pa ( y ) ) ν ν p ( ν | Pa ( ν ) ) π ( μ ) , E2

where π(μ) is the prior distribution of μ. The posterior distribution π(θ|y) is the basis for the inference.

This setup can be generalized in various directions. The nodes may be allowed to represent vectors, at both the parameter and the data levels [10]. Instead of DAGs, one may consider chain graphs, as described in Ref. [16], with undirected edges representing mutual dependence as in Markov random fields. Scheel et al. [17] introduce a graphical diagnostic for model criticism in such models.

2.2. Information contributions

The representation of a Bayesian hierarchical model in terms of a DAG is often meant to reflect an understanding of the underlying structure of the problem. By looking for a conflict associated with the different nodes in the DAG, we may therefore put our understanding of this structure to test. We may also identify parts of the model that need adjustment.

The idea put forward in Ref. [8] is that for each node λ in a DAG one may in general think of each neighboring node as providing information about λ and that it is of interest to consider the possibility of conflict between different sources of information. For instance, one may want to contrast the local prior information provided by the factor p(λ|Pa(λ)) with the likelihood information source formed by multiplying the factors p(γ|Pa(γ)) for all child nodes γ ∈ Ch(λ). The full conditional distribution of λ given all the observed and unobserved variables in the DAG, i.e.,

π ( λ | ( y , θ ) λ ) p ( λ | Pa ( λ ) ) γ Ch ( λ ) p ( γ | Pa ( γ ) ) , E3

is determined by these two types of factors. Here, (y, θ)λ denotes the vector of all components of (y, θ) except for λ.

Dahl et al. [9] normalize the product γ Ch ( λ ) p ( γ | Pa ( γ ) ) to a probability density function denoted by fc(λ), the likelihood or child node information contribution, whereas the local prior density is denoted by fp(λ) and called the prior or parent node information contribution. These information contributions are integrated with respect to posterior distributions for the unknown nuisance parameters to form integrated information contribution (iic) denoted by gc and gp. In this construction, a key to avoid the conservatism of the measure suggested in Ref. [8] is to prevent dependence between the two information sources by introducing a suitable data splitting Y = (Yp, Yc) and condition the parameters of fp on yp and the parameters of fc on yc.

Definition 1 For a given parameter node λ, denoted by βp the vector whose components are Pa(λ), and by βc the vector whose components are

γ Ch ( λ ) ( { γ } P a ( γ ) ) { λ } = Ch ( λ ) [ P a ( C h ( λ ) ) { λ } ] E4

Let Y = (Yp, Yc) be a splitting of the data Y. Define the densities fp, fc, the prior respectively likelihood information contributions, by

f p ( λ ; β p ) = p ( λ | β p ) , f c ( λ ; β c ) γ C h ( λ ) p ( γ | P a ( γ ) ) E5

Define the integrated information contribution densities gp, gc by

g p ( λ ) = f p ( λ ; β p ) π ( β p | y p ) d β p , g c ( λ ) = f c ( λ ; β c ) π ( β c | y c ) d β c , E6

and denote by Gp, Gc the corresponding cumulative distribution functions.

Note that βc may contain data nodes. The second integral in Eq. (6) is then taken only with respect to the random components of βc, i.e., the parameters in βc. If βc contains no parameters, then gc and fc coincide. Definition 1 may also be extended to the case when λ is a vector, corresponding to a subset of parameter nodes.

Combining the set of information sources linked to a specific node in different ways leads to a modification of Definition 1 where βc does not contain all child nodes of λ, the others being instead included in βp together with their parent nodes. In this way, different types of conflict about the node may be revealed. This is natural, e.g., in the context of outlier detection among independent observations with a common mean. Note that βp and βc may then be overlapping, containing common coparents with λ. The setup is illustrated in Figure 1 in the case when the set of common components, by abuse of notation denoted by βpβc, is empty. For the general setup, Definition 1 is modified as follows.

Figure 1.

Part of a DAG showing information sources about λ.

Definition 2 Let γ be a vector whose components are a subset of Ch(λ), and define βc as in Eq. (4). Denote by γ1 the rest of the child nodes of λ, and let βp consist of γ1 together with its parent nodes in the same way as in Eq. (4), as well as Pa(λ). The information contributions are then given by

f p ( λ ; β p ) p ( γ 1 | P a ( γ 1 ) p ( λ | P a ( λ ) ) , E7
f c ( λ ; β c ) p ( γ | P a ( γ ) ) . E8

In Eq. (7), p ( λ | P a ( λ ) ) is replaced by the prior density π(λ) if λ is a top-level parameter. The corresponding iic densities are defined by Eq. (6) as before.

2.3. Node-specific conflict measures

The conflict measure c λ 2 of Ref. [9] is defined as

c λ 2 = ( E G p ( λ ) E G c ( λ ) ) 2 / ( var G p ( λ ) + var G c ( λ ) ) E9

The χ 1 2 -distribution is the reference distribution for this measure. For the conflict measures of Ref. [10], the uniform distribution on [0, 1] is the reference distribution. They focus on tail behavior but are based on the same iic distributions. The general distribution of information sources given in Definition 2 is also introduced in Ref. [10]. For a given pair Gp, Gc of iic distributions, let λ p * and λ c * be independent samples from Gp and Gc, respectively. Let G be the cumulative distribution function for δ = λ p * λ c * . Define

c λ 3 + = G ( 0 ) , c λ 3 = G ¯ ( 0 ) = def 1 G ( 0 ) E10

and

c λ 3 = 1 2 min ( G ( 0 ) , G ¯ ( 0 ) ) = 2 | G ( 0 ) 1 / 2 | . E11

The c λ 3 + -measure and the P λ  conf  measure of Ref. [13] are very similar. The latter measure is aimed at detecting outlying groups or units in a three-level hierarchical model, with the second-level parameters being location parameters for group-specific data. However, the measure is interpreted as a p value, with small values indicative of conflict. Gåsemyr and Natvig [10] also defines a measure based on defining a tail area in terms of the density g of G, namely

c λ 4 = P G ( g ( δ ) > g ( 0 ) ) , E12

applicable also when λ is a vector.

Example 1. To illustrate the theory, consider the random-effects model 1, with the variance parameters σ2, τ2 assumed known, and with μ having the improper prior π(μ) = 1. For simplicity, assume ni = n for all i. Suspecting the mth group of representing an outlier, let λ=λm be the node of interest. Define the data splitting Yp, Yc by letting Y c = Y m = ( Y m , 1 , , Y m , n ) , and let β c = y c , β p = μ . Denoting the normal density function by ϕ, it is easy to see that g c ( λ ) = f c ( λ ) = φ ( λ ; y ¯ c , σ 2 / n ) . Furthermore, f p ( λ ; μ )   = φ ( λ ; μ , τ 2 ) . Given yp, μ has the density π ( μ | y p ) = φ ( μ ) ; i = 1 m 1 y ¯ i / ( m 1 ) , ( 1 / ( m 1 ) ) τ 2 + ( 1 / ( n ( m 1 ) ) ) σ 2 ) . By a standard argument

g p ( λ ) = f p ( λ ; μ ) π ( μ | y p ) d μ = φ ( λ ; i = 1 m 1 y ¯ i / ( m 1 ) , ( 1 + 1 / ( m 1 ) ) τ 2 + ( 1 / ( n ( m 1 ) ) ) σ 2 ) .

It follows that g ( δ ) = φ ( δ ) ; i = 1 m = 1 y ¯ i / ( m 1 ) y ¯ c , ( m / ( m 1 ) ) ( τ 2 + σ 2 / n ) . The conflict measures (Eqs. (9), (10), (11), and (12)) can hence be calculated analytically, with no simulation needed in this case.

In a simulation study of the c λ 2 -measure in Ref. [9] using a warning level equal to the 95% quantile of the χ 1 2 -distribution, a false warning probability of close to 5% is obtained for a normal random-effects model with unknown variance parameters as in Eq. (1) and also in similar random-effects models with heavy-tailed t- and uniformly distributed random effects. Also with respect to detection power, this measure performs well when compared to a calibrated version of the measure given in Ref. [8], if an optimal data splitting is used. Refs. [10] and [11] prove preexperimental uniformity of the conflict measures in various situations, i.e., their distributions as functions of a Y which is distributed according to the assumed model are uniform, regardless of the true value of the basic parameter. Another way of stating this is that we obtain a proper p-value by subtracting these measures from 1. These results are reviewed in Section 5 of the present chapter.

2.4. Integrated information contributions as posterior distributions

In most cases, the conflict measures of Refs. [9] and [10] are based on simulated samples from Gp and Gc. Definitions 1 and 2 suggest obtaining such samples by running an MCMC algorithm to generate posterior samples of the unknown parameters in βp and βc and then generate samples λ p * and λ c * from the respective information contributions for each such parameter sample. If the information contributions are standard probability densities, this procedure is straightforward. If not, one may instead often use the fact that, under certain conditions on the data splitting, the distributions Gp and Gc are posterior distributions conditional on yp and yc, respectively, the latter based on the improper prior π(λ) = 1, independently of the coparents.

Theorem 1 Suppose that the data splitting satisfies

Y c = Y [ γ C h ( λ ) β c D e s c ( γ ) ] , Y p = Y Y c , E13

the latter expression by abuse of notation meaning the components of Y not present in Yc. Assume λ and the coparents P a ( C h ( λ )   β p ) λ are independent. We then have

g p ( λ ) = π ( λ | y p )

and, specifying as prior density

π ( λ | P a ( C h ( λ ) β c ) λ ) = 1 , g c ( λ ) = π ( λ | y c ) . E14

The proof is given in Appendix A in the online supporting information for Ref. [11]. Specializing to the standard setup of Definition 1, where Ch ( λ ) β c , we see that the requirement for Eq. (13) to hold is that Yc consists of all data descendant nodes of λ. In Ref. [9], this splitting was compared with two other splittings for c λ 2 and found to be optimal with respect to detection power. This measure was also found to be a well-calibrated measure under this splitting.

Advertisement

3. Noninvariance and reparametrizations

The iic distributions and the corresponding conflict measures are parametrization dependent. Based on experience so far, the conflict measures seem to be fairly robust to changes in parametrization. However, this noninvariance can be handled in a theoretically satisfactory way under certain circumstances.

Let ϕ be the parameter, in a standard parametrization, corresponding to a specific node in the DAG. Suppose for simplicity that Y c = Ch ( φ ) . Assume that there exists a sufficient statistic Yc and an alternative parametrization λ, being a strictly monotonic function λ(ϕ), such that Ycλ is a pivotal quantity, i.e., the density for Yc given λ is of the form

p ( y c | λ ) = f Y c ( y c | λ ) = f 0 ( y c λ ) E15

for some known density function f0. Such a parametrization will be considered as a canonical or reference parametrization if it exists, as opposed to the standard parametrization involving ϕ. Accordingly, the conflict measures given in Eqs. (9)(12) are preferably based on this reference parametrization.

By Theorem 1, samples λ c * from Gc may be obtained by MCMC as posterior samples from π ( λ | y c ) when the splitting satisfies Eq. (13) and the prior for λ satisfies Eq. (14), i.e., equals 1. According to an argument given in Section 1.3 of Ref. [18], such a prior expresses noninformativity for likelihoods of the form (Eq. (15)). Computationally, we may, however, use the standard parametrization. When generating φ c * as posterior samples from π(ϕ|Yc), the prior density |dλ/dϕ| for ϕ must be used. Then, we may calculate λ c * = λ ( φ c * ) . To represent the iic distribution Gp(λ), we may calculate λ p * = λ ( φ p * ) for samples φ p * from π ( φ | y p ) according to the given model. Now, the c λ 4 -measure can be estimated from (Eq. (12)), using a kernel density estimate of g(δ) based on corresponding samples δ * = λ p * λ c * . However, if we limit attention to the c λ 3 -measure (Eq. (11)) and its one-sided versions (Eq. (10)), we may use the samples from π ( φ | y c ) and π ( φ | y p ) directly. To see this, note that the condition λ p * λ c * is equivalent to the condition φ p * φ c * (assuming that λ is increasing as a function of ϕ). Hence, the probability G(0) that λ p * λ c * 0 can be estimated as the proportion of sample values for which φ p * φ c * .

Advertisement

4. Extensions to deterministic nodes: Relation to cross-validation, prediction and hypothesis testing

4.1. Cross-validation and data node conflict

The model variables Y are represented by the bottom nodes in the DAG describing the hierarchical model. The framework can be extended to also cover conflict concerning these nodes. In this way, cross-validation can be viewed as a special case of the conflict measure approach.

Let Yc be an element in the vector Y of observable random variables. We define the prior iic density gp(yc) exactly as in Eq. (6), with λ replaced by yc. The Dirac measure at the observed value yc represents a degenerate iic information contribution about Yc. This leads to the following definitions:

c y c 3 + = G p ( y c ) , c y c 3 = G ¯ p ( y c ) , E16
c y c 3 = 1 2 min ( G p ( y c ) , G ¯ p ( y c ) ) , E17
c y c 4 = P g p ( g p ( Y c ) g p ( y c ) ) . E18

The measures (Eqs. (16)(18)) are called data node conflict measures. To see that these definitions are consistent with Eqs. (10)(12), note that λ p * corresponds to Yc, and λ c * is deterministic and corresponds to yc. We define X = Yc – yc, corresponding to δ. We then have g ( x ) = g p ( x + y c ) . Hence,

G ( 0 ) = 0 g ( x ) d x = y c g p ( y ) d y = G p ( y c ) , E21

and accordingly, G ¯ ( 0 ) = G ¯ p ( y c ) . It follows that Eqs. (16) and (17) are special cases of Eqs. (10) and (11). Moreover,

P g ( g ( X ) g ( 0 ) ) = P g p ( g p ( Y c ) g p ( y c ) ) , E22

showing that Eq. (18) is a special case of Eq. (12).

Furthermore, this correspondence between the data node conflict measures (Eqs. (16) and (17)) and the parameter node conflict measures (Eqs. (10) and (11)) can be used to motivate these latter measures. We will treat the c3+ measure as an example. Consider again a parameter node λ. If λ were actually observable and known to take the value λc, the data node version of the c3+ measure could be used to measure deviations toward the right tail of Gp as

G p ( λ c ) = λ c g p ( λ ) d λ = 0 g p ( δ + λ c ) d δ . E23

Now λ is in reality not known, but we can take the expectation of this conflict with respect to the distribution Gc, which reflects the uncertainty about λ when influence from data yp is removed. The result is the following theorem:

Theorem 2

E G c ( G p ( λ ) = c λ 3 + . E24

Proof:

E G c ( G p ( λ ) = g c ( λ ) ( 0 g p ( δ + λ ) d δ ) d λ = 0 ( g p ( δ + λ ) g c ( λ ) d λ ) d δ = 0 g ( δ ) d δ = G ( 0 ) = c λ 3 + E25

by Eq. (10).

4.2. Cross-validation and sufficient statistics

Suppose the node λ of interest is the parent of the subvector Yc of Y. Suppose also that Yc is a sufficient statistic for Yc. Evidently then, the measures c λ 3 + and c Y c 3 + address the same kind of possible conflict in the model. The following theorem, proved in Ref. [11], states that the two measures agree under certain conditions. This is a generalization of a result in Ref. [13], which also unnecessarily assumed symmetry for the conditional density of Yc.

Theorem 3 Suppose the conditional density for the scalar variable Yc given the parameter λ is of the form f Y c ( y | λ ) = f c , 0 2 ( y λ ) . Then,

c Y c 3 + = c λ 3 + . E26

When a sufficient statistic exists, the cross-validatory p-value is considered by Ref. [13] as the gold standard, and the aim of their construction is to provide a measure which is generally applicable and matches cross-validation when a sufficient statistic exists.

4.3. Prediction

As mentioned in Section 2, the c4 measure can be used to assess conflict concerning vectors of nodes. Applying this at the data node level, we may assess the quality of predictions of a subvector Yc of Y based on a complementary subvector yp of observations. The relevant measure is given by Eq. (18), with Yc replaced by the vector Yc. This is particularly well suited to models where data accumulate as time evolves. Such a conflict measure can be used to assess the overall quality of the model. It can also be used as a tool for model comparison and model choice.

4.4. Hypothesis testing

Suppose the top-level nodes μ appearing in Eq. (2) are assumed fixed and known according to the model, so that π(μ) is a Dirac measure at these fixed values of the components of μ. Hence, the DAG has deterministic nodes both at the top and at the bottom, namely the vectors μ and y, respectively. We may then check for a conflict concerning a component λ of μ by introducing a random version λ ˜ of λ and contrast the corresponding g c ( λ ˜ ) with the fixed value λ. The random λ ˜ has the same children and coparents as λ, and the vector βc, the information contribution f c ( λ ˜ ; β c ) and the iic density gc are defined as in Eqs. (4), (5) and (6). The respective conflict measures are defined as in Eqs. (16)(18) with yc replaced by λ and Gp and gp replaced by Gc and gc. If the model is rejected when the conflict exceeds a certain predefined warning level, this corresponds to a formal Bayesian test of the hypothesis λ ˜ = λ . Using the conflict measure (Eq. (18)), we may put the whole vector μ to test in this way.

Advertisement

5. Preexperimental uniformity of the conflict measures

In this section, we review some results concerning the distribution of the conflict measures. If c is one of the measures (Eqs. (10), (11), (12), (16), (17) or (18)), then preexperimentally, i.e., prior to observing the data y, c is a random variable taking a value in [0, 1]. A large value of c indicates a possible conflict in the model, and uniformity of c corresponds to 1 – c being a proper p-value. This does not mean that we propose a formal hypothesis testing procedure for model criticism, possibly even adjusted for multiple testing, nor that we think that a fixed significance level represents an appropriate criterion signaling the need for changing the model. A relatively large value of c may be accepted if there are convincing arguments for believing in a particular modeling aspect, while a less extreme value of c may indicate a need for adjustments in modeling aspects that are considered questionable for other reasons. But the terms “relatively large” and “less extreme” must refer to a meaningful common scale. In our view, uniformity of the conflict measure under all sources of uncertainty is the natural ideal criterion for being a well-calibrated conflict measure, the fulfillment of which ensures comparable assessment of the level of conflict across models. This means that we aim for preexperimental uniformity in cases where the prior distribution is highly noninformative, and also, as discussed in the following subsection, in cases where an informative prior represents part of the randomness in the data-generating process (aleatory uncertainty) rather than subjective (epistemic) uncertainty about the location of a fixed but unknown λ. In this chapter, we limit attention to situations where exact uniformity is achieved. The pivotality condition (Eq. (15)) turns out to be a key assumption needed to obtain such exact results. Refs. [10] and [12] provide some examples where exact uniformity is achieved in other cases.

5.1. Data-prior conflict

Consider the model

Y F Y ( y | λ ) , λ F λ ( λ ) , E27

where Fλ is an arbitrary informative prior distribution. Here, we think of this prior distribution as representing aleatory rather than epistemic uncertainty. The corresponding densities are denoted by fY and fλ. If contrasting the prior density with the likelihood f Y ( y | λ ) indicates a conflict between the prior and likelihood information contributions, we consider this a data-prior conflict. The following theorem, proved in Ref. [11], deals with this kind of conflict. Note that in this situation, the Yp part of the data splitting is empty.

Theorem 4 Suppose the conditional density for the scalar variable Y given the parameter λ is of the form f Y ( y | λ ) = f 0 ( y λ ) and that λ is generated from an arbitrary informative prior density fλ(λ). Then, the data-prior conflict measures about λ are preexperimentally uniformly distributed for both the c λ 3 - and c λ 4 -measures.

The theorem obviously applies to the location parameter of normal and t-distributions with fixed variance parameters, as well as the location parameter in the skew normal distribution [19]. If the vector Y consists of IID normal variables, the theorem also applies to the location parameter, using as scalar variable the sufficient statistic Y ¯ . If the n components of Y are IID exponentially distributed with failure rate λ, their sum is a sufficient statistic that is gamma distributed with shape parameter n and scale parameter λ. We may then use the fact that for a variable Y which is gamma distributed with known shape parameter and unknown scale parameter λ, the quantity log ( Y ) log ( λ ) is a pivotal statistic, and uniformity is obtained by combining Theorem 4 with the approach of Section 3. In the standard parametrization, the appropriate prior distribution is π ( λ ) = 1 / λ ) . Details are given in Ref. [11], which also deals with the gamma, inverse gamma, Weibull and lognormal distributions in a similar way.

5.2. Data-data conflict

Suppose all components of Y have distributions determined by the same parameter λ. Suppose we want to contrast information contributions from separate parts of Y about λ and define the splitting ( Y p , Y c ) accordingly. Focusing on this kind of possible conflict, we assume complete prior ignorance about λ and accordingly assume that λ has the improper prior π ( λ ) = 1 . Hence, recalling Eqs. (7) and (8), we contrast the information in f c ( λ ; Y c ) with that in f p ( λ ; Y p ) . We use the term data-data conflict in this context, since there is no prior information incorporated in fp, and the two information contributions play symmetric roles. However, as a particular application, one may think of Yc as a scalar variable representing a possible outlier.

The following theorem is proved in Ref. [11].

Theorem 5 Suppose that the conditional densities for the scalar variables Yp and Yc given the parameter λ are of the form f Y p ( y | λ ) = f p , 0 ( y λ ) , f Y c ( y | λ ) = f c , 0 ( y λ ) .

Assume λ has the improper prior π ( λ ) = 1 . Then, the data-data conflict measures about λ are preexperimentally uniformly distributed for both the c λ 3 - and c λ 4 -measures.

Theorem 5 can be applied if the components of Yc and Yp are normally or lognormally distributed with known variance parameter, exponentially distributed, or gamma, inverse gamma or Weibull with known shape parameter, since pivotal quantities based on sufficient statistics exist for these distributions.

5.3. Normal hierarchical models with fixed covariance matrices

Allowing for each y and ν appearing in Eq. (2) to be interpreted as vectors of nodes, we now assume that each conditional distribution in the decomposition (Eq. (2)) is multinormal with fixed and known covariance matrices. The random-effects model (Eq. (1)) is a simple example of this. We also assume that the top-level parameter vector μ has the improper prior 1 and that each linear mapping Pa ( ν ) E ( ν | Pa ( ν ) ) has full rank.

Now let λ be any node in the model description. It is standard to verify that, regardless of how the vector of neighboring and coparent nodes β is decomposed into βp, containing Pa ( λ ) , and βc, the densities f p ( λ ; β p ) and f c ( λ ; β c ) of Eqs. (5) and (8) are multinormal with fixed covariance matrices. Furthermore, this is true also for the iic densities gp and gc of Eq. (6), regardless of the data splitting. It follows that the density g of the difference δ between independent samples from gp and gc is multinormal with expectation E G ( δ ) = E G p ( λ ) E G c ( λ ) and covariance matrix cov G ( δ ) = cov G p ( λ ) + E G c ( λ ) . It follows that ( δ E G ( δ ) ) t cov G ( δ ) 1 ( δ E G ( δ ) ) is χ2-distributed with n = dim ( λ ) degrees of freedom, and the probability under G that g ( δ ) < g ( 0 ) is easily seen to be Ψ n ( E G ( δ ) t cov G ( δ ) 1 E G ( δ ) ) , where Ψn is the cumulative distribution function for the χ n 2 -distribution. The preexperimental uniformity of this quantity is proved in Ref. [10].

Theorem 6 Consider a hierarchical normal model as described above.

  1. Let λ be an arbitrary scalar or vector parameter node. If the data splitting satisfies Eq. (13), then c λ 4 is uniformly distributed preexperimentally.

  2. Suppose the data splitting ( Y p , Y c ) satisfies C h ( P a ( Y c ) ) = Y c . Then, c Y c 4 is uniformly distributed preexperimentally.

If λ in (i) or Yc in (ii) are one dimensional, then G is symmetric and unimodal, and therefore, the respective c3-measures are defined and coincide with the c4-measures. Gåsemyr et al. [10] also show that in that case the c3+- and c3−-measures are uniformly distributed preexperimentally.

Example 2. Consider the following DAG model, a regression model with randomly varying regression coefficients.

Y i , j N ( X i , j t ξ i , σ 2 ) , ξ i N ( ξ , Ω ) , j = 1 , , n , i = 1 , , m , π ( ξ ) 1. E19

The m units could be groups of individuals, with yi,j the measurement for a group member with individual covariate vector Xi,j, or individuals with the successive yi,j representing repeated measurements over time. In this model, we could check for a possible exceptional behavior of the mth unit by means of the conflict measure c ξ m 4 . With a data splitting for which Y c = Y m = ( Y m ,  1 , , Y m , n ) the conditions for Theorem 6, part (i), are satisfied if dim ( ξ )   n , and the measure is preexperimentally uniformly distributed.

Advertisement

6. Concluding remarks

The assumption of fixed covariance matrices in the previous subsection is admittedly quite restrictive. In general, the presence of unknown nuisance parameters, such as parameters describing the covariance matrices in a normal model, makes the derivation of exact uniformity at least difficult and often impossible. Promising approximate results are reported in Ref. [9] for the closely related c λ 2 measure. Further empirical studies are needed in order to examine to what extent the conflict measures are approximately uniformly distributed in other situations. As an informal tool to be used in conjunction with subject matter insight, the conflict measure approach does not require exact uniformity in order to be useful.

References

  1. 1. Box GEP. Sampling and Bayes’ inference in scientific modelling and robustness (with discussion and rejoinder). Journal of the Royal Statistical Society. Series A. 1980;143:383‐430
  2. 2. Bayarri MJ, Castellanos ME. Bayesian checking of the second levels of hierarchical models. Statistical Science. 2007;22:322‐343
  3. 3. Gelman A, Meng X-L, Stern H. Posterior predictive assessment of model fitness via realized discrepancies (with discussion and rejoinder). Statistica Sinica. 1996;6:733‐807
  4. 4. Bayarri MJ, Berger JO. P values in composite null models (with discussion). The Journal of the American Statistical Association. 2000;95:1127‐1142
  5. 5. Hjort NL, Dahl FA, Steinbakk GH. Post-processing posterior predictive p-values. The Journal of the American Statistical Association. 2006;101:1157‐1174
  6. 6. Dahl FA. On the conservativeness of posterior predictive p-values. Statistics and Probability Letters. 2006;76:1170‐1174
  7. 7. Dey D, Gelfand A, Swartz T, Vlachos P. A simulation-intensive approach for checking hierarchical models. Test. 1998;7:325‐346
  8. 8. O’Hagan A. HSSS model criticism (with discussion). In: Green PJ, Hjort NL, Richardson S, editors. Highly Structured Stochastic Systems. Oxford: Oxford University Press; 2003. pp. 423‐444
  9. 9. Dahl FA, Gåsemyr J, Natvig B. A robust conflict measure of inconsistencies in Bayesian hierarchical models. Scandinavian Journal of Statistics. 2007;34:816‐828
  10. 10. Gåsemyr J, Natvig B. Extensions of a conflict measure of inconsistencies in Bayesian hierarchical models. Scandinavian Journal of Statistics. 2009;36:822‐838
  11. 11. Gåsemyr J. Uniformity of node level conflict measures in Bayesian hierarchical models based on directed acyclic graphs. Scandinavian Journal of Statistics. 2016;43:20‐34
  12. 12. Gåsemyr J. Alternatives to post-processing posterior predictive p-values. Submitted 2017
  13. 13. Marshall EC, Spiegelhalter DJ. Identifying outliers in Bayesian hierarchical models. A simulation based approach. Bayesian Analysis. 2007;2:409‐444
  14. 14. Dias S, Welton NJ, Caldwell DM, Ades AE. Checking consistency in mixed treatment comparison meta-analysis. Statistics in Medicine. 2010;29:932‐944
  15. 15. Presanis AM, Ohlssen D, Spiegelhalter D, De Angelis D. Conflict diagnostics in directed acyclic graphs, with applications in Bayesian evidence synthesis. Statistical Science. 2013;28:376‐397
  16. 16. Lauritzen SL. Graphical Models. Oxford: Oxford University Press; 1996
  17. 17. Scheel I, Green P, Rougier JC. A graphical diagnostic to identifying influential model choices in Bayesian hierarchical models. Scandinavian Journal of Statistics.2011;38:529‐550
  18. 18. Box GEP, Tiao GC. Bayesian Inference in Statistical Analysis. New York: Wiley; 1992
  19. 19. Azzalini A. A class of distributions which include the normal ones. Scandinavian Journal of Statistics. 1985;12:171‐178

Written By

Jørund I. Gåsemyr and Bent Natvig

Submitted: 12 December 2016 Reviewed: 08 June 2017 Published: 02 November 2017