## Abstract

Over the last decades, Bayesian hierarchical models defined by means of directed, acyclic graphs have become an essential and widely used methodology in the analysis of complex data. Simulation-based model criticism in such models can be based on conflict measures constructed by contrasting separate local information sources about each node in the graph. An initial suggestion of such a measure was not well calibrated. This shortcoming has, however, to a large extent been rectified by subsequently proposed alternative mutually similar tail probability-based measures, which have been proved to be uniformly distributed under the assumed model under various circumstances, and in particular, in quite general normal models with known covariance matrices. An advantage of this is that computationally costly precalibration schemes needed for some other suggested methods can be avoided. Another advantage is that noninformative prior distributions can be used when performing model criticism. In this chapter, we describe the basic framework and review the main uniformity results.

### Keywords

- cross-validation
- data splitting
- information contribution
- MCMC
- model criticism
- pivotal quantity
- preexperimental distribution
- p-value

## 1. Introduction

Over the last decades, Bayesian hierarchical models have become an essential and widely used methodology in the analysis of complex data. Computational techniques such as Markow Chain Monte Carlo (MCMC) methods make it possible to treat very complex models and data structures. Analysis of such models gives intuitively appealing Bayesian inference based on posterior probability distributions for the parameters.

In the construction of such models, an understanding of the underlying structure of the problem can be represented by means of directed acyclic graphs (DAGs), with nodes in the graph corresponding to data or parameters, and directed edges between parameters representing conditional distributions. However, a perfect understanding of the underlying structure is usually an unachievable goal, and there is always a danger of constructing inadequate models. Box [1] suggests a pattern for the model building process where an initial candidate model is assessed for adequacy, and if necessary modified and elaborated on, leading to a new candidate that again is checked for adequacy, and so on. As a tool in this model criticism process, Ref. [1] suggests using the prior predictive distribution of some checking function or test statistic as a reference for the observed value of this checking function, resulting in a prior predictive *p*-value. This requires an informative and realistic prior distribution, which is not always available or even desirable. Indeed, as pointed out in Ref. [2], in an early phase of the model building process, it is often convenient to use noninformative or even improper priors and thus avoid costly and time-consuming elicitation of prior information. Noninformative priors may be used also for the inference because relevant prior information is unavailable.

There exist many other methods for checking the overall fit of the model or an aspect of the model of special interest, based on locating a test statistic or a discrepancy measure in some kind of a reference distribution. The posterior predictive *p*-value (ppp) of Ref. [3] uses the posterior distribution as reference and does not require informative priors. But this method uses data twice and can as a result be very conservative [2, 4–6]. Hjort et al. [5] suggest remedying this by using the ppp value as a test statistic in a prior predictive test. The computation of the resulting calibrated cppp-value is, however, very computer intensive in the general case, and again realistic, informative priors are needed. A node-level discrepancy measure suggested in Ref. [7] is subject to the same limitations. The partial posterior predictive *p*-value of Ref. [4] avoids double use of data and allows noninformative priors but may be difficult to compute and interpret in hierarchical models.

Comparison with other candidate models through a technique for model comparison or model choice, such as predictive methods, maximum posterior probability, Bayes factors or an information criterion, can also serve as tools for checking model adequacy indirectly when alternative candidate models exist.

In this chapter, we will, however, focus on methods for criticizing models in the absence of any particular alternatives. We will review methods for checking the modeling assumptions at each node of the DAG. The aim is to identify parts or building blocks of the model that are in discordance with reality, which may be in need of adjustment or further elaboration. O’Hagan [8] regards any node in the graph as receiving information from two disjoint subsets of the neighboring nodes. This information is represented as a conditional probability density or a likelihood or as a combination of these two kinds of information sources. Adopting the same basic perspective, our aim is to check for inconsistency between such subsets. The suggestion in Ref. [8] is to normalize these information sources to have equal height 1 and to regard the height of the curves at the point of intersection as a measure of conflict. However, as shown in Ref. [2], this measure tends to be quite conservative. Dahl et al. [9] demonstrated that it is also poorly calibrated, with false warning probabilities that vary substantially between models. Dahl et al. [9] also identified the different sources of inaccuracy and modified the measure of Ref. [8] to an approximately *χ*^{2}-distributed quantity under the assumed model by instead normalizing the information sources to probability densities. In Ref. [10], these densities were instead used to define tail probability-based conflict measures. Gåsemyr and Natvig [10] showed that these measures are uniformly distributed in quite general hierarchical normal models with fixed variances/covariances. In Ref. [11], such uniformity results were proved in various situations involving nonnormal and nonsymmetric distributions. These uniformity results indicate that the measures of Refs. [9] and [10] have comparable interpretations across different models. Therefore, they can be used without computationally costly precalibration schemes, such as the one suggested in Ref. [5]. Gåsemyr [12] focuses on some situations where the conflict measure approach can be directly compared to the calibration method of Ref. [5] and shows that the less computer-intensive conflict measure approach performs at least as well in these situations. Moreover, the conflict measure approach can be applied in models using noninformative prior distributions.

Focusing on the special problem of identifying outliers among the second-level parameters in a random-effects model, Ref. [13] defines similar conflict measures. In this setting, the group-specific means are the nodes of interest. In some models, there exist sufficient statistics for these means. Then, outlier detection at the group level can also be based on cross validation, measuring the tail probability beyond the observed value of the statistic in the posterior predictive distribution given data from the other groups. In this context, the conflict measure approach can be viewed as an extension of cross-validation to situations where sufficient statistics do not exist. Also in Ref. [13] applications to the examination of exceptionally high hospital mortality rates and to results from a vaccination program are given. In Ref. [14], this methodology is used to check for inconsistency in multiple treatment comparison of randomized clinical trials. Presanis et al. [15] apply these conflict measures in complex cases of medical evidence synthesis.

## 2. Directed acyclic graphs and node-specific conflict

### 2.1. Directed acyclic graphs and Bayesian hierarchical models

An example of a DAG discussed extensively in Ref. [8] is the random-effects model with normal random effects and normal error terms defined by

In general, we identify the nodes or vertices of the graph with the unknown parameters **θ** and the observed data **y**, the latter appearing as bottom nodes and being the realizations of the random vector **Y**. In the Bayesian model, the parameters, the components of **θ**, are also considered as random variables. In general, if there is a directed edge from node *a* to node *b*, then *a* is a parent of *b*, and *b* is a child of *a*. We denote by Ch(*a*) the set of child nodes of *a*, and by Pa(*b*) the set of parent nodes of *b*. More generally, *b* is a descendant of *a* if there is a directed path from *a* to *b*. The set of descendants of *a* is denoted by Desc(*a*) and, for convenience, is defined to contain *a* itself. The directed edges encode conditional independence assumptions, indicating that, given its parents, a node is assumed to be independent of all other nondescendants. Hence, writing *θ* = (**ν, μ**), with **μ** representing the vector of top-level nodes, the joint density of (**Y, θ**) *=* (**Y**, **ν**, **μ**) is

where *π*(**μ**) is the prior distribution of **μ**. The posterior distribution *π*(**θ**|**y**) is the basis for the inference.

This setup can be generalized in various directions. The nodes may be allowed to represent vectors, at both the parameter and the data levels [10]. Instead of DAGs, one may consider chain graphs, as described in Ref. [16], with undirected edges representing mutual dependence as in Markov random fields. Scheel et al. [17] introduce a graphical diagnostic for model criticism in such models.

### 2.2. Information contributions

The representation of a Bayesian hierarchical model in terms of a DAG is often meant to reflect an understanding of the underlying structure of the problem. By looking for a conflict associated with the different nodes in the DAG, we may therefore put our understanding of this structure to test. We may also identify parts of the model that need adjustment.

The idea put forward in Ref. [8] is that for each node *λ* in a DAG one may in general think of each neighboring node as providing information about *λ* and that it is of interest to consider the possibility of conflict between different sources of information. For instance, one may want to contrast the local prior information provided by the factor *p*(*λ*|Pa(*λ*)) with the likelihood information source formed by multiplying the factors *p*(*γ*|Pa(*γ*)) for all child nodes *γ* ∈ Ch(*λ*). The full conditional distribution of *λ* given all the observed and unobserved variables in the DAG, i.e.,

is determined by these two types of factors. Here, (**y, θ**)_{−λ} denotes the vector of all components of (**y, θ**) except for *λ*.

Dahl et al. [9] normalize the product *f*_{c}(*λ*), the likelihood or child node information contribution, whereas the local prior density is denoted by *f*_{p}(*λ*) and called the prior or parent node information contribution. These information contributions are integrated with respect to posterior distributions for the unknown nuisance parameters to form integrated information contribution (iic) denoted by *g*_{c} and *g*_{p}. In this construction, a key to avoid the conservatism of the measure suggested in Ref. [8] is to prevent dependence between the two information sources by introducing a suitable data splitting **Y = (Y**_{p}, **Y**_{c}**)** and condition the parameters of *f*_{p} on **y**_{p} and the parameters of *f*_{c} on **y**_{c}.

**Definition 1** *For a given parameter node λ, denoted by β*_{p} *the vector whose components are Pa*(*λ*), *and by β*_{c} *the vector whose components are*

*Let* **Y** = (**Y**_{p}, **Y**_{c}) *be a splitting of the data* **Y**. *Define the densities f*_{p}, *f*_{c}, *the prior respectively likelihood information contributions, by*

*Define the integrated information contribution densities g*_{p}, *g*_{c} *by*

*and denote by G*_{p}, *G*_{c} *the corresponding cumulative distribution functions*.

Note that **β**_{c} may contain data nodes. The second integral in Eq. (6) is then taken only with respect to the random components of **β**_{c}, i.e., the parameters in **β**_{c}. If **β**_{c} contains no parameters, then *g*_{c} and *f*_{c} coincide. Definition 1 may also be extended to the case when *λ* is a vector, corresponding to a subset of parameter nodes.

Combining the set of information sources linked to a specific node in different ways leads to a modification of Definition 1 where **β**_{c} does not contain all child nodes of *λ*, the others being instead included in **β**_{p} together with their parent nodes. In this way, different types of conflict about the node may be revealed. This is natural, e.g., in the context of outlier detection among independent observations with a common mean. Note that **β**_{p} and **β**_{c} may then be overlapping, containing common coparents with *λ*. The setup is illustrated in Figure 1 in the case when the set of common components, by abuse of notation denoted by **β**_{p} ∩ **β**_{c}, is empty. For the general setup, Definition 1 is modified as follows.

**Definition 2** *Let γ be a vector whose components are a subset of Ch*(*λ*), *and define β*_{c} *as in Eq. (4). Denote by γ*_{1} *the rest of the child nodes of λ, and let β*_{p} *consist of γ*_{1} *together with its parent nodes in the same way as in Eq. (4), as well as Pa*(*λ*). *The information contributions are then given by*

*In Eq. (7), p ( λ | P a ( λ ) ) is replaced by the prior density π*(*λ*) *if* λ *is a top-level parameter. The corresponding iic densities are defined by Eq. (6) as before*.

### 2.3. Node-specific conflict measures

The conflict measure

The *G*_{p}, *G*_{c} of iic distributions, let *G*_{p} and *G*_{c}, respectively. Let *G* be the cumulative distribution function for

and

The *p* value, with small values indicative of conflict. Gåsemyr and Natvig [10] also defines a measure based on defining a tail area in terms of the density *g* of *G*, namely

applicable also when *λ* is a vector.

*Example 1*. To illustrate the theory, consider the random-effects model 1, with the variance parameters *σ*^{2}, *τ*^{2} assumed known, and with *μ* having the improper prior *π*(*μ*) = 1. For simplicity, assume *n*_{i} = *n* for all *i*. Suspecting the *m*th group of representing an outlier, let *λ*=*λ*_{m} be the node of interest. Define the data splitting **Y**_{p}, **Y**_{c} by letting *ϕ*, it is easy to see that **y**_{p}, *μ* has the density

It follows that

In a simulation study of the **Y** which is distributed according to the assumed model are uniform, regardless of the true value of the basic parameter. Another way of stating this is that we obtain a proper *p*-value by subtracting these measures from 1. These results are reviewed in Section 5 of the present chapter.

### 2.4. Integrated information contributions as posterior distributions

In most cases, the conflict measures of Refs. [9] and [10] are based on simulated samples from *G*_{p} and *G*_{c}. Definitions 1 and 2 suggest obtaining such samples by running an MCMC algorithm to generate posterior samples of the unknown parameters in **β**_{p} and **β**_{c} and then generate samples *G*_{p} and *G*_{c} are posterior distributions conditional on **y**_{p} and **y**_{c}, respectively, the latter based on the improper prior *π*(*λ*) = 1, independently of the coparents.

**Theorem 1** *Suppose that the data splitting satisfies*

*the latter expression by abuse of notation meaning the components of* **Y** *not present in* **Y**_{c}. *Assume λ and the coparents P a ( C h ( λ ) ∩ β p ) − λ are independent. We then have*

*and, specifying as prior density*

The proof is given in Appendix A in the online supporting information for Ref. [11]. Specializing to the standard setup of Definition 1, where **Y**_{c} consists of all data descendant nodes of *λ*. In Ref. [9], this splitting was compared with two other splittings for

## 3. Noninvariance and reparametrizations

The iic distributions and the corresponding conflict measures are parametrization dependent. Based on experience so far, the conflict measures seem to be fairly robust to changes in parametrization. However, this noninvariance can be handled in a theoretically satisfactory way under certain circumstances.

Let *ϕ* be the parameter, in a standard parametrization, corresponding to a specific node in the DAG. Suppose for simplicity that *Y*_{c} and an alternative parametrization *λ*, being a strictly monotonic function *λ*(*ϕ*), such that *Y*_{c} – *λ* is a pivotal quantity, i.e., the density for *Y*_{c} given *λ* is of the form

for some known density function *f*_{0}. Such a parametrization will be considered as a canonical or reference parametrization if it exists, as opposed to the standard parametrization involving *ϕ*. Accordingly, the conflict measures given in Eqs. (9)–(12) are preferably based on this reference parametrization.

By Theorem 1, samples *G*_{c} may be obtained by MCMC as posterior samples from *λ* satisfies Eq. (14), i.e., equals 1. According to an argument given in Section 1.3 of Ref. [18], such a prior expresses noninformativity for likelihoods of the form (Eq. (15)). Computationally, we may, however, use the standard parametrization. When generating *π*(*ϕ*|**Y**_{c}), the prior density |*dλ/dϕ*| for *ϕ* must be used. Then, we may calculate *G*_{p}(*λ*), we may calculate *g*(*δ*) based on corresponding samples *λ* is increasing as a function of *ϕ*). Hence, the probability *G*(0) that

## 4. Extensions to deterministic nodes: Relation to cross-validation, prediction and hypothesis testing

### 4.1. Cross-validation and data node conflict

The model variables **Y** are represented by the bottom nodes in the DAG describing the hierarchical model. The framework can be extended to also cover conflict concerning these nodes. In this way, cross-validation can be viewed as a special case of the conflict measure approach.

Let *Y*_{c} be an element in the vector **Y** of observable random variables. We define the prior iic density *g*_{p}(*y*_{c}) exactly as in Eq. (6), with *λ* replaced by *y*_{c}. The Dirac measure at the observed value *y*_{c} represents a degenerate iic information contribution about *Y*_{c}. This leads to the following definitions:

The measures (Eqs. (16)–(18)) are called data node conflict measures. To see that these definitions are consistent with Eqs. (10)–(12), note that *Y*_{c}, and *y*_{c}. We define *X* = *Y*_{c} *– y*_{c}, corresponding to *δ*. We then have

and accordingly,

showing that Eq. (18) is a special case of Eq. (12).

Furthermore, this correspondence between the data node conflict measures (Eqs. (16) and (17)) and the parameter node conflict measures (Eqs. (10) and (11)) can be used to motivate these latter measures. We will treat the *c*^{3+} measure as an example. Consider again a parameter node *λ*. If *λ* were actually observable and known to take the value *λ*_{c}, the data node version of the *c*^{3+} measure could be used to measure deviations toward the right tail of *G*_{p} as

Now *λ* is in reality not known, but we can take the expectation of this conflict with respect to the distribution *G*_{c}, which reflects the uncertainty about *λ* when influence from data **y**_{p} is removed. The result is the following theorem:

**Theorem 2**

*Proof*:

by Eq. (10).

### 4.2. Cross-validation and sufficient statistics

Suppose the node *λ* of interest is the parent of the subvector **Y**_{c} of **Y**. Suppose also that *Y*_{c} is a sufficient statistic for **Y**_{c}. Evidently then, the measures *Y*_{c}.

**Theorem 3** *Suppose the conditional density for the scalar variable Y*_{c} *given the parameter λ is of the form f Y c ( y | λ ) = f c , 0 2 ( y − λ ) . Then*,

When a sufficient statistic exists, the cross-validatory *p*-value is considered by Ref. [13] as the gold standard, and the aim of their construction is to provide a measure which is generally applicable and matches cross-validation when a sufficient statistic exists.

### 4.3. Prediction

As mentioned in Section 2, the *c*^{4} measure can be used to assess conflict concerning vectors of nodes. Applying this at the data node level, we may assess the quality of predictions of a subvector **Y**_{c} of **Y** based on a complementary subvector *y*_{p} of observations. The relevant measure is given by Eq. (18), with *Y*_{c} replaced by the vector **Y**_{c}. This is particularly well suited to models where data accumulate as time evolves. Such a conflict measure can be used to assess the overall quality of the model. It can also be used as a tool for model comparison and model choice.

### 4.4. Hypothesis testing

Suppose the top-level nodes **μ** appearing in Eq. (2) are assumed fixed and known according to the model, so that *π*(**μ**) is a Dirac measure at these fixed values of the components of **μ**. Hence, the DAG has deterministic nodes both at the top and at the bottom, namely the vectors **μ** and **y**, respectively. We may then check for a conflict concerning a component *λ* of **μ** by introducing a random version *λ* and contrast the corresponding *λ*. The random *λ*, and the vector **β**_{c}, the information contribution *g*_{c} are defined as in Eqs. (4), (5) and (6). The respective conflict measures are defined as in Eqs. (16)–(18) with *y*_{c} replaced by *λ* and *G*_{p} and *g*_{p} replaced by *G*_{c} and *g*_{c}. If the model is rejected when the conflict exceeds a certain predefined warning level, this corresponds to a formal Bayesian test of the hypothesis **μ** to test in this way.

## 5. Preexperimental uniformity of the conflict measures

In this section, we review some results concerning the distribution of the conflict measures. If *c* is one of the measures (Eqs. (10), (11), (12), (16), (17) or (18)), then preexperimentally, i.e., prior to observing the data **y**, *c* is a random variable taking a value in [0, 1]. A large value of *c* indicates a possible conflict in the model, and uniformity of *c* corresponds to 1 – *c* being a proper *p*-value. This does not mean that we propose a formal hypothesis testing procedure for model criticism, possibly even adjusted for multiple testing, nor that we think that a fixed significance level represents an appropriate criterion signaling the need for changing the model. A relatively large value of *c* may be accepted if there are convincing arguments for believing in a particular modeling aspect, while a less extreme value of *c* may indicate a need for adjustments in modeling aspects that are considered questionable for other reasons. But the terms “relatively large” and “less extreme” must refer to a meaningful common scale. In our view, uniformity of the conflict measure under all sources of uncertainty is the natural ideal criterion for being a well-calibrated conflict measure, the fulfillment of which ensures comparable assessment of the level of conflict across models. This means that we aim for preexperimental uniformity in cases where the prior distribution is highly noninformative, and also, as discussed in the following subsection, in cases where an informative prior represents part of the randomness in the data-generating process (aleatory uncertainty) rather than subjective (epistemic) uncertainty about the location of a fixed but unknown *λ*. In this chapter, we limit attention to situations where exact uniformity is achieved. The pivotality condition (Eq. (15)) turns out to be a key assumption needed to obtain such exact results. Refs. [10] and [12] provide some examples where exact uniformity is achieved in other cases.

### 5.1. Data-prior conflict

Consider the model

where *F*_{λ} is an arbitrary informative prior distribution. Here, we think of this prior distribution as representing aleatory rather than epistemic uncertainty. The corresponding densities are denoted by *f*_{Y} and *f*_{λ}. If contrasting the prior density with the likelihood **Y**_{p} part of the data splitting is empty.

**Theorem 4** *Suppose the conditional density for the scalar variable Y given the parameter λ is of the form f Y ( y | λ ) = f 0 ( y − λ ) and that λ is generated from an arbitrary informative prior density f*_{λ}(*λ*). *Then, the data-prior conflict measures about λ are preexperimentally uniformly distributed for both the c λ 3 - and c λ 4 -measures*.

The theorem obviously applies to the location parameter of normal and *t*-distributions with fixed variance parameters, as well as the location parameter in the skew normal distribution [19]. If the vector **Y** consists of IID normal variables, the theorem also applies to the location parameter, using as scalar variable the sufficient statistic *n* components of **Y** are IID exponentially distributed with failure rate *λ*, their sum is a sufficient statistic that is gamma distributed with shape parameter *n* and scale parameter *λ*. We may then use the fact that for a variable *Y* which is gamma distributed with known shape parameter and unknown scale parameter *λ*, the quantity

### 5.2. Data-data conflict

Suppose all components of **Y** have distributions determined by the same parameter *λ*. Suppose we want to contrast information contributions from separate parts of **Y** about *λ* and define the splitting *λ* and accordingly assume that *λ* has the improper prior *f*_{p}, and the two information contributions play symmetric roles. However, as a particular application, one may think of **Y**_{c} as a scalar variable representing a possible outlier.

The following theorem is proved in Ref. [11].

**Theorem 5** *Suppose that the conditional densities for the scalar variables Y*_{p} *and Y*_{c} *given the parameter λ are of the form*

*Assume λ has the improper prior π ( λ ) = 1 . Then, the data-data conflict measures about λ are preexperimentally uniformly distributed for both the c λ 3 - and c λ 4 -measures*.

Theorem 5 can be applied if the components of **Y**_{c} and **Y**_{p} are normally or lognormally distributed with known variance parameter, exponentially distributed, or gamma, inverse gamma or Weibull with known shape parameter, since pivotal quantities based on sufficient statistics exist for these distributions.

### 5.3. Normal hierarchical models with fixed covariance matrices

Allowing for each *y* and *ν* appearing in Eq. (2) to be interpreted as vectors of nodes, we now assume that each conditional distribution in the decomposition (Eq. (2)) is multinormal with fixed and known covariance matrices. The random-effects model (Eq. (1)) is a simple example of this. We also assume that the top-level parameter vector **μ** has the improper prior 1 and that each linear mapping

Now let *λ* be any node in the model description. It is standard to verify that, regardless of how the vector of neighboring and coparent nodes **β** is decomposed into **β**_{p}, containing **β**_{c}, the densities *g*_{p} and *g*_{c} of Eq. (6), regardless of the data splitting. It follows that the density *g* of the difference *δ* between independent samples from *g*_{p} and *g*_{c} is multinormal with expectation *χ*^{2}-distributed with *G* that _{n} is the cumulative distribution function for the

**Theorem 6** *Consider a hierarchical normal model as described above*.

*Let λ be an arbitrary scalar or vector parameter node. If the data splitting satisfies Eq. (13), then c λ 4 is uniformly distributed preexperimentally*.*Suppose the data splitting ( Y p , Y c ) satisfies C h ( P a ( Y c ) ) = Y c . Then, c Y c 4 is uniformly distributed preexperimentally*.

If *λ* in (i) or *Y*_{c} in (ii) are one dimensional, then *G* is symmetric and unimodal, and therefore, the respective *c*^{3}-measures are defined and coincide with the *c*^{4}-measures. Gåsemyr et al. [10] also show that in that case the *c*^{3+}- and *c*^{3−}-measures are uniformly distributed preexperimentally.

*Example 2*. Consider the following DAG model, a regression model with randomly varying regression coefficients.

The *m* units could be groups of individuals, with *y*_{i,j} the measurement for a group member with individual covariate vector **X**_{i,j}, or individuals with the successive *y*_{i,j} representing repeated measurements over time. In this model, we could check for a possible exceptional behavior of the *m*th unit by means of the conflict measure

## 6. Concluding remarks

The assumption of fixed covariance matrices in the previous subsection is admittedly quite restrictive. In general, the presence of unknown nuisance parameters, such as parameters describing the covariance matrices in a normal model, makes the derivation of exact uniformity at least difficult and often impossible. Promising approximate results are reported in Ref. [9] for the closely related