## 1. Introduction

In today’s world, most of the statistical inference problems involve high-dimensional multiple hypothesis testing. Whenever we collect data, we collect data on multiple features, involving very high-dimensional variables in some cases. For example, gene expression data consist of gene expressions on thousands of genes; image data consist of image expressions on multiple voxels. The statistical analysis for these types of data involves multiple hypotheses testing (MHT). It is well known that univariate methods cannot be applied to simultaneously test hypotheses on the multiple features. The reason for this is that the error rates for the univariate analysis get multiplied under MHT, and as a result the actual error rate can be very high. To understand the main issue of multiplicity, consider the following example. Suppose there are, say, 100 misspelled words in a book, and each of these words occurs in 5% of the pages. You pick a page at random. For each misspelled word, the probability is certainly 0.05 of finding that word in the page. However, the probability is much higher that you will find at least one of the 100 misspelled words. If these words were independently distributed, then the probability of finding at least one misspelled word is 1 − (0.95)^{100} ≈ 0.995. If the placements of the misspelled words were positively dependent, then the probability will be lower than 0.995. For example, if we take an extreme case of dependence that they all occur together, then the probability will be 0.05. The same phenomenon occurs in the MHT. The statistical inference, based on the error rate of individual hypothesis testing, can lead to very high error rate for the combined hypotheses. Thus, for the MHT, adjustment in the error rate needs to be made. Note that the adjustment rate may depend on the dependent structure, but due to the complexity of the dependent structure in high dimension, dependency is usually ignored in the literature [1].

The statistical inference depends on how we define the error rate for the combined hypotheses testing. Let us suppose that there are *m* hypotheses testing *FWER*), which is defined as

There are many methods for controlling *FWER* ≤ *α*_{F} (=0.05, e.g.). A simplest method is the Bonferroni’s procedure. Let *T*_{i} be the test statistics for testing *p*-values *p*_{i}. Then, Bonferroni’s procedure rejects *p*_{i} < *α*_{F}/*m*. To see the proof of this, suppose *I*_{0} be the set of all *i* for which *p*_{j} < *α*_{F}/*m* for at least one ∈ *I*_{0} . Then using Boole’s inequality, we have, from Eq. (1),

Now, since, under *p*_{i} < *α*_{F}/*m*} = *α*_{F}/*m*. Then, assuming that there are *m*_{0} number of elements in *I*_{0}, we have, from Eq. (2),

Holm [2] gave a modified version of Bonferroni’s procedure which also controls the familywise error rate. Holm’s Bonferroni Procedure is the following: First rank all the *p*-values, *p*_{(1)} ≤ *p*_{(2)} ≤ … ≤ *p*_{(m)}, and let *l* be the smallest index such that *p*_{(l)} > *α*_{F}/(*m* − *l* + 1). Then, reject only those null hypotheses that are associated with *p*-values with *p*_{(1)} < *α*_{F}/*m*, *p*_{(2)} < *α*_{F}/(*m* − 1), …, *p*_{(l − 1)} < *α*_{F}/(*m* − *l* + 2) , and thus more powerful than Bonferroni’s procedure, since hypotheses that are selected under Bonferroni’s procedure will also be selected under Holm’s procedure.

The above Bonferroni type procedures are not very satisfactory when *m* is very high. Let us suppose *m* = 10, 000 (this is actually not very high for most of the high-dimensional problems), and suppose we want to control FWER by *α*_{F} = 0.05. Then, for Holm’s procedure, the smallest *p*-value has to be lower than 0.000005 in order to reject at least one hypothesis, which may be very hard to achieve. The problem is not really with Holm’s procedure; the problem is with the use of FWER as an error rate. For a high-dimensional problem, it is unrealistic to seek for a procedure which will not make at least one false discovery. Benjamini and Hochberg [1] proposed a new approach called false discovery rate (FDR) and proposed a procedure that works much better for high-dimensional MHT.

In Section 2, we review the FDR procedure and Bayesian procedures for two-sided alternatives. An extension of directional hypotheses is presented in Section 3. In Section 3, we also discuss Bayesian procedures under skewed alternatives. In Section 4, the problem of directional hypotheses is considered by converting *p*-values to normally distributed test statistics. We also discuss, in Section 4, a Bayes procedure under skew-normal alternatives. An application using real data of gene expressions is also discussed in Section 4. Some concluding remarks are made in Section 5.

## 2. False discovery rate (FDR), Benjamini and Hochberg’s (BH) procedure, and Bayesian procedures

For each of the hypothesis testing **Table 1** shows the possible outcomes by a procedure, where, for example, *V* is the total number of discoveries, among them *V*_{0} is the number of false discoveries.

Thus, the proportion of the false discoveries is *V*_{0}/max(*V*, 1). The FDR is defined as the expected proportion of false discoveries, that is,

If, for example, *FDR* = 0.05, then we can expect on the average 5% of all discoveries to be false. In other words, under repeated experiments on the average, we make 5% of the false discoveries (in a frequentist’s sense). Note that *FDR* ≤ *FWER* = *P*(*V*_{0} ≥ 1) as the following inequality shows:

Thus, we are likely to make a higher number of discoveries under FDR approach than under FWER, since if a procedure controls FWER (≤*α*), then it also controls FDR ((≤*α*), but not vice versa.

### 2.1. Benjamini and Hochberg’s procedure

Benjamini and Hochberg [1] proposed the following BH procedure which controls the FDR.

Let *p*_{i} be the *p*-value for the *i*th hypothesis under a test statistic *T*_{i}. Suppose *T*_{1}, *T*_{2}, …, *T*_{m} are independently distributed. Let *p*_{[1]} < *p*_{[2]} < … < *p*_{[m]} be the ordered *p*-values with the corresponding null hypotheses be denoted by

Then, reject *i* ≤ *i*_{0}.

This procedure controls *m*_{0} is unknown, having the upper bound of *m*_{0} can be estimated reliably, a better bound is possible.

The above result was proven in [1], under the independence of the test statistics. Hochberg and Yekitieli [3] extended the result to positively correlated test statistics, and they also sharpened the BH procedure with new *i*_{0} defined as

where

### 2.2. Bayesian procedures

Under Bayesian setting, we assume that

Under this setting, [4] developed a concept of local false discovery rate (fdr). If *T*_{i}, *i* = 1, 2, …, *m* are test statistics with pdf *T*_{i}|*H*_{0} ∼ *f*_{0}(*t*) and *T*_{i}|*H*_{a} ∼ *f*_{a}(*t*). Then, marginally, *T*_{i} ∼ *f*(*t*) = *pf*_{0}(*t*) + (1 − *p*)*f*_{a}(*t*), and

The idea is that if *T*_{i} ∈ [*t*, *t* + *δt*], where *δt* → 0, then *fdr*(*t*) represents that the proportion of the times *t* is very high, then *fdr*(*t*) will be very small indicating the probability of *p* and *f*(*t*) are unknown, which can be estimated (see [4]).

Storey [5] proposed a positive false discovery rate

where expectation is with respect to the distribution of (*T*_{i}, *θ*_{i}), *i* = 1, 2, …, *m*. Under the assumption that *T*_{1}, *T*_{2}, … *T*_{m} are identically and independently distributed, [6] proved that

for a procedure that rejects *T*_{i} ∈ Γ. Based on this, *q* − *value* for the multiple hypothesis (analogous to *p*-value for a single hypothesis) is defined as the smallest value of *pFDR*(Γ) such that the observed *T*_{i} = *t*_{i} ∈ Γ, see [6]. Under most cases, *q* − *value*(*t*_{i}) = *P*(*H*_{0}| *T*_{i} > *t*_{i}). This gives a procedure under multiple hypothesis that rejects *q* − *value*(*t*_{i}) < *α*.

## 3. Directional hypotheses testing

As described earlier, the null hypothesis

Direction hypotheses testing involves testing *T*_{i} ∈ Γ_{−}} for selecting *T*_{i} ∈ Γ_{+}} for selecting *T*_{i} ∈ Γ_{−} or *T*_{i} ∈ Γ_{+}, and the direction *T*_{i} ∈ Γ_{−} or *T*_{i} ∈ Γ_{+}, respectively. Analogous to **Table 1**, we now have

**Table 2** illustrates the number of cases possible when accepting *H*_{0} or selecting *H*_{−} or selecting *H*_{+}. For example, out of *V* times when selecting *H*_{−}, *V*_{0} times errors are made when in fact *H*_{0} is true, and *V*_{+} times errors are made when in fact *H*_{+} is true. In other words, when selecting *H*_{−}, not only *H*_{0} is falsely rejected *V*_{0} times but the direction is also falsely selected *V*_{+} times. This leads to a concept of directional false discovery rate *DFDR* defined as

Accept H_{0} | Select H_{−} | Select H_{+} | Total | |
---|---|---|---|---|

H_{0} is true | U_{0} | V_{0} | W_{0} | m_{0} |

H_{−} is true | U_{−} | V_{−} | W_{−} | m_{−} |

H_{+} is true | U_{+} | V_{+} | W_{+} | m_{+} |

Total | U | V | W | m |

This is analogous to *FDR* for two-sided alternatives. For most cases, [8] showed that DFDR-controlling procedures for directional hypotheses can be treated as FDR-controlling procedure for two-sided multiple hypotheses with direction determined by the sign of the test statistics.

Bansal and Miescke [9] considered a decision theoretic formulation to multiple hypotheses problems. The approach assumes parametric modeling. Suppose the model for the observed data ** x** be represented by

*P*(

**;**

*x***,**

*θ**η*), where

**= (**

*θ**θ*

_{1},

*θ*

_{2}, …,

*θ*

_{m}) ′ is a parameter vector of interest, and

*η*is a nuisance parameter. The problem of interest is to test

Let the loss function of a decision rule *d*(** x**) = (

*d*

_{1}(

**),**

*x**d*

_{2}(

**), …,**

*x**d*

_{m}(

**)) is given by**

*x*where *l*_{i}(** θ**,

*d*

_{i}(

**)) is an individual loss of**

*x**d*

_{i}. Here,

*d*

_{i}∈ {−1, 0, 1} with

*d*

_{i}= 0,

*d*

_{i}= − 1, and

*d*

_{i}= 1 means accepting

*l*

_{i}= 0 for correct decision, and

*l*

_{i}= 1 for the incorrect decision,

*L*is the total number of incorrect decisions. Thus, minimizing the

*E*[

*L*(

**,**

*θ**d*(

**))] for the “0-1” loss amounts to minimizing the expected number of incorrect decisions.**

*X*Now, suppose under the Bayesian setting, *θ*_{i}, *i* = 1, 2, …, *m* are generated from

where *π*_{−} is the prior density over (−∞, 0) and *π*_{+} is the prior density over (0, ∞). A special case of prior (9) is that *π*_{−}(*θ*) = *π*_{+}(−*θ*). In this case, *p*_{−} and *p*_{+} reflect the skewness in the alternative hypotheses. For example, if *p*_{−} = *p*_{+}, then we have a symmetric case. In this case, the selection of *H*_{−} or *H*_{+}, after rejecting *H*_{0}, based on the sign of the test statistics makes sense. On the other hand, if *p*_{−} < *p*_{+}, then it reflects that more of the *θ*_{i}*s* are positives than negatives. For many gene expressions data analyses, this presents a useful case when over-expressed genes may occur more frequently than under-expressed genes as a result of gene mutation (naturally or as a result of external factors). For specific examples, see [9, 10].

From now on, we focus on the “0-1” loss. The results can be easily extended to other loss functions. The “0-1” loss can be written as

where *θ*_{i} < 0 when *θ*_{i} = 0 when *θ*_{i} > 0 when

### 3.1. The constrained Bayes rule

The Bayes procedure described earlier accommodates skewness in the prior, but no type of false discovery rates is controlled. In order to control a false discovery rate, we need to obtain a constrained Bayes rule that minimizes the posterior expected loss subject to a constraint on the false discovery rate.

The directional false discovery rate (6) is defined in a frequentist’s manner, in which expectation is with respect to ** X**|

*θ*. Let us define Eq. (6) as

*BDFDR*when expectation is taken with respect to

**|**

*X**θ*and then further expectation is taken with respect to

**. We define posterior version of Eq. (6) as**

*θ**PDFDR*when the expectation is taken with respect to the posterior distribution of

**|**

*θ**X*=

*x*. It can be shown that

Here,

A constrained Bayes rule can be obtained by minimizing the posterior expected loss subject to the constraint that *PDFDR* ≤ *α*. There can be many approaches to obtain the constraint minimization. We present, here, an approach given in [9], which is as follows:

*Consider the sets *

Let *D*_{ξ} denotes the set of indices corresponding to

### 3.2. Estimating mixture parameters

The above procedure requires estimation of the parameters (*p*_{−}, *p*_{0}, *p*_{+}) and estimation of the nuisance parameter *η*. Note that marginally,

where *f*_{0}(*x*_{i}| *η*) = *f*(*x*_{i}| 0, *η*), and

and *X*_{1}, *X*_{2}, …, *X*_{m} are independently distributed. Estimates of the parameters of the mixed density can be obtained by using EM algorithm. It is easy to see that the EM estimators of (*p*_{−}, *p*_{0}, *p*_{+}) follows the following iterative scheme:

Estimation of *η* can also be estimated iteratively by using EM algorithm or by different means. See [9] for more details.

## 4. Bayes rules by converting *p*-values to normally distributed test statistics

Let *T*_{i}, *i* = 1, 2,.., *m* be independently and identically distributed test statistics. Let *p*-values. Note that under *X*_{i} = Φ^{− 1}(*P*_{i}) be the corresponding *z*-score. Then, under *X*_{i} ∼ *N*(0, 1) . Efron [11] suggested using *X*_{i} ∼ *N*(0, *σ*^{2}) under *σ*^{2} appropriately estimated. Efron pointed out that, in practice, *σ*^{2} may not be equal to 1 due to possible correlation among multiple components. Under the alternative, we assume that *X*_{i} ∼ *N*(*θ*_{i}, *σ*^{2}), where *θ*_{i}*s* are generated with distribution described in Eq. (9). It is true that this is a big leap in making this assumption. In practice, this assumption can be tested, however, and if true, it can lead to very powerful results. [9] assumed that *π*_{+}(*θ*) is a truncated normal distribution *N*(0, *σ*^{2}/*ω*) , and *π*_{−}(*θ*) = *π*_{+}(−*θ*), where *ω* is some positive constant depending upon how inflated we believe the alternative *θ*_{i}*s* are. It can be seen that

with the proportionality constant [*p*_{−}*T*_{−}(*x*_{i}) + *p*_{+}*T*_{+}(*x*_{i}) + *p*_{0}}^{− 1} . Also, *T*_{−}(*x*_{i}) = *T*_{+}(−*x*_{i}), and

In order to apply the Bayes procedure as discussed in Section 3, all we need are Eqs. (11) and (12). For computation details, see [9].

### 4.1. Skew-normal alternatives

In the above discussions, we assumed that *θ*_{i}*s* are generated from distribution with pdf (9). [12] considered the case when *θ*_{i}*s* are generated from a skew-normal distribution under the alternative hypotheses. The skew-normal distribution was first introduced in [13]. It has an important property that if (*ξ*_{1}, *ξ*_{2}) ∼ Bivariate Norma with mean 0, then the distribution of *ξ*_{1}|*ξ*_{2} > 0 ∼ *Skew* − *normal*. Its pdf is given by

and is denoted by *SN*(0, *σ*_{1}, *λ*). Here, *λ* is a skew parameter. If *λ* = 0, then this distribution is *N*(0, *σ*_{1}). The implication of this result is the following: suppose within a normal system an outcome follows a normal distribution, but if a correlated factor starts exerting a positive effect, then the outcome variable will start following a skew-normal distribution. For example, consider RNAs experiments and assume that genes are in a normal state. Suppose a gene mutation occurs at a later state and it starts exerting positive effect on the affected genes. In this case, based on the above property of skew-normal distribution, we can assume that the expressions of the affected genes will follow a skew-normal distribution.

Under this formulation, we assume that *θ*_{i} = 1, 2, …, *m* are generated from

Now, similar to Eq. (11), it can be seen that

with proportionality constant [(1 − *p*)(*T*_{+}(*x*_{i}) + *T*_{−}(*x*_{i}) + *p*]^{− 1}, where

and

The sets

where *c*_{1} > 0 and *c*_{2} > 0 are determined as shown in **Figure 1** by considering the point of intersections of *y* = *p*/(1 − *p*) and *y* = *T*_{−}(*x*), and *y* = *p*/(1 − *p*) and *y* = *T*_{+}(*x*), respectively. Note that when *λ* > 0, the intersection point *Q* (as shown in the figure) will be to the left of *x* = 0, and when *λ* < 0, *Q* will be to the right of *x* = 0. Thus, when *λ* > 0, *c*_{1} > *c*_{2} and the opposite is true when *λ* < 0. When *λ* = 0, *T*_{−}(*x*) = *T*_{+}(−*x*) and thus *c*_{1} = *c*_{2}. If *λ* → ∞, *T*_{−}(*x*) → 0 and thus *BDFDR*. However, *c*_{1} and *c*_{2} can be further shrunk so that the resulting procedure achieves *BDFDR* ≤ *α*; see [12] for details.

To illustrate the above procedure, and to compare it with the standard FDR procedure (BY) of [8], which selects the direction based on the sign of the test statistics, we consider a HIV data described in [14]. For detailed analysis, see [12]. Here, we describe the analysis very briefly. The data consist of eight microarrays, four from cells of HIV-infected subjects and four from uninfected subjects, each with expression levels of 7680 genes. For each gene, we obtained a two-sample *t*-statistic, comparing the infected versus the uninfected subjects, which is then transformed to a *z*-value, where *z*_{i} = Φ^{− 1}{*F*_{6}(*t*_{i})}. Here, *F*_{6}(∙) denotes the cumulative distribution function (cdf) of *t -*distribution with six degrees of freedom. **Figure 2** shows the histogram of the *z*-values with a skew-normal fit. Although the null distribution of *Z*_{i} should be *N*(0, 1). However, as suggested in [11], we use the null distribution as *N*(−0.11, 0.75^{2}). Thus, we formulate our problem as testing hypotheses (7) with test statistics *Z*_{i} ∼ *N*(−0.11 + *θ*_{i}, 0.75^{2}).

BY procedure resulted in cutoffs (−3.94, 3.94), which resulted in 18 total discoveries with two genes declared as under-expressed and 16 as over-expressed. For the constrained Bayes rule, we first used the EM algorithm to obtain the parameter estimates as

## 5. Concluding remarks

There are many different methods of testing multiple hypotheses. Methodologies, however, depend on the criteria we choose. When the dimension of multiple hypotheses is not very high, the familywise error rate (FWER) is an appropriate criterion which safeguards against making even one false discovery. However, when the dimension of multiple hypotheses is very high, the FWER is not very useful; instead, a false discover rate (FDR) criterion is a good approach. Although FDR was originally defined as a frequentist’s concept, it can be re-interpreted in a Bayesian framework. The Bayesian framework brings many advantages. For example, a decision-theoretic formulation is easy to implement, directional hypotheses are easy to handle, and the skewness in the alternatives is easy to implement. Drawback is that we need to make an assumption about the prior distributions under the alternatives. Some work has been done based on nonparametric priors; however, much more work is needed.