Total number of decisions made.
Abstract
For high-dimensional hypothesis problems, new approaches have emerged since the publication. The most promising of them uses Bayesian approach. In this chapter, we review some of the past approaches applicable to only law-dimensional hypotheses testing and contrast it with the modern approaches of high-dimensional hypotheses testing. We review some of the new results based on Bayesian decision theory and show how Bayesian approach can be used to accommodate directional hypotheses testing and skewness in the alternatives. A real example of gene expression data is used to demonstrate a Bayesian decision theoretic approach to directional hypotheses testing with skewed alternatives.
Keywords
- multiple directional hypotheses
- false discovery rate
- familywise error rate
- gene expression
- skew-normal distribution
1. Introduction
In today’s world, most of the statistical inference problems involve high-dimensional multiple hypothesis testing. Whenever we collect data, we collect data on multiple features, involving very high-dimensional variables in some cases. For example, gene expression data consist of gene expressions on thousands of genes; image data consist of image expressions on multiple voxels. The statistical analysis for these types of data involves multiple hypotheses testing (MHT). It is well known that univariate methods cannot be applied to simultaneously test hypotheses on the multiple features. The reason for this is that the error rates for the univariate analysis get multiplied under MHT, and as a result the actual error rate can be very high. To understand the main issue of multiplicity, consider the following example. Suppose there are, say, 100 misspelled words in a book, and each of these words occurs in 5% of the pages. You pick a page at random. For each misspelled word, the probability is certainly 0.05 of finding that word in the page. However, the probability is much higher that you will find at least one of the 100 misspelled words. If these words were independently distributed, then the probability of finding at least one misspelled word is 1 − (0.95)100 ≈ 0.995. If the placements of the misspelled words were positively dependent, then the probability will be lower than 0.995. For example, if we take an extreme case of dependence that they all occur together, then the probability will be 0.05. The same phenomenon occurs in the MHT. The statistical inference, based on the error rate of individual hypothesis testing, can lead to very high error rate for the combined hypotheses. Thus, for the MHT, adjustment in the error rate needs to be made. Note that the adjustment rate may depend on the dependent structure, but due to the complexity of the dependent structure in high dimension, dependency is usually ignored in the literature [1].
The statistical inference depends on how we define the error rate for the combined hypotheses testing. Let us suppose that there are
There are many methods for controlling
Now, since, under
Holm [2] gave a modified version of Bonferroni’s procedure which also controls the familywise error rate. Holm’s Bonferroni Procedure is the following: First rank all the
The above Bonferroni type procedures are not very satisfactory when
In Section 2, we review the FDR procedure and Bayesian procedures for two-sided alternatives. An extension of directional hypotheses is presented in Section 3. In Section 3, we also discuss Bayesian procedures under skewed alternatives. In Section 4, the problem of directional hypotheses is considered by converting
2. False discovery rate (FDR), Benjamini and Hochberg’s (BH) procedure, and Bayesian procedures
For each of the hypothesis testing
Accept | Reject | Total | |
---|---|---|---|
Thus, the proportion of the false discoveries is
If, for example,
Thus, we are likely to make a higher number of discoveries under FDR approach than under FWER, since if a procedure controls FWER (≤
2.1. Benjamini and Hochberg’s procedure
Benjamini and Hochberg [1] proposed the following BH procedure which controls the FDR.
Let
Then, reject
This procedure controls
The above result was proven in [1], under the independence of the test statistics. Hochberg and Yekitieli [3] extended the result to positively correlated test statistics, and they also sharpened the BH procedure with new
where
2.2. Bayesian procedures
Under Bayesian setting, we assume that
Under this setting, [4] developed a concept of local false discovery rate (fdr). If
The idea is that if
Storey [5] proposed a positive false discovery rate
where expectation is with respect to the distribution of (
for a procedure that rejects
3. Directional hypotheses testing
As described earlier, the null hypothesis
Direction hypotheses testing involves testing
Table 2 illustrates the number of cases possible when accepting
Accept | Select | Select | Total | |
---|---|---|---|---|
Total |
This is analogous to
Bansal and Miescke [9] considered a decision theoretic formulation to multiple hypotheses problems. The approach assumes parametric modeling. Suppose the model for the observed data
Let the loss function of a decision rule
where
Now, suppose under the Bayesian setting,
where
From now on, we focus on the “0-1” loss. The results can be easily extended to other loss functions. The “0-1” loss can be written as
where
3.1. The constrained Bayes rule
The Bayes procedure described earlier accommodates skewness in the prior, but no type of false discovery rates is controlled. In order to control a false discovery rate, we need to obtain a constrained Bayes rule that minimizes the posterior expected loss subject to a constraint on the false discovery rate.
The directional false discovery rate (6) is defined in a frequentist’s manner, in which expectation is with respect to
Here,
A constrained Bayes rule can be obtained by minimizing the posterior expected loss subject to the constraint that
Let
3.2. Estimating mixture parameters
The above procedure requires estimation of the parameters (
where
and
Estimation of
4. Bayes rules by converting p -values to normally distributed test statistics
Let
with the proportionality constant [
In order to apply the Bayes procedure as discussed in Section 3, all we need are Eqs. (11) and (12). For computation details, see [9].
4.1. Skew-normal alternatives
In the above discussions, we assumed that
and is denoted by
Under this formulation, we assume that
Now, similar to Eq. (11), it can be seen that
with proportionality constant [(1 −
and
The sets
where
To illustrate the above procedure, and to compare it with the standard FDR procedure (BY) of [8], which selects the direction based on the sign of the test statistics, we consider a HIV data described in [14]. For detailed analysis, see [12]. Here, we describe the analysis very briefly. The data consist of eight microarrays, four from cells of HIV-infected subjects and four from uninfected subjects, each with expression levels of 7680 genes. For each gene, we obtained a two-sample
BY procedure resulted in cutoffs (−3.94, 3.94), which resulted in 18 total discoveries with two genes declared as under-expressed and 16 as over-expressed. For the constrained Bayes rule, we first used the EM algorithm to obtain the parameter estimates as
5. Concluding remarks
There are many different methods of testing multiple hypotheses. Methodologies, however, depend on the criteria we choose. When the dimension of multiple hypotheses is not very high, the familywise error rate (FWER) is an appropriate criterion which safeguards against making even one false discovery. However, when the dimension of multiple hypotheses is very high, the FWER is not very useful; instead, a false discover rate (FDR) criterion is a good approach. Although FDR was originally defined as a frequentist’s concept, it can be re-interpreted in a Bayesian framework. The Bayesian framework brings many advantages. For example, a decision-theoretic formulation is easy to implement, directional hypotheses are easy to handle, and the skewness in the alternatives is easy to implement. Drawback is that we need to make an assumption about the prior distributions under the alternatives. Some work has been done based on nonparametric priors; however, much more work is needed.
References
- 1.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practice and powerful approach to multiple testing. Journal of the Royal Statistical Society B. 1995; 57 (1):289-300 - 2.
Holm S. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics. 1979; 6 (2):65-70 - 3.
Hochberg B, Yekitieli D. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics. 2001; 29 (4):1165-1188 - 4.
Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001; 96 (456):1151-1160 - 5.
Storey JD. A direct approach to false discovery rates. Journal of the Royal Statistical Society B. 2002; 64 (3):479-498 - 6.
Storey JD. The positive false discovery rate: A Bayesian interpretation and the q value. The Annals of Statistics. 2003; 31 (6):2013-2035 - 7.
Shaffer JP. Multiplicity, directional (Type III) errors, and the null hypothesis. Psychological Methods. 2002; 7 (3):356-369 - 8.
Benjamini Y, Yekutieli D. False discovery rate controlling confidence intervals for selected parameters. Journal of American Statistical Association. 2005:71-80 - 9.
Bansal NK, Miescke KJ. A Bayesian decision theoretic approach to directional multiple hypotheses problems. Journal of Multivariate Analysis. 2013:205-215 - 10.
Bansal NK, Jiang H, Pradeep P. A Bayesian methodology for detecting targeted genes under two related experiments. Statistics in Medicine. 2015; 34 (25):3362-3375 - 11.
Efron B. Correlation and large-scale simultaneous significance testing. Journal of the American Statistical Association. 2007:93-103 - 12.
Bansal NK, Hamedani GG, Maadooliat M. Testing multiple hypotheses with skewed alternatives. Biometrics. 2016; 72 (2):494-502 - 13.
Azzalini A. A class of distributions which includes the normal ones. Scandinavian Journal of Statistics. 1985; 12 (2):171-178 - 14.
van’t Wout AB, Lehrman GK, Mikheeva SA, O’Keeffe GC, Katze MG, Bumgarner RE, Mullins JI. Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4+-T-cell lines. Journal of Virology. 2003; 77 (2):1392-1402