Hypothesis Testing for High-Dimensional Problems Hypothesis Testing for High-Dimensional Problems

For high-dimensional hypothesis problems, new approaches have emerged since the publication. The most promising of them uses Bayesian approach. In this chapter, we review some of the past approaches applicable to only law-dimensional hypotheses testing and contrast it with the modern approaches of high-dimensional hypotheses testing. We review some of the new results based on Bayesian decision theory and show how Bayesian approach can be used to accommodate directional hypotheses testing and skewness in the alternatives. A real example of gene expression data is used to demon- strate a Bayesian decision theoretic approach to directional hypotheses testing with skewed alternatives.


Introduction
In today's world, most of the statistical inference problems involve high-dimensional multiple hypothesis testing. Whenever we collect data, we collect data on multiple features, involving very high-dimensional variables in some cases. For example, gene expression data consist of gene expressions on thousands of genes; image data consist of image expressions on multiple voxels. The statistical analysis for these types of data involves multiple hypotheses testing (MHT). It is well known that univariate methods cannot be applied to simultaneously test hypotheses on the multiple features. The reason for this is that the error rates for the univariate analysis get multiplied under MHT, and as a result the actual error rate can be very high. To understand the main issue of multiplicity, consider the following example. Suppose there are, say, 100 misspelled words in a book, and each of these words occurs in 5% of the pages. You pick a page at random. For each misspelled word, the probability is certainly 0.05 of finding that word in the page. However, the probability is much higher that you will find at least one of the 100 misspelled words. If these words were independently distributed, then the probability of finding at least one misspelled word is 1 À (0.95) 100 ≈ 0.995. If the placements of the misspelled words were positively dependent, then the probability will be lower than 0.995. For example, if we take an extreme case of dependence that they all occur together, then the probability will be 0.05. The same phenomenon occurs in the MHT. The statistical inference, based on the error rate of individual hypothesis testing, can lead to very high error rate for the combined hypotheses. Thus, for the MHT, adjustment in the error rate needs to be made. Note that the adjustment rate may depend on the dependent structure, but due to the complexity of the dependent structure in high dimension, dependency is usually ignored in the literature [1].
The statistical inference depends on how we define the error rate for the combined hypotheses testing. Let us suppose that there are m hypotheses testing H i 0 vs: H i a , i ¼ 1, 2, …, m. If we do not want to make even one false discovery, then we should control the familywise error rate (FWER), which is defined as There are many methods for controlling FWER ≤ α F (=0.05, e.g.). A simplest method is the Bonferroni's procedure. Let T i be the test statistics for testing H i 0 vs: H i a with the corresponding p-values p i . Then, Bonferroni's procedure rejects H i 0 if p i < α F /m. To see the proof of this, suppose I 0 be the set of all i for which H i 0 is true, and suppose p j < α F /m for at least one ∈ I 0 . Then using Boole's inequality, we have, from Eq. (1), Now, since, under H i 0 , p i $ U 0, 1 ð Þ , Pr{p i < α F /m} = α F /m. Then, assuming that there are m 0 number of elements in I 0 , we have, from Eq. (2), gave a modified version of Bonferroni's procedure which also controls the familywise error rate. Holm's Bonferroni Procedure is the following: First rank all the p-values, be their associated null hypotheses. Let l be the smallest index such that p (l) > α F /(m À l + 1). Then, reject only those null hypotheses that are . Note that the selected hypotheses have p-values with p (1) < α F /m,p (2) < α F /(m À 1),…,p (l À 1) < α F /(m À l + 2) , and thus more powerful than Bonferroni's procedure, since hypotheses that are selected under Bonferroni's procedure will also be selected under Holm's procedure.
The above Bonferroni type procedures are not very satisfactory when m is very high. Let us suppose m = 10, 000 (this is actually not very high for most of the high-dimensional problems), and suppose we want to control FWER by α F = 0.05. Then, for Holm's procedure, the smallest p-value has to be lower than 0.000005 in order to reject at least one hypothesis, which may be very hard to achieve. The problem is not really with Holm's procedure; the problem is with the use of FWER as an error rate. For a high-dimensional problem, it is unrealistic to seek for a procedure which will not make at least one false discovery. Benjamini and Hochberg [1] proposed a new approach called false discovery rate (FDR) and proposed a procedure that works much better for high-dimensional MHT.
In Section 2, we review the FDR procedure and Bayesian procedures for two-sided alternatives. An extension of directional hypotheses is presented in Section 3. In Section 3, we also discuss Bayesian procedures under skewed alternatives. In Section 4, the problem of directional hypotheses is considered by converting p-values to normally distributed test statistics. We also discuss, in Section 4, a Bayes procedure under skew-normal alternatives. An application using real data of gene expressions is also discussed in Section 4. Some concluding remarks are made in Section 5.

False discovery rate (FDR), Benjamini and Hochberg's (BH) procedure, and Bayesian procedures
For each of the hypothesis testing H i 0 vs: H i a , suppose a statistical procedure either rejects the null hypothesis H i 0 or fails to reject H i 0 . For the sake of simplicity, we equate fail to reject H i 0 as accepting the null H i 0 . However, for small sample size case, it will be unwise to make a conclusion of accepting H i 0 . From now on, rejections of the null will be called discoveries. Table 1 shows the possible outcomes by a procedure, where, for example, V is the total number of discoveries, among them V 0 is the number of false discoveries.
Thus, the proportion of the false discoveries is V 0 /max(V,1). The FDR is defined as the expected proportion of false discoveries, that is, If, for example, FDR = 0.05, then we can expect on the average 5% of all discoveries to be false. In other words, under repeated experiments on the average, we make 5% of the false discoveries (in a frequentist's sense). Note that FDR ≤ FWER = P(V 0 ≥ 1) as the following inequality shows: Thus, we are likely to make a higher number of discoveries under FDR approach than under FWER, since if a procedure controls FWER (≤α), then it also controls FDR ((≤α), but not vice versa.

Benjamini and Hochberg's procedure
Benjamini and Hochberg [1] proposed the following BH procedure which controls the FDR.
Let p i be the p-value for the ith hypothesis under a test statistic be the ordered p-values with the corresponding null hypotheses be denoted by H This procedure controls FDR ≤ m0 m α ≤ α. Since m 0 is unknown, having the upper bound of m0 m α is not very useful. If m 0 can be estimated reliably, a better bound is possible.
The above result was proven in [1], under the independence of the test statistics. Hochberg and Yekitieli [3] extended the result to positively correlated test statistics, and they also sharpened the BH procedure with new i 0 defined as

Bayesian procedures
Under Bayesian setting, we assume that H i 0 and H i a , i ¼ 1, 2, …, m are generated probabilistically with Under this setting, [4] developed a concept of local false discovery rate (fdr). If The idea is that if T i ∈ [t,t + δt], where δt ! 0, then fdr(t) represents that the proportion of the times H i 0 will be true. If t is very high, then fdr(t) will be very small indicating the probability of H i 0 to be very small (i.e., the false discovery rate will be very small). In Eq. (3), p and f(t) are unknown, which can be estimated (see [4]).
Storey [5] proposed a positive false discovery rate where expectation is with respect to the distribution of (T i ,θ i ),i = 1, 2, …, m. Under the assumption that T 1 ,T 2 , … T m are identically and independently distributed, [6] proved that for a procedure that rejects H i 0 when T i ∈ Γ. Based on this, q À value for the multiple hypothesis (analogous to p-value for a single hypothesis) is defined as the smallest value of pFDR(Γ) such that the observed T i = t i ∈ Γ, see [6]. Under most cases, q À value(t i ) = P(H 0 | T i > t i ). This gives a procedure under multiple hypothesis that rejects H i 0 if q À value(t i ) < α.

Directional hypotheses testing
As described earlier, the null hypothesis H i 0 is either accepted or rejected. In most cases, however, rejection of null hypotheses is not sufficient. After rejecting H i 0 , finding the direction of the alternatives may also be important. A detailed discussion of the directional hypotheses can be found in [7].
Direction hypotheses testing involves testing H i 0 against directional hypotheses H i À and H i þ , and the objective is to obtain selection region {T i ∈ Γ À } for selecting H i À and selection region {T i ∈ Γ + } for selecting H i þ . In other words, H i 0 will be rejected if T i ∈ Γ À or T i ∈ Γ + , and the direction H i À or H i þ is determined according to T i ∈ Γ À or T i ∈ Γ + , respectively. Analogous to Table 1, we now have Table 2 illustrates the number of cases possible when accepting H 0 or selecting H À or selecting H + . For example, out of V times when selecting H À , V 0 times errors are made when in fact H 0 is true, and V + times errors are made when in fact H + is true. In other words, when selecting H À , not only H 0 is falsely rejected V 0 times but the direction is also falsely selected V + times. This leads to a concept of directional false discovery rate DFDR defined as This is analogous to FDR for two-sided alternatives. For most cases, [8] showed that DFDRcontrolling procedures for directional hypotheses can be treated as FDR-controlling procedure for two-sided multiple hypotheses with direction determined by the sign of the test statistics.
Bansal and Miescke [9] considered a decision theoretic formulation to multiple hypotheses problems. The approach assumes parametric modeling. Suppose the model for the observed data x be represented by P(x; θ,η), where θ = (θ 1 ,θ 2 ,…,θ m ) 0 is a parameter vector of interest, and η is a nuisance parameter. The problem of interest is to test Let the loss function of a decision rule d(x) = (d 1 (x),d 2 (x),…,d m (x)) is given by where l i (θ,d i (x)) is an individual loss of d i . Here, d i ∈ {À1,0,1} with d i = 0, d i = À 1, and d i = 1 means accepting H i 0 , selecting H i À and selecting H i þ , respectively. Note that for the "0-1" loss, that is, when l i = 0 for correct decision, and l i = 1 for the incorrect decision, L is the total number of incorrect decisions. Thus, minimizing the E[L(θ,d(X))] for the "0-1" loss amounts to minimizing the expected number of incorrect decisions. Now, suppose under the Bayesian setting, θ i ,i = 1, 2, …, m are generated from where π À is the prior density over (À∞,0) and π + is the prior density over (0, ∞). A special case of prior (9) is that π À (θ) = π + (Àθ). In this case, p À and p + reflect the skewness in the alternative hypotheses. For example, if p À = p + , then we have a symmetric case. In this case, the selection of H À or H + , after rejecting H 0 , based on the sign of the test statistics makes sense. On the other hand, if p À < p + , then it reflects that more of the θ i s are positives than negatives. For many gene expressions data analyses, this presents a useful case when over-expressed genes may occur more frequently than under-expressed genes as a result of gene mutation (naturally or as a result of external factors). For specific examples, see [9,10].
From now on, we focus on the "0-1" loss. The results can be easily extended to other loss functions. The "0-1" loss can be written as where v θ i ∈ À1, 0, 1 f gis an indicator variable indicating It is easy to see that minimizing the posterior expected loss yields the selection rule that selects

The constrained Bayes rule
The Bayes procedure described earlier accommodates skewness in the prior, but no type of false discovery rates is controlled. In order to control a false discovery rate, we need to obtain a constrained Bayes rule that minimizes the posterior expected loss subject to a constraint on the false discovery rate.
The directional false discovery rate (6) is defined in a frequentist's manner, in which expectation is with respect to X|θ. Let us define Eq. (6) as BDFDR when expectation is taken with respect to X|θ and then further expectation is taken with respect to θ. We define posterior version of Eq. (6) as PDFDR when the expectation is taken with respect to the posterior distribution of θ|X = x. It can be shown that Here, D À j j ¼ A constrained Bayes rule can be obtained by minimizing the posterior expected loss subject to the constraint that PDFDR ≤ α. There can be many approaches to obtain the constraint minimization. We present, here, an approach given in [9], which is as follows: Consider the sets D B À and D B þ of indices that selects H À ð Þ i and H þ ð Þ i , respectively, according to the unconstraint Bayes rule, that is, when v þ , and then rank all ξ i , i ∈ D B À ∪D B þ from the lowest to the highest. Let the ranked values be denoted by ξ 1 with the proportionality constant [p À T À (x i ) + p + T + (x i ) + p 0 } À 1 . Also, T À (x i ) = T + (Àx i ), and In order to apply the Bayes procedure as discussed in Section 3, all we need are Eqs. (11) and (12). For computation details, see [9].

Skew-normal alternatives
In the above discussions, we assumed that θ i s are generated from distribution with pdf (9). [12] considered the case when θ i s are generated from a skew-normal distribution under the alternative hypotheses. The skew-normal distribution was first introduced in [13]. It has an important property that if (ξ 1 ,ξ 2 ) $ Bivariate Norma with mean 0, then the distribution of ξ 1 |ξ 2 > 0 $ Skew À normal. Its pdf is given by and is denoted by SN(0,σ 1 ,λ). Here, λ is a skew parameter. If λ = 0, then this distribution is N (0,σ 1 ). The implication of this result is the following: suppose within a normal system an outcome follows a normal distribution, but if a correlated factor starts exerting a positive effect, then the outcome variable will start following a skew-normal distribution. For example, consider RNAs experiments and assume that genes are in a normal state. Suppose a gene mutation occurs at a later state and it starts exerting positive effect on the affected genes. In this case, based on the above property of skew-normal distribution, we can assume that the expressions of the affected genes will follow a skew-normal distribution.
Under this formulation, we assume that θ i = 1, 2, …,m are generated from Now, similar to Eq. (11), it can be seen that and Hypothesis Testing for High-Dimensional Problems http://dx.doi.org/10.5772/intechopen.70210 The sets D B À and D B þ can be written as where c 1 > 0 and c 2 > 0 are determined as shown in Figure 1 by considering the point of intersections of y = p/(1 À p) and y = T À (x), and y = p/(1 À p) and y = T + (x), respectively. Note that when λ > 0, the intersection point Q (as shown in the figure) will be to the left of x = 0, and when λ < 0, Q will be to the right of x = 0. Thus, when λ > 0,c 1 > c 2 and the opposite is true when λ < 0. When λ = 0,T À (x) = T + (Àx) and thus c 1 = c 2 . If λ ! ∞, T À (x) ! 0 and thus D À B is an empty set which is equivalent to a one-tailed test. As discussed in Section 3, the procedure based on Eq. (13) by itself does not control BDFDR. However, c 1 and c 2 can be further shrunk so that the resulting procedure achieves BDFDR ≤ α; see [12] for details.
To illustrate the above procedure, and to compare it with the standard FDR procedure (BY) of [8], which selects the direction based on the sign of the test statistics, we consider a HIV data described in [14]. For detailed analysis, see [12]. Here, we describe the analysis very briefly. The data consist of eight microarrays, four from cells of HIV-infected subjects and four from uninfected subjects, each with expression levels of 7680 genes. For each gene, we obtained a two-sample t-statistic, comparing the infected versus the uninfected subjects, which is then transformed to a z-value, where z i = Φ À 1 {F 6 (t i )}. Here,F 6 (•) denotes the cumulative distribution function (cdf) of t -distribution with six degrees of freedom. Figure 2 shows the histogram of the z-values with a skew-normal fit. Although the null distribution of Z i should be N(0,1). However, as suggested in [11], we use the null distribution as N(À0.11,0.75 2 ). Thus, we formulate our problem as testing hypotheses (7) with test statistics Z i $ N(À0.11 + θ i ,0.75 2 ).
BY procedure resulted in cutoffs (À3.94,3.94), which resulted in 18 total discoveries with two genes declared as under-expressed and 16 as over-expressed. For the constrained Bayes rule, we first used the EM algorithm to obtain the parameter estimates as b p ¼ 0:9, b σ ¼ 0:79, c σ 1 ¼ 1:54; and b λ ¼ 0:22. The Bayes procedure ended up with cut-off points (À2.82,2.70) with a total of 86 discoveries (under-expressed genes: 23 and over-expressed genes: 63). Note that the number of discoveries by the Bayes rule is much higher than by the BY procedure.

Concluding remarks
There are many different methods of testing multiple hypotheses. Methodologies, however, depend on the criteria we choose. When the dimension of multiple hypotheses is not very high, the familywise error rate (FWER) is an appropriate criterion which safeguards against making even one false discovery. However, when the dimension of multiple hypotheses is very high, the FWER is not very useful; instead, a false discover rate (FDR) criterion is a good approach. Although FDR was originally defined as a frequentist's concept, it can be re-interpreted in a Bayesian framework. The Bayesian framework brings many advantages. For example, a decision-theoretic formulation is easy to implement, directional hypotheses are easy to handle, Hypothesis Testing for High-Dimensional Problems http://dx.doi.org/10.5772/intechopen.70210