Open access peer-reviewed chapter

Hypothesis Testing for High-Dimensional Problems

By Naveen K. Bansal

Submitted: December 12th 2016Reviewed: June 26th 2017Published: November 2nd 2017

DOI: 10.5772/intechopen.70210

Abstract

For high-dimensional hypothesis problems, new approaches have emerged since the publication. The most promising of them uses Bayesian approach. In this chapter, we review some of the past approaches applicable to only law-dimensional hypotheses testing and contrast it with the modern approaches of high-dimensional hypotheses testing. We review some of the new results based on Bayesian decision theory and show how Bayesian approach can be used to accommodate directional hypotheses testing and skewness in the alternatives. A real example of gene expression data is used to demonstrate a Bayesian decision theoretic approach to directional hypotheses testing with skewed alternatives.

Keywords

• multiple directional hypotheses
• false discovery rate
• familywise error rate
• gene expression
• skew-normal distribution

1. Introduction

In today’s world, most of the statistical inference problems involve high-dimensional multiple hypothesis testing. Whenever we collect data, we collect data on multiple features, involving very high-dimensional variables in some cases. For example, gene expression data consist of gene expressions on thousands of genes; image data consist of image expressions on multiple voxels. The statistical analysis for these types of data involves multiple hypotheses testing (MHT). It is well known that univariate methods cannot be applied to simultaneously test hypotheses on the multiple features. The reason for this is that the error rates for the univariate analysis get multiplied under MHT, and as a result the actual error rate can be very high. To understand the main issue of multiplicity, consider the following example. Suppose there are, say, 100 misspelled words in a book, and each of these words occurs in 5% of the pages. You pick a page at random. For each misspelled word, the probability is certainly 0.05 of finding that word in the page. However, the probability is much higher that you will find at least one of the 100 misspelled words. If these words were independently distributed, then the probability of finding at least one misspelled word is 1 − (0.95)100 ≈ 0.995. If the placements of the misspelled words were positively dependent, then the probability will be lower than 0.995. For example, if we take an extreme case of dependence that they all occur together, then the probability will be 0.05. The same phenomenon occurs in the MHT. The statistical inference, based on the error rate of individual hypothesis testing, can lead to very high error rate for the combined hypotheses. Thus, for the MHT, adjustment in the error rate needs to be made. Note that the adjustment rate may depend on the dependent structure, but due to the complexity of the dependent structure in high dimension, dependency is usually ignored in the literature [1].

The statistical inference depends on how we define the error rate for the combined hypotheses testing. Let us suppose that there are mhypotheses testing H0ivs.Hai,i=1,2,,m. If we do not want to make even one false discovery, then we should control the familywise error rate (FWER), which is defined as

FWER=PrFalselyRejectH0iforatleastonei,i=1,2,,mE1

There are many methods for controlling FWER ≤ αF(=0.05, e.g.). A simplest method is the Bonferroni’s procedure. Let Tibe the test statistics for testing H0ivs.Haiwith the corresponding p-values pi. Then, Bonferroni’s procedure rejects H0iif pi < αF/m. To see the proof of this, suppose I0 be the set of all ifor which H0iis true, and suppose pj < αF/mfor at least one ∈ I0 . Then using Boole’s inequality, we have, from Eq. (1),

FWER=PriI0pi<αF/miI0Prpi<αF/mE2

Now, since, under H0i,piU0,1, Pr{pi < αF/m} = αF/m. Then, assuming that there are m0 number of elements in I0, we have, from Eq. (2),

FWERm0αFmαF

Holm [2] gave a modified version of Bonferroni’s procedure which also controls the familywise error rate. Holm’s Bonferroni Procedure is the following: First rank all the p-values, p(1) ≤ p(2) ≤ … ≤ p(m), and let H01,H02,,H0mbe their associated null hypotheses. Let lbe the smallest index such that p(l) > αF/(m − l + 1). Then, reject only those null hypotheses that are associated with H01,H02,,H0l-1. Note that the selected hypotheses have p-values with p(1) < αF/m, p(2) < αF/(m − 1), …, p(l − 1) < αF/(m − l + 2) , and thus more powerful than Bonferroni’s procedure, since hypotheses that are selected under Bonferroni’s procedure will also be selected under Holm’s procedure.

The above Bonferroni type procedures are not very satisfactory when mis very high. Let us suppose m = 10, 000 (this is actually not very high for most of the high-dimensional problems), and suppose we want to control FWER by αF = 0.05. Then, for Holm’s procedure, the smallest p-value has to be lower than 0.000005 in order to reject at least one hypothesis, which may be very hard to achieve. The problem is not really with Holm’s procedure; the problem is with the use of FWER as an error rate. For a high-dimensional problem, it is unrealistic to seek for a procedure which will not make at least one false discovery. Benjamini and Hochberg [1] proposed a new approach called false discovery rate (FDR) and proposed a procedure that works much better for high-dimensional MHT.

In Section 2, we review the FDR procedure and Bayesian procedures for two-sided alternatives. An extension of directional hypotheses is presented in Section 3. In Section 3, we also discuss Bayesian procedures under skewed alternatives. In Section 4, the problem of directional hypotheses is considered by converting p-values to normally distributed test statistics. We also discuss, in Section 4, a Bayes procedure under skew-normal alternatives. An application using real data of gene expressions is also discussed in Section 4. Some concluding remarks are made in Section 5.

2. False discovery rate (FDR), Benjamini and Hochberg’s (BH) procedure, and Bayesian procedures

For each of the hypothesis testing H0ivs.Hai, suppose a statistical procedure either rejects the null hypothesis H0ior fails to reject H0i. For the sake of simplicity, we equate fail to reject H0ias accepting the null H0i. However, for small sample size case, it will be unwise to make a conclusion of accepting H0i. From now on, rejections of the null will be called discoveries. Table 1 shows the possible outcomes by a procedure, where, for example, Vis the total number of discoveries, among them V0 is the number of false discoveries.

Accept H0Reject H0Total
H0 is trueU0V0m0
Hais trueUaVamm0
UVm

Table 1.

Thus, the proportion of the false discoveries is V0/max(V, 1). The FDR is defined as the expected proportion of false discoveries, that is,

FDR=EV0maxV,1.E3

If, for example, FDR = 0.05, then we can expect on the average 5% of all discoveries to be false. In other words, under repeated experiments on the average, we make 5% of the false discoveries (in a frequentist’s sense). Note that FDR ≤ FWER = P(V0 ≥ 1) as the following inequality shows:

FDR=EV0maxV1=EV0maxV1IV01EIV01=PV01.

Thus, we are likely to make a higher number of discoveries under FDR approach than under FWER, since if a procedure controls FWER (≤α), then it also controls FDR ((≤α), but not vice versa.

2.1. Benjamini and Hochberg’s procedure

Benjamini and Hochberg [1] proposed the following BH procedure which controls the FDR.

Let pibe the p-value for the ith hypothesis under a test statistic Ti. Suppose T1, T2, …, Tmare independently distributed. Let p[1] < p[2] < … < p[m] be the ordered p-values with the corresponding null hypotheses be denoted by H01,H02,,H0m. Let

i0=maxi:piimα

Then, reject H0ifor all i ≤ i0.

This procedure controls FDRm0mαα. Since m0 is unknown, having the upper bound of m0mαis not very useful. If m0 can be estimated reliably, a better bound is possible.

The above result was proven in [1], under the independence of the test statistics. Hochberg and Yekitieli [3] extended the result to positively correlated test statistics, and they also sharpened the BH procedure with new i0 defined as

i0=maxi:pi1mcmα,

where cm=i=1m1i.

2.2. Bayesian procedures

Under Bayesian setting, we assume that H0iand Hai,i=1,2,,mare generated probabilistically with

PH0i=pandPHai=1-p

Under this setting, [4] developed a concept of local false discovery rate (fdr). If Ti, i = 1, 2, …, mare test statistics with pdf Ti|H0 ∼ f0(t) and Ti|Ha ∼ fa(t). Then, marginally, Ti ∼ f(t) = pf0(t) + (1 − p)fa(t), and

fdrt=PH0iTi=t=pf0tftE4

The idea is that if Ti ∈ [t, t + δt], where δt → 0, then fdr(t) represents that the proportion of the times H0iwill be true. If tis very high, then fdr(t) will be very small indicating the probability of H0ito be very small (i.e., the false discovery rate will be very small). In Eq. (3), pand f(t) are unknown, which can be estimated (see [4]).

Storey [5] proposed a positive false discovery rate

pFDR=EV0V|V>0,E5

where expectation is with respect to the distribution of (Ti, θi), i = 1, 2, …, m. Under the assumption that T1, T2, … Tmare identically and independently distributed, [6] proved that

pFDRΓ=PH0TΓ),

for a procedure that rejects H0iwhen Ti ∈ Γ. Based on this, q − valuefor the multiple hypothesis (analogous to p-value for a single hypothesis) is defined as the smallest value of pFDR(Γ) such that the observed Ti = ti ∈ Γ, see [6]. Under most cases, q − value(ti) = P(H0Ti > ti). This gives a procedure under multiple hypothesis that rejects H0iif q − value(ti) < α.

3. Directional hypotheses testing

As described earlier, the null hypothesis H0iis either accepted or rejected. In most cases, however, rejection of null hypotheses is not sufficient. After rejecting H0i, finding the direction of the alternatives may also be important. A detailed discussion of the directional hypotheses can be found in [7].

Direction hypotheses testing involves testing H0iagainst directional hypotheses H-iand H+i, and the objective is to obtain selection region {Ti ∈ Γ} for selecting H-iand selection region {Ti ∈ Γ+} for selecting H+i. In other words, H0iwill be rejected if Ti ∈ Γ or Ti ∈ Γ+, and the direction H-ior H+iis determined according to Ti ∈ Γ or Ti ∈ Γ+, respectively. Analogous to Table 1, we now have

Table 2 illustrates the number of cases possible when accepting H0 or selecting H or selecting H+. For example, out of Vtimes when selecting H, V0 times errors are made when in fact H0 is true, and V+ times errors are made when in fact H+ is true. In other words, when selecting H, not only H0 is falsely rejected V0 times but the direction is also falsely selected V+ times. This leads to a concept of directional false discovery rate DFDRdefined as

DFDR=EV0+V++W0+W-maxV+W,1.E6
Accept H0Select HSelect H+Total
H0 is trueU0V0W0m0
H is trueUVWm
H+ is trueU+V+W+m+
TotalUVWm

Table 2.

Number of decisions under directional hypotheses.

This is analogous to FDRfor two-sided alternatives. For most cases, [8] showed that DFDR-controlling procedures for directional hypotheses can be treated as FDR-controlling procedure for two-sided multiple hypotheses with direction determined by the sign of the test statistics.

Bansal and Miescke [9] considered a decision theoretic formulation to multiple hypotheses problems. The approach assumes parametric modeling. Suppose the model for the observed data xbe represented by P(xθ, η), where θ = (θ1, θ2, …, θm) ′ is a parameter vector of interest, and ηis a nuisance parameter. The problem of interest is to test

H0i:θi=0vs.H-i:θi<0orH+i:θi>0E7

Let the loss function of a decision rule d(x) = (d1(x), d2(x), …, dm(x)) is given by

Lθ,dx=i=1mliθ,dix,E8

where li(θ, di(x)) is an individual loss of di. Here, di ∈ {−1, 0, 1} with di = 0, di = − 1, and di = 1 means accepting H0i, selecting H-iand selecting H+i, respectively. Note that for the “0-1” loss, that is, when li = 0 for correct decision, and li = 1 for the incorrect decision, Lis the total number of incorrect decisions. Thus, minimizing the E[L(θ, d(X))] for the “0-1” loss amounts to minimizing the expected number of incorrect decisions.

Now, suppose under the Bayesian setting, θi, i = 1, 2, …, mare generated from

πθ=p-π-θ+p0Iθ=0+p+π+θ,E9

where π is the prior density over (−∞, 0) and π+ is the prior density over (0, ∞). A special case of prior (9) is that π(θ) = π+(−θ). In this case, p and p+ reflect the skewness in the alternative hypotheses. For example, if p = p+, then we have a symmetric case. In this case, the selection of H or H+, after rejecting H0, based on the sign of the test statistics makes sense. On the other hand, if p < p+, then it reflects that more of the θisare positives than negatives. For many gene expressions data analyses, this presents a useful case when over-expressed genes may occur more frequently than under-expressed genes as a result of gene mutation (naturally or as a result of external factors). For specific examples, see [9, 10].

From now on, we focus on the “0-1” loss. The results can be easily extended to other loss functions. The “0-1” loss can be written as

Lθ,d=i=1m1-j=-11Idi=jIνiθ=j,

where viθ-1,0,1is an indicator variable indicating θi < 0 when viθ=-1, θi = 0 when viθ=0, and θi > 0 when viθ=1. It is easy to see that minimizing the posterior expected loss yields the selection rule that selects H-i,H0i,orH+iaccording to maxvi-,vi0,vi+,where vi-=PHi-x,vi0=PHi0x,and vi+=PHi+x.

3.1. The constrained Bayes rule

The Bayes procedure described earlier accommodates skewness in the prior, but no type of false discovery rates is controlled. In order to control a false discovery rate, we need to obtain a constrained Bayes rule that minimizes the posterior expected loss subject to a constraint on the false discovery rate.

The directional false discovery rate (6) is defined in a frequentist’s manner, in which expectation is with respect to X|θ. Let us define Eq. (6) as BDFDRwhen expectation is taken with respect to X|θand then further expectation is taken with respect to θ. We define posterior version of Eq. (6) as PDFDRwhen the expectation is taken with respect to the posterior distribution of θ|X = x. It can be shown that

PDFDR=1-i=1mIdi=-1vi-+Idi=+1vi+|D-+|D+1E10

Here, D-=i=1mIdi=-1and |D+|=i=1mIdi=1.

A constrained Bayes rule can be obtained by minimizing the posterior expected loss subject to the constraint that PDFDR ≤ α. There can be many approaches to obtain the constraint minimization. We present, here, an approach given in [9], which is as follows:

Consider the setsD-BandD+Bof indices that selectsHi-andHi+, respectively, according to the unconstraint Bayes rule, that is, whenvi-=maxvi0,vi+andvi+=maxvi0,vi-, respectively. Defineξi=νi-foriD+B, andξi=νi+foriD+B, and then rank allξi,iD-BD+Bfrom the lowest to the highest. Let the ranked values be denoted byξ1ξ2ξk̂, wherek̂=D-BD+B. Denote

i^0=max{jk^:1ji=1jξ[k^i+1]1α}.

Let Dξdenotes the set of indices corresponding to ξk̂ξk̂-1ξk̂-î0+1. Now, select H-ifor iD-BDξ, and H+ifor iD+BDξ.

3.2. Estimating mixture parameters

The above procedure requires estimation of the parameters (p, p0, p+) and estimation of the nuisance parameter η. Note that marginally,

Xip-f-xi|η+p0f0xi|η+p+f+xiη,

where f0(xiη) = f(xi| 0, η), and

f-xiη=-0fxiθ,ηπ-θdθ,f+xiη=0fxiθ,ηπ+θdθ

and X1, X2, …, Xmare independently distributed. Estimates of the parameters of the mixed density can be obtained by using EM algorithm. It is easy to see that the EM estimators of (p, p0, p+) follows the following iterative scheme:

p-j+1=1mi=1mp-jf-xiηp-jf-xiη+p0jf0xiη+p+jf+xiη,
p0j+1=1mi=1mp0jf-xiηp-jf-xiη+p0jf0xiη+p+jf+xiη,
p+j+1=1mi=1mp+jf-xiηp-jf-xiη+p0jf0xiη+p+jf+xiη

Estimation of ηcan also be estimated iteratively by using EM algorithm or by different means. See [9] for more details.

4. Bayes rules by converting p-values to normally distributed test statistics

Let Ti, i = 1, 2,.., mbe independently and identically distributed test statistics. Let Pi=PTiti|H0ibe the corresponding p-values. Note that under H0i,PiU0,1. Let Xi = Φ− 1(Pi) be the corresponding z-score. Then, under H0i,Xi ∼ N(0, 1) . Efron [11] suggested using Xi ∼ N(0, σ2) under H0iwith σ2 appropriately estimated. Efron pointed out that, in practice, σ2 may not be equal to 1 due to possible correlation among multiple components. Under the alternative, we assume that Xi ∼ N(θi, σ2), where θisare generated with distribution described in Eq. (9). It is true that this is a big leap in making this assumption. In practice, this assumption can be tested, however, and if true, it can lead to very powerful results. [9] assumed that π+(θ) is a truncated normal distribution N(0, σ2/ω) , and π(θ) = π+(−θ), where ωis some positive constant depending upon how inflated we believe the alternative θisare. It can be seen that

vi-p-T-xi,vi+p+T+xi,andvi0p0E11

with the proportionality constant [pT(xi) + p+T+(xi) + p0}− 1 . Also, T(xi) = T+(−xi), and

T+xi=expxi221+ωσ2Φxiσ1+ωE12

In order to apply the Bayes procedure as discussed in Section 3, all we need are Eqs. (11) and (12). For computation details, see [9].

4.1. Skew-normal alternatives

In the above discussions, we assumed that θisare generated from distribution with pdf (9). [12] considered the case when θisare generated from a skew-normal distribution under the alternative hypotheses. The skew-normal distribution was first introduced in [13]. It has an important property that if (ξ1, ξ2) ∼ Bivariate Norma with mean 0, then the distribution of ξ1|ξ2 > 0 ∼ Skew − normal. Its pdf is given by

g+ξ1=21σ1φξ1σ1Φλξ1σ1,

and is denoted by SN(0, σ1, λ). Here, λis a skew parameter. If λ = 0, then this distribution is N(0, σ1). The implication of this result is the following: suppose within a normal system an outcome follows a normal distribution, but if a correlated factor starts exerting a positive effect, then the outcome variable will start following a skew-normal distribution. For example, consider RNAs experiments and assume that genes are in a normal state. Suppose a gene mutation occurs at a later state and it starts exerting positive effect on the affected genes. In this case, based on the above property of skew-normal distribution, we can assume that the expressions of the affected genes will follow a skew-normal distribution.

Under this formulation, we assume that θi = 1, 2, …, mare generated from

πλθi=pIθi=0+1-p2σ1φθiσ1Φλθiσ1

Now, similar to Eq. (11), it can be seen that

vi-1-pT-xi,vi+1-pT+xi,vi0p

with proportionality constant [(1 − p)(T+(xi) + T(xi) + p]− 1, where

T+xi=2σ10expxiθσ2φ1σ12+1σ2θΦλθσ1dθ,

and

T-xi=2σ1-0expxiθσ2φ1σ12+1σ2θΦλθσ1dθ.

The sets D-Band D+Bcan be written as

D-B=i:xi<-c1andD+B=i:xi>c2

where c1 > 0 and c2 > 0 are determined as shown in Figure 1 by considering the point of intersections of y = p/(1 − p) and y = T(x), and y = p/(1 − p) and y = T+(x), respectively. Note that when λ > 0, the intersection point Q(as shown in the figure) will be to the left of x = 0, and when λ < 0, Qwill be to the right of x = 0. Thus, when λ > 0, c1 > c2 and the opposite is true when λ < 0. When λ = 0, T(x) = T+(−x) and thus c1 = c2. If λ → ∞, T(x) → 0 and thus DB-is an empty set which is equivalent to a one-tailed test. As discussed in Section 3, the procedure based on Eq. (13) by itself does not control BDFDR. However, c1 and c2 can be further shrunk so that the resulting procedure achieves BDFDR ≤ α; see [12] for details.

To illustrate the above procedure, and to compare it with the standard FDR procedure (BY) of [8], which selects the direction based on the sign of the test statistics, we consider a HIV data described in [14]. For detailed analysis, see [12]. Here, we describe the analysis very briefly. The data consist of eight microarrays, four from cells of HIV-infected subjects and four from uninfected subjects, each with expression levels of 7680 genes. For each gene, we obtained a two-sample t-statistic, comparing the infected versus the uninfected subjects, which is then transformed to a z-value, where zi = Φ− 1{F6(ti)}. Here, F6(∙) denotes the cumulative distribution function (cdf) of t -distribution with six degrees of freedom. Figure 2 shows the histogram of the z-values with a skew-normal fit. Although the null distribution of Zishould be N(0, 1). However, as suggested in [11], we use the null distribution as N(−0.11, 0.752). Thus, we formulate our problem as testing hypotheses (7) with test statistics Zi ∼ N(−0.11 + θi, 0.752).

BY procedure resulted in cutoffs (−3.94, 3.94), which resulted in 18 total discoveries with two genes declared as under-expressed and 16 as over-expressed. For the constrained Bayes rule, we first used the EM algorithm to obtain the parameter estimates as p̂=0.9, σ̂=0.79, σ1̂=1.54,and λ̂=0.22. The Bayes procedure ended up with cut-off points (−2.82, 2.70) with a total of 86 discoveries (under-expressed genes: 23 and over-expressed genes: 63). Note that the number of discoveries by the Bayes rule is much higher than by the BY procedure.

5. Concluding remarks

There are many different methods of testing multiple hypotheses. Methodologies, however, depend on the criteria we choose. When the dimension of multiple hypotheses is not very high, the familywise error rate (FWER) is an appropriate criterion which safeguards against making even one false discovery. However, when the dimension of multiple hypotheses is very high, the FWER is not very useful; instead, a false discover rate (FDR) criterion is a good approach. Although FDR was originally defined as a frequentist’s concept, it can be re-interpreted in a Bayesian framework. The Bayesian framework brings many advantages. For example, a decision-theoretic formulation is easy to implement, directional hypotheses are easy to handle, and the skewness in the alternatives is easy to implement. Drawback is that we need to make an assumption about the prior distributions under the alternatives. Some work has been done based on nonparametric priors; however, much more work is needed.

More

© 2017 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Cite this chapter Copy to clipboard

Naveen K. Bansal (November 2nd 2017). Hypothesis Testing for High-Dimensional Problems, Bayesian Inference, Javier Prieto Tejedor, IntechOpen, DOI: 10.5772/intechopen.70210. Available from:

Related Content

Next chapter

Bayesian vs Frequentist Power Functions to Determine the Optimal Sample Size: Testing One Sample Binomial Proportion Using Exact Methods

By Valeria Sambucini

First chapter

Making a Predictive Diagnostic Model for Rangeland Management by Implementing a State and Transition Model Within a Bayesian Belief Network (Case Study: Ghom- Iran)

By Hossein Bashari

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.