The Application of Pooled DNA Sequencing in Disease Association Study

Hundreds of common genetic variants related to the risk of human disease, such as diabetes, hypertension, bipolar, and Crohn’s disease, have been successfully discovered by Genomewide Association Studies (GWAS) (Barret et al., 2008; Hindorff, 2009; Thomas et al., 1991; WTCCC, 2007). Current GWAS are based on the strategy of linkage disequilibrium (LD) mapping, in which a sufficient number of single nucleotide polymorphism (SNP) markers are selectively genotyped to capture the genetic variation of the whole genome. However, there are two major issues related to the results of GWAS. First, the results only explain a small fraction of the heritability of complex diseases. One of the reasons may be that many functional variants, in particular rare variants, which are not directly genotyped in GWAS, have a weak LD with SNP markers, and hence are missed by GWAS (Iyengar et al., 2004; Manolio et al., 2009). Second, the identified associations in GWAS are often inconsistent between different populations. The reason for this may be the varied LD structures between markers and underlying causal variants among populations, resulting in associations can only be observed in specific populations.


Introduction
Hundreds of common genetic variants related to the risk of human disease, such as diabetes, hypertension, bipolar, and Crohn's disease, have been successfully discovered by Genomewide Association Studies (GWAS) (Barret et al., 2008;Thomas et al., 1991;WTCCC, 2007). Current GWAS are based on the strategy of linkage disequilibrium (LD) mapping, in which a sufficient number of single nucleotide polymorphism (SNP) markers are selectively genotyped to capture the genetic variation of the whole genome. However, there are two major issues related to the results of GWAS. First, the results only explain a small fraction of the heritability of complex diseases. One of the reasons may be that many functional variants, in particular rare variants, which are not directly genotyped in GWAS, have a weak LD with SNP markers, and hence are missed by GWAS (Iyengar et al., 2004;Manolio et al., 2009). Second, the identified associations in GWAS are often inconsistent between different populations. The reason for this may be the varied LD structures between markers and underlying causal variants among populations, resulting in associations can only be observed in specific populations.
To address these issues, an ideal approach is to directly sequence all the samples in a study (Bodmer & Bonilla, 2008). However, this is not a feasible option for the traditional sequencing technology, namely Sanger sequencing, which is extremely expensive and time consumption for sequencing thousands of samples required to achieve reasonable statistical power in a typical genetic association study.
Next generation sequencing (NGS) technology, also called parallel sequencing, is a revolutionary technology for biomedical research (Shendure & Ji, 2008). The production of large numbers of low-cost reads makes NGS useful for many applications. Today there are three commonly-used next-generation sequencing systems: namely Roche's (454) GS FLX Genome Analyzer marketed by Roche Applied Sciences, Illumina's Genome Analyzer" (GA), and Applied Biosystem's SOLiD system. Several new systems have either just been introduced or will become available soon (Metzker, 2010). One of the most important applications is to identify DNA variants, in particular rare variants, responsible for human diseases (Metzker, 2010). Now ten billion bases can be obtained routinely in a single run of NGS instrument and yields are expected to continually increase. The throughput of the smallest function unit, e.g., a single 'lane', can generate data amounting to many thousands fold coverage for a target region, which is far greater than what is needed for genotyping one individual as the individual genotype at a specific locus is expected to be accurately called at about 15-30 fold coverage. As such, it is feasible to simultaneously sequence targeted regions of multiple individuals with dramatic saving on cost and time.
To reduce the cost of large-scale association studies, one efficient approach is to sequence a large number of individuals together on a single sequence run. Two commonly-used approaches are available in disease association studies. Bar-coding ligates the DNA fragments of each sample to a short, sample-specific DNA sequence, and then sequences these DNA fragments from multiple subjects in one single sequencing run. In addition to allowing determining individual genotypes, bar-coding offers an additional advantage of reduction of sequencing variability (Craig et al. 2008). However, bar-coding at present has a limit of the multiplexing and the cost on the individual DNA amplification and sequencing template preparation could be substantial in large scale disease-association studies. Compared to barcoding, simply pooling DNA samples is more cost-effective as it can fully make use of the high depth of sequencing and vastly reduce the efforts of sample preparation for thousands of individuals. Currently pooled DNA sequencing is particularly appealing due to its substantial cost and time-saving in large disease-association studies, i.e. pooled DNA sequencing (Shaw et al., 1998). With pooling, the sequencing throughput required per individual is much less than what is provided by a single run, and hence it is feasible to sequence multiple individuals together. For example, in a case-control study, the allele frequencies in a sample of 500 cases and 500 controls can be measured from two pooled samples, rather than from 1,000 individual samples, which represents an increase in efficiency of 500-fold.
Pooling was first used in genetic study in a case-control association study of HLA class II DR and DQ alleles in type I diabetes mellitus (Arnheim et al., 1985). Afterwards, it has been used for linkage studies in plants (Michelmore et al., 1991), for the homozygosity mapping of recessive diseases in inbred populations (Sheffield et al., 1994;Carmi et al., 1995;Nystuen et al., 1996;Scott et al., 1996), and for mutation detection (Amos et al., 2000). This strategy was also proposed for high-throughput SNP arrays (Ito et al., 2003;Shaw et al., 1998;Zeng & Lin, 2005) but it was not widely accepted as SNP array technology does not provide accurate estimates of the allele frequencies in the pooled samples. Recent next generation sequencing technology provides a high-throughput sequencing solution for examining functional variants directly. It might provide more accurate estimates of allele frequency, as shown by recent studies (Druley et al., 2009). Recently, one study adopted this strategy using 454 sequencing technology and identified associations of rare variants with insulin-dependent diabetes mellitus (Nejentsev et al., 2009). In a genome-wide analysis studies, two-stage design and DNA pooling could be used as a cost-efficient strategy to detect genetic variant regions (Chi et al., 2009;Skol et al., 2006;Wang et al., 2006;Zuo et al., 2006Zuo et al., , 2008. In the first stage, a fraction of samples are genotyped for all SNPs and a case-control association test for each SNP is then conducted to select the most significant SNPs. In the second stage, the candidate SNPs from the first stage are further evaluated by genotyping. To reduce the cost of large-scale association studies in two-stage design, pools of DNA from many individuals have been www.intechopen.com The Application of Pooled DNA Sequencing in Disease Association Study 143 successfully used in the first stage of the two-stage design (Bansal et al., 2002;Boss et al., 2009;Nejentsev et al., 2009;Norton et al., 2004;Sham et al., 2002). As suggested by Out et al. (2009), the use of a pooled DNA sample for targeted regions, NGS also can be an attractive cost-effective method to identify rare variants in candidate genes.
The data produced by next-generation sequencing is different from that of SNP-chips. Next-generation re-sequencing produces large amounts of short reads. After mapping to the reference genome, an alignment of reads across the targeted regions is obtained. A schematic example of re-sequencing data in case-control study is shown in Table 1. In this example, each case and control sample consists of two pools with two individuals in each pool. The two alleles (A and a) of each individual are shown in the "Genotype" column. Each allele appears a random number of times. Although NGS have the potential to discover the entire spectrum of sequence variations in a sample of well-phenotyped individuals, NGSs also present challenges. First, the error rate of these platforms is higher than conventional sequencing methods, and many errors are not random events (Johnson & Slatkin, 2008;Chaisson et al., 2009;Lynch, 2009;Bansal et al., 2010b). These errors may be frequent enough to obscure true associations or systematic enough to generate falsepositive associations. Second, the data produced by next-generation sequencing often lose linkage disequilibrium (LD) information which is lost in pooled sequencing. As the result, the powerful analytic approaches that combine multiple rare variants to examine the disease association are not directly applicable to pooled sequencing, because these approaches require individual genotypes to account for the LD between SNPs. The current single locus analysis of pooled sequencing data could be very inefficient, in particular, for rare variants.

Pool
Individual Genotype Read base In section 2 of this chapter, we will introduce some strategies of pooling design, including PI-deconvolution, shifted-transversal design, multiplexed scheme, and overlapping pools to recover LD information. Through these well-chosen pool designs, the variant carriers can be clearly identified, which greatly enhances the pooling efficiency. In section 3, we will introduce some statistical methods for the detection of variant and case-control association study to account for high-levels of sequencing errors. A briefly summary is added in the end of this chapter.

Strategies of pooling design
The main idea of pooling is to sequence DNA from several individuals together on a single run. Through the observed number of re-sequencing alleles, the allele frequency can be estimated. The simplest strategy is the naïve-pooling scheme, which is also called disjoint pooling. In naïve-pooling scheme, DNA was sequenced from several individuals on a single pool and each pool includes different individuals (Table 2). It offers insight into allele frequencies, but is not able to the identity of an allele carrier.
Recently, several strategies of well-chosen pools aiming to identify variant are proposed. In these designs, each individual is tested several times in different pools. This redundancy provides a potential increase in both sensitivity and specificity. We will introduce PIdeconvolution, shifted-transversal design, multiplexed scheme, and overlapping pools.  Table 2. Re-sequencing with naïve pooling scheme. A total of 16 individuals are divided into groups of two and pooled.
PI-deconvolution (Jin et al., 2006) The PI-deconvolution approach is a classic grid design. This strategy assigns individuals on an imaginary grid and construct pools by each row and each column. The individuals with variant then can usually be identified from the pattern of pools appearing variant. If there is a confounding among individuals, only a few candidates need to be retested. For example (Table 3), 16 individuals are arrayed on an imaginary grid and mixed in 8 pools, each containing 4 individuals (individuals 1, 2, 3, and 4 are in pool 1 and individuals 1, 5, 9, and 13 are in pool 5). If the pools 3 and 6 appeal variant, then individual 10 is the only variant carrier. If pools 2 and 7 also appear variant, we cannot distinguish whether the variant is from individuals 6 and 11 or from individuals 7 and 10.
To resolve this confounding, we can add four additional pools, built along one of the grid's diagonals as indicated by the colors of the individuals. If the pink diagonal pool appears variant, individuals 6 and 11 are the variant carriers, whereas if both the orange and blue diagonal pools appear variant, the variant is from individuals 7 and 10. The author has validated the technique in three experimental contexts: protein chips, yeast two-hybrid assay, and drug resistance screening. (Thierry-Mieg, 2006) This method minimizes the co-occurrence of objects and constructs pools of constant-sized intersection. They proved that it allows unambiguous decoding of noisy experimental observations. It is highly flexible and can be tailored to function robustly in a wide range of experimental settings. For every ∈ { , , }, let be a × Boolean matrix, defined by its columns , , , , as follows: For example, consider the variable and = , we have

Shifted-Transversal Design
www.intechopen.com  (Table 4), let us create 2 groups for 20 individuals according to the following two pooling rules: The total number of pools in this design is 13 (5+8). The corresponding pooling matrix is a × table and partitioned into two regions that correspond to the two pooling patterns. The staircase pattern (high-lighted in yellow) in each region is typically created by the multiplexed scheme. If pool 1 in group 1 and pool 6 in group 2 appear variant, then individual 6 can be identified as the variant carrier.
Overlapping pools (Prabhu & Pe'er, 2009) The central idea of overlapping pool design is that while sequencing DNA from several individual on a single pool, they also sequence DNA from a single individual on several pools. Individuals are assigned to pools in a manner so as to create a code: a unique set of pools for each individual. This set of pools on which an individual is sequenced defines a code word, or pool signature. If a variation is observed on the signature pools of one individual and on no other, then we identify the variant carrier. Based on the overlapping design, author proposed two algorithms for pool design: logarithmic signature designs and error-correcting designs. They showed that their designs guarantee high probability of unambiguous singleton carrier identification while maintaining the features of naïve pools in terms of sensitivity, specificity, and the ability to estimate allele frequencies.

Statistical methods for pooled DNS in GWAS
GWAS have successfully identified hundreds of variants that are associated with complex traits and pooled DNA sequencing has been considered a cost-effective approach for study rare variants in large populations. In this section, we discuss the statistical methods for the detection of variants and the case-control studies.

Detection of variants
SNPSeeker (Druley et al., 2009) This method (SNPSeeker) is an algorithm based on large deviation theory. It uses a seconder dependency error model for single-nucleotide polymorphism identification and takes into account the position in the sequencing read and the identity of the two upstream bases. This algorithm greatly improved the specificity of www.intechopen.com SNP calling. The statistical models can be described as follows. Let ∈ { , , , , } denote a observed base and ∈{ , , , } denote a base in the reference. The subset of nucleotides for each cycle , sequencing run and strand can be defined as i.i.d. random variables , , , , , , , , , , , , and the empirical probability distribution can be written as , = , , , , .
Under null hypothesis of no polymorphism at position , the distribution of is , , = P r | = , , * Pr = | , , ∈{ , , , } where Pr | = is the probability of seeing a base in the sequence at cycle position on run given that the original base at position in the reference is equal to , and Pr = | , is the probability of observing nucleotide in the reference sequence at position , = given the strand and the true allele frequency vector . The cumulative p-value for each strand can be calculated by where , || , is the Kullback-Leibler distance (Thomas & Joy, 1991) between , and , . Bonferoni-corrected is conducts for the total number of tests performed at each position in the reference sequence. The software for SNPseeker algorithm can be found at http://www.genetics.wustl.edu/rmlab/.
CRISP (Bansal, 2010a) This approach compares the distribution of allele counts across multiple pools using contingency tables and evaluates the probability of observing multiple non-reference base calls due to sequencing errors alone. The number of reads with the reference and alternate alleles at a particular position across the pools can be modeled as a contingency table with two bases (rows) and pools (columns) with row sums: =∑ and − =∑ − and column sums : Under null hypothesis, the probability of the observed read can be defined as the probability of the table : The p-value associated with the observed table is defined as the sum of all × contingency tables with identical row and column sums that have equal or lower probability than the observed table: www.intechopen.com For example (

Detect association base on case-control study
The model in case-control study can be described as follows. Let be the total number of chromosome segments of the region of interest in the th pool of phenotypic group , where = for case and 2 for control. Let be the unknown number of rare allele at a specific locus of interest. After re-sequencing, a total number of sequencing reads at the loci are observed and out of read report variant. We denote the random vector = , , and = , , . Let be the frequency of the minor allele at this locus for group . The question we are interested in the case-control study is whether this locus is associated with the disease. The hypothesis of the association can be written as: Several statistical methods can be utilized for this test.

www.intechopen.com
Fisher's exact test The allele frequencies of cases and controls are calculated from the observed numbers of total reads and from the number of reads reporting the variant; and the number of variants carried by each phenotypic group is estimated by = = ∑ /∑ × , where is the estimated allele frequency for group . The data are then summarized in a × table (Table 7) with the same row and column margins that have probabilities less than or equal to that of observed table. Although this test is simple to implement, it treats the estimated numbers of the rare variant as if they were observed without considering the uncertainty of such estimates. Thus, it may have an inflated Type I error rate. Second, the sampling scheme of Fisher's exact test is based on the hypergeometric distribution, which in principal requires both the column and row marginal totals of a × table are fixed, i.e., both the sample size (the number of cases and controls) and the number of variants and non-variants are fixed in Table I. The number of variants and non-variants are usually not fixed in a genetic case-control study. Fisher's exact test used in this way can become very conservative (Upton, 1982). Finally and most importantly, because nextgeneration sequencing has a relatively high rate of base-calling error and from sequence reads of pooled samples, it is difficult to distinguish true rare variants from such errors. For a rare variant whose frequency is not much higher than the error rate, the power to detect its association with a disease would be very low without adjusting for such error in the statistical method. Combined Z-test (Abraham et al., 2008) This method combines chi-square statistic andstatistic for testing the differences in mean allele frequencies between cases and controls. The general description of this statistic has been presented in (Sham et al., 2002;Macgregor, 2007;Kirov et al., 2009): where ̅ = ∑ is the mean of the allele frequencies over pool replicates, = ̅ ̅ is the binomial sampling variance and is number of controls and cases respectively ( = , ) and = ∑ − ̅ is the square of the standard error due to experimental error.
This method considers sampling error and experimental error, which is equivalent to a simplified version of the complex regression model suggested by Macgregor (2007).
Likelihood ratio test (Kim et al., 2010) To quantify he sequencing error rate in pooled sequencing, one approach is to include a control DNA sequence in each pool, which makes it possible to obtain the empirical distribution of the sequencing error of individual pools. Let ∈{ , , , } be the variant and ∈{ , , , } be the reference allele. We define , =Pr | , , to be the false positive error rate, i.e., the probability of reporting a variant given the reference base, and , =Pr | , , to be the true-positive rate, i.e., the probability of reporting a variant given the variant based. Both the false-positive and true-positive rates can be estimated for each pool by the proportion of reads reporting the variant for the reference base and the variant base, respectively. The estimate of allele frequency with error can be calculated as = − , + − , .
For next generation sequencing data, the likelihood can be computed as: where is as defined above. Then the Likelihood ratio statistic is computed as = − ̂ | ̂ , ̂ | .

Reject if .
Differential test  Following the approaches of Liddell (1976) and Barry & Choongrak (1900) for each individual. However, though well-chosen pool designs as introduced in Section 2, there is still high chance to identify the variant carrier. Recently, next-generation sequencing technologies have made it feasible to sequence several human genomes entirely. SNPSeeker and CRISP are efficient statistical methods to detect the variant from the short read generated by NGS platforms. For case-control analysis, Fisher's exact test is common used but has been proved to be inappropriate. Several test methods such as Combined Z test, Likelihood ratio test, and differential test can be conducted for the association analysis.