Sequencing Technologies and Their Use in Plant Biotechnology and Breeding

The development of DNA sequencing strategies has been a high priority in genetics research since the discovery of the structure of DNA and the basic molecular mechanisms of heredity. However, it was not until the works by Maxam and Gilbert (1977), and Sanger (Sanger et al, 1977), that the first practical sequencing methods were developed and implemented on a large scale. The first isolation and sequencing of a plant cDNA by Bedbrook and colleagues a few years later initiated the field of Plant Molecular Genetics (Bedbrook et al, 1980). Plant biotechnology started shortly thereafter with the successful integration of recombinant DNA and sequencing techniques to generate the first transgenic plants using Agrobacterium (Fraley et al, 1983; HerreraEstrella et al, 1983). The first genetic map in plants based on restriction fragment length polymorphisms (RFLPs; Bernatzky & Tanksley, 1986) enabled the capture of genetic variation and started the era of molecular marker-assisted plant breeding. Since then, sequencing methodologies have been essential tools in plant research. They have allowed the characterization and modification of genes and metabolic pathways, as well as the use of genetic variation for studies in species diversity, marker-assisted selection (MAS), germplasm characterization and seed purity. The determination of the reference genomes in Arabidopsis thaliana, rice and maize using Sanger sequencing strategies constituted major milestones that enabled the analysis of genome architecture and gene characterization in plants (The Arabidopsis Genome Initiative, 2001; International Rice Genome Project, 2005; Schnable et al, 2009). More recently, the development and increasing availability of multiple Next-Generation sequencing (NGS) technologies minimized research limitations and bottlenecks based on sequence information (Metzker, 2010; Glenn, 2011). It is difficult to overstate the influence that these massively parallel systems have had in our understanding of plant genomes and in the expansion, acceleration and diversification of breeding and biotechnology projects. At the same time, this influence tends to understate the importance that capillary Sanger sequencing still has in day-by-day research and development work. This review provides a description of major sequencing technologies that are available today, their use as well as future prospects in basic plant genetics research, biotechnology and breeding in crop plants.


Sanger sequence analyzers
For more than 30 years and until recently, sequencing based on the Sanger and Maxam-Gilbert chemistries were the only practical methods to routinely determine DNA sequences in plants and other biological systems.During the 80's and 90's, Sanger-based platforms increased throughput by orders of magnitude, and became the method of choice, while the Maxam-Gilbert method remained a low-throughput process.The development of automated Sanger systems was greatly facilitated by technical innovations such as thermal cycle-sequencing and single-tube reactions in combination with fluorescence-tagged terminator chemistry (Trainor, 1990).Additional improvements in parallelization, quality, read length, and cost-effectiveness were achieved by the development of automatic basecalling and capillary electrophoresis.In the current version of Sanger sequencing a mixture of primer, DNA polymerase, deoxinucleotides (dNTPs) and a proportion of dideoxynucleotide terminators (ddNTP), each labeled with a different fluorescent dye, are combined with the DNA template.During the thermal cycling reaction, DNA molecules are extended from templates and randomly terminated by the occasional incorporation of a labeled ddNTP.DNA is then cleaned up and denatured.Detection is achieved by laser excitation of the fluorescent labels after capillary-based electrophoresis separation of the extension products.The differences in dye excitation generate a "four color" system that is easily translated by a computer to generate the sequence.Modern Sanger sequencers like the Applied Biosystems ABI3730 have reached a high level of sophistication and can achieve routine read-lengths close to 900 bp and per-base 'raw' accuracies of 99.99% or higher (Shendure & Ji, 2008).The ABI 3730xl analyzer can run 96 or 384 samples every 2-10 hours, generating approximately 100,000 bases of raw sequence at a cost of a few hundred dollars.

Roche 454
The 454 platform (now owned by Roche) was the first NGS platform available as a standalone system.DNA templates need to be prepared by emulsion PCR and bound to beads, with 1-2 million beads deposited into wells in a titanium-covered plate.The Roche 454 technology is based on Pyrosequencing and additional beads that have sulphurylase and luciferase attached to them are also loaded into the same wells to generate the light production reaction.DNA polymerase reactions are performed in cycles but, unlike Sanger, there are no terminators.Instead, one single dNTP is alternated in every cycle in limiting amounts.Fluorescence after the reaction indicates the incorporation of the specific dNTP used in the cycle (Metzker, 2010).Because the intensity of the light peaks is proportional to the number of bases of the same type together in the template, the fluorescence can be used to determine the length of homopolymers, although accuracy decreases considerably with homopolymer length.The current 454 chemistry is able to produce the longest reads of any NGS system, about 700 bp, approaching those generated by Sanger reads.However, 454 systems can sequence several megabases for less than 100 dollars.

Illumina
The Solexa platform (now owned by Illumina) has become the most widely used NGS system in Plant biotechnology and breeding.Illumina captures template DNA that has been ligated to specific adapters in a flow cell, a glass enclosure similar in size to a microscope slide, with a dense lawn of primers.The template is then amplified into clusters of identical molecules, or polonies, and sequenced in cycles using DNA polymerase.Terminator dNTPs in the reaction are labeled with different fluorescent labels and detection is by optical fluorescence.As only terminators are used, only one base can be incorporated in one cluster in every cycle.After the reaction is imaged in four different fluorescence levels, the dye and terminator group is cleaved off and another round of dye-labeled terminators is added.The total number of cycles determine the length of the read and is currently up to 101 or 151, for a total of 101 or 151 bases incorporated, respectively.At the time of writing this review, this technology was able to yield the highest throughput of any system, with one of the highest raw accuracies.One major disadvantage is the short read it produces.However, paired-end protocols virtually double the read per template and facilitate some applications that were originally out of the reach of the technology.The Illumina HiSeq 2000 sequencer is currently able to sequence up to 540-600 Gbp in a single 2-flow cell, 8.5-day run at a cost of about 2 cents per Mbp (http://www.illumina.com/systems/hiseq_2000.ilmn).

Life Technologies SOLiD
ABI (now part of Life Technologies) has commercialized the SOLiD (Support Oligonucleotide Ligation Detection) platform.This platform is based on Sequencing by Ligation (SbL) chemistry.SbL is a cyclic method but differs fundamentally from other cyclic NGS chemistries in its use of DNA ligase instead of polymerase, and two-baseencoded probes instead of individual bases as units.In SbL, a fluorescently labeled 2-base probe hybridizes to its complementary sequence adjacent to the primed template and ligated.Non-ligated probes are then washed away, followed by fluorescent detection.In SOLiD, every cycle (probe hybridization, ligation, detection, and probe cleavage) is repeated ten times to yield ten color calls spaced in five-base intervals.The extension product is removed and additional ligation rounds are performed with an n-1 primer, which moves the calls by one position.Color calls from the ligation rounds are then ordered into a linear sequence to decode the DNA sequence (Metzker, 2010).SOLiD has similar throughput and cost per base to Illumina.It also has the best raw accuracy among commercial NGS systems.

Life Technologies Ion Torrent
Ion torrent is the commercial name for a new NGS platform now owned by Life Technologies (Rothberg et al, 2011;http://www.iontorrent.com).At the time of writing this chapter, the system was not widely used in plant research and its use elsewhere had been described in a limited number of published research papers (e.g.Miller et al, 2011).However, with recent upgrades, fast turnaround times and affordability, the system is finding its way into research laboratories.Currently, its usefulness in being evaluated for a number of applications in plant biotechnology and breeding.Ion Torrent differs from other NGS in that its chemistry does not require fluorescence or chemiluminescence, and for that matter optics (e.g. a CCD camera) to work.Beads, each carrying PCR clones from a single original fragment, are subjected to polymerase synthesis using standard dNTPs on an ion chip.The ion chip is a massively parallel semiconductor-sensing device that contains ionsensitive, field-effect transistor-based sensors (tiny pH meters, essentially), coupled to more than one million wells where the polymerization reaction occurs.Cycles of reactions including one single nucleotide are produced, in a way that is analogous to the Roche 454 system.In each cycle, the electronic detection of changes in pH due to the release of a proton during base incorporation indicates that a base has been incorporated.The IonTorrent has the lowest throughput but also the fastest turnaround times of all commercially available NGS systems.The current Ion Torrent chip can yield several hundred thousand reads with an average length of about 100 bp in less than 2 hours.

Other NGS platforms, Helicos Heliscope, Polonator
There are other NGS systems that have been marketed in the last few years, however, they have had limited use in plant sciences.Helicos developed the first commercial singlemolecule sequencer, called HeliScope.However, very few units were sold due to the cost of the machine, on-site requirements and other considerations.Currently, Helicos provides sequencing as a service.One additional company, Azco-Biotech is marketing the Max-Seq Genome sequencer (http://www.azcobiotech.com/instruments/maxseq.php).This commercial version of the academic, open-source Polonator can run either sequencing by synthesis or sequencing by ligation protocols, similar to Illumina and SOLiD, respectively, although it generates shorter reads, 35-or 55-bp-long.

Pacific Biosciences and the 3 rd generation
Pacific Biosciences has launched the PacBio RS platform, considered the first commercially available 3 rd -Generation system.The first early-access instruments were deployed in late 2010 and the first commercial batch became available by mid-2011.The PacBio system is based on SMRT, a single-molecule sequencing chemistry with real time detection.The sequencing cell has DNA polymerases attached to nanowells and exposed to single molecule templates and labeled NTPs.No terminators are used, although conditions are set to slow polymerization to a level that can be detected by a CCD camera.Each dNTP has a unique fluorescent label that is detected and then cleaved off during synthesis.Polymerization is detected as it happens, several bases per second.Because of this real-time detection and the enzyme processivity, this method has the potential to generate reads in excess of 10 kilobases in a few minutes.The potential of a technology that is able to sequence single molecules and produce long reads is immense.However, the PacBio technology may need to overcome a number of technical challenges before it reaches a widespread use in plant sciences.Average read length in current outputs exceeds 1 Kbp although single-pass error rate has been reported to be 15%, considerably higher than other sequencing platforms (Glenn, 2011).One major source of errors consists of deletions produced during detection.As will be discussed later, improvements in raw quality and further gains in read length will broaden the range of optimal applications for PacBio.

Applications in plant research
Sequencing platforms have different combinations of throughput, cost, read length, number of reads and raw accuracy.Their effective use in plant research and development programs depends on matching the best Sanger, NGS or Third Generation platform to specific applications (Morozova & Marra, 2008;Schuster, 2008;Varshney et al, 2009).One common misconception about Sanger-based systems is that they have, or will soon become obsolete.On the contrary, Sanger capillary systems are still the most widely used sequencers in routine molecular biology applications and are not likely to disappear in the near future.While their number of optimal applications has decreased, Sanger sequencers remain essential in many.The characteristics of capillary Sanger systems make them better suited for confirmatory sequencing in recombinant DNA technology, when the need to determine specific targets at low throughput makes them cost-effective.They are also best in low-to medium-throughput low-complexity shotgun and targeted sequencing experiments, where the use of highly-parallel random sequencing is impractical.Currently, no other chemistry or technology can match Sanger's combination of length and quality that remain the gold standard of sequencing.
Most sequencing applications can be divided into 2 categories: 1)de novo sequencing, and 2) resequencing.In the case of de novo sequencing, reads are obtained from an unknown sequence and either assembled to reconstruct this sequence or compared directly to reads from other unknown sequences.In the case of resequencing, reads are mapped or aligned to a known reference sequence.De novo applications are usually slower and more computerintensive than resequencing, but are needed to reconstruct genomes and transcriptomes in species with unknown genomes.Major resequencing applications include polymorphism discovery and transcription profiling.This section emphasizes the use of new massive sequencing technologies and how they have recently been deployed in de novo and resequencing applications in plant research.

Physical maps and reference genomes
It is not surprising that considerable effort has been given during the last 15 years to the sequencing of plant genomes.The determination of nuclear and organellar genomes enables the identification of genes, regulatory elements, and the analysis of genome structure.This information improves our understanding of the role of genes in development and evolution, and facilitates the discovery of related genes and functions across species (Messing & Llaca, 1998;Feuillet et al, 2011).Reference genomes are also important tools in the identification, analysis and exploitation of genetic diversity of an organism in plant population genetics and breeding (Varshney et al, 2009;Edwards & Batley, 2010;Jackson et al, 2011).The sequencing of the human genome and other vertebrates in the 90's provided the technological pathway for the initial sequencing of genomes in plants (International Human Genome Sequencing Consortium, 2001, Venter et al, 2001).However, the structure of plant genomes poses additional challenges.Plant genomes are characterized by higher proportions of highly repetitive DNA and by the presence of segmental duplications or full genome duplications due to polyploidization events.The 1C genome content in Maize, for example, is smaller than in humans but consists of higher proportions and larger tracks of high-copy elements such as retrotransposable elements.Only a small fraction of the genome corresponds to exons and regulatory regions, usually in low-copy DNA islands that harbor single genes or small groups of genes (Schnable et al, 2009;Llaca et al, 2011).The average genome size in plants is larger than humans, approximately 5.8 Gbp, and they have a wider size distribution than mammals (Bennett & Leitch, http://data.kew.org/cvalues.).Some important crops like hexaploid wheat can have genomes that are more than 4 times the size of the human genome.
The first completed reference plant genomes, Arabidopsis and Rice, were from model plant species with small genomes, approximately 4% and 12% the size of the human genome.The genomes were produced by Sanger-based shotgun sequencing of overlapping bacterial artificial chromosomes (BACs) (The Arabidopsis Genome Initiative, 2001;International Rice Genome Project, 2005).This BAC-by-BAC approach requires the initial construction, fingerprinting and physical mapping of large numbers of random BACs (Soderlund et al, 1997;Ding et al, 2001).A subset of BACs is selected based on a minimum tiling path and shotgun libraries are individually constructed from each BAC and completed by subclone end-sequencing and assembly.Finally, BAC sequences are completed using a targeted approach aimed at closing sequencing gaps and finishing low-quality regions.This process, albeit slow and time consuming produced the only two references considered finished to date.These projects were performed by large collaborative consortia and took several years of fingerprinting and sequencing work.The cost of the Arabidopsis genome project has been estimated at US$70 million (Feuillet et al, 2011).In maize, a draft reference genome was completed from the inbred line B73 using a similar approach, although no gap closure or low quality finishing steps were completed.The maize draft genome, a highly valuable genetic resource available to the plant research community, was accomplished by multiple laboratories at an estimated cost of tens of millions in a joint NSF/DOE/USDA program.The three BAC-by-BAC sequencing projects mentioned above benefited from working in small units (BACs), which minimized problems associated by misassembly of highly repetitive DNA.One important consideration about BAC-by-BAC genomes is that they are not really complete.They have representation gaps in regions that are "unclonable" under the conditions used to prepare the BAC libraries.Many of these unclonable regions correspond to tandem repeats such as telomeric sequences and other repetitive regions, although it may also include gene space (Schnable et al, 2009).Furthermore, even in BACby-BAC approaches, the complexity of many plant genomes of moderate size such as maize prevent the creation of a complete physical assembly and there are some regions that may still lay in unassigned regions.
The high cost, long time, and logistics of BAC-by-BAC projects led many groups to adopt an alternative strategy also previously implemented in humans and other vertebrates: Whole-Genome Sequencing (WGS; Venter et al, 2001).In WGS, whole genomic DNA is randomly sheared and the fragments are end-sequenced and assembled.This strategy has improved with the use of multiple genomic libraries with different insert sizes and improved assembly software, which can identify such constraints in clone size.Not surprisingly, the first WGS, Sanger-based draft genomes were obtained from small genomes with relatively small amounts of repetitive DNA, including Populus (Tuskan et al, 2006), Grape (Jailon et al, 2007), and Papaya (Ming et al, 2008).More recent refinements enabled the sequencing of larger genomes such as Sorghum bicolor (~730 Mbp; Paterson et al, 2009) and soybean, an ancestral tetraploid (1.1 Mbp; Schmutz et al, 2010).The cost and time to accomplish these projects is reduced in comparison to BAC-by-BAC projects, although still considerable.In the case of soybean, the largest plant genome completed by Sanger WGS, sequencing was done by a team of 18 institutions and a total of more than 15 million Sanger reads were produced and assembled from multiple libraries with average sizes ranging from 3.3 Kb to 135 kb (Schmutz et al, 2010).In general, WGS approaches are effective in the determination of gene space in small and medium size plant genomes.However, reduction in time and cost is achieved at the expense of assembly fidelity in repetitive regions and expanded need for computational resources.WGS-based approaches increase potential assembly artifacts due to haplotype and homeolog collapse in regions with high identity.This may lead to large numbers of scaffolds to be mapped.
The use of NGS platforms in WGS projects has improved the ability to rapidly determine reference genomes at the expense of overall assembly quality, especially in high copy and duplicated regions.The potato reference genome (The Potato Genome Sequencing Consortium, 2011) was successfully constructed using a combination of Illumina, 454 and Sanger reads.The implementation of hybrid methods using Roche 454 sequencing in combination with Sanger sequences has been effective in reducing overall cost and time to generate high-quality sequences in gene space regions (Rounsley, 2009).Examples of hybrid references are cucumber (Huang et al, 2009) and apple (Velasco et al, 2010).The use of NGS-only WGS assemblies, especially based on Illumina or Solid reads, can reduce cost and time by orders of magnitude in relation to Sanger or Hybrid strategies.Medium-size genomes such as maize, can be covered up to 200-fold in a single 9-day run in an Illumina HiSeq2000 system for under $30,000, for example.However, correct mapping and de novo assembly of these shotgun short reads has been problematic.Short reads have raised concern about their ability to accurately assemble genomes with high abundance of near identical repetitive sequences and gene duplication.The difficulty of using shotgun short read data for de novo assembly has also been a challenge in humans and other animals, but it is exacerbated in plants due to the higher proportion of highly repetitive DNA, segmental duplications and polyploidization.However, improvements have been made recently by using strategies that rely on paired-end reads and mate pairs, the use of multiple libraries with different insert sizes and the development of software with algorithms use end-sequence distance information from these libraries.Using these strategies, contig size, particularly in gene-rich regions, has increased considerably (Li et al, 2010).As read length in NGS continues to expand (e.g.Illumina platforms can perform 150-bp paired ends, and Roche 454 has released a long read chemistry), assembly will be improved.www.intechopen.com The use of NGS to sequence genomes in a BAC-by-BAC, or a pooled BAC approach can be facilitated by the use of new physical mapping technologies such as whole genome profiling (WGP).This process allows the physical mapping of BACs using a restriction-based fingerprinting approach analogous to high information content gel electrophoresis.In this system BAC clones are pooled, then DNA from the pools are prepared and digested with a restriction endonuclease.Tags are then added to the ends of the fragments and the labeled fragments are end-sequenced using Illumina chemistry.The sequence data are processed and analyzed by an optimized FPC software program to build BAC contigs across the genome.The sequence data obtained during WGP can be combined with BAC, BAC pool, and or Whole Genome Sequencing data.(Steuernagel 2009;van Oeveren et al, 2011).
Regardless the assembly strategy or sequence technology used, the completion of reference genomes for most plants remains a big challenge.All publicly available completed references indicated in Figure 2 correspond to plant species with below-average genome sizes.The challenge of sequencing full genomes with vast amounts of duplications and continuous high copy transposable elements remains inaccessible with the current technology.The major technological breakthrough required here is the improvement of 3 rd Generation technologies able to produce long reads.Such long reads can then be used to improve contig length in combination with other technologies, or by themselves.

Development of pan-genomes
The significant sequence diversity and the high structural polymorphism observed in important plant models such as maize, highlights a serious limitation in the use of a single reference genome as a sufficient representative of a species.There is accumulating evidence that large differences in 1C DNA content observed between closely-related species, or between subspecies, landraces and lines in the same species correspond not only to differences in repetitive, non-coding DNA but also to gene content (Morgante et al, 2007;Llaca et al, 2011).Deep resequencing and the addition of de novo assembly of non-reference genomes is necessary to capture gene space included in larger structural variations (i.e.CNVs, PAVs, and large indels).With improvements in long-read sequencing technologies and assembly software and strategies, the creation of reference "pan-genomes" for certain species will be an important resource in plant genetics research

Genome surveys and partial genome assembly
The use of genome survey sequence (GSS) and partial targeted de novo assembly strategies can be useful in gene discovery research projects involving non-model plants species, or species with significant sequence diversity and high structural polymorphism.Maize and other crops exhibit pan-genomes that can be considerably larger than the standard available reference genome, with presence-absence variation (PAV) polymorphisms including both non-genic regions and gene space.Partial de novo assembly is also useful for gene discovery in large genomes or any non-model species where a region of interest can be genetically mapped and a partial physical map can be derived from the genetic map.Target regions may be included in one single BAC clone or in a series of overlapping BAC clones, determined by fingerprinting or by known probes that are used to develop assemblies.Rounsley et al. (2009) sequenced and assembled a 19Mbp region of chromosome 3 in rice, using the 454 reads and was able to generate scaffolds with size ranging from 243 Kbp to 518 Kbp.Similar approaches have been used in cacao (Feltus et al, 2011) and Barley (Steuernagel et al, 2009).

Plant-associate genomics
Genome sequence information from plant pathogen, comensal and mutualistic species is an important resource for plant improvement.Knowledge on gene content, expression and diversity of plant-associated organisms helps our understanding of the basis of their interactions with plants and in developing strategies to modify such interactions.
Sequencing the genomes of mutualistic endophytes can help in the modification of nitrogen fixation and other processes, which are essential for developing a more sustainable agriculture.Genomic resources created from pathogens and their non-pathogenic relatives provide not only targets to develop increased disease resistance and pest control, but improved mechanisms for gene transfer in plant biotechnology as well (Wood et al, 2001).
Due to their simplicity, DNA and RNA viruses were among the first pathogens to be sequenced.Strains for more than 700 plant viruses, ranging from 1.2 to 30 Kb, have been completely sequenced to date (http://www.ncbi.nlm.nih.gov/genomes/).However, most recent advances in plant-associated genomes have been related to bacterial pathogens.They represent an important data mining resource for plant pathologists regarding some of the most devastating agricultural diseases (Stavrinides, 2009).From a technical point of view, they are also simpler to sequence than plant genomes and better suited for current technologies.Most bacterial genomes are approximately 5 Mbp, approximately 1,000-fold smaller than an average plant genome, with a relatively simple structure, often consisting of a single amplicon, although other amplicons such as megaplasmids and plasmids are frequently present.Bacterial genomes have little or no highly-repetitive elements.Due to their importance and relative technical simplicity, a considerable number of bacterial pathogens had already been sequenced by Sanger methodologies before the advent of NGS.Among the first published genomes are the citrus chlorosis agent Xylella fastidiosa (Simpson et al, 2000), Pseudomonas syringae pv tomato DC3000 (Buell et al, 2003), and Ralstonia solanacearum (Salanoubat et al, 2002).Currently, there are approximately 50 public finished and draft genomes from pathogenic bacteria, plus an even larger number currently in progress or unpublished.The smallest genome is that of Phytoplasma asteris, an obligated intracellular pathogen (0.9 Mbp; Oshima et al, 2004)  Azoarcus is a mutualistic endophyte in cereal species, supplying biologically fixed nitrogen to its host while colonizing plants in high numbers without eliciting disease.Unlike related species that are pathogens, Azoarcus shows a lack of genes that pathogenic bacteria possess that degrade plant cell walls.
Fungi and stramenopiles are eukaryotic groups that include important plant pathogens and mutualists and have been the focus of genome sequencing projects.Stramanemopiles such as Phytophtora spp.are fungus-like eukaryotes although they are more closely related to diatoms.Genomes from at least 15 fungal and stramenopile pathogens are publicly available (see the Phytopathogen Genomics Resource, http://cpgr.plantbiology.msu.edu/ for a comprehensive list).Among them is Phytophtora infestans, the stramanemopile causal agent of the Irish Potato Famine in the 19 th century.The 240 Mbp sequence was assembled using Sanger sequencing (Haas et al, 2009).Martin et al (2008) sequenced the 65-Mbp genome of the fungus Laccaria bicolor that is part of a mycorrhizal symbiosis.The adoption of NGS for de novo sequencing of prokaryotic and fungal plant pathogens has been effective, especially when using a combination of 454 and Illumina reads (Reindhart et al, 2009;DiGuistini et al, 2009).
The biological relevance and economic importance of certain nematodes and insects has made them desirable targets for genome sequencing.Currently, publicly available genomes have been produced by Sanger-based WGS, including drafts for the Northern root-knot nematode Meloidogyne hapla, (Opperman et al, 2008).The genomes of the Aphid Acyrthosiphon pisum and the beetle Tribolium castaneum, which produce damage in crops and stored grains, have been completed (International Aphid Genomics Consortium, 2010;Tribolium Genome Sequencing Consortium, 2008).
Within two degrees of separation, genome sequencing of bacterial and fungal species that are pathogens of plant-infesting insects and nematodes are important resources in developing effective and safer strategies for pest resistance.One of such resources is the complete sequencing of replicons comprising the genome of Bacillus thuringiensis, which expresses insecticidal crystal proteins that have been used to engineer insect-resistant crops (Roh et al, 2007;Challacombe et al, 2007).

Metagenomics
The previous section underscores the importance of understanding the genomics of plantassociated microbiota.However, there are multiple interactions between plants and noncharacterized microorganisms, some of them still to be discovered, that cannot be grown in cultures in the laboratory.(Riesenfeld et al, 2004;Allen et al, 2009).There is increasing evidence of the effect these organisms have in traits such as disease resistance and nitrogen utilization (Handelsman, 2004).Metagenomic studies of these microbial communities can exploit the availability of DNA amplification techniques and highly-parallel, clone-free NGS sequencers to sequence part of their genomes (Chen & Pachter 2005;Leveau, 2007).There are two major roles that high-throughput sequencing technologies can play in metagenomes applied to agriculture.The most common role is the mass sequencing of environmental (e.g.soil, water) samples to provide a systems-biology view of the microbiota under study.This type of study focuses on the genetic diversity and interactions between large numbers of plant associates and plants (Krober et al, 2009).Roche 454 Pyrosequencing of small subunit (16S) ribosomal RNA amplicons (pyrotags) is a method for profiling microbial communities that provides deep coverage with low cost, although it is complicated by several artifacts, including chimeric sequences caused by PCR amplication and sequencing errors.Illumina protocols have also been developed for the sequencing of "itags" derived from 16S hypervariable regions, for deep metagenomics analysis (Degnan & Ochman, 2011).
A second trend in modern metagenomics involves its exploitation for the discovery of biomolecules with novel properties.Current discovery strategies involve the screening of metagenomic libraries.Jin et al (2007) identified a novel EPSP gene with high resistance to glyphosate and potential use in plant biotechnology by screening a metagenomic library derived from a glyphosate-polluted area.However, the use of ultra-high throughput sequencing that could lead to simpler strategies is currently limited by the length of reads in NGS systems.Long reads are needed to generate full-sequence information within a single read.The use of sequencing approaches in biomolecule discovery will be feasible with the improvement of longer-read, 3 rd -Generation technologies such as PacBio.

Genomic variant discovery for marker development
Linkage mapping, diversity and evolutionary studies in plants rely on the ability to identify and analyze single nucleotide and insertion-deletion polymorphisms (SNPs and Indels), which can be directly related to differences in a phenotype of interest, be genetically linked to its causative factor, or indicate relationships between individuals in populations (Rafalski, 2002).The implementation of high-throughput PCR-based marker technologies (e.g.Taqman) and improvements in Sanger sequencing throughput increased the limits for both the number of markers as well as samples in marker-related studies.These changes have enabled new applications in linkage and association mapping analysis, marker assisted selection (MAS) and characterization of germplasm.They have also facilitated fingerprinting and determination of seed purity.More recently, the emergence of NGS has enabled genome-wide discovery of polymorphisms on a massive scale.The Roche 454 system has been used effectively for polymorphism discovery (Gore et al, 2009a), although the higher throughput and lower cost of Illumina and SOLiD technologies make them particularly well suited for major programs when a reference genome is available (Deschamps & Campbell, 2010).
In species such as Arabidopsis and Rice, which have a small genome and an available reference genome, NGS-based genome-wide variant discovery can be simply accomplished by WGS (Ossowski et al, 2008;Huang et al, 2010).In medium-to large-sized genomes, where the proportion of gene space is reduced and much of the sequence is repetitive, the use of reduced-representation strategies can improve cost effectiveness.Reduced representation strategies involve the selection of specific regions of the genome to reduce complexity and increase coverage for the selected regions.Several enrichment strategies can be used to reduce genome representation.These approaches can utilize previous knowledge about the genome or region of interest.Examples of knowledge-driven enrichment include multiplex long-range PCR, molecular inversion probes (MIP), and sequence capture (Mamanova et al, 2010).These methods are usually preferred when a specific region or gene family is targeted.However, random approaches based on restriction digestion and transcriptome sequencing are more adequate in most genome-wide projects (Deschamps & Campbell, 2010).The use of methylation-sensitive enzymes or endonucleases that preferentially cut in low copy DNA have been particularly successful when used in strategies to identify large sets of SNPs in maize and soybean varieties (Gore et al, 2009b;Deschamps et al, 2009;Hyten et al, 2010).Illumina-based SNP discovery strategies using reduced representation libraries (RRLs) have been described by Deschamps and colleagues in soybean.By combining a 6-bp methylation-sensitive and one 4bp-restriction endonuclease they demonstrated enrichment for gene space, and considerable reduction of repetitive DNA reads (Deschamps et al, 2009;Fellers, 2008).Other methyl-filtration methods for reduced representation consist of digesting DNA with the endonuclease mcrBC (Gore et al, 2009;Palmer et al, 2003) representation for polymorphism detection (Barbazuk et al, 2007;Trick et al, 2009).One major advantage to this method is the direct targeting of exonic DNA, which increases the chance of detecting functional SNPs, especially when used in conjunction with cDNA normalization methods.However, it can also constrain SNPs within a relatively small number of genes expressed in the tissue and stage used.Both standard whole-genome and RRL approaches usually yield a massive amount of polymorphisms, at a scale that is beyond prior Sanger-based projects.For example, Lam et al (2010) reported the genome resequencing in 31 wild and cultivated soy varieties, which led to the identification of more than 10 million SNPs in total, where more than 1 million of them were in genic regions.Nelson et al (2011) resequenced 8 sorghum (Sorghum bicolor) accessions using a reducedrepresentation approach in an Illumina system and identified 283,000 SNPs.With seemingly unlimited numbers of SNPs, current bottlenecks have been shifted from the discovery phase to marker assay development and validation.
In plant species where high-quality reference genomes are not available, variant discovery using an NGS resequencing approach can still be accomplished by using alternative references, such as high-quality transcriptome assemblies (see section 3.9) or de novo partial assemblies of individual BACs or BAC contigs (see section 3.5).However, both strategies carry some additional risk and validation must include potential for detection of repetitive sequences of paralogous genes.An alternative is an annotation-based strategy, as described in the wheat relative Aegilops tauschii by You et al. (2011).The genome size is more than 4Gbp and has a large proportion (more than 90%) of repetitive DNA.They produced Roche 454 shotgun reads at low genome coverage from one genotype and identified single-copy sequences and repeat junctions from repetitive sequences as well as sequences shared by paralogous genes.SOLiD and Solexa reads were then generated from another genotype and were mapped to the annotated Roche 454 reads.In this case, 454 reads provide a DNA "context" surrounding the putative SNPs, which can be used to generate genome-wide markers.They were able to identify nearly 500,000 SNPs with a validation rate higher than 81%.

QTL and eQTL mapping, hapmaps and WGAS
In plants, most agronomically important traits are quantitative.Plant yield, flowering time, sugar content, disease resistance and fruit weight are examples of quantitative traits, which result from the segregation of many genes and are influenced by environmental interactions (Paran & Zamir, 2003).While quantitative traits have been studied for more than 100 years, the mapping of the underlying quantitative trait loci (QTL) could only be accomplished after the development of sequencing methodologies, molecular markers and improved statistical methods.Furthermore, until 2005, only a small fraction of mapped plant QTL had been cloned (Salvi and Tuberosa, 2005;Frary, 2000).One major difficulty was the low resolution of available mapping strategies.Before the advent of NGS platforms, most QTL identification and cloning was based on linkage mapping strategies.In linkage mapping, polymorphisms are identified between two parents and then followed in a large segregating population.The linkage of different regions of the genome to the individual phenotypes can be then inferred statistically by identifying recombinants that show phenotypic differences in the trait of interest.One drawback to linkage mapping is the low resolution that results from relatively few recombinants generated from two original parents in a limited number of generations.
Even in cases of QTL with large effects in the total genetic variance, intervals can encompass a large genetic and physical distance and require walking through several megabase-pairs of sequence, with a large number of potential candidates (Yano et al. 1997;El-Assal et al. 2001;Liu et al. 2002).In maize, recent linkage mapping studies have identified QTL with relatively large effects in oil content (Zheng et al, 2008) and root architecture (Ruta et al, 2010).
The development of high-throughput genotyping technologies and later the emergence of NGS platforms has enabled the use of genome-wide association studies (GWAS) and bulked segregant analysis to map plant QTL (Rafalski, 2010;Schneeberger & Weigel, 2011).Unlike linkage mapping, GWAS exploit the natural diversity generated by multi-generational recombination events in a population or panel (Risch & Merikangas, 1996;Yu & Buckler, 2006;Belo et al, 2008;Nordborg & Weigel, 2008).These results in increased resolution compared to linkage mapping populations, as long as enough markers are provided: GWAS may require hundreds of thousands or even millions of genetic markers to achieve sufficient coverage.Before NGS, such marker density was unfeasible and linkage disequilibrium, or association, mapping studies needed to focus on polymorphisms in candidate genes that were suspected to have roles in controlling phenotypic variation for one specific trait of interest (Thornsberry et al, 2001).In plants, availability of NGS and the ability to create lines of individuals with identical or near identical background offer the potential to create public GWAS resources that can be accessed by multiple groups and rapidly resolve complex traits.Plant GWAS can be performed in large numbers of samples in replicated trials using inbreds and recombinant inbred lines (RILs) (Zhu et al, 2008).One or more research groups can then analyze one or many traits in multiple environments.The most important GWAS resource in maize is a collection of recombinant inbred lines derived from a nested association mapping (NAM) population (Gore et al, 2009).The maize NAM population is a collection of 5,000 RILs in sets of 200, derived from one of 25 populations.(Each of the 200 RILs is derived from one F2 plant from a cross between one of 25 inbred lines to B73.)The original inbred lines that were used as founders of the NAM have been resequenced using a NGS reduced-representation approach.Such resequencing surveys (HapMaps) include a high quality data set consisting of 1.4 million SNPs and 200,000 indels spanning the 5,000 inbred lines.Seeds from the RILs can be used to grow and phenotype plants for any trait of interest (McMullen et al, 2009).Recent studies, all derived from the same NAM resource, demonstrate the effectiveness of this approach to identify and characterize QTL.Buckler et al (2009) identified 50 loci that contribute to variation in the genetic architecture of flowering time, with many loci showing small additive effects.Tian et al (2011) also identified large numbers of QTL with small effects determining leaf architecture.Poland et al (2011) identified candidate genes for resistance to northern leaf blight in 29 loci, which included QTL with small additive effects.Kump et al (2011), identified QTL for Southern Leaf blight.HapMaps in Rice have been reported by Huang et al (2010) by resequencing at low coverage a total of 517 landraces that yielded a total of 3.6 million SNPs.The study identified QTL with minor and major contributions to phenotypic variance for drought tolerance, spikelet number and 12 additional agronomic traits.In Medicago truncatula, Branca et al (2011) detected more than 3 million SNPs in 26 inbred lines to study the genetics of traits related to symbiosis and nodulation.In Arabidopsis thaliana, the 1,001 Genomes Project, started in 2008 aims at discovering polymorphisms in that number of wild accessions (Weigel and Mott, 2009; http://1001genomes.org/).The complete genome sequences of over 80 accessions have already been released and inbred lines have been generated from each accession.
Finally, the determination of genome variants and transcription profiling by NGS approaches can be used effectively in the determination of expression quantitative loci (eQTLs; Damerval et al, 1994).Variation in the expression of transcripts, when measured across a segregating population, can be used to map regions with cis-and trans-effects (Holloway & Li, 2010).The development of massively parallel sequencing technologies has replaced microarrays as the method of choice for eQTL analsysis (Holloway et al, 2011;West et al, 2007).Using NGS, Swanson-Wagner et al (2009) identified ~4,000 eQTLs in reciprocal crosses between the maize inbred lines B73 and Mo17, most of them acting in trans and regulated exclusively by the paternally transmitted allele.

Genotyping by sequencing
The value of NGS-driven massive polymorphisms discovery can be seriously restricted by cost and time limitations in the design, validation and deployment of molecular markers.
With the falling cost of NGS there is an increased interest in genotyping-by-sequencing (GbS), where the obtained sequence differences are used directly as markers for analysis.As described in section 3.7, Maize NAM and other GWAS comunity-based resources already make use of GbS.However, such panels have limited utility beyond their populations or panels.A number of reduced-representation GbS protocols have been reported that can be applied to other population or panels for linkage, association, bulked segregant analysis, fingerprinting, diversity and other studies.Depending on the details of the project and the available resources, sequences can be mapped to a reference.However, in large genomes or other genomes with no reference available, the consensus of reads flanking the polymorphism can be used as a partial reference or polymorphic reads can simply be treated as dominant markers (Elshire et al, 2011).Construction of a low-density GbS linkage map using Restriction-Site-Associated DNA (RAD) has been reported in barley (Chutimanitsakun et al, 2010).The use of simpler and highly multiplexed protocols, however, is required in most cases to make GbS cost and time-effective.The bottom line is that an all-inclusive cost per sample is lower than those provided by other available genotyping platforms.Cost estimates need to include the considerable computational and bioinformatic resources needed for GbS data analysis.Using a simple reducedrepresentation procedure based on ApeKI restriction digestion, Elshire et al (2011) identified and mapped approximately 200,000 polymorphisms in the 2 parents and the 276 RILs from the maize IBM (B73 x Mo17) mapping population at an estimated cost of $29.00 per sample.
With the same protocols and using a single Illumina run, they can process up to 672 samples, taking the actual data collection cost to well under US$20.00 per sample.The low cost per base and the high numbers of reads produced per run make the Illumina and SOLiD systems more suitable for GbS.

Transcriptome assembly and profiling
The sequencing of DNA products synthesized from total and mRNA isolates (cDNA) has been crucial in gene expression analysis, discovery and determination of alternative splicing forms of genes (isoforms).In the case of organisms with a genome sequence available, cDNA sequencing has facilitated the annotation of splicing sites and untranslated regions (UTRs), as well as improved gene prediction algorithms (Brautigam & Gowik, 2010).As indicated before, transcriptome sequencing can also be deployed as a reduced-representation strategy to identify polymorphisms for marker development and genotyping.Before the advent of NGS, multiple Sanger sequencing strategies were developed for the quantitative and qualitative analysis of mRNA expression.The need for direct quantitative analysis on gene expression led to the development of profiling strategies such as Serial-Analysis-of-Gene-Expression (SAGE; Velculescu et al, 1995).On the other hand, the creation of large consortia dedicated to providing end-sequence for individual clones from cDNA libraries enabled gene discovery, annotation and expression on a large scale (Rafalski et al, 1998).These efforts have yielded more than 22 million Sanger-based expressed sequenced tags (ESTs) from more than 40 plant species.The largest datasets correspond to arabidopsis, soybean, rice, maize and wheat, each of which contains more than 1 million entries (http://www.ncbi.nlm.nih.gov/dbEST/).There are also EST databases available for at least 42 fungal, stramenopile and nematode phytopathogen species (http://cpgr.plantbiology.msu.edu).Sequence information derived from these databases has been utilized to develop expression microarrays, which are aimed at establishing relative abundance of known genes in large numbers of samples and tissues within the same species or among related species (Rensink and Buell, 2005).While such arrays have been effective in providing gene expression data, they are inheritably biased in their design and have limitations in resolution and in their ability to differentiate between individual genes within families.The highly parallel, short-read NGS technologies such as Illumina and SOLiD have allowed the development of transcription profiling strategies that are more sensitive and accurate than SAGE or microarrays.Initial NGS strategies for transcription profiling had their roots in the innovative, now obsolete massively parallel signature sequencing (MPSS) technology.This technology, owned and provided as a service by Lynx Therapeutics, consisted in the generation and sequencing of short 17-bp unique tags, or signatures, from 3'-UTRs of transcripts at high coverage (Simon et al, 2009).It provided unparalleled resolution generating over a million signature sequences per experiment, although the cost of every experiment was considerable (Reinartz et al, 2002).With Illumina, tag-based "digital" expression profiling protocols became relatively simple and achieved higher resolution than MPSS at a fraction of the cost (Wang et al, 2010) .
The use of shotgun sequencing of cDNA using the Roche 454 analyzer has provided relatively long reads and high coverage for gene discovery, annotation and polymorphism discovery in both model and non-model plant species (Barbazuck et al, 2007;Emrich et al, 2007).More recently, the increasing gains in throughput, as well as improvement in shotgun RNA sequencing (RNA-seq) strategies and analysis software have expanded the potential of Illumina and SOLiD platforms for full transcriptome analysis, and replaced the use of the tag-based expression profiling approach.In RNA-seq, total or messenger RNA is fragmented and converted into cDNA.Alternatively, it is first converted into cDNA and then fragmented.Adaptors are attached to one or both ends, and sequenced as single-or paired-ends (Wang et al, 2009;Margerat & Bahler, 2010).Depending on the genomic resources available for the organism of interest, the resulting sequences can be aligned to either a reference genome or reference transcripts.Alternatively, genes can be assembled de novo.In either case, cDNA sequencing provides considerably more information on the transcriptome, including gene structure, expression levels, presence of multiple isoforms and sequence polymorphism.Unlike microarray-based hybridization, it does not depend upon previous knowledge of potential genes.A considerable number of RNA-Seq projects have been made in major crop species.Severin et al (2010), Zenoni et al (2010) and Zhang et al (2010), all applied Illumina-based RNA-Seq on multiple tissues and stages in soy, grape and rice, respectively, and aligned transcript reads to their respective reference genome sequences.Li et al (2011) used Illumina in multiple stages along a leaf developmental gradient and in mature bundle sheath and mesophyll cells.

Small RNA characterization
Small RNAs (sRNA) are non-protein-coding small RNA molecules ranging from 20 to 30 nt that have a role in development, genome maintenance and plant responses to environmental stresses (Simon et al, 2009).Most sRNAs belong to two major groups: 1) microRNAs (miRNA) are about 21 nt and usually have a post-transcriptional regulatory role by directing cleavage of a specific transcript, 2) short interfering RNAs (siRNA) are usually 24 nt-long and influence de novo methylation or other modifications to silence genes (Vaucheret 2006).
The finding of their prevalence in low-molecular-weight fractions of total RNA in animals and plants predated the development of NGS.However, the use of MPSS greatly expanded resolution and later became clear that short-read NGS technologies such as Illumina or SOLiD had optimal characteristics in sRNA analysis (Zhang et al, 2009).Roche 454 sequencing has also been used in sRNA analysis (Gonzales-Ibeas et al, 2011).

Epigenomics
In plants and other multicellular organisms, cell differentiation is driven by variation of epigenetic marks encoded on the DNA or chromatin.Such variations can be stable, or heritable, but do not change the underlying DNA sequence of the genome (Zhang & Jeltsch, 2010).Furthermore, there is accumulating evidence that transgenerationally inherited epigenetic variants (epialleles) have a significant effect in differential gene expression in plant and animal populations (Reinders et al, 2009;Johannes et al, 2009).Biochemical alterations such as cytosine methylation polymorphisms and differential histone deacetylation are epigenetic marks that can play a critical role in development (Schöb & Grossniklaus, 2006;Chen & Tian, 2007).In plants, 5-methylcytosine can be present at symmetric CG sites but also can be located at CHG sites as well as in asymmetric CHH locations (where H can be A, C or T).Methylation at CG sites usually occurs symmetrically on both strands and is heritable, maintained by specific types of methyltransferases that recognize hemimethylated sites created during replication.Methylation of CHG and CHH is established and maintained by additional methyltransferases (Schob & Grossniklaus, 2006).One important consideration is that both the epigenome and methylome will be considerably larger than the genome of an organism.As a major part of the epigenome, the methylome consists of the sum of genome and methylation states at every cytosine location.Multiple states coexist in the same individual, depending on cell types, tissues, developmental stages or environments.Adding additional complexity is the fact that methylation at one position may be partial within the same cell type.Similar to the transcriptome, methylome analysis has, therefore, a quantitative component in addition to a qualitative one.
Different Sanger and NGS strategies have been developed over the years that can directly or indirectly identify epigenetic marks and patterns.Before NGS, epigenetic studies were mostly limited to individual genes or sets of candidate genes or regions.One exception is the extensive work done in arabidopsis by Zhang et al (2007), which provided the first genome-wide study in plants and considerable information on methylation distribution and effect in gene expression.The use of NGS technologies coupled with bisulfite conversion, restriction digestion, or immunoprecipitation strategies are facilitating genome-wide methylome analysis in plants.Of these, approaches based on sodium bisulfite conversion (BC) provide the highest resolution.In BC, denatured DNA is treated with sodium bisulfite, which induces the hydrolytic deamination of cytosine.Subsequent treatment with a desulfonation agent transforms the uracyl-sulfonate intermediate into uracyl.Replication and further amplification of the converted strands will incorporate a thymidine in originally non-methylated C sites.However, 5-methylcytosine residues remain unreactive during conversion and further amplification will retain the original CG pairing at that position (Liu et al, 2004;Frommer et al 1992).As a consequence, unmethylated and methylated cytosines can be mapped in the two original strands and resolution can reach single base level, as long as the original sequence is known (Zhang et al, 2007;Lister et al, 2008).The combination of methylome BC-NGS and high-definition transcriptome analysis will have an important role in further characterization of epi-regulation in plants.

Future outlook
For more than 30 years, the use of sequencing methods and technologies, in combination with strategies in breeding and molecular genetic modification, has contributed both to our knowledge of plant genetics and to remarkable increases in agricultural productivity.In recent years, agricultural sciences have been in the middle of a second technological revolution in DNA sequencing, driven in large part by the post-human genome goal of affordable genome sequencing for disease research and personalized medicine.The resulting NGS systems have become a "disruptive technology", radically reducing limitations in sequence information and consequently altering the types of questions and problems that can be addressed (Mardis, 2010).As we have explored in this chapter, these massively parallel sequencing systems have had a dramatic effect in variant and gene discovery, genotyping, and in the characterization of transcriptomes, genomes, epigenomes and metagenomes.The increasing ability to sequence complete genomes from multiple individuals within the same species is providing a more comprehensive view of crop diversity and transgenesis, as well as a better understanding of the effect of mutational and epimutational processes in plant breeding.With all their high throughput and low cost, different NGS platforms have specific flaws and researchers have been playing on their weaknesses and strengths.Short reads, even at very high throughputs, may not be the best output for de novo sequencing.(After all, longer Sanger reads are not ideal either to determine the most complex eukaryotic genomes.)Alternatively, the use of systems that are able to produce longer reads but have relatively lower capacity will probably not be effective in some of the most important future applications in molecular breeding such as genotyping-by-sequencing.
The boundaries of sequencing technologies continue to expand, and the quest for more universal sequencers continues.The expected improvements in quality and read length in real-time 3 rd -Generation systems have the potential to greatly benefit plant de novo sequencing and metagenomics applications.The development of cheaper, portable, easierto-use machines has the potential to create a decentralization of sequencing and to create  (Branton et al, 2008).Future improvements in sequencing technologies will enable applications that can support discovery and innovation needed to respond to growing population pressure, energy crises, decreasing fresh water availability and climate change (Gepts & Hancock, 2006;Moose & Mumm, 2008).Ultimately, DNA sequencing is one of a number of tools in plant breeding and biotechnology, albeit an essential one.The knowledge of genomic organization, diversity and function of genes in crops needs to be associated to an understanding of plant biology.Effective plant breeding programs need solid statistical strategies and the ability to create, manage and integrate large, heterogeneous sets of data based on phenotype and sequence-derived information.

Fig. 1 .
Fig. 1.Increase in maximum throughput per run in sequencing platforms from 1980 to 2011.Throughput per run based on Stratton et al (2009) and Glenn (2011).
. Sequences d e r i v e d f r o m c D N A h a v e b e e n u s e d i n b o th Sanger and NGS sequences to reduce

Table 1 .
Comparison of current sequence technologies.
DNA Sequencing -Methods and Applications 54 entirely new field applications.With shorter technological cycles in sequencing and computing, it is difficult to anticipate the next disruptive technology.While it is likely that 3 rd Generation systems will soon become widespread in plant research, there has also been progress toward nanopore-based technologies.Nanopore systems are based on electronic detection of DNA sequence and have the potential of low sample preparation work, high speed, and low cost www.intechopen.com