Integration of Next-generation Sequencing Technologies with Comparative Genomics in Cereals

Cereals are the major sources of calories worldwide. Their production should be high to achieve food security, despite the projected increase in global population. Genomics re‐ search may enhance cereal productivity. Genomics immensely benefits from robust next- generation sequencing (NGS) techniques, which produce vast amounts of sequence data in a time and cost-efficient way. Research has demonstrated that gene sequences among closely related species that share common ancestry have remained well conserved over millions of years of evolution. Comparative genomics allows for comparison of genome sequences across different species, with the implication that genomes with large sizes can be investigated using closely related species with smaller genomes. This offers prospects of studying genes in a single species and, in turn, gaining information on their functions in other related species. Comparative genomics is expected to provide invaluable infor‐ mation on the control of gene function in complex cereal genomes, and also in designing molecular markers across related species. This chapter discusses advances in sequencing technologies, their application in cereal genomics and their potential contribution to the understanding of the relationships between the different cereal genomes and their pheno‐ types. structure and genomic resources of major cereal species.


Introduction
Significant limitations to cereal crop production and productivity pose a threat to global food security since these crops are the main sources of calories that support the ever-growing human population. Despite the significant progress that has been made in the improvement of edible yield through classical breeding techniques, the current rates of increase in grain yield in several major cereal crops are still too slow to catch up with the increasing demand of the growing population [1,2]. This is likely to get worse according to the projected climate change scenarios [3], as it also affects biotic stresses such as pests, diseases and weeds, and abiotic stresses including drought, extreme temperatures, salinity and nutrient deficiencies [4][5][6]. Although there are various strategies to cope with these constraints, Kole [7] suggested the use of genomics-assisted breeding as an effective and economic strategy.
Despite the sustainability of breeding resilient crops, there are still several genomic constraints to genome-based selection and stress resistance improvement, particularly for multigenic traits. A poor understanding of the genetic basis and the regulatory mechanisms of various stresses is among the major challenges for successful genetic manipulation through gene introgression, gene pyramiding, gene stacking or gene silencing. Additionally, more diagnostic genetic markers are necessary to improve the current limited success in marker application in both foreground and background selection. These challenges are related to the fact that genomes of some cereal crops are not yet fully sequenced and annotated, either because the crops have been under-researched or the genomes are huge and structurally complex. For instance, the hexaploid wheat (Triticum aestivum) genome is the largest (about 17 billion nucleotides) among cultivated cereals, and is multifaceted by repetitive DNA sequences [8]. Furthermore, dissection of the genetic and regulatory mechanisms of host plant resistance is complicated because most traits of interest are multigenic and thus influenced by several genes with additive and nonadditive gene effects. Hence, tools that detect the genetic variation at the genome sequence level allow all genes controlling particular traits to be investigated for various genetic applications to realize phenotypic gains from genetic manipulation.
Enhanced application of next-generation sequencing (NGS) techniques in cereal crops is revolutionizing and speeding up plant breeding. The advances that have been made so far in the use of NGS, particularly with the human genome in the field of medicine, and on various model crops through plant biotechnology, envisions the following in cereals and other crops: first, complete sequencing of small and less complex plant genomes is increasingly becoming possible as costs have dropped significantly and more sequences are being generated in a shorter time than before. Secondly, the genetic mechanisms of particular traits in huge and complex plant genomes can now be investigated using small and less complex genomes of related plants sharing conserved regions through comparative genomics. This will potentially identify genes or quantitative trait loci (QTL) and putative single nucleotide polymorphism (SNP) markers for genome-wide association mapping and annotation of genomes. This chapter discusses the advances made in improving sequencing technologies and how these advances can assist in generating complete sequences for the improvement of genome-aided selection. This will also assist in identifying the unique sequences responsible for the major differences existing among cereals.

The need for high-throughput genome and transcriptome sequencing
Since the discovery of the DNA molecule by Friedrich Miescher in 1869 [9], and the subsequent exposition of its double-helical structure by Watson and Crick in 1953, significant knowledge has been gained on the flow of genetic information. Understanding how this genetic information influences the phenotype (trait) of interest has, however, remained a challenge. This is mainly because the overall instruction contributing to the phenotype is not restricted to the coding region but is also influenced by some posttranscriptional modifications controlled by noncoding DNA [10][11][12]. Also, multigenic traits are influenced by complex interactions of alleles at different loci, having major or minor influence [13]. These, together with differential genotype-by-environment interactions, add to the structural and functional complexity of most cereal genomes that are multifaceted by repetitive DNA sequences, transposable elements and polyploid genomes, as in the case of wheat and finger millet (Eleusine coracana) [8,14]. Whole genome and transcriptome sequencing therefore become a necessity so that all the genomic and transcriptomic variation can be detected. NGS and various 'omic' technologies, including genomics, transcriptomics, proteomics, metabolomics and phenomics, offer prospects towards whole-genome annotation; particularly in cereals that have small and less complex genomes. This will simplify comparative genomics and evolutionary genetic research, which will enhance the manipulation and exploitation of important genes for cereal improvement.
NGS technologies are one of the available tools that can produce complete sequences for diverse research at the DNA and RNA level within and across species. Firstly, this will make it easy to obtain the entire DNA, coding and noncoding regions. Secondly, this will simplify studies on the whole transcriptome, including RNAs involved in protein synthesis such as the messenger, ribosomal, signal recognition particle, transfer and transfer-messenger RNAs and other RNAs involved in posttranscriptional modifications, such as small RNAs [15]. Quantification of such transcripts through NGS under various stress conditions will precisely determine the levels of gene expression within and across different species.

Advances in sequencing technologies
Since the pioneering of genome sequencing through technologies such as Sanger sequencing [16], significant advances have been made to resolve the limitations of the early technologies. This has seen the development of more sophisticated sequencing technologies that allow de novo genome sequencing, generating vast amounts of data in a short period at low costs. Table  1 summarizes the advances made in sequencing technology development, from the advent of the chain termination sequencing [16], to prominent NGS technologies including Roche/454 sequencing [17], Illumina (Solexa) sequencing [18], sequencing by oligonucleotide ligation and detection (SOLiD) [19], the single molecule sequence pioneered by Helicos Biosciences [20] and Ion Torrent sequencing [21]. These technological advances are instrumental in wholegenome research and are expected to simplify comparative genomics within species and across distantly related cereals and grasses. Several modifications are available for each of these technologies and fine-tuned protocols are constantly being developed to address some of the current limitations.
Although NGS technologies have enormous prospective benefits, they come with their own limitations that need to be addressed to realize their full potential. Key among these drawbacks are the bioinformatic and computational challenges related to storage, image analysis, base calling and integration of the large amounts of data that are generated in several terabytes per day. Apparently, the large amount of sequence data that is being generated on a daily basis in cereal genomics cannot be transformed into information that is useful for the detection of important genomic variants within and among species or in identifying genes that are differentially expressed under particular stress conditions. Hence, investment in computational and high-throughput bioinformatic equipment and human resources and combining the various NGS technologies will allow the data generated using different NGS techniques by various laboratories to be related and used to build onto each other. Unlike traditional marker technologies, NGS is currently dissociated from phenomics, yet it should be complementary to high-throughput phenotyping in order to relate sequence variations to traits of interest for progressive discoveries through genome-wide association mapping, particularly for multigenic traits like adaptation to drought in complex cereal genomes [22]. Additionally, NGS technologies are still associated with high error rates [23] and short read lengths that limit data analysis accuracy. This further confuses detection and distinction of sequence variations including large amounts of duplications, deletions, inversions and chromosomal rearrangements that characterize cereal genomes.

Application of next-generation sequencing in cereal biotechnology
Among the major cereals, the relatively small rice (Oryza sativa) genome (∼389 Mb) has long been fully sequenced by the International Rice Genome Project [24]. Kawahara [25] recently demonstrated, however, the robustness of NGS technologies by revising the rice genome using the Illumina and Roche 454 pyrosequencing platforms. Their study noted some errors in the initial assembly. This research provides sufficient evidence that high quality and validated reference genomes can be produced among most cereals through resequencing using NGS technologies. Also, a recent whole genome-wide study of the hexaploid wheat genome (∼17 Gb) using the Roche/454 pyrosequencing technology reviewed the capacity of NGS technologies to resequence huge and complex genomes and to identify SNPs for dissection of quantitative traits [26]. Similarly, Illumina sequencing was recently used to quantify the transposable element (TE) content in the complex maize (Zea mays) genome (∼2.3 Gb) [27] and to estimate their potential contribution to the genome size differences between the cultivated species and its close relative, Zea luxurians [28]. The latter also reported high proportions of conserved TE families between the two species, revealing the potential of NGS technologies to enhance evolutionary and comparative genomic studies. Other major cereals whose genomes have been sequenced and are expected to further benefit from NGS technologies include barley (Hordeum vulgare) (∼5.1 Gb) [29] and sorghum (Sorghum bicolor) (∼730 Mb) [30].
Minor and under-researched cereals such as the allotetraploid finger millet (Eleusine coracana) -which has a genome size of about 1.  [34]. These minor crops are renowned for their adaptation to various biotic and abiotic stresses, particularly drought. Thus, sequencing or resequencing their genomes will potentially expose huge amounts of relevant genetic information for cereal improvement. NGS technologies will have great application in comparing genomic features of cereal crops through comparative genomic research.

Comparative genomics in cereal crops
Core questions unanswered with traditional cereal biotechnology approaches include: (1) What are the genetic foundations that underlie the similarities between different grass species or individuals within a species? (2) What are the genetic variations responsible for the detected phenotypic differences? Comparative genomics is the branch of biology in which DNA sequence information from genomes of different life forms are compared in an effort to directly answer these questions. It was founded mainly on various ideas. Firstly, comprehensive analysis and comparison of whole genomes can uncover the essentially conserved and the important variable components of any set of genomes [35]. Secondly, differences in genome sequence (genotype) contribute to differences in genome function and therefore explain differences between phenotypic traits [36]. The application of comparative genomic informa-tion on various plants including cereals has, however, been a challenge previously because of the large genome sizes of most species, which are complicated by high rates of structural rearrangements mainly due to transposable elements, duplications and inversions [35], as listed in Table 2. The application of comparative genomics for crop improvement has evolved over time. In the grass family, significant research provided remarkable and comprehensive datasets demonstrating high degree of collinearity or synteny among genomes at chromosome (macro) and gene (micro) levels [37,38]. Synteny, from the Greek syn (together with) and taenia (ribbon), refers to loci contained within the same chromosome. Collinearity, on the other hand, refers to some degree of conservation of gene order between chromosomes of different species or between nonhomologous chromosomes of a single species [39]. A large number of sequences within the grass family has remained considerably conserved at the genome level over millions of years of evolution, irrespective of the differences in ploidy level, chromosome number and haploid DNA content [37]. This conservation of gene content and order at the megabase level makes it easy to use species with small genome sizes such as Arabidopsis and rice as model species for studying similar gene contents in other related species. Their applications include allele discovery, positional cloning, and comparative studies in related species [40]. There is, however, limited synteny and gene homology between Arabidopsis and rice, but an extensive collinearity between the latter and other grasses, thereby suggesting that rice is an appropriate grass model species for cereal comparative genomics [41]. In this case, rice and purple false brome (Brachypodium distachyon) (genome size ~355 Mb), both of which are from the grass family, serve as functional model species for cereal comparative genomics owing to their small and fully sequenced genomes. Moreover, Brachypodium showed conservation of gene content and family structure with rice and sorghum [42]. A phylogenetic study carried out on seven grass species also revealed a close evolutionary relationship of Brachypodium with maize, barley and wheat based on 335 commonly shared sequences [43].
Microcollinearity has numerous interesting applications in cereal genome analysis including the transfer of genetic markers between species and the identification of candidate genes across species borders [44]. It is possible, due to such advances, to intensively study, decipher and understand the genetic makeup of the cereal genomes including those of rice, maize, wheat, barley and sorghum [30,[45][46][47]. Comparing the gene sequences of these cereal crops is the initial step towards understanding their morphological and functional similarities and differences. Comparative analysis research has been extended to the DNA sequence (micro) level, to allow the investigation of conservation of coding and noncoding regions as well as characterization of molecular mechanisms of genome evolution [38].

Several examples of macro-and microcollinearity in cereal crops
The advent of molecular markers and molecular mapping allowed researchers to conduct comparative mapping research, comparing gene orders and content of genes and markers along chromosomes of related species. The first research of large-scale restriction fragment length polymorphism (RFLP) mapping in several economically important crop genomes included the genomes of wheat, rice, maize, oat and barley. They are benchmarks for the discovery of collinearity in the grass family [44]. Hence, in the past, exploiting RFLPs to compare genomes was a valuable method as the markers made it possible to map, for the first time, a huge number of randomly distributed polymorphic loci in a single population and provided the foundation for efficient, whole-genome studies at the molecular level [48]. The application of RFLP technology in comparative genome analysis studies revealed that an extensive commonality in gene content and arrangement was a basic chromosomal property, thus prompting the idea that the genetic map could be used to tie all grasses into a single model system. This led to the construction of a consensus grass map based on 25 rice linkage blocks [37,38]. The resolution of the genetic maps, however, proved to be very low with an average of one marker in every 5 to 10 centimorgans (cM), allowing the detection of only large rearrangements. The RFLP markers used to construct the maps were also low-copy, therefore limiting the detection of small deletions, inversions and whole or partial genome duplication events [49]. The use of RFLP markers for comparative mapping also had difficulty to assess orthologous (derived from a common ancestor by speciation) and paralogous (derived by duplication within one genome) relationships in gene families. Having these challenges associated with traditional genotyping, the NGS techniques discussed above are expected to advance comparative genomics because they provide actual DNA sequences that allow interspecies or intergeneric comparisons.
Traditional genome analyses have provided sufficient evidence that cereal genomes share conserved regions at either macro or micro levels. For example, a comparative genomics study on rice and maize indicated high levels of collinearity between the two genomes with some chromosomes or their arms-accounting for at least 67% of the two genomes-having almost similar gene order and sequences [46]. Similarly, large proportions of conserved regions between rice and wheat chromosomes were identified with major differences arising from chromosomal rearrangements [40,45]. Conservation of about 24% of grass-specific gene orders have been reported in sorghum [30], including high collinearity with rice [50]. Thus, sorghum can also serve as a model species for cereal genomic studies due to its relatively small genome size and wide adaptability. High levels of microcollinearity have been demonstrated between chromosome 6 of rice and the telomeric regions of barley chromosome 1P, which further confirm the usefulness of mapping the small rice genome for map-based cloning of important genes in complex genomes [47]. Figures 1 and 2 illustrate the conservation of synteny and collinearity among different cereals by revealing the syntenic relationships between chromosomes of cereal crops. Furthermore, Figure 2B reveals that the 10 maize progenitor chromosomes and the 10 linkage groups of sorghum appear to be similar, thus exposing their evolutionary divergence from rice that could be their common ancestor before speciation [51].
The study of such evolutionary relationships and changes that occurred after cereals diverged from their progenitors will further be enhanced through comparative genomics integrated with NGS and next-next or third-generation sequencing techniques, which can generate more resolute physical maps. Availability of updated genome sequences will expose the multiple breaks in collinearity occurring in the genome compositions due to structural rearrangements caused by transposable elements, inversions, deletions and duplications. The macro-and microcollinearities described in this section are exposed by the observed phenotypic similarities that exist among different cereal species.

Phenotypic commonality in cereals
The conservation of synteny and collinearity of genes among cereals is highly attributed to the common phenotypic features or characteristics that are evidence that they share common ancestry, while their differences mainly stem from chromosomal rearrangements and polyploidization as shown in Figure 2. Their morphological similarity ( Figure 3) also shows evidence that they share common ancestry. Based on phenotype alone, most also share similar rooting system, leaf venation, flowering habits, tillering, inflorescences, physiological behavior such as vernalization requirements, and adaptation to biotic and abiotic stresses. For example, some cereals are hosts of common diseases, as in the case of maize streak virus (MSV), wheat streak mosaic virus (WSMV) and rusts [52,53], while others are nonhosts, as in the case of rice to rusts. The differences in phenotype and genome structure among all these species could be due to mutations, breaks in collinearity and loss of synteny that occurred in their genomes over millions of years. Such differences can be traced through comparative genomic analysis, particularly with the aid of high-throughput sequencing techniques. Likewise, the similarity in phenotype and genome structure could be due to sharing a common ancestry (Figures 2  and 3). This finding therefore reveals some phenotypes along with gene orders and sequences that have been conserved over millions of years.

Outlook
Plant species have highly conserved regions at DNA sequence level, whereas the bulk of the large genomes consist of repetitive DNA sequences, most of which are species-specific. Comparative genomics have opened new avenues for map-based positional cloning of genes encoding important traits on large and intricate genomes through investigating small and less complex genomes. In grasses, rice and Brachypodium have been identified as model species for such research since they have small and stable genomes. This, however, requires the integration of NGS techniques so that all the conserved and nonconserved regions can be fully sequenced and annotated with the aid of other "omic" technologies. Hence, the future of comparative genomics studies in cereals will largely rely on cost-effective sequencing technologies along with computational systems that handle large numbers of sequences, thus allowing effective sequence comparisons across species of interest. The substantial evidence regarding a common ancestry of cereals-based on genome and morphological structuresled to the successful use of the genome sequence of one species to share a light on the function of that sequence in other related species. A wide adoption of this approach across different cereals will speed up gains and generate useful databases and datasets for effective cereal breeding. Furthermore, researchers will be able to use other widely adapted cereals like sorghum and some of the under-researched cereals as models for sequencing genes and alleles responsible for unique traits such as wide adaptation to stress-prone environments due to increased sequencing throughput. There is, however, a need to invest in advanced computational and bioinformatics tools to handle and analyze huge datasets that will be generated through these technology advances.