Genome and transcriptome assemblers.
Cotton is economically and evolutionarily important crop for its fiber. In order to improve fiber quality and yield, and to exploit the natural genetic potential inherent in genotypes, understanding genome structure and function of cultivated cotton is important. In order to achieve this, a functional understanding of bioinformatics resources such as databases, software solutions, and analysis tools is required. But currently, there are very few unified reports on bioinformatics tools and even fewer repositories to access cotton genomic information. Also, resourceful developers and bioinformatics scientists actively addressing complex genomic challenges in cotton genomes are much in need. The primary goal of this chapter is to provide a review of such tools and resources for analyzing the structure and function of the cotton genome with preferential emphasis on this complex and economically important plant species. This discourse begins with a description of concurrent advances in high‐throughput genome sequencing and bioinformatics analyses and focuses on four major sections covering bioinformatics tools and resources for analysis of: (1) genomes; (2) transcriptomes; (3) small RNAs; and (4) epigenomes. In each section, recent advances in cotton have been discussed. Cotton genome sequencing and annotation efforts are outlined within these sections. This review discusses the availability of genome information of both diploid and tetraploid species that have impelled cotton genome research into the post‐genomics era, opening new avenues for exploring regulatory mechanisms associated with fine‐tuning of gene expression of fiber‐related genes. Finally, the potential impacts of these rapid advances, especially the challenges in handling and analyzing the large datasets are discussed.
- cotton fiber
Cotton is an economically and evolutionarily important crop species. Along with cotton improvement, which has progressed impressively with conventional and molecular breeding approaches, the genomic approaches utilizing next generation sequencing (NGS) technologies have enhanced our ability to understand and utilize the genetic potential of crop species . The early sequencing efforts in cotton (Gossypium spp.) are mostly limited to diploid species such as G. raimondii, as its genome is structurally less complex. Now, DNA sequencing has become a routine tool in cotton genetic research. Sequencing genomic, transcriptomic, and regulatory regions of a plant species and comparing their patterns will provide a better understanding of genome architecture, genetic variation, gene identification, regulation, and its expression . Understanding cotton genome, transcriptome, and regulatory molecules (small RNAs and epigenetic modulators) will therefore provide deeper insights into the structure and function of the genome . Dissecting the complexity associated with tetraploid cotton genome and narrow genetic variability is more challenging and requires efficient methodologies. Comparative genome and transcriptome analysis of cultivated cotton and its progenitors will aid in the identification of novel single‐nucleotide polymorphisms (SNPs), copy number variations (CNVs), transcripts, transcript quantification, and alternative splice junctions at the transcriptome level and the role of these elements in regulating cotton fiber quality and yield . Similarly, transcriptome profiling studies are powerful in unravelling the underlying mechanisms involved in gene expression associated with the sub‐genomes of a plant species. In addition, the role of small RNAs and epigenetic modulators is increasingly evident in determining the genome landscapes of plant species including cotton.
DNA sequencing technologies have evolved tremendously over two decades from Bacterial artificial chromosome (BAC)‐based cloning to single‐molecule sequencing [5, 6]. Concomitantly, the computational tools for addressing the problems in understanding sequence data have also evolved from sequence alignment algorithms such as Needleman–Wunsch and Smith–Waterman to assembly tools such as Burrows‐Wheeler Alignment (BWA) and Bowtie . As described here, a wide range of publicly available bioinformatics tools and some commercially available sequence analysis tools such as CLCBio Genomics Workbench, Strand NGS, Geneious, Laser gene suite, and NextGENe are being utilized .
2. Genome analysis in cotton
Approximately 50 naturally occurring cotton species are available, including 45 diploid (2n = 2x = 26) and five allotetraploid species (2n = 4x = 52) each with a haploid chromosome number of 13 . Based on meiotic pairing and chromosome size, diploid species (2n = 26) have been placed into genomic groups A, B, C, D, E, F, G, or K. Of these, A‐genome sources, G. herbaceum (2n = 2x = A1A1 = 26) and G. arboreum (2n = 2x = A2A2 = 26), and D‐genome sources, G. raimondii (2n = 2x = D5D5 = 26), and G. gossypioides (2n = 2x = D6D6 = 26) are considered as closest progenitor species to cultivated allotetraploid cotton . Diploid genome sizes of A‐G and K vary significantly between species ranging from ∼900 to ∼2800 Mb. Two natural allotetraploid (2n = 4x = AtAtDtDt = 52) species, G. hirsutum and G. barbadense, are derived from a complex inter‐specific hybridization process between two diploid species carrying A‐genome and D‐genomes . However, the sequencing of polyploid plant genomes has been a tedious task due to repeat regions, whole‐genome duplication events, and chromosomal rearrangements that have occurred in the process of evolution . Effective strategies must be developed and employed for sequencing such complex genomes. In order to reduce such complexity, closest diploid progenitor species carrying D5 (G. raimondii; 880 Mb), A2 (G. arboreum; 1700 Mb), and A1 (G. herbaceum; 1700 Mb) genomes can be sequenced and compared against cultivated tetraploid species carrying AtDt genome (G. hirsutum; ∼2400 Mb and G. barbadense; ∼2500 Mb), the estimated genome sizes  of the respective Gossypium species are given in parenthesis. The whole‐genome sequencing efforts in cotton resulted in assembling ∼775 Mb (∼88%) of G. raimondii and ∼1694 Mb (∼90%) of G. arboreum genomes [14, 15]. Also, ∼2.3 Gb (∼90%) and ∼2.5 Gb (∼88%) of the genomes have been assembled into 13 pseudochromosomes each from At and Dt sub‐genomes of G. hirsutum [16, 17] and G. barbadense, [18, 19] respectively. Presently, G. herbaceum (A1) genome sequencing efforts are in progress with a partial genome assembled up to 1.2 Gb (∼70%) .
Transposable elements (TEs) are abundant in plant genomes and are highly variable and often deleterious, as they undergo massive amplification within the genome. The role of TEs has been implicated in gene mutation, whole‐genome duplication, chromosomal rearrangements and novel gene formation through genetic and epigenetic changes. They can alter gene expression and phenotypes by establishing and modifying gene regulatory networks. This is accomplished by inducing changes in genetic and epigenetic mechanisms . The variation in genome structure and organization, even in closely related species, is primarily due to TEs and whole‐genome duplication (WGD). This may be as a result of a combination of non‐random events such as small RNA silencing and epigenetic mechanisms  and random events such as natural selection and adaptation .With the availability of allotetraploid and diploid cotton genomes, comparative genomics has been used to analyze and understand the structural variation and role of TEs in evolution . Their study identified ∼57, ∼68, and ∼67% of TEs in D5, A2, and AtDt genomes of cotton, respectively. Long terminal repeats (LTRs) have contributed significantly in whole‐genome duplication and evolution of domesticated cotton. Though the overall TE content in A2‐ and AtDt‐genomes has been found to be similar, the frequency of LTR‐gypsy and LTR‐copia‐type elements varied significantly. Terminal repeat retrotransposons in miniature (TRIMs) are a small, ubiquitous, conserved, poorly characterized, and scarcely reported group of repeats. TRIMs are often derived from partial deletion of long terminal repeat retrotransposons found in genic regions and in gene‐body methylation within the genome. TRIMS are often targeted by small RNAs (sRNAs) of 21–24 nt in length. Screening for TRIMs in land plants is critical to understanding differences in selection pressure and evolutionary relationships among the clades. Using high‐throughput sequencing followed by bioinformatics analysis, 145 unique families of TRIMs have been identified after screening 48 plant genomes .
Genetic variation is an important element for crop improvement. An understanding of the genetic and genomic relationships of cotton species and cultivars is critical for further utilization of diversity in the development of improved cultivars with favorable alleles . Allelic variations within a genome of the species can be classified into three major groups at DNA level: microsatellites, insertions/deletions, and single‐nucleotide polymorphisms (SNPs) . Molecular markers serve as efficient tools for genome characterization, understanding the genetic complex traits, marker‐assisted selection (MAS) and for map‐based cloning in breeding programs. Several molecular marker technologies have been used to study the genetic diversity and relationships of Gossypium species . However, cotton crop improvement is limited by its narrow genetic base and limited variation among the cultivated cotton cultivars. Genetic variation at molecular level in cotton was previously characterized using isozyme/allozyme markers ; using non‐coding genomic markers such as restriction fragment length polymorphisms (RFLPs) ; amplified fragment length polymorphisms (AFLPs) ; microsatellites [31, 32]; single‐nucleotide polymorphisms  in G. hirsutum and its related species.
Simple sequence repeat (SSR) markers are widely used in many plant and animal genomes due to their abundance, hypervariability, and suitability for high‐throughput analysis. Development of SSR markers using molecular methods is time consuming, laborious, and expensive. Use of computational approaches to mine ever‐increasing sequences such as expressed sequence tags (ESTs) in public databases permits rapid and economical discovery of SSRs . SSR mining programs such as Repeat Pattern tool kit; SSR Finder; Advanced Content Matching Engine for Sequences (ACMES); Spectral‐repeat finder; Adplot; REPEATS and other programs are routinely used to mine EST databases, genome survey sequences, and other nucleotide databases . In cotton, SSR markers mined from diploid species such as G. arboreum were also successfully employed to understand the structural variation in tetraploid cultivars . SSR markers had been extensively used for many genetic mapping, quantitative trait loci (QTL), and trait mapping experiments for favorable characteristics such as fiber quality, higher yield , pathogen resistance [37, 38], and other important traits in cotton. With the advent of next generation sequencing technologies, identification of allelic variation at single‐nucleotide level and their application in crop genetics is becoming a common practice.
SNP markers are becoming the “markers of choice” due to their abundance in the genomes, amenability to automation, and high‐throughput genotyping capability. In cotton, using available EST or transcriptome sequences, gene‐specific SNPs had been characterized ; however, these initial efforts and methods had limited application for genome wide SNP marker development in cotton species. Tetraploid nature, highly complex, and repetitive genome of cultivated cotton species poses significant challenges for genome wide SNP marker development. These complications usually result in high number of false positive SNPs, especially when they are developed from sequences that were not thoroughly characterized . Though there is considerable genetic variation across the cultivated cotton species, the narrow germplasm base within each of the cultivated tetraploid species: G. hirsutum and G. barbadense had made the discovery of useful SNP markers more difficult. Using highly conservative parameters such as minimum coverage of 8× at each SNP and 20% minor allelic frequency, a total of 11,834 and 1679 non‐genic SNPs were previously identified between accessions of G. hirsutum and G. barbadense in genome reduction assemblies, respectively, by Byers et al.  As a part of the same study, an additional 4327 genic SNPs were also identified between accessions of G. hirsutum in the EST assembly. The transcriptome sequencing has been extended to SNP marker discovery . Using oligonucleotide microarrays, SNP markers in seven differentially expressed EXPANSIN transcripts in early cotton fiber development have been identified . The variant analysis in cotton revealed 27,956 indels and 149,616 SNPs from 268,786 EST assembled contigs . Over 1000 SNPs from 92 single‐copy polymorphic loci have been characterized in tetraploid cotton .
Despite significant success in cotton breeding and genetic improvement for fiber‐related traits, the genome‐wide physical or high‐density genetic maps are scarcely available in cotton due to its large genome size. Approximately 5000 markers are needed to fully saturate the cotton genome . A diverse list of markers associated with cotton is available in the cotton marker database (CMD) . CottonGen is a comprehensive database that contains genetic, breeding, and genomic data in cotton including 49 genetic maps, ∼24,000 markers, ∼1000 quantitative trait loci (QTL) linked to more than 30 agronomic traits, ∼18,000 genes/transcripts and ∼460,000 expressed sequence tags (ESTs) . Cotton QTLdb contains a total of 2274 QTLs that have been identified in intraspecific populations . Recently, using genotype‐by‐sequencing method ultra‐dense inter‐specific genetic map that comprised of ∼5 million SNPs distributed unevenly across 26 linkage groups has been constructed in allotetraploid cotton .
Currently, there is a growing repertoire of available NGS platforms that support whole‐genome sequencing, re‐sequencing, and exome sequencing such as NextSeq, HiSeq and HiSeq‐X series (Illumina, Inc.), and IonProton (Life technologies). While ABI3500 series (Applied Biosystems), MiniSeq and MiSeq (Illumina, Inc.), and Ion S5 and Ion PGM (Life technologies) are suitable for amplicon sequencing. Extensive genomic information related to gene regulation, genetic, and epigenetic codes are available at GenBank (NCBI), EBI (EMBL) and DDBJ (NIG). In addition, raw sequencing data obtained from NGS platforms as a result of sequencing/re‐sequencing efforts has been reported at NCBI Sequence Read Archive (SRA). NCBI‐Genome hosts information related to genomes including sequences, physical and genetic maps, chromosomes, assemblies, and annotations. Single‐nucleotide polymorphisms (SNPs) and multiple small‐scale variations that include insertions/deletions, microsatellites and non‐polymorphic variants are also available at dbSNP. The databases available for repeats are Repbase, RepeatsDB, Dfam, P‐MITE and the PGSB Repeat Database, while the information associated with plant cis‐acting regulatory DNA elements in promoter regions are accessible by PLACE and PlantCARE.
|Assembler category||Assembly program||References|
|Overlap layout consensus (OLC)‐based||Celera assembler, Arachne, CAP/PCAP/CAP3, CABOG, MIRA, Newbler/GS de novo assembler, Edena||[49–53]|
|Eulerian‐based/De Bruijn graph|
|Euler, Velvet, AllPaths, ABySS, Trans‐ABySS, SOAP and SOAPdenovo, miraEST, Oases and Rnnotator||[54–57]|
|Greedy assemblers||SSAKE, VCAKE, and SHARCGS||[58, 59]|
Three major categories of assemblers the widely used in whole‐genome and transcriptome assembly are as follows: Overlap Layout Consensus (OLC)‐based, Eulerian‐based/De Bruijn Graph (DBG)‐based and Greedy assemblers. A table of various types of available sequence assembly. Tools are summarized in Table 1 .
|Mapping to reference genome||SAMtools, Picard, BEDtools, BWA, BWT, Bowtie2, CLC Genomics Workbench||[77–79]|
|Whole‐genome alignment||LASTZ, AVID, VISTA||[80, 81]|
|Multiple sequence alignment||MUSCLE; MEGA5|||
|Phylogenetic analysis||EMBOSS; PAML|||
|Homology‐based search||NCBI BLASTN, BLASTP|
|Repeat regions||RepeatMasker, Tandem Repeat Finder, MGEScanLTR, TransposonPSI, MITE Digger,||[84–86]|
|Gene structure||Augustus, GeneMark, FGENESH, GLAD, GeneWise, GenScan, GlimmerHMM, HMMER, GeneID, SNAP, and GLEAN||[87–90]|
|Gene annotation||BLAST2GO, BLAT, PASA||[91, 92]|
|Orthologous sequence search||OrthoMCL|||
|Paralogous sequence search||Mcscan, McscanX|||
|Transcriptome tools||Trinity, Tuxedo, TopHat, Cufflinks||[95, 96]|
|Single‐nucleotide polymorphisms detection||GATK pipeline, SnpEff||[98, 99]|
|Copy number variation detection||CNVKit|||
|Linkage mapping||MapMaker, JoinMap, MapManager QTX||[101, 102]|
|QTL mapping||MapQTL, Windows QTL Cartographer||[103, 104]|
The cotton genomes, D5 and A2, were assembled using SOAPdenovo, while AtDt utilized overlap‐layout‐consensus (OLC) assembly followed by SOAPdenovo to achieve a coverage of over 100×. Other tools commonly used in analyzing the structural and functional analysis in cotton genomes are summarized in Table 2.
3. Transcriptome analysis in cotton
The RNA‐sequencing (RNA‐Seq) and analysis research has exploded in parallel to the genome sequencing methodologies. RNA‐Seq has been widely adopted due to its high accuracy in characterization and quantification of transcriptomes . The major objectives of transcriptomics are to catalog the transcripts of all species, to predict the genes with biological functions and to quantify the divergence in gene expression during varied spatial, temporal, and ecological conditions. In the beginning, RNA‐Seq was focused on deciphering the transcriptomes of model species, but it was quickly extended to non‐model species due to the portability of the method because RNA‐Seq analysis can be achieved with or without a reference genome. There was a huge shift in characterizing the high‐resolution transcriptomes in plant species from EST‐sequencing to microarray analysis and currently to RNA‐Seq. In plants, RNA‐Seq was first adopted for sequencing the transcriptome of Arabidopsis using massive parallel 454 sequencing platform . Since then, transcriptomes of several plant species under differing genetic and physiological states have been sequenced including cotton . The currently available NGS platforms that support whole‐transcriptome or total RNA/mRNA analysis are GS FLX+ System (Roche 454), SOLiD and IonProton (Applied Biosystems), MiSeq, HiSeq series and NextSeq (Illumina, Inc.). The platforms that support targeted RNA analysis are GS Junior, IonTorrent PGM, and MiniSeq.
Several independent studies used RNA‐Seq analysis to determine spatial‐, temporal‐, tissue‐, genotype‐ and genome‐specific, and stress‐induced (abiotic and biotic) expression in both diploid and tetraploid cottons. Some recent studies are discussed here. Several groups have sequenced the cotton fiber cDNA or ESTs [63, 64] and transcriptome or mRNA [65, 66] using RNA‐Seq, while others utilized microarrays for analysis [67, 68]. Recently, 40,976, 41,330, 66,434, and 80,876 high‐confidence protein‐coding genes have been predicted in G. raimondii, G. arboreum, G. hirsutum and G. barbadense, respectively. Cotton fibers are unique have the most elongated cells in plants, and they account to 40–50% of the whole transcriptome in Upland cotton .
Differential gene expression analyses in tetraploid and diploid cottons have been carried out . Two separate studies in root transcriptome analyses of G. hirsutum revealed 519 and 1530 differentially expressed transcripts between well‐watered and water‐deficit conditions, respectively [66, 70]. The comparative transcriptome analysis of developing cotyledon and embryo axis in cotton revealed 17,384 differentially expressed unigenes between two tissues. Of these, ∼8000 unigenes were down‐regulated and ∼10,000 unigenes were up‐regulated in cotyledons . Transcriptome analysis of interspecific hybrid F1, synthetic and natural allopolyploid cotton revealed homeolog expression bias (relative contribution of homeologs towards gene expression) and expression level dominance bias (over‐all expression from both homeologs) towards A‐genome in diploid and natural allopolyploid cotton but not in synthetic cotton . Using RNA‐Seq, the global transcriptome profiles of developing cotton fibers from four wild and five domesticated cottons were compared, and this study identified over 5000 differentially expressed genes during the primary and secondary cell wall synthesis between wild and domesticated cottons . Comparative RNA‐Seq analysis of G. hirsutum from young, mature, and different senescence stages of leaves identified 3624 differentially expressed genes during leaf senescence . Comparative transcriptome profiling of G. hirsutum and G. davidsonii (diploid wild species) further characterized 4744 and 5337 differentially expressed salt‐stress responsive genes from roots and leaves of G. davidsonii . The potential role of signalling pathways, ethylene pathway in fiber elongation, and receptor‐like kinases (RLKs) in cell wall integrity have been proposed in determining fiber quality using comparative RNA‐Seq analyses of near isogenic lines (NILs) in G. hirsutum .
The major mRNA repositories are Crops‐ESTdb, NCBI dbEST, UniGene, RefSeq, and GEO; plant‐specific microarray databases are plant expression database (PLEXdb) and plant co‐expression database (PLANEX). Tools for RNA‐Seq expression analysis are R/bioconductor packages such as RSEM, DEGseq, DESeq, edger, and baySeq. The open access expVIP is a web‐based tool for visualizing, analyzing and comparing RNA‐Seq data for conducting gene expression analysis in diploid and polyploid plant species. Single‐nucleotide variant analysis in RNA‐Seq dataset can be performed by SAMtools followed by Genome Analysis Toolkit (GATK). RobiNA is a web‐based tool for accessing and comparing EST, microarray, and RNA‐Seq databases.
Though several OLC‐based, DBG‐based, and greedy assemblers discussed above have been utilized in RNA‐Seq analyses, Trinity is the most promising tool in generating transcriptome assembly. The Tuxedo pipeline contains a series of tools including Bowtie and TopHat for aligning the RNA‐Seq reads, Cufflinks for assembling the mapped reads, Cuffdiff for identifying differentially expressed genes and CummeRbund for visualizing differentially expressed genes and transcripts . The parameters such as false discovery rate (FDR) ≤ 0.001, fold change (log2FC) ≥ 2, and P‐value ≤ 0.05 are considered in comparative transcriptome profiling using RNA‐Seq . Using Trinity and Tuxedo, novel genes, and splice variants can be determined. In RNA‐Seq analysis, FPKM (Fragments per kilobase of transcript per million mapped reads) and RPKM (Reads per kilobase of transcript per million mapped reads) values are often used in quantifying gene expression. Express tool has been used in detecting transcript abundance in G. hirsutum.
Additionally, to annotate gene functions, homology‐based methods such as BLASTX, BLAST2GO, and BLAT have been used. PASA and EVM have been used in annotating 3' and 5' UTR regions and alternative splice events in the transcript assemblies. Functional annotation of transcript assembled fragments (TAFs) can be done using BLASTX (E‐value 1 × 10‐6) or BLASTP (E‐value 1 × 10‐5) against protein databases non‐redundant (NR), SwissProt, and TrEMBL. Gene ontology annotation is conducted by using BLAST2GO and AgriGO analysis. InterPro is used to annotate motifs and domains by comparing TAFs with publicly available databases such as Pfam, PRINTS, PROSITE, ProDom, and SMART. The databases used for predicting pathways are KEGG, PANTHER, Pathguide, PlantCyc, BioCyc, and MetaCyc. Using BLASTP (E‐value =< 1e‐5) and OrthoMCL gene clusters between sub‐genomes in tetraploid cotton have been classified. FASTQC and FASTX toolkit are widely used in filtering low quality and determining the quality of RNA‐Seq reads. Reads contaminated with adapters are removed using Trimmomatic software. HMMER has been used in identifying transcription factor (TF) gene families in tetraploid cotton using the PlnTFDB. Using D5 genome as a reference, homeologous genes/syntenic blocks between At and Dt sub‐genomes have been identified using MCscanX. These tools were summarized in Table 2.
The comparative transcriptome analysis of G. hirsutum with its progenitors ameliorates the complexities associated with the co‐existence of At and Dt genome transcripts in allotetraploid species. Moreover, comparative transcriptome mapping of sub‐genomes leads to identification of high‐confidence transcriptional modules that are evolutionarily conserved and are specific to the genus Gossypium. Identifying sub‐genome specific transcripts and analyzing their role in cotton gene regulation would help in developing novel biomarker tools associated with various complex polygenic traits in cotton. Alternative splicing in eukaryotes results in transcriptome and proteome diversity. The presence of abundant splice variants and novel transcriptionally active regions (nTARs) has been identified in Arabidopsis and rice using RNA‐Seq, but majority of these novel transcripts did not overlap with known protein‐coding genes and open reading frames (ORFs), suggesting their potential role in post‐transcriptional gene regulation and novel transcript/gene formation . Identification of alternative splice events and nTARs in tetraploid cotton is computationally challenging due to ploidy, duplication, chromosomal rearrangements, and presence of homeologous bias.
4. Small RNA analysis in cotton
Napoli et al. (1990) first identified the phenomenon of RNA interference (RNAi) and referred to it as co‐suppression in plants . Hamilton & Baulcombe (1999) reported a group of antisense RNAs that mediate in post‐transcriptional gene silencing (PTGS) in plants that target both cellular and viral mRNAs . The world of sRNAs exploded with the discovery of miRNAs (microRNA) in C. elgans . In plants, miRNAs were first identified in A. thaliana and studied extensively in other plant species using comparative genomic, cloning, or sequencing approaches . Unlike in animals, plant miRNAs are mainly found in intergenic and intronic regions . MiRNA genes are mostly localized and form clusters and they transcribe together as a single transcriptional unit . MiRNA discovery has gained momentum in diverse plant species. The majority of plant miRNAs identified to date negatively regulate the target gene expression at the post‐transcriptional level. MiRNAs regulate environmental stress response, metabolic processes, organogenesis, growth and development . The miRNAs associated with seed‐specific transcription factors have been identified and were found to influence differentiation and developmental timing .
The reduction in cotton fiber quality and yields is mainly attributed to several biotic and abiotic factors, and these complex traits are also regulated by small RNAs (sRNAs) which are mostly by miRNAs. Recent advances in cotton genomics have provided an impetus to pursue the discovery of novel miRNAs in the cotton genome [3, 114, 115]. MiRNAs and small interfering RNAs (siRNAs) are the two major classes of endogenous small non‐coding RNAs that regulate the gene expression in plants. Earlier reports suggest that miRNAs mediate a variety of functional roles such as developmental timing, cell proliferation, differentiation, morphogenesis, defense response, and signal transduction [116–118]. Cotton miRNA research is limited due to the lack of genomic information of cultivated cotton until recently. Cotton miRNAs have been identified in G. hirsutum (AtDt), G. barbadense (AtDt), G. herbaceum (A1), G. arboreum (A2), and G. raimondii (D5). Besides their role in fiber growth, development, initiation, and elongation, some miRNAs have been implicated in biotic and abiotic stress responses in cotton. Moreover, many cotton‐specific miRNAs have been identified, but their functions need experimental validation. Recent studies that use qRT‐PCR and degradome sequencing of cotton fiber RNA suggest that miRNAs play an important role in cotton fiber development. Similarly, Wang et al. (2012) identified 73 miRNAs that belong to 49 families in G. arboreum using homology‐based approach . Zhang et al. (2013) identified 65 novel miRNAs and their candidate gene targets in G. hirsutum using transcript sequences of G. raimondii . MiRNA was also analyzed in salt tolerance response in G. hirsutum. The miRNVL5 precursor was first discovered in Arabidopsis and later found in cotton. G. arboreum miRNA was studied in response to Cotton Leaf Curl Disease .
NGS technologies that are mentioned in transcriptome analysis are also applicable in small RNA sequencing. MiRNAs are highly conserved across the species and can be predicted by using homology‐based methods. Various resources commonly used in analyses of small RNAs are described in Table 3.
|miRNA precursors||RNAfold, mfold|||
|miRNA databases||miRBase, MicroRNAdb|||
|miRNA prediction||miRCat, miREAP, miRPlant, miRDeep‐P, miRNAKey||[125–128]|
|Gene targets for miRNAs||miRanda and psRNATarget||[129, 130]|
|lncRNA homologs||RefSeq, NONCODE, lncRNADB, PLncDB, PNRD, PlncRNADB,|
|Epigenomic databases||pENCODE, GEO, NGSmethDB, MethBase, plant methylome‐db||[135–137]|
|ChIP‐Enriched region identification||HOMER|||
|Methylation detection||CyMATE, MeQA, MEDIPS, FASTmC||[139, 140]|
|Bisulphite data analysis||Methylpipe|||
The other classes of small RNAs identified in Gossypium include trans‐acting small interfering RNA (TasiRNA), long intergenic noncoding RNA (lincRNA), and long noncoding natural antisense transcript loci (lncNAT). Long noncoding RNAs (lncRNAs) are 200 bp in length, rich in repetitive sequences, preferentially expressed in a tissue‐, genome‐ and lineage‐specific manner, are poorly conserved and participate in various regulatory processes. Higher overall methylation levels are exhibited by lncRNAs when compared with protein‐coding genes, while their expression is less affected by gene‐body methylation. The lncRNAs play a pivotal role in regulating lignin metabolism, cotton fiber initiation, and elongation. There are two main classes of lncRNAs, long intergenic noncoding RNAs (lincRNAs), and long noncoding natural antisense transcript (lncNATs), which are structurally similar but vary in number and length of exons and transcripts. In cotton (Gossypium spp.), 30550 lincRNA and 4718 lncNAT loci have been reported. However, homeolog expression bias of lncRNAs has been identified in subgenomes of polyploid species when compared to wild parents . In a similar study, 3510 lincRNAs and 2486 lncNATs that play an integral role in fiber initiation and elongation have been identified in G. arboreum using strand‐specific RNA sequencing (ssRNA‐Seq) of cotton fibers and leaves . A preferential expression of lncRNAs was observed (∼50%) than for protein coding genes (∼20%) during rapid fiber elongation, thus establishing their function in fiber development. The direct role of 21–22 nt siRNA derived from GhMML3 gene (MYBMIXTA‐like transcription factor 3) that encodes a lncNAT, implicated in fiber development has been investigated in G. hirsutum .
5. Epigenome analysis in cotton
Post‐transcriptional gene silencing (PTGS) is mainly associated with the regulation of gene expression directly by targeting the mRNA, while transcriptional gene silencing (TGS) is mostly coordinated by epigenetic modifications such as DNA methylation, histone modifications (methylation and acetylation), chromatin remodelling factors, polycomb group (PcG), and trithorax group (trxG) of proteins . These epigenetic modulators can be identified by using homology‐based searching in Upland cotton and its closest progenitor species, which will aid in understanding the epigenetic landscape of Upland cotton. The role of DNA methylation has been implicated in plant growth and development, by maintaining active and inactive chromatin states and in gene silencing by employing canonical and non‐canonical RNA‐directed DNA methylation (RdDM) pathways . For eliciting both the pathways, core RNAi machinery including DICER‐LIKE (DCL), RNA‐DEPENDENT RNA POLYMERASE (RDR), ARGONAUTE (AGO) proteins, RNA polymerases (Pol IV and V), and plant specific‐DNA methylases are required .
In plants, DNA methylation occurs in both CG and non‐CG (CHG and CHH) contexts, where “H” denotes A, T, or C. Although DNA is primarily methylated in CG context in both plants and mammals, the non‐CG contexts are abundant in plants when compared to mammals . The CG, CHG methylation contexts are symmetrical and are maintained by DNA replication, while CHH context is established by de novo DNA methylation. The DNA methylation within the genome can occur in promoter, gene‐body, transposable elements, and repeat regions . Previous reports suggest that 35–43%, 30–41%, and 30–32% of loci were methylated in A. thaliana ecotypes , Brassica oleracea accessions , and G. hirsutum accessions , respectively. The role of DNA methylation in fiber growth and gene silencing has been proposed in cotton. It has also been suggested that CHH methylation may have a role in developmental timing in cotton. Annual fluctuations in DNA demethylases and methyltransferases have shown opposite trends in their abundance in cotton ovules. Also, substantial changes in CHH methylation was noticed in the promoter regions of three key genes ETHYLENE RESPONSIVE FACTOR 6 (ERF6), SUPPRESSION OF RVS 161 DELTA 4 (SUR4) and 3‐KETOACYL‐COA SYNTHASE 13 (KCS13) that regulate cotton fiber growth. Development of homozygous RNAi lines, specifically targeting demethylases and methyltransferases, will aid in determining DNA methylation patterns in cotton fiber growth and development .
The variation in epigenetic modifications has been observed in spontaneous reciprocal hybrids of Rosa sect. Caninae and Rubigineae when compared to their parental species using cDNA–AFLPs and methylation‐sensitive amplified fragment length polymorphisms (MSAPs), suggesting the biased contribution of DNA methylation from parents to polyploid hybrids . Similarly, genotype‐specific and tissue‐specific variations in DNA methylation have been identified in cotton by using MSAP with methylation insensitive enzyme (BsiSI). Their results suggest that CHG methylation is more diverse than the other two contexts (CG and CHH) and this work established a relationship between DNA methylation and fiber development but failed to establish the correlation between epigenetic regulation and fiber quality. To achieve this, a robust study involving diverse genotypes/accessions and tissues is required. Cytosine methylation is among one of the key epigenetic regulatory processes that likely silence duplicated genes in polyploid crop species such as cotton. The levels, patterns, and diversity of methylation‐polymorphism in CG context have been investigated in 20 accessions of Upland cotton (G. hirsutum) to identify 32 methylation‐polymorphic cytosine sites using MSAP . The introgression of exotic DNA fragments from wild parents (G. bickii) into cultivated Upland cotton (G. hirsutum) has been investigated using AFLP and MSAP analysis to identify ~2000 genomic and ~800 epigenetic sites . This study identified ∼0.5% of alien DNA segments and ∼0.7% of genetic variation in the genomes of introgression lines. Though the overall methylation content is close to the wild parent, the context‐specific methylation varied significantly in introgression lines .
Using methylcytosine‐sequencing (MethylC‐Seq) and RNA‐sequencing (RNA‐Seq) analysis, variation in CHH methylation during ovule, and fiber development was reported in cotton . Further, this study suggested that CHH hypermethylation triggered RdDM‐dependent methylation in promoters, while RdDM‐independent methylation occurred in TEs and nearby genes, thus facilitating ovule and fiber development in cotton. Moreover, the contribution of CHG and CHH methylation in genic regions towards homeologous expression patterns in At, or Dt subgenomes have been proposed . Evolution is a very slow process when it is purely dependent on mutation and recombination. In plants, it is believed that there are additional forces that accelerated the process of evolution, which include interspecific hybridization. However, the mechanism of acceleration is less understood. The possible underlying mechanism has been proposed in wheat . It was suggested that the interaction of alien nuclear DNA with cytoplasmic macromolecules as a result of interspecific hybridization can potentially trigger a network of epigenetic changes in nuclear DNA, thus altering the expression of genes and genetic pathways associated with physiological processes, which may serve as a possible source of variation that facilitate evolutionary process. This idea sheds light into development of epigenomic‐segregating lines in crop plants, including cotton .
Screening the core RNAi machinery of important DNA methylases such as DOMAINS REARRANGED METHYLASE 1/2 (DRM1/2), CHROMOMETHYLASE 2/3 (CMT 2/3), and demethylases such as ROS1 and DEMETER (DME) in Upland cotton, and its closest progenitor species will aid in elucidating the key regulatory mechanisms in whole‐genome duplication, chromosomal rearrangements, dosage compensation, and evolutionary advantage of being polyploids. Chromosome painting techniques such as fluorescence in situ hybridization (FISH), variations of FISH and genomic in situ hybridization (GISH) are not only useful in determining chromosome structure and distribution of hetero/euchromatin but also in understanding epigenomic patterns and evolution of polyploid species .
Although several databases and repositories are available for storing, cataloging and normalizing animal, and mammalian epigenomic data, they are limited for plants. The popular resource is NCBI epigenomics sample browser to access the diverse collection of epigenomic data sets including genome‐wide DNA methylation maps and histone modifications. However, only A. thaliana dataset is currently available in the sample browser. Other than NCBI sample browser, the popular visualization tools for epigenomic data are as follows: UCSC epigenome browser, Ensembl encyclopedia of DNA elements (ENCODE), and WashU epigenome browser. A plant ENCODE (pENCODE) collects and curates epigenomic data from a wide range of plant species to compare, annotate, and understand mechanisms behind plant evolution. Epigenomics of Plants International Consortium (EPIC) provides a platform to share the protocols, methods, and results across the research labs to address basic questions of genetics and genome regulation beyond routine whole‐genome sequencing. GEO is one of the seminal repositories for hosting epigenome data from various resources including animals (human) and plants (Arabidopsis and Maize). NGSmethDB and MethBase specifically store whole‐genome single‐base resolution data obtained from whole‐genome bisulphite‐sequencing (WGBS/BS‐Seq) from diverse organisms. Currently, very few plant‐specific epigenome databases are publicly available and include plant methylome‐db, Arabidopsis Epigenome Browser, and Tomato Epigenome Database. Though the cotton‐specific epigenome database is not publicly available, methylomes of evolutionarily close relatives such as Ricinus communis and Theobroma cacao are available at plant methylomes‐db with user restrictions . Epigenetic resources and tools were summarized in Table 3.
The wide applications of chromatin Immunoprecipitation‐sequencing (ChIP‐Seq) analysis include understanding protein‐DNA interactions, histone modifications, chromatin states, and DNA methylation. Methylation and acetylation marked peaks can be identified using spatial clustering for identification of ChIP‐enriched regions (SICER), and later peaks can be annotated by using HOMER (motif search algorithm). Other peak‐finding algorithms, Model‐based Analysis of ChIP‐Seq (MACS), and PeakSeq have been extensively used in analysis. BS‐Seq analysis helps in identification of genome‐wide differentially methylated regions (DMRs) along with tools such as Bismark, BSMAP, and BS‐Seeker can be used. Cytosine Methylation Analysis Tool for Everyone (CyMATE) is a plant‐specific tool for analyzing BS‐Seq data. A web‐based tool, QUMA (quantification tool for methylation analysis) is suitable for BS‐Seq analysis, primarily for estimating CG methylation. MeDIP analysis helps in detecting methylated cytosines in both CG and non‐CG contexts and tools such as MeQA and MEDIPS can be used. FASTmC Webtool is available for comparative analysis of CG, CHG, and CHH DNA methylation levels in non‐model species. Methpipe is useful in analyzing bisulphite‐sequencing data including BS‐Seq, WGBS, and reduced representation bisulfite sequencing (RRBS). Integrative epigenome analysis can be done using methylPipe and compEpiTools. Differentially methylated regions (DMRs) have been identified in Arabidopsis by using methylpipe. ChIP‐Seq analysis in R (CSAR) accurately account the protein bound regions in the genome in plants. PolyCat determines mapping bias between genomes that may have resulted due to SNPs; hence, it can be used in comparing allopolyploid, G. hirsutum against diploid G. raimondii or G. arboreum. Moreover, it accepts data from diverse NGS platforms including RNA‐Seq, BS‐Seq, ChIP‐Seq, and SNP analysis .
6. Future direction
The reduction in cost of sequencing per base, increase in throughput and read size, and availability of better algorithms have significantly facilitated the integration of one or more genomic methods to address biological questions. For instance, continued progress in detection of SNPs and CNVs from sequencing data and combining this with genotyping‐by‐sequencing methods will be helpful in crop improvement. Rapid advances in genomics, bioinformatics, and computational biology in the past two decades have already facilitated generation and screening of large datasets, and thus have ushered us into the “Big Data” era of genomics. The plant science community still lacks full‐fledged computational infrastructure, data curation, and novel tools to extract information from the massive datasets. It should be realized that progress in analyzing polyploid genomes will continually need improvements because the nature of the data generated from high‐throughput technologies is voluminous, heterogeneous, and often in unstructured forms. The concepts of parallel processing, cloud computing, and machine learning offer promising solutions for managing and analyzing large datasets. The other important limiting factors are trained professionals to access and analyze these resources. Nonetheless, progress is being made as we enter the burgeoning age of “Big Data” to tackle genomes of widely grown and utilized crops such as cotton using bioinformatics tools.
The authors acknowledge Dr. Avinash Sreedasyam of HudsonAlpha Institute of Biotechnology and Mr. Joshua Reid at Alabama A&M University for reviewing this book chapter. Also authors would like to thank Dr. Ibrokhim Abdurakhmonov and Ms. Ivona Lovric for their valuable suggestions and comments in improving this book chapter.