Abstract
The recent emergence of long-read transcriptome sequencing has helped improve the overall accuracy of gene prediction compared with that by short-read RNA-Seq. In addition, the technology can offer a more comprehensive view of functional genomics in uncharacterized species with an efficient full-length unigene build and high-precision gene annotation, thus being efficient in developing transcriptome data resources from useful genetic pools. Hence, I will review the applications of long-read RNA isoform sequencing, including the relative merits of the technology, the improvement of the accuracy in gene prediction and gene annotation, and the full-length unigene builds in a new genome; the limitations of the technology will be also discussed. The review will be valuable in collecting data resources for functional genomic studies.
Keywords
- functional genomics
- gene prediction
- long-read RNA sequencing
- transcriptome
1. Introduction
Transcriptomics is the study of transcript catalogs in a cell, tissue, or organism for a given developmental stage or physiological condition [1]. The transcriptome indicates the complete set of transcripts that consists of protein-coding messenger RNA (mRNA) and non-coding RNA (ncRNA), including ribosomal RNA (rRNA), transfer RNA (tRNA), and other ncRNAs [2, 3]. In contrast with the relatively stable genome, various factors such as developmental stage, physiological condition, and external environment influence the changes in the transcriptome. The goals of transcriptomics include the annotation of the transcriptome, and the determination of the functional structure of each gene in the genome and the changes in the expression levels of each gene among different transcriptome samples [1, 4, 5].
Transcriptome analysis depends heavily on the availability of high-throughput tools on account of the complexity of the transcriptome. Thus, RNA sequencing (RNA-Seq) has become an important tool for biological studies. RNA-Seq can quantify gene expression spatially and temporally. Although RNA-Seq has enabled the generation of massive amounts of sequence data due to their high-throughput characteristic, their application of short reads makes them poorly suited for genome and transcriptome assembly, and isoform detection. Single-molecule real-time (SMRT) sequencing, a new method to generate long-read sequences developed by PacBio platform, provides an alternative approach to overcome these limitations in sequence length and accelerate improving our understanding of the complexity of the transcripts [6].
In general, the read length of Illumina HiSeq platform is about 100–150 bp, which is relatively short compared to that of PacBio platform (around 10 kb). However, Illumina HiSeq platform has the advantage of generating more accurate reads and high-throughput data. On the other hand, even though its accuracy is lower than that of Illumina HiSeq platform, single-molecule real-time (SMRT) sequencing of PacBio platform, a new method of sequence analysis, was developed and applied to elucidate the genomic structures of difficult to sequence organisms [7] because of its long-reads, which results in the improvement of assembly, gene prediction, and annotation. Using this technique, sequences are analyzed from a single strand of DNA without genomic amplification [9]. PCR-free long-read sequencing enables to help to carry out large complex whole-genomes (i.e., hexaploid wheat and maize).
PacBio sequencing captures sequences during the replication process of the target DNA in real-time. The template, also called a SMRTbell, contains a target double-stranded DNA (dsDNA) ligated with hairpin adaptors at both ends, resulting in a closed and single-stranded circular DNA [8]. When the SMRTbell is loaded into a chip called a SMRT cell, diffusion of the SMRTbell into a sequencing unit called a zero-mode wave guide (ZMW) is carried out [10]. In each ZMW, a single polymerase immobilized at the bottom can bind to adaptors of the SMRTbell [11]. Each of the four nucleotides is fluorescent-labeled. As a nucleotide associates with the template in the active site of the polymerase, a light pulse is produced for base detection. A single polymerase read can be generated up to 40 kb, depending on the library size and sequencing time. The closed-circle form of the SMRTbell can make the reaction repeat until the reaction is terminated after the replication of one strand of the target dsDNA or double-stranded complementary DNA by the polymerase. However, the mean length of full transcripts is 1–3 kb in most plant and animal genomes (e.g., 1.6 kb in Arabidopsis [12], 1.8 kb in rice [13], 2.3 kb in human [14], and 1.2 kb in mouse [15]); thus, the same transcript can be covered multiple times by the long polymerase read. In this scenario, a few reads (called subreads) can be generated from the polymerase read by trimming adaptor sequences. The consensus sequence of multiple subreads in a single ZMW generates a read of insert (ROI) or a circular consensus sequence (CCS) read with higher accuracy. Hence, a protocol of isoform sequencing (Iso-Seq) for long-read transcriptome sequencing that includes library construction, size selection, sequencing, and data processing was developed by PacBio. Iso-Seq allows the direct sequencing of transcripts up to 10 kb, which is particularly useful for the genomes of uncharacterized species.
However, even though PacBio sequencing has an advantage in terms of read length over next-generation sequencing, the throughput of PacBio sequencing is relatively low. A single SMRT cell contains 150,000 ZMWs, each of which can produce one polymerase read with a mean length of 1o kb. Typically, only 35,000–70,000 reads of the 150,000 ZMW wells on a SMRT cell can be produced successfully because of the failure of anchoring a polymerase and loading more than one DNA molecule in a ZMW. Consequently, the typical throughput of the PacBio RS II system is around 0.5–1 Gb per SMRT cell [16]. Recently, PacBio developed another system called Sequel that produces over seven times the reads, with 1,000,000 ZMWs, and yields around 3.5–7 Gb per SMRT cell [17]. Sequel is appropriate for projects such as de novo genome assembly and isoform sequencing of transcriptomes. Another notable problem of PacBio sequencing is the relatively high error rate (around 11–15%) of polymerase reads [18]. Many hybrid sequencing approaches have been attempted to develop a method that has the accuracy of short reads but with the length of PacBio reads [19].
Long-read transcriptome sequencing generates longer and improved transcripts with a high level of assembly completeness and gene annotation. Moreover, it prevents obtaining artifacts such as chimeras, structural errors, incomplete assembly, and base errors [20].
Here, we review the sample preparation, library construction, analytical pipelines, and the result of isoform sequencing (Iso-Seq), as a long-read transcriptome sequencing, in gene prediction and annotation. Furthermore, we will also discuss the relative merits and the limitations of the Iso-Seq technology.
2. Merits of long-read transcriptome sequencing
Long-read transcriptome sequencing such as Iso-Seq generates longer and improved transcripts from a species with a high level of assembly completeness and gene annotation, enabling a comprehensive view of the transcriptome. Conventional methods, such as cDNA cloning and EST sequencing, have limitations with relatively low data coverage. Although deep short-read sequencing (i.e., RNA-Seq) provides good sequencing depth and coverage for genome-wide transcriptome analysis, their short-read length generates assembly incompleteness of transcripts, resulting in high error rate in assembly and unreliable gene annotation. Long-read transcriptome sequencing can also provide experimental verification of predicted gene models in a genome, enable the quality of gene structures predicted and also give the potential to reduce missing gene annotation. For example, missing gene annotation may lead to false interpretation such as gene loss and errors in gene expression profiles that map and quantify RNA-seq reads using predicted gene models. Thus, this technology can be helpful to find full-length (FL) transcripts harboring complete open reading frames (ORFs) and uncover novel splice isoforms as well as novel genes. This can result in the improvement of accuracy of gene prediction with an experimental verification and annotations for aiding in studying gene regulation.
3. Sample preparation and library construction for isoform sequencing
Iso-Seq with the PacBio platform can generate FL cDNA sequences including the 5′ and 3′-UTRs (untranslated regions), as well as the polyA tails of the transcripts. The whole workflow including the experimental protocol and analytical pipelines is illuminated in Figure 1 [10].

Figure 1.
Schematic workflow of isoform sequencing.
3.1 Isolation of total RNA
The samples can be collected from various tissues (i.e., blood, gill, skin, muscle, liver, spleen, intestine, ovary, testis, kidney, heart, and brain of an animal) [21], or from certain developmental stages (developing rabbit at 21, 49, and 84 days of age) [22]. The high quality of RNA with enough purity and integrity is critical to reduce the amplification cycles required in large-scale PCR and improve the sequencing diversity. RNA extraction is usually done through an easy-spin RNA extraction kit, or RNAiso Pure RNA Isolation kit [20, 21, 22]. In general, 2–5 μg of total RNA with an RNA integrity number (RIN) greater than 7 is required.
3.2 cDNA synthesis and size partitioning
Isolation of polyA mRNA is required for analyzing the transcripts of protein-coding genes. The Iso-Seq method is flexible and allows different types of RNA to be sequenced. Alternatively, mRNAs can be selected by polyA enrichment. The first-strand cDNA is amplified with oligo(dT) to enrich RNAs with a polyA tail, including mRNAs and long noncoding RNAs (lncRNAs) for further analysis.
For parallel analysis of RNA samples derived from various tissues, barcode for each sample with unique sequences is alternatively used. For instance, multiplex sequencing was performed to construct a maize transcriptome library from various tissues [23]. However, barcoding samples is not always desired because sequencing efficiency may be reduced by the barcode sequence.
3.3 Size partitioning
Size selection for size partitioning, which is the most commonly used method to avoid over-representation of smaller transcripts in sequencing data, allows for more even representation of cDNA of different size ranges, since smaller fragments may load preferentially on the sequencer. Furthermore, the process of second fractionation is recommended to remove any smaller fractions from the first size selection. To enhance PCR amplification, different sizes of the cDNA libraries including <1, 1–2, 2–3, and 3–6 kb are generally constructed to maximally recover transcript diversity and sequence. However, such size selection may bring about missing small size transcripts less than approximately 1 kb. This problem appears to result from technical limitation by size selection in the construction of mRNA sequencing libraries. This can get solved by combinatorial use with short-read RNA-Seq data that are very effective for transcriptome coverage, especially small size of transcripts.
3.4 Library preparation and sequencing
Double-stranded cDNA is not enough for SMRTbell library construction following size selection. PacBio suggests PCR amplification using the KAPA HiFi Enzyme [24] with about 10 cycles. Then, a circularized molecule called a SMRTbell template is transformed from the amplified cDNAs by the SMRTbell Template Prep kit. After the step is completed, the library is ready to be loaded into a SMRT cell and subjected to sequencing on the PacBio platform. There is a compromise between SMRT cell numbers and the sequencing cost. In general, the Iso-Seq protocol recommends 8–50 SMRT cells to retrieve diversity in a tissue.
4. Building full-length transcripts in a genome
Error correction of the raw reads is necessary to improve the assembly quality of the FL transcripts. PacBio provides the Iso-Seq analysis software to perform the procedure by iterative clustering for error correction (ICE) and the Quiver algorithm (
The Iso-Seq raw reads are usually called polymerase reads or continuous long reads (CLRs) and have an average length of 10 kb (Figure 1). Considering the average length of a transcript is 1–2 kb, the same copies of the inserts are contained in a single polymerase that could be split into several subreads by removing the adaptor sequences by PacBio SMRT link analysis [20]. The circular consensus sequences or ROIs are generated from several subreads. The full-length non-chimeric read (FLNC) is defined not only when the polyA tail signal preceding the 30-primer is present, but also when both 50- and 30-cDNA primers are present. To enhance consensus accuracy and remove the redundancy of FLNC without any additional sequence data, ICE and Quiver can be applied [20]. The Iso-Seq classify tool is used for classifying the ROIs into full-length nonchimeric and non-full-length reads by identifying the 50 and 30 adapters used in library preparation. Then, the Iso-Seq cluster tool is used for clustering all the full-length reads, and the consensus sequences produced by the cluster tool are polished using the non-full-length reads through the Quiver algorithm [25]. Additionally, the CD-HIT program [26] is likely to be helpful to cluster the high and low quiver consensus isoforms from ROIs with high sequence identity threshold (i.e. 0.98–0.99) [20, 21].
Iso-Seq reads present a disadvantage with the high frequency of errors of nucleotide indels and mismatches. Thus, the procedure of correcting InDels and mismatches is performed via alignment with reference genomes [27]. To overcome this, a viable alternative approach is to integrate short reads with long reads via hybrid sequencing. For instance, RNA samples prepared from the same samples are sequenced by both PacBio and Illumina HiSeq. The short reads from the Illumina HiSeq are applied to correct the transcript isoforms using LoRDEC tool v0.6 [28]. Then, the corrected isoform sequences are aligned against a reference genome by GMAP aligner [29]. The following analyses are recommended to exclude the sequences with multiple and chimeric alignments. To assess quality of the unigenes, some software such as CEGMA [30] and BUSCO [31] can be applied [20, 21, 32, 33]. The percentages of the transcripts that fully and partially aligned to the conserved proteins are calculated.
FL or longer transcriptome data have been mostly published from large complex or uncharacterized genomes of plant species (Table 1). Although deep short-read transcriptome sequencing (i.e., RNA-Seq) have accumulated over recent year, they are likely to generate low-quality transcripts with a small portion of FL transcripts, prohibiting accurate transcript reconstruction and leading incorrect annotation.vUnlike RNA-Seq data, Iso-Seq data, which are derived from various tissues as many as possible, harbor a large portion of unique FL transcripts. For example, Wang et al. [23] reported that maize yielded 111,151 non-redundant FL transcript isoforms, corresponding to approximately 26,946 genes. In addition, genome coverage of Iso-Seq data is achieved near-saturation. Ultimately, cost-effective long-read transcriptome sequencing can be the gold standard for transcript completeness, characterization of transcriptome, and draft genome annotation. To identify trait-associated transcripts in species for which a reference genome is lacking (i.e., garlic), this approach was used as a reference sequence for scoring the variation in both SNP and expression level in the population [36], reporting the characterization of transcripts (lncRNAs) associated with garlic clove shape traits.
Species | No. of transcripts | Mean length (bp) | Discovery | Reference | ||||
---|---|---|---|---|---|---|---|---|
Identification of novel gene isoforms | Isoform annotation | Alternative splicing events | Gene prediction | Other | ||||
Panax ginseng | 135,317 | 3178 | Y | Y | Y | — | — | [20] |
Triticum aestivum | 91,881 | 2388 | Y | Y | — | — | — | [45] |
Zea mays B73 | 111,151 | 3372 | Y | Y | Y | Y | Fusion transcripts | [23] |
Sorghum bicolor | 27,860 | 1042 (full-length ROI) | Y | Y | Y | Y | — | [27] |
Trifolium pratense | 206,465 | 2789 | Y | Y | Y | — | — | [34] |
Zea mays W64A | 166,693 | 2715 | Y | Y | Y | — | Fusion transcripts | [35] |
Allium sativum | 36,321 | 1500 | Y | Y | — | — | Association study | [36] |
Populus (P. deltoides × P. euramericana cv. ‘Nanlin895’) | 87,150 | 2417 | Y | Y | Y | — | Fusion transcripts | [37] |
Coffea arabica | 95,995 | 3236 | Y | Y | Y | Y | — | [38] |
Table 1.
Transcriptomics studies in plants by isoform sequencing.
5. Improvement of the efficiency of functional gene prediction and annotation
Completeness of assembled transcripts is closely related to the efficiency of functional gene prediction or annotation, especially in the absence of reference genome information. Because of such advantage, Iso-Seq has been applied in a variety of species [20, 21, 22, 32, 33]. In addition, optimized training and prediction settings on the basis of short- and long-read transcriptome data in gene prediction results in increased their sensitivity and precision [39]. In particular, the method is helpful for obtaining comprehensive gene sets for newly sequenced genomes of non-model eukaryotes [39].
To identify the protein coding potential of transcripts, Transdecoder (
For functional annotation, isoform sequences are used as queries for sequence homology searches in Blast, Blast2GO [43], and InterProScan5 [44] to identify functional annotation terms from the nonredundant protein (NR), non-redundant nucleotide (NT), Gene Ontology (GO), Clusters of Orthologous Groups (COG), Kyoto Encyclopedia of Genes and Genomes (KEGG), SwissProt, and Interpro databases. For example, when the RNA-Seq data of H. glomeratus were re-annotated with Iso-Seq transcriptome data, the length distribution, functional annotation, and coding sequence quantity of the Iso-Seq transcripts were significantly improved [42]. In particular, with respect to the species distribution of the annotation from the NR database, 98.31% of the annotated isoforms showed the highest similarity to sequences from the three most prevalent species. In addition, Illumina RNA-Seq data were highly mapped to the Iso-Seq transcripts (unigenes). This suggests that long-read, full-length or partial-unigene data with high-quality assemblies are invaluable resources as transcriptomic references in a genome and can be used for comparative analyses in closely related medicinal plants.
6. Conclusion
Transcriptome data generated by Iso-Seq generate longer and improved unigenes with a high level of assembly completeness and gene annotation, enabling a comprehensive view of the transcriptome. In particular, compared with conventional methods, long-read transcriptome sequencing seems to improve misassembly rate and unreliable gene annotation, thus enabling to elucidate the function of genes associated with traits of interest as well as novel transcripts. A hybrid approach that combines isoform sequencing with full-length transcripts and RNA-Seq capable of fixing sequence error and quantifying gene expression is the optimal solution to study transcriptomes for improving completeness of transcripts, data coverage, and gene annotation.
Acknowledgments
This work was supported by grants from the National Agricultural Genome Center (project No. PJ01349002), Rural Development Administration, Republic of Korea.
Conflict of interest
The author declares no conflict of interest to disclose.