Open access peer-reviewed chapter

Gene Expression and Transcriptome Sequencing: Basics, Analysis, Advances

Written By

Nakul D. Magar, Priya Shah, K.  Harish, Tejas C. Bosamia, Kalyani M. Barbadikar, Yogesh M. Shukla, Amol Phule, Harshvardhan N. Zala, Maganti Sheshu Madhav, Satendra Kumar Mangrauthia, Chirravuri Naga Neeraja and Raman Meenakshi Sundaram

Submitted: 09 April 2022 Reviewed: 17 June 2022 Published: 14 August 2022

DOI: 10.5772/intechopen.105929

From the Edited Volume

Gene Expression

Edited by Fumiaki Uchiumi

Chapter metrics overview

838 Chapter Downloads

View Full Metrics

Abstract

Gene expression studies are extremely useful for understanding a broad range of biological, physiological, and molecular responses. The techniques for gene expression reflect differential patterns of gene regulation and have evolved with time from detecting one gene to many genes at a time laterally. Gene expression depends on the spatiotemporal expression in a particular tissue at a given time point and needs critical examination and interpretation. Transcriptome sequencing or RNA-seq using next-generation sequencing (short and long reads) is the most widely deployed technology for accurate quantification of gene expression. According to the biological aim of the experiment, replications, platform, and chemistries, propelling improvement has been demonstrated and documented using RNA-seq in plants, humans, animals, and clinical sciences with respect to gene expression of mRNA, small non-coding, long non-coding RNAs, alternative splice variations, isoform variations, gene fusions, single-nucleotide variants. Integrating transcriptome sequencing with other techniques such as chromatin immunoprecipitation, methylation, genome-wide association studies, manifests insights into genetic and epigenetic regulation. Epi-transcriptome including RNA methylation, modification, and alternative polyadenylation events can also be explored through long-read sequencing. In this chapter, we have presented an account of the basics of gene expression methods, transcriptome sequencing, and the various methodologies involved in the downstream analysis.

Keywords

  • ESTs
  • microarray
  • RNA-seq
  • assembly
  • annotation
  • visualization
  • tools
  • databases

1. Introduction

The phenotypic manifestation of the genetic code through transcription and translation is known as gene expression. The determination of specific spatiotemporal expression patterns under a particular condition or developmental stage is known as gene expression analysis. The gene expression analysis has gained feasible attention in the biological field of research. The conventional methods of gene expression and functional analysis focus on one gene at a time. But in the last decade, there has been the development of numerous high-throughput technologies that allow the expression studies of thousands of genes simultaneously, in a single experiment such as microarray, transcriptome analysis/ RNA-seq, etc. These methods are highly capable of generating an ample amount of biological data. There has been phenomenal progress in the data repositories, and the data are continuously being deposited in the databases. Parallelly, the advancements made in the bioinformatics pipeline, tools, and software (online/ offline) with the graphical user interface or language-based also add to the ease and convenience to use the same for data analysis. Several databases serve as repositories of the sequenced data, the most widely deployed is National Center for Biotechnology Information, NCBI.

Advertisement

2. Evolution of high-throughput transcriptomic technologies

The traditional methodologies for gene predictions and transcriptomic studies involve the complementary deoxyribonucleic acid (cDNA) clone preparation and further utilize it to generate expressed sequence tags (ESTs), and then sequencing these tags using the first-generation sequencing platforms such as Sanger sequencing technology. Figure 1 shows historic advancements in gene expression technologies. In the late 1990s, gene expression studies were carried out for 45 Arabidopsis genes by using the early high-capacity microarrays in which cDNA is spotted on microscope-sized glass slides [1]. Another pioneering quantitative transcriptomic study is the serial analysis of gene expression (SAGE), which was first performed on 1000 tags for characterization of the human pancreatic gene expression pattern [2]. Later on, with the advancement in sequencing technology, a technique such as RNA sequencing (RNA-seq) has emerged that possesses numerous next-generation sequencing techniques that help to retrieve the sequence and the expression level of the RNA transcripts [3, 4]. Continuous efforts have been made over the years for the development of feasible high-throughput technologies for gene expression profiling and quantification. This will help to cope with the several challenges associated with sequencing technology that include cost, complexity, availability, and error occurrence rate while assembling the sequence [5]. In comparison to the sequencing technology, the array-based technology does not involve these challenges, hence is still widely used for expression studies. However, it has several other limitations such as the probe-based nature of microarrays, it requires predefined probes, and hence, is unable to deliver precise readings [6].

Figure 1.

Historic timeline of technologies involved in gene expression analysis.

With the onset of RNA sequencing (RNA-seq) technology, whole transcriptome sequencing has been carried out [7, 8]. RNA-seq studies cover the genome-wide assessment of transcripts and have a sequencing depth of 100–1000 reads per base pair of a transcript [9]. In the RNA-seq technology generally, the output comprises short reads, which are generated by sequencing the cDNA fragments from one end or both ends. Further, the error rate is minimized followed by assembling these short reads into the long sequences in correspondence to the sample RNAs.

Generally, for sequencing the short reads, the next-generation sequencing platforms are being utilized to read quite short sequences of 35–500 bp [5, 10]. This platform requires high-powered computing systems with huge storage and memory along with several cores as this will enable to run the algorithms simultaneously and regenerate the full-length transcripts. However, it has been observed that such platforms possess showcased coverage and have quite high error rates ultimately increasing the informatics challenges [6, 11, 12]. There would be a requirement for additional reads for ensuring high-quality coverage and improving throughput [9, 12]. The assembly algorithms have always kept evolving with time and improving the quality of data. The main aim is to read the extension of the length and eliminate the assembly dependency. The advanced RNA sequence technologies include single-molecule, real-time sequencing technology (SMRT), or nanopore sequencers that can cope with the existing limitations and provide several kilobases longer reads and generate whole-genome transcripts. The SMRT platforms have an average read length of 3000 bp and are extendable up to 20,000 bp [9, 13, 14].

In combination with the fluorescent in-situ hybridization, RNA-seq technology has made an advancement in the data generation even at the transcript cellular localization. A cell's RNA is sequenced while it remains in tissue or culture using next-generation sequencing called fluorescent in situ sequencing (FISSEQ) [15] and is a breakthrough in transcriptomic research. In this technology, firstly the cDNA is generated by RNA reverse transcription in situ, then via rolling-circle amplification copies of cDNA are generated to form DNA “nanoballs.” Then by making use of the “sequencing by oligonucleotide ligation and detection” (SOLiD) technology based on sequential hybridization of fluorescently labeled probes with two bases, these nanoballs are sequenced at the cellular level. The emergence of this technology has made possible the simultaneous generation of sequence and positional information. However, it still requires further optimization for wider adoption.

The novel sequencing approach such as nanopore sequencing can perform direct RNA sequencing by eliminating the need to generate cDNA and sequence assembly unlike several high-throughput technologies [16], by avoiding the dependency on the two important inherent sources of error in the application of indirect approaches. The most important point to take into account regarding all these methods from microarrays or next-generation sequencing is that when the simultaneous measurements have been carried out, despite a very low error rate, a large number of errors occur. Hence, there is a need for cross-validation to enhance the high-throughput data accuracy by utilizing an alternative procedure, such as quantitative real-time polymerase chain reaction (qRT-PCR) or other gene expression methods as discussed in the later part [17, 18].

Advertisement

3. Expressed sequenced tags (ESTs)

An EST is a short fragment of RNA sequence (200–800) generated from sequencing of randomly selected cDNA clones. Single RNA transcript is reverse transcribed to cDNA, cloned, then it is sequenced. These cDNA libraries will provide information on EST, which can be used to identify gene transcript, gene discovery, and sequence determination [19]. It involves mapping of EST to the location on a specific chromosome using physical mapping strategies or aligning EST sequence with the genome. It will help to find out the expression of the corresponding gene concerning specific conditions or any treatment [20]. Hence, ESTs are studying the structure of plant genome, gene expression, and function [21]. Additionally, this tool also helps to clarify the structural gene annotation and development of molecular markers [22, 23], genomic map construction [24], study ancestral relationships between the species, helps in the elucidation of transcriptome activity [25, 26] as well providing information to develop probes DNA chips [1]. Figure 2 shows the methodology of EST-seq. With the advancement in sequencing techniques, various approaches such as whole-genome sequencing and transcriptome sequencing become an alternative for EST. These NGS techniques avoid the missing of rare transcripts by reducing the complexity and cost of sequencing [27]. Sanger sequencing method generated EST data with less number as compared with GS-FLX, which is being a widely used technique for de novo sequencing and EST analysis in plants [6]. Table 1 shows the databases available for ESTs. The suppression subtractive hybridization (SSH) was developed for the generation of subtracted cDNA libraries based on suppression PCR. It combines normalization and subtraction in a single procedure wherein the common sequences between the two samples for differential gene expression are subtracted and the rare sequences are enriched [28]. The use of this technique is limited due to its complexity and to the identification of low abundance genes.

Figure 2.

Flow chart of the EST sequencing.

Advertisement

4. Serial analysis of gene expression (SAGE/CAGE)

Serial analysis of gene expression (SAGE) refers to the comprehensive, unbiased, and quantitative gene expression of transcript profiles. SAGE involves the development of EST with the help of high-throughput tags for quantification [2]. For several modifications for quantification, it does not require the prior knowledge gene, which is superseded over the array techniques. In SAGE, cDNA generated from respective mRNA is digested using specific restriction enzyme results into 10–11 bp tag fragments. Further, these tags are concatenated (head to tail) to long strands (>400–600 bp) and sequenced. The sequence is then aligned with the reference gene for the identification of corresponding gene (Figure 3). Lacking information on the reference genome, differentially expressed tags can also be used as diagnostic markers. Variants of SAGE have been studied such as cap analysis of gene expression (CAGE) involves sequence tags from the 5A end of an mRNA transcript only [29]. Consequently, these tags aligned with the reference genome will help to reveal the transcriptional start site. Likewise, several SAGE-like variants have been developed (MAGE, SAGE, microSAGE, miniSAGE, longSAGE, superSAGE, deepSAGE, etc.) to study the genome-wide analysis of DNA copy-number changes and methylation patterns, chromatin structure, and transcription factor targets.

Figure 3.

Flow chart for serial analysis of gene expression.

Advertisement

5. Microarray

For the last decade, for high-throughput transcriptome profiling, DNA microarrays have been preferred. The gene expression quantification requires RNA and microarrays hybridization. One such technique is the microarrays that rely on the principle of complementarity between the nucleic acid strands [30]. The microarrays are distinguished into two types: genotyping microarrays and expression microarrays. In the former, specific cDNA while the latter is used to detect specific RNA [31]. The comparison between the results of these two arrays leads to the establishment of the specific variation in the gene expression patterns and the mRNA abundance. This ultimately leads to the detection of some promising candidate genes in response to the different treatments and distinct genetic backgrounds. The methodology of this technique involves the high-quality RNA extraction and preparation from specific tissues, followed by RNA amplification to facilitate hybridization, then the mRNA is converted into cDNA. This cDNA is further fragmented and biotin-labeled followed by the addition of the fluorescent molecule that binds to the biotin. Then, the hybridization is carried out, and the time required to complete the hybridization process signifies the sample concentration. Finally, the hybridized microarray is rinsed for removing the unbound chains. This is followed by microarray scanning where tagged fluorescent light detection indicates specific sequence hybridization at a specific point. The reading is performed by utilizing a laser, and the fluorescence emission is recorded by scanning. The fluorescence intensity determines the amount of probe bound to each sample (Figure 4).

Figure 4.

Flowchart for a methodology for microarray.

However, there are several limitations associated with the microarrays: such as the expression levels detection is limited, and it is ineffective for extremely high and low expressive genes. This is dependent on prior existing knowledge, and sometimes it proves to provide error-prone outcomes. Additionally, the cross-hybridization between similar sequences leads to a reduction in appropriate detection. Hence, the results obtained from microarray need cross-validation by qRT-PCR, Northern blot, etc., by using appropriate reference genes [32].

Advertisement

6. RNA sequencing: the next-generation sequencing

RNA sequencing (RNA-seq) refers to quantifying the transcriptome using high-throughput sequencing methodology and computational methods [7]. The transcriptome is the set of various types of ribonucleic acid that are present in the cell such as messenger ribonucleic acid (mRNA), transfer ribonucleic acid (tRNA), ribosomal ribonucleic acid (rRNA), small nuclear ribonucleic acid (snRNA), non-coding ribonucleic acids (ncRNA), and others [33, 34]. The RNA-seq workflow includes total RNA extraction from a tissue sample, enrichment of RNA using either oligo (dT) or rRNA depletion, fragmentation of RNA (100–500bp), cDNA synthesis, and preparation of library then sequenced using various high-throughput sequencing methods, resulting into the short sequences from one end (single-end sequencing), which is faster and cost-effective than paired-end sequencing and also appropriate for quantification of gene expression levels. However, both ends (pair-end sequencing) generate more robust alignments and/or assemblies, which is found to be beneficial for gene annotation and transcript isoform discovery [7, 35]. The nucleotide sequences generate in a range between 30 bp and over 10,000 bp, vary with the sequencing method used [6]. Further, to study the expression level and transcriptional structure for each gene, the resulted sequences are aligned with reference genome sequences, available in databases. RNA-Seq reveals the genes that are active at a particular time, growth during the stages, or during treatment, read counts are used to studying the relative expression level. There has been continuous improvement made in the sequencing technology to obtain the finest result. There has been always huge importance of DNA sequencing in biological research that is hard to overstate. This sequencing technology helps to reveal the fundamental difference between the organisms. The limitations of the first-generation Sanger sequencing developed by Frederick Sanger and colleagues were overcome in the second-generation sequencing (SGS); likewise, in the third generation sequencing (TGS). Over the years, there have been wide innovations in the sequencing protocol, also a great elevation has been in the automation that has increased the capabilities of the DNA sequencing technology. Along with the technological advancements, it has made to be cost-effective that has resulted in the increased application and allows the parallel massive read of DNA of about hundreds of base pairs in a single run. The sequencing technology has shifted the researchers from computer to high-end servers, from code to programs, from single to multiple time points, and from single to multiple databases. Table 2 provides the comparative account of the sequencing technologies.

MethodCommercial Released yearTypical Read lengthSingle Read Accuracy (%)Reads per runTime per runAdvantagesLimitations
Pyrosequencing 454 Life Sciences2005700 bp99.91 million24 hrsLong read size FastRuns are expensive Homopolymer errors
Sequencing by synthesis Illumina200650–600 bp99.91 million to 2.5 billion1–11 daysHigh sequence yieldExpensive equipment Requires high DNA concentrations
Sequencing by ligation SOLiD Sequencing200850 bp99.91.2 to 1.4 billion1–2 weeksLow cost per baseSlower Issues with palindromic sequences
Combinatorial probe anchor synthesis cPAS-BGI/MGI200935–300 bp99.950 to 1300M per flow cell1–9 days-
Ion semiconductor Ion Torrent sequencing2010600 bp99.60Up to 80 million2 hrsLess expensive equipment. FastHomopolymer errors
Nanopore Sequencing (Oxford Nanopore Technologies Ltd.)2011500 kb92–97Up to 500 kb1 min–48 hrsPortableLower throughput Lower accuracy
Single-molecule real-time sequencing Pacific Biosciences2011>100, 000 bases87500,000 per Sequel SMRT cell 10 to 20 gigabases30 mins–20 hrsFastExpensive

Table 2.

Comparative analysis of next-generation sequencing technologies for gene expression.

The first-generation sequencing earlier involved gene fragmenting, cloning, and has a cumbersome manual analysis process. However, later it utilizes capillary gel electrophoresis, which involves the automation of capillary with polymers and sample loading and the computer-based detection of sequence [36]. This generates reads slightly less than 1 kilobase (kb) in length with an error rate of 0.001%. The second generation is also known as next-generation DNA sequencing (NGS) procedures that involve PCR-based in vitro cloning unlike the in vitro cloning in the first-generation sequencers [37, 38]. While the TGS that is available on the commercial scale doesn’t involve cloning and can sequence a single DNA molecule [39]. Nevertheless, the Sanger sequencing platform has wide application as a gap-filling technology between contigs generated using NGS and TGS platforms.

The NGS involves a platform that can perform massively parallel sequencing of hundreds of thousands to hundreds of millions of different DNA fragments [40] with less template preparation. The NGS includes 454 pyrosequencing (Roche), Solexa sequencing (Illumina), ion semiconductor sequencing or Ion Torrent Proton sequencing, sequencing by oligonucleotide ligation and detection (SOLiD) system from Applied Biosystems massively parallel signature sequencing (MPSS). Among these, the former three are based on the principle of sequencing by synthesis while the SOLiD and MPSS employ the principle of oligonucleotide-template hybridization followed by ligation to the growing chain [41]. The MPSS is best suited for gene expression studies and utilizes enzymatic cleavage and ligation. This helps in distinguishing and quantifying the sample RNA from different species. The NGS technology has been used majorly for mRNA expression profiling, targeted resequencing, and biomarker discovery. It carries out the deduction of bases based on light color and intensity signals.

On the commercial scale, the short-read NGS sequencers available in the market are the short-read sequencers that possess sequencing ability of up to 600 bases, for example, Illumina, NovaSeq, HiSeq, NextSeq, MiSeq, Thermo Fisher’s Ion Torrent sequencers BGI’s MGISEQ , and BGISEQ. However, the NGS methods hold some limitations such as the short read length as a destructive effect of lasers on DNA and enzymes. Also, repetitive washing after each cycle affects the amount of DNA to be made available for sequencing. And as the plant genomes contain extensive repeat sequences, these short reads make the assembly of the genome sequences complicated. In addition, the heterozygosity and high/low GC-content regions could not be precisely assembled by utilizing the NGS. The NGS technology uses PCR for generating multiple copies of DNA fragments, which leads to biasness, and there is no uniformity in the quality of coverage of different genomic regions. This method relies on the principle of hybridization that requires a template as PCR generated millions of copies of a single DNA fragment, and as a result, the reaction does not occur in synchrony. Finally, in the case of NG sequencers, these asynchronous reactions ultimately lead to an increase in the error rate in the base sequence of the given fragment, which builds up through the cycles. However, the NGS platforms provide the software packages for “base calling” to minimize the error rate, and in addition, there are several base-calling algorithms present that reduce the error rate by ~5–30%.

Another limitation associated with the NGS technology is the time (several days) required for sample preparation despite generating the sequence data at a comparatively lower cost per base sequenced, the equipment, costs, chemicals, data storage, analysis, management, and other consumables increase the amount. The contrary, to the above limitations, the NGS technology still rules the commercial sector due to its capability of generating a huge amount of data with low per nucleotide cost. However, to deal with the above limitations in a successful way, the third-generation sequencing technology has been introduced.

6.1 Third-generation sequencing

As discussed above, the NGS sequencers are faster, cheaper, user-friendly with extremely high throughput. The TGS holds versatility and can successfully carry out several distinct analyses with much higher throughput and in a more cost-effective way than the NGS sequencers. Additionally, the TGS technologies only require a sequence of single DNA molecules. Hence, they do not depend on the in vivo cloning and PCR amplification and are extremely time-saving as they complete necessary template preparation in a few hours [39]. Therefore, they are often known as single-molecule sequencing (SMS) methods.

The TGS technology makes use of the enzymes DNA polymerase, fluorescence energy transfer, transmission electron microscopy, nanopores, and electronic detection. The TGS platforms are the long-read sequencers that produce reads of 10–15 kb. In the present scenario, Pacific Bio sciences’ (PacBio) single-molecule real-time (SMRT) sequencing and Oxford Nanopore Technologies (ONT) nanopore sequencing are widely deployed [42]. However, the error rate of the TGS methods is reported to be quite higher about 10–15% as it makes use of single molecule so the error removal opportunity is less in comparison to the NGS, which carries out multiple copies of each fragment. Hence, this necessitates the requirement of the supporting technology to carry out the correction before and after the assembly process. Moreover, the supporting technology imparts support to existing genome assembly such as optical mapping method (Bio-nano), linked-read technology (10X genomics chromium system), and genome-folding-based technique HiC.

The selection of the appropriate sequencing technology has to be carried out based on a number of several implications, read coverage, accuracy, type of samples, DNA quality, quantity, and computation resources. However, in some cases, the combination of long and short-read sequencing platforms can be deployed as a better option for downstream analysis. This combination overcomes the individual limitations of both technologies and provides improved quality of whole-genome assembly.

In comparison with microarray, RNA-seq measures both low- and high-abundance RNAs, and it requires very little starting material, i.e., as little as 50pg, which made possible transcriptome studies of single cell over the tissue samples and helps in finer examination of cellular structures, expression level at a single-cell level along with an alternative transcript, novel transcript, and fusion genes. Several modifications of this RNA-Seq have been used for the identification of the candidate non-coding RNAs in plant species. A few of them are briefly described below.

6.2 Strand-specific RNA-Seq

Transcription of sense strand generates antisense transcript involved in the production of non-coding RNAs that are complementary with associated sense transcript. Antisense transcription was reported in nucleosomal-free regions such as promoters of bacteria, fungi, protozoa, plants, invertebrates, and mammals to carry out important regulatory functions. To identify the function and presence of antisense non-coding strand strands, there is a need for strand-specific RNA-Seq. Prevalent RNA-Seq does not preserve the information of sequenced transcripts. Beyond strand information, reads can be aligned to gene locus, but it will not give an idea about the transcription direction of a gene. Strand-specific RNA-Seq (ssRNA-Seq) helps to identify the transcribing genes, which overlap in various directions, and prediction of bouncing genes in organisms [43]. In ssRNA-Seq, the identity of the strand of DNA (sense or antisense) is preserved. This technique is also used to reveal the significant information of originating strand, a distinction between antisense and other non-canonical RNAs, which will be then used for enhancing the detection of a transcript from a sequencing experiment. For example, to uncover the sense and antisense transcript, mark off the boundaries of neighboring genes transcribed from both strands and study both non-coding and coding transcripts session level. A commonly used method for ssRNA-Seq is the dUTP [43], which involves the replacement of thymine nucleotide with uracil in the complementary strand generated during second-strand cDNA synthesis. The complementary strands were further degraded by Uracil DNA Glycosylase (UDG); consequently, only the original strand remains back. Hence, in this way, the original strand used in the transcription can be identified, by aligning the sequence with the reference genome. By using strand-specific RNA-Seq, various novel lncRNAs have been identified in many plant species. One such example is reported in Arabidopsis, there has been the identification of a substantial amount of antisense transcription and long non-coding natural antisense transcripts (lncNATs) using this method. This cutting edge capable to give important information surrounding the transcriptome is a key to a greater understanding of the transcriptome. The methodology has been depicted in Figure 5.

Figure 5.

Flowchart for strand-specific-Seq.

6.3 RNA immune precipitation–sequencing (RIP-Seq) and CLIP-Seq

RIP-Seq refers to high-throughput sequencing of the interacting RNA, which is confined through immunoprecipitation of target proteins that helps to infer the mechanism of the posttranscriptional regulatory network. RIP-Seq maps the protein binding sites on RNA and produces RNA-protein complexes. Various long non-coding RNAs (lncRNAs) have been reported to date, while the functions of many are still unclear. Hence, to reveal the significance of the lncRNAs, scientists have developed various technologies to study the RNA-RBP (RNA binding protein) interaction that is a critical mechanism regulating the translation. Mainly, there are two methods to characterize the functions of lncRNA, namely RNA immunoprecipitation sequencing (RIP-Seq) and cross-linking-immunoprecipitation sequencing (CLIP-Seq) (Figures 6 and 7).

Figure 6.

Flowchart of single-cell RNA-seq and RIP-Seq.

Figure 7.

Flowchart of lncRNA sequenicng.

In RIP-Seq, proteins were used as bait to pull down the RNA from the sample, then protein targeted antibody is used for the immunoprecipitation of RNA-protein complexes, which are further purified under the optimum physiological condition to retain the native interactions. Followed by the RNase digestion, the extraction of the RNA protected by protein binding is carried out and is then reverse-transcribed to cDNA. Further, high-throughput sequencing has been carried out, and data analysis reveals the transcriptome-wide view of the protein-RNA/lncRNA regulatory network. Likewise, in CLIP-Seq, covalent binding between RNA molecules and RBPs under ultraviolet irradiation results in the improved binging strength of RNA binding proteins and their corresponding RNA targets.

6.4 Single-cell RNA-Seq

Rapid advancement in NGS-based technologies for genomics, transcriptomics, and epigenomics facilitated scientists to focus on individual cell characterization, which reveals significant novel and potentially expected discoveries. Studies on any biological system were carried out at the level of the organism, organ, or tissue. Nonetheless, cells of identical genotypes also change in the activity of only a subset of genes. Moreover, for a better understanding of a biological phenomenon, there is an essential requirement of a more precise transcriptomics study for individual cells that will further elucidate their role in numerous cellular functions. This will ultimately lead to a better understanding of gene expression in promoting beneficial and harmful states. There are six methods for sRNA seq includes, cell expression by linear amplification and sequencing (CEL-seq), droplet sequencing (Drop-seq), massively parallel single-cell RNA sequencing (MARS-seq), single-cell RNA barcoding and sequencing (SCRB-seq), switch mechanism at the 5’ end of RNA template (Smart-seq, and Smart-seq2). In various plants such as Zea mays, A. thaliana, Medicago truncatula, rice, and Glycine max, expression pattern of genes was studied with the help of single-cell RNA seq methods [44]. However, there have been no reports about the utilization of the scRNA-seq to study plant lncRNAs. The major cause behind this is low sequencing coverage and the inability to capture and sequence non-poly A RNA. Although cell and tissue-specific roles and functional identification of lncRNAs in plants could be deduced using the scRNA-seq. In fluorescence microscopy, only a few genes can be studied under the response of cells to a specific signal or environment, while RNA-seq has been used for the study of differential gene expression levels with transcriptional differences of both coding and non-coding RNAs on a genome-wide scale. Single-cell transcriptomics has also been useful for the reconstitution of temporal transcription networks during developmental processes [45] or during the exposure of cells to external stimuli, all of which can be masked on a population level. In the below section, critical points for consideration are given.

Advertisement

7. RNA-Seq data analysis

7.1 Data quality control and reads mapping

Once RNA sequencing has been completed, the data generated need to be checked regarding the total numbers of reads generated, quality, and other requirements for sequencing. To remove the low-quality reads and base calls, filtering of reads and trimming of bases have been carried out, which are dependent on QC reports performed using RNA-SeQC [46] and RSeQC [47]. Reads of RNA-seq mapping with reference genome are quite more challenging than the procedure of mapping the general reads. This is mainly due to the synthesis of mRNA from the transcription process where the splicing out of introns and joining of exons in the gene make the RNA-Seq reads discontinuous (Table 3).

1SAMStatA tool evaluates unmapped, poorly and accurately mapped sequences independently to infer possible causes of poor mapping.
2FastQCA quality control tool for high-throughput sequence data.
3RNA-SeQCA tool with application in experiment design, process optimization and quality control before computational analysis. Provides three types of quality control: read counts, coverage, and expression correlation.
4RSeQCAnalyzes diverse aspects of RNA-Seq experiments: sequence quality, sequencing depth, strand specificity, GC bias, read distribution over the genome structure and coverage uniformity. The input can be SAM, BAM, FASTA, BED files or Chromosome size file (two-column, plain text file).
5KrakenA set of tools for quality control and analysis of high-throughput sequence data.
6dupRadarAn R package provides functions for plotting and analyzing the duplication rates dependent on the expression levels.
7HTSeqThe Python script htseq-qa takes a file with sequencing reads (either raw or aligned reads) and produces a PDF file with useful plots to assess the technical quality of a run.
8MultiQAggregate and visualize results from numerous tools (FastQC, HTSeq, RSeQC, Tophat, STAR, others.) across all samples into a single report.

Table 3.

Tools for quality control of the transcriptome data.

There are two approaches for mapping the RNA-seq: one is to construct a database of reference transcript sequences, which consists of currently annotated exons generated using a reference genome Reference transcript database is used to map such as BWA and Bowtie. Various examples of SpliceSeq [48], SAMMate [49], PASTA [50], RNASEQR [51]. While the other method detects ab initio splice junctions and is independent of genome annotation. Examples of ab-initio spliced mappers are TopHat/TopHat2 [52], MapSplice [53], HMMSplicer [54], GSNAP [55], MapNext [56], and STAR [57].

Recently, Salmon (https://salmon.readthedocs.io/en/latest/salmon.html#using-salmon), Sailfish (https://www.cs.cmu.edu/~ckingsf/software /sailfish/ ) and Kallisto (http://pachterlab.github.io/kallisto/) are being deployed.

The percentage of mapped reads varies with the different factors such as aligning methods and species although it is an important QC parameter (Tables 4 and 5). Additionally, several other critical factors need to consider such as rRNA reads and duplicate reads, which vary due to biological factors such as overrepresentation of a small number of highly expressed genes, or technical factor-like PCR over amplification. The RNA-Seq QC tools have good genomic coverage; it reports a percentage of reads often on the intragenic region (within genes including exons or introns) or intergenic regions (between genes). However, if a sequenced reference genome is not present to map the reads of RNA-Seq, then there exist two ways of analysis of RNA-Seq data. First, use a reference genome of related species to map reads, and another is to assemble the target transcriptome de novo. Many de novo transcriptome assemblers are available, which include Oases [58], SOAPdenovo-Trans [59], Trinity [60], and Trans-ABySS [61]. The point that has to be considered is reference genome of related species must have genome similarity ~85% or more with the species of study, otherwise better to go with the de novo assembly approach.

1CutadaptRemoves adapter sequences from next-generation sequencing data (Illumina, SOLiD and 454). It is used especially when the read length of the sequencing machine is longer than the sequenced molecule, like the microRNA case.
2PRINSEQGenerates statistics of your sequence data for sequence length, GC content, quality scores, n-plicates, complexity, tag sequences, poly-A/T tails, and odds ratios. Filter the data, reformat and trim sequences.
3SnoWhiteA pipeline designed to flexibly and aggressively clean sequence reads (gDNA or cDNA) prior to assembly.
4AlienTrimmerImplements a very fast approach (based on k-mers) to trim low-quality base pairs and clip technical (alien) oligonucleotides from single- or paired-end sequencing reads in plain or gzip-compressed FASTQ files.
5TrimmomaticPerforms trimming for Illumina platforms and works with FASTQ reads (single or pair-ended). Some of the tasks executed are: cut adapters, cut bases in optional positions based on quality thresholds, cut reads to a specific length, and convert quality scores to Phred-33/64.

Table 4.

Tools for trimming and adapter removal.

Short (Unspliced) Aligners
1SubreadExpression analysis
2BowtieA short aligner based on the Burrows–Wheeler transform algorithm and the FM-index. Bowtie tolerates a small number of mismatches.
3Burrows–Wheeler Aligner (BWA)A software package for mapping low-divergent sequences.
4Bowtie2Aligns sequencing reads to long reference sequences that supports gapped, local, and paired-end alignment modes.
5PerMGenome-scale alignments for hundreds of millions of short reads produced by the ABI SOLiD and Illumina sequencing platforms.
6ZOOMShort aligner of Illumina/Solexa 1G platform, uses extended spaced seeds methodology building hash tables for the reads and tolerates mismatches and insertions and deletions.
Spliced aligners
1RNA-MATEPipeline for alignment of data from Applied Biosystems SOLID system.
2ErangeAlignment and data quantification to mammalian transcriptomes.
3RUMAlignment based on a pipeline, being able to manipulate reads with splice junctions, using Bowtie and Blat
4RNASEQRTools used for alignment.
5SAMMate
6SpliceSeq
7X-Mate
De novo splice aligners
1HiSATAlignment program for mapping RNA-seq reads.
2HISAT2Alignment program for mapping next-generation sequencing reads.
3HMM SplicerCanonical and non-canonical splice junctions in short-reads.
4GMAPA Genomic Mapping and Alignment Program for mRNA and EST Sequences.
5PassAligns gapped, ungapped reads and also bisulfite sequencing data.
6QPALMAPredicts splice junctions supported on machine learning algorithms. In this case the training set is a set of spliced reads with quality information and already known alignments.
7SuperSplatAlgorithm splits each read in all possible two-chunk combinations in an iterative way, and alignment is tried to each chunck.
8SoapSpliceTool for genome-wide ab initio detection of splice junction sites from RNA-Seq, a method using new generation sequencing technologies to sequence the messenger RNA.
9RASERReads aligner for SNPs and editing sites of RNA.
De novo splice aligners (also for annotation )
1STARAlign long reads and can reach speeds of 45 million paired reads per hour per processor.
2TopHatAlignment of shotgun cDNA sequencing reads.
3SubjuncUses all mappable regions in an RNA-seq read to discover exons and exon-exon junctions.

Table 5.

Tools for alignment of the transcriptome data.

7.2 Data normalization, differential gene expression, and splicing variant analysis

Normalization refers to removing the technical bias and unwanted variation in the total read count of different samples, which helps to focus on sample difference. In RNA-Seq, genes that are highly expressed, i.e., transcribed, mean more reads will be present for the same gene. However, the critical factors that need to be considered while applying this basic principle are the sequencing depth and length of gene transcript. Comparing reads of different genes over the sample in different treatments helps to normalize the number of reads for each gene (Table 6).

1BaySeqIt is a Bioconductor package to identify differential expression using next-generation sequencing data, via empirical Bayesian methods.
2DESeqIt is a Bioconductor package to perform differential gene expression analysis based on the negative binomial distribution.
3DerfinderIt helps to annotation-agnostic differential expression analysis of RNA-seq data at base-pair resolution via the DER Finder approach.
4DiffSpliceIt is a method for differential expression detection and visualization, not dependent on gene annotations.
5EdgeRIt is an R package for analysis of differential expression of data from DNA sequencing methods, like RNA-Seq, SAGE or ChIP-Seq data.
6EdgeRunIt is an R package for sensitive, functionally relevant differential expression discovery using an unconditional exact test.
7MetaDiffDifferential isoform expression analysis using random-effects meta-regression.
8MMSEQIt helps to estimating isoform expression and allelic imbalance in diploid organisms based on RNA-Seq.
9RcountIt is a simple and flexible RNA-Seq read counting.
10rDiffIt is a tool that can detect differential RNA processing (e.g. alternative splicing, polyadenylation or ribosome occupancy).
11StringTieIt is an assembler of RNA-Seq alignments into potential transcripts.
12TIGARTranscript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference.
13TimeSeqIt helps to detects differentially expressed genes in time course RNA-Seq Data.

Table 6.

Tools for quantitative analysis and differential expression.

Reads per kilobase per million (RPKM) and fragments per kilobase per million mapped reads (FPKM) are the two simplest early normalization approaches in RNA-Seq data; nevertheless, additional tools such as DESeq and edgeR are also commonly used for normalization. In FPKM, the gene expression is normalized during software such as StringTie, which helps in transcript assembly and RNA-seq quantification. Then FPKM value is generated for the gene expression, where the higher value indicates the increased gene expression. Moreover, this software has also been utilized for the identification of the alternative transcripts generated during the splicing of mRNAs during developmental stages. The transcripts per million (TPM) based on depth-normalized counts and counts per million reads mapped (CPM) based on length-normalized are also used as metrics depending upon the experimental consideration.

To identify the differentially expressed genes, various models are available such as bayseq [62], Cuffdiff / Cuffdiff2 [45], DEGSeq [63], DESeq/DESeq2 [64], edgeR [65], Limma Voom (https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/) While reads counting software T-Seq are used for the counting aligned reads for overlap of reads and edgeR or Deseq2 is used to find out differentially expressed genes in the form of heat map, which shows higher expression by dark-colored pattern and decreased expression denoted by pale color relative to controls [65]. Sequencing depth (number of times a sample is sequenced) and high coverage (number of reads) obtained after sequencing are the key important factors to uncover the low-level expressed novel transcripts such as lncRNAs. Deep RNA-Seq helps to identify the novel lncRNAs in plants by resequencing cDNA fragments.

Similarly, SpliceSeq quantifies and compares reads covering exons, while the splicing junction approach is used to identify the change in splicing pattern. Many of these methods commonly focus on the level of splicing events instead of full-length splicing variants. Various methods are available such as MISA [66], ALEXA-Seq [67], FDM [68], rDiff [69], and rSeqDiff [70]. Genome-independent methods for the analysis of splicing variants, especially used in case of species not having sequenced reference genome or huge variation in RNA transcript (diseased condition) compared with the reference genome. To assemble and differentiate splicing variants, methods are based on transcriptome preassembled from RNA-Seq reads, i.e., transcriptome-based approach includes RSEM [71], IsoEM [72], BitSeq [73], and recently developed such as Rnnotator [74] and KisSplice [75].

7.3 Functional analysis of identified genes

Once the differentially expressed genes are revealed, there comes the need to understand the biological functions of those genes. Functional analysis of identified genes is an important part of data analysis, and it has been carried out at multiple levels such as biological pathways, gene ontology, and gene networks. Many different tools are available for functional analysis such as DAVID, g:profiler and clusterProfiler [76, 77] used for the analysis of GO and biological terms, GSEA [78], which is used for functional analysis of the entire gene set, and IPA (ingenuity pathway analysis) for gene network analysis. The online sources for gene annotation such as OmicsBox (https://www.biobam.com/omicsbox/), Panzzer2 (http://ekhidna2.biocenter.helsinki.fi/sanspanz/), EggNOG (http://eggnog-mapper.embl.de.) are widely deployed (Table 7).

S. no.ToolsRemarks
1TomboA suite of tools for the identification of modified nucleotides, analysis and visualization of raw nanopore signal from nanopore sequencing data.
2IDPTool for de novo transcriptome assembly and isoform annotation by hybrid sequencing.
3NanoModDetection of DNA modifications using Nanopore long-read sequencing data.
4PinfishPinfish is a collection of tools helping to make sense of long transcriptomics data (long cDNA reads, direct RNA reads).
5TAPISTAPIS (Transriptome Analysis Pipeline from Isoform Sequencing) is a program for correcting and aligning long reads with/without the second generation reads, transcript clustering, novel and full-length splice isoform detection, and identification and analysis of polyadenylation (poly(A)) and alternative poly(A) (APA).
6SQANTISQANTI provides a wide range of descriptors of transcript quality and generates a graphical report to aid in the interpretation of the sequencing results.
7Tamasoftware was designed for processing Iso-Seq data and other long read transcriptome data.

Table 7.

Tools for annotation.

7.4 Data visualization

After annotation, variants can be visualized using genome browsers and visualization tools. Many RNA-Seq data visualization tools are available such as Genome Browser (https://genome.ucsc.edu/), Integrated Genome Viewer (https://software.broadinstitute.org/software/igv/), and Jbrowse (https://jbrowse.org/jb2/). Alternative splicing visualization tools such as Alexa-Seq, SpliceSeq, SpliceGrapher, and SpliceViewer are also available. These visualization tools help to understand the information of variants including reads, mapping reads, and annotation information such as consequences, scores, and impact of variants. To demonstrate the large changes in gene expression volcano plot can be used. In the volcano plot, each dot is a representation of a gene, whereas the x-axis and y-axis represent the log-fold change based on FPKM values and log10 (p-values), respectively (Table 8).

1BamViewBamView is a free interactive display of read alignments in BAM data files
2IGVThe Integrative Genomics Viewer (IGV) is a high-performance, easy-to-use, interactive tool for the visual exploration of genomic data.
3BrowserGenomeWeb-based RNA-seq data analysis and visualization.
4ABrowseA customizable next-generation genome browser framework.
5SavantSavant is a next-generation genome browser designed for the latest generation of genome data
6EagleViewEagleView is an information-rich genome assembler viewer with data integration capability.
7TBroIt is a transcriptome browser for de novo RNA-sequencing experiments.
8MicroScopeIt is a comprehensive genome analysis software suite for gene expression heatmaps.
9MatchAnnotMatchAnnot is a python script which accepts a SAM file of IsoSeq transcripts aligned to a genomic reference and matches them to an annotation database in GTF format.
10Iso-SeqThe Iso-Seq method produces full-length transcripts using Single Molecule, Real-Time (SMRT) Sequencing.
11IsoSeq-BrowserInteractive visual analytics tool for long-read RNA sequencing (Pacific Biosciences’ isoform sequencing (Iso-Seq) techniques).

Table 8.

Tools for data visualization of transcriptome data.

Advertisement

8. Analyses with PacBio and NanoPore datasets

Long read sequencing of the transcriptome is done generally to qualitatively understand the expression of the genes/transcripts in the organism. This is done by understanding where the genes are localized and whether there are events such as fusions/deletions impacting the genes. Unlike Illumina or other short-read sequencing approaches, where the number of reads can be in millions, thus capturing the expression multiple times, long reads generally are in a few thousand but offer the capture of full-length transcripts depending on the libraries prepared. For starters, earlier in 2010–14, most of the expression data used in the range of 25–100bp long were either paired or unpaired. Currently, the short-read technology can sequence up to 250bp long, which essentially means that the transcripts are going to be fragmented a few times.

PacBio’s IsoSeq and Nanopore’s direct cDNA/amplified cDNA sequencing can capture the complete expressed transcripts in the range of up to 90Kbp, with the median being around 1400 for PacBio and 770bp for Nanopore (depending on Nanopore’s preps). Generally, IsoSeq and direct cDNA capture are done to confirm the existence of long repetitive regions, gene isoforms, and gene fusions. This then aids in annotation of the genome, capturing alternative splice-sites, etc. Figure 8 depicts the analysis to be executed in the transcriptomics datasets.

Figure 8.

A complete overview of the downstream analysis to be executed for the transcriptomics datasets.

8.1 Generalized workflow for PacBio/Nanopore

The general steps of analyses for IsoSeq are as follows:

  1. Generate reads of the insert with multiple passes to ensure high-quality reads with Q>30.

  2. Identification of the reads that represent full-length transcripts based on the presence of Poly-A tails.

  3. Cluster the reads iteratively using the longest reads and polish the read to obtain high-quality consensus.

  4. Map the consensus to the genome with long read aligners such as Minimap2 or GMAP.

  5. Identify the genomic regions where the read maps to derive gene models based on identity or coverage thresholds.

The general steps of analyses for Nanopore are as follows:

  1. Perform base-calling with Guppy using the model suitable for Minion, Gridion, or Promethion based on the flowcell used.

  2. Identify reads with primers on both the ends, these will likely be the full-length transcripts.

  3. Perform all vs all overlap with the reads to get overlaps with Minimap2 and do consensus calling with Racon.

  4. Align the consensus reads to the genome to derive gene models using Minimap2 or GMAP.

Once the gene models or their transcript sequences are identified, the next few steps are to understand if these sequences exhibit any sort of a function. For example, are these sequences coding or non-coding, are these protein-coding transcripts, or are these long non-coding transcripts? Fortunately, there are multiple approaches to solve this issue.

One of the easiest ways to know this is to evaluate the coding potential of these transcripts. This can be done using tools such as Coding Potential Calculator, Coding–Non-Coding Index, and Coding Potential Assessment Tool. Once we know which transcripts are coding based on the results from these approaches, we can select the remaining ones and call them long non-coding RNAs.

Another approach is to convert the transcripts into peptide sequences using something like TransDecoder or Evidential Gene, which can then be used to get functional assignments from PFAM, RFAM, eggnog, InterPro databases, etc. The above approaches do not necessarily require the presence of a reference genome as recent developments/tools can directly use partial-order alignments (POAs)-based approaches to generate clusters, which can be then used to derive consensus directly.

8.2 Combinatorial analyses with short reads

The consensus sequences can then be used in place of a reference genome to study the transcriptome of the organism directly, by using short-read data to quantify the expression of the transcripts. The long reads provide a qualitative expression of the organism, whereas the short reads will give the actual measure of expression due to the sheer quantity of data. Once the gene models from long reads are available, annotated, and curated approaches such as Salmon/Kallisto/RSEM, etc., can be deployed to quantify the expression using short reads based on which differential expression can then be performed. Similarly, tools such as SQUANTI/TAMA, etc., exist, which can use short-read data to augment the long-read data by annotating with CAGE peaks, polyA sites, NMD prediction, etc., which can be used for downstream analyses (Appendix).

Advertisement

9. Transcriptomic databases

Transcriptomic studies provide enormous information beyond the aim of experiments, which can serve as a base for other scientific communities. This large amount of data generated through experiments may be deposited in publicly available databases. In transcriptome, it quantifies the expression of genes along with small RNA and noncoding RNAs in cells, organs, particular growth stages, or stress conditions [79, 80]. Identified information in transcriptomic studies such as differentially regulated genes under the stress condition that can be targeted in the crop improvement programs [80, 81, 82]. Presently, a lot of information is available publicly through databases for in-silico studies. NCBI GEO is one of the highest updated and curated databases, which provides information regarding microarray data, RNA sequences, and functional annotation [83]. In addition to the earlier discussion, there are many other databases available that account for the RNA co-expression (http://atted.jp), plant-pathogen interaction, phosphorylation sites, RNA editing events, and transcription factors.

Advertisement

10. Validation of the RNA-seq data

The gene expression data obtained from RNA-seq studies need to be validated experimentally. The high-throughput large datasets emanate a large number of genes, and practically validating all the relatively expressed genes has limitations. Hence, validation can be performed on small or large subsets as per the design, sampling, and tissues of the experiment. Such validation should be ideally done using the same samples subjected to RNA-seq or microarray. Quantitative real-time PCR (qRT-PCR) is the most widely used technique for the validation of gene expression on account of reliability, accuracy, and sensitivity. It is considered a medium-throughput gene expression analysis technology and is largely used for the validation of transcriptome studies. Other relatively less deployed techniques are translational fusion reporters using reporter genes, functional assays, etc. The virus-induced gene silencing (VIGS) is an RNA interference-based technology deployed to transiently knock down the target gene expression by utilizing modified plant viral genomes. It is an emerging resourceful tool for functional validation of more number of genes [84].

The qRT-PCR remains a widely adopted mandatory technique for the validation of gene expression. Nevertheless, it holds the best with the use of the same samples as assayed for RNA-seq. This one is always well meant when other replications from the same sampling population are assayed using the qRT-PCR . The reference genes or housekeeping genes or endogenous genes whose expression is expected to be stable in a particular tissue at a given time play a critical role in the quantification of gene expression. The variables in the experiment are taken care of by the appropriate usage of reference genes in the experiment [85]. The most commonly used reference genes are 18s rRNA, GAPDH, actin, ubiquitin, elongation factor, tubulin, etc. The selection of a reference gene set is very crucial in differential expression studies as it is known that varying reference genes work in a particular tissue in a spatiotemporal manner. The available software/tools for validation of the reference genes are delta (∆Ct), geNorm, qBASE, NormFinder, BestKeeper, and RefFinder [86]. The selection of the number of reference genes depends on the M value and V value.

Livak’s 2−ΔΔCT method is the most popular method for quantifying relative gene expression using the target and reference gene Cq (quantitative cycle) or Ct (cycle threshold) used Minimum Information for Publication of Quantitative Real-Time PCR Experiments (MIQE) guidelines to describe the information required for publishing the qRT-PCR-related data in terms of transparency, accuracy, reliability. The readers are encouraged to refer to [87, 88].

10.1 Critical points for qRT-PCR validation

  1. Designing qRT-PCR primers having desirable amplicon size (60–150 bp), GC content (40–60%), primer length (18–25 bases), high Tm (temperature melting 60–62°C), and without formation of dimers, loop structures. The primers can be checked using online tools for the secondary structure formation and thermodynamics parameters.

  2. Selection of sample sets (as per the experimental design) for experimental validation and the number of samples for validation in replications.

  3. Priori checks using semiquantitative PCR for cDNA template concentration, primer concentration, and PCR reaction components.

  4. Selection of a validated set of reference/endogenous genes in the experiment.

  5. Appropriate use of positive, negative controls, negative template controls in plate setup.

  6. Number of biological replications (at least three) and technical replications (two) for statistical inference.

11. Critical factors to be considered in expression studies

Numerous points need to be taken into account for gene expression studies. First, and the most important, factor to be considered is the sample tissues. As the expression studies have an enormous variation depending on the developmental stage and the aim of the experiment. Hence, the time of sample collection and proper storage holds crucial importance while experimenting. The selected individuals should be representative of the species and should possess strong genetic background. This will enable to extract adequate information on a large scale. Further, the isolated nucleic acid quality and quantity should be thoroughly checked before performing the NGS to get accurate outcomes. Several points need to be taken into consideration for the selection of the required sequencing platform and assembly tools/software/program to get proper and accurate results. The choice of the sequencing platform to use will influence the cost and success of the assembly process. Different types of sequencing platforms generate different types of data that can be analyzed using different assembly programs. The assembly program is very specific for the type of data to be analyzed so the analysis pipeline should be decided before prefer sequencing. The biological factors include the selection of individuals with pure genetic backgrounds and good representatives of the species to be used for DNA isolation. The basic studies regarding the biochemical, morphological, and physiological should be known. When considering the technical factors, the computational tool has an enormous role as for the proper genome assembly, to run proper analysis process, storage. The accurate selection of the annotation program to be used is also a critical step. So that the gene/transcript has a low copy number and must not escape from the study. The proper stringency level of the bioinformatics tools/software should be maintained for the final interpretation of the data generated. For the proper alignment of assembly and annotation of coding regions, the RNA sequencing data must be generated by extracting RNA of the same sample used.

11.1 Critical points for biology consideration before the start of transcriptome sequencing

  1. To understand gene expression at a given time in a tissue spatiotemporal

  2. To know the changes at different time points happening in a tissue

  3. To compare the changes in expression across tissues in varying organisms

  4. To understand the biology of affecting/invading microorganisms in another plant/ animal

  5. To pinpoint a specific gene/transcript/s responsible for that trait/condition

11.2 Critical parameters affecting a transcriptome

  1. Quality of reads (Phred score)

  2. Availability of the reference genome

  3. The completeness of the transcript/gene

  4. GC content

  5. Number of transcripts in the assembly (assembly thinning)

  6. The correctness of the transcripts

  7. Replications for statistical analysis of data

  8. Functional significance

  9. Validation of transcripts

  10. Phylogeny, etc.

11.3 Critical points before the start of the experiment

  1. Know all about the biology of the sample/trait

  2. Time points of sampling/data required at that particular need to be checked or validated

  3. Need to cross-check and refer as per biological experiment under consideration

  4. Critical time points at which data need to be recorded

  5. Keep the data sorted based on numbers, maximize using excel/filter based on various parameters such as e value, FPKM, q value, p-value stats

  6. Use more than two/three tools for each data

  7. Keep the stringencies at different levels and observe the data/data distribution

  8. Always check back the data with the available information from RNA-seq, ESTs, genome data, back references

  9. Validation using qRT-PCR select the maximum number of set of transcript based on function and transcripts falling in one pathway for validation

11.4 Critical points for data accuracy

  1. Cross-check the data at every step of data analysis

  2. Using two tools/software will increase accuracy

  3. Stringencies in terms of parameters and statistics should be checked at every step

  4. Annotation can be checked with the known set of proteins of the nearest genome or with the RNA-seq data available in the public databases

  5. The functionally relevant transcripts (relatively higher or lower) should be biologically validated

  6. The gene expression validation using a particular significant pathway or metabolic process or a set of specific gene families will hold a higher confidence level for data validation

12. Integration of transcriptomics with other techniques for unravelling the gene expression

The techniques of gene expression especially the RNA-seq are widely deployed in plants, humans, and animal sciences for quantitative and qualitative profiling. The data emanated from the RNA-seq can be analyzed for differential gene expression, annotation, isoform identification, metabolic pathways, domain identification, alternative splicing variant identification, insertion-deletion variations, single nucleotide variants, gene expression co-network analysis, mapping with already identified regions, or quantitative trait loci (QTLs). Along with the mentioned downstream analysis, the RNA-seq gene expression data can be generated and integrated with other techniques for better understanding of specific human or animal processing depending on the tissue, complexity, and the metabolic and biological processes. The data can be generated de novo (sequencing from the samples), or the available data in the publically available databases can also be mined and utilized [89].

Specific to the human and animal sciences in view of the complexities associated with the varying diseases, responses to disease conditions and healthcare medications, the following advances of RNA-seq techniques have been majorly deployed for studying the gene expression. The spatial gene expression in tissue sections retains the precise location of biological molecules in tissue samples and then can be sequenced for knowing the morphological differences. Similarly, the formalin-fixed, paraffin-embedded (FFPE) tissues and antibodies tagged with cell-surface proteins can be sequenced. For better analysis of the relative gene expression studies, RNA-seq techniques are being combined with DNA methylation, degraded RNA samples, protein, and chromatin studies for a thorough understanding of gene expression at a given time point. The circulating RNA can also be captured by using modified protocols (during the initial isolation steps from tissues) and sequenced for the identification of transcripts. In the clinical aspects of the treatment of diseases, it is essential to characterize the immune repertoire at the single-cell level. The techniques such as cellular indexing of transcriptomes and epitopes by sequencing (CITE-Seq) combine single-cell RNA-Seq with cell surface protein analysis and facilitate analysis of cell-surface proteins. The specific region of interest can also be sequenced using an enrichment probe-based approach that can also be deployed to target the transcripts of interest called targeted enrichment RNA-seq. The understanding of alternative splice variants also forms a major application of RNA-seq in clinical research [90].

The RNA-bulk seq is a modified technique of bulk segregant analysis wherein the extreme bulks are made for the identification of QTLs and the gene expression patterns associated with the trait of interest. The RNA samples from contrasting types of tissue are bulked and sequenced called RNA-bulk sequencing, which can be combined with spatial RNA-seq for quantitating the gene expression of tissues at a given time [91].

The epi-transcriptomics pertains to the transcriptome analysis to understand the RNA modifications such as N6-methyladenosine, 5-methylcytidine, and 5-hydroxylmethylcytidine. Specific antibodies are used for precipitating the RNA (RNA immunoprecipitation (RIP)) with modifications that are then sequenced on a high-throughput platform. The Oxford Nanopore RNA-Seq can detect the modifications directly without the need for antibodies [92].

Dual-RNA seq is a persuasive method for analyzing the simultaneous gene expression patterns of the host and microorganism during their interaction. The interaction can be beneficial as has been observed in the growth-promoting microorganisms or during an infection process. The transcripts from the host and the microorganisms are concurrently captured, and the genome-wide transcriptional changes from the host as well as from the microorganism can be accessed. This technique unravels the mechanism of the beneficial organism or invading pathogen enabling the understanding of the effectors and molecular processes of host colonization. Nevertheless, the practical procedures of isolating the interactive transcriptome require specialized protocols and further bioinformatics analysis [93].

The RNA-seq datasets generated through sequencing can be utilized further for mapping the trait of interest. The genetic variants present in a particular region called expression quantitative trait loci (eQTLs) regulate the expression levels of local or distant genes and explain the variation in the gene expression. The genome-wide association studies (GWAS) results can be integrated with the eQTL data in an approach called transcriptome-wide association studies (TWAS). The gene expression levels for GWAS samples can be combined with the gene expression datasets for that trait expression (trait values) in order to identify the gene-trait associations (the involvement of that genic region/genes associated with the trait). The TWAS is a potential approach to ascertain the causal genes at the GWAS loci [94]. In addition, the differentially expressed genes can be mapped with the already known reported QTLs for a particular trait of interest. These co-localized genes increase the confidence of the study in terms of linkage with the QTL.

In the specific applications of understanding the biogenesis, development of non-coding RNA, transcription sites, or finding the binding sites of transcriptionally active RNA polymerase II (RNAPII), the Global Run-On sequencing (GRO-seq) is utilized. The GRO-seq allows the unbiased mapping of nascent transcripts. Brominated nucleotides (5-bromouridine 5′-triphosphate (Br-UTP)) are deployed for immunoprecipitation and enrichment of nascent RNA followed by cDNA conversion and sequencing [95].

  1. QC of the data—FastQC

  2. Adapter low-quality data trimming—FastP

  3. Alignment—STAR/HiSAT2

  4. Alignment QC—Biotype plot, duplicate marking, strandedness identification

  5. Quantification—Feature counts

  6. Sample correlation and principal components—DESeq2

  7. Comparative analyses—DESeq 2: Volcano plots, HeatMaps

  8. Term enrichments—GO overrepresentation, reactome, and KEGG pathways

1. QC of the Data—FastQC

FastQC is generally used to judge the quality of the data based on Phred Scores. Phred Scores are negative log score that is used to assign the quality of the base that is called.

Q = -10 log10P. The probability of the base calling increases, quality increases

FastQC gives visual confirmation that all is well with the data. GC content plots can be used as an assessment check to know if there are any contaminants present in the sequences. Weird distribution at the starting or the ending of the reads can be a signature of library artifacts or systemic biases of the sequencer, esp. Illumina.

Distribution of the bases at the starting and ending of the reads is an example of biases of the sequencer of improper library preps

2. Adapter/low-quality data trimming—FastP

Fast is a brilliant all-in-oneQC tool. It gives a summary of the data before the removal of bad regions in a read and after removing the bad regions.

Bad regions can be specified as a stretch of bases with a lower quality than expected. For example, if the average read quality is to be Q35, i.e., roughly greater 99 than 95% accuracy, but a few bases have a Q15 score, i.e., around 93% accurate, then we can remove these.

FastP can also trim off the head or tail of the reads in Figure 2 above, do a duplicate analysis to see how much data are duplicated, and what is the common motifs present in the data.

3. Alignment—STAR/HiSAT2

Fast trimmed reads are aligned to the genome using STAR/HiSAT2, which are both splice-aware aligners. This means for higher organisms such as eukaryotes where the mRNA is formed by splicing out the introns, the aligner can try to truncate the reads partially at the exon-intron boundary and try to align starting at the next intron-exon junction.

STAR uses a sparse function to store the representation of the genome, whereas HiSAT2 stores the indices in a hierarchical linked manner to align the reads.

STAR alignment scores

A good sample would have the largest set of aligned reads mapped uniquely. A large representation of multi-mapped reads suggests rRNA contamination

4. Alignment QC—Biotype plot, duplicate reads, strandedness identification

The alignments can specify a plethora of information about the samples and the sequenced data. A few pointers to note are:

  1. In the case of RNASeq, the majority of the reads should be aligned to the exons.

  2. The major biotype should be related to protein-coding genes.

  3. Duplicate reads do arise in RNASeq, and you can either mark them as duplicates or not.

  4. Strandedness of the data. Some genes have their regulatory bodies on the opposing strand. Specifying to the provider to perform stranded sequencing can help identify if the gene is getting expressed or depressed due to the regulatory effect.

Sequencing reaction type also affects the duplicacy rates

Identified by annotating the alignments, can tell if the library prepared has captured the protein-coding genes or auxillary contamination due to failed library preparation

5. Quantification—Feature counts

Once the alignment QC is done and looks satisfactory, the next step is to use the annotation of the reference organism to quantify which genes have a heightened expression or reduced expression. This is the step that is going to help perform the following analyses. This is also the part where the effect of strandedness can be observed.

6. Sample correlation and principal components

Once the counts are obtained, the subsequent steps are to interrogate whether there are any clusters observed among the various samples.

Ideally, the correlation and principal components (as a by-product) should tell you if the replicates sequenced are clustering together. Are there any batch effects or other covariates to be adjusted for?

A simple Pearson correlation, which explores a linear relationship, would tell on a scale of -1 (no)—0 (bad)—1 (good) correlation in the sample.

PCA plots can tell which individuals here have similar sets of gene expression profile

7. Comparative analyses—DESeq2: Volcano plots, heatmaps, etc.

Depending on the experiments planned, you prepare metadata stating what the datasets represent. Are the samples differing based on treatments, populations, time points, etc.? Once the objective is understood, you tend to model the formula (called design) in DESeq2, which looks something like:

  1. design= ~ condition + age

  2. design= ~ batch + condition

The thing to note here is that the variable immediately after “~” is called the controlling feature, and after the “+” is the effecting feature. You want to control things such as sequencing batches, populations, genders, etc., while exploring the impact of the experiment on the condition/age, etc. In the absence of the + sign, the first feature gets explored.

Heatmap of the top “N” differential genes and volcano plot of the gene expression

Genes in red are positively regulated while in blue are negative. Genes can also be clustered hierarchically based on expression patterns to see which genes are expressing together

A volcano plot tells about the expression modulation in the context of the confidence intervals. A general threshold that people use for the confidence interval is 0.05 (adjusted p-values) or false discovery rates. This is drawn on the outcome reported by DESeq2 using the log of fold changes vs negative log of the P-adjusted values/false discovery rates.

DESeq2 is one of the approaches, which uses regularized logarithm transformation to normalize the counts. People also use variance stabilizing transform, TMM, UQ , etc.

8. Term enrichments—GO overrepresentation, reactome and KEGG pathways, etc

One of the end goals of an RNA-Seq study is to generally understand biologically what the perturbations are. These can be cell cycle disruption, increased metabolic processes, cell senescence, cell growth, increased trafficking of vesicles, etc. A few tools to perform these are: Profiler, cluster profile, etc.

Reactome, KEGG, etc., are databases that are created by understanding physiologically how the genes are arranged and flow in a pathway, or transfer them from systems for which we have already understood them.

GO/gene ontology is similarly a technique that is universally applied to the tree of life. A GO term for a gene is assigned after studying the function of the gene and then assigned a generalized role to it.

By figuring out genes that have a similar role in the expressions, and their modulations, we can try to understand what would have been the impact of these genes on the organism.

KEGG pathways. The colored circles are scaled according to the number of genes involved in the particular pathway. The q value tells the confidence of the assignment. Rich factor is a ratio of genes observed in the study/genes seen in the pathway.

In this case, ~20 genes would be involved in glutathione metabolism with a q value of 0, showing that this pathway is being impacted.

While KEGG/reactome/GOs are shown as an example, one can create their versions of databases and try to compute the impact of the genes observed accordingly.

References

  1. 1. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467-470
  2. 2. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science. 1995;270:484-487
  3. 3. Chu Y, Corey DR. RNA sequencing: Platform selection, experimental design, and data interpretation. Nucleic Acid Therapeutics. 2012;22:271-274
  4. 4. Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008;453:1239-1243
  5. 5. Metzker ML. Sequencing technologies—the next generation. Nature Reviews: Genetics. 2010;11:31-46
  6. 6. Wang Z, Gerstein M, Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews: Genetics. 2009;10:57-63
  7. 7. Ozsolak F, Milos PM. RNA sequencing: Advances, challenges and opportunities. Nature Reviews: Genetics. 2011;12:87-98
  8. 8. Marguerat S, Bähler J. RNA-seq: From technology to biology. Cellular and Molecular Life Sciences. 2010;67:569-579
  9. 9. Martin JA, Wang Z. Next-generation transcriptome assembly. Nature Reviews. Genetics. 2011;12:671-682
  10. 10. Shendure J, Ji H. Next-generation DNA sequencing. Nature Biotechnology. 2008;26:1135-1145
  11. 11. Guo Y, Li J, Li C-I, Long J, Samuels DC, Shyr Y. The effect of strand bias in Illumina short-read sequencing data. BMC Genomics. 2012;13:1-11
  12. 12. Bahassi EM, Stambrook PJ. Next-generation sequencing technologies: Breaking the sound barrier of human genetics. Mutagenesis. 2014;29:303-310
  13. 13. Mikheyev AS, Tin MMY. A first look at the Oxford Nanopore MinION sequencer. Molecular Ecology Resources. 2014;14:1097-1102
  14. 14. Maitra RD, Kim J, Dunbar WB. Recent advances in nanopore sequencing. Electrophoresis. 2012;33:3418-3428
  15. 15. Lee JH, Daugharthy ER, Scheiman J, Kalhor R, Yang JL, Ferrante TC, et al. Highly multiplexed subcellular RNA sequencing in situ. Science. 2014;343:1360-1363
  16. 16. Ayub M, Bayley H. Individual RNA base recognition in immobilized oligonucleotides using a protein nanopore. Nano Letters. 2012;12:5637-5643
  17. 17. Milward EA, Daneshi N, Johnstone DM. Emerging real-time technologies in molecular medicine and the evolution of integrated ‘pharmacomics’ approaches to personalized medicine and drug discovery. Pharmacology & Therapeutics. 2012;136(3):295-304. DOI: 10.1016/j.pharmthera.2012.08.008
  18. 18. Piepenburg O, Williams CH, Stemple DL, Armes NA. DNA detection using recombination proteins. PLoS Biology. 2006;4:e204
  19. 19. Parkinson J, Blaxter M. Expressed sequence tags: An overview. Expressed Sequence Tags. 2009:1-12
  20. 20. Hatey F, Tosser-Klopp G, Clouscard-Martinato C, Mulsant P, Gasser F. Expressed sequence tags for genes: A review. Genetics, Selection, Evolution. 1998;30:521-541
  21. 21. Lopez C, Soto M, Restrepo S, Piégu B, Cooke R, Delseny M, et al. Gene expression profile in response to Xanthomonas axonopodis pv. manihotis infection in cassava using a cDNA microarray. Plant Molecular Biology. 2005;57:393-410
  22. 22. Yonekura-Sakakibara K, Saito K. Functional genomics for plant natural product biosynthesis. Natural Product Reports. 2009;26:1466-1487
  23. 23. Kalia RK, Rai MK, Kalia S, Singh R, Dhawan AK. Microsatellite markers: An overview of the recent progress in plants. Euphytica. 2011;177:309-334
  24. 24. Paterson AH, Bowers JE, Burow MD, Draye X, Elsik CG, Jiang C-X, et al. Comparative genomics of plant chromosomes. Plant Cell. 2000;12:1523-1539
  25. 25. Ewing RM, Claverie JM. EST databases as multi-conditional gene expression datasets. Biocomput. World Scientific. 1999;200:430-432
  26. 26. Ogihara Y, Mochida K, Nemoto Y, Murai K, Yamazaki Y, Shin-I T, et al. Correlated clustering and virtual display of gene expression patterns in the wheat life cycle by large-scale statistical analyses of expressed sequence tags. The Plant Journal. 2003;33:1001-1011
  27. 27. Morozova O, Hirst M, Marra MA. Applications of new sequencing technologies for transcriptome analysis. Annual Review of Genomics and Human Genetics. 2009;10:135-151
  28. 28. Diatchenko L, Lau YF, Campbell AP, et al. Suppression subtractive hybridization: A method for generating differentially regulated or tissue-specific cDNA probes and libraries. Proceedings of the National Academy of Sciences. 1996;93(12):6025-6030
  29. 29. Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proceedings of the National Academy of Sciences. 2003;100:15776-15781
  30. 30. Southern E, Mir K, Shchepinov M. Molecular interactions on microarrays. Nature Genetics. 1999;21:5-9
  31. 31. Daudén E, Farmacogenética II. Métodos moleculares de estudio, bioinformática y aspectos éticos. Actas Dermo-Sifiliográficas. 2007;98:3-13
  32. 32. Malone JH, Oliver B. Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biology. 2011;9:1-9
  33. 33. Santos CA, Blanck DV, de Freitas PD. RNA-seq as a powerful tool for penaeid shrimp genetic progress. Frontiers in Genetics. 2014;5:298
  34. 34. San Segundo-Val I, Sanz-Lozano CS. Introduction to the gene expression analysis. Molecular Genetics in Asthma. 2016;2016:29-43
  35. 35. Lowe R, Shirley N, Bleackley M, Dolan S, Shafee T. Transcriptomics technologies. PLoS Computational Biology. 2017;13:e1005457
  36. 36. de Vienne D. Molecular Markers in Plant Genetics and Biotechnology. CRC Press; 2003
  37. 37. Pandey V, Nutter RC, Prediger E. Applied biosystems SOLiD™ System: Ligation-Based Sequencing. In: Next Generation Genome Sequencing: Towards Personalized Medicine. Wiley. 2008:29-41.
  38. 38. Edwards M. Whole-genome Sequencing for Marker Discovery. In: Henry RJ, editor. Molecular Markers in Plants. Oxford: Blackwell Publishing Ltd.; 2012:21-34
  39. 39. Schadt EE, Turner S, Kasarskis A. A window into third-generation sequencing. Human Molecular Genetics. 2010;19:R227-R240
  40. 40. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376-380
  41. 41. McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Research. 2009;19:1527-1541
  42. 42. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biology. 2020;21:1-16
  43. 43. Levin JZ, Yassour M, Adiconis X, Nusbaum C, Thompson DA, Friedman N, et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nature Methods. 2010;7:709-715
  44. 44. Efroni I, Ip P-L, Nawy T, Mello A, Birnbaum KD. Quantification of cell identity from single-cell gene expression profiles. Genome Biology. 2015;16:1-12
  45. 45. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnology. 2013;31:46-53
  46. 46. DeLuca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire M-D, Williams C, et al. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics. 2012;28:1530-1532
  47. 47. Wang L, Wang S, Li W. RSeQC: Quality control of RNA-seq experiments. Bioinformatics. 2012;28:2184-2185
  48. 48. Ryan MC, Cleland J, Kim R, Wong WC, Weinstein JN. SpliceSeq: A resource for analysis and visualization of RNA-Seq data on alternative splicing and its functional impacts. Bioinformatics. 2012;28:2385-2387
  49. 49. Xu G, Deng N, Zhao Z, Judeh T, Flemington E, Zhu D. SAMMate: A GUI tool for processing short read alignments in SAM/BAM format. Source Code for Biology and Medicine. 2011;6:1-11
  50. 50. Tang S, Riva A. PASTA: Splice junction identification from RNA-Sequencing data. BMC Bioinformatics. 2013;14:1-11
  51. 51. Chen LY, Wei K-C, Huang AC-Y, Wang K, Huang C-Y, Yi D, et al. RNASEQR—a streamlined and accurate RNA-seq sequence analysis program. Nucleic Acids Research. 2012;40:e42-e42
  52. 52. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology. 2013;14:1-13
  53. 53. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, et al. MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Research. 2010;38:e178-e178
  54. 54. Dimon MT, Sorber K, DeRisi JL. HMMSplicer: A tool for efficient and sensitive discovery of known and novel splice junctions in RNA-Seq data. PLoS One. 2010;5:e13875
  55. 55. Wu TD, Reeder J, Lawrence M, Becker G, Brauer MJ. GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality. Springer; 2016. pp. 283-334
  56. 56. Bao H, Xiong Y, Guo H, Zhou R, Lu X, Yang Z, et al. MapNext: A software tool for spliced and unspliced alignments and SNP detection of short sequence reads. BMC Genomics. 2009;10:1-6
  57. 57. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15-21
  58. 58. Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: Robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012;28:1086-1092
  59. 59. Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics. 2014;30:1660-1666
  60. 60. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Trinity: Reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature Biotechnology. 2011;29:644
  61. 61. Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, et al. De novo assembly and analysis of RNA-seq data. Nature Methods. 2010;7:909-912
  62. 62. Hardcastle TJ, Kelly KA. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010;11:1-14
  63. 63. Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: An R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2010;26:136-138
  64. 64. Anders S, Huber W. Differential expression of RNA-Seq data at the gene level–the DESeq package. European Molecular Biology Lab. 2012;10:f1000
  65. 65. Robinson MD, McCarthy DJ, Smyth GK. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139-140
  66. 66. Beier S, Thiel T, Münch T, Scholz U, Mascher M. MISA-web: A web server for microsatellite prediction. Bioinformatics. 2017;33:2583-2585
  67. 67. Griffith M, Griffith OL, Mwenifumbo J, Goya R, Morrissy AS, Morin RD, et al. Alternative expression analysis by RNA sequencing. Nature Methods. 2010;7:843-847
  68. 68. Singh D, Orellana CF, Hu Y, Jones CD, Liu Y, Chiang DY, et al. FDM: A graph-based statistical method to detect differential transcription using RNA-seq data. Bioinformatics. 2011;27:2633-2640
  69. 69. Drewe P, Stegle O, Hartmann L, Kahles A, Bohnert R, Wachter A, et al. Accurate detection of differential RNA processing. Nucleic Acids Research. 2013;41:5189-5198
  70. 70. Shi Y, Jiang H. rSeqDiff: Detecting differential isoform expression from RNA-Seq data using hierarchical likelihood ratio test. PLoS One. 2013;8:e79448
  71. 71. Li B, Dewey CN. RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:1-16
  72. 72. Nicolae M, Mangul S, Măndoiu II, Zelikovsky A. Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms for Molecular Biology. 2011;6:1-13
  73. 73. Glaus P, Honkela A, Rattray M. Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics. 2012;28:1721-1728
  74. 74. Martin J, Bruno VM, Fang Z, Meng X, Blow M, Zhang T, et al. Rnnotator: An automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC Genomics. 2010;11:1-8
  75. 75. Sacomoto GAT, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot M-F, et al. K is s plice: de-novo calling alternative splicing events from rna-seq data. BMC Bioinformatics. 2012;13:1-12
  76. 76. Sherman BT, Tan Q , Collins JR, Alvord WG, Roayaei J, Stephens R, et al. The DAVID Gene Functional Classification Tool: A novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biology. 2007;8:1-16
  77. 77. Huang DW, Sherman BT, Zheng X, Yang J, Imamichi T, Stephens R, et al. Extracting biological meaning from large gene lists with DAVID. Current Protocol Bioinforma. 2009;27:1-13
  78. 78. Subramanian A, Kuehn H, Gould J, Tamayo P, Mesirov JP. GSEA-P: A desktop application for Gene Set Enrichment Analysis. Bioinformatics. 2007;23:3251-3253
  79. 79. El-Metwally S, Ouda OM, Helmy M. New horizons in next-generation sequencing. Next Generation. 2014;2014:51-59
  80. 80. Shen W, Li H, Teng R, Wang Y, Wang W, Zhuang J. Genomic and transcriptomic analyses of HD-Zip family transcription factors and their responses to abiotic stress in tea plant (Camellia sinensis). Genomics. 2019;111:1142-1151
  81. 81. Leisner CP, Yendrek CR, Ainsworth EA. Physiological and transcriptomic responses in the seed coat of field-grown soybean (Glycine max L. Merr.) to abiotic stress. BMC Plant Biology. 2017;17:1-11
  82. 82. Kreszies T, Shellakkutti N, Osthoff A, Yu P, Baldauf JA, Zeisler-Diehl VV, et al. Osmotic stress enhances suberization of apoplastic barriers in barley seminal roots: Analysis of chemical, transcriptomic and physiological responses. The New Phytologist. 2019;221:180-194
  83. 83. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, et al. NCBI GEO: Mining tens of millions of expression profiles—database and tools update. Nucleic Acids Research. 2007;35:D760-D765
  84. 84. Burch-Smith TM, Anderson JC, Martin GB, Dinesh-Kumar SP. Applications and advantages of virus-induced gene silencing for gene function studies in plants. The Plant Journal. 2004;39(5):734-746
  85. 85. Fang Z, Cui X. Design and validation issues in RNA-seq experiments. Briefings in Bioinformatics. 2011;12:280-287
  86. 86. Phule AS, Barbadikar KM, Madhav MS, Senguttuvel P, Babu MBB, Ananda KP. Genes encoding membrane proteins showed stable expression in rice under aerobic condition: Novel set of reference genes for expression studies. 3 Biotech. 2018;8:1-12
  87. 87. Bustin SA, Benes V, Garson JA, Hellemans J, Huggett J, Kubista M, et al. The MIQE Guidelines: Minimum I nformation for Publication of Quantitative Real-Time PCR Experiments, Clinical Chemistry. 1 April 2009;55(4):611-622. DOI: 10.1373/clinchem.2008.112797
  88. 88. Schmittgen TD, Livak KJ. Analyzing real-time PCR data by the comparative CT method. Nature Protocols. 2008;3:1101-1108
  89. 89. Sripathi VR, Anche VC, Gossett ZB, Walker LT. Recent applications of RNA sequencing in food and agriculture. In: Louis IV, editor. Applications of RNA-Seq in Biology and Medicine. London: IntechOpen; 2021
  90. 90. Byron A, Van Keuren-Jensen K, Engelthaler D. Translating RNA sequencing into clinical diagnostics: Opportunities and challenges. Nature Review Genetics. 2016;17:257-271
  91. 91. Li X, Wang CY. From bulk, single-cell to spatial RNA sequencing. International Journal of Oral Science. 2021;13:36
  92. 92. Zhao L, Zhang H, Kohnen MV, Prasad KVSK, Gu L, Reddy ASN. Analysis of transcriptome and epitranscriptome in plants using PacBio Iso-Seq and nanopore-based direct RNA sequencing. Frontiers in Genetics. 2019;10:253
  93. 93. Westermann AJ, Barquist L, Vogel J. Resolving host–pathogen interactions by dual RNA-seq. PLoS Pathogens. 2017;13(2):e1006033
  94. 94. Wainberg M, Sinnott-Armstrong N, Mancuso N, et al. Opportunities and challenges for transcriptome-wide association studies. Nature Genetics. 2019;51:592-599
  95. 95. Gardini A. Global Run-On Sequencing (GRO-Seq). Methods in Molecular Biology. 2017;1468:111-120

Written By

Nakul D. Magar, Priya Shah, K.  Harish, Tejas C. Bosamia, Kalyani M. Barbadikar, Yogesh M. Shukla, Amol Phule, Harshvardhan N. Zala, Maganti Sheshu Madhav, Satendra Kumar Mangrauthia, Chirravuri Naga Neeraja and Raman Meenakshi Sundaram

Submitted: 09 April 2022 Reviewed: 17 June 2022 Published: 14 August 2022