Tools for RNA‐seq data analysis.
RNA‐sequencing (RNA‐seq) is the state‐of‐the‐art technique for transcriptome analysis that takes advantage of high‐throughput next‐generation sequencing. Although being a powerful approach, RNA‐seq imposes major challenges throughout its steps with numerous caveats. There are currently many experimental options available, and a complete comprehension of each step is critical to make right decisions and avoid getting into inconclusive results. A complete workflow consists of: (1) experimental design; (2) sample and library preparation; (3) sequencing; and (4) data analysis. RNA‐seq enables a wide range of applications such as the discovery of novel genes, gene/transcript quantification, and differential expression and functional analysis. This chapter will encompass the main aspects from sample preparation to downstream data analysis. It will be discussed how to obtain high‐quality samples, replicates amount, library preparation, sequencing platforms and coverage, focusing on best recommended practices based on specialized literature. Basic techniques and well‐known algorithms are presented and discussed, guiding both beginners and experienced users in the implementation of reliable experiments.
- next‐generation sequencing
- data analysis
- best practices
A transcriptome represents the entire repertoire of RNA content from an organism, a tissue or a cell and it is dynamic, changing in response to genetic and environmental factors. Several approaches have been developed for transcriptome analysis: hybridization‐based (DNA microarray ) or sequence‐based (ESTs—Expressed Sequence Tags , SAGE—Serial Analysis of Gene Expression , CAGE—Cap Analysis of Gene Expression  and MPSS—Massively Parallel Signature Sequencing ). The first sequence‐based methods relied on Sanger sequencing , but with advances in next‐generation sequencing technology (NGS), transcriptomic studies have evolved considerably and RNA‐seq [7, 8] became the state‐of‐art for transcriptome analysis.
RNA‐seq consists of the direct sequencing of transcripts by NGS. Several NGS platforms [9–11] are commercially available nowadays. In general, an RNA set of interest is converted to a library of complementary DNA (cDNA) fragments and sequenced in a high‐throughput manner. Compared to ESTs, RNA‐seq provides better resolution and representativeness, whereas when compared to microarrays, the independence of reference sequences facilitates the discovery of novel genes and isoforms .
RNA‐seq experiments harbors challenges from the experimental design to data analysis. Since a complete comprehension of each step is critical to make right decision, this chapter will encompass essential principles required for a successful RNA‐seq experiment, focusing on best recommended practices based on specialized and recent literature. Basic techniques and well‐known algorithms are presented and discussed, guiding both beginners and experienced users in the implementation of reliable experiments.
2. Experimental design
In order to obtain a successful RNA‐seq experiment, it is critical to have a good experimental design. Despite its importance, a proper planning is not always done. There are many experimental options available, and to fully comprehend each step, it is essential to make right decisions, avoiding inconclusive results. These choices depend on extrinsic (e.g., cost, time, samples availability) and intrinsic (e.g., experimental design complexity, transcriptional variability among tissues, samples and organisms) factors. The amount of available resources is usually the main extrinsic limiting factor driving researchers’ decisions. First, it is necessary to identify the main goal of an RNA‐seq experiment in order to be able to choose the best approach. Qualitative (e.g., annotation) and quantitative (e.g., differential gene expression—DGE) data analyses have some different requirements such as those related to the starting RNA amount, the number and type of replicates, library type and preparation, sequencing platforms, throughput, coverage and depth, and read length. Scotty , RNASeqPower  and RnaSeqSampleSize  are statistical tools designed to aid in the conception of the experimental design, adjusting many of these variables to the main objective and taking into account the financial limitations. A detailed workflow from experimental design to library sequencing is presented in Figure 1.
2.1. Starting sample amount
The necessary starting amount of an RNA sample varies between kits and platforms, and the amount of available RNA is one of the limiting factors for an RNA‐seq experiment. The majority of library construction kits require micrograms of RNA, sometimes limited to high‐quality samples. Takara Bio USA Inc presents some kits for low quantity and/or quality RNA samples: SMARTer Ultra Low mRNA‐seq kits (as little as 1 cell or 10 pg of total RNA), SMARTer Stranded kits (100 pg, regardless of RNA quality) and SMARTer Universal kits (200 pg, regardless of RNA quality). These kits are compatible with both Illumina and Ion Torrent platforms. NuGEN company has also some kits with input RNA levels of 10 pg (Ovation Ultralow Library System V2 and Ovation SoLo RNA‐Seq System) available only for Illumina. For a comparison study of four commercially available RNA amplification kits using low‐input RNA samples, see Ref. .
The variability of an RNA‐seq experiment depends on the organism, the biological question under investigation and the available laboratory techniques, and it can be measured by technical and biological variances. Technical replication consists on the repeated analysis of the same sample to infer the variance associated with the technology, that is, equipment and protocols . If only experimental errors analysis is desired, technical replication is satisfactory. Otherwise, biological replicates are necessary . Three biological replicates are the minimum suggested for any inferential analysis , although the minimum amount required for a reliable RNA‐seq experiment depends on the desired statistical power. For example, in DGE analysis, performing more biological replication is recommended over increasing the sequencing depth [19, 20], and from 6 to 12 biological replicates have been suggested . Biological replication is often preferable to enrich the inferential analysis and increase your statistical power. Statistical knowledge helps to understand the different statistical analysis methods required for different levels of replication [16, 17, 22].
2.3. Sequencing platforms
There are several sequencing platforms available with diverse data formats, throughputs and qualities [9–11]. Two commonly used approaches are sequencing by synthesis (e.g., Illumina, Helicos and PacBio) and ion semiconductor sequencing (Ion Torrent). They can also be classified as clonal amplification‐based sequencing (e.g., Illumina and Ion Torrent) or single‐molecule‐based sequencing (e.g., Helicos, PacBio, Nanopore). For RNA‐seq experiments, the most popular platform is Illumina due to its high throughput and low‐error rates. PacBio has gained attention due to read length increases since its reads can be long enough to recapitulate a full‐length cDNA transcript [23–26]. RNA‐seq approaches can also be combined to take advantage of each method benefits. Further information and comparison studies are available in Refs. [11, 27–29].
2.4. Sequencing depth
The required sequencing depth for RNA‐seq experiments varies over several degrees. Transcripts are expressed at different levels within the cell, and their coverage differs considerably in any RNA‐seq experiment. A deeper sequencing is required to detect low abundance transcripts and rare splicing events, but their relevance can only be assessed with a good biological replication . However, deeper sequencing may increase the detection of off‐target RNA species and the number of false positives in differential expression calls . A correlation between sequencing depth and accuracy demonstrated that as low as one million reads can provide similar information of transcript abundance as more than 30 million reads for highly expressed genes. This result was consistently shown in all six widely used model organisms (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Mus musculus and Saccharomyces cerevisiae) that represent a wide range of genome sizes . For the majority of human tissue genes, the amount required was about 15–50 million reads . It is noteworthy that there is a point of sequencing depth saturation where a deeper sequencing results in only a small gain of information. More about the impact of sequencing depth on gene detection, gene expression quantification and structural variants discovery can be found in Ref. .
2.5. Read length
Short‐read sequencing is cheaper than long‐read sequencing. RNA‐seq experiments usually make use of short‐reads; however, longer reads can be helpful and more informative. Reads are usually shorter than full‐length transcripts, and a single read may map to multiple positions in the genome stickling expression analysis and transcriptome assembly. Longer read length reduces mapping bias and ambiguity in assigning reads to genomic elements  and improves splicing detection [35, 36] and complex transcriptome analysis [37, 38]. However, some studies question the advantages of long reads sustaining that for humans, there are no substantial improvements in transcriptome assembly quality with reads over 150 base pairs  and in differential expression analysis with reads over 50 base pairs .
2.6. Library type
Standard RNA‐seq library protocols do not retain the strand orientation for each original transcript, making it difficult to discriminate gene expression from overlapping genes. Therefore, it is often desirable to construct strand‐specific libraries [40–42]. There are several strand‐specific protocols available, and they can be performed by two main alternatives. One method consists of marking the second strand by chemical modification, preventing it from being amplified by PCR and leading to the amplification of the first strand only. The deoxy‐UTP (dUTP) approach  is a well‐known example, and it is one of the leading protocols. The other method involves adapter’s ligation in a known orientation in the RNA molecule such as Illumina RNA ligation method . A comparison between seven library‐construction protocols reveals strong differences and substantial variation in the experimental complexity . Stranded RNA‐seq provides more accurate downstream expression analysis, and it is the recommend approach for RNA‐seq studies [40, 42]. Moreover, the dUTP and the Illumina RNA ligation methods were identified as the best overall protocols [40, 45].
The External RNA Control Consortium (ERCC)  has developed a set of 92 polyadenylated synthetic spike‐in controls for normalization and noise reduction of gene expression. ERCC spike‐ins mimic eukaryotic mRNAs and can be added (‘spiked’) equally to each sample prior to library construction . Ambion ERCC spike‐in control mixes (Thermo Fisher Scientific) are commercially available. Sequins, another set of spike‐in RNA standards, can also be used as internal controls and are freely available for non‐profit research upon request . Normalization methods should be carefully chosen to ensure that spike‐in will behave as expected. The R package erccdashboard  and Anaquin  can be used for spike‐in analysis.
3. Sample preparation and library construction
After defining the experimental design, a typical RNA‐seq experiment workflow consists of (i) RNA preparation, (ii) cDNA library construction, (iii) sequencing and (iv) bioinformatic analysis. Each step will be briefly discussed below.
3.1. RNA preparation
Since RNA is more labile than DNA and RNases are ubiquitous and very stable enzymes, special precautions and more stringent working practices should be taken to obtain pure and high‐quality RNA. Best practices can be found at  or spread on diverse companies’ websites such as Thermo Fisher Scientific, Qiagen and Ambion.
In an RNA‐seq experiment, the RNA preparation consists basically of isolation/extraction and enrichment. Many RNA sample preparation techniques and commercial kits are available. No unique method is optimal for every application, and combination of methods may vary depending on the sample type and the study goals. It is always recommended to carefully follow manufacturer’s instructions.
3.1.1. RNA isolation and extraction
In order to isolate high‐quality RNA, the samples need to be processed immediately after harvest. If an immediate isolation is not possible, samples can be stabilized in an intermediary solution to preserve RNA integrity and allow storage. Commonly used stabilizers are RNAlater (Thermo Fisher Scientific and Qiagen) and RNAstable (Sigma‐Aldrich). RNA isolation and extraction methods can be manual (e.g., TRIzol–Thermo Fisher Scientific) or automated (e.g., RNeasy—Qiagen), and different types of samples require different approaches, although all of them comprise: (i) sample solubilization in the presence of detergent and chaotropic agents, (ii) sample homogenization for complete cell disruption and (iii) RNA recovery from the lysate with organic or solid‐phase extraction. It is also important to have a final RNA free of genomic DNA (gDNA) contaminants. Some protocols can carry over some gDNA into total RNA samples that can be removed by a DNAse treatment. gDNA contamination can lead to a counting bias in downstream analysis and can be detected by reads background over the whole genome (false positive signal). Further information about sample preparation techniques and some commercial kits available can be found in Ref. . Different commercial kits demonstrated satisfactory RNA yield, but differences in the quality of extracted RNA were observed, which can interfere on the downstream analysis .
RNA quality can be assessed by gel electrophoresis (agarose or polyacrylamide) or through Agilent Bioanalyzer. RNA quantity can be assessed using spectrophotometer (e.g., Nanodrop), fluorometer (e.g., Qubit) or Agilent Bioanalyzer. No single RNA quantification and quality control method are ideal, and it is necessary to know the limits of each method. We recommend Bioanalyzer since it measures the RNA integrity and level of degradation by the RNA Integrity Number (RIN) score that allows sample quality comparison by a scale with a range from 1 (most degraded) to 10 (most intact) [54, 55]. There is no consensus about the RIN cutoff for sample inclusion or exclusion in a study, but RIN ≥ 6 are commonly acceptable. DGE analysis could be performed even with RIN scores around 4 , but non‐degraded RNA is preferred for a successful transcriptome analysis. It is also important to highlight that some organisms do not present typical rRNAs peaks and cannot be evaluated by RIN value. Most insect RNA shows a cleavage of 28S rRNA into two similar fragments (28Sα and 28Sβ) that comigrate with 18S rRNA depending on pretreatment and electrophoresis conditions. This comigration is due to the disruption of the hydrogen bonds responsible for maintaining the two 28S fragments together. This profile should not be misinterpreted as low integrity and degradation . In these cases, check the overall Bioanalyzer trace. More information about each method and a comparison study can be found in Refs. [58, 59], respectively.
3.1.2. RNA enrichment
The type of the desired RNA molecule drives the RNA enrichment approach. Selection of mature mRNAs by their poly(A) tails is the most common application and can be carried out with magnetic or cellulose beads coated with oligo(dT) molecules or through oligo(dT) priming for reverse transcription (RT). Therefore, since RNAs from formalin‐fixed and paraffin‐embedded (FFPE) are degraded and mRNA‐seq poorly captures degraded mRNAs, it is not an appropriate method to use with FFPE samples , unless adapted protocols are applied such as the recently described protocol based on in vitro T7 transcription for linear amplification of mRNA . In order to surpass this limitation, rRNA depletion protocols have been developed based on hybridization in highly conserved ribosomal regions, including the selective depletion of abundant RNA (SDRNA) with RNase H [61, 62], Ribominus (Thermo Fisher Scientific), Ribo‐Zero (Illumina), GeneRead (Qiagen) and RiboGone (Takara). Another approach is the duplex‐specific nuclease (DSN) normalization by depletion of abundant transcripts, such as rRNAs and tRNAs [63, 64]. Samples can be also enriched of small ncRNAs (e.g., miRNA, siRNA and piRNA) via size‐selection through electrophoresis or based on solid phase extraction with commercial kits such as mirVana (Thermo Fisher Scientific) and miRNeasy (Qiagen). For comparison studies between these methods, see Refs. [42, 65]. rRNA depletion is recommended rather than oligo(dT) because it can capture a complete view of the transcriptome and can be used for low‐quality RNA samples .
3.2. cDNA Library construction
The library construction includes four steps: (i) RNA/cDNA fragmentation, (ii) cDNA synthesis, (iii) adapters ligation and (iv) quantification. Some specific points will be briefly discussed below, but additional information can be found in Refs. [41, 45].
3.2.1. RNA/cDNA fragmentation
The length of your RNA insert is a key factor for library construction and sequencing. Since most current platforms sequence only short reads, most protocols incorporate an RNA or cDNA fragmentation step that allows amplification and sequencing. For short RNAs (under 200 pb), no fragmentation is required. There are three main ways to fragment the nucleic acid samples: physical (e.g., sonication, nebulization), enzymatic (e.g., RNase III, DNase I or Fragmentase) and chemical (e.g., heat, metal ion) shearing. Little information is known about which is the best method for each application. A comparison study of nebulization, sonication and enzymatic digestion showed that all three methods presented equal performance and that fragmentation is indicated . In most cases, RNA is fragmented before conversion into cDNA. Furthermore, it is important to highlight that due to FFPE samples degradation, cDNA fragmentation must be performed instead of RNA fragmentation when using oligo(dT) priming for first‐strand synthesis.
3.2.2. cDNA synthesis
After an adequate RNA preparation, RNA must be converted to double complementary DNA (cDNA) via RT, generating a cDNA:RNA hybrid. This process is known as first‐strand cDNA synthesis and requires an oligonucleotide primer. Three options are available: oligo(dT) priming, random priming or gene‐specific priming. The first two are the mainly used for RNA‐seq. Oligo(dT) priming is one of the oldest methods for first‐strand synthesis and involves oligo(dT) primer to capture the poly(A) tail of mature mRNA. Because of their specificity for poly(A) tails, oligo(dT) priming is not compatible with fragmented RNA, such as FFPE samples, nor for RNAs that lack poly(A) tails, such as non‐mRNAs (e.g., microRNAs (miRNAs)). If using this methodology, cDNA fragmentation must be performed instead of RNA fragmentation. Besides that, RTs are not highly processive polymerases and can prematurely terminate the strand biosynthesis, leading to 3′ end bias and under‐representation of the 5′ ends. Random priming involves oligonucleotides with random base sequences that prime at random positions along the RNA (i.e., no template specificity), and it is preferable to oligo(dT) priming. This approach allows recovery of non‐poly(A) RNAs and prevents 3′ end bias, resulting in a more uniform transcript coverage. However, it was shown that random priming is not completely random leading to a nucleotide bias across the first reads positions [67, 68].
The first‐strand cDNA is used as a template to generate double‐stranded cDNA. Second‐strand cDNA synthesis can be performed by (i) RNA nicking of the RNA template by RNase H and synthesis with E. coli DNA polymerase I and T4 DNA ligase , (ii) using an oligo that is complementary to an adapter located in the 5′ end of the RNA template or by (iii) Clontech’s SMART (Switching Mechanism At 5′ end of RNA Transcript) technology . RNase H method presented a better performance for low‐quality RNA when compared to four other methods (Ribo‐Zero, NuGEN, SMART and DSN‐lite) .
3.2.3. Adapters sequences and ligation
Adapters sequences must be ligated at the ends of every single molecule during library preparation, and this process varies depending upon the sequencing platform. It can contain one or more extra functional elements such as barcode/index to allow multiplexing and a second sequencing‐priming site to allow paired‐end sequencing. The addition of adapter via Y‐adapter PCR is the most commonly used technique. Adapters can also be added via RT/PCR during the first‐ and second‐strand synthesis process or via ligation.
3.2.4. Library quantification
To ensure the maximum yield (i.e., data output) and quality from your RNA‐seq experiment, it is important to have a precise quantification of your NGS libraries. Inaccurate quantification may lead to lower throughput, lower sequences qualities and poor samples balance within your multiplex. There are many ways to quantify your libraries, but the most accurate and effective method is quantitative real‐time PCR (qPCR). qPCR is more sensitive and only quantifies amplifiable DNA molecules (i.e., molecules that contain both adaptor sequence), providing a more precise estimation. Some commercial kits available are KAPA Library Quantification Kit (Kapa Biosystem), GeneRead Library Quant System (Qiagen), Ion Library TaqMan Quantitation Kit (Thermo Fisher Scientific), QPCR NGS Library Quant Kit (Agilent), PerfeCTa NGS Quantitation Kit (Quantabio) and NEBNext Library Quant Kit (New England BioLabs). Other methods are similar to the previously mentioned for RNA quantification: spectrophotometer (e.g., Nanodrop), fluorometer (e.g., Qubit) and Agilent Bioanalyzer. However, since these methods measure total nucleic acid concentrations, including non‐amplifiable DNA, they can lead to inaccurate results. It is also recommended to verify the libraries fragment size distribution, which can be performed by electrophoresis, preferably Agilent Bioanalyzer. Bioanalyzer electropherogram needs to show a narrow distribution with a peak height of the average size fragmentation value. After quantification, the library must be sequenced with the platforms discussed in Section 2.3, and data must be analyzed through bioinformatic tools. RNA‐seq data analysis will be discussed below.
4. Data analysis
RNA‐seq data analysis involves many different strategies that depend on the goals and biological questions established at the time of the study design. A typical data analysis includes quality control, reads preprocessing, alignment to a reference or de novo assembly and downstream analysis such as transcripts annotation, DGE, gene fusion analysis and alternative splicing. In the following topics, we will emphasize common steps and applications of this technology. A detailed workflow for data analysis is presented in Figure 2. Bioinformatic tools discussed in this chapter are compiled at Table 1, and a more exhaustive list of available tools can be found in Ref. . For those with limited access for computational resources or little experience with command‐line execution of these bioinformatic tools, free online (Galaxy ) and commercial (Illumina BaseSpace  and Geneious ) platforms can be very helpful and intuitive.
|Experimental design||Scotty |
|Raw reads quality control||FASTQC |
|Reads preprocessing||Read clipping/trimming||Picard , BamUtil , BamTools , PRINSEQ , Cutadapt , FASTX‐Toolkit , Trimmomatic |
|Paired‐end reads overlapping detection||FLASh , PEAR |
|Reads error correction||SEECER , Rcorrector |
|Unspliced mapping||Hash Table index based||BFAST , MAQ , Mosaik , Novoalign , RMAP , SHRiMP |
|FM‐index based||Bowtie2 , BWA |
|Spliced‐aware mapping||Hash Table index based||GSNAP , RNASEQR |
|FM‐index‐based||TopHat2 , HISAT2 , SOAP‐splice , STAR , RNASEQR |
|Alignment quality assessment||Picard , BamUtil , BamTools , Samtools , Qualimap2 , BAMstats , SAMstat |
|Assembly||Genome‐guided||Cufflinks , Scripture , StringTie |
|De novo||Rnnotator , Trans‐ABySS , Trinity , Oases |
|Assembly quality assessment||Detonate , TransRate , BUSCO |
|Alignment visualization||IGV , Tablet , UCSC |
|Raw read counts||Mapped‐based||featureCounts , HTSeq‐count , RSEM |
|Pseudoalignment||Kallisto , Salmon |
|Raw read counts quality assessment||NOISeq |
|Differential expression||DESeq , DESeq2 , edgeR , CuffDiff2 , BitSeq , Ballgown |
|Annotation||BLAST [145, 146], DIAMOND , InterProScan , tRNAscan‐SE , RNAmmer , Blast2GO , Annocript , TRAPID , Trinotate |
|Enrichment analysis||GSEA |
|Alternative splicing||Cufflinks , Scripture , StringTie |
|Differential alternative splicing||CuffDiff , Ballgown , DEXSeq , rMATS , SpliceR , MISO , DiffSplice |
|Fusion genes||SOAPfuse , FusionCatcher , JAFFA |
|miRNA||miRdeep2 , miReNA , miRanalyzer |
4.1. Quality control and reads preprocessing
A complete pipeline for an RNA‐seq analysis demands some checkpoints in order to ensure the quality of the results and elimination of noise from the biological samples. After sequencing, the analysis starts with files containing the raw reads. The FASTQ  is the standard format used to store the nucleotide sequences along with a per base quality score in Phred log scale. The qualities, typically with scores from 0 to 40, are represented by single letters encoded with pre‐defined ranges of characters from the American Standard Code for Information Interchange (ASCII) table. Currently, there are two patterns: Phred + 64, used in initial Illumina versions 1.3 + and 1.5 + and Phred + 33, the default encoding for Sanger and more recent sequencers. The FASTQ is widely accepted and used in most downstream software, although the unmapped BAM (uBAM) format has been recently encouraged as it is capable of storing important sequencing metadata not present in FASTQ, and for being binary, it demands less disk storage. Some sequencing platforms, like Ion Torrent, have already included uBAM as default output format in their pipelines. Both formats are interchangeable by using Picard , BamUtil  and BamTools .
The first step is to perform a quality control (QC) of the data, checking parameters like amount of reads per sample, general read and base qualities, mean reads length, G + C content, presence of unclipped adapters or PCR primers and unexpected repetitive sequences. This general overview will indicate if library construction and sequencing were properly performed, or if errors like contaminants, poor ribosomal RNA depletion or low sequencing output will demand a new round of experiments. The most common software used to retrieve these basic statistics is FastQC  and PRINSEQ . The first was mainly designed for Illumina, while the later for 454/Roche technology and may be also used for preprocessing. Both programs are available with intuitive graphical user interfaces (GUI), accept other sequencing technologies input files and generate graphical reports, which are very useful for guiding the choice of filtering thresholds.
The following preprocessing step is crucial and can greatly influence the data analysis . Besides PRINSEQ, other tools like Cutadapt , FASTX‐Toolkit  and Trimmomatic  are efficient in preprocessing reads, but FASTX‐Toolkit cannot be used with paired‐end reads. Generally, due to problems inherent in sequencing technologies, the bases in 3’end of reads have lower quality, and one may choose to filter off reads with low mean quality or trim only the low‐quality ends. Trimming in most cases may improve mappability, although shorter reads have a higher probability of erroneous mapping. Therefore, it is recommended to remove short reads in conjunction with non‐aggressive base‐quality trimming to avoid spurious mapping and incorrect inferences [85, 86]. Adapter removal and trimming low‐quality ends improve RNA‐seq assembly, single nucleotide polymorphism (SNP) detection and gene‐expression analysis.
Modern mapping tools (see next section) are capable of labeling the unaligned read ends, a process known as soft‐clipping, without actually removing them (hard‐clipping). There is no consensus on which approach is the best, but it has been considered that keeping as much as information as possible would be better for downstream analysis. For example, the soft‐clipped reads are important for detection of genomic structural variants .
When the goal is to perform RNA‐seq de novo assembly, supplementary tools can be used to join overlapping paired‐end reads, like FLASh  and PEAR . Additionally, base error correction can be applied as an alternative to read trimming and filtering, increasing the amount of useful data and consequently the contig sizes. SEECER  and Rcorrector  were specifically designed for this task. Both strategies will likely improve assembly qualities.
In summary, preprocessing is beneficial, but there is no best tool for any experiment or general rule for filtering thresholds. All software has its own standard parameters, advantages and limitations, being recommended a case‐by‐case analysis and a thorough software comparison.
4.2. Mapping, assembling and visualizing mapped reads
Now that the raw reads have been preprocessed, alternative approaches can be chosen according to the availability of a reference sequence. If present, reads can be mapped to the genome and the gene that originated the transcript from which the reads were derived may be inferred and expression quantified. The genome may also be used to guide transcriptome assembly, resulting in several contigs representing the genes and its isoforms. On the other hand, if the studied species still lacks a reference sequence, reads can be de novo assembled, and transcripts can now be used as a mapping reference.
4.2.1. Mapping to a reference
Mapping reads to a reference can be also seen as a traditional pair‐wise sequence alignment, as observed in common Basic Local Alignment Search Tool (BLAST) , but with the main difference that a vast amount of reads are compared with a database composed of fewer and longer sequences instead of several thousand nucleotides/proteins. This is a field under constant development with plenty of tools available . These tools have to deal with inherent mapping challenges, such as sequencing errors, natural sequence variability like SNPs and indels, reads spanning exon junctions and repetitive regions or pseudogenes in references. To guarantee reproducibility, it is highly recommended reporting alignment parameter details, such as mapper and reference versions and sources, allowed seed mismatches, minimal alignment score and treatment given to multi mapping reads.
Mappers can be roughly divided by the algorithm chosen to create indexes and by the ability to recognize exon‐exon junctions. Indexes have the purpose of making the alignments significantly faster and are mainly divided into Hash Table or compressed prefix or suffix array‐like structures (FM‐index). Their principle is to quickly find small local alignments representing substrings of whole reads—designated as seeds—in the reference and then extend those alignments surpassing a defined quality threshold toward the read ends, assigning a Phred‐based mapping quality score for each read. Unfortunately, most mappers have developed their own mapping quality formulas, creating a non‐uniform mapping qualification. Some well‐known Hash Table‐based algorithms are BFAST , GSNAP , MAQ , Mosaik , Novoalign , RMAP  and SHRiMP , while Bowtie2 , TopHat2 , HISAT2 , BWA , SOAP‐splice  and STAR  are examples of FM‐index based algorithms.
Regarding the splicing events, they can be divided into unspliced and splice‐aware aligners. Most recent mappers are capable of using reference annotation files to deal with known exon‐exon junctions and to predict new splice sites, which is essential when analyzing RNA‐seq data from most eukaryotes. GSNAP, SOAP‐splice, RNASEQR , STAR and TopHat2 are some recommended options for spliced alignments, but for intronless species, miRNA and transcriptomes, unspliced aligners can be used. Comparative evaluations showed that FM‐index‐based mappers are preferable  and that, again, no tool is the best for every performance parameters like speed, alignment yield, exon discovery and accuracy .
The standard alignment output is the Sequence Alignment/Map (SAM) format or its binary version BAM and they are essential inputs for many downstream applications. Picard  and Samtools  are frequently used to manipulate these files. It is advisable to assess the alignment quality from SAM/BAM files with tools like Qualimap2 , BAMstats  and SAMstat  for general characterization or for comparing mappers’ performances.
4.2.2. Genome‐guided assembly
Short RNA‐seq reads represent only a small portion of most transcripts, and therefore, overlaps have to be detected in order to fully reconstruct the original molecules. Paralogous genes, alternative splicing, alternative transcription initiation and termination sites increase the complexity and impose computational challenges in Eukaryotic assembly analysis . For Bacteria, Archaea and lower eukaryotes, the absence or smaller amount of introns makes the assembly more straightforward.
RNA‐seq assemblers greatly differ from DNA‐seq algorithms because a wide range of transcripts coverage is expected, and several gene isoforms can be observed resulting in thousands of contigs stead of ideally one per chromosome. When a good quality reference genome is available, the usual procedure is to use the coordinates of aligned reads to separate them into clusters and perform a de novo alignment individually for each locus, from which individual isoforms can be inferred. Cufflinks , Scripture  and StringTie  are recommended tools, and their algorithm strategies have been reviewed , with StringTie  presenting better transcript reconstruction performance. Paired‐end, strand‐specific libraries and longer reads are highly encouraged for better assemblies and to allow distinction in overlapping transcripts from opposite strands for gene‐dense species and antisense transcription. Genome‐guided assembled transcriptomes can be used to improve gene structures annotation through detection of transcription boundaries and splice‐sites.
4.2.3. De novo assembly
In the absence of a reference sequence or if only a fragmented draft genome is available, overlaps have to be detected from the complete read set in a de novo assembly approach. The independence from a good quality reference and mapping procedures can be also seen as an advantage. The counterpart is that sequencing depth must be obtained in a higher coverage, estimated around 30× , while genome‐guided approach requires about 10× [120, 121] to find full‐length transcripts. The higher throughput increases the processing requirements, so data digital normalization is recommended in order to remove redundancy without impacting the assembly outcome . Although the de novo approach is usually more error prone and computationally intensive, it allows the discovery of novel splicing events, unpredicted genes and exons, chromosomal rearrangements and trans‐splicing. Trinity , Oases , Rnnotator  and Trans‐ABySS  are advised for this task. Whenever possible, a combined genome‐guided/de novo strategy is recommended, as enhanced performance is observed . A comprehensive overview of transcriptome assembly can be found in Ref. . Evaluation of the assembly quality and transcriptome completeness can be assessed with Detonate , TransRate  and BUSCO .
Alignment output SAM files are hard to be interpreted with common text editors, and therefore, a number of graphical browsers have been developed to inspect NGS sequencing data at any specific loci at nucleotide level. IGV , Tablet , Browser Genome  and UCSC  are extremely useful when validating novel transcripts and gene junctions, checking the coverage support for genomic variants and spot read piles, which may represent repetitive regions.
4.3. Downstream analyses
After conducting these general steps, the experiments can be directed to specific applications in order to address the scientific questions, designated as downstream analysis.
4.3.1. Quantification and differential expression
The primary goal of most RNA‐seq projects is to quantify and compare the gene expression under different conditions and infer biological function to differential expression at gene or transcript level. Intra‐sample abundance comparisons were commonly performed with Reads Per Kilobase per Million (RPKM mapped reads) or Fragments Per Kilobase per Million (FPKM mapped reads) metrics. Their principle is to count the amount of raw reads mapped to each genomic feature and normalize considering the gene length and library depth. Although still widely applied, these normalization metrics should be avoided as RPKM has shown to be inconsistent and Transcripts Per Million (TPM) is preferable . Raw reads counting can be obtained with feature counts  and HTSeq‐count , which are capable of detecting multi‐mapping reads, exon junctions and overlapping reference features. NOISeq  can be used to assess the count quality parameters, such as saturation and specificity, in a set of comprehensive plots.
DESeq , DESeq2  and edgeR  packages are recommended for between‐sample comparisons to detect differences in the relative abundances of genes . Quantification at transcript level can be analyzed with Cufflinks  and RSEM  and compared with DESeq2, CuffDiff2 , BitSeq  or Ballgown . Variations in expression between different conditions are usually measured in log2 fold‐change units. DESeq2 can also perform pair‐wise and time series analysis.
Generally, a control set of housekeeping genes should present non‐differential expression and a high between replicates correlation (Spearman R2 ≥ 0.9) observable in Principal Component Analysis (PCA) plots . For a set of 12 or less replicates, at gene level, edgeR or DESeq2 is recommended to detect differential expression and DESeq when more than 12 replicates are available . Thresholds in log2 fold‐change should be applied to increase the true positive and decrease the false positive rates, but this parameter is highly dependent on the amount of biological replicates, varying from 0.1 to 0.5 .
Recently, quasi‐mapping (or pseudoalignment) approaches have been proposed for RNA‐seq quantification, like Kallisto  and Salmon . Their main difference is that reads are assigned to reference sequences without base‐to‐base alignment, making analyses usually considerably faster. They have shown comparable performance over complete mapping‐based methods, can incorporate information from multi‐mapping reads, and provide counts and abundances already as normalized TPM values, which can be used as input for differential expression analysis. These are promising although under development tools.
Although RNA‐seq provides a precise and accurate estimation of RNA abundance, these findings are still widely required to be further validated through quantitative PCR, also known as qPCR or real‐time PCR as it is still considered the gold standard for gene expression quantification. However, it is still questionable whether qPCR validation is still necessary for RNA‐seq studies. High correlation between RNA‐seq and qPCR results has been observed in previous studies [7, 147, 148]. Due to this high consistency, qPCR may be more useful when performed on different biological replicate samples from those already sequenced, confirming the DGE findings and validating the biological conclusions.
In computational biology, annotation is the process of identifying the location and sequence of genomic elements and/or assigning biological function to them. Despite the annotation process being mostly carried over genomic sequences, such as newly sequenced genomes, RNA‐seq data can provide valuable information to improve existing annotations  or create novel transcript annotations for an unsequenced organism .
The major drawback of using genome sequences for annotation is that only features with patterns or conservation with annotated elements, such as open reading frames (ORFs), tRNAs and rRNAs can be inferred from it. On the other hand, RNA‐seq data provide a new layer of information that allows precise identification of pattern less features such as untranslated regions (UTRs), non‐coding RNAs and post‐transcriptional events. Even though some features can be somewhat inferred through DNA sequences, for example, Transcription Start Site (TSS), TATA box/CpG islands and splicing sites, transcriptomic data still provide a more reliable annotation.
Transcriptome assembly, de novo or reference‐guided, often reveals new potential transcripts whose functions are unknown. Before any further step can be made, it is crucial to gather information on these transcripts function in order to extract any meaningful answer.
The most common approach to annotate a transcript is to look for similar known transcripts or protein sequences in large databases. This is usually done using versatile tools like BLAST/BLASTX [150, 151] or DIAMOND  when looking for similar nucleotide or protein sequences. It is often better to perform searches at protein level since it is easier to find homology, as they tend to be more conserved than nucleotide sequences, especially if the study subject has no close species sequenced.
InterProScan  can be used to search for conserved protein signatures. This is especially useful when it is difficult to find full sequence homologs given that the study organism might be too divergent from species sequences available in the database. Protein families often present signature domains that are well conserved even among divergent species, so these signatures can give insights into the putative function of the protein. The process for annotating non‐coding transcripts differs from protein coding transcripts. They usually present poor sequence conservation since their function relies on factors, such as secondary structure, rather than amino acid sequences. Therefore, their annotation process requires specialized software to detect those intrinsic characteristics of a given class of non‐coding transcripts, for example, tRNAscan‐SE  for tRNAs and RNAmmer  for rRNAs.
Given the importance of annotation, there are plenty of tools and pipelines developed to streamline this process. Some annotation tools like Blast2GO  are generic and very user‐friendly, although it requires a paid license to use it. Others like Annocript , TRAPID  and Trinotate  are pipelines developed specifically for annotating transcriptomes. It is important to note that although automatic pipelines often ease and speed up the analysis, it comes at a cost of lesser control of the annotation process.
4.3.3. Enrichment analysis
Functional enrichment analysis is a computational method capable of determining whether a pre‐defined set of genes shows significant differences between samples. The GSEA software from Broad Institute runs the original GSEA algorithm . Although alternative algorithms have been published since then, the original algorithm is still the most widely used. In order to perform an enrichment analysis from RNA‐Seq data, the GSEAPreranked software is recommended and it requires two types of data: a gene set list and a ranked list.
A gene set is a set of genes related to the feature to be tested for enrichment. A variety of features can be tested from general features such as pathways and chromosome location, to more specific features such as cancer signatures or miRNAs targets. Gene sets can be obtained from the Molecular Signatures Database (MSigDB) that comprehends thousands of pre‐defined gene sets, or it can be created by the user.
A ranked list of the genes needs to be provided to test if the chosen gene set is significantly enriched at either end of the ranking. The list can be ranked according to any quantitative feature such as gene expression or fold‐change results from DGE analysis.
4.3.4. Alternative splicing
Alternative splicing (AS) is a post‐transcriptional mechanism present in the majority of eukaryotes that greatly increases the diversity of proteins that can be encoded by a determined genome. This process occurs when particular regions of a gene are included or excluded, through splicing, from the final processed mRNA sequence. AS can occur in several ways, such as exon skipping, intron retention, alternative 5′ donor and 3′ receptor sites [161, 162], analysis of new AS events or patterns is relevant since many traits, especially genetic diseases such as cancers, are related with disorders in splicing patterns that generates aberrant variants [162, 163].
AS analysis by deep sequencing requires splice‐aware programs capable of aligning transcripts reads to a reference genome while performing the difficult task of placing spliced reads across introns by determining the exon‐intron boundaries. A systematic evaluation of splice‐aware alignment programs for RNA‐seq data performed by the RNA‐seq Genome Annotation Assessment Project (RGASP) Consortium  tested 26 RNA‐seq alignment protocols and concluded that, in general, GSNAP , MapSplice  and STAR  compared favorably to other methods. Still, two of this software (GSNAP and STAR) presented many false exons junctions in the output if they were not filtered based on the number of supporting alignments.
Following the alignment step, software like cufflinks , scripture  and StringTie  can be used to perform transcript reconstruction, which can reveal new splicing isoforms evidenced by the alignments. This step usually yields an updated GTF annotation file as output that can be used in subsequent steps.
If data from different conditions are available, differential AS analysis can be performed. With the alignment results (SAM file) and a GTF annotation file at hand, differential exon usage analysis can be performed with DEXSeq  and differential analysis of AS events, such as skipped exon, alternative 5′ and 3′ splice site, mutually exclusive exons, and retained intron events can be performed with rMATS . There are plenty of other software specialized in performing differential AS analysis each one with their advantages and disadvantages, such as CuffDiff , Ballgown , SpliceR , MISO  and DiffSplice .
4.3.5. Fusion genes
Fusion genes or chimeras are aberrant alterations commonly found in tumor cells  that can be useful biomarkers or therapeutic targets . They may originate from chromosomal rearrangements, insertions, deletions and inversions or even by trans‐splicing events. The increasing throughput and reads length from NGS technologies have facilitated their detection and supported the development of several bioinformatic tools . For fusion detection, most and more accurate methods rely on good quality read alignments supporting discordant mappings (read segments aligning to different genes) and both single‐ or paired‐end sequencing, although paired data increase the probability of fusion detection . A recent evaluation defined SOAPfuse , FusionCatcher  and JAFFA  the best tools among 18 options for real and simulated data, and their combination has shown increased performance, albeit high false‐positive rates are still a reality in this field, with space for improvements .
MicroRNAs (miRNAs) are a subset of small non‐coding RNAs, usually 21–23 nt long that play a post‐transcriptional regulatory role in several pathogenic and developmental processes . These molecules are part of an RNA‐induced silencing complex (RISC) containing Dicer, Argonaute and many associated proteins that can cause enhanced decay/cleavage of mRNA target, elongation and ribosomal binding inhibition, thus acting at transcriptional and translational levels .
A common miRNA pipeline follows the same steps as the conventional RNA‐seq: (i) raw data must be preprocessed as previously described where adapters and low quality bases are trimmed with a minimum length filter (e.g., 18–21 nt for miRNAs), (ii) sequences are mapped to a reference (genome, RefSeq, miRBase) and raw counts are estimated, (iii) the raw count of mapped reads is normalized and (iv) downstream analysis is conducted to investigate biologically relevant questions. Due to its small nature, miRNA sequencing analysis has some caveats that require attention especially in steps (ii) and (iii).
The read mapping step is crucial for accurate miRNA abundance estimation, and therefore, the alignment algorithm must be carefully selected and adjusted to deal with its small size. Although a wide range of software are available to perform this task, some aligners are designed and optimized for specific tasks (e.g., SNP calling, splicing detection, gapped alignment) that might not be appropriate for the task at hand . Compared with conventional RNA‐seq, indels and splicing events are usually not relevant to miRNA alignment, and therefore, splice‐aware aligners are not required for this task. To these extent general purpose aligners such as BWA‐MEM, bowtie  and STAR  can be used. Most aligners default settings are set for conventional longer RNA‐seq reads, and since miRNAs are very short, aligners’ parameters should be tweaked. The default seed size for these aligners is longer than miRNA sizes and therefore should be set to a value that is at least shorter than the smallest read size. Given that sequencing errors might occur and the fact that many miRNAs often does not present an exact match with their target, it is recommended to allow at least one mismatch in the seeding and alignment process as well . Also during the mapping step, it is very common to find multi‐mapping reads since we are dealing with very small sequences. Similarly to conventional RNA‐seq, multi‐mapping reads are usually not taken into account for the abundance estimation, since it is impossible to know from where the read was originated. As long as these aligners are properly set, they should yield similar results .
Please note that for the aforementioned pipeline, miRNA annotations or sequences are usually required for raw counting estimation. If annotations are not available for the study subject or looking for novel miRNAs candidates, algorithms such as miRdeep2 , miReNA  and miRanalyzer  can be used to annotate novel canonical and non‐canonical miRNA.
After raw miRNAs abundances are estimated, a normalization step is required in order to remove bias of non‐biological origin (e.g., sequencing depth, sample handling, library preparation). A good normalization technique should reduce those biases without generating noise, so that the remaining differences between samples are truly of biological origin. Previous comparative studies on normalization procedures for miRNA data resulted in conflicting results. A study from Garmire and Subramaniam  supported the use of quantile and Lowess normalization, while Tam et al.  and Dillies et al.  advocated for the use of Trimmed Mean of M (TMM) and Upper quartile (UQ) normalization. Nevertheless, the results from any of these methods and also DESeq2 normalization  method should be highly similar, while other normalization methods such as CPM, total count scaling and linear regression should be avoided since they tend to present higher variance and bias . Several R/Bioconductor packages can be used to normalize the data and also run differential expression, such as edgeR  (TMM and UQ), DESeq2  (DESeq normalization) and limma  (quantile and cyclic loess).
After all these processing steps, the resulting miRNA estimation is ready to conduct downstream analysis. This can be done with useful databases. Being the primary miRNA sequence repository, miRBase  contains several features that may help to investigate the roles for miRNAs of interest, such as annotations for a wide range of species, references links for studies and deep sequencing evidence.
Quantitative trait loci (QTLs) are genomic regions that contain sequence variants that can affect any given trait. Since genome‐wide association studies (GWAS) started , thousands of variants have been associated with complex traits and diseases. The process of assigning variants to a gene is relatively straightforward when variants are located in coding regions that can have a direct effect on a gene product; however, most variants are found in non‐coding regions making difficult to identify the causal genes . By integrating transcriptomic data, it is possible to identify causal genes for non‐coding variants that affect its expression. When the trait in question is gene expression, they are referred as expression quantitative trait loci (eQTLs) that, similarly to other QTLs, are sequence variants capable of affecting the expression level of one or more genes that will ultimately result in different phenotypes. eQTLs can be classified according to the location of the QTL itself and its targeted gene, and according to the mechanism that affects the expression .
Regarding the eQTL‐Gene position, when they are located close to the genes, they influence they are called local eQTLs. Local eQTLs can affect a gene in two ways: in cis (cis‐eQTL) when the variant affects only the gene that is located on the same chromosome and not affecting the copy of the homologous chromosome, thus causing an allelic imbalance; and in trans (trans‐eQTL) when the eQTLs do not affect the target expression directly, but instead affect an intermediate factor that will ultimately affect its target expression. Since the intermediate factor acts equally for both alleles, it does not cause allelic imbalance. On the other hand, eQTLs located further away from their target genes are referred as distant eQTLs, usually act in trans and are harder to find . Several eQTL‐mapping studies published in the past few years showed that many variants often affect gene expression levels of nearby and distant genes [193–197] highlighting the importance of integrating transcriptomic and genomic data.
Despite the mapping process for eQTL analysis being conceptually simple, since this analysis is dealing with allelic specific expression, some caution is required during its counting estimation. For the aligning process, general purpose aligners or variant aware aligners such as GSNAP  can be used. After the alignment, some steps are recommended for retrieving allelic‐specific counts, such as removing duplicate reads that may arise from PCR artifacts. However, it is important that the choice for discarding a duplicate read is not done by mapping score as this might bias toward the reference allele . Also, mapping bias should be controlled by filtering sites with likely bias . Some tools like ASEReadCounter from GATK for allelic‐specific expression implement these filters by default .
The GTEx portal is a valuable resource to study human gene expression and regulation related to genetic variation. It hosts data from several eQTL studies and much information on laboratory and analysis methods for eQTL .
5. Concluding remarks
In the past few years, recent advances in sequencing technologies allowed the cost‐efficient generation of an unprecedented amount of biological information. Similarly, RNA‐seq techniques are under continuous improvements allowing wide range applications and development of high level resolution experiments such as those based on the emergent single‐cell RNA sequencing (scRNA‐seq) field. To couple with this ever increasing data, several tools and pipelines have been constantly developed. The bioinformatics field changes in an astonishing pace, in a way that it is almost impossible to keep up with all the new tendencies, the overwhelming amount of available software and the controversial opinions in the scientific community. For some aspects, it is difficult to find a consensus on the best pipeline to be applied. This chapter goal was to guide RNA‐seq users through its complex steps, providing a brief overview of the complete workflow, highlighting accessible protocols and currently available tools, most of which correlated with supporting benchmark studies.