Open Access book

Transcriptome, the functional element of the genome, is comprised of different kinds of RNA molecules such as mRNA, miRNA, ncRNA, rRNA, and tRNA to name a few. Each of these RNA molecules plays a vital role in the physiological response, and understand‐ ing the regulation of these molecules is extremely critical for the better understanding of the functional genome. RNA Sequencing (RNASeq) is one of the latest techniques applied to study genome-wide transcriptome characterization and profiling using high-through‐ put sequenced data. As compared to array-based methods, RNASeq provides in-depth and more precise information on transcriptome characterization and quantification. Based upon availability of reference genome, transcriptome assembly can be reference- guided or de novo . Once transcripts are assembled, downstream analysis such as expres‐ sion profiling, gene ontology, and pathway enrichment analyses can give more insight into gene regulation. This chapter describes the significance of RNASeq study over array-based traditional methods, approach to analyze RNASeq data, available methods and tools, challenges associated with the data analysis, application areas, some of the recent advancement made in the area of transcriptome study and its application. Genome Project Data Processing The Sequence


Introduction
Completion of the Human Genome Project in 2001 brought with it the realization that while understanding the genome is of great value, our understanding of biology is woefully incomplete without the knowledge of the functional elements of the genome. The functional element of the genome is the transcriptome, which is the set of RNA molecules such as mRNA, rRNA, tRNA, and various small RNAs. A large number of research projects are now focused on the transcriptome rather than on genome and proteome as only 1-2% of genes are coding and 80-90% of the transcribed genes are not translated to proteins. However, these are known to be involved in epigenetic regulation and gene expression regulation [1][2][3][4]. Gene expression is a complex process regulated at multiple levels such as gene transcription, post-transcriptional modifications, and translation. Briefly, complexity at the transcription regulation arises from the presence of multiple Transcription Start Sites (TSSs), which can result in production of multiple transcripts from a single gene [5] and alternate splicing as well as alternate polyadenylation of the primary RNA to produce several different forms of transcripts originating from the same gene [6,7]. Because of different TSSs, eventually each mature transcript will code for different protein [8]. Additionally, noncoding RNAs, which are not translated to proteins, play catalytic and structurally important roles. For example, tRNAs and rRNAs play a critical role in translation, small nuclear RNAs (snRNAs) participate in mRNA splicing, small nucleolar RNAs (snoRNAs) regulate rRNA splicing, guide RNAs (gRNAs) regulate RNA editing, and miRNA are involved in translational repression [9]. Study of the transcriptome provides an understanding of the regulation of gene expression pattern [10], alternative splicing and transcript structure [11], dynamic regulation of transcripts in different tissues [12], and detailed information about the gene regulation in normal and diseased conditions [13].
Transcriptome profiling is typically performed using hybridization or sequencing-based methodologies. Hybridization-based methods involve binding of fluorescently labeled fragments to complementary probe sequences either in solution or on a solid surface, e.g., microarray [14,15]. These approaches, however, suffer from limitations such as low resolution, low specificity, and low sensitivity [16]. Later, Sanger sequencing-based approaches such as SAGE (Serial Analysis of Gene Expression) [17], CAGE (Cap Analysis of Gene Expression) [18], and MPSS (Massively Parallel Signature Sequencing) [19] were developed, but these approaches have serious limitations such as consideration of partial transcripts structure for gene expression and inability to distinguish between isoforms [20]. With the advent of Next Generation Sequencing (NGS), a technology that enables sequencing of millions of nucleotide fragments in parallel, RNA Sequencing (RNASeq) has emerged as a powerful method for studying the transcriptome. Though microarrays are high-throughput and economical, RNASeq offers numerous advantages over microarrays [15]. Some of the key benefits of using RNASeq over microarrays are: a. Genome-wide coverage of transcripts is offered by RNASeq.

b.
No prior knowledge of genome sequence is required in the case of RNASeq as opposed to microarray and hence RNASeq experiment can be performed in the absence of the reference genome.
c. Improved sensitivity and specificity: RNASeq offers enhanced detection of transcripts and differentially expressed genes and isoforms. Moreover, RNASeq is known to be more accurate in terms of fold change detection for both high-and low-abundance genes.
d. Detection of novel transcripts: Unlike microarray, RNASeq enables genome-wide unbiased study and is not dependent on transcript or region-specific probes and hence it investigates both known and novel transcripts.
e. Detection of low-abundance transcripts if sequencing is done at high depth.
f. No or minimal background signal: While mapping the reads to the genome, one can consider reads mapping unambiguously, which results in noise reduction. On the other hand, cross-hybridization increases noise-to-signal ratio in microarrays.
g. SNP detection: RNASeq data can be used for SNP detection especially for highly and medium expressed genes.
Because of its wider detection range, more sensitivity, genome-wide capture of expression profile, and rapidly decreasing cost, RNASeq technology is being preferred over array-based methods for transcriptome profiling. RNASeq has been widely used in the detection of differentially expressed genes between cancerous and normal tissue samples [21], identification of novel gene fusion events in melanoma [22], discovery of several novel miRNAs in cancerous cells [23], identification of differential gene expression and splicing events in Alzheimer's disease [24], identification of differential promoter usage, and higher expression of noncoding RNA in diabetes [25,26]. RNASeq is now being used extensively for transcriptome assembly, thus enabling better characterization of economically important plants such as Garlic [27], Pea [28], Chickpea [29], Rice [30], Olive [31], Wheat [32], and many other plants [33]. Further, combination of molecular biology and biochemical techniques with sequencing has led to the study of different aspects of the transcriptome, such as mRNASeq, miRNASeq, GROSeq, CLIPSeq, NETSeq, PARESeq, and ChIRPSeq (additional information in Table 1). Projects such as ENCODE (ENCyclopedia of the DNA Elements) and TCGA (The Cancer Genome Atlas) have characterized transcriptome of several different human cell lines and tumor samples, respectively, using NGS-based transcriptome profiling. Goal of ENCODE (https://www.encodeproject.org/) is to identify genome-wide transcriptome profile to understand the downstream effects of gene regulation in the human genome. TCGA (www.cancergenome.nih.gov/), which contains information on cancer patient data, aims to understand the mechanism of tumor transformation and progression.

Sequencing), PROSeq
To identify nascent RNAs that are actively transcribed by RNA Pol II [168] ChIRPSeq (Chromatin Isolation by RNA Purification) To discover regions of the genome bound by a specific RNA [169] RiboSeq (Ribosome profile Sequencing) To identify RNAs that are being processed by the ribosome and hence this method helps to monitor the translation process [170] CLIPSeq (Cross-Linking and Immunoprecipitation Sequencing) To identify the binding sites of cellular RNA-binding proteins (RBPs) using UV light to cross-link RNA to RBPs without the incorporation of photoactivatable groups into RNA [171] Transcriptomic Profiling Using Next Generation Sequencing -Advances, Advantages, and

Bioinformatics analysis of RNASeq data
Analysis of the RNASeq data is a multistep process that typically includes quality check, data preprocessing, transcriptome assembly (reference-guided and de novo transcriptome assembly), quantification, statistical analysis, and functional annotation ( Figure 1). These steps are described in details in the following.

Figure 1.
Basic RNASeq data analysis workflow. Firstly, raw sequenced data are checked for the quality and, if required, low-quality reads and artifacts are removed. In the case of reference-based assembly, the reads are mapped to the reference genome in order to know their location. All the mapped reads are then analyzed for expression profiling. Further, differentially expressed genes and isoforms can be annotated using Gene Ontology (GO) and Pathway enrichment analyses. In de novo assembly approach, after preprocessing of the raw reads, transcriptome can be assembled using different de novo transcriptome assemblers. Once transcripts are constructed and abundance estimate is obtained, the complete Open Reading Frames (ORFs) transcripts are predicted. The predicted ORFs can be annotated or analyzed for expression profiling and then annotated using remote homology search method, GO, and pathway enrichment analyses.

Quality check and data preprocessing
Next generation sequencers assign a Phred quality score, which is the probability of the base call being inaccurate, to the called bases. Low Phred scores (Q< 30) indicate read data of poor quality. Poor-quality read data can arise from problems in the library preparation or from sequencing itself. Additionally, PCR artifacts, sequence-specific biasness, untrimmed adapter sequences, and other possible contaminants can lead to poor data quality. These factors can affect the downstream analysis and data interpretation, and can give inaccurate results. In order to assess quality of raw sequenced data several tools such as FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and PRINSEQ [36] are available. Once the data are checked for quality, they should be processed to remove reads with low-quality bases, adapter sequences, and other contaminating sequences. Tools such as Cutadapt [37], Trimmomatic [38], TrimGalore (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/), FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/), which trim adapter or other contaminants based upon user-provided parameters, can be used for performing these operations. A brief description of some of these quality and data preprocessing tools is provided below: FastQC: FastQC is a simple, easy-to-use tool that evaluates the quality of read data generated from the next generation sequencers. The input file/s for FastQC can be in Fastq, SAM, or BAM format either in the compressed or uncompressed form. FastQC reports basic statistics for the read data such as overrepresented sequences, k-mer content, base quality and content, adapter content, read duplication level, etc. FastQC is available as a stand-alone Java-based program with a graphical user interface and can be run from both Linux (using command line) and Windows systems.
PRINSEQ: PRINSEQ reports base quality, GC content, duplicates, adapters, presence of ambiguous sequences represented as "N," poly A tails, etc. Unlike FastQC, PRINSEQ also has the option of trimming and filtering reads. PRINSEQ is available as stand-alone as well as web application (http://prinseq.sourceforge.net/). It accepts uncompressed files in Fasta, Qual, and Fastq formats.
Trimmomatic: Trimmomatic is a Java-based program for the preprocessing of NGS read data (http://www.usadellab.org/cms/?page=trimmomatic). It can trim contaminant sequences, adapters, and filter reads based upon the quality. It supports compressed files in Fastq format and generates output in Fastq format. Because of its multithreading option, its data processing speed is higher than other tools available to perform the same function. Unlike some of the other tools, Trimmomatic can analyze both single-end as well as paired-end read data.
Cutadapt: Cutadapt is a python-based tool for read preprocessing and can be run as a command line application (https://cutadapt.readthedocs.org/en/stable). It accepts compressed files in Fasta, Qual, and Fastq formats, and supports both paired-end and single-end files. It trims low-quality bases, multiple adapter sequences from either 3', 5', or from both ends. In addition, Cutadapt can remove fixed number of bases from either ends of the sequences and supports demultiplexing, i.e., reads can be written to different output files depending upon the adapter sequence found in the reads. The demultiplexing feature is particularly useful since pooling multiple samples in a single run is an increasingly common practice as a result of increased sequencer throughput.
TrimGalore: TrimGalore is a wrapper tool written around FastQC and Cutadapt for quality check and adapter trimming for regular as well as MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries. It accepts compressed Fastq files and supports pairedend and single-end data.

FASTX-Toolkit:
FASTX-Toolkit is a collection of tools that accepts read data in Fasta and Fastq file formats and trim the data based on base quality and adapter sequence contamination. Additionally, the toolkit has tools that can perform file format conversion, split sequences based upon barcodes, and generate reverse complement of sequences.
Once the read data are filtered and trimmed to remove low-quality bases, adapter sequence, and contaminants, they are ready for transcriptome assembly and profiling analysis. There are two different approaches for constructing full-length transcripts: reference-based assembly (when a reference genome is available) and de novo assembly (when the reference genome is not available), a computationally intensive and complex process ( Table 2). Reference-based or genome-guided assembly refers to mapping sequenced reads to the reference genome followed by assembling the transcripts. In contrast, in de novo transcriptome assembly, transcripts are constructed directly from the overlapping sequenced reads. For the transcriptome assembly of organisms without reference genome, only de novo transcriptome assembly approach is available for transcriptome construction. However, for organisms with known reference genome, both reference-based and de novo transcriptome assembly can be employed for transcriptome construction. In fact, in this case, de novo transcriptome assembly will be more effective in filling in the gaps (observed due to variation in reference genome sequence and poor-quality annotation) and hence would complement the reference-based transcriptome assembly. More details on these two transcriptome assembly approaches are discussed in the following sections.

Reference-based assembly de novo assembly
Reference genome is required to assemble the transcriptome

Reference-based transcriptome assembly and profiling
Typically, in a reference-based transcriptome profiling study, the computational workflow starts with aligning the quality-checked reads to the reference genome or transcriptome using a suitable read aligner. The aligned reads are then used to quantitate the genomic features (genes/isoforms). The quantity of the features needs to be normalized before comparison between different experimental conditions. The normalized feature counts are then used for drawing statistical inference on their difference in expression between samples under study. Finally, the differentially expressed set of genes is processed to derive biological insights relevant to the experimental setup. The success of this analysis depends very much on decisions that the user takes while choosing reference genome, annotation, tools, and associated parameter values at every step of the analysis. Steps involved in reference-based transcriptome assembly and analysis are described below.

Choice of reference build and annotation file
Reference genome and annotation files of a large number of species are available from a number of publicly available resources. Three of the most widely used resources are Ensembl (http://www.ensembl.org), the National Center of Biotechnology Information (NCBI; ftp:// ftp.ncbi.nih.gov/genomes), and UCSC genome browser (http://genome.ucsc.edu). Ensembl is jointly headed by the European Molecular Biology Laboratory -European Bioinformatics Institute (EMBL-EBI) and the Wellcome Trust Sanger Institute (WTSI). Ensembl generates genome annotation for vertebrates and other eukaryotic species, and the information is made freely available to the research community [39]. According to the latest Ensembl release 81, a total 23,636 genomes from 4,991 species are available. The NCBI also hosts genome sequence annotation data of over 1000 organisms including bacteria, archaea, eukaryote, viruses, phages, viroids, plasmids, and organelles. The UCSC genome browser is maintained by the UCSC Genome Bioinformatics group and provides data for over 90 organisms that belong to vertebrates, deuterostomes, insects, nematodes, yeast, viruses, and others [40]. In addition to the aforementioned data resources, Genome Reference Consortium (GRC), comprising of WTSI, the Genome Institute of Washington University (TGI), EBI, and NCBI ensures that the human, mouse, and zebrafish, and the genome assemblies of other model organisms are continuously updated and properly maintained.
Irrespective of the source, it is always recommended to use the latest genome sequence and its annotation. Zhao et al. [41] demonstrated that the choice of a gene model (annotation information/annotation catalog) has a dramatic effect on both gene quantification and differential analysis. We would recommend using Ensembl as it provides more detailed annotation of the genomic features.

Choice of read aligner
One of the most challenging parts of RNASeq analysis is mapping the sequencing reads to the genome correctly, especially for eukaryotes where presence of splicing events adds to the complexity. Multiple aligners, which can be divided into two categories, are available for aligning short-reads to the genome: 1. Non-spliced aligners: These aligners do not handle splicing events and are therefore suitable for prokaryotic RNASeq analysis only.

2.
Spliced aligners: These aligners can place spliced reads across introns and determine exon-intron boundaries. Therefore, these are preferred for eukaryotic RNASeq analysis.
The non-spliced aligners can be further classified, on the basis of the algorithm used, into two categories: • Hash table-based aligners: This set of aligners uses a seed sequence to identify alignment candidates, which are then either extended or discarded using more precise dynamic programming alignment algorithms. These aligners can be further divided, based upon the approach of finding a seed, into two groups: a. Reference indexing: Aligners create index using reference genome. Examples include BFAST [42], Novoalign (http://www.novocraft.com), GNUMAP [43], SHRiMP2 [44], Mosaik [45].
Example of spliced aligners include GSNAP [53], MapSplice [54], SpliceMap [55], STAR [56], and TopHat2 [57]. GSNAP can identify a splice site in two ways: first, by evaluating the surrounding genomic sequence using probabilistic models of donor and acceptor splice site; second, by utilizing user-provided database of known exon-intron boundaries, which improves the sensitivity and specificity of the tool. Both MapSplice and TopHat2 use a twostep algorithm where in the first step potential splice sites are detected, which are then used in the second step to find correct map of reads. MapSplice is a de novo spliced aligner, whereas TopHat2 can perform both de novo and gene-annotation-based splice alignment. TopHat2 incorporates Bowtie1 or Bowtie2, in the back-end, for initial alignments. SpliceMap is also a de novo splice aligner, which is highly sensitive and specific in finding novel splice junctions without using any existing gene model information in arbitrary RNASeq read lengths. Another splice-aware aligner, STAR, utilizes sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. It has been evaluated to be the fastest aligner among the above-listed spliced aligners with lowest false-positive rate at high sensitivity [56]. However, its RAM requirement is higher as compared to its counterparts.
The latest addition to the list of spliced aligners is HISAT (Hierarchical Indexing for Spliced Alignments of Transcripts) [58], which is claimed to be the fastest aligner currently available.
The reason for this highly efficient system is believed to be the indexing scheme it utilizes. As compared to its counterparts, HISAT uses two different types of indexes instead of a single index: (i) a whole-genome FM index to anchor each alignment, and (ii) numerous local FM indexes for very rapid extension of these alignments. HISAT is 50 times faster than TopHat2, 12 times faster than GSNAP, and slightly faster than STAR [56]. In addition, HISAT requires comparable amount of RAM as TopHat2 but maximum 20% of RAM as GSNAP or STAR needs. Similar to TopHat2, HISAT also uses Bowtie2 in the back-end. Furthermore, it is the only aligner that can work directly on an SRA file, which eliminates the sra to fastq file conversion requirement.
Considering the options available, selecting the right aligner is a nontrivial task and there are several publications comparing the read aligners. Fonseca et al. [59] published a feature-level comparison of 60 mappers and highlighted the difficulties in determining the best aligner (in terms of accuracy and speed). Other comparative studies include one by Lindner and Friedel [60] on non-spliced aligners and another by Engstrom et al. [61] on spliced aligners.
Answers to the following questions may help to choose a suitable aligner: 1. Does the genome sequence belong to a prokaryote (where a gene lacks intron) or eukaryote (where a gene has introns)?
If the genome is bacterial (example of a prokaryote), then computationally intensive splice aligners such as TopHat2 or STAR are not required. In this case, non-splice aligners such as Bowtie1, Bowtie2, or BWA are more appropriate because of the contiguous read mapping to the reference genome. On the contrary, for eukaryotic genomes such as human/mouse, where the reads will span an exon boundary and therefore a part of it will not map contiguously on the reference genome; it is better to use a splice aligner that can identify splice sites.

2.
Are the sequence data available in base space or color space format?
If the data are generated from a SOLiD sequencing platform, they will be in color space format and almost all recently developed tools do not support color space data. In this case, the only available options are aligners such as BWA (older than 0.6.x), Bowtie1, and TopHat2.

Does the aim of RNASeq experiment include calling variants in transcripts?
In experiments where the aim is to find variants in transcripts, mapping quality plays a crucial role, and hence it is advisable to use only aligners that provide accurate mapping quality. BWA and STAR aligners are suitable for this purpose; however, Bowtie 1 is not because it does not assign appropriate quality score to the mapped reads.
Additionally, one should also consider the comparative precision and recall statistics, CPU, and RAM requirements of the aligners. In addition to the aligners used, the data type itself plays a critical role in the quality of mapped data. For example, paired-end information improves mapping accuracy and, therefore, paired-end data are favored over single-end data for RNASeq experiment.
The aligned read data generated from aligners mentioned in the previous section are stored in Sequence Alignment/Map (SAM) file format, which is a gold standard to store alignment data. The SAM format has been created by the SAM/BAM format specification working group (https://samtools.github.io/hts-specs/SAMv1.pdf) for standardizing the format in which aligned data are stored. A SAM file contains information about the reference sequence name, query sequence name, alignment position and direction of the read on the genome, mapping quality, etc. However, SAM files are typically very large; hence, these files are converted into binary counterpart known as BAM (Binary of SAM) files. This is done typically using SAMtools [62], which provides a set of programs to manipulate the alignment files. Alignment files can be further manipulated with utilities such as SAMtools and Picard (http://broadinstitute.github.io/picard/) to efficiently retrieve reads and regions of interest.

Choice of annotation source
Depending upon the biological question of interest, one may wish to perform expression study either on known transcripts only, as per a given annotation catalog, or on reconstructed transcriptome built using a known reference annotation. This enables the quantification of novel genes/isoforms in addition to the known ones. In the first case, the mapped reads and the annotation catalog can be used to assign read counts to each feature (genes/transcripts) using a tool like htseq-count [63], and then perform statistical analysis to identify the differentially expressed genes/isoforms. In the second case, transcriptome reconstruction is required prior to differential expression analysis. It requires assembly of reads into transcription units using either the reference-based or de novo assembly approach. Given a reference genome and an annotation catalog, there are tools such as Cufflinks [64,65] that first map all the reads to the genome and then use spliced reads directly to reconstruct the transcriptome. It generates a GTF file that contains the assembled isoforms along with isoform-level relative abundance in Fragments Per Kilobase of exon model per Million mapped fragments (FPKM) units [65].

De novo transcriptome assembly
Building a transcriptome using de novo methods is a powerful way to create the transcriptome of a divergent or novel species. Mainly three features affect the quality of assembled transcripts: a) type of transcript: presence of repeats, polymorphisms, splicing event, complexity of organism, e.g., ploidy level, GC content; b) sequencing technology: library preparation, sequencing accuracy; c) bioinformatics workflow: assembly algorithms and annotation. Currently available de novo assemblers have different sensitivity, and specificity in terms of transcript identification are error-prone, and lead to fused transcripts, splicing errors, and gaps [66]. In order to enhance the sensitivity and specificity one can take the combined approach, which employs de novo assembly method with reference-guided approach.

De novo assembly approaches
There are several algorithms available for de novo transcriptome assembly (Table 3). In de novo transcriptome assembly, contigs or transfragments are created from overlapping reads. Process of assembly involves either de Bruijn graphs construction using k-mers or overlaplayout-consensus (OLC) approach for short and long reads, respectively [67].  Overlap-Layout-Consensus (OLC) approach: OLC approach was initially developed for reconstruction of the genome from Sanger sequence and EST (Expressed sequenced tag) data. As the name suggests, in the OLC approach, the read data are searched for overlapping sequences and merged to create longer reads. Depending on the volume of data and complexity of genome (e.g., repeats), the OLC approach is computation-intensive. Some of the OLC-based assemblers are MIRA [68], Newbler (from Roche/454 Life Sciences), and CAP3 [69]. The assemblers using the OLC approach are more suitable for small volume of data, not sensitive to repeat region detection and resolution, and cannot handle the high-depth short read data generated from sequencers such as Illumina. The Eulerian path assemblers, which are based on de Bruijn graph algorithms [70], are more suitable for the high-depth short read data and are discussed in detail below.

De Bruijn-graph-based approach:
De Bruijn graph is a mathematical graph that uses a substring of letters (here nucleotides) of length k to represent nodes. Nodes are connected if shifting a substring by one nucleotide creates an exact k-1 overlap between the nodes [70]. De Bruijn graph can be created for both small as well as large sequences. Based upon the defined k-mer (a nucleotide substring of length k) length, reads are broken in k-length to generate substrings. Using these substrings, de Bruijn graph is generated in which each unique substring represents a node (or vertex) connected with overlaps between the last k-1 nucleotides of the previous sequence with the first k-1 nucleotides of the subsequent sequence [71]. Identical overlaps of k-mers are merged and counted while creating the graph. If the assembler finds differences in the nodes, the graph is branched. Upon subsequent identity and overlap in the nodes, the graph will join the ends. Presence of single nucleotide difference between the sequence data gives rise to bubbles in the graph. In the case of RNASeq data, occurrence of large bubbles and open-ended branches in the graph suggests presence of alternative splicing and alternative transcription start and end.
Occurrence of small bubbles can be due to single nucleotide variation or sequencing errors [72].
In most of the de Bruijn-graph-based assemblers, the preferred value of k-mer is usually an odd number in order to avoid reverse complement of k-mers. The chosen size of k-mer has great impact on the assembly process as using a large k-mer can result in a unique de Bruijn graph, but this approach is computationally intensive. On the other hand, using small k-mers can result in a fragmented assembly. According to some of the previous studies it has been observed that smaller k-mers can be useful in more accurate transcriptome assembly of lowly expressed genes whereas larger k-mers perform better for abundant transcripts [73][74][75]. It is therefore essential to identify the optimal k-mer for the sequence being assembled and it depends to a large extent on the read length, sequencing depth, sequencing error rate, and the complexity of the genome. Additionally, using directionality of the read from paired-end data, assemblers can generate more accurate assembly as compared to single-end data [76]. Some of the most commonly used de Bruijn-graph-based assemblers are: Velvet/Oases [74,77], Trinity [78], Trans-Abyss [79], SOAPdenovo-Trans [80].
Oases: Oases has a set of algorithms that post-processes the assembly generated by Velvet at different k-mers such as dynamic filtering of the noise, resolution of alternative splicing transcripts, and merging of the multiple assemblies generated using different k-mers (www.ebi.ac.uk/~zerbino/oases/). Data generated from different k-mers are merged to generate a complete assembly. Oases works well for the correction of errors and resolution of repeats in the case of paired-end data.
Trinity: Trinity uses three steps to produce transcriptome assembly: inchworm, chrysalis, and butterfly. Inchworm builds initial sets of contigs using k-mer graphs. Chrysalis groups these contigs and builds de Bruijn graphs from them. Butterfly simplifies and resolves the graphs to generate the final set of transcripts containing spliced variants and isoforms.
Trans-Abyss: Trans-Abyss considers multiple assemblies generated from Abyss to optimize the assembly and can tackle varying coverage of the transcripts very well.
SOAPdenovo-Trans: SOAPdenovo-Trans is derived from the genome assembler, SOAPdeno-vo2 [81] and is known to construct transcriptome faster than the above-mentioned assemblers.

Choosing the transcriptome assembler
Choosing an assembly algorithm is difficult as it depends on a number of factors such as read type, length, and complexity of the genome. Some instrument vendors such as Roche provide assembly algorithms, e.g., Newbler, which can handle the long read data and the homopolymer issue frequently observed in the data generated from 454. A recent study using peanut plant RNASeq data suggests that performance of Trinity is better than TransAByss and SOAPdenovo-Trans when raw reads are mapped to the reconstructed assembly of the polyploidy transcriptome [82]. Another study suggested use of multiple k-mers and clustering of k-mer assemblies and at the same time identifying unique contigs from each assembly for effective extraction of biological information from transcriptome assembly [83].

Assessing quality and accuracy of de novo assembled transcriptome
Because of sequencing errors and presence of repeats in the genome, it is hard to achieve a perfect assembly. Moreover, different assemblers use different heuristic approaches to assemble the transcriptome, which results in different number of identified transcripts.
Quality and accuracy of assembled transcriptome are assessed in several different ways [84,85]: 1. Assembly statistics: Most algorithms generate an assembly statistic that includes the number of contigs/transfragments generated, total contigs/transfragments length and singletons, size of the assembly (in number of nucleotides), percentage of reads assembled to transfragments, percent GC content, etc. Assembly statistics provide overview of the organisms' transcriptome.

2.
Transfragments/contigs statistics: This statistics includes lengths of the largest and shortest transfragments, average and median length of transfragments, and N50 of assembled transcriptome. N50 of the assembly is calculated by sorting the contigs in descending order and the size of the contig that makes the total greater than or equal to 50% of the genome size is regarded as the N50 value. A large N50 is indicative of a more contiguous assembly.

3.
Mis-assembly and variations: Some of the major reasons for mis-assembly of the transcriptome are presence of ambiguous bases, repeat regions, insertions, deletions, SNPs, and chromosomal rearrangements in the transcriptome. Percentage of mis-assembled contigs can be calculated by mapping the contigs back to the reference genome. QUAST, a tool, generates consolidated report on mis-assembly statistics [84].

4.
Number of transfragments matching with the closest reference genome: Once transcripts are assembled, it can be compared against a closely related species/genome. Assembly is considered to be of high quality if the number of reference transcripts matching with the transfragments is high. However, the genes that are not expressed, or lowly expressed, might not be captured.

5.
Hybrid or fused transcripts: Hybrid transcripts result from joining of two or more different transcripts and hence matching to different locations of the genome. Reasons for hybrid transcript generation are sequencing error, improper trimming of the adapter/contaminant from the raw read, similarity of the transcripts, assembly algorithm's parameters, etc. Low number of hybrid transcripts reflects better assembly.

Choice of expression unit: CPM, RPKM, FPKM, TPM, or read count
Once the read data is aligned to the reference genome, the gene expression can be quantitated by read counting at exon, transcript, or gene-level. Here are few possible expression units: a. Read Count: read counts are number of reads overlapping a genomic feature such as a gene or transcript.
b. CPM (Counts Per Million mapped reads): CPMs are read counts scaled by the number of fragments sequenced times one million. This unit is used in a differential expression analysis R package edgeR [86].
c. RPKM (Reads Per Kilobase of transcript per Million): RPKM for a feature is computed by dividing the number of read counts by it length and total number of reads sequenced, followed by multiplication with one billion [12]. Applicable only for single-end data.

d. FPKM (Fragments Per Kilobase of transcript per Million)
: similar as RPKM. But takes into account a fragment (not reads) [65]. For pair-end data, there will be two reads for a single fragment of genome while for single-end data, there will be one read for a single fragment. Both the situations will add only one count.
e. TPM (Transcripts Per Million): TPM for a transcript is calculated by dividing the ratio of its read counts over its length by the summation of ratios for all the transcripts, and multiplying with one million [87]. Especially for transcript abundance.

Why should one normalize the expression data?
RNASeq experiments have multiple sources of systematic variations introduced through intersample differences such as difference in library size (sequencing depth) or unwanted variations due to batch effects such as sampling time or different sequencing technology [12] or through intra-sample differences such as difference in read length [88] or GC content between genes [89,90]. These variations, if ignored, can dramatically reduce the accuracy of statistical inference and hence should be removed or controlled during statistical analysis. Therefore, read count and FPKM of a feature, as calculated for example by htseq-count and Cufflinks, respectively, may not be appropriate to compare across features and samples without normalization.
Normalization is a process that aims to ensure that expression estimates are comparable. There are a number of normalization methods, such as: a. Total Count: each read count of a feature expression is divided by total number of mapped reads in that sample and multiplied by the average total count across all the samples.
b. Upper Quartile: each feature expression is divided by the upper quartile of expression values, other than 0, in that sample and multiplied by the average upper quartile across all the samples [91]. Upper quartile for FPKMs or fragment counts has been implemented in Cuffdiff2 tool from Cufflinks suite [92].
c. Median: each feature expression is divided by the median of these expression values (other than 0) in that sample and multiplied by the average median expression across all the samples.

d.
Quantile: the distribution of expression values for each sample is made identical [93]. Quantile method is available in R package limma [94].
e. Trimmed Mean of M-values (TMM): TMM normalization factor for each sample is computed as the weighted mean of log ratios between a test sample and a reference sample after excluding the features with highest expressions and features with largest log ratios. These factors are rescaled by the mean of normalized library sizes. Finally, each feature expression value is divided by these rescaled normalization factors to get the normalized expression [86,95]. TMM method has been implemented in R package edgeR [86].
f. Median of ratio: the normalization factor for each sample is computed as the median of ratios of expressions of features over their geometric means across all samples. Finally, each feature expression is divided by this factor to get the normalized expression [96]. Median of ratio has been implemented in R packages DESeq [96], DESeq2 [97], and in Cuffdiff2 [92].
Several publications [98,99] comparing normalization methods suggest that median of ratio is the best method for normalization in differential expression study for mRNASeq experiment.
In addition to normalization methods, several packages have been developed to control batch effects, for example, svaseq [100]. svaseq can work on both, count-based data (e.g., htseq-count generated data) as well as FPKMs (Cufflinks generated data).

Differential expression analysis
Differential expression analysis helps identify genes that are important in the experimental conditions being tested and hence is the most routine analysis performed using the RNASeq data. In RNASeq data, a linear relationship has been observed between the number of reads that map to a transcript and the abundance of the transcript [12]. The goal of differential expression analysis is to compare these read counts for a feature between distinct sample groups and perform a statistical test to determine whether the difference is significant. For this purpose, a distribution is required to be fitted to the count data using generalized linear model (GLM). Based upon the assumption that reads are independently sampled from a population with a given, fixed fractions of genes, it can be said that the read counts will follow a multinomial distribution. This multinomial distribution can be approximated by the Poisson distribution and therefore Poisson distribution has been used to test differential expression in several studies [101][102][103]. But it has been found that this distribution predicts smaller variations than what is seen in the data. To overcome this issue, negative binomial (NB) distribution and beta negative binomial distribution were proposed. NB has been used in several differential expression tools such as edgeR [86], DESeq [96], DESeq2 (an enhanced version of DESeq) [97], and BaySeq [104]. Though these tools use a common distribution, the method of variance (dispersion) estimation differs, which affects the final outcome of the analysis. Cuffdiff2 uses beta negative binomial distribution to fit fragment counts [92].
Recent advances in this area of research suggest that a combination of Poisson distribution and NB distribution may yield better results. Chen et al. [105] derived a novel algorithm XBSeq from DESeq, where they used Poisson distribution to fit read counts that map to nonexonic regions (considered as sequencing noise) and used NB distribution to fit read counts that map to exonic regions (considered as true signals).
Recently, limma [94], a well-known R package for performing differential expression analysis of microarray data, has been empowered with RNASeq data analysis ability. It does not use the above-mentioned distribution, rather converts count data (or normalized count data) to log-counts per million using voom transformation, then fits a linear model to this data and performs differential expression analysis using an empirical Bayes method.
There is no clear evidence as such about the best tool for differential expression analysis; however, multiple studies comparing available methods have been performed. Soneson and Delorenzi [106] evaluated and compared eleven methods for differential expression analysis on simulated and real RNASeq data, whereas Seyednasrollah et al. [107] compared eight widely used tools on real data sets. Both the studies concluded that no single method is optimal under all circumstances. Soneson and Delorenzi [106] observed that limma performed well under many different conditions and Seyednasrollah et al. [107] found limma and DESeq as the preferred choice. Additionally, these studies have suggested that the method of choice should depend on the experimental conditions that include the number of samples per condition.

Annotation of de novo assembled transcriptome
In addition to transcriptome abundance calculation after mapping the assembled contigs/ transfragments to the assembled transcriptome or reference genome and differential expression data analysis, coding regions within de novo assembled transcripts can be searched using ORF predictor tools such as Transdecoder (http://transdecoder.github.io/). Further, homologous gene/protein identification of assembled transcripts can be done using tools such as BLAT and BLAST [108].

Making sense of the differentially expressed gene list
List of differentially expressed genes is just the first tangible outcome of an RNASeq experiment. In order to derive biological insight from this list of genes, it is important to identify functional categories of the genes that are differentially expressed and the biological pathways that are enriched as a result of these differentially expressed genes. In order to do so, enrichment analysis is typically performed using publicly available resources such as GO (Biological Processes and Molecular Functions) databases [109], KEGG pathways [110], BioCarta (www.biocarta.com), and Reactome [111].
In a review article, Khatri et al. [112] elaborated the current approaches of pathway analysis and their challenges and divided the existing approaches into three generations:

a. First Generation: Overrepresentation Analysis (ORA) approach
This approach statistically evaluates the fraction of genes, among the set of differentially expressed genes, in a particular pathway. There are many tools that follow this approach, for example, Onto-Express [113], GenMAPP [114], GoMiner [115], and DAVID [116,117]. However, this approach has certain limitations. For example, it does not consider the fold change values associated with the genes, thereby ignoring the extent of regulation. Moreover, it does not consider the gene product interactions that are found in a pathway. This approach also ignores the dependency between the pathways.

b. Second Generation: Functional Class Scoring (FCS) approach
This approach addresses few limitations of ORA. It considers all the genes and their expression for pathway enrichment, so as to take into consideration the coordinated changes (irrespective of the magnitude) unlike ORA where only differentially expressed genes were considered and that too without considering their expression levels. Example of such tools include global test [118], GSEA [119].
But this approach too has some limitations. Similar to ORA, this approach ignores the dependency between the pathways and the interaction between gene products in a given pathway.

c. Third Generation: Pathway Topology (PT)-based approach
To overcome the limitations of ORA and FCS, the Pathway-Topology-based approach has been devised. It uses pathway knowledgebase to include pathway topology information for enrichment analysis [112]. This information includes genes that are interacting, their mode of interaction (e.g, activation, inhibition), and their location of interaction (e.g, cytoplasm, nucleus). SPIA [120], an R package, is an example of this category of pathway analysis approach, which combines evidence of pathway overrepresentation and unusual signaling perturbations. NetGSA [121] is another method in this category that takes into consideration the change in correlation as well as the change in network structure as experimental condition changes. However, in the absence of high-resolution knowledge databases that can provide knowledge for all conditions, tissue-and cell-specific functions of a gene product; the true pathway topology is rarely inferred. And hence this restricts a researcher to investigate the dynamic states of a system [112].
We have recently developed SanGeniX (www.sangenix.com), an easy-to-use client-serverbased NGS data analysis application with a highly intuitive user interface (manuscript under preparation). SanGeniX supports primary, secondary, and tertiary analysis of sequence data from Illumina, Ion Torrent, SOliD, and PacBio RS. SanGeniX integrates multiple robust and validated algorithms in the form of predefined workflows and offers flexibility to construct custom workflows for RNASeq (reference-based as well as de novo), genome assembly, ChIPSeq and DNASeq (for SNP and CNV calling). For example, in the case of RNASeq workflow, the analysis starts with quality check (using tool FastQC), contaminant/adapter trimming and removal (using Cutadapt and in-house scripts), read mapping using splice aware aligners (using STAR, TopHat2), transcript quantification, differential expression analysis (using Cufflinks packages and DESeq2), and gene ontology, as well as pathway enrichment analysis (using GoMiner) ( Figure 2). Further, graphically enriched visuals such as heatmap based on clustering, scatter plot, and volcano plot for differentially expressed genes, pie chart on gene-ontology-based annotation, visualization of read data in the genome viewer, etc., are generated for easy interpretation of the data (Figure 3). These figures and underlying data can be downloaded in svg, png, and tsv formats. Moreover, the raw output files such as output of mapping in SAM and BAM formats can also be downloaded. The executed workflows can be shared with peers, rerun after changing parameters or tools. SanGeniX is available as cloud-hosted as well as on premise solution and supported on multiple Linux platforms such as Ubuntu, CentOS, and RedHat.  Here, log2 fold change of genes in three groups with respect to a reference group, Group1 has been plotted. The color-code helps to infer gene expression level. Scatter plot (C), MA plot (D), and Volcano plot (E) present visual investigation of differentially expressed genes between two conditions, for example, here Group 4 and Group 1. Scatter plot helps to quickly compare the expression of a gene between the two conditions, while MA plot depicts trends of difference in expression over the average expression, and Volcano plot helps to spot genes by considering both fold change and test statistic.

Challenges in RNASeq data generation and analysis
As described above, NGS-based transcriptomic data generation and analysis is a complex and multistep process. Every step has some key challenges that hinder the data analysis.

Library preparation
The process of library preparation is generating cDNA from the large RNA fragments, adding the adapters, and amplifying the cDNA for sequencing. Due to a series of experimental reactions, several biases can be introduced in the library preparation step. In majority of the cases, fragmentation of RNA or DNA, which plays an important role in the preparation of high-quality sequencing library, is done using physical or enzymatic methods or chemical shearing. The fragmentation of RNA has even coverage in the gene body and hence it is biased toward the gene body as compared to the 5′ and 3′ ends where the coverage is relatively depleted [20]. The library preparation step is further complicated by the presence of several identical short reads and hence duplicate sequences in the library could arise from abundance of RNA molecules. Another source of duplicate sequences in a library could be due to PCR artifacts. These two different scenarios can be assessed by considering biological replicates in the study. In the case of total RNAseq, abundance of ribosomal RNA (rRNA) dominates sequenced reads and hence creates bias if not removed.

Sequencing platform
Sequencing platforms are available from multiple vendors such as Illumina (http://www.illumina.com/), Life Technologies (https://www.lifetechnologies.com/), and Pacific Biosciences (www.pacificbiosciences.com/), and each of the platforms has its set of advantages and disadvantages [35]. In choosing a sequencing platform, some of the factors to be considered are sequencing length, sequencing type (single end or paired end), throughput, error rate, and type of errors in the generated sequence data. The gigabytes of short reads generated from the current platforms are not error-free, which affects the downstream analysis and interpretation. For transcriptome assembly, the larger read length (such as produced from 454, PacBio) is preferred over short read length (as produced by Illumina) as it will result in assembly of the high-quality and reliable transcripts. However, both 454 and PacBio platforms have limited throughput and hence the approach most commonly used is to generate data from multiple platforms and combine the data during analysis.

Mapping
Accurate mapping of RNASeq reads is a challenging issue because of large data volume, slow mapping speed, false-positive splicing events and incorrect estimation of exon-intron boundaries, large genome size, repeat sequences in the genome, and annotation quality of the genome. Usually, aligners search for introns smaller than a fixed length to reduce the computational power, which often leads to missing the splice reads spanning longer introns [66].
Multiple mapping of reads is another major problem that can be due to presence of repeat regions, similar sequences, and number of mismatches allowed in the mapping step. If such reads mapping to multiple regions are discarded, it will lead to gap in the regions that cannot be mapped uniquely, and if it is included, it can lead to false-positive transcription status. Reference-based assembly cannot efficiently detect trans-spliced genes that are formed from splicing and joining of two different precursor mRNAs and found in some disease conditions such as cancer [125,126]. Additionally, aligners have to cope with sequencing errors, SNP, InDels, other genomics variations and parameters-based, suboptimal mapping outcome. In summary, mapping-based RNASeq analysis can be more effective and complete when reads are long, genome is well-annotated, and it can be combined with de novo genome assembly to identify novel transcripts.

Read quantification for the estimation of gene expression
Once the sequenced reads are aligned, gene expression is measured. The most common way of read quantification is counting the number of reads overlapping the exons of a gene and if the exon boundaries are not well-annotated, it may lead to false-positive hits. Another major challenge in read quantification is reads mapping to multiple locations.

Count normalization
There are several methods such as quantile-based normalization, GC-content-based normalization, Poisson model with variable rates for different positions, available to normalize and correct the biasness in the count data for the improved detection of differentially expressed genes [91,127,128]. The increasing number of normalization methods requires a state-of-theart technique for comparing these methods. In the absence of such technique, there is no consensus on the best method for normalization. For example, Zyprych-Walczak et al. [99] found that TMM method worked poorly for them while Dillies et al. [98] found TMM and median of ratio methods to be the best as compared to other methods. The transcript length is another source of bias and leads to detection of more differential expression in longer transcripts compared to shorter transcripts [88].

Differential expression analysis
There are several tools and methods developed for the differential expression analysis comparing differences in gene expression in different conditions (see section 2). Nonparametric methods are not capable of better differential expression detection in the absence of sample replicates and hence parametric methods are preferred for differential expression analysis [129]. A study comparing various differential expression methods suggests that there is no optimized method that can serve well for all the different conditions. As compared to other tools, Cuffdiff performed poorly with large number of false-positives [130]. The accuracy of differentially expressed genes is statistically significant and makes more sense if multiple replicates are used in the analysis.
Similar to the situation as in normalization, picking up the best tool for differential analysis is a tricky job. This is because there is no consensus about the tool best-suited for all experimental setups. Soneson and Dolerenzi [106] found limma performing well under many conditions but it required at least three replicates. Furthermore, they found limma performing worse when dispersion differed between two conditions. They also observed that with large sample sizes DESeq was overly conservative, while edgeR was producing large number of false-positives.

De novo assembly
The performance and accuracy of the de novo transcriptome assembly is largely dependent on the complexity of the genome (e.g., genome size, number of paralogs, ploidy level), differential read coverage of the sequenced data, and sequencing error. Transcriptome assembly is complex and different from genome assembly in which read coverage is uniform. In contrast, in RNASeq, the abundance of reads vary based upon gene expression, in which case isoforms originating from same gene can have different expression levels and hence poses significant challenge in estimating the abundance especially for the lowly expressed genes if the sequencing depth is too low. In general, de novo transcriptome requires much higher sequencing depth than the reference-based transcriptome assembly.
The de novo transcriptome assembly generally consumes more time and is more computationintensive than reference-based assembly [131]. The number of transfragments produced using the de novo approach is quite high, which can be due to multiple similar transcripts/isoforms at the locus from allelic variation, or could be due to artifacts. Additionally, the contiguity and completeness of the de novo assembled transcriptome is less than the reference-based assembly especially for the data with less sequencing depth [132].

Deep sequencing versus cost
Another challenge associated with the RNASeq technology is read coverage and cost associated with it. In order to detect lowly expressed genes or rare variants in the coding region, high read coverage is required. According to Nagalakshmi et al. [10], for simple organism such as yeast, which does not undergo alternative splicing, 30 million reads are sufficient to observe genome-wide transcriptome profile [10]. But for larger and complex genomes such as the human genome, higher-depth RNASeq data are required in order to capture the complete transcriptomes. Moreover, in a given organism the number of transcripts expressed in different conditions is different and hence same coverage may not be sufficient to capture all the transcripts expressed under different conditions. Hence, before designing an experiment, one should be aware of both sequencing depth required and the number of samples to be sequenced. If the aim of experiment is to detect rare variants or lowly expressed genes, one should go for high coverage of the transcriptome, whereas, if the aim of the experiment is focused on gene expression differences between different samples (or conditions), one should consider generating replicate data for statistical power [133].
There are other bioinformatics challenges such as data retrieval, storing, unavailability of optimized statistical methods, and high-end compute infrastructure requirement that add to the complexity of transcriptome analysis.

Applications of RNASeq
RNASeq provides an unprecedented view into the complexity of the transcriptome and hence is a powerful tool to characterize and profile transcriptome on a genome-wide scale. Some of these applications with detailed examples are discussed below.

Transcriptome profiling of economically important plants
Understanding the transcriptome and the functional elements of the economically important plants can provide tremendous insights into biological entities, critical for traits such as disease resistance, productivity, and characteristics such as flavor. Recently, Hu et al. [134] performed transcriptome assembly and annotation for the spice black pepper. Black pepper is one of the most widely used fruit for adding flavor to food as well for its medicinal properties. The authors were able to identify genes that might participate in piperidine, quinolizidine, indolizidine, and lycopodium alkaloid biosynthesis, of which piperidine alkaloids account for pungent taste and medicinal properties of black pepper. Similarly, Shudeesh et al. [135] performed assembly and annotation of field pea, a legume that is cultivated worldwide for human as well as livestock consumption. Studies have also been undertaken to identify transcriptomes of the pathogens that infect economically important plants and the defense mechanisms deployed by the plants. For example, the transcriptome of coffee leaf rust pathogen Hemileia vastatrix was sequenced by Talhinhas et al. [136] to identify genes/ pathways that play a key role in the early stage of the infection, and Yang et al. [137] sequenced the sand pear germplasm with differential resistance to infection by Alternaria alternata to identify genes that contribute toward the resistance.

Transcriptome profiling of economically important animals
Similar to the value provided by transcriptome profiling of plants, transcriptome profiling of economically important animals contributes toward better understanding of disease resistance, productivity, breeding, quality of meat, etc., in animals. Ropka-Molik et al. [138] have used the NGS transcriptome profiling approach to identify genes that are differentially expressed between two pig breeds with differences in muscularity that could contribute toward the quality of meat. Gene expression profiles have been generated from different breeds of cows to identify genes that contribute toward milk protein and fat percentage in cow milk [139,140] and milk yield [141]. Transcriptome profiling has also been used very recently to identify the genes that are differentially expressed in silkworms (B. mori) undergoing thermal parthenogenesis [142]. Thermal parthenogenesis is a process that is used in silkworm breeding and selection.

Cancer
Cancer is a complex and heterogeneous genetic disorder that results from either inherited or somatic genetic variations such as single nucleotide variations (SNV), insertions, deletions, copy number variations, dysregulation of gene expression, and epigenetic modifications. As changes in the gene expression pattern play a key role in tumorigenicity [143], metastasis [144], prognosis [145], and relapse [146,147], gene expression profiling has been used extensively in cancer research and diagnosis. OncotypeDx (http://www.oncotypedx.com/) is a gene-expression-based commercially available test that is used for breast cancer, colon cancer, and prostate cancer diagnosis and prognosis.
Contrary to microarrays and RT-PCR-based approaches used earlier, RNASeq, which can detect coding and noncoding RNA, strand orientation, and genetic variants all in one go, is a very powerful tool in deciphering the complex transcriptome changes usually found in cancer. One of the most comprehensive studies published recently is the transcriptome profiling of 4043 Cancers and 548 Normal Tissue Controls across 12 TCGA Cancer Types [148]. In this study, in addition to identifying tissue specific gene signature, the authors were able to identify a 14-gene signature that accurately distinguished the cancer samples from the normal. Using a whole transcriptome sequencing approach, Koh et al. [149] recently reported 14 candidate genes that are important in rhabdoid glioblastoma (R-GBM) tumor, a rare form of GBM. Similarly, RNASeq approach was used to identify gene signature in flow-sorted viable EpCAM + tumor epithelial cells and CD45+ tumor-infiltrating immune cells that were obtained from cervical cancer samples [150]. The authors identified TCL1A as a novel biomarker, found specifically in the immune cells, for predicting survival in cervical cancer patients.
The aforementioned studies highlight the varied approaches that can be used for identifying biomarkers or gene signatures associated with distinct cancer characteristics.

Reproductive health
With the advancing parental age and a desire to limit the number of pregnancies, many couples opt for assisted reproduction for childbearing. The advanced parental age is a key factor that contributes toward the complications in assisted reproduction, and genomics-based approaches are widely used to ensure a high success rate. Gene expression changes in ovarian granulosa cells in women >35 years of age include downregulation of polo-like kinase pathway, which plays an important role in cell cycle arrest of granulosa cells, and the G2/GM checkpoint pathway [151]. Another very recent study also used the RNASeq approach to identify differential gene expression profiles in women with successful pregnancy and a failed pregnancy through assisted reproduction [152]. The authors found that the genes that were differentially expressed played a role in immune response and inflammation, oocyte meiosis, and rhythmic process.
The application of RNASeq in reproductive health is relatively new and as more knowledge is gleaned through this, it might be possible to develop a signature that can be used for predicting the success of assisted reproductive approach.

Developmental disorders
Developmental disorders are ones in which the child develops slower than peers in areas such as motor function, social skills, and cognitive ability. Developmental disorders include Austism, Asperger's Syndrome, Attention Deficit Hyperactivity Disorder (ADHD), Rett Syndrome, and stereotypic movement disorder, to name a few. Gene expression profiling has been used extensively in Austism and genes involved in neuronal action potential, myelination, axon ensheathment, cellular development, and cellular proliferation have been found to be differentially expressed in autistic children [153]. Another study, using an in vitro model of Autism found expression differences in genes involved in cell proliferation, neuronal differentiation, and synaptic assembly [154]. Similarly, a gene expression study in Rett Syndrome [155], which is a rare variant of Austism, has identified genes involved in mitochondrial functions, cellular protein metabolic processes, and RNA processing and DNA organization to be differentially regulated.
In addition to the applications listed here, gene expression profiling can be used in number of other human disorders such as diabetes, hypertension, psychiatric disorders, and infectious diseases.

Future perspective
RNASeq technology is proving to be a valuable tool to study known and novel transcripts of an organism by providing more insights into the role of gene expression in development, differential expression between different conditions, changes in gene expression in disease progression, alternative splicing events, RNA editing, fusion transcripts, allele-specific expression, etc. This technology is revolutionizing the field of plant and animal transcriptome, where many of the species lack reference genome because of genome size and complexity. Metatranscriptomic-NGS technology employed to study microbial transcriptome is another emerging area of research in which construction of transcriptome assembly has led to simultaneous identification of thousands of transcripts from the microbial community of the human gastrointestinal tract [156], and the marine [157,158] and soil [159]. Because of the fact that gene expression levels vary significantly from one cell to another, researchers are now moving toward single-cell transcriptomics, in which cell-to-cell variability on a genome-wide scale can be profiled. Hence, transcriptome of single cell can be probed more efficiently as compared to cell population where average transcript abundance of population is seen [160,161]. A recent study by Sasagawa et al. developed the method Quartz-Seq for individual cell isolation followed by RNA sequencing and distinguished mouse embryonic stem cells from primitive endoderm based upon transcriptome profile as well as cell-to-cell stochastic variation [162].
Another recently developed method, RaceID, is very useful in identifying rare cell types in healthy and diseased tissues using mRNA sequencing [163]. Tissue-specific RNASeq is another emerging area of research that can reveal tissue-specific requirement of RNA expression. A recent study done on 13 different cell types discovered many tissue-specific and novel miRNAs, which suggests that the repertoire of human miRNA is more extensive than our current knowledge [164]. RNASeq is used as a powerful tool for clinical application as well. A recent study developed exome capture RNASeq protocol for degraded clinical formalin-fixed samples, which has shown to work successfully on prostate cancer samples suggesting that capture transcriptome study can be used beyond cell lines and in the clinical setting [165].
Moreover, there are several publicly available RNASeq data repositories such as ENCODE (https://www.encodeproject.org/), TCGA (www.cancergenome.nih.gov), and The Geuvadis Project (http://www.geuvadis.org/), which provide enormous amount of data to researchers to conduct genome-wide analyses beyond traditional gene expression and profiling analysis. Mining data from public repositories will provide new insights into the transcriptome and hence enable researchers to gain more information on gene regulation, which has been previously neglected.
Sequencing method and experimental protocols are also continuously improving to reduce the challenges associated with the technology. Platforms such as PacBio can produce a fulllength transcript in a single read, which can eventually eliminate the transcript assembly step of the data analysis.
Additionally, to cater to the high volume of data and the demand for high-end computational resources for the transcriptome assembly, many assemblers have started supporting parallel data processing, which has significantly reduced the time required for the assembly (reviewed in [66]). Cloud computing is another lucrative approach for parallel computing, which is scalable and can be used as per the user requirement [166].