Open access peer-reviewed chapter

Recent Advancement on In-Silico Tools for Whole Transcriptome Analysis

Written By

Vidya Niranjan, Lavanya Chandramouli, Pooja SureshKumar and Jitendra Kumar

Submitted: 29 January 2023 Reviewed: 08 December 2023 Published: 12 February 2024

DOI: 10.5772/intechopen.114077

From the Edited Volume

Population Genetics - From DNA to Evolutionary Biology

Edited by Payam Behzadi

Chapter metrics overview

40 Chapter Downloads

View Full Metrics

Abstract

Delving into the intricate world of transcriptome analysis, this chapter unfolds the story of gene expression in organisms. The classic DNA microarray and RNA-seq methods have long been the pillars, with RNA-seq taking the spotlight for its superior resolution in understanding dynamic aspects. Yet, tools like Hisat2 and DESeq2, while effective, come with the drawback of being time-consuming and reliant on powerful GPUs. The need for quicker, less resource-intensive techniques has sparked a shift toward simpler R and Python-based tools that not only sidestep GPU dependence but also offer enhanced graphical representations. As we navigate through the content, the chapter draws a vivid comparison between the established tools and the emerging ones, highlighting the pressing need for innovative approaches in transcriptome analysis. The narrative guides readers through the fundamentals, from the Central Dogma’s backstory to the pivotal role of RNA in gene expression and disease. It uncovers the nuances between RNA-Seq and microarray technologies, providing a comprehensive overview of tools for data collection and interpreting changes in gene expression. Our journey extends to the latest breakthroughs, such as the TACITuS platform and the TALON pipeline, tailored for in-depth analysis of transcriptomes using long-read data. The chapter concludes by emphasizing the ever-growing significance of transcriptomics in unraveling complex biological phenomena, with a spotlight on the promising applications of next-generation sequencing. A comprehensive summary ties it all together, detailing the step-by-step protocol of transcriptome analysis, along with insights into current tools, their advantages, and limitations, providing readers with a holistic understanding of their practical application and outcomes.

Keywords

  • transcriptome analysis
  • in-silico tools
  • current trends
  • run time
  • memory
  • protocol

1. Introduction

In 1950, Watson and Crick introduced the Central Dogma, outlining the directional flow of genetic information within cells [1]. This fundamental principle of molecular biology involves two key processes: the transcription of DNA into RNA and the translation of RNA into proteins [2]. Subsequent research revealed the existence of various RNA types, including messenger RNA (mRNA), transfer RNA (tRNA), and ribosomal RNA (rRNA), playing pivotal roles in the synthesis of proteins [3]. During the 1980s, investigations into U-rich RNA, particularly in Tetrahymena thermophila and bacterial RNase P complexes, revealed the catalytic function of RNA akin to ribosomes. Subsequent findings highlighted the existence of micro-RNA (miRNA) and their regulatory properties. Both messenger RNA (mRNA) and non-coding RNA (ncRNA) play pivotal roles in governing gene expression and participating in cellular development [4]. RNA stands out as a crucial macromolecule in biological cells, transcribing essential messages from DNA to facilitate protein synthesis, thereby sustaining life. Minor alterations in transcription can disrupt the entire mechanism, sometimes leading to severe diseases. The transcriptome, encompassing all messenger RNA molecules (mRNAs) expressed by an organism, characterizes the intricate web of genetic information. Additionally, the term “transcriptome” extends to describe the mRNA transcripts within a specific cell or tissue type. The field of transcriptomics closely examines the regulation, variation, and mechanisms governing RNA molecules in cellular processes [5]. Comprehensive examination of the entire transcriptome has become increasingly vital for comprehending the modified expression of genetic variants implicated in complex diseases such as cancer, diabetes, and cardiovascular conditions. Primarily, transcriptome analysis unveils fresh insights into biomarker exploration, establishing gene-centric benchmarks for personalized medicine and therapies [6]. Furthermore, transcriptome analysis plays a significant role in advancing research on long-term effects of COVID-19 [7]. By scrutinizing genome-wide differential RNA expression, researchers can gain a deeper understanding of the biological processes and molecular mechanisms governing cell fate, development, and the progression of diseases.

The analysis of the transcriptome commonly involves a comparative evaluation between two groups, notably healthy and diseased conditions [8]. This approach proves valuable in elucidating the functionalities of genes and regulatory pathways. Advanced techniques in transcriptome analysis encompass microarrays, which provide a comprehensive quantitative and qualitative gene expression profile of the sample by scrutinizing the entire transcriptome. Additionally, RNA-Seq employs high-throughput sequencing to capture all sequences, offering an enhanced perspective on the complexity levels within the eukaryotic transcriptome [9]. Table 1 illustrates distinctions between RNA-Seq and Microarray technologies.

RNA-SeqMicroarray
DefinitionRNA-Seq (RNA-Sequencing) can examine the quantity and sequences of RNA in the given samples using Next-generation sequencing (NGS)Microarray detects the expression of genes.
TechniqueSequence-basedHybridization-based
SequenceIdentify novel RNA sequencesIdentify known sequences
SNPIdentify SNP except new SNPs for low abundanceCannot identity SNPs
SensitivityHighly sensitivityComparatively low sensitivity
AccuracyHigh accuracyLow accuracy
Labor intensity (Sample preparation and data analysis)HighLow
Technically reproducibility>99%>99%

Table 1.

Comparison of microarray and RNASeq technologies.

Despite the drawbacks of RNA-Seq, such as high costs and the need for powerful computing systems, this technology offers the advantage of not depending on prior sequence information. It facilitates the exploration of both known and unknown transcripts. The RNA-Seq process entails preparing libraries and sequencing RNA samples through diverse platforms like Illumina, Nanopore, and PacBio. The sequencing techniques yield fastq files, and the subsequent bioinformatics pipeline is implemented using suitable tools. The following sections delve into discussions about the existing tools, emerging technologies, and the protocol involved in RNA-Seq.

Advertisement

2. Glimpse of protocol and tools used for transcriptome analysis

2.1 Data collection and processing

Data collection is a key process for protocol development, decision making, planning and research. Effective data collection of the transcriptome sequences gives the required outcome and enables to predict future trends. Quality of the reads plays a major role in the analysis. To identify sequencing errors, PCR artifacts, or contamination, Analyzing the sequence quality, GC content, adaptor presence, overrepresented k-mers, and duplicate reads is part of quality control for the raw reads. Acceptable duplication, k-mer, or GC content levels vary organism- and experiment-specific, although these values should be uniform among samples in the same experiments. With above 30% disagreement, we advise eliminating outliers. In contrast to NGSQC [10], which can be used on any platform, FastQC [11] is a popular tool for carrying out these analyses on Illumina reads. In general, read quality declines at the 3′ end of reads, and if it drops too low, bases should be deleted to increase the mapping quality. The low-quality reads can be removed using Trimmomatic and FastX toolkit [12, 13].

2.2 Denovo assembly

One or more contigs are created from partially or completely overlapping reads and one or more scaffolds are created by joining groups of contigs that are overlapping or non-overlapping. A single chromosome is created by joining together groups of overlapping or non-overlapping scaffolds. Reads must overlap by a certain number of base pairs, called k-mers, in the contig assembly process before they can be mapped together. Contigs can be connected to one another during the scaffold assembly process without necessarily overlapping. This can be attributed to paired-end sequencing. Scaffolds are connected via a gap-filling, gap-closing, or genome-finishing procedure during the chromosomal construction stage. Using only short-read technology, it can be challenging and occasionally impossible to finish this last stage. Despite considerable improvement in this area, the presence of repetitive sequences can prevent gap-filling using only short reads [14]. The ability to generate large amounts of RNA-sequencing data has led to the creation of a variety of reference-based and de novo transcriptome assemblers, each of which has unique benefits and drawbacks. The only organisms that can use reference-based methodologies are those with entire, well-annotated genomes, even though many transcriptome investigations routinely use them. De novo transcriptome reconstruction from short reads is challenging, and this challenge is exacerbated by alternative splicing, paralogous genes, and the diversity of gene expression levels. rnaSPADES and Trinity [15, 16] are one of the robust tools for performing De novo assembly.

2.3 Read alignment

Typically, reads are mapped to either a genome or a transcriptome. The percentage of reads that are mapped, which serves as a general indicator of sequencing accuracy and the presence of contaminating DNA, is a crucial measure for determining the mapping quality. In this regard, we forecast 70 and 90 percent of typical RNA-seq readings to map onto the human genome [17], with a sizeable portion of reads mapping to a small number of identical areas equally well (referred to as “multi-mapping reads”). The uniformity of read coverage across exons and the mapped strand represents crucial parameters. If there is a predominant accumulation of reads at the 3′ end of transcripts in poly(A)-selected samples, it could suggest suboptimal RNA quality in the initial material. The standard tool for alignment: HISAT2 The reads from RNA sequencing studies can be aligned using the extremely effective approach known as HISAT (hierarchical indexing for spliced alignment of transcripts). HISAT utilizes dual types of indexes for alignment: a comprehensive whole-genome FM index for initial alignment anchoring, and several localized FM indexes for rapid extension of these alignments. This indexing methodology is grounded in the Burrows-Wheeler transform and leverages the Ferragina-Manzini (FM) index. Each of the 48,000 local FM indexes in HISAT’s hierarchical index for the human genome represents a genomic area of about 64,000 base pairs [18].

2.4 Quantification

The primary use of RNA-seq commonly involves assessing transcript and gene expression levels. This application heavily relies on the quantification of reads aligning to each transcript sequence, although alternatives like Sailfish and others may involve counting k-mers in reads without the need for mapping [19]. This gene-level quantification approach, which quantifies genes rather than transcripts, usually disregards multireads and uses a gene transfer format (GTF) file including the genomic coordinates of exons and genes. Raw read counts cannot be used to evaluate expression levels between samples since they are influenced by factors such as transcript length, total reads, and sequencing biases. The feature-length and library-size influences will be eliminated using the within-sample normalization method RPKM (reads per kilobase of exon model per million reads) [20]. This metric and its subsequent modifications, such as FPKM (fragments per kilobase of exon model per million mapped reads) – a measure of normalized transcript expression within a sample like RPKs – and TPM (transcripts per million), are the frequently reported values for RNA-seq gene expression. It’s important to highlight that for single-end (SE) values, FPKM and RPKM are interchangeable, and TPM can be derived from FPKM using a straightforward method.

2.5 Differential gene expression

Conducting differential expression analysis necessitates the comparison of gene expression values across different samples. However, RPKM, FPKM, and TPM normalization methods mitigate the impact of a crucial factor for inter-sample comparisons – sequencing depth. Whether directly or by considering the number of transcripts, which may vary substantially between samples, these methods normalize based on total or effective counts. Notably, they exhibit suboptimal performance in cases where samples display heterogeneous transcript distributions, meaning that highly and differentially expressed features can distort the count distribution [21]. One of the important parameters by which the Differential Gene Expression is evaluated is log2fold change value. Negative and positive fold change values are taken into consideration. Negative indicates downregulation of the genes and positive value indicates the upregulation of genes. The most well studied tools for differential gene expression would be DeSeq2, Ballgown, Cuffdiff and Cufflink [22, 23, 24, 25]. Figure 1 shows the complete protocol involved in Computational Transcriptomics.

Figure 1.

Protocol for computational transcriptomics: the figure is inclusive of all the steps in transcriptomics; dataset collection, alignment, assembly, differential gene expression and pathway analysis.

Advertisement

3. New era for transcriptome analysis

3.1 Dataset collection

3.1.1 Gene expression omnibus

The Gene Expression Omnibus (GEO), managed by the National Centre for Biotechnology Information (NCBI), is a leading repository for high-throughput genomics data. Let us talk about GEO. It’s a hub for gene expression studies and is fully accessible to researchers everywhere. Everyone can give their contributions, making the database ever-changing and rich with variegated test data. Searching through GEO is smooth thanks to its user-friendly interface. You can easily download any dataset, detailed metadata, and more. But GEO is more than gene expressions. The repository also keeps other omics data. It follows community standards. This means that everyone can use the data, promoting easy sharing and use of information. GEO is critical in genomics research progress, providing learning tools, backing up research work, and aiding in scientific collaboration and discovery.

The Gene Expression omnibus [26] is an International public repository that provides free access to the high throughput gene expression and functional genomics datasets. It is maintained by National Centre for Biotechnology information (NCBI) and is supported by National Library of Medicine (NLM). The raw files in fastq format obtained after sequencing is deposited along with the descriptions, experimental design, attributes, and information on the protocol for study. The database has provision for Direct retrieval of the specific GEO record and quick access to datasets with the appropriate keywords.

The GEO database offers two separate search engines: (i) GEO datasets, (ii) GEO profiles. GEO datasets: The search engine is used for accessing the datasets of specific study. The submitter’s platform, sample, and series entries comprise Database which is supplemented with curated gene Expression Dataset records. Every record has an accession code, title, synopsis, species information, and link to relevant data which will result in thorough and meaningful recovery.

By using the GEO Profiles, people can make the task of finding gene expression profile easier. The database holds carefully compiled set records which forms these profiles. The name of the gene, as well as that of the dataset name and a thumbnail showing the gene’s expression level for each dataset sample is accessible by users. It allows for quick recognition of whether a gene shows differential expression under different experimental set-ups.

3.1.2 TACITUS

Researchers have acces to different RNA-Seq and microarrays datasets in many sizable publis repositories. Some of the most popular include NCBI GEO and Array express. These repositories contain large numbers of files requiring substantial bandwidth and specialized tools for identifying relevant subsets for research purposes. Hence, it is not easy to import or modify data from such sources. In a nutshell, however, TACITuS web-based platform provides quick query of microarray and NGS archive data. Module for handling large files and store them in a cloud and extract efficient data subset. Additionally, the technology facilitates importing of data into galaxy for further analysis. High-throughput microarray and NGS data analysis often involves processing extensive data from publicly accessible libraries. TACITuS streamlines this pre-processing task by automating several modules, enabling efficient management of large data files with agility. Additionally, it can work in a galaxy setting and has a user-friendly interface for users to analyze data [27].

TACITuS is developed using the Laravel framework and employs two databases: MariaDB for swift indexing of available datasets and MongoDB for storing both data and metadata. The data processing pipeline optimizes performance by leveraging R, C++, and PHP. The platform seamlessly integrates information from prominent sources such as NCBI GEO and Array Express with user data. TACITuS offers five essential functionalities: (i) data import, (ii) data selection, (iii) identifier mapping, (iv) data integration, and (v) Galaxy Export. A detailed, step-by-step tutorial accessible through the web interface provides in-depth insights into the implementation of these modules. The panel for “Dataset submission” within TACITuS provides a means for users to import datasets from diverse public transcriptomics resources. This includes the option to select a data source such as NCBI GEO, Array Express, or a custom source, in addition to specifying the dataset accession number and establishing whether the dataset should be classified as public or private. Following submission, the request joins a priority queue, triggering computation once resources become accessible.

Upon initial acquisition, the files are promptly downloaded, and pertinent data are meticulously archived within MongoDB databases, with the inclusion of two crucial indexes: one housing metadata attributions and the other mapping samples to their respective positions in the expression matrix through unique codes. Furthermore, a full-text index backed by Lucene is meticulously constructed for each metadata attribute, enriching the dataset’s accessibility for end-users.

As for the NCBI GEO data sets, TACITuS gets the platform descriptor and makes reference links between probes and their correspondents in other platforms such as Entrez or Ensembl Gene IDs. “Selections” panel provides services like “Map Identifiers,” helping in mapping probe identifiers to standard formats like Entrez (for example), which will allow integrating data among various platforms. This system uses COMBAT, Gene standardization and XPN to integrate selection items together creating one complete set of data. It entails using z-score normalization, empirical Bayesian modeling, gene normalization, and similarity analysis normalization.

Figure 2 illustrates the principle and working of the tool.

Figure 2.

Working of Tacitus: the figure explains the working of the tool, which gives a workflow of the tool.

3.1.3 Array express

One of the most important online databases for functional genomics datasets is Array Express. Genome-wide gene expression data collected using microarray or next-generation sequencing (NGS) systems makes up most of the data. Array Express also offers a variety of DNA tests, including ChIP-seq and genotyping. Several assays from research are typically combined into one experiment. Depending on the sort of investigation, an assay has many definitions. An assay corresponds to one hybridization in microarray investigations (of biological sample material to an array). A read-out (sequencing) of one library constitutes an assay for NGS investigations [28].

Array express is one of the repositories that follow the MIAME (minimal information about a microarray experiment) standards and serves as the main database of published data or those obtained through joint projects. MGED society also recommends Array Express together with GEO and CiBEX for confidential storage prior to publication. Once published, data is open source. In the last 2 years, the Array Express Repository has expanded and consists of over 50,000 hybridisations across 1650 experiments. These experiments are more than 90% for gene expression profiling, and the rests are some array-based chromatin immunoprecipitation, or comparative genomics studies. It comprises more than 200 species among which there is a considerable representation of human, mouse, Arabidopsis, yeast and rat.

The three methods that are used in data submission to the Array Express repository include; personal, internet and postal delivery. Firstly, web-based submissions are recommended for up to ∼20 hybridizations, utilizing the MIA Express online tool. A batch-loader for larger experiments via this route is in testing and slated for release in 2006. Secondly, experiments of varying types and sizes can be submitted as spreadsheets, employing a template generation system for user convenience. Thirdly, laboratories with local databases can use MAGE-ML or MAGE-TAB formats for automated data export directly to Array Express. The curation process involves meticulous checking for MIAME compliance, including raw and processed data presence, accuracy and completeness of biological information, and data consistency. Array Express has been sought by journals to provide a MIAME assessment service, with legacy data set to display MIAME scores in the user interface and be accessible to reviewers supporting publications. The MIAME scoring is currently used internally for Data Warehouse selection [28].

3.2 Alignment

3.2.1 STAR

Due to the non-contiguous transcript structure, relatively small read lengths, and continuously rising throughput of the sequencing technologies, accurate alignment of high-throughput RNA-seq data is a difficult and unresolved topic. Read length limitations, high mapping error rates, slow mapping speeds, and mapping biases are all problems with the RNA-seq aligners that are currently on the market. We created the Spliced Transcripts Alignment to a Reference (STAR) software based on an undocumented RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure to align our large (>80 billion reads) ENCODE Transcriptome RNA-seq dataset. On a modest 12-core server, STAR performs better than other aligners by a factor of >50 when it comes to mapping speed, matching 550 million 2 76 bp paired-end reads to the human genome in an hour while also enhancing alignment sensitivity and precision. STAR is capable of mapping full-length RNA sequences, non-canonical splices, chimeric (fusion) transcripts, unbiased de novo identification of canonical junctions, and non-canonical splices. We experimentally confirmed 1960 unique intergenic splice junctions with an 80–90% success rate using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, confirming the excellent accuracy of the STAR mapping technique.

While many RNA-seq aligners extend contiguous short read mappers, STAR takes a unique approach by directly aligning non-contiguous sequences to the reference genome. The STAR algorithm involves two main steps: seed searching and clustering/stitching/scoring. In the seed search phase, STAR identifies Maximal Mappable Prefixes (MMPs), akin to concepts in large-scale genome alignment tools. MMP represents the longest substring of a read sequence that exactly matches one or more substrings of the reference genome. STAR’s sequential MMP search efficiently detects splice junctions, making it notably faster than comparable tools. This approach, implemented through uncompressed suffix arrays, allows for precise splice junction detection in a single pass, without prior knowledge of junction loci. The binary nature of SA search results in efficient, logarithmic scaling search times, particularly advantageous for large genomes, and facilitates accurate alignment of multimapping reads [29].

3.2.2 TOPHAT2

By sequencing transcribed RNA molecules in cells, RNA-seq allows for a more comprehensive understanding of transcription activities. In this case, analyses of RNA-seq data are done to determine the expressed genes and their amount or abundance in a cell. The first stage is mapping the RNA-seq reads back onto the reference genome and it has its own specific challenges associated with sequence alignment. The gene structures of eukaryotic genome involve intronic sequences necessitating the RNA-Seq alignment software to be able to perform gapped (spliced) alignment with varying intron lengths. In addition to this, there is a lot of processed pseudogenes in the human genome that might lead to alignment issues for read spanning exons. Mature mRNAs have an average size of about 2227 bp per transcript, with 235 bp corresponding to each exon on average, and the number of exons in a typical transcript is around 9.5 bp. The shorter the read length, the higher are the alignment complexities and about twenty percent of junction-spanning reads prove to be problematic because of very short ‘anchors’ that reach into the exon part of a gene. This makes it difficult to get the correct alignment especially for algorithms based on k-mers initial mapping.

In RNA-sequencing (RNA-seq) research, one popularly used spliced aligner is known as TopHat. TopHat 2 is the last one and it can align reads obtained with the most modern sequencing technologies to any reference genome allowing for variable length indel. In addition to de novo spliced alignment, TopHat version 2.0 is also capable of aligning reads across fusion breaks as they might occur after genomic translocation. TopHat2 produces sensitive and precise alignments, even for highly repetitive genomes or in the presence of pseudogenes, by combining the capacity to find novel splice sites with direct mapping to existing transcripts [30].

3.3 Assembly

3.3.1 BinPacker

Introduced by Heber, the splicing graph serves as a foundational concept in BinPacker. BinPacker works by constructing directed acyclic splicing graphs comprising of nodes, which represent exons, and edges, which represent splicing. Although nodes represent contiguous sequences in the genome that exclude alternative splicing events, they may not necessarily correspond to actual exons because of such factors like sequencing errors and poor gene expression. The tool construct splices graph with respect to an expressed gene by incorporating data from RNA-seq and it looks for the best cover with its path edges and explains each splicing graph via iteration of bin packing problems providing sufficient evidence of all the splicing events allowing whole length recovery of the transcripts. Unlike other assemblers, which sequence the transcripts using the de Bruijn graph, BinPacker sequences the spliced graph with the coverage information. According to the assumption that every splicing graph represent the single manifest transcript; every splicing graph symbolize the whole alternative splicing transcripts at each position. With the use of overlapping sequence reads and junctions, the binpacking method attempts to maximize the edge-path-cover for each splicing graph to recover the set of transcripts that may be put together [31].

3.4 Gene count generation

3.4.1 Feature counts

Next-generation sequencing technologies produce vast amounts of short reads that are typically aligned to a reference genome. One crucial aspect of downstream analysis involves determining the number of reads associated with each genomic feature, such as exons or genes. This process, known as read summarization, is essential for various genomic analyses but has not been extensively explored in the literature. A notable tool for read summarization is feature Counts, designed to count reads from both RNA and genomic DNA sequencing experiments. This program employs efficient chromosome hashing and feature blocking techniques, resulting in significantly faster performance (approximately tenfold improvement for gene-level summarization) and reduced memory requirements compared to existing methods. Feature Counts is versatile, accommodating single or paired end reads and offering a range of options tailored to different sequencing applications [32].

Feature Counts takes aligned reads in SAM or BAM format and genomic features in GFF or SAF format as input. The read input format is automatically detected, and both the read alignment and feature annotation should correspond to the same reference genome. SAM or BAM files provide detailed alignment information, including chromosome mapping and alignment specifics. Genomic features, specified in GFF or SAF format, include information like feature identifier, chromosome name, start and end positions, and strand. The tool supports strand-specific read counting when provided. It accommodates various reference sequence numbers, counting either individual features or grouping them into meta-features, such as genes. Paired or unpaired reads are supported, with paired reads counting fragments.

Feature Counts ensures precise read assignment by evaluating the mapping location of each base in a read or fragment against genomic features, accounting for gaps like insertions, deletions, and exon–exon junctions. A hit is registered with any overlap of 1 bp or more between the read/fragment and a feature. Meta-features receive hits if overlapping with any component feature. Multi-overlap reads, overlapping multiple features or meta-features, can be excluded or counted based on experiment type. For RNA-seq, excluding multi-overlap reads is suggested, while counting them is recommended for ChIP-seq, considering potential regulatory effects on overlapping genes. Chromosome hashing facilitates quick matching of reference sequence names for efficient analysis [32].

3.5 Differential gene expression

3.5.1 NOISeq

A non-parametric method for analyzing the differential expression of RNseq data is termed NOISeq. By contrasting the number of reads for each gene in samples taken under the same condition, NOISeq generates a null or noise distribution of count changes. The change in count number between two conditions for a certain gene is then evaluated using this reference distribution to determine whether it is most likely noise or a genuine differential expression. The method is implemented in two different ways: NOISeq-real uses replicates to compute the noise distribution when they are available, while NOISeq-sim simulates them when replication is not possible [33].

3.6 Automated pipelines

3.6.1 TALON

It is well known that alternative splicing regulates gene expression and plays a significant role in both healthy development and disease states. Despite increasingly powerful computational techniques, short-read RNA-seq is unable to determine full-length transcript isoforms even while it is accurate and cost-effective for quantification. Platforms for long-read sequencing, such those from Pacific Biosciences (PacBio) and Oxford Nanopore (ONT), avoid the difficulties of short-read transcript reconstruction. The ENCODE4 pipeline for platform-independent analysis of long-read transcriptomes, TALON, is introduced here. For both straightforward investigations and bigger initiatives, TALON can track both known and innovative transcript models as well as their expression levels across datasets. With the help of these characteristics, TALON users will be able to overcome the limits of short-read data and carry out isoform detection and quantification on current and upcoming long-read platforms uniformly [34].

3.6.2 TCC-GUI

Differential expression (DE) analysis of RNA-Seq count data is a critical stage in the process. For this reason, we already created the TCC R/Bioconductor package. Although this package has the distinctive ability to include a reliable normalization mechanism, only R users have been able to utilize it thus far. Therefore, for non-R users, there is a need for a DE analysis alternative to TCC. The TCC-GUI is developed in R and packaged as a Shiny application. It includes all of the key TCC features, such as strong normalization for DE pipelines and the creation of simulation data under varied scenarios. Additionally, it includes I tools for exploratory analysis, such as the average silhouette score, (ii) visualization tools like the volcano plot and heatmap with hierarchical clustering, and (iii) a reporting tool that uses R Markdown [35]. Table 2 gives a comprehensive information of traditional and current tools its run time along with its memory usage.

Tool usageTool nameKey featureRun timeMemory usage (RAM space)InputOutput
AlignmentHISAT2Provides greater alignment accuracy for reads with SNPs8.28 ms per read5 gbFastqBAM
STARDifferent parts of reads can be aligned to different genomic position with the reference2 hrs27gbfastqBAM
TOPHAT2Fast splice junction mapper40 mins4 GbFastqBAM files
Gene countHTSEQProvides the total gene counts22.7mins101 MbBAM fileCount table
FEATURECOUNTProvides the total gene counts1 min16 MbBam fileCount table
Differential expressionDESEQ2provides a collapse function technical duplicates that can help combine the counts into a single column of the count matrix.15–20 mins4 Gb/threadBAM fileExpression values
BALLGOWNBallgown offers a variety of quick, easy statistical techniques to determine if transcripts differ in expression depending on the conditions of an experiment or a continuous covariate.20–25 minsComparatively more memory usageBAM fileExpression values
CUFFDIFFIt produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes.30 mins72 GbBAM fileExpression values
CUFFLINKThe aligned reads from two or more conditions and reports genes and transcripts that are differentially expressed using a rigorous statistical analysis.30 mins72 GbBAM fileExpression values
NOISEQNoiseq provides expressions of genes without replicates15–20 mins8gbBAM fileExpression values
Denovo AssemblyTRINITYTo extract full-length splicing isoforms and separate transcripts originating from paralogous genes, Trinity divides the sequencing data into several unique de Bruijn graphs, each of which represents the transcriptional complexity at a particular gene or locus.2 hrs/ millime bp20 GbFastq fileContigs
rnaSPADESEnable to capture the transcripts and produce high quality contigs1 hr512 Mb/threadFastq fileContigs
BINPACKERAble to assemble splice junctions1 hrLess memory usageFastq fileContigs

Table 2.

Comparison of different transcriptome analysis tools.

Advertisement

4. Concluding remarks

An essential technique for functional genomics and related fields today is computational transcriptomics. The success of the mature discipline of computational transcriptomics, which uses high throughput methods like cDNA microarray and RNA sequencing, is significantly impacted by these methods. The field of bioinformatics offers a variety of databases, programmes, and automated pipelines for processing and statistical analysis of high throughput data. In order to evaluate the biological meaning from experimental data, this chapter presents an overview of key computational components that are used for RNA-seq data processing. The Chapter also gives a thorough overview of the tools and databases that are often utilized in transcriptomics research. It can conclude that the recent advancements in transcriptomics analysis have a new boom in the next generation sequencing field. This has made the researchers explore more on the Transcriptomics and its applications in solving complex biological problems.

Advertisement

Conflict of interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

  1. 1. Cobb M. 60 years ago, Francis crick changed the logic of biology. PLoS Biology. 2017;15(9):e2003243. DOI: 10.1371/journal.pbio.2003243
  2. 2. Mattick JS. Challenging the dogma: The hidden layer of non-protein-coding RNAs in complex organisms. BioEssays. 2003;25(10):930-939
  3. 3. Evans JS. Principles of molecular biology and biomacromolecular chemistry. Reviews in Mineralogy and Geochemistry. 2003;54(1):31-56
  4. 4. Ratti M, Lampis A, Ghidini M, Salati M, Mirchev MB, Valeri N, et al. MicroRNAs (miRNAs) and long non-coding RNAs (lncRNAs) as new tools for cancer therapy: First steps from bench to bedside. Targeted Oncology. 2020;15(3):261-278. DOI: 10.1007/s11523-020-00717-x
  5. 5. Amaro A, Petretto A, Angelini G, Pfeffer U. Advancements in omics sciences. In: Translational Medicine. Cambridge, Massachusetts, United States: Academic Press; 2016. pp. 67-108
  6. 6. Ramsköld D, Wang ET, Burge CB, Sandberg R. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Computational Biology. 2009;5(12):e1000598
  7. 7. Lavanya C, Upadhyaya A, Neogi AG, Niranjan V. Identification of novel regulatory pathways across normal human bronchial epithelial cell lines (NHBEs) and peripheral blood mononuclear cell lines (PBMCs) in COVID-19 patients using transcriptome analysis. Informatics in Medicine Unlocked. 2022;31:100979
  8. 8. Lowe R, Shirley N, Bleackley M, Dolan S, Shafee T. Transcriptomics technologies. PLoS Computational Biology. 2017;13(5):e1005457
  9. 9. Wang Z, Gerstein M, Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews. Genetics. 2009;10(1):57-63. DOI: 10.1038/nrg2484
  10. 10. Patel RK, Jain M. NGS QC toolkit: A toolkit for quality control of next generation sequencing data. PLoS One. 2012;7(2):e30619
  11. 11. Leggett RM, Ramirez-Gonzalez RH, Clavijo BJ, Waite D, Davey RP. Sequencing quality assessment tools to enable data-driven informatics for high throughput genomics. Frontiers in Genetics. 2013;4:288. DOI: 10.3389/fgene.2013.00288
  12. 12. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114-2120
  13. 13. Liu X, Yan Z, Wu C, Yang Y, Li X, Zhang G. FastProNGS: Fast preprocessing of next-generation sequencing reads. BMC Bioinformatics. 2019;20(1):345. DOI: 10.1186/s12859-019-2936-9
  14. 14. Schatz MC, Delcher AL, Salzberg SL. Assembly of large genomes using second-generation sequencing. Genome Research. 2010;20(9):1165-1173. DOI: 10.1101/gr.101360.109
  15. 15. Bushmanova E, Antipov D, Lapidus A, Prjibelski AD. rnaSPAdes: A de novo transcriptome assembler and its application to RNA-Seq data. GigaScience. 2019;8(9):giz100
  16. 16. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology. 2011;29(7):644-652. DOI: 10.1038/nbt.1883
  17. 17. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best practices for RNA-seq data analysis. Genome Biology. 2016;17(1):1-19
  18. 18. Kim D, Langmead B, Salzberg SL. HISAT: A fast spliced aligner with low memory requirements. Nature Methods. 2015;12(4):357-360. DOI: 10.1038/nmeth.3317
  19. 19. Kukurba KR, Montgomery SB. RNA sequencing and analysis. Cold Spring Harbor Protocols. 2015;2015(11):951-969. DOI: 10.1101/pdb.top084970
  20. 20. Abbas-Aghababazadeh F, Li Q , Fridley BL. Comparison of normalization approaches for gene expression studies completed with high-throughput sequencing. PLoS One. 2018;13(10):e0206312. DOI: 10.1371/journal.pone.0206312
  21. 21. Zhao S, Ye Z, Stanton R. Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. RNA. 2020;26(8):903-909. DOI: 10.1261/rna.074922.120
  22. 22. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15(12):1-21
  23. 23. Frazee AC, Pertea G, Jaffe AE, Langmead B, Salzberg SL, Leek JT. Ballgown bridges the gap between transcriptome assembly and expression analysis. Nature Biotechnology. 2015;33(3):243-246. DOI: 10.1038/nbt.3172
  24. 24. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and cufflinks. Nature Protocols. 2012;7(3):562-578. DOI: 10.1038/nprot.2012.016
  25. 25. Li W, Jiang T. Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics. 2012;28(22):2914-2921
  26. 26. Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, et al. NCBI GEO: Mining millions of expression profiles—Database and tools. Nucleic Acids Research. 2005;33(suppl_1):D562-D566
  27. 27. Alaimo S, Di Maria A, Shasha D, Ferro A, Pulvirenti A. TACITuS: Transcriptomic data collector, integrator, and selector on big data platform. BMC Bioinformatics. 2019;20(9):1-11
  28. 28. Athar A, Füllgrabe A, George N, Iqbal H, Huerta L, Ali A, et al. ArrayExpress update–from bulk to single-cell expression data. Nucleic Acids Research. 2019;47(D1):D711-D715
  29. 29. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England). 2013;29(1):15-21. DOI: 10.1093/bioinformatics/bts635
  30. 30. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology. 2013;14(4):1-13
  31. 31. Liu J, Li G, Chang Z, Yu T, Liu B, McMullen R, et al. BinPacker: Packing-based De novo transcriptome assembly from RNA-seq data. PLoS Computational Biology. 2016;12(2):e1004772. DOI: 10.1371/journal.pcbi.1004772
  32. 32. Liao Y, Smyth GK, Shi W. featureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923-930
  33. 33. Tarazona S, Furió-Tarí P, Turrà D, Pietro AD, Nueda MJ, Ferrer A, et al. Data quality aware analysis of differential expression in RNA-seq with NOISeq R/bioc package. Nucleic Acids Research. 2015;43(21):e140-e140
  34. 34. Wyman D, Balderrama-Gutierrez G, Reese F, Jiang S, Rahmanian S, Forner S, et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. Biorxiv. 2019;672931
  35. 35. Su W, Sun J, Shimizu K, Kadota K. TCC-GUI: A shiny-based application for differential expression analysis of RNA-Seq count data. BMC Research Notes. 2019;12(1):1-6

Written By

Vidya Niranjan, Lavanya Chandramouli, Pooja SureshKumar and Jitendra Kumar

Submitted: 29 January 2023 Reviewed: 08 December 2023 Published: 12 February 2024