Comparison of commonly used methods for gene and transcriptome analysis.
RNA sequencing is a valuable tool brought about by advances in next generation sequencing (NGS) technology. Initially used for transcriptome mapping, it has grown to become one of the ‘gold standards’ for studying molecular changes that occur in niche environments or within and across infections. It employs high-throughput sequencing with many advantages over previous methods. In this chapter, we review the experimental approaches of RNA sequencing from isolating samples all the way to data analysis methods. We focus on a number of NGS platforms that offer RNA sequencing with each having their own strengths and drawbacks. The focus will also be on how RNA sequencing has led to developments in the field of host-pathogen interactions using the dual RNA sequencing technique. Besides dual RNA sequencing, this review also explores the application of other RNA sequencing techniques such as single cell RNA sequencing as well as the potential use of newer techniques like ‘spatialomics’ and ribosome-profiling in host-pathogen interaction studies. Finally, we examine the common challenges faced when using RNA sequencing and possible ways to overcome these challenges.
- next generation sequencing
- systems biology
- host-pathogen interactions
1.1 RNA sequence profiling
RNA sequencing (most commonly abbreviated as RNA-Seq) is an advanced sequencing approach that has transformed the way we look at the intricacies that exist within complex biological systems. Using high-throughput next generation sequencing (NGS) technology, RNA-Seq allows the detection and quantification of RNA transcripts in a biological sample with high accuracy . Further analysis of RNA-Seq data can reveal a dynamic scale of information ranging from alternative spliced transcripts, gene fusions, single nucleotide polymorphisms (SNPs), post-translational modifications, temporal fluctuations in RNA expression during infection across cells [2, 3, 4, 5]. This extensive capability of RNA-Seq has also recently found its way into studies investigating host-pathogen interaction networks with hopes of further elucidating this multi-faceted system [6, 7].
One of the earliest papers describing the term ‘RNA-Seq’ successfully mapped the transcriptome of the yeast genome using a high-throughput sequencing platform . In fact, a handful of studies had already started using the RNA-Seq method even before the term was coined [9, 10, 11, 12, 13]. Commonly referred to as ‘transcriptome sequencing’, these studies mainly adopted the massively parallel pyro-sequencing technology which was one of the newer sequencing technologies at the time . While DNA sequencing and genomic studies have led to many breakthroughs, RNA-Seq brings forth a more functional, integrated view of expressed genes with distinct advantages over previous methods. Different aspects of RNA-Seq will be discussed in the following sections leading to its role in unravelling host-pathogen interaction networks.
2. Introduction to RNA Seq approaches in biology and medicine
Transcriptomics is an area that is being continuously developed especially with the recent advances in technology that make it easier to carry out large-scale analysis of RNA. Prior to the use of RNA-Seq, traditional methods used to study transcriptomes include hybridization-based, sequence-based and tag-based approaches . A popular hybridization-based approach is the use of microarrays. The main principle behind microarrays is complementary binding of nucleotides. A microarray or ‘gene chip’ is prepared containing thousands of different oligonucleotides or cDNA molecules . Extracted RNA samples converted into cDNA are fluorescently labelled and allowed to hybridise on the microarray . This approach has proven to be useful in studies looking to compare the levels of gene expression but it does not generate quantitative values and can only be used for known genes . A related method called genome tiling array, however, has the ability to examine genomic regions without prior knowledge of its expression . Like any other method scrutinised over time, the pitfall of microarrays stem from inconsistent protocols, high background noise due to cross-hybridization, low technical reproducibility as well as other technical issues [20, 21].
As for sequence-based approaches, a method used for gene discovery early on was expressed sequence tags (ESTs), which are single-pass sequence reads selected from cDNA libraries . Aside from being expensive, the single-pass reads produced using this method are more prone to error and likely to have redundancies in large datasets . On the other hand, tag-based approaches like serial analysis of gene expression (SAGE) and massively parallel signature sequencing (MPSS) employ the principle of generating short ‘tags’ (9–20 base pairs) which are then sequenced and quantified on a large scale [24, 25]. Both methods make use of bead-based technology and produce accurate quantitative levels of gene expression but mostly focusing on the 3′-ends . Cap analysis of gene expression (CAGE) was then introduced to examine 5′-end short tag sequences revealing more information about promoters and transcription start sites . Altogether, these relatively costly methods were common during the Sanger sequencing era and could only be optimally used in conjunction with already known genome or EST databases. In addition, these approaches had limitations such as cloning biases, technical challenges and general lack of strength to be solid stand-alone approaches for transcriptome analysis [28, 29].
After decades of utilising Sanger sequencing, the development of Next Generation Sequencing (NGS) was a giant leap for researchers everywhere. There has been constant development in NGS technologies hence they can be more distinctly categorised as second-, third- and even fourth generation sequencing. Second generation sequencing mainly consists of two methods which are sequencing by hybridization (SBH) and sequencing by synthesis (SBS) . SBH was the main principle behind microarray technology using known DNA sequences as explained previously. Meanwhile, SBS is different from Sanger sequencing because dideoxy terminators are not used. In addition, it employs repeated cycles of nucleotide incorporation and also tiny-volume reactions that are massively run in parallel. Most second generation methods commonly rely on sequencing reactions that take place in micro wells or channels . One of the most common second generation sequencing technology is developed by Illumina, producing short read lengths. On the other hand, third- and fourth generation sequencing technologies are more focused on producing longer read lengths. These technologies have creatively exploited the principle of sequencing reactions occurring in millions of tiny wells either by specially engineered chambers or biological nanopores . The front runners of third- and fourth generation sequencing are currently Pacific Biosciences and Oxford Nanopore Technologies. Their technologies will be discussed in the coming sections. Also known as deep sequencing, these high-throughput sequencing technologies eventually led to the development of next generation RNA-Seq. Originally described by Nagalakshmi et al. , preliminary RNA-Seq studies focused on improving genomic annotation by examining novel untranslated regions, promoter regions, intergenic transcripts, alternative gene splicing events and single nucleotide polymorphisms (SNPs) among others [31, 32, 33, 34, 35]. Advances in next generation RNA-Seq has allowed diverse studies spanning areas like diagnosis of genetic conditions, characterisation of immune microenvironments, understanding cellular frameworks and viral genetics [36, 37, 38, 39, 40]. Table 1 shows a comparison of RNA-Seq with some of the main methods used to study the transcriptome.
|Type of method||Hybrid-based||Tag-based||cDNA library preparation & high throughput sequencing|
|Amount of input material||High||High||Low|
|Data analysis||Based on relative intensity||Based on amplified SAGE tag counts||Based on amplified & sequenced cDNA fragments producing raw read counts|
|Detection of novel genes/transcripts||No||Limited||Yes|
|Detection of alternatively spliced isoforms||Limited||No||Yes|
|Detection of single nucleotide polymorphisms||No||No||Yes|
|Detection of non-coding transcripts||Limited||Limited||Yes|
|Prior knowledge of gene sequence||Yes||Limited||No|
2.1 Experimental flow in approaches
The flow chart in Figure 1 shows the initial steps involved when carrying out an RNA-Seq experiment.
The first step in an RNA-Seq experiment is to isolate RNA from any biological sample (e.g. cell or tissue populations). As a quality control step, the integrity of extracted RNA samples is commonly measured using an Agilent Bioanalyzer. Based on electrophoretic separation of RNA and a built-in software algorithm, it produces an RNA Integrity Number (RIN) depicting levels of RNA degradation . The next step involves either an enriching or depleting procedure to select specific RNA species. In any given total RNA sample, a variety of RNA species would be present including messenger RNAs, ribosomal RNAs, precursor RNAs, non-coding RNAs, etc. A bulk of the RNA portion (~95%) in most cells comprises of rRNA which if not removed, would make up a large part of the sequencing reads. Since this would largely restrict the study of less-abundant RNA species, protocols were created to circumvent this issue.
One such protocol is the enrichment of polyadenylated (poly-A) RNAs. This procedure selects for poly (A) + RNA mainly mRNA and exploits the fact that rRNAs generally lack this structure. A particular study however did find the presence of rRNA polyadenylation but only in small amounts . This selection step can be carried out by using magnetic beads coated with oligo-dT or reverse transcription (RT) using oligo-dT primers . An alternative step is rRNA depletion which serves to eliminate them from total RNA samples. There are various approaches used by different researchers for this method. One such approach uses probes like biotinylated DNA or locked nucleic acid which are allowed to hybridise to rRNAs. This is followed by a depleting step using streptavidin beads . Another method that can be used for rRNA depletion is known as probe-directed degradation (PDD). This method involves obtaining cDNA:RNA duplexes, circularising them and then hybridising them with rRNA-specific probes. The final step involves digestion with Duplex-Specific Nuclease (DSN) which renders the hybridised-sequences unusable . Some researchers also use not-so-random (NSR) primers that bind to specific RNA molecules during RT, excluding rRNAs . In essence, the variety of methods that exist for rRNA depletion focuses on unique features of rRNA that can be singled out and developed into an eliminating step. The choice of using either poly (A) + selection or rRNA depletion ultimately depends on the aims of the experiment. Evaluation of these two methods showed that while rRNA depletion could record more unique characteristics of the transcriptome, poly(A) + selection was more accurate in terms of gene quantification .
Following poly (A) + enrichment or rRNA depletion, RNA samples need to be fragmented to shorter sequences according to the size restrictions of sequencing platforms. RNAs are usually fragmented chemically using alkaline solutions, divalent cations or enzymes . Alternatively, RNA can be reverse transcribed (RT) first followed by cDNA fragmentation. Similarly, enzymes like DNAses can be used to fragment cDNA with recent advances including a transposon-based approach . Next, either fragmented RNAs or cDNAs are ligated with adapters that are specific to the sequencing platform to be used. This step however overlooks RNA directionality whereby there is lack of information about DNA strands and their corresponding sense RNA strands. This may impede the identification of novel RNA species and also make it harder to accurately measure sense RNA expression . Methods have been developed to preserve this directionality and they can be carried out either directly on fragmented RNA, cDNA or even on RNA:cDNA hybrids that are formed during RT. One of these approaches include adding distinct adapters to the 5′ and 3′ ends of fragmented RNA . This difference in sequences at both ends preserve the strandedness of RNA. Other methods to preserve strand-specificity of RNA are BrAD-Seq  and the Peregrine method . The BrAD-Seq method exploits the transient strand separation or ‘breathing’ of RNA:cDNA hybrid during reverse transcription to add an adapter to the 5′ end of the duplex. This is followed by incorporation of nucleotides by
Finally, after cDNA synthesis and adapter ligation, cDNA libraries need to be amplified using PCR. Once amplified, they are ready for sequencing using a chosen NGS sequencing technology.
3. Next-generation sequencing technologies
3.1 Illumina, second generation sequencing technology
In 2005, Solexa released the Genome Analyser which established a quality standard for the transformation of sequencing platforms that came after. Solexa was bought over by Illumina in 2007 and continued developing second-generation sequencing platforms for specific aims . The strategy behind Illumina’s sequencing process is a four-colour reversible termination sequencing method. After clonal amplification of DNA, sequencing occurs through base incorporation onto the template strand successively, followed by washing, imaging and cleavage. In this method, the polymerisation reaction is halted using fluorescently-labelled dNTPs and unincorporated bases are removed. Final analysis is carried out on the obtained four-colour images to ascertain base composition . Currently, Illumina provides an impressive number of sequencing platforms which include MiniSeq, MiSeq, NextSeq 550, NovaSeq 6000, etc. NextSeq 500 was discontinued with the introduction of NextSeq 550 which has more flexible features of microarray scanning and sequencing. Their newest sequencing systems, NextSeq 1000 and 2000, boasts an integrated cartridge containing fluidics, waste compartment and reagents. It also possesses a novel system taking advantage of super resolution optics resulting in higher sensitivity and increased accuracy of imaging data .
3.2 Pacific Biosciences, third generation sequencing technology
The single-molecule real-time sequencing (SMRT) method is a third-generation sequencing approach developed by Pacific Biosciences (PacBio). This method directly observes DNA or cDNA synthesis by DNA polymerase as it occurs in real time . The principle behind this method is the use of zero-mode waveguide (ZMW) technology. A ZMW is essentially a tiny, zeptoliter-sized hole deposited slightly above a glass surface . Within each ZMW is a chamber containing a single DNA polymerase molecule affixed to the bottom glass surface using a biotin/streptavidin system. Fluorophore-labelled nucleotides are added to the compartment above an array of ZMWs. Diffusion then occurs whereby labelled nucleotides travel downwards through the ZMW to reach DNA polymerase for incorporation onto the DNA strand. The ZMW system is sufficiently sensitive to detect incorporations against background nucleotides. In addition, one of the first commercially available sequencing system employing SMRT contains an assembly of ~75000 ZMWs . Therefore, single-molecule sequencing can be carried out massively in parallel. As of now, PacBio also has an Iso-Seq method used to analyse long reads produced by SMRT to examine novel transcripts, gene fusion, alternative splicing, etc. Their newest system release is the sequel IIe system that promotes higher quality data, shorter analysis time and cheaper costs .
3.3 Oxford Nanopore Technologies, fourth generation sequencing technology
As suggested by their name, Oxford Nanopore Technologies (ONT) developed and commercialised nanopore-based sequencing. The idea behind this strategy is that each nucleotide can induce a unique fluctuation in ionic current while passing through a tiny channel . An α-hemolysin pore secreted by
3.4 Other genome analysers
Roche 454 pyrosequencing was the first commercially successful 2nd generation sequencing platform, initially developed by 454 Life Sciences and later acquired by Roche. Sequencing by this platform depended on the detection of visible light produced by a group of enzymes correlating to the pyrophosphate release during nucleotide incorporation . Roche however stopped supplying the 454 sequencing machines and any accompanying reagents since 2016 . Another NGS instrument is Sequencing by Oligonucleotide Ligation and Detection (SOLiD) released by Applied Biosystems Instruments (ABI). This technology uses sequencing by ligation. It involves cycles of annealing and ligation of primers and probes. Four-colour imaging is also carried out after which ligated probes are cleaved to allow another cycle of ligation . Despite being quite accurate, it has a long run time and requires experts to analyse raw data . Furthermore, another sequencing approach called DNA nanoball sequencing was developed by Complete Genomics and later acquired by Beijing Genomics Institute (BGI) . This approach combines the principles of hybridization and ligation. DNA nanoballs are produced by amplifying DNA or cDNA using rolling-circle replication. They are then added onto a flow cell with an array of wells and each nanoball in each well are sequenced at high density. This process only yields short reads however and takes a long time. Meanwhile, Ion Torrent technology introduced by the team behind the 454 sequencer is based on the electronic detection of pH changes as opposed to detection of light as previously used . Each incorporated nucleotide generates an electronic signal detected by electronic sensors placed at the bottom of each flow cell . Lastly, a third generation sequencing platform called Helicos sequencing employs the principle of single-molecule fluorescent sequencing . The Helicos sequencer, Heliscope, does not require clonal amplification and uses a very sensitive fluorescence detection system . This method merges sequencing by synthesis and hybridization.
3.5 NGS advantages
All NGS platforms have significant advantages over previously used methods however, each platform has their own strengths and unique features. The four major sequencing platforms being used currently are Illumina, Pacific Biosciences, Oxford Nanopore Technologies (ONT) and Ion Torrent. Both Illumina and Ion Torrent are highly accurate but they are relatively more costly and have short reads (≤ 400). The problem with short read lengths is that it prevents researchers from performing de novo assembly and impedes the detection of structural variations . On the other hand, PacBio and ONT platforms produce long reads (≥ 500) but they have variable accuracies. Although, both ONT and PacBio have similar read lengths, ONT specifically the MinION device, has higher error rates of up to 38.2% . ONT also produces a higher yield but PacBio has better data quality overall . All these platforms have a similar disadvantage which is a long turnaround time except for ONT. In addition, ONT also has lower capital costs compared to the others .
Illumina sequencing has error rates of <1% and one of their systems called the NextSeq 550, employs the use of two-channel sequencing strategies instead of the four-channel strategy used by previous systems. This method only needs two images to detect nucleotides which makes data processing much faster . However, a few studies found that PacBio sequence data produced better results than Illumina datasets specifically when used for
NGS technologies are also capable of producing either single-end or paired-end reads during sequencing. The question that normally arises is which type of sequencing to perform. Single-end sequencing in RNA-Seq is when a cDNA fragment is sequenced from only one end whereas paired-end sequencing is when both ends of a fragment are sequenced . Paired-end sequencing produces twice the amount of data which increases the accuracy of read alignment. It also more sensitive and allows the detection of events like gene fusions and new splice isoforms. On the other hand, single-end sequencing is much cheaper than paired-end sequencing. It is also more suitable for some methods such as ChIP-Seq and small RNA-Seq . Although it is the more economical choice, it has drawbacks such as lower read counts per RNA feature and a weaker ability to assign reads to features. In the context of functional profiling, single- and paired-end reads in an RNA-Seq experiment only showed a 65% agreement in the top 20 gene ontology (GO) terms obtained. However, when looking at the top 300 GO terms, both led to similar broad conclusions . Since the cost of sequencing is an important consideration to make, Corley et al.  suggested that single-end sequencing could be carried out with more biological replicates as they found that it was comparable to the results obtained using paired-end sequencing if functional analysis is done cautiously. As mentioned before, the utility of single- or paired-end sequencing ultimately comes down to the research question. For instance, if the main objective of the experiment is transcriptome assembly, then paired-end sequencing would be the more suitable choice.
4. Application of systems biology in understanding host-pathogen interactions
Systems biology is the comprehensive study of a biological system encompassing molecular- level interactions, sub-cellular dynamics and overall physiological functions of cells, tissues and organs . A systems biology approach aims to looks at the larger picture involved in a given system or condition. For a long time, research had centred on the molecular understanding of genes and proteins. Current illustrations or diagrams of interconnecting pathways are just not enough to completely understand a system. Kitano et al.  aptly describes these diagrams as mere static roadmaps, whereas what we seek to understand leans more toward patterns, their causes and regulatory dynamics. In the context of host-pathogen interactions, a systems biology view is examining components from both the host and pathogen as well as their interactions with one another. Some of the approaches used in systems biology include identification of key molecules or biomarkers, inference between networks and disease module discovery . The advancement of -omics technologies supported by high throughput sequencing has increased the whole-system analyses focusing on host-pathogen interaction between genes, proteins and small ligands . This is accomplished by carrying out dual RNA sequencing whereby both host and pathogen transcriptomes are profiled during the course of an infection. Multiple cascades of events are triggered by an infection and dual RNA-Seq allows the monitoring of host and pathogen in parallel. Knowledge gained from comprehensive host-pathogen interaction studies especially with the use of dual RNA-Seq can guide efforts toward better therapeutics against infection. Dual RNA-Seq was first described by Westermann et al.  however it only started gaining attention recently resulting in a surge of studies utilising this method.
4.1 Bacteria-host interaction
Interaction between bacteria and hosts usually begin with a compulsory attachment or adherence of bacteria to host cells followed by subsequent internalisation which may involve direct or indirect receptor binding . Entry into the host may seem like a straightforward step but it involves a drastic change in environment for the pathogen. Hence, entry and any subsequent mechanism employed are bound to involve a complex interplay between the host and pathogen. Previous methods were limited in the sense that they only allow the analysis of mRNA in either infected host cells or bacteria . Dual RNA-Seq has provided researchers everywhere an access to the complete story. Some of the host-bacteria interaction studies utilising dual RNA-Seq have looked at bacteria infecting humans, such as
4.2 Virus-host interaction
Viruses are obligate intracellular parasites manipulating various machinery and components of the host cell. The human body has developed efficient responses against viruses particularly the interferon system. An antiviral state is induced by the family of interferon proteins and other effectors upon viral infection. However, over time, certain viruses have evolved mechanisms to dodge these immune responses . Given the complex nature of viral infections, it most certainly involves multi-level interactions and a method like dual RNA-Seq can help us understand these elaborate interactions networks. One of the first studies examining host-virus interactions using dual RNA-Seq was carried out using a murine infection model for cytomegalovirus (CMV) . This study found some unexpected results such as highly abundant viral transcripts with unknown functions and also a viral transcript bearing functions of both non-coding RNA and mRNA. From the host perspective, expected upregulation of genes involved in inflammation and immunity were observed. Certain unforeseen results include upregulation of genes associated with development and differentiation. More importantly, this study found many differentially expressed genes within specific biological pathways including certain networks with unknown relevance to infection, providing new insights into CMV pathogenesis. The use of dual RNA-Seq has been applied to a range of studies analysing host-virus interactions which include infections by avian influenza (H5N8) , varicella zoster virus , Crimean-Congo hemorrhagic fever virus (CCHFV) , influenza A (H3N2) , and Zika virus . Similar to host-bacterial studies, a wide range of findings were uncovered including variable alternative gene splicing events, association between clinical phenotypes and viral gene induction, remodelling of host epidermal environment, inhibition of functional pathways, host metabolic regulation and many more. Michlmayr et al.  successfully identified CD169 (Siglec-1) on CD14+ monocytes as a potential biomarker against acute infections of Zika virus while also providing evidence that dengue-immune patients did not necessarily have an upper hand when faced with Zika virus. Another interesting study by Wesolowska-Andersen et al.  using dual RNA-Seq found that transcriptionally active respiratory viruses were present in children even in the absence of any observable respiratory illness. These viral carriers also displayed alterations in their nasal transcriptomes. This shows that underlying host-virus interaction networks are still being engaged ‘silently’ and not necessarily in cases where the illness clearly manifests itself. In due time, these studies will hopefully reveal horizontal inter-study patterns which will point toward the discovery of common disease modules or host-pathogen interaction networks. Furthermore, the discovery of a novel coronavirus in Hong Kong was achieved through a series of eliminating laboratory tests and eventually genome sequencing . In addition to discovery of novel pathogens, RNA-Seq analysis can provide information relating to genome sequence, gene expression, pathogen abundance and a myriad of information that will provide useful insight regarding the pathogen and how it causes disease . Currently, most RNA-Seq studies examining novel viruses are focused on plant viruses [107, 108]. The rapid detection of novel viruses in humans by RNA-Seq is an area that should be further investigated and optimised as it can help us take precautionary steps before the wide spread of disease.
4.3 Fungi-host interaction
There are at least 712 000 existing fungal species around the world however the total number of fungal species is estimated to be more than 1.5 million . The proportion of fungal species causing human diseases are quite small comparatively . Some of the most common opportunistic fungal pathogens are
4.4 Combination of pathogens and host interactions
Aside from the pathogens discussed above, some other pathogens that exist are parasites, prions and in rare cases, algae [121, 122, 123]. Parasites in particular have extremely complex life cycles involving different hosts at different life stages . A clear comprehension of parasitic life cycles will undoubtedly require a systems biology approach and RNA-Seq has provided an avenue for that. RNA-Seq studies have allowed inter-sex, inter-stage and inter-host studies involving parasites like
5. Bioinformatics and statistical approaches in analysing RNA-Seq data
The initial experimental workflow of RNA-Seq has been described earlier which briefly include depletion of rRNA or enrichment of mRNA, fragmentation of samples and subsequent reverse transcription to form a cDNA library. These cDNA fragments are then sequenced using a high-throughput sequencing platform. This section will describe the data analysis of RNA-Seq data including statistical approaches taken to analyses differentially expressed genes. The whole process is simplified in Figure 2, covering all the important analytical steps involved.
Once sequencing data is obtained in the form of raw reads, quality control and sequence filtering need to be carried. This is a key pre-processing step because next-generation sequencing data may contain unexpected artefacts, poor quality reads, low-complexity regions, high GC content and sequencing errors [139, 140]. The presence of these low-quality sequences will further effect downstream analysis leading to inaccuracies in overall RNA-Seq data interpretation. There are a variety of tools that can be used to perform data pre-processing. Two important pre-processing concepts are the quality assessment of reads and also processing/filtering to remove contaminants, adapter sequences, low-quality sequences . Some of the methods developed include FastQC , RSeQC , NGSQC , Trimmomatic  and CutAdapt . Weaknesses of these tools include the inability to carry out both data quality control and processing steps, slow run times and single-platform services [147, 148]. Recently developed tools are more comprehensive, encompassing all steps required in raw reads processing. Some of these include FastProNGS , FastqPuri , Zseq , RNA-QC-Chain  and fastp .
The next step is mapping or aligning the quality-assessed reads onto a genome or transcriptome. Reads can be mapped either uniquely to a single position or multiple positions (multi-reads) in the reference genome. Some of the mapping software or algorithms available are STAR , TopHat2 , MapSplice , BowTie2  and Magic-BLAST  among others. A range of bench-marking studies have compared the efficiencies of various RNA-Seq aligners. Baruzzo et al.  examined 14 common RNA-Seq aligners, whereas Schaarschmidt et al.  evaluated 7 alignment tools. In addition, Engstrom et al.  carried out comprehensive analysis on a total of 26 alignment protocols. A similarity across these three studies is that they all found STAR to be one of the more reliable aligners, although other aligners do have their own strengths. After alignment, transcript identification is carried out. Reads that are mapped onto known reference transcriptomes can only focus on quantification and not novel transcript discovery. Meanwhile, reads mapped onto a reference genome can either be identified as known transcripts or alternative transcripts . For rapid discovery of novel transcripts, a popular programme called Cufflinks utilises existing annotated genomes as a reference to assist in transcript assembly . Other methods focusing on novel transcript identification are SLIDE , iReckon  and StringTie . In the case where a reference genome is absent or incomplete, de novo transcript reconstruction is carried out. Reads are first assembled into longer contigs, then this is treated as the ‘reference transcriptome’ to which the reads are mapped back onto for quantification purposes. Some of the tools available for de novo transcript assembly include Trinity , SOAPdenovo-Trans , TransABySS  and Oases . Depending on the experiment, transcript identification and quantification can be carried out either simultaneously or sequentially. One of the most frequent applications of RNA-Seq is estimating the abundance of gene or transcript expressions. HTSeq-count and featureCounts are two gene-level quantification approaches with HTSeq-count being specially designed for downstream differential expression analysis [168, 169]. These are ‘union exon’-based approaches whereby exons that overlap are merged to form a union-exon. This method can assign reads to respective genes with high confidence however, difficulty arises when dealing with alternatively spliced transcripts . Due to biases related to transcript length and number of reads, within-sample normalisation methods are used to standardise reads with some common measures like RPKM (reads per kilobase of exon model per million reads), FPKM (fragments per kilobase of exon model per million mapped reads) and TPK (transcripts per million) [34, 139]. Besides union exon-based methods, several transcript-level statistical quantification methods also exist such as RSEM , eXpress  and TIGAR2 . Recently, alignment-free methods have also been developed like Salmon , kallisto  and Sailfish .
A crucial step before carrying out differential gene expression (DGE) analysis is data normalisation. The within-sample normalisation approaches during quantification are not sufficient in cases where high numbers of differentially expressed transcripts exist . The current software that exist for RNA-Seq differential gene expression analysis can be mainly categorised into four groups based on the statistical methods employed . These include (1) Poisson or negative binomial model-based methods – baySeq , DESeq , DESeq2 , EBSeq , edgeR , NBPSeq , PoissonSeq , TSPM , (2)
|Author (Year)||Statistical methods compared||Data used||Main Findings|
|Robles et al. ||DESeq, edgeR, NBPSeq||Simulations using statistical models derived from real RNA-Seq data|
|Soneson & Delorenzi ||baySeq, DESeq, EBSeq, edgeR, NBPSeq, NOIseq, SAMseq, ShrinkSeq, TSPM, voom+limma, vst + limma||Simulations using statistical models derived from real RNA-Seq data|
|Rapaport et al. ||baySeq, Cuffdiff, DESeq, edgeR, limma, PoissonSeq||Used benchmark datasets: SEQC dataset & ENCODE project data|
|Zhang et al. ||Cuffdiff2, DESeq, edgeR||Real RNA-Seq & simulated datasets: MAQC dataset (human), K_N dataset (mouse), LCL dataset (human)|
|Seyednasrollah et al. ||baySeq, Cuffdiff2, DESeq, EBSeq, edgeR, limma, NOIseq, SAMseq||Real mouse RNA-Seq and human RNA-Seq data|
|Rajkumar et al. ||Cuffdiff2, DESeq2, edgeR, TSPM||Real RNA-Seq data from mice amygdalae micro-punches|
|Costa-Silva et al. ||baySeq, DESeq, DESeq2, EBSeq, edgeR, limma+voom, NOIseq, SAMseq||Real RNA-Seq dataset produced for MAQC project|
A common finding across these studies is that no single method is superior in all circumstances. Each method has their own strengths and weaknesses. Out of the seven studies mentioned in Table 2, edgeR and DESeq were commonly found to perform better than other softwares however, a few studies did find contrasting results. Ultimately, the choice of statistical approach largely depends on the nature of study, type of biological sample, number of replicates, budget of study and many other factors that need to be matched to the strengths of any particular approach.
The next step usually examines differential gene expression at a transcript level which is alternative splicing (AS) events. Many computational tools exist that can infer AS events including some of the previously mentioned methods . These include exon-based methods like DEXSeq  and JunctionSeq , event-based methods like MAJIQ , dSpliceType  and SUPPA2  and lastly isoform-based methods like Cuffidiff2  and DiffSplice . The final step is a pathway enrichment analysis. The list of DEGs obtained are further analysed to characterise their molecular involvement in biological pathways. Some of the RNA-Seq-specific tools developed for this aim are GOSeq , Gene Set Variation Analysis (GSVA)  and SeqGSEA . Annotation databases such as KEGG , Gene Ontology  and Bioconductor  also complement functional profiling of DEGs. This is an important step particularly in host-pathogen interactions to unravel the interaction networks that exist. Common databases and softwares used by dual RNA-Seq studies examining host-pathogen interactions are Gene Ontology and KOBAS (KEGG Orthology-based Annotation System) [215, 216, 217]. Novel transcripts detected based on de novo assembly can be functionally annotated by finding orthologous proteins in protein databases. Challenges arise when annotating non-protein coding transcripts like long non-coding RNAs which still lack proper functional-annotation procedures .
6. Other applications of RNA-Seq in host-pathogen interaction studies
RNA-Seq can be applied in very innovative ways to answer many of the questions and mysteries posed by biology and disease. Initially, it was used for simpler research goals like profiling transcriptomes and monitoring gene expression. Over time, RNA-Seq technology has developed rapidly and one of its vital uses is characterising host-pathogen interaction networks. Dual RNA-Seq in particular has been applied to many infection models ranging from bacteria, virus, fungi and parasites as described in previous sections. Understanding the mechanics of infection induced by pathogens and subsequent host response is a crucial step required before proceeding to figure out clinical treatment strategies. Besides utilising dual RNA-Seq, as extensively detailed earlier, another application of RNA-Seq is single cell RNA sequencing (scRNA-Seq). The difference between bulk RNA-Seq and scRNA-Seq is that the latter allows transcriptional comparison of single-cell populations and has the ability to capture cellular heterogeneity that is normally obscured by bulk RNA-Seq . In the context of host-pathogen interaction studies, dual scRNA-Seq is commonly utilised. ScRNA-Seq involves an extra step which is isolating single cells from tissue samples using techniques like fluorescence-activated cell sorting (FACS), micro-dissection and droplet-based methods instead of bulk sequencing various cell populations . While dual RNA-Seq provides insight about the bigger picture, dual scRNA-Seq can elucidate the smaller scale interactions that sum up to produce the host outcome during infection .
It is common for bacteria to have distinct co-existing subpopulations due to their dynamic adaptability. This heterogeneity can lead to phenotypic variations in infection and scRNA-Seq is capable of characterising these variabilities . Avraham et al.  examined individual macrophages infected with Salmonella typhimurium and found molecular variations despite what seemed to be identical infections in these cells. They discovered that the type I interferon response pathway is influenced by PhoPQ activity levels in the bacterium. Host cells infected with a bacterium expressing high levels of PhoPQ had an increased type I interferon response. Another similar study also examined bone marrow-derived macrophages exposed to Salmonella with their method called scDual-Seq . From their time-dependent analysis of macrophage single-cell transcriptomes, they found that within infected cells, some had fully induced immune responses while others only had ‘partially induced’ immune responses. They also found two intracellular classes of Salmonella having unique transcriptional signatures. One of their interesting findings is how the infection progresses from partially induced to fully induced immune responses which also involve changes in Salmonella subpopulations . Meanwhile, scRNA-Seq has also been applied to host-viral interaction studies. In HIV infections, the virus has the ability to persist in latent reservoirs where they are not completely eradicated by treatments like antiretroviral therapy (ART). Golumbeanu et al.  used scRNA-Seq to characterise the transcriptomes of latent and reactivated HIV-infected cells. They identified two main subpopulations with one cell cluster being more predisposed to HIV reactivation. Their results provide interesting insights for the identification of potential latency reversing agents and biomarkers for susceptible cells. However, the use of scRNA-Seq in host-pathogen interactions studies are still in its infant stages. Many more questions can be answered using scRNA-Seq such as the mechanism behind selective infections of host cells, antibiotic tolerance of certain bacteria, the switch between active and latent infection in viruses and the list goes on .
Furthermore, scRNA-Seq has also played a role in the development of human organoids from stem cells by assessing the similarity between these organoids and primary tissue counterparts . In addition, scRNA-Seq can be used to properly characterise the development and maturation stages of stem cells to specific organ tissue or even used as a blueprint to direct the recreation of actual human organs [224, 225]. Moreover, scRNA-Seq can be used in conjunction with the well-known CRISPR-based gene editing tool to provide confirmation of target gene activation/repression . Advancements in the application of scRNA-Seq in these research areas can provide valuable tools for host-pathogen interaction studies in the future. For instance, the successful creation of human organoids which are highly accurate to real organs can be used as infection models to study disease mechanisms.
Innovations of RNA-Seq methods based on experimental needs have led to its application in various settings. Two of these methods are spatially resolved RNA-Seq known as ‘spatialomics’ and ribosome-profiling using RNA-Seq to understand the translatome . Spatial information is not provided when using bulk RNA-Seq or scRNA-Seq and this information could be crucial to comprehend cellular processes and how they relate to gene expression. The main concept behind spatialomics is in situ transcriptomics which produce data within tissue sections either using sequencing or imaging . Some of the approaches that have been used in spatial transcriptomics are fluorescent in situ RNA sequencing (FISSEQ) and also a combination of scRNA-Seq data with single molecule fluorescence in situ hybridization method (smFISH) to examine spatial division of genes along liver lobules and investigate gene expression as well as post-transcriptional modifications while preserving spatial information [228, 229]. The smFISH method however had limitations in the number of RNA species that could be imaged at once in single cells. Hence, another method called multiplexed error-robust FISH (MERFISH) was developed which allows thousands of RNA species to be imaged in individual cells with spatial distribution information as well . The use of spatialomics in host pathogen interaction studies shows great promise as many infections by pathogens induce alterations in specific subcellular compartments . Understanding both temporal and spatial changes that occur during the course of an infection can improve our comprehension of host-pathogen interplay. As for ribosome-profiling, the highly regulated process of mRNA translation by ribosomes inspired this translatome-based analysis with an assumption that protein synthesis is proportional to the density of mRNA ribosomes . By sequencing the ribosome-protected mRNAs, studies have gained insight on translational control in yeast, codon usage biases and unannotated translational events [232, 233, 234]. Ribosome profiling coupled with RNA-Seq has been carried out as well to study infections by pathogens like
7. Challenges in RNA-Seq
The rapid surge of RNA-Seq technology has led to many new discoveries and is currently the go-to method for transcriptomic analysis. Although significant advancements have resulted from the use of RNA-Seq, it is still continuously evolving with many aspects that need to be improved. The drawbacks of short-read sequencing platforms as mentioned before have been mostly solved with the advent of long-read technology. While long-read technology has its own strengths, analysing long-read datasets still poses a challenge. Aside from lower accuracies per read compared to short-read platforms, most of the long-read transcriptomic tools do not take into account factors like coverage bias and high error rates . Several studies have found beneficial effects of combining short- and long-read technologies, however integrating different tools are often laborious hence it still needs to be improved [238, 239]. There are certain challenges faced with library preparations as well. In this process, cDNA is generated from fragmented RNAs followed by adapter ligation, amplification and finally sequencing. Linsen et al.  compared three different library preparation methods and found that each method had large differences in the frequency of miRNAs captured. Other biases include PCR amplification bias which might be introduced due to variations in template length and base composition during parallel amplification of multiple templates [241, 242]. Yet another issue faced in library preparation is the influence of batch effects. Batch effects may arise from various factors including experimental conditions, quality of reagents, pipetting abilities and also the individual/technician in charge on a particular day . Careful considerations should be made by researchers in order to reduce the effects of these confounding variables.
A recent discovery was the abundance of circular RNAs in various eukaryotic organisms including humans . Previous RNA-Seq protocols were mostly biased against circular RNAs (circRNAs) whereby the poly (A) enrichment step would efficiently deplete all circRNAs since they lack poly (A) tails. The development of alternate protocols more suited to non-coding transcripts like rRNA depletion improved detection of circRNAs. However, these approaches are not entirely efficient for circRNAs and further research is required to improve the detection sensitivity of circRNA and possibly other non-coding RNA transcripts . There are several technical challenges associated with scRNA-Seq as well. With regard to host-bacterial studies, the bacterial lysing protocols employed, whether physical or chemical, are not very compatible with further downstream steps in RNA-Seq like amplification and library preparation. These steps also do not preserve the RNA effectively. Another problem is the accurate identification of minority transcripts in bacteria. ScRNA-Seq protocols commonly employ poly (A) enriching strategies which are useful for eukaryotes however, prokaryotic mRNAs are not poly-adenylated. Analysis of non-polyadenylated RNAs have been attempted however, they involve complex and specialised protocols which need to be simplified [218, 246]. This problem is also faced when analysing viral infections in host cells because certain viruses like dengue virus and hepatitis C virus have non-polyadenylated mRNAs. There needs to be a more optimum procedure to accurately quantify bacterial and viral transcripts. Furthermore, scRNA-Seq examines individual cells leading to very low input material. This results in high levels of technical noise which can be confused with biological variability . A few statistical models have been proposed which are capable of quantifying this technical noise but additional research is required to assess the validity of these models [247, 248].
The development of more complex tools for RNA-Seq analysis are quite possible and challenges may arise in the comprehension or use of such approaches. Efforts should be made to increase the practicality of approaches to avoid methods that are only manageable for those with very high expertise. While many tools exist for the analysis of RNA-Seq data, they seem to be more than we can handle. There are a multitude of pipelines incorporating many different tools with multiple versions and licences . This is a major challenge especially in the context of translating RNA-Seq into clinic. Bringing a laboratory test into clinic involves an important step that is demonstration of analytical validity. One aspect of analytical validity is accuracy that is commonly measured by comparing obtained values to a reference standard . The development of a reference standard especially for NGS data can reduce method- and platform-specific biases . One of the first reference standards that existed for RNA-Seq was developed by the External RNA Controls Consortium (ERCC) using synthetic RNA spike-in controls . Other projects like the Sequencing Quality Control (SEQC) , Association of Biomolecular Resource Facilities (ABRF)  and GEUVADIS  carried out extensive studies investigating the accuracy of RNA-Seq data across many platforms, protocols and laboratory sites, providing a guide for other researchers. The continuous technological advancements occurring in the field of sequencing technologies have to be accompanied by more reference standards . The constant development and assessment of reference standards are required to reduce the variability that arises from the emergence of numerous tools. Conquering this challenge will also allow improved translation of RNA-Seq into clinic and ensure the smooth transition of NGS technologies into clinical settings.
RNA-Seq has revolutionised the approach taken by researchers in exploring host-pathogen interactions. From scRNA-Seq to bulk RNA-Seq, the vast amount of information derived from these studies provide novel insights into the exact mechanisms of disease and host counter- reactions in combating the disease. RNA-Seq has allowed us to examine the mechanisms of gene expression, differentially expressed genes in development or disease, alternative splicing events, gene fusion events, transcriptional regulation and many more. The use of dual RNA-Seq has changed our current perspectives of host-pathogen interactions. It is clear that systems-level alterations are induced by infection all the way from immune responses to metabolic processes. These studies are laying the foundation for more complex interrogations of our immune system and eventually its translation into clinical settings. Other creative innovations to RNA-Seq are also bound to occur as long as the determination to answer biological questions are present. The use of spatialomics seems very promising as it allows the known transcripts to be assessed while preserving the three dimensional suurounding of the tissue. This has major implications especially in studies investigating the influence of cellular architecture on infection progression. Single-cell RNA-Seq is also slowly gaining momentum in the field of host-pathogen interaction studies namely due to its ability to elucidate pathogen subpopulations. This is a key factor that will provide further information about their pathogenesis, host cell susceptibility and potential targeted treatment strategies. The current discrepancies and biases that exist within RNA-Seq protocols are challenges that need to be met in order to ensure its upward trajectory. The next few years will be a period of concurrent growth for RNA-Seq technology and biomedical research. A new biological discovery phase has just begun and RNA-Seq has proved to be a valuable tool to guide us through this phase.