Human diseases caused by single-stranded, positive-sense RNA viruses, are among the deadliest of the 21st Century. In particular, there are two notable standouts: human immunodeficiency virus (HIV) and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Detection of these disease-causing viral transcripts, by next-generation RNA sequencing (RNA-Seq), represents the most immediate opportunity for advances in diagnostic, therapeutic, and preventive applicability in infectious diseases (e.g., AIDS and COVID-19). Moreover, RNA-Seq technologies add significant value to public health studies by first, providing real-time surveillance of known viral strains, and second, by the augmentation of epidemiological databases, construction of annotations and classifications of novel sequence variants. This chapter intends to recapitulate the current knowledge of HIV and SARS-CoV-2 transcriptome architecture, pathogenicity, and some features of the host immune response. Additionally, it provides an overview of recent advances in diagnostic sequencing methodologies and discusses the future challenges and prospects on the utilization of RNA-Seq technologies.
- Next generation RNA Sequencing
- RNA viruses
- human immunodeficiency virus
- severe acute respiratory syndrome coronavirus
- HIV transcriptome
- SARS-CoV-2 transcriptome
Next-generation RNA sequencing is rapidly replacing cDNA microarrays and quantitative PCR in clinics and in the public health laboratories, due to higher sensitivity and precision, as well as its cost-effective ability to identify novel RNA species. High-throughput sequencing has become widely adopted in infectious diseases. Supporting bioinformatics services have improved substantially, which aids in genotyping causative mutations for antimicrobial resistance, pathogens’ virulence factors, and global epidemiological surveillance to monitor molecular dynamics of pathogen diversity [1, 2]. Integration of pathogen genome sequencing into infectious disease surveillance was established by United States Centers for Disease Control and Prevention (CDC) under Advanced Molecular Detection program in 2016 . As described below, RNA sequencing became particularly important to identify newly emerging HIV and SARS-CoV strains. RNA-Seq can be used for many different purposes, such as improve diagnostics, characterization of novel strains, investigation of disease epidemiology, spatiotemporal spread and transmission routes, assessment of evolutionary rates, and the development of countermeasures and public health policies.
The standard practice of viral RNA detection is composed of a two-step procedure - reverse transcription of RNA into cDNA and cDNA amplification. The reverse transcription reaction is prone to recombination errors leading to potential alterations in the amplicon library and inadequate sequence representation of the original sample. Although both the nucleic acid amplification test (NAAT) and real-time reverse transcriptase - polymerase chain reaction (RT-PCR) test are cost-effective, the occurrence of amplicon failures needs to be monitored for substitutions in primer binding sites. In order to circumvent possible mutations in the primer regions, it is recommended multiple primer designs targeting the same genes, when designing RT-PCR assays [3, 4]. The RNA-Seq method is capable to unbiasedly identify variants across thousands of target regions with single-base resolution, in cases where NAAT or qRT-PCR produce false-negative results.
Illumina’s technology appears to be at the forefront of viral sequencing, however others such as IonTorrent and Oxford Nanopore technologies are quickly closing the gaps . Illumina deploys sequencing-by-synthesis chemistry, which provides high accuracy and deep coverage of the viral genome. A limitation of Illumina is that short read length (< 400 nt) fragmented sequences should be re-assembled computationally, during which the haplotype information can be lost. Short-read sequencing data requires computational pipelines for trimming of low-quality reads, removal of optical duplicates, detection of reads orientation, and alignment to reconstruct full-length viral sequences. Short read RNA-Seq technologies allow to pull multiple samples per lane to reduce the cost, however, demultiplexing of reads from different samples can be bioinformatically challenging.
The recent shift toward emerging long-read technologies, enabled direct sequencing of individual RNA molecules as opposed to classical indirect approaches . Long-read RNA-seq technologies, such as that provided by Ion Torrent or Oxford Nanopore, allow mapping of novel insertions, deletions, or substitutions in natural variants . Thus, the technology offers advantages over established sequencing technologies, as it has simplified procedures for library preparation and bioinformatics analysis. Although direct RNA sequencing possesses low accuracy, than Illumina’s short read method, it enables easier full-length sequencing, rapid viral genome annotation and analysis, which would be particularly useful for understanding changes in transcriptome architectures . Additionally, long-read technologies are more cost-effective, portable, and provide robust reproducibility of the results, when compared to short-read RNA-seq . A heterozygous variants can be detected when filtered by bioinformatic pipelines, which may identify co- infections and estimate the risk of superinfection . Detailed descriptions and of long- and short- read methodologies, as well as analysis workflows, are provided in other chapters of this book.
2. Sequencing of human immunodeficiency virus (HIV)
The human immunodeficiency virus (HIV) is organized in the genus Lentivirus, within the family of Retroviridae, subfamily Orthoretrovirinae. HIV is a human retrovirus with an RNA genome that is composed of copies of a single stranded positive sense RNA . The RNA genome encodes viral enzymes and is packed into nucleocapsid proteins (the viral capsid). After infecting the cell, the viral RNA genome is reverse transcribed into double stranded viral DNA, which is subsequently integrated into cellular host DNA with the aid of viral and host cofactors. The HIV life cycle has several life-cycle phases; a prolonged latent phase of host genome integration, and a phase of active transcription of new viral RNA in the infected cell . The of HIV replication process is characterized by high rates of nucleotide misincorporations, insertions, deletions, and recombinations. Twentieth century research studies, in 1980s and 1990s, uncovered only a partial viral genome. Determination of the HIV partial genome sequence provided the basis for the development of next-generation sequencing and real-time reverse transcriptase polymerase chain reaction (RT-PCR) methods.
Next-generation RNA-Sequencing enabled the first successful pan-HIV-1 sequencing, including subtype identification and phylodynamic diversification during the course of AIDS pandemic . The current HIV nomenclature includes two types: HIV-1 and HIV-2. The HIV-1 is phylogenetically classified into of 4 groups (M, N, O, and P) . Phylodynamic analyses and evolutionary clock models estimated the time of the most recent common ancestor (HIV-1 pandemic group) to be around the late 19th – early 20th centuries, in Central Africa, and where it may have been associated with an epidemic in primates . The major (M) pandemic group, which causes human-to-human transmission, is classified into 9 subtypes. The most prevalent M subtypes are A, B, C, D, and G; while subtypes F, H, J, L and K are collectively responsible for approximately 1% of all HIV infections [14, 15, 16]. An important genetic feature of HIV is that it is prone to recombination. The highest levels of inter-subtype re-combinations are found in HIV-1 infected specimens, from patients in the Sub-Saharan region of Africa. Analysis of a whole-genome HIV-1 sequence, from the Congo Basin, identified ancestral HIV species which clustered basal to all major subtypes; many of which underwent purifying selection and are no longer in circulation. However, 72 additional recombinant forms of HIV remain in circulation .
Intra-individual mixtures of recombinant genomes have been reported throughout the world. It is hypothesized that inter-subtype recombinant viruses have an advantage of transmissibility over parental strains [18, 19, 20, 21, 22, 23, 24]. To achieve full viral genome sequence coverage, several approaches were employed. One such approach was the amplification of two large fragments (“half genomes”), spanning the full HIV-1 sequence, including all critical regions. Using this approach, molecular surveillance studies of circulating and recombinant forms of HIV, has been conducted in Cameron. Researchers used primers which target two overlapping “half genome” sequencing approach: the 5′ region (
The percentage of people living with HIV drug resistance appears to be increasing . RNA-seq has been key to revealing so-called viral invasive genes and the acquisition of drug-resistance mutations throughout of the 9.7 Kb of viral genome . Achievement of long amplicon length allowed quantification of viral load and multiple drug-resistant mutations using viral RNA and pro-viral DNA as a template [35, 36, 37, 38]. RNA-based long-read sequencing is also being explored for surveillance of viral diversification and assessment of HIV quasispecies [39, 40]. The FDA has cleared Sentosa HIV-1 RNA-Seq genotyping SQ assay for clinical use, to detect and monitor antiretroviral drug resistance mutations in the blood of HIV-infected patients. This product helps clinicians to prescribe effective combination of antiretroviral drugs, such as reverse transcriptase, integrase, envelope (gp41), or CCR5/CXCR4 inhibitors . Phylogenetic clustering of genotyped samples from HIV patients who diagnosed between 1999 and 2012 identified strong associations between molecular clusters and epidemiological hotspots for transmitted drug resistance .
3. Sequencing of SARS-CoV-2
Viruses of the order Nidovirales - family Coronaviridae - subfamily Coronavirinae are positive-sense, single-stranded sequences of RNAs; ranging from 26 to 32 kilobases in length . The novel SARS-CoV-2 species was identified via metagenomic RNA sequencing and made publicly available, in NCBI Virus, within two months of unknown pneumonia cases outbreak in China (accession number NC_045512) . The SARS-CoV-2 viral genome (made up of 29,903 nucleotides) is phylogenetically related to genus beta-coronavirus (subgenus sarbecovirus), which has demonstrated the ability to crossover, from the animal kingdom, and subsequently cause infection in humans . A viral RNA genome encodes several structural proteins (spike, S; nucleocapsid, N; transmembrane, M; envelope, E). SARS-CoV-2 is known to produce 16 non-structural proteins, and at least six accessory proteins (encoded primarily by reading frames ORF1 and ORF2) . Each viral transcript has a 5′ cap structure and a 3′ poly(A) tail . Genomic characterization of SARS-CoV-2 receptor-binding domains, in conjunction with phylogenetic analysis and homology modeling, revealed the sequence for viral port of entry proteins that conferred human-to-human transmission . Direct RNA sequencing via nanopore and DNA nanoball methodologies identified that SARS-CoV-2 generates a tiered series of canonical and non-canonical subgenomic RNAs, through a process involving homology recombinations between transcriptional regulatory sequences . The function of these transcripts is currently unknown .
Sequencing-based genomic surveillance has been employed by UK, USA, and Canadian consortia, to coordinate efforts to sequence large numbers of SARS-CoV-2 genomes. An executive summary “Genomic sequencing of SARS-CoV-2” was issued by the WHO, in January 2021. It discusses the implementation of NGS, for “maximum impact on public health” and provide detailed overview of current sequencing methodologies .
In lockstep with the continuing advances of next-generation sequencing technologies, the bioinformatics community has developed many computational tools capable of keeping up with big data analysis and interpretation. Data sharing repositories such as GISAID (EpiCoV), NCBI Virus/GenBank, COG-UK, ENA provide free access to sequencing data (Figure 1). Bioinformatics workspaces (e.g., Terra , UshER , NextStrain [53, 54], PANGO Lineages , CoV-Glue , see Figure 1) allow users to assign their own sequences to globally circulating lineages .
Genomics researchers who track phylogenetic dynamics of SARS-CoV-2 developed several schemes to describe the accumulation of mutations and detectable divisions within circulating SARS-CoV-2 populations (Figure 2). Three popular clade naming conventions are used:
The PANGO Lineages nomenclature refers to corresponding outbreaks caused by distinct lineages (Figure 2A). Lineage A is suggested to be ancestral because it shares more nucleotides similarities with related coronaviruses in animals . Lineages B presumably originated from United Kingdom, and lineages P – from Brazil [61, 62]. The Clades by NextStrain and GISAID nomenclature refers to the divergence of newly emergent clades (see Figure 2B and C) [57, 59, 63]. A formal naming system for SARS-CoV-2 evolutionary lineages has not been universally adopted. The nomenclatures from PANGO Lineages and GISAID are updated on CDC and WHO web [64, 65]. Novel phylodynamic models for visualization of phylogenomic datasets have been employed in different web formats. For example, Auspice, Augur, NextClade are open-source interactive web tools for visualizing phylogenomic data within NextStrain interface . UShER (Ultrafast Sample placement on Existing tRee) is available in UCSC, and the virus phylogeny is a part of T-BioInfo platform . Novel human coronavirus lineages are now regularly reported, and lineage nomenclature is continually updated . The WHO Virus Evolution Working Group has been working on simplifying the nomenclature, for SARS-CoV-2 variants of interests (VOI) and variants of concern (VOC). This group has recommended easy-to-pronounce variant naming based on the Greek alphabet (e.g. Alpha, Beta, Gamma, Delta - for VOC; and Epsilon thought Lambda – for VOI) .
Genomic epidemiology is a novel discipline that assesses viral genomic diversification, including the acquisition of pathogenic mutations, and molecular dynamics of a pandemic, at the population level. Real-time viral sequencing helps delineate the origin of imported cases. For example, the assessment of WGS data (in several European countries) enabled a more precise understanding of SARS-CoV-2 transmission patterns, during various phases of the current pandemic. Molecular epidemiology data aided in identifying multiple introduction events, from the community spread in the Netherlands, Brazil, and several other countries [68, 69, 70, 71]. WGS analysis leveraged with phylogenetic methodology approximated the source of lineage exportation. For example, the first outbreak of COVID-19 in Taiwan was imported from China. However, the second was from Italy, as strains that caused the second outbreak were phylogenetically more related to GISAID Accession IDs from the Italian outbreak . As determined by the neighbor-joining method using MEGAX software, the circulating Moroccan lineage is more closely related to the South African lineage (20H/501Y.V2). This is not surprising given that travel within the continent is more predominant than travel between continents [73, 74]. In the United States, epidemiological investigation of sequencing data showed that SARS-CoV-2 varies at the single-nucleotide polymorphism (SNP) resolution, within various States. Additionally, SARS-CoV-2 genomic surveillance identified introduction and community spread of more transmissible strains in the states of California, Ohio, Washington and others (e.g., 20C > 20G clade switch) [75, 76, 77, 78, 79]. Genome sequencing in the New York City and Boston areas identified a cryptic spread of multiple simultaneous lineage introductions, from European or Asian travel-related exposures [80, 81, 82]. Using Global Epidemic and Mobility Model (GLEAM), Davis et al. estimated the time for the arrival and cryptic phase of the COVID-19 epidemic, which began in early to mid-January on both the East and West coasts of the USA . During the cryptic phase of transmission, monophyletic clades of imported SARS-CoV-2 were spreading until air travel restrictions were dictated [83, 84].
US Center for Disease Controls and Prevention regularly updates the SARS-CoV-2 variant classification, adding novel mutations that have an impact on the immune response or virus virulence factor. Variants are generally classified as a variant of interest, a variant of concern, and a variant of high consequence . Genome-wide sequencing analysis of samples, from various geographic locations, has revealed novel mutations dispersed throughout SARS-CoV-2 genome . Weber et.al has recently detected a number of hotspot mutations, juxtaposing SARS-CoV-2 sequences from various geographic regions, which occurred
4. Advances in diagnostic sequencing methods
Diagnostic sequencing methods have reached a state of capability, which enables the detection of low quantities of viral RNA, with greater sensitivity and specificity. Several amplicon construction approaches have been utilized: targeted capture-based, tiling hexamer-based, and standard amplicon-based [89, 90, 91]. A newly commercialized methodology that includes the hybridization capture method yielded near-complete genome coverage, even in samples with relatively low viral loads (cycle thresholds, Ct > 25) . Several studies reported that 93 to 99% of genome coverage is the samples with Ct values ranging from 26 to 32. The median on-target percentage of genome coverage dropped below 50%, in higher Ct value samples (Ct > 30) . Public Health Laboratories have recently utilized capture-based methods followed by sequencing to evaluate if SARS-CoV-2 is present in wastewater samples. The goal of the study was the detection the meta-transcriptomic differences between SARS-CoV-2 variants, found in wastewater, and compare the differences to locally reported clinical genotypes . An Illumina Flex for Enrichment kit and Illumina Respiratory Virus Oligo Panel were used to enrich the samples for virus cDNA amplicon.
The tiling amplicon-based approach has been adapted for SARS-CoV-2 sequencing. It is based on modified reverse transcription (RT with designed primer pool) and an overlapping long amplicon multiplex PCR strategy . Primers can be designed in a random-hexamer-primed RT fashion, and a not-so-random-hexamer-primed fashion, to eliminate human rRNA during cDNA synthesis. The number of primers can reach up to 400, in metagenomics protocols, to improve genome coverage. However, less than 20 primers are used more widely for amplicon construction [95, 96, 97]. The tiling amplicon-based approach has been validated by many laboratories and adapted for nanopore sequencing technologies as well as Illumina and Sanger sequencing [97, 98, 99]. A Swift Biosciences’ Normalase® tiling amplicon SARS-CoV-2 panel achieved 80% success of partial genome recovery from samples with a Ct value between 32 and 34, and 40% - for samples with a Ct value between 34 and 36 . A 98 non-overlapping primer pair set have been designed by a group of scientists from ARTIC network in early March of 2020. This primer set has been modified on 3 occasions, reducing primer-dimers formation during RT-PCR (a major cause of coverage bias) . Recently, the ARTIC’s CoronaHiT platform (Coronavirus High Throughput) has been launched and provided accurate consensus SARS-CoV-2 genomes when at least 20x coverage is obtained. It is important to note that in the low-titer samples, with Ct > 32 (< 100 viral genome copies/uL), sequencing results became less reliable for all sequencing methods (long- and short- read) [96, 102, 103, 104].
5. Assessment of host response by RNA-Seq
The host-pathogen interactions have been studied by RNA-Seq in research settings. Transcriptome analysis of hosts’ diseased tissues has been an integral part of the discovery of immune response biomarkers. University of Minnesota Division of Infectious Diseases (USA) leads intensive transcriptomic studies for discoveries of novel biomarkers of immune reconstitution inflammatory syndrome (IRIS) in the HIV-infected population [105, 106]. IRIS presents as an exaggerated immune reaction (in a form of cytokine release syndrome) that occurs in immunocompromised patients who have commenced antiretroviral treatments (ART) . Several predictive and diagnostic biomarkers have already been identified in peripheral blood, utilizing Galaxy, JMP Genomics, or Partek Flow software (Figure 1). For example, the low expression of type I interferon, interferon-response genes, and components of antiviral defense pathways were identified as a risk factor for subsequent occurrence of non-fatal forms of IRIS [107, 108]. More recently, transcriptomic predictors of fatal IRIS have been delineated. These include the overexpressed genes that encode numerous proinflammatory cytokines (e.g., IL6), chemokines, stress response kinase signaling and production of reactive oxygen species in monocytes pathways . Interestingly, the deficiency in innate antiviral defense genes, and poly-immuno-cytopenia have also been identified as significant contributors to subsequent cytokine release syndrome during COVID-19 [110, 111, 112]. The reduced IFN I and IFN III type gene expression and elevated monocyte- and granulocyte- derived chemokines at an early stage of COVID-19 have been proposed as predictors of subsequent cytokine release syndrome and disease severity .
In a small percentage of the human population, the SARS-CoV-2 infection results in a critical illness that begins with uncontrolled overexpression of proinflammatory cytokines (the so-called “cytokine storm”) . A cytokine storm may lead to pathophysiological events such as activation of the coagulation cascades, systemic oxidative stress, and various forms of cell death . Clinically, these events manifest as an acute respiratory distress syndrome (ARDS), or death from multiple organ failure [116, 117]. SARS-CoV-2 is recognized by innate immune cells and, in cases of ARDS, elicits highly abundant transcriptional gene expression, which is somewhat comparable to that of immunocompromised patients who have developed fatal IRIS [109, 118, 119]. The upregulation of inflammasome, Toll-like receptor signaling, HMGB1, and oxidative stress response transcripts, have been recently demonstrated as biomarkers of fatal immune reconstitution inflammatory syndrome . Additionally, NF-kB-associated inflammatory genes and strong neutrophile activation/degranulation signatures differentiate fatal IRIS events from those of survivors. Likewise, fatal ARDS in COVID-19 patients have been characterized by maladapted hyperactivation of the same innate inflammatory pathways described for fatal IRIS [120, 121, 122, 123]. Systemic upregulation of cytokines TNFa, IL6, IL1B, and NLRP3 are noteworthy of mention as they are targetable molecules for drugs such as adalimumab, tocilizumab, anakinra, and RAPA-501-Allo, respectively [124, 125, 126, 127].
SARS-CoV-2 has the propensity to selectively induce mortality in elderly population (over 65 years of age) and in immunocompromised individuals [128, 129]. The most reasonable explanation for this propensity is the depletion of the thymic population, and the inability to mount adaptive T cell immune responses in timely manner . Innate immune cells become the first line of defense against SARS-CoV-2, however in the absence of proper regulatory feedback – are thrown into an uncontrolled proinflammatory loop. Similarly, the irreparable thymic damage and a waning adaptive immunity in AIDS patients enable the sustained HIV replication, which may explain why serious complications such as fatal IRIS are hinged on responses driven by innate immune cells.
There have been isolated cases of SARS-CoV-2 reinfection in individuals who had no known immune disorders [131, 132, 133]. Since SARS-CoV-2 has not diversified significantly, these patients have been re-infected with variants from the same or adjacent clade . Studies have revealed a number of SARS-CoV-2 epitopes are recognized by CD8+ T cells in COVID-19 convalescent subjects (as determined by TruSight HLA sequencing panel) . It is hypothesized that re-infected individuals are unable to mount a sufficient response to primary infection, which result in secondary COVID-19 manifestation. Many of these patients had tested positive for anti- SARS-CoV-2 antibodies after the reinfection, but it is unknown whether they developed antibodies after the first infection at all. It has been hypothesized that susceptibility to re-infection is genetically driven . Several hypothetical models had been proposed . Thus, simultaneous RNA-seq of host and pathogen during viral infection would be helpful to characterize the molecular responses in case of COVID-19 reinfection.
There are now several vaccine types that are being used, in an attempt to reach herd immunity, in various countries . The most successful vaccines are mRNA-based, which have an estimated 92% efficacy [138, 139]. Still, there is an accumulating body of evidence of the so-called ‘vaccine breakthrough’ phenomenon, when SARS-CoV-2 infection occurs in persons who are fully vaccinated [140, 141]. The immune gene variants that contribute to poor T cell memory and undelaying vaccine breakthroughs are not yet known . As vaccination increases, RNA-Seq may become a useful tool in exploring the consequence of vaccine dose-spearing strategies, the emergence of novel vaccine-escape variants, intra- host variant diversity, as well as in understanding the phenomenon of vaccine breakthrough.
6. Conclusion and future directions
RNA-seq has become an indispensable diagnostic tool of clinically important RNA viruses. It has taken the research community almost 20 years to sequence a complete HIV viral genome, but it took a little over 2 months to sequence a near complete SARS-CoV-2 genome and identify its origin! With an improved understanding of the genomic structures of virus populations, RNA-seq can be used to track the origins, genetic diversity, and global monitoring of transmission patterns. In a public health context, RNA-Seq is an indispensable tool to be utilized in the assessment of risks and the emergence of future outbreaks. Additionally, assessment of transcriptomic profiles, during host responses to infection improved the understanding of disease pathogenesis and identify patients at risk for severe outcomes. To overcome drug resistance, novel viral genotype-guided agents based on antisense RNAs (e.g., aptamers, peptide nucleic acids oligomers, ribozymes, and RNAi silencing therapies) are being developed to specifically inhibit viral replication. Complex bioinformatics approaches enabled vaccines to be designed with high probabilities of intercepting viral escape . Much work awaits the global scientific community, including the establishment of unified reference standards and the adoption of a single nomenclature for SARS-CoV-2. Additionally, the assessment of long-term vaccination efficacy by RNA-Seq in the clinical laboratory is necessary to fully understand its benefits .
severe acute respiratory syndrome coronavirus 2
Center for Disease Controls and Prevention
University of California, Santa Cruz
Global Initiative on Sharing Avian Flu Data
European Nucleotide Archive
COVID-19 Genomics UK Consortium
whole genome sequencing
The National Center for Biotechnology Information
Ultrafast Sample placement on Existing tRee
immune reconstitution inflammatory syndrome
Tumor Necrosis Factor Alfa
High Mobility Group Box 1 gene
NLR (nucleotide-binding domain and leucine-rich repeat containing) Family Pyrin Domain Containing 3
European Molecular Biology Laboratory - European Bioinformatics Institute