Some popular databases and tools in finding DEGs in RNA-Seq data.
RNA sequencing (RNA-Seq) is the leading, routine, high-throughput, and cost-effective next-generation sequencing (NGS) approach for mapping and quantifying transcriptomes, and determining the transcriptional structure. The transcriptome is a complete collection of transcripts found in a cell or tissue or organism at a given time point or specific developmental or environmental or physiological condition. The emergence and evolution of RNA-Seq chemistries have changed the landscape and the pace of transcriptome research in life sciences over a decade. This chapter introduces RNA-Seq and surveys its recent food and agriculture applications, ranging from differential gene expression, variants calling and detection, allele-specific expression, alternative splicing, alternative polyadenylation site usage, microRNA profiling, circular RNAs, single-cell RNA-Seq, metatranscriptomics, and systems biology. A few popular RNA-Seq databases and analysis tools are also presented for each application. We began to witness the broader impacts of RNA-Seq in addressing complex biological questions in food and agriculture.
- gene expression
- and tools
Transcriptome broadly refers to a collection of RNA transcripts within a particular context that includes combinations of spatial and temporal factors: biological level of organization, from organelle to organism; and phase of growth, differentiation, or development, from zygote through adult. Additionally, one can investigate transcriptomes under more experimental contexts by controlling or varying the factors mentioned above, along with combinations of environmental, genetic, and physiological conditions. All of these factors influence the constituents of a transcriptome, an array of RNA types that traditionally fall into two categories: coding, the messenger RNAs (mRNAs); and non-coding (ncRNAs), such as ribosomal (rRNAs), transfer (tRNA), small interfering (siRNAs), micro (miRNAs), tRNA-derived small (tsRNA), Piwi-interacting (piRNAs), short hairpin (shRNAs), small nuclear (snRNAs), small nucleolar (snoRNAs), long non-coding (lncRNAs), and circular RNAs (circRNAs) [1, 2]. Interestingly, studies have questioned this sharp distinction between coding and non-coding RNAs, paving the way for more research into multifunctional RNA types that transcend this traditional dichotomy [3, 4]. Given the complex definitions of transcriptome and its constituent RNAs, keen attention is required in understanding and managing the context within which a transcriptome is generated and analyzed throughout the experimental procedure and downstream analysis.
Thus far, RNA research efforts have concentrated on a few major types of RNAs: mRNAs, rRNAs, tRNAs, and miRNAs. Accounting for 3-4% of the total RNA in a cell , mRNAs are products of transcription and, in eukaryotes, multiple processing steps that usually involve the addition of adenosine monophosphates to form a poly(A) tail via polyadenylation . This coding mRNA is then translated into an amino acid (AA) chain by the ribosome, in a process incorporating ribosomal proteins, AAs, and non-coding RNAs, such as rRNAs and tRNAs. About 60% of the ribosome’s mass  and up to 95% of the total RNA in a cell  can consist of rRNAs, which facilitate mRNA and tRNA binding while catalyzing the transfer of an AA from the tRNA to the growing AA chain. Many processes that comprise gene expression, including the steps mentioned above, can be regulated by miRNAs . These short (17-22 bp), single-stranded, non-coding RNAs are exclusive to eukaryotes and typically bind to complementary sequences on mRNA molecules, thereby inducing degradation or inefficient translation of the target transcript .
These four major types of RNA and the multitude of minor types can be selectively isolated and analyzed using various wet lab and dry lab techniques, depending on the specific applications and biological questions under investigation. In the case of transcriptome profiling for coding RNAs in a eukaryotic organism, the ratio of mRNA to rRNAs can be increased: first during library preparation through poly(A) selection, ribosomal depletion, and size selection strategies; and again during the bioinformatic analysis by rRNA filtering during the initial quality control (QC) step in the pipeline. Especially for capturing miRNAs, in addition to rRNA decontamination steps, size selection strategies are used for selective isolation of small RNA . Many bioinformatics tools are available customized for short sequence alignments , and a few can evaluate the thermodynamics of miRNA secondary structures . The molecular biology of RNA transcription, processing, transportation, and translation can be drastically different between phylogenetically distant organisms, and hence the taxonomy of the species being studied is often considered. A variety of wet lab and dry lab techniques have been developed to account for the biological differences in mRNA structure and processing throughout the phylogenetic tree of life.
Transcriptome analysis evolved steadily from nucleic acid detection methods (e.g., northern blots), to hybridization-based methods (e.g., microarrays), through a multitude of sequencing-based methods (e.g., RNA-Seq). RNA-Seq has been the most widely used approach for analyzing transcriptomes obtained from phylogenetically diverse organisms . The swift advancements in RNA-Seq research are being driven by the continual improvements in sequencing technologies (first, second, and third generation), which have steadily provided higher throughput, lower cost, and more accurate sequencing for transcriptome analyses. Despite the availability of many sequencing technologies, the Illumina short-read method remains the most widely used platform for transcriptome sequencing, and many consider it as the gold-standard sequencing for single-nucleotide resolution transcriptome analysis with an accuracy of 99.99% and minimal biases . This method has evolved from 35 bp to 350 bp fragment sequencing in the past decade, and it offers multiple library preparation options, including single-end, mate-pair, and paired-end. Library preparation can yield either stranded sequences, where the sense and/or antisense orientation of the output reads is known, or unstranded sequences, where the read orientation is unknown. Stranded RNA-Seq enables the resolution of both sense and antisense transcription for genes overlapping on opposite strands , and it remains the standard for most RNA-Seq applications.
A thorough conceptual understanding of the prospective RNA-Seq experiment is required to overcome the plethora of potential biases, errors, misinterpretations, and other various challenges common in RNA-Seq experiments [17, 18]; researchers ought to precisely monitor and engineer each phase of the entire process, wet lab through the dry lab, from beginning to end and in all steps between: experimental design, sample collection, RNA isolation, RNA-QC, adapter ligation, multiplexing, library preparation, library-QC, sequencing, data collection, demultiplexing, pre-processing, data-QC, analyses, and interpretation. The experimental design is the first fundamental process in RNA-Seq analysis. When the goal is to detect statistically significant, differentially expressed genes (DEGs), increasing the number of replicates usually has a more positive effect than increasing the sequencing depth, especially when sequencing over 2 million reads per sample [19, 20]. For most RNA-Seq experiments, six or more biological replicates are recommended, and at least three biological replicates are necessary. If one aims to identify DEGs, then pooling biological replicates before multiplexing is discouraged, but such pooling might be pragmatic when one only attempts to assemble a comprehensive transcriptome. Contrary to biological replicates, technical replicates are unnecessary for RNA-Seq on modern sequencing platforms , and resources can be better utilized by increasing the number of biological replicates and minimizing batch effects from unintended influences, such as variance in personnel, in the laboratory environment, and in the selection and usage of materials and methods. A thorough review of the expansive RNA-Seq landscape is available, and to confine our discussion to the scope of this chapter, we will be highlighting the most popular and current RNA-Seq applications in food and agriculture.
2. RNA sequencing (RNA-Seq) applications
2.1 Differential gene expression (DGE)
As previously mentioned, transcriptomes are spatially and temporally dynamic, and they evolve in response to changing environmental, genetic, and physiological conditions. For instance, the transcriptome of one cell type can be significantly different from another cell type, even within the same tissue, and similarly, the transcriptome of a particular cell can vary drastically, as it transitions through the cell cycle, differentiates, acclimates to environmental factors, adapts to the introduction of particular treatments, or changes during disease progression. RNA-Seq can detect such changes in gene expression levels between samples and, in DGE studies, between two or more experimental groups [21, 22]. DGE analysis seeks to identify statistically significant genes that are expressed differently between groups, which are generated through careful attention to experimental design . DGE studies can elucidate functional elements of the genome by identifying gene-level relationships between transcript abundance and experimental conditions, thereby illuminating the mechanisms of associated physiological processes and expanding our understanding of the links between genotype and phenotype .
While DGE analysis focuses on quantifying and comparing the complete collection of all transcript isoforms for a gene to identify differentially expressed genes (DEGs), differential isoform expression (DIE) analysis focuses on quantifying and comparing each individual isoform in a collection of transcripts associated with a particular gene, to identify differentially expressed isoforms (DEIs) between experimental groups . The materials and methods for analyzing DEGs differ from those used for DEIs. The decision to find differential genes or isoforms is crucial and determines the downstream analysis, and it is ideally taken at the beginning of the experiment. Given these differences, we discussed the methods most relevant to DGE analysis, since it has been more deeply studied and widely applied. Some methods being applied to investigate DEGs include northern blot, western blot, quantitative real-time PCR (qPCR), expressed sequence tags (ESTs), microarrays, and RNA-Seq. Most bioinformatics pipelines for DGE analysis of RNA-Seq data include five main stages: QC, alignment, quantification, normalization, and DGE calculation, which usually assumes either a negative binomial, log-normal, or nonparametric statistical distribution. Many databases and bioinformatics tools are available for all these stages and downstream analyses, and a few popular, reliable databases and DGE calculation tools are presented below (Table 1). Often each program will output slightly different collections of statistically significant DEGs , so many investigators use multiple tools, assign higher confidence to intersectional DEGs, and then continue by piping these results through various downstream functional analyses, which will be discussed later in this chapter.
RNA-Seq followed by DGE analysis has been extensively used in the agriculture and food industry. Poultry scientists have applied RNA-Seq analysis to identify DEGs associated with the eggshell formation in the shell gland at different time-points in laying hens . A dairy research group identified significant enrichment of DEGs associated with mammary gland development, milk protein formation, lipid metabolism, and other biological processes linked with milk production traits in lactating cows . Interestingly, the possible roles of DEGs involved in pathogenesis-related pathways in response to peanut allergy have been examined by comparing the transcriptome profiles of high-risk and risk-free infants, facilitating early detection of food allergies in infants . The symbiotic association between rhizobium bacteria and root nodules in leguminous plants is important in agriculture and soil metagenomics, as this interaction improves soil fertility by nitrogen fixation and increases crop production. Differences in nodulation phenotypes have been observed by comparing two diverse symbiotic systems at different time-points using RNA-Seq . Furthermore, these researchers identified DEGs in response to specific strains of rhizobia in soybean roots, and the majority of these DEGs were involved in plant-pathogen interactions and flavonoids biosynthesis . By studying global transcriptome profiles in strawberry fruits, plant scientists have elucidated the influence of red and blue light on the differential expression of genes associated with anthocyanin biosynthesis and accumulation .
2.2 Variants calling and detection
The genetic variations in the coding region may or may not alter the amino acid sequence, resulting in asynonymous or synonymous variants, respectively; characterizing such variants is important for associating the genomic locations with a trait or phenotype . RNA-Seq can be used to identify variations in the coding sequences, including single-nucleotide variants (SNVs), short insertions/deletions (indels <50 bp), and structural variants (SVs). SNVs result from a single nucleotide substitution at a particular coordinate and single-nucleotide polymorphism (SNP) refers to a frequent SNV, generally present in at least 1% of the subject population . SNPs are ubiquitous throughout the coding, non-coding, and regulatory regions of the genome. In comparison, a haplotype is a set of genes, alleles, or SNPs, which are inherited together. Copy number variations (CNVs) are a type of SV where regions in the genome are repeated, and the number of these repeats varies among individuals due to duplication or deletion events. The percentage of CNVs detected in diverse organisms varied significantly. Over 80% and > 15% of the detected SNPs and CNVs were associated with gene expression in the mammalian system, respectively .
Many experimental methods have been developed to detect genetic variants in the genomes of plants and animals, and a few routinely used techniques include rhAmp (RNase H2-dependent amplification assay), Kompetitive Allele-Specific PCR (KASP), TaqMan, Fluidigm, AmpliSeq, Fluorescence In Situ hybridization (FISH), qRT-PCR, microarray, and RNA-Seq. When generating RNA-Seq data for the downstream bioinformatics analysis, sequencing depth is a major consideration, given its influence on not only the overall results but also the cost of experimentation; and after analyzing variants for mutated myeloid genes, researchers suggested 30-40 million paired-end reads per sample was sufficient . Additionally, highly variable coverage between different genes can hinder variant calling and annotation of RNA-Seq data. To identify variants (SNPs and short indels) in RNA-Seq reads, a typical bioinformatics pipeline involves three phases: data clean-up, variant discovery and filtering, and evaluation. A selection of databases and programs for variant analysis is presented below (Table 2).
The application of RNA-Seq in genome-wide screening for genetic variants is imperative to accelerate the usage of genome-based breeding approaches for selecting agriculturally desirable traits in plants  and animals [41, 56]. Functional SNPs associated with quality traits (e.g., plant color, flowering, fruit color, size, and ripening) and/or quantitative traits (e.g., grain yield, abiotic, and biotic stress tolerance) may result in phenotypic diversity among individuals. Previous studies have used RNA-Seq analysis to identify SNPs in relatively smaller genomes, such as barley , and larger genomes, such as wheat . One of the main goals of livestock germplasm improvement is identifying the genetic variation associated with phenotypic traits of economic importance. By screening 15 duck transcriptomes, SNPs in genes related to fat metabolism and digestion were found in genomic regions that have undergone selective pressures . In a similar study, SNPs associated with the fat deposition in sheep have been identified, potentially leading to breeding programs that reduce tail size in fat-tailed phenotypes . While comparing RNA-Seq variant analysis methodologies for investigating beef production in Nellore steers, researchers recently identified SNPs in genes related to feed efficiency, an economically important trait in cattle .
2.3 Allele-specific expression (ASE)
RNA-Seq data can be used to investigate allele-specific expressions (ASEs), which denotes a differential expression of two or more alleles in a diploid or a polyploid organism, sometimes may result in multiple traits and phenotypes. Heterozygous SNPs may lead to ASE, and this phenomenon is conserved in most higher organisms, including those in plant and animal kingdoms. Due to the intrinsic potential of heterozygous SNPs, ASE can be a sensitive marker for detecting cis-regulatory variation and reducing background noise in an individual . Heterozygous variants have been identified in coding regions of mRNA, possibly leading to a variant polypeptide or a truncated protein ; non-coding regions (splice site, 5’-UTR, or 3’-UTR), possibly influencing mRNA processing and degradation ; and non-coding regulatory regions (promoter, enhancer, or silencer), possibly affecting the binding of transcription and epigenetic factors . Genetic and epigenetic factors regulate transcriptional activity and contribute to ASE, and an imbalanced expression via heterozygous SNP loci in a non-haploid genome may lead to a diseased or abnormal condition . Using whole genome sequencing (WGS) alone, variants throughout the entire genome can be identified. However, by combining WGS and RNA-Seq analyses, ASE and allele silencing information can also be obtained.
Of the many bioinformatics tools and databases created to explore ASE, a few are listed here (Table 3). However, despite the recent developments in ASE bioinformatics analysis, significant challenges in applying these tools include: 1) required family tree information, i.e., sequencing data from the individual under investigation and their respective parents, which is more laborious and costly; 2) required phased genotype information, i.e., the haplotype of the individual must be known in order to use the source file as input; 3) commonly required genomic and transcriptomic data to obtain ASE, but MBASED (Table 3) requires only RNA-Seq data; 4) common usage of short-read data (100-250 bp) due to the low error rate, which is incapable of covering multiple SNVs and subject to read bias at the exon-intron junctions; and 5) lack of advanced statistical methods. Long read (1-100 kb) data allows the detection of multiple SNVs, but it is prone to high error rates and low throughput, which is not ideal for downstream ASE quantification. Therefore, researchers can use a hybrid sequencing approach that combines both short and long reads. IDP-ASE (Table 3) can utilize such hybrid data to simultaneously phase haplotype and quantify the ASE at both gene and transcript/isoform levels. More sophisticated tools are required to identify ASE associated with multiple phenotypes and complex traits in comprehensive datasets.
Using genome-wide analysis, the underlying genetic and molecular mechanisms associated with ASE in heterosis have been determined in hybrid rice . ASE of
2.4 Alternative splicing (AS)
During the canonical splicing process in eukaryotes, introns are removed as lariats, and the flanking exons are rejoined to form a processed mRNA, with sequences in the RNA determining where splicing occurs. Usually, exons of the same mRNA are spliced, but sometimes exons from different mRNAs can be combined by trans-splicing . The RNA splicing machinery is a complex of proteins called the spliceosome, its major components being small nuclear Ribo-Nuclear Proteins (snRNPs). The three main types of spliceosome complexes are GU–AG spliceosome (major spliceosome), AU–AC spliceosome, and trans-spliceosome . In general, three main classes of RNA splicing are found: pre-mRNA splicing, Group II introns self-splicing, and Group I introns self-splicing. A single gene can produce multiple products by alternative splicing (AS). In addition to normal, canonical splicing, the primary AS events identified in eukaryotes are exon skipping (ES), mutually exclusive exons (EE), alternative 5′ donor sites (A5), alternative 3′ acceptor sites (A3), alternative promoters (AP), intron retention (IR), and alternative polyadenylation (APA) . Of these, the later three events gained attention recently with the advancements in RNA-Seq. AS is often regulated by activator and repressor proteins, and it can lead to premature termination of translation due to the interaction of exon junction complexes (EJC) with release factors, triggering the Nonsense-Mediated mRNA Decay (NMD) pathway .
RNA-Seq data can be assembled into full-length isoforms from the raw reads associated with AS of the same gene, and then the corresponding AS events can be identified and characterized. Mate-pair and paired-end sequences have performed better than single-end short-reads for detecting AS patterns . Among the contemporary approaches, long-read sequencing (PacBio/Oxford Nanopore) is an ideal solution for generating full-length transcript sequences and detecting AS events and isoforms . Full-length isoforms can be assembled with or without a reference, and each approach requires specific bioinformatics software. Some of these AS tools and databases are presented here (Table 4). Many AS tools can be used to analyze these AS events genome-wide and/or for a single gene. For example, the ASGAL pipeline (Table 4) begins by building a splice graph from a reference genome and an annotation file. Then, the RNA-Seq reads are aligned to the splice graph. Finally, these splice graph alignments are used to detect novel AS events.
Emerging functional roles of AS in generating transcriptomic and proteomic diversity have been evident in diverse biological processes . In the tea leaves of a
2.5 Alternative polyadenylation (APA) site usage
During post-transcriptional processing at the 3’UTR region of pre-mRNA, differential usage of polyadenylation sites can lead to a diverse set of transcript isoforms with different 3’UTR lengths and sequences, as part of a ubiquitous regulatory mechanism called Alternative Polyadenylation (APA). Most eukaryotic genes have multiple APA sites (APAs) that are often found in a coding region (CR-APA) or 3’UTR (UTR-APA) . APAs found in internal intronic and exonic regions account for a small proportion of identified APAs, but these predominantly disrupt the coding regions and can result in variable protein isoforms or NMD decay . In contrast, APAs found in the terminal exon and 3’UTR regions account for a significant proportion of identified APAs, and though such APAs usually do not disrupt the coding regions, they may result in transcript isoforms with variable lengths. A poly(A) tail in the 3’UTR region of an mRNA transcript generally provides mRNA stability, localization, and translational efficiency, so these factors are subject to APA-mediated regulation . Since the 3’UTR region can have hotspots for the binding of miRNAs and RNA-binding proteins (RBPs), any modifications in this region may lead to new RNA species interactions or the formation of novel secondary structures, thereby affecting translational efficiency [101, 103]. APAs likely play a role in many processes involved in gene expression, including nuclear export, localization, stability, degradation, repression, translation, and protein diversification . Additionally, APAs associated with differentiation, proliferation, and tissue-specific expression have been reported .
APAs at the gene-level can be discovered using EST, microarray, RNA-Seq, 3’ RNA-Seq, and qRT–PCR methodologies. However, genome-wide screening for APAs can be achieved through NGS based approaches, such as Whole Transcriptome Termini Site sequencing (WTTS-Seq), poly(A) site sequencing (PAS-Seq), direct RNA sequencing (DRS), poly(A) single-molecule sequencing, as well as 3′ region extraction and deep sequencing (3′ READS). Moreover, researchers can engage in cell type-specific APA profiling by preprocessing the samples with specialized wet-lab methods, such as cell sorting, crosslinking immunoprecipitation and green fluorescent protein (GFP)-tagging, and cellular and molecular barcoding. All these methods utilize total RNA or mRNA as their starting material, but they diverge in their usage of polyA enrichment, library preparation, and sequencing strategies. Usually, NGS data analysis for APAs includes preprocessing, size selection, QC, mapping/assembly, normalized expression value assessment for the poly(A) enriched 3’UTRs or transcripts, DGE, functional annotation, motif analysis, and pathway analysis. A few tools that use most of these steps and databases for APA analysis are presented (Table 5).
APA processing has been associated with around 70% of human genes, with the longest resulting isoform for each usually observed to be the most abundant [102, 116]. Recent studies have proposed a role for APAs in leaf development and stress response in the two dominant rice (
2.6 microRNA (miRNA) profiling
RNA-Seq can identify and characterize diverse classes of small (17-200 bp) ncRNAs, including miRNAs, siRNAs, piRNAs, tsRNAs, snoRNAs, and snRNAs. Almost all types of RNAs crosstalk, and especially miRNAs, the abundant class of sRNAs act as mediator molecules in regulating and deregulation of genes via complementary binding to miRNA response elements (MREs) on target transcripts . Moreover, co-localization and co-expression of ncRNA and mRNA and their interactions are well established . MiRNA genes can be found in exonic, intronic, and intergenic regions of the genome, and they are predominantly localized, form clusters, and generally transcribed together as a single transcriptional unit. The various miRNAs can positively and/or negatively regulate gene expression post-transcriptionally or by translational repression . While competing endogenous RNA, ceRNAs (e.g., lncRNAs and circRNAs) contain MREs and can regulate gene expression by acting as “miRNA sponges”, thus reducing the availability of one or more miRNAs for other potential targets . A nascent miRNA transcript undergoes post-transcriptional processing and nuclear export during the canonical regulation, eventually being loaded into the RNA-induced silencing complex (RISC) . After the incorporated miRNA binds to a target mRNA at MREs often located in the 3’-UTR, RISC mediates gene expression by post-transcriptional gene silencing (PTGS) or by mRNA cleavage or mRNA degradation . However, the presence of ceRNAs challenges the canonical miRNA regulation of gene targets, and the mechanisms and functions of miRNA sponges are still unclear .
Though several wet lab and computational methods have been evolved in the past two decades for genome-wide screening of miRNAs, in silico approaches, continue to be more widely used due to the ease in exploring the properties of miRNAs. MiRNAs are highly conserved, and the thermodynamics of miRNA secondary structures and target binding have been elucidated; identification of conserved and novel miRNAs and their targets can be performed using readily available bioinformatics tools. A few frequently accessed databases and tools used are listed here (Table 6). Most studies have applied homology-based approaches in identifying conserved miRNAs, and miRNA precursors can be identified by conducting secondary structure analysis using RNAfold  or mfold . The properties of miRNAs, such as cooperativity and multiplicity, can also predict miRNAs and their targets computationally .
|Databases||miRNA gene prediction tools||miRNA target prediction tools|
|1||Rfam||||UEA sRNA workbench||||miRWalk|||
Since the first reported miRNAs in
2.7 Circular RNAs
Among the many ncRNAs species, circRNAs are characterized by a stable, closed-loop structure formed through back-splicing via an upstream splice acceptor (SA) site, in contrast to the downstream SA sites of standard linear splicing . CircRNAs span exonic, intronic, intergenic regions, UTR (5′ and 3′), and lncRNA loci , and they are stable, conserved, non-random, as well as cell-type and tissue-specific . Additionally, circRNAs have been found in all life domains, and, similar to miRNAs, their orthologous expression facilitates discovery, validation, and functional assignments. CircRNAs are transcribed at higher levels than mRNA in specific cells, tissues, or conditions, and they are expressed during chromatin remodeling  and in some disease-specific contexts . For example, 14.4% of actively transcribed genes in human fibroblasts produced circRNAs , and due to their orthologous, tissue-specific, and spatial expression tendencies, circRNAs may be employed as plausible biomarkers in disease control and treatment . Biological functions for circRNAs continue to be discovered and currently include scaffolding for RNA-binding proteins; formation of regulatory complexes; promotion of translation; regulation of protein function; and target decoys for other regulatory molecules, like miRNAs .
Similar to the methods used in experimental validation of linear mRNA, circRNA-forming exons can be determined by RNA-Seq, back-splice junction specific quantitative PCR (qPCR), northern blot, microarrays, RNA fluorescence in situ hybridization (FISH), Chromatin immunoprecipitation (ChIP), RNA immunoprecipitation (RIP), RNA pulldown, mass spectrometry,
The biogenesis mechanisms and functional roles of plants are different from animals, but their expression-specific patterns are very similar . Plant circRNAs have been implicated in stress-induced (dehydration, chilling, high-light, etc.) expression patterns . Intricate regulatory roles of circRNAs in ripening through ethylene signaling pathway has been investigated using integrated RNA-Seq and bioinformatics analysis in tomato . The role of circRNAs in the fat deposition by regulating adipogenic differentiation and lipid metabolism has been determined by studying subcutaneous adipose tissues of two pig breeds using RNA-Seq and bioinformatics and their potential to serve as early diagnostic markers in treating metabolism-related diseases . CircRNAs found on four casein genes in the bovine mammary gland harbor complementary sites for specific miRNAs, suggesting their regulatory role in milk protein synthesis. These circRNAs can be used to fine-tune the gene expression of casein genes, thus producing high-quality milk protein and enhanced milk in dairy cows .
2.8 Single-cell RNA-Seq
Cell-specific transcriptome changes are critical for understanding single cells or groups of cells throughout tissues, organs, and organ systems. Single-cell RNA-Seq (scRNA-Seq) can be used to measure individual gene expression in a single cell and the distribution of expression levels across a cell population. It was first developed to undertake the whole-transcriptome analysis of a single mouse blastomere  and gained widespread popularity recently due to sequencing chemistry advancements and the steep decline in sequencing costs since 2014. scRNA-Seq can illuminate the complex interplay between intrinsic cellular processes and extrinsic stimuli in cell fate determination , and scRNA-Seq can facilitate novel discovery species or regulatory processes, which may serve as tools in biotechnology and medicine . Many scRNA-Seq protocols have been developed, often differing in their methods used for cell isolation , but studies continue to be limited by the difficulties of culturing certain cell types and by issues involving accurate and precise viable cell isolation .
Different methodologies are available in generating single-cell RNA-Seq data from a biological sample. However, most of these methodologies utilize these steps: 1) digest the tissue, i.e., single-cell dissociation; 2) isolate single cells by plate-based or droplet-based methods; 3) capture intracellular mRNA and prepare the massively multiplexed library with sample-specific cellular barcodes or unique molecular identifiers (UMI); 4) sequence on an NGS platform to generate raw reads. Several different platforms and frameworks (stand-alone, cloud-based, and interactive web-based) are presently available for conducting the bioinformatics analysis of scRNA-Seq data, and a few examples for each platform are listed in Table 8. The majority of scRNA-Seq frameworks partially or fully follow these steps: QC; alignment; mapping QC; cell QC; normalization; batch correction; imputation; cell cycle-assignment; feature selection; dimensionality reduction and visualization; pseudotime; cell type annotation; DGE; unsupervised clustering; and network analysis.
|Databases||Web-based scRNA-tools||Cloud-based scRNA-tools|
|scRNASeqDB||||Single Cell Explorer||||Falco|||
scRNA-Seq has been a valuable tool in determining differential gene expression by using gene cluster analyses among heterogeneous cell types and understanding their complex interactions and cellular responses in woody plants . The use of scRNA-Seq and single-cell gene regulatory networks (scGRN) frameworks in studying complex agronomic traits and resistance to various stresses in crops have been proposed . Gene expression profiles among subcellular populations of the skeletal muscle and its development in chicken have been determined using scRNA-Seq, which are important in producing quantity and quality meat in poultry . In sea urchins, using scRNA-Seq, different cell types commonly seen during the embryo development have been identified by the selective inhibition of Delta/Notch and Wnt responsive pathways . Studying the infant and adult cattle mammary glands (MG) with scRNA-Seq, dairy scientists developed a MG-specific single-cell atlas, determined the cell-type heterogeneity, and identified a novel myofibroblast that can differentiate into luminal epithelial cells, and has potential role in lactation and immunity .
Metatranscriptome refers to the total RNA sequences (protein-coding and non-coding) collected from a location or source or body, which corresponds to the expression profiles of prokaryotic and eukaryotic species found in natural environments such as soil, sea, space, gut, airways, feces, and skin . Metagenomics focuses on the overall genetic composition of the microbial community, while metatranscriptomics provides more profound insights about the genes expressed, their abundance, diversity, differential expression, and aims to address the functional, metabolic, and pathway diversity present in a microbial community . Metatranscriptome is a dynamic entity that can detect gene expression variability with time and environmental changes . Metatranscriptomics is a culture-free profiling method that helps understand the structure (i.e., microbial communities and taxonomic analysis), function (DEGs, enrichment, and annotation), and mechanisms (adaptability, selection, and domestication) of complex microbial communities . It also helps in understanding RNA-mediated regulation and in deriving biological signatures associated with microbial communities.
The experimental methods for analyzing RNA, such as northern blot, qRT-PCR, microarrays, cDNA clone-based Sanger sequencing, and RNA-Seq, are also used for studying and analyzing metatranscriptomes. The main challenges in molecular metatranscriptome methods include low total RNA yield commonly found in environmental samples, high rRNA content in total RNA and its removal, and the fidelity of microbial mRNA isolated. Metatranscriptome analysis using RNA-Seq can distinguish and handle metadata , whereas the previous transcript analysis approaches failed to: categorize or catalog metadata, understand community-wide gene expression, and determine functional diversity. Most of the metatranscriptome tools utilize one or more steps from the following: 1) preprocessing (QC, trimming, and filtering), 2) Binning, 3) Mapping or
Though several applications have been documented in the recent past, only selected studies from agriculture and food disciplines are presented here. In agriculture, metatranscriptome analysis can help us find beneficial and harmful rhizosphere-associated microbes specific to plant and soil types. Thus, it allows us to enrich associated rhizosphere microbes that promote crop health and yield. Metatranscriptomics has been used in deciphering multifunctional genes and enzymes linked with the degradation of contaminants in the crop rhizosphere . Metatranscriptomic profiling helped to determine the variation in the rumen’s microbial composition based on the host feed efficiency in beef cattle . In the food industry, metatranscriptomics can be applied to detect food contamination, toxins, and metabolic activities of food-associated microbes and enhance food safety, quality, and function. Metatranscriptomics has been used in finding insights into the core functional microbiota of soy sauce aroma type liquor production in the fermentation process under varied environmental conditions . Metatranscriptome analysis has been used to study the community dynamics of bacteria in fermented foods . Using metatranscriptome sequencing followed by 16S and 18S rRNA analysis, temperature-induced changes in the structural landscape and functional diversity of the mesophilic and thermophilic food web communities respond to two contrasting temperatures in the rice fields have been observed .
2.10 Systems biology/biological network analysis
The ultimate goal of RNA-Seq analysis is to understand the underlying biological processes and mechanisms linked with gene expression and regulation. From molecule to biospheres, biological systems can be represented as networks of pairwise relationships between biological entities throughout various levels of organization. The interactions between biomolecules can be: direct, via physical contact, or indirect, via causal chains or mere correlations. Interactomes that are commonly studied include networks between: DNA–RNA; DNA-Protein; RNA–RNA; RNA-Protein; and Protein–Protein. Theoretically, any network of words can be merged with these interactions, as some elements are shared by both, like common gene, transcript, or protein identifiers. The systems biology approach examines the overall structure and function of a cell or an organism, rather than looking at its components as isolated events . The systems biology approach considers gene expression of an organism or an interaction as a sum of individual genes, sets of genes, and other compounding factors . Gene regulatory networks (GRNs) and co-expression analyses are common elements while studying a biological problem as a system rather than as an individual problem .
Given the growing avalanche of RNA-Seq data along with the wealth of network analysis (NA) programs, there are tremendous opportunities to find networks within and between their available datasets, guiding them toward valuable insights, future validation experiments, and a more holistic understanding of their research. NA of RNA-Seq data can illuminate the interrelationships and functional associations  between several elements: regulators/co-regulators, upstream/downstream sequences, and genic features; differentially expressed subnetworks; global connectivity among genes and gene networks. Often combined with the aforementioned biomolecular interactions, a more abstracted view of biological systems can be provided by semantic networks, which involve the relationships between categories of biological meaning, commonly ontological, that have been assigned to the biomolecules. Traditional systems biology relied on mathematical and statistical models. In contrast, modern systems biology depends on computer models that simulate an organism’s entire biological systems by considering all components . So, these approaches depend on the constant selection of predictors, building models, and testing. Thus, it allows us to move from descriptive science to data science in providing a holistic answer to the biological question under investigation. Thankfully, the inherent complexity of systems biology is ameliorated by the availability of many open-source tools to reconstruct and visualize networks (a few tools and databases are presented in Table 10).
RNA-Seq data from a plant (maize) and a pathogen (
In conclusion, a combination of multi-omic approaches and bioinformatics tools developed to date has unquestionably expanded the scope of RNA-Seq applications and improved our understanding of gene expression data. In addition to the applications discussed in this chapter, fusion gene analysis, RNA editing, RNA interference, and Epitranscriptomics can also be used to understand novel functions of the gene, complex interactions, and the interplay between coding and non-coding regions during gene regulation. In the near future, we will be able to: sequence transcriptomes from complex environments, study more comprehensive RNA datasets using data science tools, functionally validate predicted genes using gene-editing technologies, which will positively impact the food and agriculture sectors.
The authors acknowledge Ms. Shalini P. Etukuri and Dr. Govind Sharma at Alabama A&M University for reviewing this book chapter. Also, authors would like to thank anonymous reviewers and editor for their efforts in improving this book chapter. The authors acknowledge the funding support by the Capacity Building grant #2020-38821-31103 from the USDA National Institute of Food and Agriculture.
Conflict of interest
The authors declare no conflict of interest.