Overview of some sequencing platforms for transcriptome analysis and their characteristics.
The recent remarkable development of transcriptomics technologies, especially next generation sequencing technologies, allows deeper exploration of the hidden landscapes of complex traits and creates great opportunities to improve livestock productivity and welfare. Non-coding RNAs (ncRNAs), RNA molecules that are not translated into proteins, are key transcriptional regulators of health and production traits, thus, transcriptomics analyses of ncRNAs are important for a better understanding of the regulatory architecture of livestock phenotypes. In this chapter, we present an overview of common frameworks for generating and processing RNA sequence data to obtain ncRNA transcripts. Then, we review common approaches for analyzing ncRNA transcriptome data and present current state of the art methods for identification of ncRNAs and functional inference of identified ncRNAs, with emphasis on tools for livestock species. We also discuss future challenges and perspectives for ncRNA transcriptome data analysis in livestock species.
- genome editing
- livestock species
- long non-coding RNA
- non-coding RNA
A vast portion of the mammalian transcriptome is composed of non-protein coding transcripts or non-coding RNA (ncRNA). Some ncRNAs are processed into functionally important transcripts such as microRNA (miRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), small nucleolar RNA (snoRNA), small nuclear RNA (snRNA), small interfering RNA (siRNA), PIWI-interacting RNA (piRNA), circular RNA (circRNA), long non-coding RNA (lncRNA) and several classes with limited information about their functions. In addition to the well described ncRNA classes, clusters of ncRNA (22–200 nucleotides (nt)) were detected at the 5, and 3′end of human and mouse genes, and named promoter-associated short RNAs (PASRs) and termini-associated short RNAs (TASRs) . Mercer et al.  described a class of ncRNA, about 50–200 nt, that are processed from the 3′UTRs of protein-coding genes (uaRNAs). The uaRNAs are in sense direction to the protein-coding gene and show stage, sex and subcellular specific expression. A class of ncRNA derived from tRNA precursors and named tRNA-derived RNA fragments (tRF) or tRNA-derived small RNAs (tsRNAs) appear to be processed by Dicer while others are Dicer independently processed [3, 4]. Small nucleolar RNAs (snoRNA) can also be processed into small miRNA-like molecules called sno-derived RNAs or sdRNAs [5, 6] which play roles in guiding enzymes to target RNAs for modification . In this chapter, only the main classes of functional ncRNAs (miRNA, snoRNA, siRNA, piRNA and lncRNA), not considering the translation related ncRNAs (rRNA and tRNA), will be further discussed. NcRNAs have been implicated in many biological processes including transcriptional inference, translational modifications, mRNA cleavage, epigenetic modifications, regulation of structural organization, and modulation of alternative splicing, small RNA precursor, and endo or secondary siRNA generation [7–10].
2. Transcriptome analysis of non-coding RNA
2.1. Platforms for transcriptome analysis of non-coding RNA
Transcriptome analysis reached a turning point in its history with the arrival of high throughput next-generation sequencing technologies like RNA-Sequencing (RNA-Seq) [11, 12]. Before this time, microarray was the gold standard for transcript profiling or simultaneous measurement of the expression level of thousands of genes in a given sample [13, 14]. Microarray technology however has major drawbacks like non-specific probe hybridization signals and errors in background level measurements , as well as limited gene diversity since probes are designed to represent only a set of preselected genes. Unique hybridization properties of each probe may affect their dynamic range and thus create bias in data processing algorithms . The flexibility offered by RNA-Seq technology enables detection of unknown splice junctions , novel transcripts , new single nucleotide polymorphisms (SNPs)  and many other features all in the same assay. RNA-Seq technology has taken the possibility of fine tuning our knowledge of the transcriptome to a much higher level. In recent years, RNA-Seq has proved its worth as a technology that will replace microarray in whole-genome transcript profiling [20–22]. Correlation of RNA-Seq to RNA-Seq differential gene expression data resulted in good overlap than RNA-Seq to microarray data [23, 24], thus confirming that RNA-Seq is the preferred method to analyze the transcriptome. Moreover, correlation of transcriptome quantification by the two methods versus transcript level measured by shotgun mass spectroscopy showed better estimation with RNA-Seq analysis . Through the evolution process of RNA-Seq technology, other new aspects have been included such as allele specific transcriptome analysis. Moreover, since the RNA-Seq procedure does not rely on known genome annotation, but rather on all the information available in a given sample, there is clear opportunity to make discoveries at a rate never expected before.
A diversity of platforms offer a wide range of RNA-sequencing possibilities. For example, Illumina HiSeq and MiSeq technologies offer short sequence reads (36–300 base pairs (bp)) while Oxford Nanopore can reach sequence lengths of greater than 150 kilo base pairs (kb) . The sequencing techniques could be DNA-polymerase dependent (i.e. sequencing-by-synthesis (e.g. Illumina MiSeq/HiSeq)) while others like PacBio and Oxford Nanopore are single-molecule sequencers. The sequencing error rate ranges from 0.1% (Illumina MiSeq/HiSeq) to about 1.3% (PacBio RSII single pass). An overview of sequencing platforms and their characteristics is shown in Table 1. The error rate between platforms varies , so it is important to consider this especially when the goal is to sequence short read transcripts like miRNA.
|Platform||Read length1 (base pair)||Throughput2||Number of reads3||Error profile|
|Illumina MiniSeq (high output)||75 (SE)||1.6–1.8 Gb||22–25 M||<1%, substitution|
|75 (PE)||3.3–7.5 Gb||44–50 M||<1%, substitution|
|150 (PE)||6.6–7.5 Gb||44–50 M|
|Illumina MiniSeq (mid output)||75 (SE)||2.1–2.4 Gb||14–16 M||<1%, substitution|
|Illumina MiSeq v2||36 (SE)||540–610 Mb||12–15 M||<0.1%, substitution|
|25 (PE)||750–850 Mb||24–30 M||<0.1%, substitution|
|150 (PE)||4.5–5.1 Gb||24–30 M||<0.1%, substitution|
|250 (PE)||7.5–8.5 Gb||24–30 M||<0.1%, substitution|
|Illumina MiSeq v3||75 (PE)||3–4 Gb||44–50 M||<0.1%, substitution|
|300 (PE)||13–15 Gb||44–50 M||<0.1%, substitution|
|Illumina NextSeq 500/550 (high output)||75 (SE)||25–30 Gb||400 M||<1%, substitution|
|75 (PE)||50–60 Gb||800 M||<1%, substitution|
|150 (PE)||100–120 Gb||800 M||<1%, substitution|
|Illumina NextSeq 500/550 (mid output)||75 (PE)||16–20 Gb||~260 M||<1%, substitution|
|150 (PE)||32–40 Gb||~260 M||<1%, substitution|
|Illumina HiSeq250v2 Rapid run||36 (SE)||9–11 Gb||300 M||0.1%, substitution|
|50 (PE)||25–30 Gb||600 M||0.1%, substitution|
|100 (PE)||50–60 Gb||0.1%, substitution|
|150 (PE)||75–90 Gb||0.1%, substitution|
|250 (PE)||125–150 Gb||0.1%, substitution|
|Illumina HiSeq250v3||36 (SE)||47–52 Gb||1.5 B||0.1%, substitution|
|50 (PE)||135–150 Gb||3 B||0.1%, substitution|
|100 (PE)||270–300 Gb||0.1%, substitution|
|Illumina HiSeq250v4||36 (SE)||64–72 Gb||2 B||0.1%, substitution|
|50 (PE)||180–200 Gb||4 B||0.1%, substitution|
|100 (PE)||360–400 Gb||0.1%, substitution|
|125 (PE)||450–500 Gb||0.1%, substitution|
|Illumina HiSeq3000/4000||50 (SE)||105–125 Gb||2.5 B||0.1%, substitution|
|75 (PE)||325–375 Gb||0.1%, substitution|
|150 (PE)||650–750 Gb||0.1%, substitution|
|Illumina HiSeqX||150 (PE)||800–900 Gb||2.6–3 B||0.1%, substitution|
|150 (PE)||1.6–20 B||167 Gb–6 Tb|
|Ion Proton||200 (SE)||Up to 10 Gb||60 M||1% indel|
|Ion PGM 318||200 or 400 (SE)||0.6–2 Gb||4–5.5 M||1% indel|
|Ion PGM 316||200 or 400 (SE)||0.3–1 Gb||2–3 M||1% indel|
|Ion PGM 314||200 or 400 (SE)||30–100 Mb||0.4–0.5 M||1% indel|
|PacBio Sequel||8–12 kb (SE)||3.5–7 Gb||>100,000||N/A|
|PacBio RS II||~20 kb||0.5–1Gb||~55,000||~13%, indel|
|454 GS Junior||~400 (SE, PE)||35 Mb||~0.1 M||1%, indel|
|454 GS Junior+||~700 (SE, PE)||70 Mb||~0.1 M||1%, indel|
|454 GS FLX Titanium XLR70||Up to 600; 450 mode (SE, PE)||450 Mb||~1 M||1%, indel|
|454 GS FLX Titanium XL+||Up to 1000; 700 mode (SE, PE)||700 Mb||~1 M||1%, indel|
|SOLiD 5500 xl||50 or 75 (SE)||160–320 Gb||~1.4 B||≤0.1%, AT bias|
|SOLiD 5500 Wildfire||50 or 75 (SE)||80–160 Gb||700 M||≤0.1%, AT bias|
|Oxford Nanopore MK1 MinION||Up to 200 Kb||~1.5 Gb||~12%, indel|
|Oxford Nanopore GridION X5||~Hundreds of Kb||100 Gb|
|Oxford Nanopore PromethION||~4 Tb|
The challenges of managing RNA-Seq data are considerable in terms of data storage and analysis as well as algorithm development. Since the technology is not yet fully matured, shortcomings exist at every step of sequence analysis. Various tools are available for alignment of reads, transcript construction, quantification, differential gene expression, pathways and correlation analyses  (Tables 2 and 3). Nonetheless, the use and specificity of the softwares differ highly from one type of analysis to another and the hardest part is making sure that the right tool is chosen at every step. A review of best practices for RNA-Seq data analysis was published recently . The gap between the rapid evolution of RNA-Seq technology and the development of data analysis tools is hindering wide application in livestock species. Most data analysis tools are developed for use with genomes of human and common model organisms (mouse, rat) and require tweaking before use with livestock genomes. For example, when performing target prediction analysis for newly discovered transcripts, it is the practise to use human/mouse databases as it brings a lot of power to the analysis. However, there is great bias coming from the assumption that livestock biological systems are identical to human or mouse.
|Trimming*||Trimmomatic||Illumina single end and paired end quality and adapter trimming. http://www.usadellab.org/cms/?page=trimmomatic|||
|PEAT||Specific for paired end sequencing quality and adapter trimming. https://github.com/jhhung/PEAT|||
|Trim Galore||Quality and adapter trimming with some extra functionality for Bisulfite-Seq. https://www.bioinformatics.babraham.ac.uk/projects/trim_galore|||
|Skewer||Adapter trimming, can take into account indels. https://github.com/relipmoc/skewer|||
|AlienTrimmer||Detect and remove alien k-mers in both ends of sequence reads. ftp://ftp.pasteur.fr/pub/gensoft/projects/AlienTrimmer/.|||
|Cutadapt||Finds and remove adapter, primers, poly-A and other types of unwanted sequences. https://github.com/marcelm/cutadapt|
|NxTrim||Discard as little sequence as possible from Illumina Nextera Mate Pair reads, single end and paired end reads. https://github.com/sequencing/NxTrim|||
|SeqPurge||Can detect very short adapter sequences. https://github.com/imgag/ngs-bits/blob/master/doc/tools/SeqPurge.md|||
|Alignment**||STAR||Align RNA-Seq reads to a reference genome, detect splice junctions. https://github.com/alexdobin/STAR|||
|Bowtie / Bowtie2||Align short DNA sequences to genomes with Burrows-Wheeler index. bowtie-bio.sourceforge.net/bowtie2||[56, 57]|
|BWA||Mapping low-divergent sequences against large reference genome. bio-bwa.sourceforge.net|||
|TopHat2||Use Bowtie for alignment. TopHat analyzes results to identify splice junctions. https://ccb.jhu.edu/software/tophat|||
|Rockhopper||Specific for bacterial RNA-Seq data. It supports de novo and reference based transcript assembly. cs.wellesley.edu/~btjaden/Rockhopper|||
|SpliceMap||De novo splice junction discovery and alignment tool. https://web.stanford.edu/group/wonglab/SpliceMap|||
|StringTie||De novo transcript assembly.|
Quantitation of full-length transcripts representing multiple splice variants for each gene locus. https://ccb.jhu.edu/software/stringtie
|Trinity||De novo reconstruction of transcriptomes from RNA-seq data. https://github.com/trinityrnaseq/trinityrnaseq/wiki|||
|Names||Major purpose1||Known miRNA annotation2||Novel miRNA discovery||DE analyses||Target prediction||Pathway enrichment||Livestock Species||References|
|UEA sRNA Workbench||miRNA identification||+||+||+||+||−||+|||
|Target scan||Target prediction||−||−||−||+||−||+|||
|DIANA-mirPath v3||Down-stream miRNA analyses||−||−||−||+||+||+|||
|MAGIA||Down-stream miRNA analyses||−||−||−||−||+||−|||
|miRNet||Down-stream miRNA analyses||−||−||−||−||+||+|||
|miRSystem||Down-stream miRNA analyses||−||−||−||−||+||+|||
|TransmiR||Down-stream miRNA analyses||−||−||−||−||+||+|||
|DIANA-mirExTra||Down-stream miRNA analyses||−||−||−||−||+||−|||
2.2. Generation of ncRNA sequence data and pre-mapping quality control
2.2.1. Generation of ncRNA sequence data
The choice of the sequencing platform is critical to attain the goals of a study. Numerous protocols and commercial kits to generate cDNA libraries from RNA samples are available and they are mostly based on the same principles (e.g. fragmentation, reverse-transcription, adapter ligation and amplification). The steps in library preparation for lncRNA are the same as for mRNA since they share similar biogenesis pathways. The starting material for lncRNA library preparation is total RNA. Majority of lncRNA transcripts have poly-A tails while a small proportion do not. Library preparation methods based on poly-A tail selection are cheaper but less robust since non-poly-A tail transcripts are lost. An ideal but more expensive method involves depletion of rRNA (constitutes ~90% of total RNA). Library preparation with rRNA depleted total RNA is robust as it allows quantification of all other RNA transcripts including lowly expressed transcripts. Thus, the first step in lncRNA library preparation is to consider whether to perform poly-A tail selection or to deplete rRNA (Figure 1). The next dilemma is deciding whether or not to preserve strand information during library preparation. As lncRNA annotation is still in the initial phase, it is crucial to preserve strand information to enable correct genome localization of novel transcripts. Paired-end sequencing is to be considered over single end sequencing for lncRNA characterization to facilitate construction of transcripts with clear-cut exon boundaries. Paired-end sequencing also allows accurate detection of splicing position. Sequencing long fragments (>100 bp) is also desired to get adequate coverage of the genome and consequently, better transcript construction. The number of multiplexed samples on each sequencing lane affects lncRNA sequence depth. Reducing cost by multiplexing more samples than necessary reduces quality of results obtained. It has been demonstrated that the depth of sequencing is relative to the nature of the expected results [30, 31]. To accomplish lncRNA discovery with confidence, a minimum of 100 million reads per sample is suggested to enable
The procedure for the generation of miRNA sequence data differs slightly from the procedure for lncRNA analysis. First of all, miRNAs are small (18–24 bp) in size and do not require RNA fragmentation prior to library construction. Total RNA is the recommended starting material for miRNA library preparation (Figure 1). Although some commercial kits provide the option to enrich the miRNA fraction prior to library preparation, there is evidence that some small RNA species are lost during enrichment . The protocols for miRNA library preparation are generally similar to lncRNA and include adapter ligation step, reverse transcription and amplification followed by size selection and purification of the cDNA. Fifty bp single end sequencing is sufficient for miRNA libraries since miRNAs are generally small. Thus, Illumina platforms are well suited for sequencing miRNA libraries. Studies showed that approximately 2 million reads are sufficient for differential expression analysis while 8 million reads are sufficient for discovery analysis [33, 34]. Considering that over 150 million reads are available per lane on HiSeq machines, sample multiplexing can be as high as 18 to 20 libraries per lane.
2.2.2. Common data processing steps
Upon availability of sequence data, many bioinformatics tools are used in the analytical procedures. Some processing steps are optional but strongly recommended; while others are required before the next step can be performed. Many pipelines have been developed to answer specific questions, but the softwares used can be very different. A global view of the general processing steps and frequently used tools for lncRNA and miRNA sequence data analyses are presented in Figures 2 and 3, respectively. These processing steps can be modified to include desired or specific tools depending on the research question.
2.2.3. Raw data quality control
Sequence data generated by Illumina platforms and most platforms is in FASTQ format. The FASTQ format is a text file consisting of the nucleic acid sequence (read) and base calling accuracy score (Phred score) attributed to each base pair of the sequence. FastQC , Picard tools (https://broadinstitute.github.io/picard/) and NGS QC tool kit  are often used to assess the quality of raw sequence reads. This step is necessary to determine if the sequencing outcome is as expected. These tools inform on the total number of reads, the overall quality of base call according to the position, GC percentage and other features. Care should be taken when interpreting the results because GC content is species specific and some softwares evaluate GC content according to the human genome. In order to avoid bias in the mapping step, a quality trimming is necessary to get rid of low quality base pairs and remaining adapter sequences. A recent study showed that incorrect trimming can lead to generation of short reads impairing the capacity to correctly predict differences in expression changes . Several trimming tools are available  (https://omictools.com/adapter-trimming-category) including Trimmomatic , FASTX-Toolkit , CutAdapt , etc.(Table 2). Following trimming, filtering of reads is necessary to get rid of very short and overall low quality reads to keep bias level as low as possible.
After trimming and filtering, reads are ready for alignment or
2.2.5. Transcript construction and quantification
RNA-Seq transcript construction and the alignment steps can demand considerable computing time. Transcript construction tools are many (https://omictools.com/transcript-quantification-category) including commonly used tools like Cufflinks , iReckon , StringTie , etc. This step requires paired-end data and high sequence coverage to reconstruct lowly expressed transcripts. With the assumption that transcripts are species specific, raw data or alignment files from all samples from the same population can be merged to increase coverage . This modification will help clarify transcript boundaries in case of
2.2.6. miRNA processing steps
Overall, the procedures for miRNA identification and discovery are less time consuming and do not include as many steps as for mRNA and lncRNA identification. The global process includes quality and adaptors trimming with quality checkpoints before and after each step. A size selection to keep sequences between 17 and 30 nt (sometimes up to 35 nt) is often performed right after the quality and adaptors trimming step. This is followed by read mapping and filtering of other RNA sequences (rRNA, tRNA, snRNA, mRNA, lncRNA, etc.). The reads thought to represent miRNA are analyzed with miRNA prediction tools like miRDeep2 , miRanalyzer , mirTools 2.0 , etc. (Table 3). Subsequent interrogation of miRBase database enables classification of retained miRNAs as known or novel miRNAs. A tool like miRDeep2 has a quantifier module that generates a read count table for each miRNA using precursor and mature sequence files as input. An overview of tools for miRNA identification are presented in Table 3 and further discussed in the next section.
3. Tools for ncRNA identification
3.1. Tools for miRNA identification
The identification of miRNAs can be either annotation of known miRNAs or discovery of novel miRNAs. A variety of algorithms and bioinformatics tools are applied to annotate known miRNAs as well as to discover new miRNAs from sequence data. These tools can use several features such as sequence conservation among species, structural features like hairpin and minimal folding free energy . Many tools are available for miRNA annotation (https://tools4mirs.org/software/known_mirna_identification/)  including frequently used tools like miRdeep , miRanalyzer , mirTools 2.0, UEA sRNA Workbench , sRNAtoolbox , and SeqBuster  (Table 3). Many more tools have been developed for novel miRNA discovery and miRNA precursor prediction (https://tools4mirs.org/software/precursor_prediction/) including frequently used tools like MiPred , miRanalyzer , miR-Abela , MiReNA , UEA sRNA Workbench  and mirDeep  (Table 3). Major features of miRNA discovery tools have been reviewed [82–84]. Regarding livestock species, the choice of methods for miRNA discovery and novel miRNA annotation vary among studies and species. For example, De Vliegher et al.  used miRbase  and UNAFold  for miRNA annotation and discovery in bovine mammary gland tissues while Peng et al  used miRbase  and RNAfold  for these purposes in porcine mammary glands. In our own studies, miRbase  and mirDeep2  were used to identify miRNAs in various tissues including bovine mammary gland tissues , milk fat [90–92], milk whey and cells .
3.2. Tools for lncRNA identification
To date, a large number of lncRNA genes have been identified in the genomes of human (141,353), cow (23,896) and chicken (13,085) (http://www.bioinfo.org/noncode/analysis.php, accessed on 24-03-2017). Several methodologies have been described to identify/distinguish lncRNAs from mRNAs and successfully applied to livestock species such as coding potential calculator (CPC) , PhyLoCSF , coding-non-coding index (CNCI) , coding potential assessment tool (CPAT) , Predictor of Long non-coding RNAs and mRNAs based on an improved k-mer scheme (PLEK)  and Flexible Extraction of LncRNAs (FEELnc) , etc. The FEELnc program developed by the functional annotation of animal genome project consortium (FAANG)  is recommended as a standardized protocol for lncRNA analyses in animal species. In order to distinguish lncRNAs from mRNAs, FEELnc program uses a machine-learning method for estimation of a protein-coding score according to the RNA size, open reading frame coverage and multi k-mer usage . The FEELnc program can derive an automatically computed cut-off so it maximizes the lncRNA prediction sensitivity and specificity. An overview of tools for lncRNA identification/characterization is listed in Table 4.
|Tools||Type||Major Function/web link||References|
|ChIPBase||Database||Identifies binding motif matrices and their binding sites. Predicts transcriptional regulatory relationships between transcription factors and genes. http://rna.sysu.edu.cn/chipbase/.|||
|LNCipedia||Database||Provides basic transcript information and structure, human lncRNA transcripts and genes. http://www.lncipedia.org/.|||
|lncRNAdb||Database||Provides comprehensive annotation of eukaryotic lncRNAs. Offers an improved user interface enabling greater accessibility to sequence information, expression data and the literature. http://www.lncrnadb.org/.|||
|LNCat||Database||Stores the information of 24 lncRNA annotation resources. Allows achieving refined annotation of lncRNAs within the interested region. http://biocc.hrbmu.edu.cn/LNCat/|||
|LncRNASNP||Database||Provide comprehensive resources of single nucleotide polymorphisms (SNPs) in human/mouse lncRNAs. bioinfo.life.hust.edu.cn/lncRNASNP/|||
|lncRNAWiki||Database||Provide open-content and publicly editable curation and collection of information on human lncRNAs. http://lncrna.big.ac.cn/index.php/Main_Page|||
|NONCODE||Database||Presents the most complete collection and annotation of non-coding RNAs (excluding tRNAs and rRNAs) for 18 species including human, mouse, cow, rat, chicken, pig, fruitfly, zebrafish, |||
|ALDB||Database||Enables the exploration and comparative analysis of lncRNAs in domestic animals. Offers information on genome-wide expression profiles and animal quantitative trait loci (QTLs) of domestic animals. http://res.xaut.edu.cn/aldb/index.jsp|||
|GENCODE||Database||Presents all gene features in the human genome.|
Contains annotation of lncRNA loci publicly available with the predominant transcript form consisting of two exons. https://www.gencodegenes.org
|ncRDeathDB||Database||Present a comprehensive bioinformatics resource to ncRNA-associated cell death interactions. www.rna-society.org/ncrdeathdb|||
|LncVar||Database||Presents genetic variation associated with long noncoding genes. bioinfo.ibp.ac.cn/LncVar|||
|IRNdb||Database||Combines microRNA, PIWI-interacting RNA, and lncRNA information with immunologically relevant target genes. http://irndb.org|||
|AnnoLnc||Annotation||Presents online portal for systematically annotating newly identified human lncRNAs.|||
|LongTarget||Target prediction||Present a computational method and program to predict lncRNA DNA-binding motifs and binding sites. lncrna.smu.edu.cn|||
|LncRNA2Function||Functional inferences||Facilitates search for the functions of a specific lncRNA or the lncRNAs associated with a given functional term, or annotate functionally a set of human lncRNAs of interest. http://mlg.hit.edu.cn/lncrna2function|||
|Co-LncRNA||Function inference||Presents a web-based computational tool that allows users to identify GO annotations and KEGG pathways that may be affected by co-expressed protein-coding genes of single or multiple lncRNAs. www.bio-bigdata.com/Co-LncRNA/|||
|LncReg||Function inference||Provides regulatory information about lncRNAs, such as targets, regulatory mechanisms, and experimental evidence for regulation and key molecules participating in regulation. bioinformatics.ustc.edu.cn/lncreg/|||
|Linc2GO||Function inference||Provides comprehensive functional annotations for human lincRNA. http://www.bioinfo.tsinghua.edu.cn/~liuke/Linc2GO|||
|FARNA||Function annotation||Integrates ncRNA information related to expression, pathways and diseases in a large number of human tissues and primary cells. www.cbrc.kaust.edu.sa/farna/|||
|ViRBase||Database||Provides the scientific community with a resource for efficient browsing and visualization of virus-host ncRNA-associated interactions and interaction networks in viral infection. http://www.rna-society.org/virbase|||
|LncRNA2Target||Database||Stores lncRNA-to-target genes. Provides a web interface for searching targets of a particular lncRNA or for the lncRNAs that target a particular gene. https://www.lncrna2target.org/|||
|Lncin||Function annotation||Identifies lncRNAs-associated modules from protein interaction networks and predicts the function of lncRNAs based on the protein functions in the modules. lncin.ym.edu.tw|||
|NPInter||Function annotation||Integrates experimentally verified functional interactions between noncoding RNAs (excluding tRNAs and rRNAs) and other biomolecules (proteins, RNA and genomic DNA). www.bioinfo.org.cn/NPInter|||
|CPC||Coding potential assessment||Distinguishes between coding and noncoding RNA. Uses a Support Vector Machine-based classifier to assess the protein-coding potential of a transcript. cpc.cbi.pku.edu.cn/|||
|CNCI||Coding potential assessment||Distinguishes between protein-coding and non-coding sequences independent of known annotations. Applies to a variety of species without whole-genome sequence or with poorly annotated information. https://github.com/www-bioinfo-org/CNCI|||
|CPAT||Coding potential assessment||Distinguishes between coding and noncoding RNA. Uses a logistic regression model to assess the protein coding potential. rna-cpat.sourceforge.net/|||
|FEELnc||LncRNA prediction||Derives an automatically computed cut-off so it maximizes the lncRNA prediction sensitivity and specificity. https://github.com/tderrien/FEELnc|||
|PLEK||lncRNA prediction||Uses k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from mRNAs. http://www.ibiomedical.net/plek/|||
3.3. Tools for identification of other non-coding RNA
Currently, few tools have been developed for the identification of groups of ncRNAs other than miRNAs and lncRNAs. The popular tools for piRNA identification include ProTRAC , piClust , piRNAQuest , etc. (Table 5). proTRAC detects piRNA clusters based on a probabilistic analysis with assumption of a uniform distribution while piClust uses a density based clustering approach for the detection of piRNAs. piRNAQuest allows a search of the piRNome for silencers . Another notable framework is SeqCluster , a python pipeline for the annotation and classification of non-miRNA small ncRNAs. The pipeline permits a highly versatile and user-friendly interaction with data in order to easily classify small RNA sequences with putative functional importance . For other small RNAs, ncPRO-seq  allows the discovery of unknown ncRNA or siRNA-coding regions from small RNA sequence data. DARIO  is a web-tool that allows annotation and detection of ncRNAs from various species but not livestock species. CoRAL  is a machine learning method that classifies ncRNAs by relying on biologically interpretable features. Several tools also have been developed for predicting circRNAs such as PredicircRNATool  and PredcircRNA  which apply a machine learning approach to distinguish circRNAs from other ncRNAs (Table 5).
|Tools||Types||Main Features/web link||References|
|ProTRAC||piRNA prediction||Detects and analyses piRNA clusters based on quantifiable deviations from a hypothetical uniform distribution regarding the decisive piRNA cluster characteristics. https://sourceforge.net/projects/protrac/|||
|piClust||piRNA prediction||Finds piRNA clusters and transcripts from small RNA-seq data using a density based clustering approach. http://epigenomics.snu.ac.kr/piclustweb|||
|piRNAQuest||piRNA database||Provides annotation of piRNAs based on their genomic location in gene, intron, intergenic, CDS, UTR, repeat elements, pseudogenes and syntenic regions. bicresources.jcbose.ac.in/zhumur/pirnaquest|||
|SeqCluster||ncRNA classification||A framework python for the annotation and classification of the non-miRNA small RNA transcriptome. http://seqcluster.readthedocs.io/#|||
|ncPRO-seq||ncRNA discovery||Allows the discovery of unknown ncRNA- or siRNA-coding regions from sRNA sequence data. http://ncpro.curie.fr/.|||
|DARIO||ncRNA discovery||Allows annotation and detection of ncRNAs from various species but not livestock species. http://dario.bioinf.uni-leipzig.de/index.py|||
|CoRAL||ncRNA classification||A machine learning method that classifies ncRNA by relying on biologically interpretable features. http://wanglab.pcbi.upenn.edu/coral|||
|DASHR||Database||Stores human small ncRNAs: miRNAs, piRNAs, snRNAs, snoRNAs, scRNAs (small cytoplasmic RNAs), tRNAs, and rRNAs information. lisanwanglab.org/DASHR|||
|Sno/scaRNAbase||Database||A curated database for small nucleolar RNAs (snoRNAs) and small cajal body-specific RNAs (scaRNAs). gene.fudan.edu.cn/snoRNAbase.nsf|||
|snoRNA||Database||Contains over 1000 snoRNA sequences from Bacteria, Archaea, and Eukaryotes. http://evolveathome.com/snoRNA/snoRNA.php|||
|CircNet||Provides the following resources: (i) novel circRNAs, (ii) integrated miRNA-target networks, (iii) expression profiles of circRNA isoforms, (iv) genomic annotations of circRNA isoforms, and (v) sequences of circRNA isoforms. circnet.mbc.nctu.edu.tw|||
|PredicircRNATool||circRNA prediction||Uses a machine learning method for predicting circRNAs from those of non-circularized, expressed exons based on conformational and thermodynamic properties in the flanking introns. https://sourceforge.net/projects/predicircrnatool|||
|circRNADb||circRNA database||Contains 32,914 human circular RNAs. http://reprod.njmu.edu.cn/circrnadb|||
|PredcircRNA||cirRNA prediction||Applies a machine learning approach to predict circRNA. https://github.com/xypan1232/PredcircRNA|||
|CirsBase||Database||Provides scripts to identify known and novel circRNAs in sequence data. circbase.org|||
|Circ2Traits||Database||Contains a database of potential association of circular RNAs with diseases in human. http://gyanxet-beta.com/circdb|||
|CircInteractome||Database||Provides a web tool for mapping (RNA Binding Proteins (RBP)- and miRNA-binding sites on human circRNAs. Allows to (i) identify potential circRNAs which can act as RBP sponges, (ii) design junction-spanning primers for specific detection of circRNAs of interest, (iii) design siRNAs for circRNA silencing, and (iv) identify potential internal ribosomal entry sites. https://circinteractome.nia.nih.gov|||
|tRNAdb||Database||Contains 12,000 tRNA genes from 577 species and 623 tRNA sequences from 104 species, provides various services such as graphical representations of tRNA secondary structures. trnadb.bioinf.uni-leipzig.de|||
4. Tools for differential expression analysis of non-coding RNA
Various tools allow for the detection of genes (mRNA or ncRNA) differentially expressed (DE) between two or more conditions or states from sequence data. The major differences among tools are their implemented statistical methods, input and output file formats as well as filtering steps for DE analyses. Many tools such as DESeq , edgeR , NBPSeq , TSPM , baySeq , EBSeq , NOISeq , SAMseq  and ShrinkSeq  use count data as input file, while others like limma  and Cufflinks use transformed data or BAM files (the binary version of sequence alignment data) as input, respectively. Tools that use count data can be divided in to two groups; parametric (DESeq , edgeR , NBPSeq , TSPM , baySeq , EBSeq ) and non-parametric methods (NOISeq , SAMseq ). For parametric methods, most softwares (baySeq , DESeq , NBPSeq , edgeR , EBSeq  and NBPSeq) use a negative binomial model to account for over dispersion except ShrinkSeq which has two options for distribution, either negative binomial or a zero-inflated negative binomial distribution. These methods also implement different statistical test approaches; DESeq, edgeR and NBPSeq perform a classical hypothesis testing approach while baySeq, EBSeq and ShrinkSeq apply Bayesian methods. The comparison of methods and performances have been done and reviewed by many authors [29, 179–183]. In general, no single method performs well for all datasets. In a survey of performance of DE analyses methods, Conesa et al.  observed that limma package  performed well under many conditions. Many studies observed similar performances by DESeq and edgeR in ranking genes [29, 179–183]. However, DESeq is more conservative while edgeR is more liberal in controlling false discovery rate (FDR) . Other tools such as SAMseq is better in controlling FDR while NOISeq is efficient in avoiding false positives .
5. Bioinformatics tools for target prediction and functional inference of non-coding RNA
Following discovery and detection of important ncRNAs from RNA sequence data, the important next steps are to understand their regulatory roles. Since ncRNAs commonly act by interacting with target genes (mostly inhibit expression), various tools have been developed to predict their target genes and to infer their functions (Tables 3 and 4). A simple work flow for inferring the functions of miRNAs is shown in Figure 4.
5.1. Functional inference of miRNAs
5.1.1. Bioinformatics tools for target prediction and functional inference of miRNAs
Inferring individual targets for a given miRNA can be done either by computational or experimental methods. Computational target prediction is coordinated in a sequence-specific manner and the target genes are normally predicted based on information derived from the potency of binding between miRNA and putative targets. Generally, the methods for computational prediction of miRNA targets can be grouped in single platforms such as TargetScan , PicTar , RNAhybrid  or multiple platforms such as miRwalk , TarBases , miRecords  as well as integrative platforms which include downstream analyses of putative target genes such as DIANA-microT-CDS , miRPathDB , etc. A collection of tools for miRNA target prediction are available at https://omictools.com/mirna-target-prediction-category and https://tools4mirs.org/software/target_prediction/  (Table 3). Among the prediction tools, the major differences in principles are in the algorithm applied and in filtering steps considering the secondary structure of the target mRNA (reviewed in [83, 115, 186]). Consequently, the specificity, sensitivity and accuracy of prediction are different among tools. Additionally, the performances of tools also differ based on the skills of the user (such as formatting of input and output, programming skills, web interface and so on). Taken together, all these factors affect popularity of tools [72, 187]. A word cloud plot of the popularity of tools based on their citation per year is shown in Figure 5.
5.1.2. Popular single platforms for miRNA target prediction
TargetScan can be accessed via the web interface or by running a perl script (local run) . The software detects targets in the 3′UTR of protein-coding transcripts by base-pairing rules (seed complementarity) and predicts miRNAs for miRNA families instead of individual miRNAs. To assess important miRNA-target interaction, TargetScan outputs two matrices: probability of conserved targeting (Pct) and total contextual score (TCS). Pct corresponds to a Bayesian estimate of the probability that a miRNA site on the 3′ UTR of a mRNA is conserved due to miRNA targeting while TCS represents the strength of the sequential features (site-type, 3′ pairing contribution, local AU contribution, position contribution, target site abundance and seed-pairing stability) that facilitate miRNA-target hybridization/cleavage. PicTar also searches for identical seed sequences to predict miRNA-mRNA interaction . PicTar derives an overall score to assess the strength of the miRNA-target interaction. PicTar computes a score based on the maximum likelihood that a given 3′ UTR sequence is targeted by a fixed set of miRNAs. The PicTar algorithm scores any 3′ UTR that has at least one aligned conserved predicted binding site for a miRNA, and then incorporates all possible binding sites into the score. RNAhybrid computes target genes based on the free energy of hybridization of a long and a short RNA . Hybridization is performed in a kind of domain mode; for example the short sequence is hybridized to the best fitting part of the long one. Rna22  is a pattern-based approach to find miRNA binding sites and corresponding miRNA:mRNA complexes without a cross-species sequence conservation filter. Rna22 is resilient to noise and does not rely upon cross-species conservation. Unlike previous methods, Rna22 starts by finding putative miRNA binding sites in the sequence of interest followed by identification of the targeting miRNA. It can identify putative miRNA binding sites even though the targeting miRNA is unknown. miRanda was the first bioinformatics tool to predict the target genes of miRNAs. The miRanda algorithm is based on a comparison of miRNAs complementarity to 3′UTR of genes . miRanda calculates the binding energy of the duplex structure, evolutionary conservation of the whole target site and its position within the 3′UTR and accounts for a weighted sum of match and mismatch scores for base pairs and gap penalties.
5.1.3. Portals for miRNA target prediction
miRWalk, a comprehensive database developed by Dweep et al  documents miRNA binding sites within the complete sequence of a gene and combines this information with predicted binding sites data resulting from 12 target prediction programs (DIANA-microTv4.0, DIANA-microT-CDS, miRanda-rel2010, mirBridge, miRDB4.0, miRmap, miRNAMap, doRiNA, PicTar2, PITA, RNA22v2, RNAhybrid2.1 and Targetscan6.2) to build platforms of binding sites for the promoter, coding (5 prediction datasets), 5’ and 3′UTR regions. It also contains experimentally verified miRNA-target interaction information collected via text-mining search and data from existing resources (miRTarBase, PhenomiR, miR2Disease and HMDD). MirRecords is a resource for animal miRNA-target interactions developed at the University of Minnesota . MiRecords integrates predicted miRNA targets produced by 10 miRNA target prediction programs (DIANA-microTv4.0, miRanda-rel2010, miRDB4.0, PicTar2, PITA, RNAhybrid2.1, Targetscan6.2, miRTarget2, microinspector, NBmiRTar). It also contains information on experimentally validated miRNA targets obtained from the literature. mirDIP integrates 12 miRNA prediction datasets from miRNA prediction databases (DIANA-microTv4.0, miRanda-rel2010, miRDB4.0, PicTar2, PITA, RNAhybrid2.1, Targetscan6.2 and microCosm) allowing to customize miRNA target searches. multiMiR contains a collection of nearly 50 million records from 14 different databases . It allows user-defined cut-offs for predicted binding strength to provide the most confident selection.
5.1.4. Integrated tools for miRNA analysis
Various integrated tools as well as work flow for miRNA analysis have been developed to perform downstream analyses of putative target genes (e.g. gene ontology, pathways enrichments of target genes, etc.) such as MMIA , MAGIA  and miRconnX , to link miRNA to transcription factors or to analyze the effect of several miRNAs such as DIANA-mirExTra v2.0  and TransMIR . Typically, predicted target genes are used as input for functional enrichment to infer the potential functions of miRNAs. Furthermore, several tools are also used to correlate the expression levels of miRNAs with mRNA in a particular experiment to infer miRNA function such as miRnet , miRSystem  and DIANA-miRPath v3.0 . Several tools have also been developed to directly link miRNAs to biological processes such as DMirNet , miRnet  and DIANA-miRPath v3.0 . Many tools and resources have also been developed to link miRNAs to specific phenotypes/environments including diseases such as miRNAs in obsessive-compulsive disorder , autophagy in gerontology , epilepsy  and cancer . Among the most popular integrated tools, DIANA-tools (www.microrna.gr) covers a wide scope and research scenarios integrating several tools such as DIANA-microT-CDS, DIANA-TarBase v7.0, DIANA-miRGen v3.0, DIANA-miRPath v3.0, and DIANA-mirExTra v2.0. DIANA-microT-CDS uses different thresholds and meta-analysis followed by pathway enrichment to perform miRNA target prediction . DIANA-TarBase is a manually curated target database with more than half a million miRNA-target interactions curated from published experiments performed with 356 different cell types from 24 species. DIANA-miRPath is an online software suite dedicated to the assessment of miRNA regulatory roles and the identification of controlled pathways . DIANA-mirExTra performs combined differential expression analysis of mRNAs and miRNAs to uncover miRNAs and transcription factors that play important regulatory roles between two investigated state . miRNet is an easy-to-use web-based tool for statistical analysis and functional interpretation of various datasets generated in miRNAs studies in various species. Moreover, it also allows users to explore the results of miRNA-target interaction . MMIA is a web tool for integration of miRNA and mRNA expression data with predicted miRNA target information for analyzing miRNA-associated phenotypes and biological functions by gene set enrichment analyses .
5.2. Functional inference of lncRNA
Compared to miRNAs, fewer bioinformatics tools have been developed for functional inference of lncRNAs. Several databases have been developed to curate computationally predicted and experimentally verified lncRNAs, such as LncRNAdb , GENCODE , lncRNAtor , lncRNome , NONCODE , lncRNAWiki , LncRNA2Function  and starBase v2.0 . LncRNAdb was the first lncRNA database  and its updated version (LncRNAdb v2.0) integrates lncRNAs reported in livestock species (cattle, sheep, pig, horse and chicken) . DeepBase database is an online platform for annotation and discovery of lncRNAs from RNA-seq data and it contains a large number of transcript entries for bovine (43,156) and chicken (47,004) lncRNAs. Other databases for livestock species are RNAcentral  which currently houses information from 23 ncRNA databases (http://rnacentral.org/, access March, 2017) but only contains a small number of lncRNAs from livestock species (cattle, pig, horse and chicken). NONCODE  contains lncRNAs for 16 species including cattle and chicken in the latest version. The first lncRNA database with a particular focus on domesticated animals was ALDB . ALDB contains 12,103 pig lincRNAs (long intergenic non-coding RNA), 8923 chicken lincRNAs, and 8250 cow lincRNAs (http://www.ibiomedical.net/aldb/, access March, 2017). However, no comprehensive database currently covers available information on lncRNAs from livestock species, therefore the availability of a comprehensive tool will be valuable and helpful for subsequent genomic and functional annotation of lncRNAs and comparative interspecies analyses . Inference of lncRNAs functions can also be done by connecting their expression patterns with specific cell types or biological processes to draw possible conclusions on their potential roles. LncRNAs can act in cis and/or trans manner to influence or interact with nearby or distant genes, respectively [2, 199]. For cis-regulation, the genomic location can be used as a guide for guilt-by-association analysis which allows global understanding of lncRNAs and protein coding genes that are tightly co-expressed and thus presumably co-regulated. Cis-relationships can foreseeably arise through complementary sequence motifs, tethering, blocking, and product-independent transcription . For example, the human HOTTIP lncRNA is a cis-acting lncRNA expressed in the HOXA cluster that activates transcription of flanking genes . The bioinformatics tools for cis-regulation prediction include ncFANs (http://www.ebiomed.org/ncFANs)  which uses a coding-non-coding gene co-expression network to infer lncRNA function.
6. Emerging platforms and technologies for understanding and using ncRNAs
Efficient and reliable techniques for accurate detection of genome information are important for productivity and health of livestock species . The introduction of next generation sequencing technologies has increased throughput studies of ncRNAs considerably. Consequently, studies on ncRNAs have contributed toward better understanding of disease resistance, productivity, breeding and meat quality in livestock species . Although the numbers of detected ncRNA transcripts are increasing continuously, the ncRNAs identified and annotated in livestock species are still very scanty, compared with human data. Therefore, there is need to continue to explore the ncRNA transcriptome of livestock species . The ability to explore and modify the genomes of livestock species could be beneficial in improving disease resistance, productivity, breeding capability as well as generation of new biomedical models .
Genome editing tools have emerged that allow efficient and precise genome manipulation of many organisms including livestock. The genome editing technique is built on engineered, programmable and highly specific nucleases that induce site-specific changes in the genomes of cellular organisms . Subsequent cellular DNA repair processes generates desired insertions, deletions or substitutions at the loci of interest establishing linkages between genetic variations and biological phenotypes . Presently, four artificially engineered nuclease systems have been developed for genome editing: meganucleases derived from microbial mobile elements, zinc finger nucleases (ZFNs) based on eukaryotic transcription factor DNA binding motif, transcription activator-like effector-based nucleases (TALEN) derived from a plan-invasive bacterial protein, and clustered regularly interspaced short palindromic repeats (CRISPR)-CRISPR associated protein 9 (Cas9) system . Centromere and Promoter Factor 1 (Cpf1) is used as an alternative to Cas9 nuclease which requires only a single CRISPR RNA (crRNA) for targeting . CRISPR/Cas9 is easily applicable and has developed really fast over the past years since only programmable RNA is required to generate sequence specificity .
CRISPR–Cas9 system is based on a bacterial CRISPR-Cas9 nuclease from
In order to adapt this far-reaching application of gene-editing technology to agricultural improvement, various approaches have been applied to a number of livestock species. In pigs, direct cytoplasmic injection of Cas9 mRNA and single-guide RNA into zygotes generated biallelic knockout piglets . The CRISPR-Cas9 system was used to generate gene-edited pigs protected from porcine reproductive and respiratory syndrome virus  and to genetically modify single blastocyst inducing indel mutations in a given gene locus. Both Talen and ZNF have been injected directly into pig zygotes to produce live genome edited pigs . Similarly, the porcine myostatin (MSTN) gene, which functions as a negative regulator of muscle growth, was disrupted using CRISPR/Cas9 system to efficiently generate biologically safe genetically modified pigs . Similarly, zygote injection of TALEN mRNA targeting MSTN gene led to production of gene-edited cattle and sheep 
In cattle, the CRISPR/Cas9 system was successfully used to clone embryos that could be used to develop livestock transgenes for agricultural science . Hornlessness was introduced into dairy cattle by genome editing and reproductive cloning providing the potential to improve the welfare of millions of cattle . In the cattle industry, gene-edited calves have been produced with specified genetics by ovum pickup,
In livestock, CRISPR-Cas9 has been greatly enhanced by single-guide RNA generating site-specific DNA breaks through homology-directed repair and used for diverse applications, from disease modelling of individual loci to parallelized loss-of-function screens of thousands of regulatory elements . Equally, bioinformatics designs for CRISPR deletions are now possible with a tool known as CRISPETa developed with efficient CRISPR deletion of an enhancer and exonic fragment of MALAT1, a lncRNA. CRISPETa can be used for single target regions or thousands of targets and has high-coverage library designs for entire classes of non-coding elements which can be adopted for use in livestock species . CRISPR-Cas9 may be used with a gene drive incorporated with genome edit to investigate the control of any biological process and can be used to accelerate livestock breeding . Gene drives can be constructed with the use of CRISPR-Cas9 tool that can favour the inheritance of edited alleles possible to modify a whole population . In the DNA, a double strand break can be initiated by a gene drive during the copying process. Using the sequence of the chromosome containing the gene drive elements as a repair template, the DNA break could be repaired by cellular pathways such as homology-directed repair . Editing the genomic DNA elements targeting non-coding regions is vital since silencing of ncRNA genes using RNA interference tools still presents major challenges. An improved vector system adapted to delete non-protein-coding regulatory elements; double excision CRISPR Knockout (DECKO) using two-step cloning to produce vectors (lentivirus) with two guide RNAs concurrently , has been used effectively to silenced five ncRNAs (miRNAs-miR21, miR29a and lncRNAs-UCA1 and MALAT1) . The use of genome editing technologies will create novel viewpoints for enquiry to advance our knowledge on biological function of ncRNAs in livestock species and facilitate creating animals with precise alterations.
7. Conclusion and remarks
With the application of next generation sequencing technologies, the number of ncRNAs reported in livestock species has increased dramatically in the last 5 years. Various tools and pipelines have been introduced to make sense out of ncRNA sequence data. This chapter has provided a comprehensive overview of the current and emerging tools and methods for generating and analyzing ncRNA (miRNA, lncRNA as well as other small ncRNAs) sequence data (transcriptome) with special emphases on the tools that can be applied to livestock species. While bioinformatics tools for miRNA analyses are quite mature, there is a general lack of comprehensive bioinformatics tools for lncRNA and other small ncRNAs. It is our belief that comprehensive “omics” databases that integrate existing and future ncRNA transcriptome databases in the framework of livestock species will contribute towards elucidation of the ambiguity surrounding RNA sequence data. Moreover, given the fact that several emerging platforms (such as genome editing tools) for understanding ncRNAs have been introduced recently, these tools certainly bring great opportunities for broader and also deeper exploration of ncRNA functions. In addition, meticulous
We acknowledge financial support from Agriculture and Agri-Food Canada.