Sequencing statistics. Adapted from 
New technologies are constantly being released and the improvements therein bring advances not only to transcriptome, the focus of this chapter, but also to diverse areas of biological research. Since the announcement and application of the RNA-seq approach, discoveries are being made in this field, but when we consider bacterial species, this progress proceeded a few years behind. However, with the application of RNA-seq derivative approaches, we can gain biological insights into the bacterial world and aspire to uncover the mysteries involving gene expression, organization and other functional genomic features.
- bioinformatics analysis workflow
RNA-seq technology has driven advances in gene expression analysis through new-generation sequencing platforms, as they are versatile, powerful and ensure quality results with accuracy and reproducibility never reached before. This technology generates information that provides meaning to the set of transcripts (transcriptome), opening up possibilities for understanding cell behavior in different environments. RNA is an important component within the cell, since it plays different roles as a messenger regulatory molecule and carrier; and, it is also essential for the maintenance of housekeeping genes .
In 2005, the first new generation of sequencing technology was released and has been evolving rapidly . After starting the process of gene expression analysis in bacteria [3, 4] at a more accessible cost, shorter experimental time and without probes, the technology took off and today overlaps other tools used for this purpose, such as microarray technology, until now extremely useful for this type of analysis.
2. Applications of RNA-seq
Understanding the transcriptome is essential to knowledge of the functional genomics of an organism. The development of next-generation sequencing (NGS) impacts different areas, such as medical and industrial, and has gone through a revolutionary process. Different approaches, among them the RNA-seq technique, have emerged in the fields of microbiology and molecular biology in order to aid in understanding and bring solutions to bacterial domain investigations. In this section, we will detail some applications that are part of our current context.
2.1. The medical field
The applications of these NGS technologies in medicine have allowed expansion in the fields of diagnosis, treatment and prevention, especially concerning bacterial diseases. One of their major applications has been the quantification of expression levels of each transcript under different conditions that simulate the intracellular environment. Such work has been done by Pinto et al. (2014) to understand the host–pathogen relationship . Westermann et al. (2012) demonstrated the validity of this technique, with the transcriptome of the pathogenic bacteria as their host, using the dual RNA-seq that simultaneously analyzed the gene expressions of the pathogen and host . This gives us better understanding of the systems biology involving bacteria and their hosts, helping scientists to develop drugs and vaccines.
Another field that has been explored extensively involves metatranscriptome, as scientists have sought to comprehend the composition and regulation of microbial ecosystems [7, 8]. To pursue this, they have used the RNA-seq technique to generate, and allow the interpretation of, a large volume of very reliable data. Leimena et al. (2013) also validated the RNA-seq technique using the microbiota of a human small intestine with ileostomy. Their aim was to understand the interactions involved in this microbial ecosystem and how these relationships can be associated with disease . Transcriptome analysis pipelines (see Section 5) can be used with different experimental designs and applied to many bacteria in addition to those in the medical field.
2.2. The industrial field
Industrial applications have been developed in recent years, mainly in the probiotic industry, since it benefits the world economy. Bisanz et al. (2014) used the RNA-seq technique  to show the metatranscriptome of probiotic yogurt, seeking to understand the metabolic activities that allow the survival of this organism in the products. Their results show the adaptive capacity of this bacterium, as well as the variation in differential gene expression, yielding the taste or storage life of the product . Studies such as these are important because they enrich the knowledge of the industrial field and open new possibilities for an attractive area in the marketplace, which results in improvement in the quality of the product that is ultimately delivered to the consumer.
In addition to the probiotic market, another important area is the bacterial production and synthesis of biomolecules. Wiegand et al. (2013) used the RNA-seq technique to understand the regulatory RNAs in the fermentation of Bacillus licheniformis. Their study identified active genomic regions which, in turn, contribute to the efficiency and optimization of the fermentation process, which can promote the industrial production of exoenzymes and antibiotics .
Microorganisms produce antioxidant molecules that can be used in the pharmaceutical and cosmetic industries. They also produce other compounds, such as propionate, that are applicable in the production of chemical aids and are produced by Propionibacterium freudenreichii ssp. shermanii, which one is considered valuable in the food industry . In this area, the RNA-seq technology is very promising and its application can bring advances in these studies.
3. RNA-seq and derivative techniques
The RNA-seq technology is able to identify all RNAs directly and quantitatively: coding and non-coding, rare and abundant, smaller and larger. This method provides information about the transcription start site (TSS), untranslated regions (UTRs), detection of unknown open reading frames (ORFs), improved quality in genomic annotation , and also allows the distinction between primary and processed transcripts (dRNA-seq) .
The major constraint is to ensure representatives for rare transcripts. In this case, the recommendation is either to increase the representation of reads per library  or to enhance these transcripts, eliminating the ribosomal (rRNA) and transfer (tRNA) RNAs that are in abundance in the cells representing about 95% of total RNA .
Despite RNA-seq generally being considered the gold standard for gene expression analysis, some researchers nevertheless find it complicated to define this technology as the gold standard. It is a method that is available in different platforms and address different strategies, showing advantages and disadvantages. However, the superiority of this technology, compared to others in the past, is not questioned .
Despite the technological superiority, the need for biological replicates and depth of sequencing remains. Hence, the results may achieve greater reliability and reproducibility . Differentially expressed genes are better appraised when there are samples with more biological replicates, as compared to enhanced depth with fewer replicates .
Transcriptomics studies have contributed a revolution in the study of the bacterial environment. Different bacterial species have been targeted for RNA-seq studies [5, 13, 19, 20], and gene expression-based discovery has transformed the scientific paradigm of these organisms. The detection of an unexpected amount of coding genes in Helicobacter pylori has demonstrated that, despite having a small compact genome, the transcriptome of this bacterium is extremely complex .
A surprising result was the detection of a large number of transcription start sites (TSS). This has never been achieved before using any technology aside from derivative RNA-seq technology, like the differential RNA-seq (dRNA-seq), which differentiated primary transcripts that exhibit triphosphate ends from processed transcripts that present monophosphate ends, such as rRNAs and tRNAs. In this case, to enrich mRNA, the strategy was to treat all the RNA samples with exonuclease enzymes that degrade nucleotide monophosphate. This strategy identified 5'UTR ends, operons and antisense transcription, thus providing a new perception of the organization of the bacterial transcriptome and a new model for the analysis of individual genes .
The results obtained allow the inference of a role of 5'UTR regions. A correlation between size and cell function was proposed by the researchers, who found that larger size is related to pathogenicity . These results show how little knowledge there is regarding microorganisms, believed to be the simplest form of life, yet which nevertheless prove to be more complex than previously anticipated. This leaves a lot to be discovered.
An RNA-seq application that has been widely used in bacterial genomes is found in studies focused on identifying small RNAs (sRNA). These elements are regulators of various biological processes and were initially studied primarily in Escherichia coli . However, with the advances in technology, it has been possible to identify and characterize small RNAs in a variety of bacterial species [13, 22, 23]. Yan et al. (2013) identified an expression profile of sRNA in the Yersinia pestis, both in vitro and in vivo. This has allowed the identification of new sRNAs and the recognition of gene expression modulation during the infection process, thus improving the understanding of the transcription regulation mechanisms of this organism . The importance of studies involving sRNA also includes assistance in research related to antibiotics therapies, a study in initial development despite a lot of knowledge to be better exploited .
RNA-seq has been used in different areas and situations. Advanced studies using this technology can detect details in cell expression . Even with the difficulties in separating eukaryotic and prokaryotic materials, it was possible to distinguish the simultaneous expression profiles between the host–pathogen responses through dual transcriptome studies. This work allowed to disclosure the host response against the bacterial infection and virulence factors, enabling the infectious process determination . These studies contribute to the research in the field of biological infection by examining diverse pathogens with different life cycles and methods of infection and providing crucial knowledge for studies of diagnostics and vaccines, such as metatranscriptomics study.
After a relatively short time on the market, RNA-seq can accurately reveal structural and functional elements of bacteria. The mapping of transcripts in the genome can refine the annotation or even identify new regions, improve the quality of the studied genome compared to regions previously annotated by predictors or assembled using an ab initio approach [28, 29], and can even check the abundance of transcript expression.
Data coming from a quality genome tends to provide more promising results, responding to the biological question being investigated by researchers. In search of a quality genome, ab initio transcripts assembly or even a hybrid approach, which uses both the reference genome and ab initio assembly, become an auspicious endeavour to solve many problems encountered in the genome and complicated to adjust .
Pinto et al. (2012) conducted a study of Corynebacterium pseudotuberculosis adopting ab initio assembly and, therefore, were able to identify differences in the expression of active genes under different environmental conditions. This allowed them to detect new possible virulence factors involved in pathogenicity, making them targets for vaccine development, diagnosis or treatment against caseous lymphadenitis disease caused by this bacterium .
These results suggest the importance of this technology and the possibility of going further with a tool that aims to improve, and probably will expand, the field of analysis. This could bring the results increasingly closer to bacterial molecular reality.
Bacterial RNA can be divided in two groups: primary and processed transcripts. Primary transcripts are represented by the presence of 5’-triphosphate (5’PPP), which includes messenger RNA (mRNA) and small RNAs (sRNA). Processed transcripts are those carrying 5’-monophosphate (5’P), such as mature ribosomal RNA (rRNA) and transfer RNA (tRNA).
Transcriptome represents approximately 95% of the total bacterial transcriptome . A recently developed approach called dRNA-seq  revolutionized the study of the primary transcripts by considering the 5’ difference between the primary and the processed groups, as mentioned previously (see Section 3.1).
RNAs are very stable and during preparation, considering the “wet-lab” experiments, some transcripts are partially or totally degraded. 5’PPP and 5’P are two of the mechanisms of protection against exonucleases and the first degraded portion of the transcripts. During that process, information is lost and some primary transcripts end up with 5’P and are treated as processed transcripts. Consequently, they are eliminated by the dRNA-seq technique. A new methodology was created to overcome this problem by tagging and clustering the two groups together in an RNA-seq-derived approach named tagRNA-seq . This technique also considers the difference between processed and primary transcripts, but instead of degrading the processed ones, two different ligation reactions are implemented with two different markers: PSS-tag (processed start site) and TSS-tag (transcription start site). They differ in their nucleotide sequence. Figure 1 exhibits briefly the methodology, considering the three main steps: (1) the first reaction tags (PSS-tag) on the processed transcripts; (2) treatment with tobacco alkaline phosphatase (TAP), where the 5’PPP loses two phosphates, which allows the third step; (3) the second ligation reaction (TSS-tag) on the primary transcripts. After those steps are completed, the transcripts are sequenced and, due to the different markers, they can be distinguished and compared .
This methodology was first described for Enterococcus faecalis  and was based on another technique, 5’tagRACE , a 5’RACE derived method. The results provided by tagRNA-seq improved the annotation of the E. faecalis genome by having identified or corrected several genome portions, including both non-coding and coding regions. This study also compared different libraries to prove the effectiveness of this innovative approach. With this, it provided a new method capable of differentiating primary and processed RNAs and was suited to better comprehending of the genetic information of bacteria as other groups .
dRNA-seq and tagRNA-seq are approaches that enable a new view of the transcriptome by selecting the primary transcripts for sequencing or by differentiating the primary from the processed transcripts, for a broader insight into the transcriptome. These state-of-the-art techniques promise a better understanding of RNA structures like TSS, 5’UTR, promoters, among others, besides the knowledge of non-annotated genes and small RNAs.
3.3. FRT-seq (flowcell reverse transcription sequencing)
Flowcell reverse transcription sequencing (FRT-seq) is a new and improved methodology, derived from the RNA-seq technology that was created for Illumina sequencers. Unlike RNA-seq, FRT-seq does not require amplification by PCR, a step that usually introduces bias into the results by displaying an erroneous view of the quantity of some RNA species . Other important features of the Illumina sequencing methodology are the ability to generate strand-specific information, the use of pair-end libraries and the need for a considerable initial amount of RNA template. PCR-free amplification is a major step towards a more comprehensive library, akin to the original one, but without the formation of intermolecular priming artefacts among other errors. It will probably become a fairly useful technique in the near future [33, 34]. Third-generation sequencing platforms, like Nanopore and PacBio, also use amplification-free approaches. However, neither is currently being broadly used since they still exhibit sequencing errors.
FRT-seq comprises the fragmentation of the template (e.g., mRNA) followed by ligation of adapters in both the 3’ and the 5’ ends, which are responsible for the hybridization of the template with oligonucleotides on the flowcell surface. The next steps performed are quantification, reverse transcription and then sequence reaction [33, 34].
This approach can be applied to both eukaryotes and prokaryotes, although the number of published papers involving eukaryotes is more substantial. From the bacterial world, we can quote papers involving Salmonella enterica  and Shigella fleneri  in which FRT-seq was applied as a complementary approach to describe the transcriptional landscape of the species. In both cases, FRT-seq showed greater sensitivity and excellent concordance when compared to other approaches and replicates.
The S. enterica paper  shows that FRT-seq is as efficient as the RNA-seq and dRNA-seq techniques (Figure 2) (Table 1). Figure 2 compares nine different RNA libraries: TEX (1, 2, 3), RNA-seq (1, 2, 3, *) and FRT-seq (depleted and not depleted). TEX (libraries treated with terminator exonuclease) is a dRNA-seq methodology (see Sections 3.1 and 3.2) that, together with the first three RNA-seq biological replicates, was sequenced using a 454 (1 and 2) or an Illumina GAII (3 and FRT-seq) sequencer and the RNA-seq* (library enriched for small RNA species) was sequenced using Illumina HiSeq. The charts relate the percentages of different RNA species and show that the FRT-seq libraries provide similar or better results than the other approaches. The data presented in Table 2 also support this claim, especially considering both the total number of reads and the uniquely mapped reads achieved using the FRT-seq libraries.
|Library||Sequencing technology||Description||Total number||Number of reads (not mapped)||Number of reads|
mapped reads [%]
|Minimum fold coverage#|
biological replicate 1
biological replicate 1
biological replicate 2
biological replicate 2
|TEX_3||Illumina GAII||dRNA-seq library|
biological replicate 3
|RNA-seq_3||Illumina GAII||RNA-seq library|
biological replicate 3
|RNA-seq*||Illumina HiSeq||RNA-seq library|
biological replicate 4
|FRT-seq||Illumina GAII||FRT-seq library|
biological replicate 5
|FRT-seq dep||Illumina GAII||FRT-seq library|
biological replicate 5
|Condition A||Condition B||Condition A||Condition B|
|Total number of mapped reads||20,099,597||22,736,494||49,925,286||47,605,241|
|Total number of reads mapping to genes||1,525,782||2,271,423||3,037,954||2,585,600|
|Reads mapping genes in sense||1,195,446||1,958,533||2,469,828||2,129,951|
|Reads mapping genes in antisense||330,336||312,890||568,126||455,649|
The data presented in this topic demonstrate the quality of this recently published methodology and, according to the authors [33, 34], new updates are still being developed. This will probably provide an even better approach for users. The fact that this technique is only applicable for Illumina sequencers is a drawback; but, since this sequencing platform is available worldwide, this disadvantage can easily be fixed. Perhaps, in the near future, it can be extended to work in other sequencing platforms. Another particularity of this technique is its efficiency with AT-rich genomes, which does not constrain its application with AT-poor genomes. This is due to the PCR-free amplification, which raises a question for other sequencers like Nanopore and PacBio. Despite these issues, this technology has a bright future and is a great advance over the conventional RNA-seq.
3.4. Chromatin immunoprecipitation followed by sequencing (ChIP-seq)
Chromatin immunoprecipitation followed by sequencing (ChIP-Seq) is a technique for the genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes . ChIP-Seq has become an essential tool for studying gene regulation and epigenetic mechanisms. It offers higher resolution, less noise and greater coverage than its array-based predecessor, the ChIP-chip [37, 38]. This approach has six main steps: (1) it is initiated with cell cultures that are grown under defined conditions; and, when the cultures reach the desired stage of development, they are treated with formaldehyde for the cross-linking of proteins and DNA; (2) the chromatin is sheared by sonication into small fragments (200–600 bp); (3) an antibody specific to the protein is used to immunoprecipitate the DNA–protein complex; (4) the cross-links are reversed by heating; (5) the released DNA is subjected to high-throughput sequencing and (6) in silico analysis is carried out in which the resulting sequencing reads are studied for quality and then cropped, based on the quality of the reads [38–40]. The cropped reads are then aligned to a reference genome. Afterwards, areas of enrichment in the ChIP-seq data are identified and those areas, usually called peaks, represent where the transcription factors (TF) bind throughout the genome. CisGenome, MOSAiCs and MACS are some known algorithms that have been utilized in bacterial ChIP-seq analysis [38, 41]. After peaks are associated with genes downstream, a number of bioinformatics analyses can be carried out, including identification and analysis of motifs, differential analysis and association with expression data for deep understanding of bacterial regulon. This is shown in Figure 3 .
As whole-genome transcription profiling cannot reveal whether the influence of the transcription factors (TF) on RNA levels is direct or indirect, this requires identification of transcription factors binding within the appropriate promoter region. ChIP-seq provides information about where the TF are bound. Thus, by integrating ChIP methods and transcription profiling, it is possible to identify all direct regulatory targets of a TF for a given condition. For example, work carried out by Stringer et al. (2014) on the araC gene of Escherichia coli and Salmonella enterica has identified direct regulatory targets of AraC, including five novel target genes: ytfQ, ydeN, ydeM, ygeA and polB . Although ChIP-seq has been used only in moderation to study bacterial systems in a few bacterial species, such as Vibrio harveyi, V. cholerae, Rhodobacter sphaeroides, Mycobacterium tuberculosis, S. enterica and Caulobacter crescentus [36, 37, 43–45], it is used to identify novel regulatory interactions, even for well-studied proteins [46, 47].
ChIP-seq, in combination with RNA-seq, could be an efficient tool to get detailed information about bacterial transcription regulation and how bacteria respond to different external conditions.
3.5. RNA immunoprecipitation sequencing (RIP-seq)
RNA immunoprecipitation (RIP) is the study of intracellular RNA and protein binding; it is a tool for understanding the dynamic process of post-transcriptional regulatory networks. With this technique, an antibody is used against a protein of interest to recover the RNA species bound to the protein. Since the sequence information of the RNA species bound to a specific protein is often desired, an approach combining RNA immunoprecipitation with sequencing technology (RIP-seq) was created . The main challenge of RIP-seq is the cross-linking step, which is relatively inefficient and only a small amount of RNA is available to construct the library [48, 49]. After that step, treatment with endonuclease elucidates the specific binding sites within the RNA, as they will be protected from digestion. This is followed by purification of the RNA–protein complexes using electrophoresis and high-throughput sequencing [48, 50]. Finally, the data obtained from the sequencer are analyzed using bioinformatics tools. The first study using the RIP-seq-based technique was carried out on Salmonella by Sittka et al. (2008) . They used the RNA-binding property of the Hfq protein in their analysis and, as a result, many new sRNA were discovered . Thus, RIP-Seq could be an efficient tool for the identification of bacterial non-coding RNAs.
3.6. LEA-seq (low error amplicon sequencing)
The LEA-seq technique (low error amplicon sequencing) emerged in 2013 and was developed and patented by Gordon and Faith (2014) . This method was created to improve the quality and depth of sequencing runs, since the massive amount of data produced by NGS has caused a high error rate in the sequencing, due to problems with the algorithms or platform reading lengths .
LEA-seq is a nucleic acid sequencing technique that identifies events that occur at low frequency, seeking to understand mutation events. The three basic steps for implementing this technique are: (1) linear PCR, (2) exponential PCR and (3) sequencing. This technique is performed based on bacterial 16S sequencing in which PCR carries numerous times and each amplified PCR uses specific primers for each linear molecule .
The LEA-seq technique is a quantitative method that has the advantages of generating and reading. This permits the formation of a consensus and the elimination of errors for each molecule. Currently, the available techniques do not support error detection in sequencing or identification of whether there is a real variation in the sequence of that microorganism. The multiple sequencing, using the LEA-seq technique, supports better quality and precision about the organism.
The study by Faith et al. (2013) aimed to identify the composition of the faecal microbiota of adults and to understand the role of these bacterial species and their therapeutic potential for intestinal diseases. This technique allowed them to work with a large number of samples (over 500 isolates), as well as to achieve a fast and accurate analysis of the data .
Researchers have a continuing interest in improving this technique, since it can be used for clinical investigation due to its high accuracy: for example, in patients with genetic mutations or somatic mutations. LEA-seq can assist in the search for knowledge about intestinal microbiota, as it may reveal their composition, opening up prospects for the diagnosis, treatment and prevention of gastrointestinal tract diseases.
3.7. CRISPR (clustered regularly interspaced short palindromic repeats)
Ishino et al. (1987) were the first to describe CRISPR . This system has been identified in 40% of bacterial genomes so far  and they are defined as short repetitions of grouped bases. The determination of the CRISPR locus and the characterization of adjacent genes, known as cas genes, responsible for the function of CRISPR, only occurred in 2002 . The CRISPR/Cas system uses small non-coding RNAs in association with Cas proteins. Cas9 is a nuclease which cleaves DNA in the selected region, so that the CRISPR system/Cas9 can be used to edit genomes.
CRISPR/Cas activity involves three main mechanisms: (1) acquisition, the step in which the DNA fragment is inserted into the CRISPR locus in the genome of interest; (2) transcription, in which the CRISPR locus is transcribed and processed; (3) interference, in which the ejection of nucleic acids occurs. All those mechanisms contribute to bacterial persistence in the environment [58, 59]. Furthermore, CRISPR provides mechanisms to limit the spread of antibiotic resistance or virulence factors. However, Gophna et al. (2015) demonstrated that, even though there are different measurements to evaluate horizontal gene transfer, it is not possible to identify a correlation between the CRISPR/Cas system and the evolution of the species. Changes occur only at the population level .
RNA-seq helped in the annotation transcription of regions, mainly non-coding, and also enabled the identification of CRISPR elements in prokaryotes . The CRISPR system can also be used as a tool in studies centered on gene regulation, since this system is able to activate or repress genes.
Zoephel and Randau (2013) discuss how the structure of CRISPR can affect the maturation of RNA and, thus, influence the functionality of the CRISPR/Cas system . The RNA-seq approach was used to evaluate differential gene expression in S. aureus, a pathogen of major importance. It was able to identify the CRISPR in these strains and helped in investigating their possible role, since these regions show an adaptive response to infection . Thus, we see the importance of the use of the RNA-seq approach in the magnification of knowledge about function in prokaryotes.
4. RNA Sequencing Platforms
The RNA-seq approach can be applied to different next-generation sequencing platforms and the results obtained by them are proportional to the machine capability. In Table 3, a comparison is made with some of the platforms currently most employed .
|Company Name||Instrument||Version||Run Time|
|Reads Per Run|
|Illumina||HiSeq 2000||High Output||132||50||6,000||Gene expression, Splice junction detection, variant calling, fusion|
|Illumina||HiSeq 2500||High Output||132||50||6,000||Gene expression, Splice junction detection, variant calling, fusion|
|Illumina||MiSeq||v2 kit||39||250||30||Splice junction detection, variant calling,|
|Life Technologies||PGM||318 Chip||7.3||176||6||Splice junction detection, variant calling|
|Life Technologies||Proton||Proton I chip||2-4||81||70||Gene expression, Splice junction detection, variant calling|
|Pacific Biosciences||RS||RS||0.5-2||1,289||0.03||Splice junction detection, variant calling, full-length gene coverage|
|Roche||454||GS FLX+||20||686||1||Splice junction detection|
5. Bioinformatics Analysis
Experimental investigations in prokaryotes have been facilitated, extended and complemented using computational approaches . Large amounts of data have been generated from RNA-seq experiments which need to be stored and analyzed using computational techniques and tools . This amount has become a bottleneck to bioinformatics analysis and to biologists, since today's transcriptome analysis consists of experiments and data evaluation . Extracting biological information from RNA-seq datasets requires bioinformatics knowledge and tools, making the software choice an important issue for successful RNA-seq analysis [65, 67].
According to Chierico et al. (2015)  and Pinto et al. (2011) , RNA-seq can be understood as a five-step process: (1) isolation of the total RNA of the organism; (2) mRNA enrichment; (3) synthesis of cDNA; (4) NGS sequencing, which returns raw data to the (5) bioinformatics analysis . A flowchart of this process can be seen in Figure 4.
This session focuses on bioinformatics analysis and the computational tools available. Based on a literature review [29, 65, 67–69], bioinformatics analysis can be comprehended as the extraction and classification/division of biological information gleaned from the sequencing of raw data (Figure 5).
5.1. Bioinformatics workflow
The quality check step aims to increase the accuracy of the results by removing sequences that may contain errors ; trimming sequences introduced in the library preparation step, such as adapters and poly(A)-tails ; and, removing reads with low phred quality. However, in that regard, the use of poor-quality databases can lead to less precise results ; considering this, the quality check can affect the next steps drastically.
Some RNA-Seq pipelines, like ReaDemption , implement quality checking which performs quality trimming, removes adapters and poly(A) tails and discards reads shorter than a given cut-off (the default cut-off is 12 nucleotides (nt)). Quality assessment  evaluates the quality based on quality-graph analysis and estimated coverage. According to Backofen et al. (2014) , FastQC (http://www.bioinformatics.babraham.ac.uk/projects/ fastq c/) is a tool commonly used to check read quality and to determine the quality profile of the reads. Software suites can also be used for this purpose, FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) provides tools to remove sequences attached in previous steps and to perform other pre-processing strategies on raw data.
After the quality check, if a reference genome is available, then a mapping step will be done; otherwise, de novo assembly. Mapping consists of producing the transcriptome map by aligning reads to a reference genome . This aims to detect the right position of the reads and to distinguish between sequencing errors and genetic variations . Abundant mapping software has been released, differing in their algorithms, memory management, velocity and computational cost . This makes the choice of a mapping tool a challenge. McClure et al. (2013)  made a comparison between SOAP2, BWA, Bowtie and Bowtie2 aligners using 75 RNA-seq experiment data. The comparison of mapping algorithms applied to IonTorrent data can be seen in . After mapping quality is evaluated, ReadXplorer software offers quality classification of read mapping in order to provide information about the quality and quantity of each single read mapping . This approach is recommended when a high-quality genome is available as a reference. If one is unavailable, transcripts should be assembled de novo .
De novo assembly can be used when investigating poorly studied organisms , complex microbial communities or uncultivable organisms . Both DNA and RNA must be assembled, but transcriptome assembly is significantly different than genome assembly ; thus, it is important to use RNA assemblers. Tjaden (2015)  affirms that assemblers should be specifically designed to prokaryotes, owing to the different challenges of eukaryotic and prokaryotic transcriptomes. Bacterial genomes are often denser than eukaryotic genomes, considering the proximity of the genes. Neighbouring bacterial transcripts can overlap, making it difficult to identify transcript boundaries appropriately. Non-coding eukaryotic RNA models are not appropriate for detecting bacterial small regulatory RNAs . An assembly comparison of three different software titles (Trinity, SOAPdenovo2 and Rockhooper 2), using data from nine different bacteria, can be seen in .
When reference mapping or de novo assembly is done, data can be analyzed structurally and differentially. The main purpose of differential analysis is to determine the differences in expression among different growth conditions or treatments . Several software titles have been released for this purpose, but there is no consensus about best practices, which makes it difficult to select a tool or method. Seyednasrollah et al. (2013)  compared eight differential expression software packages using two real, publicly available datasets. Software that analyzes differential expression can be based on the Poisson method (DEGseq and Myrna), negative binomial method (edgeR and DEseq) or other methods [67, 76]. Pinto et al. (2011)  recommends using DEseq or edgeR when analyzing replicates.
Transcriptome annotation and classification can be based on structural analysis, evaluating transcripts regarding the genomic region with which they have been associated and in which they have been classified: protein-coding, non-coding and intergenic regions . Aiming to predict ncRNA transcripts, several computational methods have been developed. Herbig and Nieselt (2011)  highlight the SIPHT, sRNAFinder, sRNAscanner, NOCORNAr and sRNAPredict software. NOCORNAr distinguishes itself as it is useful for predicting and characterizing ncRNAs in bacteria .
Assessing transcripts concerning genomic regions rely on transcript annotation. The computational approach is convenient to use due to its velocity and precision, compared to manual annotation. However, human supervision of the results is considered important in order to avoid false-positives or missing features . With this technique, some main structures must be detected: 5' transcript ends, 3' transcript ends, TSS and operon [1, 65].
Transcript boundaries identification
Annotation of transcript boundaries is important for operon identification and regulatory analyses . Identifying 5' UTR is not always possible; a significant number of transcripts lacking 5' UTR were found in bacteria and called leaderless transcripts. In this situation, the transcript translation start site and the transcription start site remain in almost the same position . Annotation of 3' UTR is important in order to obtain the entire analytical value of the RNA-seq data. Creecy and Conway (2014)  affirm that the current best method for detecting 3' ends is to search for correlations between replicates data. They highlight that the software package TransTermHP can find intrinsic terminators successfully.
TSS annotation can assist in ncRNA annotation and polycistronic transcripts . According to Creecy and Conway (2013) , it is essential to discover unknown transcripts and to analyze operon, 5' UTR and promoters architecture. Although there are no well-established strategies for TSS identification, owing to scarce knowledge about transcription start sites in bacteria, with computational developments in both computational analyses and “wet-lab” experiments, TSS annotation has become more feasible . TSSAR is a dRNA-seq data-based tool for rapid annotation of TSS that considers dRNA-seq library statistics . According to Backofen et al. (2014) , the main advantage is in the statistical analysis presented as an easy-to-use web service. The TSSpredator tool provides automated TSS detection and classification from RNA-seq data, performing a genome-wide comparative prediction of TSS . A comparison among manual annotation, TSSpredator and TSSAR annotation can be seen in .
The operon represents clusters of co-transcribed genes regulated by the same regulatory sequence and co-transcribed into a single mRNA. This structure has immense biological importance, improving functional gene annotation and giving important information to studies of drug targeting, functional analyses and antibiotic resistance . To handle operon occurrence complexity, the occurrence should be detected using operon architecture (i.e., 5' ends and 3' ends) and have sufficient read coverage to connect promoters and terminators. A strong indication that an operon is real is that at least 90% of the bases of the reads is covered . Chuang et al. (2012)  classify computational methods to predict operons and they evaluate 15 algorithms with respect to accuracy, specificity and sensitivity.
5.2. RNA-seq pipeline tools
Not all pipeline tools feature the complete RNA-seq workflow described earlier. To help with tool selection, a software functionalities comparison was developed and is shown in Table 4. To provide additional support, important issues about each software are described, below.
Rockhopper is a system designed specifically for bacterial transcriptome RNA-seq data analysis. A novel approach to mapping transcripts is implemented in this software (similar to the Bowtie2 approach). Mapping normalization is performed followed by transcripts assembly, identification of transcript boundaries, quantification of transcript abundance, testing for differential gene expression and operon prediction. Analysis results are presented using Integrative Genome Viewer, which allows different experiments to be viewed simultaneously .
Rockhopper 2 is a comprehensive system focused on de novo assembly that supports differential analysis and transcripts abundance quantification. According to Tjaden (2015) , it does not require high-performance computers and can run on personal computers. Rockhopper 2 implements a novel de novo assembly algorithm for bacterial transcriptomes. The algorithm works in two stages: (1) candidate transcripts are assembled using a found k-mer and (2) sequencing reads are mapped to candidate transcripts aimed at filtering candidate transcripts to high-quality final transcripts. Concerning differential analysis, Rockhopper 2 first normalizes each RNA-seq dataset, enabling it to compare different experiments or samples .
RNA-Rocket aims to simplify the process of aligning RNA-seq data to a reference genome and to generate quantitative transcript profiles. It is built on Galaxy, to provide the tools and services necessary to process RNA-seq data. Some of its benefits are: the possibility of sharing results across research groups; the support of batch analysis for multiple samples; and, the integration of tools and projects, integrating data from the PATRIC platform .
READemption pipeline aims to integrate individual RNA-seq analysis tasks and provides a user-friendly tool with a command line interface. This tool was primarily developed to analyze bacterial transcriptome. In order to use the full capacity of modern computers and reduce run time, READemption offers parallel data processing. First, it performs quality trimming of polyA and adapters followed by mapping, coverage calculation, gene expression quantification, differential gene expression analysis and plotting. The software is able to analyze RNA-seq data from Illumina and 454 platforms.
ReadXplorer offers straightforward visualization and analysis functions built around its unique read mapping classification. Analyses such as TSS and operon detection, differential expression, RPKM value and read count calculations are available in ReadXplorer and can be exported to Microsoft Excel files. Read mapping classification sorts read mappings into three different classes: perfect match, best match and common match. These classifications are incorporated in all analyses functions.
5.3. Bioinformatics challenges
Through bibliographic research [29, 66, 69, 71, 82, 83], it has been concluded that bioinformatics has many challenges related to computational issues. RNA-seq experiments generate large amounts of data that must be computationally processed, analyzed, stored and retrieved using a great deal of computational power. In addition to the computational issues, it is important to take into account that not all bioinformatic researchers have extensive computational experience: this makes the lack of user-friendly tools a problem for some users and an important issue for developers. However, great computers, excellent bioinformatic researchers and user-friendly tools do not guarantee successful analysis. The software selected must be appropriate to each biological question and to the organisms studied. Even with all questions presented here, RNA-seq analysis has been very successful in recent years. This success can lead us to imagine the wonderful possibilities for RNA-seq bioinformatic analyses in the future.