The determination of when and how genes are “turned on and off” is a challenge in pos-genomic era. Differences between two species are closer to gene expression and regulation than to gene structures (Howard & Benson, 2002). The first and key step in gene expression is promoter recognition by RNA polymerase enzyme (RNAP). The promoter sequences can be defined as cis-acting elements located upstream of the transcription start site (TSS) of open reading frames (ORF). To make an analogy, genes represent the “computer memory” and promoters represent the “computer program” which acts on that memory. The study about promoters can assist in providing new models about the constitution of the computer program and how it operates (Howard & Benson, 2002).
The proper regulation of transcription is crucial for a single-cell prokaryote since its environment can change dramatically and instantly (Huffmann& Brennan, 2002). In face of this, the detailing of the principals and the organization of transcriptional process is helpful for understanding the complexity of biological systems involved, for instance, cellular responses to environmental changes or in the molecular bases of many diseases caused by microbes (Janga&Collado-Vides, 2007).
While several sequenced genomes have their protein-coding gene repertoire well described, the accurate identification and delineation ofcis-regulatory elements remain elusive (Fauteuxet al., 2008). At this moment, the challenges are to analyze the available sequences and to locate TSS, promoters and other regulatory sequences(Askaryet al., 2009). The purpose of this review is to provide a brief survey of promoter sequences characteristics and the advances of computer algorithms for their analysis and prediction. This chapteris organized in two main sections. The established knowledge about biological features of the promoterswill be described in the first section, focusing in their genetic role and sequence content constitution. This is an important topic for understanding the intrinsic difficulties in the
2. The bacterial promoter sequences
A common feature of the transcriptional regulators is their ability to recognize specific DNA patterns in order to modulate gene expression (Jacques et al., 2006). The upstream regulatory region of the bacterial coding regions contains the promoter, that is, the DNA sequence which determines specific recognition by RNAP (Barrios et al., 1999). The following section presents a concerned description about the promoter sequences and their role as gene expression regulators.
2.1. Promoter sequences and gene expression specificity
In bacteria, RNAP holoenzyme consists of five subunits (2α,β,β’,ω) and an additional sigma (σ) subunit factor (Figure 1). A collection of different σ subunits act as key regulators of bacterial gene expression. The σ factor led RNAP sequence-specific binding at promoter where melting of the DNA double strand occurs (Borukov&Nudler, 2003). The substitution of one σ factor by another can initiate the transcription of different groups of genes (Schultzabergeret al., 2006). The numbers of σ factors encoded in bacterial genomes is highly variable. It is possible that the number of σ factor genes is relatedto the diversity of lifestyles encountered by a bacterium (Janga&Collado-Vides, 2007).
The σ factors are labeled according to their molecular weight (e.g. σ24,σ28, σ32,σ38,σ54 and σ70) and each one has been assigned to a global function role (Table 1). The σ70 is most commonly used σ factor in
Regardless of the σ factor, most of the promoters can be dissected into two functional sites, known as the -35 and -10 regions upstream of the TSS. Mutations in the consensus sequences of the promoters can affect the level of expression of the gene(s) they control, without altering the gene products themselves (Lewin, 2008). The canonical consensus and the number interspacing nucleotides recognized by the most important σ are presented on Table 1. Just for σ54, the consensual region is located in the -12 and -24 nucleotides.
|28||fliA||Flagellar genes||CTAAA||15 pb||GCCGATAA||Helmann& Chamberlin, 1987|
|CCCTTGAA||13-15pb||CCCGATNT||Cowing et al., 1985|
|TTGACA||16-18pb||CTATACT||Typas et al., 2007|
|CTGGNA||6pb||TTGCA||Barrios et al., 1999|
|GGAACTT||15pb||GTCTAA||Rhodius et al.,2006|
A comprehensive study of the promoter content information was carried out by Schultzabergeret al. (2006). The authors have used the Claude Shannon’s information theory and have built a promoter model by aligning and refining of 559 sequences upstream of TSS. The results for the promoter motifs showed, among others, two interesting results:
As it has been related so far, the promoter motifs are not strictly conserved within a set of promoters recognized by a given σ factor and also differ according to the σ factor which recognizes them.The structure of bacterial promoters limits the efficacy of prediction by a global analysis approach. A limited analysis of a putative promoter sequence by comparison with the σ70 promoter consensus motif can lead to an unacceptable rate of false negatives and incorrect assignments (de Avila e Silva et al., 2011).
2.2. Structural properties of promoter sequences
The motifs obtained from promoter sequences compilation are indicative of the existence of a nucleotide signal in them. Nonetheless, it also been demonstrated that primary DNA sequence is not the only source of information in the genome for the transcription regulatory process (Olivares-Zavaletaet al., 2006). According to many authors (e.g, Kanhere&Bansal, 2005a; Klaimanet al., 2009; Wang &Benham, 2006), not only regulatory sequences contain specific sequence elements that serve as target for interacting proteins, but also present different properties, such as: suitable geometrical arrangement of DNA (curvature), propensity to adopt a deformed conformation facilitating the protein binding (flexibility) and physical properties (e.g., stacking energy, stability, stress-induced duplex destabilization). Several studies have reported that eukaryotic and prokaryotic σ70-dependent promoter sequences have lower stability, higher curvature and lesser flexibility than coding sequences (Gabrielian&Bolshoy, 1999; Kanhere&Bansal, 2005a).
DNA stability is a sequence-dependent property based on the sum of the interactions between the dinucleotides of a given sequence. It is possible to calculate the DNA duplex stability and to predict the melting behavior if the contribution of each nearest-neighbor interaction is known (SantaLucia& Hicks, 2004). A eukaryotic and prokaryotic promoter stability analysis was carried out by Kanhere&Bansal (2005a). The authors reported that promoters from three bacteria which have different genome composition (A+T composition:
Intrinsic DNA curvature and bendability were shown to be important as physical basis in many biological processes, in particular in those which have interaction of DNA with DNA-binding site proteins, such as transcription initiation and termination, DNA origins of replication and nucleosome positioning (Gabrielian&Bolshoy, 1999; Jáureguiet al., 2003; Nickerson &Achberger, 1995; Thiyagarajanet al., 2006). Specifically, bending is related with twists and short bends of approximately 3 base-pairs, while curvature refers to loops and arcs involving around 9 base-pairs (Holloway et al., 2007). DNA curvature in prokaryotes is usually present upstream of the promoter but sometimes within the promoter sequence (Jáureguiet al., 2003; Kozobay-Avrahamet al., 2006). The distribution of curved DNA in promoter regions is evolutionarily preserved, since orthologous groups of genes with highly curved upstream regions were identified (Kozobay-Avrahamet al., 2006). As related by Pandey&Krishnamachari (2006), sequences derived from non-coding regions had similar overall base composition but different curvature values from promoter regions, indicating that the differences in curvature values are not just the consequence of base composition but also the organization of bases in sequences.
Another DNA feature that can distinguish promoter sequences is stress-induced DNA duplex destabilization (SIDD). According to Wang &Benham (2006), SIDD is not directly related to primary sequence alone, nor equivalent to stability of DNA double helix. In this complex process, the differences between the energy cost of strand separation for the specific base pairs involved and the energy benefit from fractional relaxation of the superhelical stress provides the energies that govern SIDD. Promoters are strongly associated with regions of low SIDD energy. Certain non-coding regions containing promoters or terminators are unstable, while transcripted regions remain stably duplexed under the stress imposed by negative superhelicity. The change of the level of superhelicity on a promoter region can shows a variety of effects on the expression of the genes it encodes (Wang &Benham, 2006).
As related so far, the promoters present organizational properties which, in different scales, may play a significant role in the transcription process. Recent studies have reported promising results using DNA structural or biophysical properties as predictors of promoter regions, either alone or associated with the sequence composition. A concerned description of these approaches and their results will be presented in the next section.
Experimental methods applied to the identification of promoters by molecular methods can be laborious, time-consuming and expensive. Consequently, it is important to develop algorithms that can rapidly and accurately evaluate the presence of promoters (Jacques et al., 2006; Li & Lin, 2006). A variety of
3.1. Performance measures for the evaluation of promoter classification programs
A classification model (or classifier) is a mapping from instances of predicted classes (Fawcett, 2006). The promoter prediction problem is a kind of binary classification, as the input sequence can be classified in only one class of two non-overlapping classes (Sokolova& Lapalme, 2009). The result of a classifier during testing is based onthe counting of the correct and incorrect classifications from each class (Bradley, 1997). In this way, the four possible outcomes of a classification model evaluate this correctness(Bradley, 1997; Fawcett, 2006;Sokolova& Lapalme, 2009):
TP: promoter sequences classified as promoter(true positive);
TN: non-promoter sequences recognized as non-promoters (true negative);
FP: non-promoter sequences classified as promoter(false positive);
FN: promoters classified as non-promoter sequences(false negative).
This information is then normally displayed in a two-by-two confusion matrix (Table 2). A confusion matrix is a form of contingency table showing the differences between the true and predicted classes for a set of labeled examples (Bradley, 1997).
|Data Class||Classified as promoter||Classified non-promoter|
|Promoter||True positive (TP)||False negative (FN)|
|Non-promoter||False positive (FP)||True negative (TN)|
Although the confusion matrix shows the whole information about the classifier’s performance, it is the basis for many common metrics (Bradley, 1997;Fawcett, 2006). The often used performance measures are accuracy, sensitivity, specificity, precision and receiver operating characteristics ROC graphs. Their formulas are presented in equations 1 to 4. The accuracy measure gives an overall effectiveness of a classifier. Alternative measures are sensitivity (proportion of observed promoter sequences that are predicted as such) and specificity (probability of a classifier identifies non-promoter sequences). Additionally, the precision is related to the class agreement of identified promoters given by the classifier (Sokolova& Lapalme, 2009). A reliable performance of a promoter prediction program is the harmonic average of the sensitivity and specificity. A ROC graph is a technique for visualizing, organizing and selecting classifiers based on their performance (Figure 2). ROC graph allows visualizing and selecting classifiers based on their performance. It is presented as two-dimensional graphs in which TP rate is plotted on the Y axis and FP rate is plotted on the X axis. A common method associated with the ROC graph is to calculate the area under the ROC curve, abbreviated AUC. The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Further information about the ROC curve can be found in Fawcett (2006).
3.2. Position-weight matrices
Consensus sequences have been used to predict promoters by simple pattern matching. These strategies for promoter identification are usually based on a prior knowledge of some characterized sequences (Jacques et al., 2006). Thefirst alignments of
A more sophisticated approach based on alignment is the Position-Weight Matrix(PWM). In this two-dimensional array, the rows represent one of the nucleotides A, T, C or G and the columns represent the analyzed motif. This accepted method yields results by aligning examples of referenced sequences, which allow estimating the base preference at each position of a matrix (Song et al., 2007).A weight is assigned to each base at each position in the promoter sequence and the final score of a candidate sequence decreases according to given differences of the reference matrix. Detailed information about the first implementations and the mathematical background can be found in Stormo (2000).
Huerta &Collado-Vides (2003) use a two stage PWM code-named Cover. This approach searches for conserved motifs using multiple sequence alignment methods and generates weight matrices for σ70-dependent promoter sequences. Aiming to select the best matrices, the authors added some criteria, such as the spacers between -10 and -35 hexamers, the distance from -10 region and the start codon, the distance from -10 region and the TSS, and statistical analysis and the matrix score. Despite the 86% of predictive capacity of this approach, the accuracy obtained was 53%. This value indicates that this approach presents a high number of false positives.
Li & Lin (2006) have proposed a variation from PWMs called Position Correlation Scoring Matrix (PCSM). This approachconsiders the position-specific weight matrices at ten specific positions for the promoter. A PCSM for promoter and another for non-promoter training sequences sets have been computed. For classifying a new test sequence, the resulted scores from promoter and negative PCSM were used. Based on those scores, the sequence was identified as promoter only if the score was higher for positive PCSM. The results achieved in this paper present sensitivity of 91% and specificity of 81%.In order to predict promoters in the whole genome, the PCSM was applied and all the 683 experimentally identified σ70-dependent promoter sequences were successfully predicted. Besides that, 1567 predictions were considered as probable promoters.
To predict σ28 promoter-dependent sequences of ten gamma-proteobacteria species, Song et al. (2007) carried out an alternative approach based on PWM named as Position Specific Score Matrix (PSSM). The species chosen were
PWM models are commonly used because they are a simple predictive approach. Moreover, they are a convenient way to account for the fact that some positions are more conserved, than others (Stormo, 2000). However, in a large number of sequences the consensus can be insufficiently conserved,that is, they present insertions, deletions, variable spacing between elements or they are difficult to define. In such cases, this approach yieldmany false predictions (Kalate
3.3. Machine learning
Machine Learning (ML) concerns the development of computer algorithms which allow the machine to learn from examples. The classification (or pattern recognition) is an important application of ML techniques in bioinformatics due to their capability of capturing hidden knowledge from data. This is possible to achieve even if the underlying relationships are unknown or hard to describe. Additionally, they can recognize complex patterns in an automatic way or distinguish exemplars based on these patterns (Cen et al., 2010; Sivarao
ML approaches usually split the data set into training and test groups. They learn from examples (training data), and the set of examples, which were not exposed to the classifier in the training process, are used to test the classification model. Among all ML techniques, Support Vector Machines (SVM) and Artificial Neural Network (ANN) applicationshave produced promising results in the promoter predictionproblem. For this reason, the purpose of this section is to provide an explanationabout the basicideas ofthese two ML approaches.
3.3.1. Support vector machines
SVM has been applied to identify important biological elements including protein, promoters and TSS, among others. This technique is used in bioinformatics as not only it can represent complex nonlinear functions but it also has flexibility in modeling diverse sources of data. This approach, introducedby Vapnik and his collaborators in 1992, is usually implemented as binary classifiers and it yields results by two key concepts: the separation of the data set into two classes by a hyperplane, and the application of supervisedlearning algorithms denoted as kernel machines (Ben-Huret al., 2008; Kapetanovicet al., 2004; Polat&Günes, 2007). In a simple way (Figure 3), SVM classifies the data by:
Some published paper devoted to promoter prediction using SVM. L. Gordon et al.(2003) carried out SVM with alignment kernel in two different data sets: promoters and coding regions, and promoters and non-promoter intergenic regions. The average error achieved was 16.5% and 18.6%, respectively for the data sets used. This method is preferable in cases which present a sufficient number of known promoter regions, but might not know anything about their composition (L. Gordon et al., 2003). This tool is available online in http://nostradamus.cs.rhul.ac.uk/~leo/sak_demo/. Another SVM carried out by J. J. Gordon et al. (2006) made a jointprediction of
By using a combination of feature selection and
The SVM algorithms present many advantages in their use when compared with other methods. First of all, SVM produces a unique solution since it is basically a linear problem. Second, SVM is able to deal with very large amounts of dissimilar information. Third, the discriminant function is characterized by only a comparatively small subset of the entire training data set, thus making the computations noticeably faster (Kapetanovicet al., 2004). On the other hand, a problem of SVM is its slow training, as it is trained by solving a quadratic programming problem with the number of variables equal to the number of training data (Abe, 2010).
3.3.2. Artificial Neural Networks
The artificial neural networks (ANN)arepowerful computational tools inspired (they are not a faithful models of biological neural or cognitive phenomena) on the structure and behavior of biological neurons (Hilal et al., 2008; Wu, 1996). As in the human brain, the basic unit of ANN is called artificial neuron (Figure 4b), and it can be considered as a processing unit which performs a weighted sum of inputs (Hilalet al
The way by which the neurons are interconnected defines the ANN architecture. There are many kinds of architecture, but this review describes only the multilayer perceptron (MLP) architecture. The reasons for this choice are the capability of MLP capture and discover high-order correlations and/or relationships in input data and its wide applicability on promoter prediction (Hilal et al., 2008; Wu, 1996). Three-layer ANN (Figure 4a) is known as universal classifier as it is able to classify any labeled data correctly if there are no identical data in different classes (Baldi&Brunak, 2001).
The MLP presents three kinds of layers: input layer, output layer, and hidden layers (Figure 4a). The input layer contains the neurons which receive the information from external sources and passes this information to the hidden layer for network processing. The use of hidden neurons makes the learning process harder to visualize, since the search has to be conducted in a much larger space of possible functions in order to decide how input features should be represented by the hidden neurons. The output layer contains neurons that receives processed information and sends output signals out of the system. In all layers there is a bias input which provides a threshold for the activation of neurons (Hilalet al., 2008). The neurons in a given layer are fully connected by weights with the neurons on the adjacent layer. Each layer is comprised of a determined number of neurons. The number of input neurons corresponds to the number of input variables into the ANN, and the number of output neurons is the same as the number of desired output variables. The number of neurons in the hidden layer(s) depends on the application of the network (Hilal et al., 2008).
MLPs have been applied successfully to solve many problems by training them in a supervised way with a highly popular algorithm known as back-propagation (Wu, 1996). This algorithm is the most widely used to adjust the connection weights. During the training of multilayer neural networks classifiers, the weights are usually corrected so that the sum of squares error between the network outputs and the desired output are minimized (Abe, 2010).
The first NN promoter prediction, as presented by Demeler and Zhou (1991), had simple architecture and the results showed high accuracy and false positive rate. More complex architectures were applied by Mahadevan and Ghosh (1994), who used a combination of two ANN to identify
DNA promoter information, other than nucleotide composition, was used as ANN input data by several authors. Rani et al. (2007) propose a global feature extraction scheme which extracts an average signal from the entire promoter sequence of 80 bp length. The resulting signal was composed by a combination of promoterdinucleotides. After this procedure, MLP training was carried out with the promoter signal as positive examples and four different negative data sets:
By using anANN architecture fed by difference in DNA stability values between upstream and downstream regions in vicinity of known TSS, Askary
Rani and Bapi (2007) used
An ANN-based approach was used by de Avila e Silva et al. (2011) for promoter prediction according to the σ factor which recognizes the sequence.This bioinformatics tool, denoted as BacPP, was developed by weighting rules extracted from ANNs trained with promoter sequences known to respond to a specific σ factor.The information obtained from the rules was weighted to optimize promoter prediction and classificationof the sequences according to σ factor which recognize them.The accuracy results for
In spite of the ANN capability capture imprecise and incomplete patterns, such as individual promoter motifs including mismatches (Cotik et al., 2005), this ML approach can present some intrinsic difficulties. Many decisions related to the choice of ANN structure and parameters are often completely subjective. The final ANN solution may be influenced by a number of factors (e.g., starting weights, number of cases, number of training cycles, etc.). Besides, the overtraining needs to be avoided, since it results in ANN which memorizes the data, instead of to do a generalization of them (Kapetanovic et al., 2004).
3.4. Other approaches
The symbolic representationof DNA nucleotides given by the letters A,T,G,C lead to many studies which aiming at understanding its structure through distributions, complexities, redundancy and statistical regularities (Krishnamachari et al., 2004). All this kind of information have a theoretical potential to be a distinguish feature of promoter sequences. Some papers are devoted to applied this features either alone or in combination with other approaches for improve promoter prediction results.
Kanhere and Bansal (2005b) developed their own promoter recognition approach based on differences of DNA stability between promoter and coding regions. That tool was improved by Rangannan e Bansal (2007) and achieves sensitivity of 98% anda just precision of 55%. The authors claim that this stability-based approach can be used to annotate entire genome sequences for promoter regions. According to the authors, the low precision can be reduced if it was combined with other sequence based methods. Additionally, they argue that this method can be used to investigate characteristic properties of specific subclasses of promoters, as well as other functional elements which no exhibit obvious consensus sequences.
Jacques et al. (2006) describe a novel approach based on matrices representing the genomic distribution of hexanucleotides pairs. The principal strategy was based on the observation that the promoters are over-represented in intergenic regions relative to the whole genome. This approach was carried out for ten prokaryotic genomes and the analysis of characterized promoter sequences generates a sensibility of the matrices generated. These results present different sensibility values according to the analyzed bacteria. The lowest value was 29.4% for
The SIDD values were used by Wang and Benham (2006) for demonstrating that this information can be useful when applied to promoter prediction. They define a promoter as extending from positions -80 to +20 with respect to the TSS and they define strong SIDD as any value below 6 kcal/mole. SIDD values correctly predicted 74.6% of the real promoters with a false positive rate of 18%. When the SIDD values were combined with -10 motifs scores in a linear classification function, they predict promoter regions with better than 90% accuracy. The authors attribute their success to the fact that about 80% of documented promoters contain a strong SIDD site. The authors also observed a bimodal distribution of SIDD properties, which can reflect the complexity of transcriptional regulation, suggesting that SIDD may be needed to initiate transcription from some promoters, but not others.
A brief survey of currently
The currently available approaches described in this paper make efforts to reduce the number of false predictions. Recent bioinformatics applications are increasingly appreciating the DNA structural features and incorporating this kind of information for detecting promoter tools. Some works shows the advantage of the use of the feature selection or extraction process as an important part of pattern recognition, since this procedure can decrease the computation cost and increase the performance of the classification (Polat and Günes, 2009). One of the goals of promoter recognition is to locate promoter regions in the genome sequence.Predicting promoters on a genome-wide scale is problematic due to the higher number of false positive predictions caused by the large amount of DNA analyzed. It is important for consideration the fact that a given classification method is not universally better than other, since each method has a class of target functions for which it is best suited(Bradley, 1997).
The authors would like to express their gratitude to the University of Caxias do Sul by the financial support applied in the research project.