Significance Score of Motifs in Biological Sequences

In Bioinformatics, it is common to search biological sequences (DNA, RNA, proteins) for functional motifs such as cross-over hotspot instigators (chi), restriction sites, regulation motifs, binding sites, active sites in proteins, etc. (Beaudoing et al., 2000; Brazma et al., 1998; El Karoui et al., 1999; Frith et al., 2002; Hampson et al., 2002; Karlin et al., 1992; Leonardo Marino-Ramirez & Landsman, 2004; van Helden et al., 1998). Due to evolution pressure, functional motifs are likely to be more conserved than non-functional motifs. As a consequence, it is a natural strategy to search biological sequences for motifs which are statistically exceptional (ex: overor under-represented). Given M a motif of interest (from simple strings to complex regular expressions), a recurrent question is: “how surprising is it to observe n occurrences of M in my dataset ”. In statistical terms, this is equivalent to compute the p-value of observation n in respect with a relevant reference model. More precisely, if X1:l = X1 . . .Xl is a length l random sequence generated by our reference model, and if N denotes the random number of occurrences ofM in X1:l, for any n 0 our objective is to compute the significance score of observation n:


Introduction
In Bioinformatics, it is common to search biological sequences (DNA, RNA, proteins) for functional motifs such as cross-over hotspot instigators (chi), restriction sites, regulation motifs, binding sites, active sites in proteins, etc. (Beaudoing et al., 2000;Brazma et al., 1998;El Karoui et al., 1999;Frith et al., 2002;Hampson et al., 2002;Karlin et al., 1992;Leonardo Marino-Ramírez & Landsman, 2004;van Helden et al., 1998). Due to evolution pressure, functional motifs are likely to be more conserved than non-functional motifs. As a consequence, it is a natural strategy to search biological sequences for motifs which are statistically exceptional (ex: over-or under-represented). Given M a motif of interest (from simple strings to complex regular expressions), a recurrent question is: "how surprising is it to observe n occurrences of M in my dataset ". In statistical terms, this is equivalent to compute the p-value of observation n in respect with a relevant reference model. More precisely, if X 1:ℓ = X 1 ...X ℓ is a length ℓ random sequence generated by our reference model, and if N denotes the random number of occurrences of M in X 1:ℓ , for any n 0 our objective is to compute the significance score of observation n: this score representing the p-value in a decimal log-scale, negative (resp. positive) values being associated to under-(resp. over-) representation events. In order to compute such a score for a given motif M and a given dataset, one needs two essential steps: 1) Counting: count the observed number n of occurrences of motif M in the dataset; 2) Significance: compute the p-value of observation n with respect to a reference model.
In this chapter, we give all the necessary details to perform these two steps using state of the art approaches including some unpublished results.  (Cornish-Bowden, 1985) alphabet (DNA), multiple alignment (proteins), sequence logo (proteins), consensus pattern (proteins), and frequency matrix (DNA). Various sources including ReBase (Roberts et al., 2010), PROSITE (Sigrist et al., 2010), and JASPAR databases (Bryne et al., 2008).
or confirmed by experiments) in the form of a multiple alignment or a frequency matrix from which can be derived a consensus. This consensus could sometimes be a simple string (ex: AGCGG the chi site of B. subtilis) but in most cases it is a degenerated pattern (ex: CAYNNNNNRTG a restriction site in the IUPAC alphabet, PROSITE signatures). In all cases however, it is possible to consider our biological motif M as a (possibly large) set of strings. Formally, let M be a finite set of strings over a finite alphabet A. Ex: A = {A, C, G, T} for DNA sequences; this is the alphabet we are going to use from now on in our examples. Let X 1:ℓ = X 1 ...X ℓ be an observed sequence of length ℓ over A. Then the number N(M; X 1:ℓ ) of A ⋆ M being the set of all finite sequences over A ending with one element of W (this notation will be explained in the next section), and where 1 A is the indicator function of event A.
In the particular case where M contains no strings that are included into each other (which is a common assumption), the number N of matching position corresponds exactly to the number of occurrences. However, there is no need to put any restriction on M as long as we are interested in the number of matching positions like we do. From now on, if the sequence X 1:ℓ is observed, we denote by the number of matching positions by n, and if the sequence X 1:ℓ is random, we simply denote by N the random number of matching positions.

Regular languages
Let us denote by A ⋆ the set of all finite sequences over A. Any subset L⊂A ⋆ is then called a language over A. We denote by P (A ⋆ ) the set of all possible languages over A. We denote by ε ∈A ⋆ the empty sequence, and for the sake of simplicity, the singletons of P (A ⋆ ) will be simply denoted by their element. Ex: We define on these languages three regular operations:  The precedence rule of these operations is: | (lowest precedence), · (associative operator), ⋆ (highest precedence). Ex: A|C · T ⋆ =(A|(C(·T ⋆ )), TT · A|C ⋆ · G =((TT · A)|((C ⋆ ) · G)). We call regular expression over A any algebric expression over P (A ⋆ ) defined from singleton elements and a finite number of regular operations. The resulting language is called a regular language. Ex: any finite language is a regular language, A ⋆ is a regular language, (A|C|G|T) ⋆ GGATG is a regular language, {AG, AAGG, AAAGGG,...} is not a regular language.

Non-deterministic finite automaton
A Non-deterministic Finite Automaton (NFA) is defined as a 5-tuple (A, Q, σ, F , δ) where: A is a finite alphabet, Q is a finite state space, σ ∈Qis the starting state, F⊂Qis the set of final states, and δ : Q×A→P(Q) is the transition function. An element X 1:ℓ ∈A ⋆ is accepted by this NFA if and only if it exists a path from the starting state to one of the final state that sequentially use the letters X 1:ℓ in the transitions. More formally, it means that it exists a sequence of states (ie: elements of Q) q 0 = σ, q 1 , q 2 ...,q ℓ−1 , q ℓ ∈Fsuch as q i ∈ δ(q i−1 , X i ) for all 1 i ℓ. The language of a NFA is the set of all elements of A ⋆ it accepts. Theorem 1. For any language L∈P(A ⋆ ): L regular ⇐⇒ it exists a NFA whose language is L.
We admit that the language of a NFA is always regular (see Hopcroft et al., 2001, for the formal proof) but we will prove the reciprocal with the Glushkov's construction (Allauzen &  Mohri , 2006). This construction provides a simple way to build the NFA directly from the regular expression of the language. The idea is to treat the regular expression as any algebraic expression with a stack of operands (NFAs) and a stack of operators (regular operations). Since a regular expression is by definition built from singleton elements of A ⋆ and the three regular operations, we only need to give the construction of a NFA corresponding to singleton elements, and the constructions corresponding to the regular operations.
Union: the union (A, Q, σ, F , δ) of two NFAs (A, Q 1 , σ 1 , F 1 , δ 1 ) and (A, Q, σ 2 , F 2 , δ 2 ) is given by: Concatenation: the concatenation (A, Q, σ, F , δ) of two NFAs (A, Q 1 , σ 1 , F 1 , δ 1 ) and (A, Q, σ 2 , F 2 , δ 2 ) is given by: Using Glushkov's construction, it is then possible to build a NFA whose language correspond to the regular expression of our choice. However in general, this construction is not optimal in terms of number of states. Fortunately, the reduction algorithm (Algorithm 1) due to Hopcroft provides a (partial) solution to this problem. Note that finding a minimal NFA for a given regular expression is a difficult task in general, but that Hopcroft's reduction is a good heuristic (we will see later that in the case of DFA, Hopcroft's reduction is indeed a minimization).

Counting with NFA
NFAs provide with Algorithm 2 an extremely efficient way to look for matching positions of any motif M (in fact, any regular expression) in a sequence X 1:ℓ . The algorithm directly results from the definition of the language of a NFA. Let us illustrate this algorithm with a toy example: how to find all matching positions of M = G(G|C)G in X 1:12 = AGCGGTGGGCGA ? We first use Glushkov's construction and Algorithm 1 to obtain on Fig. 3 a minimal NFA whose language is (A|C|G|T) ⋆ G(G|C)G. Then we directly apply Algorithm 2 starting with S = {0}: • i = 9, X 9 = G, S←δ({0, 1, 2}, G)={0, 1, 2, 3}, matching position; • i = 10, X 10 = C, S←δ({0, 1, 2, 3}, C)={0, 2}.
We hence return three matching positions: 4, 9 and 11. One should note in this example that in twice occasions, we need to recompute a previously computed transition (i = 7 and i = 11). Obviously, this kind of event is likely to appear very often when working with longer sequences. It is hence a natural idea to store in memory previously computed transitions. This approach, known as lazy determinization (Green et al., 2004), speeds up considerably pattern matching (reducing the complexity from O(|Q| × ℓ) to O(ℓ)) at the expense of a higher memory usage. We will see later that the amount of memory needed can increase exponentially with the NFA size |Q|; this problem is usually addressed by allocating a fixed amount of memory to a buffer of computed transitions which is flushed when full.

Significance
Since we now have efficient algorithms to count the number of occurrence of a motif M in a sequence X 1:ℓ , let us deal with the significance of an observation n.

www.intechopen.com
Significance Score of Motifs in Biological Sequences 7

Reference model
The choice of a reference model is obviously a key point. Since biological sequences like DNA or proteins are known to have unbalanced letter compositions, it is hence clear that our model should at least take into account this source of bias. A natural parametric approach 1 is hence to model X 1:ℓ as a i.i.d. sequence with P(X i = a)=π(a) ∀a ∈Awith all π(a) ∈ [0, 1] and ∑ a∈A π(a)=1. This model is called model M0 with parameter π. For example, in the complete genome of HIV1 (Genbank AF033819) we observe the following counts: 3272 A, 1642 C, 2225 G, and 2042 T. The maximum likelihood estimates of a M0 model based on this observation is then: For example, we observe 971/3272 = 29.68% of G after a A, but a G occurs after a C only 82/1641 = 16.41% of the time. This phenomenon is directly explained by the fact that the di-nucleotide CG tend to be easily methylated (see CpG island, Fatemi et al., 2005). Is hence tempting to take into account the frequencies of di-nucleotides in our reference model, or tri-nucleotides, or more, which naturally leads to Markov models. For any d 0, we denote by Md the (homogeneous) Markov model of order d defined for any i d + 1, a ∈A d , and b ∈Aby: where π denotes the transition matrix of Md. This model is clearly defined conditionally to X 1:d . The maximum likelihood estimator π is then given for all a ∈A d , and b ∈Aby: where n ab are the observed counts of word ab in the training dataset.
When working with Markov model and biological sequences, a recurrent question is: what order d should I choose for my reference model ? This is a classical model selection problem which can easily be solved using penalized likelihood criteria like BIC or AIC (Liddle, 2007). For example, using the BIC criterion, one would select d = 1 for the complete genome of HIV1 (ℓ ≃ 10kb), and d = 5 for the complete genome of E. coli (ℓ ≃ 4.6Mb). However, since our objective is the significance of motifs counts rather than the modelization of biological sequence in itself, we suggest a different approach. First, it is critical to realize than working with a model Md as reference model allows to take into account the sequence composition bias in (d + 1)-mers. Hence, with d = 1 one takes into account the composition bias in di-nucleotides, and with d = 5, one takes into account the composition bias in hexa-nucleotides. The decision could then be based on the information one wishes to include in the reference model; working on coding sequences, one might wish to take into account at least the codon bias hence resulting in the choice of d 2. On the other 8

Will-be-set-by-IN-TECH
Require: for all a ∈Ado 4: if ∃j, q j = S then 6: end for 12: end for Output: return (A, Q 2 , q 0 , F 2 , δ 2 ) Algorithm 3. Determinization. Build a DFA which recognizes the same language than the original NFA.
hand, it would obviously be pointless to use a reference model of order d = 7 to study a motif of length 8 or less. Another critical point to keep in mind is that motif significance is by nature very sensitive to the parameters of the reference model. In order to convince us, let us a consider the following simple example with M = GGATG, a reference model M0 of parameter π, and ℓ = 1, 000, 000. If If we admit that the standard deviation of N ℓ is roughly equal to σ = 25 (we will see later on how to perform such computation), an observation of n = 550 could be interpreted as a significant over-representation with the first parameters, and a significant under-representation with the second parameters (observation n deviates from the expectation by more than three standard deviations in both cases). The reason behind this is that parameter values are typically involved in complex products when evaluating the significance of an observation, and that such operations usually increase small variations rather than averaging them (like with sums). This problem have been investigated in Nuel (2006c) where it is shown that unwise choices of d might lead to many false positive results.

Monte-Carlo simulations
Since the theoretical distribution of N not easy to obtain, it is tempting to study it from the empirical point of view by performing simple simulations. The approach is quite straightforward: 1) generate a random dataset i according to the reference model; 2) count the number of occurrence n i of M in the dataset; 3) repeat 1) and 2) until we have a sample n 1 , n 2 ,...,n r .
Once a reference sample have been obtained, we can derive the empirical p-value of the observation n using: Bioinformatics -Trends and Methodologies

www.intechopen.com
Significance Score of Motifs in Biological Sequences 9 or, alternatively, one might use this sample to derive empirical expectation, variance, and z-score: If this approach is quite simple, it suffers several drawbacks: 1) it is slow; 2) sample size must be large to obtain accurate results. Indeed, if the true p-value is p, then p ∼B (r, p) where r is the sample size. The following table gives a 90% upper bound confidence for p for several value of r in the case where p = 10 −5 : r 10 3 10 4 10 5 10 6 10 7 10 8 bound 1.00 × 10 −3 1.00 × 10 −4 3.00 × 10 −5 1.50 × 10 −5 1.14 × 10 −5 1.04 × 10 −5 we clearly see that it requires at least r = 10 6 samples to obtain the first accurate digit in p, and a prohibitive r = 10 8 samples for the second digit. Considering that very small p-value are easily encountered in motif significance (ex: 10 −20 ,10 −50 ,10 −100 ), it is clear that empirical p-value have a limited interest in this context. Empirical z-score does not suffer the same drawback but makes the implicit assumption that N has a Gaussian distribution which is highly questionable as we will see later on. For completeness, let us point out that importance sampling techniques might solve the estimation problem by sampling N using a tailored dataset distribution (Chan et al., 2010). However, these sophisticated numerical techniques are slow and requires a good skills to be implemented.

181
Significance Score of Motifs in Biological Sequences

www.intechopen.com
We start with a NFA whose language is A ⋆ M from which we build a DFA (A, Q, q 0 , F , δ) using the determinization algorithm (Algorithm 3). A DFA differs from an NFA only by the definition of its transition function: δ : Q×A → P(Q) for a NFA, and δ : Q×A → Q for a DFA. For example, we can see on Figure 4, a (minimal) DFA whose language is (A|C|G|T) ⋆ G(G|C)G. This DFA has more states (6) than the corresponding NFA (4). In fact, since the state space Q 2 of a DFA corresponds to a subset of the parts of the original NFA state space Q 1 , we have |Q 2 | 2 |Q 1 | . Fortunately, this upper bound is seldom reached in practice.
Theorem 2 (Markov chain embedding for Model M0). Let (A, Q, σ, F , δ) be a (minimal) DFA whose language is A ⋆ M. Let X 1:ℓ be a random sequence generated by the M0 model of parameter π. We consider the sequence Z 0:ℓ recursively defined by Z 0 = σ, and Z i = δ(Z i−1 , X i ) for all 1 i ℓ. Then Z 0:ℓ is an order 1 Markov chain whose transition matrix T is defined for all p, q ∈Qby: and having the following property for all 1 i ℓ: For example, if we consider the DNA motif G(G|C)G and the corresponding DFA of Figure 4, we get the following transition matrix: In order to extend Theorem 2 to order Md with d > 0 it is necessary to build DFA (A, Q, σ, F , δ) be a (minimal) DFA whose language is A ⋆ M and with the property that for all q ∈Q , past(q)={a ∈A d , ∃p ∈Q , δ(p, a)=q} is either empty or a singleton. A DFA having this property is called a order d DFA by Lladser (2007), and is called non d-ambiguous by Nuel (2008a). The construction of such a (minimal) DFA is not very complicated but is a bit technical. A possible approach suggested by Nuel (2008a) consists in starting from a DFA without this property and duplicating any "ambiguous" state. Another more straightforward approach consists in adding the elements of A ⋆ A d to the original language with a specific label for the final states corresponding to each elements of A d , and to keep these labels during minimization and determinization algorithms.
Theorem 3 (Markov chain embedding for Model Md). Let (A, Q, σ, F , δ) be a (minimal) order d DFA whose language is A ⋆ M. Let X 1:ℓ be a random sequence generated by the Md model of parameter π. We consider the sequence Z d:ℓ recursively defined by Z d = δ(σ, X 1:d ), and Then Z d:ℓ is an order 1 Markov chain whose transition matrix T is defined for all p, q ∈Qby: and having the following property for all 1 i ℓ: X 1:i ∈A ⋆ M⇐ ⇒Z i ∈F.
One should note that Z d:ℓ is defined on δ(σ, A d A ⋆ ) which could be slightly smaller than Q. This subset corresponds to the states of Q having a order d past. If we consider the DFA of Figure 5, d = 1, and with X 1 = A, we see that the Markov chain Z d:ℓ is defined on {1, 2, 3, 4, 5, 6, 7, 8} by Z 1 = 1 and the following transition matrix: From now on, we assume that our motif problem with Md reference model is embedded into the Markov chain Z d:ℓ whose transition matrix is decomposed into T = P + Q where matrices P and Q are defined for all p, q by: P(p, q)=T(p, q)1 q/ ∈F , and Q(p, q)=T(p, q)1 q∈F .

Main results
We present here the main results that are then used to derive exact computations and various approximations of S(n). In all this section, we assume that N is the random number of occurrences of M in X 1:ℓ , a sequence generated by a Md model (X 1:d being fixed) with d 0.
we denote by T = P + Q be the transition (L × L) matrix of the Markov chain embedding of the corresponding problem. We also introduce two vectors: u a1× L vector filled with '0' and having a '1' in the position corresponding to X 1:d , and v a L × 1 vector of '1'.

Proposition 4 (probability generating function). If we denote by G(y)=E[y N
] the probability generating function (pgf) of N, then we have: 183 Significance Score of Motifs in Biological Sequences

www.intechopen.com
Proof. The first equality derives directly from the definition of G(y). For the second equality now, it is clear that u(P + Q) ℓ−d gives the marginal distribution of Z ℓ . We then connect this distribution to N by counting the number of times we use the transitions of Q with the dummy variable y so that u(P + yQ) ℓ−d gives the joint distribution of (Z ℓ , N). Finally, we sum up the contributions of all states using the product with v.
Proof. The formal proof can be found in Nuel (2010) in a slightly less general case. Here we prove it only for the first two derivatives in the particular case where ℓ − d = 3. Starting from G(y)=u(P + yQ) 3 v we get: G ′ (y)=u Q(P + yQ) 2 +(P + yQ)Q(P + yQ)+(P + yQ) 2 Q v and G ′′ (y)=2u Q 2 (P + yQ)+Q(P + yQ)Q +(P + yQ)Q 2 v which are easily connected to the terms coefficients of z 1 and z 2 in u(P + yQ + zQ) ℓ−d v.

Exact computations
As we have seen above, Proposition 4 provides a way to obtain the whole distribution of N by computing G(y)=u(P + yQ) ℓ−d v from which we can easily derive S(n) for any n 0: .
From the algorithmic point of view, there are basically two approaches to compute S(n) using Expression (12). The first one, called power, consists in computing (P + yQ) ℓ−d using the power method and a binary decomposition of ℓ − d. Ex: if ℓ − d = 1097 then ℓ − d = 2 10 + 2 6 + 2 3 + 2 0 . We then just have to recursively compute D k (y)=( P + yQ) 2 k using the relation D k+1 (y)=D k (y) × D k (y) for all k 0. Since in the computation of S(n) we are only interested in terms of degree n or less (or n or more), we can easily truncate 2 all polynomials at degree n thus dramatically reducing the computational costs of polynomial products. We end up with a O(log 2 ℓ × n 2 × L 3 ) complexity in time where L is the order of the transition matrix T = P + Q. The corresponding memory complexity is O(log 2 ℓ × n × L 2 ). Since the length ℓ of the dataset appears in a logarithmic scale in these complexity, the power approach is obviously suitable for large datasets (ex: ℓ = 10 6 or ℓ = 10 9 ). Unfortunately, the cubic complexity with L (quadratic in memory) prevents the approach to deal with complex motifs with high L. One should also note that the quadratic complexity in n could really be a problem when dealing with frequent motifs and/or large datasets. In order to overcome this problem, Ribeca & Raineri (2008) suggested to use fast Fourier transforms (FFT) to perform all polynomial product hence replacing n 2 by n log 2 n in the time complexity. However appealing at first glance, this approach is not recommended in practice since the FFT products in floating-point arithmetics induce numerical instabilities that make totally unreliable the smallest coefficients of the polynomials. And unfortunately, these coefficients are precisely the one needed to study the tail distribution of N.
Another interesting approach called full recursion, consists in computing v i =( P + yQ) i v for all 0 i ℓ − d recursively using the relation v i+1 =(P + yQ)v i . There are two main interests for this approach: 1) we have only products between polynomials of degree 1 and polynomials of degree n (by dropping terms of degree greater than n like in the power approach); 2) we can take full advantage of the sparse structure (only L ×|A|non-zero terms in the worst case) of the transition matrix T = P + Q. The resulting complexity is O(ℓ × L ×|A|×n) in time, and O(L × n) in memory. Since these complexities are linear with L, this approach is able to handle very complex motifs. The drawback is that the approach can be very slow when dealing with large ℓ and n. It exists a sophisticated version of this recursion called partial recursion (see Nuel & Dumas, 2010) which allows to replace ℓ × n by log ℓ × n 2 in the time complexity. However, the quadratic complexity in n and numerical instabilities in floating-point arithmetic restrains its use to small n (ex: n 10). For completeness, let us point out another approach to the problem. The idea is that we can derive from Expression (12) the following expression: where I is the identity matrix and N ℓ the number of matching position in X 1:ℓ . It is then possible to obtain P(N ℓ = n) for any ℓ and n using (fast) Taylor expansions of G(y, z). For the mathematician, this approach is so "natural" that it is often referred as the "golden" approach to the problem of motif significance (Nicodème et al., 2002). However, this approach suffers several severe drawbacks that dramatically limits its practical interest: 1) the approach needs sophisticated computer algebra systems to be implemented (rather than simple floating point arithmetic for the previous approaches); 2) the explicit computation of (I − Pz + yzQ) −1 could be very time (and memory) consuming; 3) even if the explicit computation of the inverse matrix is avoided (which is highly advisable), the coefficient extraction using state of the art techniques (like high-order lifting for example) is often slower than the much simpler alternative developed above (see Nuel & Dumas, 2010, for details). Considering either the power or the recursion approaches we obtain easy to implement algorithms allowing to compute the exact value of S(n) in all cases except when dealing with high complexity motifs (large L) and/or frequent motifs (large n). But even if we stick to more tractable cases, exact computations could be slow. The question hence is: is it possible to compute fast and reliable approximations of S(n) ?  Table 1. Characteristic moments the number N of occurrences of motif M = G(G|C)G in a sequence X 1:ℓ generated by a M0 model with parameters π(A)=π(T)=0.10 and π(C)=π(G)=0.40. Computation performed using the power approach.

Near-Gaussian approximations
Since the random count N is basically defined by Eq.
(2) as large sum of Bernouilli variables, the idea of approximating the distribution of N using Gaussian approximation sounds appealing. Indeed, Gaussian approximations are historically the first ones to have been suggested for this problem (Cowan, 1991;Kleffe & Borodovski, 1997;Pevzner et al., 1989;Prum et al., 1995). From the theoretical point of view, Central Limit Theorems (CLT) for weakly dependent variables ensure that N is asymptotically normal distributed. On Table 1, we can see the characteristic moments of N for motif M = G(G|C)G and various value of the sequence lengths ℓ. According to theory, we observe that the skewness and excess kurtosis both decease toward 0 when ℓ grows (a normal distribution has null skewness and excess kurtosis). But it is also clear that N is not normally distributed for small values of ℓ.A sa consequence, the quality of a Gaussian approximation for S(n) is expected to be questionable at finite distance. In order to overcome this issue, Nuel (2010) suggested to consider near Gaussian approximations instead of simple Gaussian approximations for this problem. The idea is simply to perform a higher order asymptotic development that exploits more than the two first moments of N. This technique is known as the Edgeworth's expansion. Blinnikov & Moessner (1998) gives a general (and rather complicated) formula for this expansion. For practical purpose, we present the result only up to order 3 expansions.
Proposition 7 (Edgeworth's expansion). If we denote by ϕ(z)=exp(−z 2 /2)/ √ 2π the probability distribution function (pdf) of a standard Gaussian, then for all n 0 we have the following approximation: with where µ = E[N], σ = V[N], z =( n − µ)/σ, S k = κ k /σ 2k−2 for all k 1, and where H k (z) are the Hermite polynomials defined recursively by H 0 (z)=1 and H k (z)=zH k−1 (z) − H ′ k−1 (z) for all k 1. For h ∈{0, 1, 2, 3} we define the Near Gaussian (NG) approximation of order h of S(n) by: We can see on Figure 6 the reliability of NG approximations. In solid black, the order 0 approximation corresponds to the classical Gaussian approximation. Unsurprisingly, this central limit approximation is accurate for the center of the distribution (n close to the expectation µ = 153.3), the reliability quickly deceases when |n − µ| increases. Central limit theorems (CLT) for N have established long ago that N should be asymptotically Gaussian distributed. The problem however with CLT theorems is that the quality of the resulting approximation dramatically decreases at finite distance when considering tail distribution events. Here we try to overcome the issue by considering Near-Gaussian approximations that exploits higher moments of N to improve the quality of the approximations. In order to do this, a critical problem is first to obtain the first k-th moments of N. Of course we can access these moments by computing the full distribution of N, but if it is possible to do so, why bothering with approximations. We hence need an method to compute the moments of N whose complexity should be somehow significantly smaller than the complete exact computations. With higher order approximation, we can see a dramatic improvement of reliability of the results, with a noticeable increase of the region where at least two digits are correct (up to n ∈ [80; 240] for NG 3 ). Thanks to NG approximations, we hence have a fast and reliable way to compute an approximation of S(n) when n falls in the center of the distribution (ex: |S(n)| 3.0), but NG approximations unfortunately remain totally unreliable for tail distribution events (ex: |S(n)| > 3.0), which are moreover often precisely the event of interest. Fortunately we have a solution to this problem.

Bahadur-Rao
We want here to study specifically the tail distribution of N with events on the form P(N n) with large n (or P(N n) with small n). For all t > 0 let us first notice that we can use the Markov inequality to write: P(N n)=P(e tN e tn ) E[e tN ]/e tn = exp(Λ(t) − tn).B y taking the smallest of these bounds for t > 0 we hence get: log P(N n) Λ(τ) − τn with τ defined by Λ ′ (τ)=n. This upper bound, known as the Chernoff's Bound (CB), is often surprisingly sharp for events located in the tail distribution. By dealing symmetrically with P(N n) and t < 0 we hence obtain the following approximation for S(n): where δ n = −1ifn E[N], and δ n =+1ifn > E[N].

189
Significance Score of Motifs in Biological Sequences

www.intechopen.com
From the computational point of view, the solution τ of Λ ′ (τ)=n can be easily determined numerically using (for example) using the Newton-Raphson sequence (Press et al., 1992). Starting for a first guess t 0 (ex: t 0 = 0), one performs t i+1 = t i +(n − Λ ′ (t i ))/Λ ′′ (t i ) for i 0 until convergence to τ. The computation of Λ, Λ ′ , and Λ ′′ being possible thanks to Lemma 5 and the following formulas: with Moreover, this bound can be further refined using the Bahadur-Rao Theorem (Bahadur & Rao, 1960) and gives the following approximation for S(n): On Figure 7 we can see the reliability of the approximations CB(n) and BR(n). Unsurprisingly, the farther from the center of the distribution, the better are both approximations. We also observe that BR(n) is a dramatic improvement over CB(n) since it obtains at least two correct digits of S(n) for all n but on [120,200]. At the end of previous section, we have seen that the order 3 NG approximation achieves the same precision for region [80; 240], hence, by combining both NG 3 (n) (for the center of the distribution) and BR(n) (for the tail distributions), one can achieve at least two correct digits of S(n) on the whole bulk of the distribution for a modest computational cost.

Discussion
Obtaining the distribution of motif count in random sequences is a very challenging problem that has attracted considerable attention from mathematicians and computer scientists in the last fifty years. Recently however, a significant advance has been obtained by connecting the well-known theory of pattern matching and automata to the Markov chain embedding technique Lladser (2007); Nuel (2008a); Nuel & Prum (2007). Thanks to this finding, it is now possible to deal with simple (runs of 1 in binary sequences, single words, etc.) or complex motifs (PROSITE signature, gapped motifs, etc.) using the same general framework. Using exact approaches, it is possible to obtain efficiently the first moments of any motif count N, and even the complete distribution of N. As a consequence, the computation of S(n) is now tractable for a wide range of motif problems including large datasets or complex motifs. However, the case of complex frequent motifs in large datasets remains an open problem (Nuel & Dumas, 2010).
As an alternative to exact computations, a wide range of approximations have been developed (see Lothaire, 2005;Nuel, 2006b;Reignier, 2000, for a review). We can basically classify these approximations in three categories: 1) Gaussian approximations (Cowan, 1991;Kleffe & Borodovski, 1997;Nuel, 2010;Pevzner et al., 1989;Prum et al., 1995); 2) Poisson approximations Erhardsson (2000); Geske et al. (1995); Godbole (1991); Reinert & Schbath (1999); Roquain & Schbath (2007); 3) large deviations approximations Denise et al. (2001);Nuel (2004). In this chapter we deliberately left aside the Poisson-based approximations and considered only two of these approximations: the (Near-) Gaussian approximations with NG h (n), and the large deviations based approximations with CB(n) and BR(n). The reason why Poisson-based approximations are not considered here is basically practical, these approximations cannot be directly derived from the formalism of this manuscript and require the introduction of many tedious notions like clumps, overlapping words and so on. However, we compare here the performance of all these approximations (including compound Poisson approximations) in the case where X 1:ℓ generated by a M0 model with parameters π(A)=π(T)=0.10 and π(C)=π(G)=0.40 i.i.d. DNA sequence, and for two motifs: the frequent G(G|C)G, and the rare A(A|T)A.
We can see on Figure 8 the relative error (in log-scale) for all approximations. For Gaussian approximations, performances are only good in the very center of the distribution (for n very close to E(n)) for the frequent motif G(G|C)G, and performances are poor almost everywhere for the rare motif T(A|T)T. This observation to consistent with the well known claim that "Gaussian approximations a more suitable for frequent motif" (Lothaire, 2005). It has however to be pointed out that even in the most favorable case (with highly frequent motif), Gaussian approximations totally fail to capture the tail distribution of N and hence not suitable for the highly significant observations we usually encounter in biological sequences (Nuel, 2006b). If we consider now the near-Gaussian approximation, taking into account more moments of N dramatically improve the result for both motifs, but the failure to deal with extreme distribution events remains. Compound Poisson approximations are known to be extremely sensitive to the relative abundance of the motif of interest in the sequence, being more accurate for rare motifs (Lothaire, 2005;Roquain & Schbath, 2007). It is hence not a surprise to see that Poisson approximations are totally unreliable for the frequent motif G(G|C)G. For the rare motif T(A|T)T we naturally obtain much better results but like for Gaussian approximations, and even in this favorable case, reliability decreases in the tail distribution. Considering that Poisson

191
Significance Score of Motifs in Biological Sequences www.intechopen.com approximations are not easily generalizable to motifs defined by regular expressions, that their computations could be complicated and time consuming, and that their reliability is highly questionable in some configurations, it seems advisable to avoid their use is most cases. With large deviations based approximations, we unsurprisingly get a low reliability in the center of the distribution, but a high reliability in the tail distribution. With Bahadur-Rao precise approximations, the improvement over the classical Chernoff's bound is quite impressive, and the complementarity with Near-Gaussian approximations clearly shows that a combination of both approaches could be a very efficient way to obtain reliable approximations of S(n) for all n.
In this chapter we gave all the necessary ingredients to assess the significance score of motif in a biological sequence using state of the art results, including several unpublished ones: Lemma 5 which is an extension of the results of Nuel (2010), and the complete "Bahadur-Rao" Section which provides interesting improvements over previous large deviations work (Denise et al., 2001;Nuel, 2004).
Let us finally point out that for the sake of compactness, we have left aside some interesting questions and extensions like: approximate matching Hopcroft et al. (2001), renewal occurrences (Nuel, 2006b;Roquain & Schbath, 2007), joint distributions (Nuel, 2008b;Stefanov & Szpankowski, 2007), dataset with many sequences , and sensitivity to parameter estimation (Nuel, 2006c). Even if some results are already available for these problems, many questions still have to be answered in the exciting and challenging field of the distribution of motifs in random sequences.