Open access peer-reviewed chapter

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness

By Haiping Wang, Taining Xiang and Xuegang Hu

Submitted: December 6th 2011Reviewed: May 14th 2012Published: November 28th 2012

DOI: 10.5772/48574

Downloaded: 1559

1. Introduction

The practical importance of the string matching problem should be obvious to everyone. For typical word-processing applications, immense amounts of work have been done on this subject. However, with the developments in bioinformatics (Cole et al., 2005), information retrieval (Califf et al., 2003), pattern mining (Xie et al., 2010; Ji et al., 2007 ; He et al., 2007), etc, sequential Pattern Matching with Wildcards and Length constraints (PMWL) has attracted more and more attention. It is not difficult to think up realistic cases where PMWL plays an important role. In Dan Gusfield’s book (Gusfield, 1997), they give an example about transcription factorto illustrate the concept of wildcard. A transcription factoris a protein that binds to specific locations in DNA and regulates the transcription of the DNA into RNA. In this way, production of the protein that the DNA codes for is regulated. Many transcription factors are found and can be separated into families characterized by specific substrings containing wildcards. They use Zinc Finger, a common transcription factor as an example. It has the following signature:


Where CYS is the amino acid cysteine and HIS is the amino acid histidine. They also give a conclusion that if the number of wildcards is bounded by a fixed constant, the problem can be solved in linear time.

Another respective example is about promoter. In bioinformatics, promoterwill help researchers to quickly locate the starting position of the intron from hundreds of millions of the sequence of ACGT. Among these promoters, TATAbox is a common one (Manber & Baeza-Yates, 1991). It has very loose sequence specificity, so many TATAsequences are not TATAbox. As a result, indirect positioning by pairs of sites is needed. The commonly used one is CAATCTsequence. The DNA sequence TATAis a common promoter that often occurs after the sequence CAATCTwithin 30-50 wildcards. Therefore, matching patterns with wildcards becomes especially crucial in exploring valuable information from DNA sequences.

There are many applications that involve pattern matching with wildcards and various researches have provided many solutions to different forms of this problem. Fischer and Paterson were the first to generalized pattern matching with wildcards (Fischer & Paterson, 1974): given a pattern Pand a text T, either of which may contain wildcards, denoted by ¢, the goal is to locate all P’s occurrences in T. ¢ can match any letter in a given alphabet, such as a¢¢c¢t. Unlike previous work, Chen, et al. proposed a PMWL problem integrating two problems(Chen et al., 2006): one is complex local constraints which means the user can specify a different range of wildcards between each two consecutive letters of P, for example, a¢[0,3]c¢[1,3]t. Another one is global length constraints. The user can constrain the length of each matching substring of Tin which Poccurs. Therefore, flexible constraints of wildcards conduct flexible jump of the matching positions. The definition of PMWL problem is an extension of Fischer and Paterson’s definition and the introduction of complex local constraints increase the flexibility. On one hand, this definition of pattern is more suitable for areas such as bioinformatics; on the other hand, the size of the matching candidate positions is in the exponential increment which greatly increases the complexity of the problem solving.

Figure 1.

The flexibility and complexity of PMWL problem

From a view of practical point, they also proposed two issues: with and without the one-offcondition (Chen et al., 2006; Min et al., 2009). In their problem definition, users have more flexibility to search on sequences and the one-offcondition has both theoretical and practical significance. One-offcondition means that every letter in Tcan be used once at most. In practical applications, with and without the one-offcondition has practical meaning in specific areas. For example, in sequential pattern analysis in data mining, Pcan be treated as a candidate shopping pattern, the user is interested in how frequently Poccurs in one document, it makes sense to count each occurrence for once, what is more, the one-offcondition also makes the problem solving possible. However, under the one-offcondition, how to allocate limited text resource to each matching occurrences, in order to obtaining the maximum number of occurrences, belongs to optimization problem. In the allocation of resources, the matching of different letters in the pattern possess a strong correlation, which conducts the selection of matching positions in the combination of explosive growth. Since it is difficult to develop a complete matching strategy in this problem, almost existed algorithms for PMWL are using greedy matching strategies, which is the root reason why matching algorithm is not complete. This article will focus on SAIL algorithm (Chen et al., 2006) which is a representative algorithm for PMWL problem and will also describe RSAIL (Wang et al., 2010), SBO (Wu et al., 2011), BPBM (Guo et al., 2011) algorithm which are all designed to solve PMWL problem in different conditions. The each of above algorithms has its own characteristics in the data structures and matching strategies, which will be analyzed in this paper.

What is more, since the theoretic and practical importance of the definition of PMWL, we need to research the nature of this problem. To our best knowledge, there are still no efficient methods on this problem, because as for completeness of the problem, we still do not know whether it could be solved in polynomial time. In this article, we will research the completeness of PMWL under certain condition. In the traditional matching problem, description of pattern and text information is the key to the algorithm design, however, flexibility and complexity of the PMWL problem all depends on the pattern features, so this article will focus on pattern information, especially the pattern features including the size of alphabet, the length of pattern, the gapof wildcards in the pattern, etc. We will also investigate the relationship between pattern features and completeness, and use the approximate ratio judgment. Further more, since the definition itself is produced in realistic background, we need to consider the situation in real biological background and improve the solution of the problem. Based on the above, we choose this topic as a research object in this book.

This capture is organized as follows: In section 2 we will give the development, definition and application of PMWL problem; Section 3 will show the representative algorithms, we will introduce their structure, strategy, complexity and completeness; Section 4 will analyze the PMWL problem completeness based on pattern features. We will give our conclusions in section 5.


2. Pattern matching with wildcards and length constraints

The sequential pattern matching problem is to given a Text Tand a pattern Pas input, and output all the occurrences of Pin T. After Fischer and Paterson’s work, there are a variety of non-standard definitions of the pattern matching problem: the approximate matching (He et al., 2007), the swapped matching (Amir et al., 2000), the Parameterized matching (Amir et al., 2009), etc. They all belong to Non-standard Stringologyproblem (Muthukrishnan, 1994). Many of them are still open problems.

2.1. The development of PMWL problem

After years of development, these Non-standard Stringologyproblems always focus on a problem: that is, how to conduct the traditional pattern matching definition to be more flexible to adapt the development of application. The don’t caresproblem always focus on how to combine the wildcards and the pattern. After Fischer and Paterson’s work, Cole et al. considered a slightly different problem (Cole et al., 2004), where instead of fixing the number of ¢s between two consecutive letters in Pand T, they fixed the total number of ¢s in P. The disadvantage of these problem definitions is that the number of ¢s is a constant but not a range. This limits flexibilities for the user’s queries. To alleviate the problem of a fixed number of ¢s, Kucherov et al. (Kucherov et al., 1995) proposed a solution to allow an unbounded number of ¢s between two consecutive letters in a given pattern. Given a set of such patterns, their objective is to find whether any of these patterns matches some substring of the text that does not contain any ¢.Obviously, allowing an unbounded number of ¢s still does not offer the users enough flexibilities to control their queries. Manber et al. (Manber & Baeza-Yates, 1991) proposed an algorithm for string matching with a sequence of wildcards. They considered the following problem: given two pattern strings Pand Q, each of which consists of letters, and an integer g, all occurrences of the form P¢0−gQin the text are returned. The number of ¢s between Pand Qis in the range of [0, g], and the text does not contain any ¢. This problem was so-called exact string matching with variable-length don’t cares. Chen et al. sum up all these definitions into three conditions (Chen et al., 2006): firstly, there is a wildcard between two consecutive letters in P, for example A¢[0,1]T¢[0,2]G¢[1,3]C; secondly, every letter in Tcan only be used once for matching; thirdly, there is a global constraint to limit the matching occurrence length. We call the problem satisfying above definition PMWL problem, which has been used in approximate matching, pattern mining, information retrieval, etc.

2.2. The potential applications of the PMWL problem

  1. Text Indexing: There is a large amount of hypertext information on the Internet. How to effectively obtain information that meet users’ needs is becoming more and more urgent. Text indexing is a method to solve this problem. How to determine the position of user-specified pattern (may contain wildcards) is a challenge task.

  2. Data stream is becoming more and more crucial in many new database applications such as data warehouse and sensor network. Mining dependence or association in large amounts of data flow has practical value and during which the sequential pattern matching with wildcards is the first and the most important step. In addition, in data mining, sequential pattern mining also search frequent patterns as in transaction sequence, a typical instance is the similar consuming pattern of many consumers, for example, buying a desktop, a laser printer, a digital camera and an LCD screen monitor in turn, between each of them exists a certain time interval. Mining such typical user mode, which is obviously a pattern matching with wildcards, will has a great influence on the market.

  3. Network Security: Pattern matching methods in network security and intrusion detection need high performance. A complete IDS (Intrusion Detection System) based on Snort rules needs to optimize hundreds of rules and many of them need to do pattern matching efficiently for the entire data partition of a package. Efficient pattern matching and mining with wildcards constraints give the system administrator a more flexible and accurate solution to locate the suspicious users.

2.3. Problem statement for PMWL

Definition 1Let Σbe an alphabet, T = t0t1…tn-1Σ*is called a text of Σwhere n= |T| is the length of T. A patternis a tuple P= (p, g) where p = p0p1…pm-1is a sequence of characters, which belong to the alphabetΣ, and g= g0g1…gm-2is a sequence of wildcards. And m= |P| is the length of P. The interval of wildcards between pi and pi+1 is denoted by gi= g(Ni, Mi) where 0 ≤ im- 1, called the local constraints. Niand Miis the upper and lower limit of wildcard. Such as P= a¢[1,3]g, where 1, 3 is respectively the lower and upper limit of local constraints. ¢[1,3] means the wildcards between aand gis referring to a string which length is 1~3. Given interval [minLen, maxLen], set globalLength= t[am-1] - t[a0] +1, if globalLength∈[minLen, maxLen], then it is called global constraint(Chen et al., 2006).

Definition 2Given a pattern P= (p, g), p = p0p1…pm-1, gi= g(Ni, Mi). The max{Mi- Ni} where 0 ≤ im- 1 is called the gapof local constraints, named Gap for short. For example, P= a¢[0,2]g¢[1,4]g, then gap= max{2 - 0, 4 - 1} = 3.

PMWL problem can be defined by the above definition:

Definition 3PMWL problem(Pattern Matching with Wildcards and Length constraints)

Pattern Matching with Wildcards and Length constraints meets the following conditions:

  1. ¢s can occur between each two consecutive letters in pattern and are independent to each other;

  2. ¢s between two consecutive letters can match a string which length is limited by local constraints, and the total length of pattern is limited by global constraint;

  3. One-offcondition is taken into consideration that every letter in Tcan only be used once for matching pj (0 ≤ jm- 1) and as soon as there exists one occurrence of Pin Twhen Tis being scanned from left to right it will be returned.

Definition 4Given a text Tand a pattern P, if there is a sequence of matching positions A= (a1, a2, …, am-1), where t[ai] = p[i] for every 0i<m1, we say Ais a matching occurrence of P. A set of occurrences A1, A2,..., Atconstitute an occurrence set Uwhere tis the number of occurrences, and also named matching numberin our paper.

Definition 5Let t be the matching number of A, if there is no occurrence set A’with the matching number t’, and t’> t, then Ais called a complete occurrence set. If there is another occurrence set U, with the matching number tu= t, the Uis equivalent to A. Specially, if Ais complete, and so is U. So the complete occurrence set is not always unique.

3. Algorithms for PMWL

The results of traditional matching algorithm are complete, so the focus of research is to improve the matching efficiency. As a kind of searching problem, the key to solving matching problem is how to use and extract information getting from text and pattern. KMP, BM algorithm uses automata to describe the pattern characteristics, and deposit information obtained from scanning during matching process into automata. Algorithm visits the automata, when the jump distance needs to be calculated, thus to avoid obtaining the pattern information repeatedly and to ensure the jump in matching process does not affect the final result. The basic idea the suffix tree is to use the tree structure to describe the text information, and to avoid scanning the same text repeatedly when matching a set of patterns. We believe that data structure and search strategy are crucial for traditional algorithms to access to information of text and pattern. Reasonable data structure is better to explore the potential of the computer, such as bit parallel technology, and can also be a more reasonable representation of the sequence information, such as automata. In addition, there exist the sliding window, indexes and other data structures. Reasonable matching strategy makes better use of sequence information. These strategies can approximately be divided into prefix searching, suffix searching and factor searching (Navarro & Raffinot, 2001).

Table 1.

Analysis of the traditional pattern matching algorithms

As different extension of traditional matching problem, PMWL problem, approximate matching, and swap matching all belong to the Non-standard Stringologyproblem. Problems in this field mostly belong to the optimization problem, and most of them have not yet been completely solved, such as PMWL and approximate matching problems with wildcards etc. What PMWL and traditional matching problem have in common are:

  1. From the view of algorithm itself, the data structures and matching strategies are as the key of algorithm design.

  2. From the view of describing the object, how to effectively describe the patterns and text information is the key to solve the problem.

What PMWL and traditional matching problem have in difference are:

  1. For PMWL, there is no complete solving yet, so algorithm evaluation criteria include both time efficiency and solution quality; but traditional matching algorithm is only concerned with matching time.

  2. The flexibility and complexity of the PMWL problem definition are reflected in the pattern, therefore, compared with traditional matching, PMWL pay more attention to the description of the pattern information. Pattern characteristics are extremely associated with the solution of PMWL problem.

Next, we will give the representative algorithms for solving PMWL problem, and detailed description of their design ideas from the perspective of data structure and matching strategy.

3.1. The SAIL Algorithm

Description of SAIL Algorithm (Chen et al., 2006):

Input: A text T= t0t1…tn-1, a pattern P= p0p1…pm-1, local constraints gi= g(Ni, Mi), global constraints [minLen, maxLen].

Output: Occurrences of Pin Tsatisfying the constraints.

The Steps of the algorithm:

  1. Location: ① Search position iwhere t[i] = p[m-1], and locate position kwhere t[k] = p[0] by considering the global constraint. ② Cut out a substring in Tfrom t[k] to t[i] named T’. ③ Build the table with the row and column according to T’and P.

  2. Forward: Scan the table forward, and mark all the positions satisfying the local and global constraints. They are the potential matching positions.

  3. Backward: Scan the table backward, and select the left-mostposition in the marked cells every row that compose an occurrence. Then mark them used.

Generally, SAIL starts from the beginning of Tto search position iwhere t[i] = p[m-1]. After that, SAIL conducts two phases, the Forwardphase and the Backwardphase. In the Forwardphase, SAIL determines whether there is a potential matching occurrence by using a search table. Afterwards, if a potential matching occurrence can be determined, Backwardphase is triggered out to output an optimal occurrence by using the left-moststrategy.

A running example for SAIL:

In this subsection, we show how SAIL works with a running example where P, Tand constraints are given as follows.

Table 2.

A running example for SAIL

Step 1. Scan the P[m-1], that is the letter ‘c’, in Tfrom left to right. The first matching position is 6, and then SAIL enters the Locationphase. Use the global constraint [6, 7] to locate P[0]’s position, that is the letter ‘a’. We get the scanning range is [6-6, 7-6]. However there are no matching in [0, 1]. Then SAIL move on.

Step 2. The second matching position is 7, and we can locate P[0] in position 2. Then we get the substring “aaggcc” from T. In this way global constraint is satisfied.

Step 3. Build a 4×6 table. The row stand for character in P, and the column is the substring. Then set the position pos[3][5], pos[0][0] and pos[0][1] to 1.

Step 4. Enter Forwardand set all the positions to 1 in the table, which satisfy the local constraints.

Table 3.

The constructed search table pos[j][i-start] when P[m-1] is 7

Step 5. Enter Backwardand select the left-most one from the marked positions in each row, and they are highlighted. In this way, we will get an occurrence {2, 4, 6, 7} and mark the four positions used.

Step 6. Go on to execute Locationand get the third matching position is 8, then we can build the table below. Notice the positions of 6, 7 have been used. Under the one-offcondition, all used positions (marked as * in Table) of Tare never considered for further matching again. If the one-offcondition is not considered, SAIL will get another two occurrences {2, 4, 6, 8} and {3, 5, 7, 9}. Then the Forwardphase returns false, and SAIL go back to Location.

Step 7. In position 9, the Forwardalso returns false. Finally, SAIL output only one occurrence {2, 4, 6, 7}.

Table 4.

The constructed search table pos[j][i-start] when P[m-1] is 8

The time complexity and completeness analysis:

O(SAIL) = O(n+ klmg) where nis the length of T, kis the frequency of P’s last letter occurring in T, lis the user-specified maximum length for each matching substring, mis the length of P, and g is the maximum gapof wildcards in P.

Two important issues, Online searchingand Optimization, are taken into consideration to design SAIL. As for optimization, under the one-offcondition, SAIL determines which occurrence is an optimal one if multiple occurrences end at a P[m-1]’s position by applying the left-moststrategy. As a heuristic algorithm, SAIL utilizes a kind of greedy strategy to select a set of occurrences; consequently, SAIL may obtain locally optimal solution which lead to losing occurrences in offline searching.

Form the above example, we can know that a complete occurrence set for text Tis {{2, 4, 6, 8}, {3, 5, 7, 9}}, but SAIL’s output is {{2, 4, 6, 7}}.

We believe that the SAIL’s data structure is based on the sliding window, the Locationalso uses the left-moststrategy, so it is possible to lose occurrences, for instance:

Figure 2.

A sliding window in SAIL

Obviously, in the above example, SAIL loses occurrences in offline condition because of the selection of character c’s matching position. For further observation, it is not difficult to find that character cappears in the pattern twice. If pattern is a¢[0,1]g¢[0,1]c, SAIL will get a complete occurrence set, that is, {{2, 4, 6},{3, 5, 7}}. Further experiments show that, the recurring appearances of pattern characters influence the quality of matching occurrences obtained by the algorithm. In next part, we will analyze the completeness of PMWL based on pattern features.

3.2. The RSAIL Algorithm

Description of RSAIL Algorithm (Wang et al., 2010):

Definition 6 Given a pattern P, if there are letters p[i] = p[j] where 0 ≤ im-1,0 ≤ jm-1, Pis called a pattern with Recurring characters, and R patternin short, such as a¢[0,1]c¢[0,1]c¢[0,1]t.

Definition 7 Given a pattern P, if all the letters in Pare different, Pis called a pattern with No-Recurring charactersand NRPatternfor brevity, such as a¢[0,1]c¢[0,1]g¢[0,1]t.

Definition 8 Given a pattern P, if there is a position isuch that p[i] = p[i+1] =……= p[m-1] where 1 ≤ i< m-1, Pis called a pattern with recurring tail characters and RT patternin short. Such as a¢[0,1]c¢[0,1]c. As we can see, the RT pattern is a special form of the R pattern.

From the above discussion, in the research of Chen et al., since they only concern about the on-line situation, their proof of SAIL’s completeness is incomplete, which is only suitable for the on-line situation. What is more, it ignores the interaction between different occurrences.

We find that SAIL satisfies the completeness under a certain restriction, i.e. the pattern with no-recurring character (NR pattern), such as a¢[0, 1]t¢[0, 1]g¢[0, 1]c¢[0, 1]. The concept of NR pattern has practical significance, for example, in text mining, where the text is a sequence of words, the NR pattern reflects the semantic relation between words.

We utilize the symmetry to scan the text and the pattern. Then convert an RT pattern into an R pattern.

  1. According to the characteristic of P, if it is not an RT pattern, we directly call SAIL; otherwise go to (2);

  2. Reverse Tand P, respectively get T’, P’;

  3. Call SAIL, and obtain the occurrences of P’in T’;

  4. Obtain the occurrences of Pin Tby coordinate transformation of the obtained solution.

Obviously, since the time of the identification of pattern’s characteristics is linear, O (RSAIL) = O (SAIL).

Experiments and Analysis:

We will give a set of experiments to illustrate two problems:

  1. Analysis of the complete extent of the SAIL algorithm;

  2. The comparison of the complete extent of RSAIL and SAIL algorithm;

Considering there is no algorithm can obtain the completeness occurrences of PMWL problem in polynomial time, we have developed a text generator (Xie et al., 2010) to generate experimental text, by which way we can know the completeness occurrences in order to analyze the complete extent of algorithm. In addition, the patterns used in these experiments are all RT patterns.

Table 5.

The parameters in the experiments.

The experimental results and analysis:

Figure 3.

The approximate ratio experimental results of RSAIL and SAIL

From the above images, SAIL itself is already a near-complete algorithm, in the above graphs, the average approximation ratio of SAIL is higher than 0.94;For the RT patterns, the completeness of RSAIL is better than SAIL in differentΣ, mand gap. Thus, not only a revised algorithm is obtained, the SAIL’s deficiency on handling RT patterns is also proved from another aspect.

3.3. The BPBM Algorithm

Description of BPBM Algorithm (Guo et al., 2011):

Like the SAIL algorithm, the BPBM algorithm also focuses on pattern matching in online sequential text with both flexible gapconstraints by user’s specification and the one-offcondition. BPBM is based on bit-parallel technology to simulate the matching process and adopt two nondeterministic finite state automatons (NFAs). One is a search mechanism to identify all pattern P’s suffix, and another one is a security window transition mechanism which accelerates the scanning process by dropping useless sequences in text.

BPBM has following characteristics:

  1. BPBM also uses the left-moststrategy to obtain the maximal occurrence of pattern in text, and return all these matching position sequences. This algorithm combines bit-parallel technology with nondeterministic finite state automatons. It also simplifies the calculation of shift distance of the security window transition, which gets good results. BPBM inherits the advantage of BM algorithm to skip some of characters in text, which conducts the algorithm with a sub linear average time complexity. Therefore, the time complexity of BPBM is lower compared to SAIL.

  2. Compared with Gaps-Shift-And algorithm and Gaps-BNDM algorithm, since they are all based on bit-parallel technology, their time performances are equal, however, because BPBM improves the formula of ε-transition in matching process, BPBM is fit for all patterns. In addition, BPBM returns a concrete set of matching position sequence, which makes it more applicable.

  3. Compared with SAIL, SAIL applies two-dimensional table as data structure, but BPBM is base on bit-parallel technology. Since the difference in data structure, BPBM has a better time performance. The similarity between these two algorithms is they all utilize the left-moststrategy, therefore, they are all heuristic algorithm with greedy strategy. In addition, the matching occurrences of these two algorithms are same and both incomplete.

3.4. The SBO Algorithm

Description of BPBM Algorithm (Wu et al., 2011):

Wu et al. propose a new nonlinear structure called Nettreeto deal with pattern matching with flexible constraints of wildcards. A Nettreeis a kind of directed acyclic graph (DAG) with edge labels. They apply a heuristic algorithm to select better occurrence (SBO). In this algorithm, they use two strategies: Strategy of Greedy-Search Parent, SGSP and Strategy of right-mostParent, SRMP to two occurrences of the same leaf, and select the better one as occurrence. The core idea of SGSP is finding an approximately optimal parent (AOP) of current node in each step; while the core idea of SRMP is finding the right-most parent node of current node in each step.

In off-lineconditions, owing to its heuristic strategy, SBO can obtain more occurrences than SAIL and BPBM in most cases, but it is still incompleteness. However, the time complexity of SBO algorithm is O(gap*n*(n+m2)) which is nonlinear of length of text. Further more, experiments show that, in general, SBO indeed consumes more time than SAIL. In SBO, the improvement of solution’s quality is relying on using heuristic strategies repeatedly, and this also consumes a lot of time. Therefore, we need consider the balance between completeness and time efficiency of algorithm. What is more, SAIL originally is not designed for off-linecondition, as an on-linealgorithm, it only has current information, so when applying it to off-linematching it will definitely be imperfect. But SBO uses global information to search occurrences, it also uses heuristic strategies to search on solution space. With the improvement of occurrences’ quality, there are two problems: more information need more place to store, so the space complexity of SBO is O(gap*m*n), the next one is more information means more calculations thus consuming more time.

3.5. Other algorithms

In many literatures, similar problems are defined and various algorithms are put out to solve certain problems. Morgante, et al. (Morgante, et al., 2004) described a structured model, which can be considered as ‘compound patterns’ made of a list of simple motifs and a list of intervals that specify at what distances adjacent motifs should occur. They gave a detailed description of the biological background of the problem definition. For example, many retrotransposons belonging to the Ty1-copia group contain a match of MT¢[115,136]MTNTAYGG¢[121,151]GTNGAYGAY, which consists of three patterns and two intervals. As the paper pointed out, structured motifs are called classes of Characters and Bounded Gaps (CBG) expressions in Navarro and Raffinot, but use of these expressions is quite different: the underlying motivation for CBG expressions is searching in database like PROSITE and a sequence of this kind is usually not very long, while structured motifs can be very long since gaps may span many letters. As we can see, the concept of CBG and structured motifs are all have practical meaning. Because of the different application background, they design different algorithms to solve their problems. From the application point, this paper also considered a problem of q-approximation match which means just finding partial motifs in the sequence. In this paper, they proposed a two-step procedure which is used in many algorithms for PMWL: firstly, finding the occurrences of all the component patterns; secondly, combining the occurrences that satisfy the distance constraints into a structured motif. For step two, they gave a detailed algorithm to build a directed acyclic graph according to the positions of the component patterns and interval constraints. Then they discussed how to output all the occurrences in detail. In (Rahman et al., 2006), the definition of their problem likes SAIL, but they don’t consider global constraints and the one-offsearching. In addition, just like paper (Chen et al., 2006), the local constraints exist between two substrings, while in SAIL, exist between any two consecutive letters. Certainly, a single character is a substring, but in this paper, all these substrings are used to build an AC automaton. It is not efficient to build a Trie structure over a set of single letters. This paper also used a two-step procedure: firstly using AC automaton to get occurrences of each sub-patterns in orders and combine them. They built an implicit graph, in which vertices are partitioned into several sets in order according to the corresponding sub-pattern and edges between two consecutive sets means two positions in these two consecutive sets fit corresponding local constraints. To output all Pin T, we have to enumerate all possible paths in the implicit directed graph which length is the number of sub-patterns in the pattern. Morgante, et al. (Morgante, et al., 2004) applied a revised depth first searching algorithm. Philip Bille et al. (Bille et al., 2010) defined a concept named variable length gap(VLG) which is a pattern formed by a sequence of strings and variable length gaps. Obviously, this definition is almost the same with above works. Unlike Rahman’s work, although this paper also applies AC automaton, it maintains a sorted list containing the ranges defined by previously reported relevant occurrences, and naturally it uses the left-moststrategy to count an occurrence as soon as it appears. Haapasalo et al. (Haapasalo et al., 2011) extended the usual dictionary matching problem to the case in which patterns may include single wildcards, or wildcard strings of variable length with fixed or unlimited upper bound. And their algorithm is designed for on-linematching: the text is scanned only once, and the matches for all patterns are reported at the point of occurrence. Firstly, they constructed an AC PMA (pattern matching automaton) with output tuples identifying the keywords of the patterns to be matched. The idea in their algorithm is that they recognize keywords by the PMA and check whether or not a newly found keyword forms a continuation of a pattern prefix found thus far.

3.6. Discussion

Because of the complexity of the PMWL problem definition, we believe that the matching occurrence of PMWL problem is with high degrees of freedom. In traditional matching problem with fixed-length wildcard, the positions of each match in the same set of the matching occurrence are relatively fixed to each other. Therefore, to determine the position of any one character, a set of matching occurrences have been identified. We believe that matching of each character in above problem has a strong correlation. For instance, in pattern a¢g¢¢c, p[2] – p[1] = 1, p[3] – p[2] = 2. However, for pattern in PWML, matching positions of adjacent characters are bounded by local constraints, which means matching of each character has a weak correlation, and to determine the positions of all characters, a set of matching occurrences could have been identified, that is, freedom degree of matching increases. This is an important factor leads to the complexity of the PMWL problem, which greatly increases the difficulties of searching process in matching algorithms. In order to get a complete solution, a lot of backtracking operation are required, making it difficult to be completed in polynomial time, therefore, almost all PMWL algorithms use greedy strategies in matching process. This is destined to incomplete results. However, on the other hand, although the above algorithms are not complete, we find that when length of pattern is shorter than 6, the approximation ratio of these algorithms are more than 0.9. Consequently, the next work can be considered form two aspects: 1, based on SAIL algorithm etc, improving the time efficiency, like BPBM; 2, designing algorithm for PMWL under certain conditions, such as RSAIL’s work for RT pattern. We believe that the pattern features, data structures and matching strategies will continue to be the center for PMWL algorithm design.

Table 6.

The strategy, structure, time consumption and completeness of PMWL methods

4. Analysis of PMWL based on pattern features

In the traditional matching problems, how to search pattern in text much faster as well as the correlation between pattern and text are paid more attention. But the characteristics of the pattern itself are paid little direct attention, because traditional matching problems can always have complete occurrences. The main characteristic of PMWL is with flexible wildcards, which leads to a large number of candidate matching positions. And the conflict between these occurrences will cause the final output incompleteness. However, our research shows that the direct cause which impacts PMWL incompleteness is not wildcards; it is the pattern characteristics directly conduct PMWL incompleteness. This is much different from traditional matching problems.

4.1. The impact of the alphabet, the length of pattern and the gap on completeness

In the traditional pattern matching research, the length of pattern and the size of alphabet are key elements influencing time complexity when analyzing traditional matching problems. Taking into account PMWL problem definition, upper and lower limits of the length constraints probably affect problem solving. Especially, instead of upper and lower limits themselves, the distance between the upper and lower limits, that is gap, are taken into consideration. Therefore, the parameters related to the algorithm completeness may be the size of alphabet, the length of pattern and distance between the upper and lower limits, denoted as Σ, m, and gaprespectively. In this article, the approximate degree of completeness of the algorithm will be measured by approximation ratio ε. Consequently, we try to build following model:

ε = F (Σ, m, gap)(1)

Taking into account that the size of Σ is determined in a specific area, for example, in bioinformatics, DNA sequences can be defined on Σ = {a, c, g, t}, the above formula can be simplified as ε = F (m, gap). In experiment project, input text is a biology DNA sequence, so Σ = {a, c, g, t}. Then the remaining parameter values are as follows: gap∈ [1, 30], m∈ [3, 9], consequently, there are 30*7 = 210 groups of experiments. The aim is to find approximation ratio ε.

Firstly, pattern Pis generated randomly by pattern generator according to Σ, m, and gap. For example, when m= 5, Σ = {a, c, g, t}, gap= 2, a¢[0,2]c¢[0,2]c¢[0,2]t¢[0,2]g is a qualified pattern. For simplicity, in generated patterns, each two consecutive characters have the same length constraints i.e. gap. Then, what needs to be done is calculating approximate ratio ε for each pattern. Since ε = N(UALG) / N(Uopt), we need to know N(Uopt). However, it is not desirable to directly solve this from a text T, since there is no any known algorithm to obtain the completeness solution. If we use a simple brute-force, the exponential time will be need. Therefore, we have developed a text generator, which can generate text Taccording to Pand N(UALG). In addition, SAIL algorithm is currently regarded as the most representative algorithm for PMWL problem, since SAIL firstly adopts the left-moststrategy which is applied in different situations and technologies such as BPBM(Guo et al., 2011) algorithm and the mining algorithm MAIL(Xie et al., 2010). Based on the above analysis, we have SAIL as a research object, that is, N(UALG) = N(USAIL).

In summary, the concrete steps of the experiment are as follows:

  1. For given Σ, mand gap, 100 patterns pi are generated randomly, where i= 1, 2,.., 100;

  2. For pattern pi, given N(Uopt) = 100, text length n = 2000, generate text Ti;

  3. For Ti, call SAIL algorithm to get N(USAIL);

  4. Calculating εi = N(UALG) / N(Uopt);

  5. Calculatingε=i=0100εi/100.

Table 7.

Parameters in experiments for ε = F (gap)

The experimental results:

Figure 4.

Curves of ε = F (gap) in experiment 1

By the figure 4, as mincreases, ε is gradually decreasing. As the gapincreases, the trend of ε is decreasing first and then increases, especially when gap= 1 and ε = 1, since the left-moststrategy can obtain a complete occurrence set. With the increase of gap, ε begin to decline because when the gapis becoming greater, the probability of matching occurrences overlap is becoming greater and the algorithm is becoming more easily to lose occurrences; when gapis sufficient, although matching occurrences are still overlap, greater gapreserve enough space for matching, making the remaining occurrences which have not yet been still have enough resources. Moreover, it is worth noting that the minimum of these curves can be reached when gapis about 7, and have nothing to do with the pattern length.

Figure 5.

Curves of ε = F (gap) in experiment 2

In figure 5, the trend of curves is the same as in figure 4. The difference between them is curves in figure 5 reach the minimum when gapis about 9~11. It can be found that the impact of Σ, m, and gapon the curves is that the change of gapdetermines the trend of the curve, maffects the magnitude of this change, and Σ makes the curve do translational move.

Table 8.

Parameters in mathematic model

After a series of experiments, we speculate that ε = A*gap4+ B*gap3+ C*gap2+ D*gap+E, where A, B, C, D and E are parameters and for different mthere are different parameters. We try to use this model to illustrate the relation between gapand approximation ratio ε.

Use this parameter table, some of illustrations for m= 3, 4……14 are listed below, where horizontal axis is the gap, vertical axis is the ε.

Figure 6.

Model fitting

We believe this model can be used to predict the completeness of solutions given a certain pattern. For example, given m= 10, Σ = {a, c, g, t}, gap= 5, this model shows the prediction of approximation ratio ε of SAIL algorithm is about 0.878. Therefore, this model can be used in pattern mining showed as below.

PMWL pattern mining evaluation mechanism

Input: Given T, Σ, m, gap, support sup

Output: pattern P

As we know, mining algorithm strategy is to learn from the strategy of matching algorithm, so PMWL pattern mining problem is naturally based on PMWL matching problem. For example, in mining algorithm MAIL (Xie et al., 2010), although a graph structure is utilized which conducts it different from SAIL; it is still based on the left-moststrategy. As a result, they have the same degree of completeness. Therefore, our model can propose an evaluation mechanism for mining.

4.2. The impact of pattern repon completeness

In next part, we will put forward another important concept, named rep, and analyze its impact on completeness. We first give an example to illustrate the reason why this concept is needed. Given m= 4, Σ = {a, c, g, t}, gap= 2, the corresponding patterns maybe P1 = a¢[0,2]c¢[0,2]g¢[0,2]t or P2 = a¢[0,2]c¢[0,2]c¢[0,2]t. They have the same Σ, mand gap. However, when applying SAIL or BPBM, the completeness of solutions is not the same, since for P1 algorithms can obtain complete solutions while for P2 can not.

Considering two examples below:

Table 9.

Example 1 for repconcept

A complete occurrence set of this example is {{0, 3, 5}, {2, 4, 6}}, the number of matching occurrences is 2. It is not difficult to find that, in SAIL algorithm, for position5, the selection of position 2 as p[1]’s occurrence by the left-moststrategy will consume the position for the next matching occurrence. We can guess that, the recurring 'b' character in this pattern affect the quality of matching occurrences.

Table 10.

Example 1 for repconcept

In this example, A complete occurrence set is {{0, 2, 4}, {1, 3, 5}}, the number of matching occurrences is 2. If we use SAIL algorithm and first obtain {0, 2, 3}, then we will only get this occurrence and lose {0, 2, 4}, {1, 3, 5}. Obviously, the recurring 'b' character in this pattern affects the completeness.

From above examples, the matching of recurring character in the pattern may determine the completeness of the algorithm. As a result, we consider this repeatability as an element to influence the completeness.

In order to quantify the repeatability, the concept of repeatability, rep, is proposed in this paper.

Definition 9Given a pattern P= p0p1…pm-1, let fij=(pi, pj) be all binary combinations of characters in pattern P

Letfij={0,pipj1,pi=pj, andrep=i=0m1j=0m1fij, where 0 ≤ i, jm-1, and ij. then repis the repeatability of characters in pattern. It shows the number of pairs of the same characters in pattern.

Definition 10Given occurrences Aand S,if a[i] = s[k] where 0 ≤ im-1, 0 ≤ km-1, we say Aconflicts with S.For example, T= aacccc, P= a¢[0,1]c¢[0,1]c, {0,2,3} conflicts with {0,2,4} and {1,3,5}. And “c” is the conflict letter.

For simplicity, global length constraint is deliberately ignored in our proof, and it does not affect the conclusion.

LEMMA 1Given two occurrences A, S, if Aand Scome from the same occurrence set, then a[i] ≠ s[k] where 0 ≤ im-1, 0 ≤ km-1.

Proof: Assume a[i] = s[k], then A conflicts with S, so they can not belong to the same set. The contradiction is achieved. Lemma 1 is proved.

LEMMA 2Given two occurrences Aand Swhere SUSAIL. If there is a conflict between Aand S, and let a[t] and s[i] be the conflict positions. According to the definition 10, under the one-offcondition, Ashould be discarded. Moreover, if i= t, then s[i] = a[i]; if it, s[i] < a[i] where 0 ≤ im-1. For instance, S= {0, 2, 3}, A= {1, 2, 4}, for s[1] = a[1], the conflict position is 1, and the other positions satisfy s[0] < a[0], s[2] < a[2].

Proof: Assume s[i] > a[i], then a[i] is in the left of s[i] in T. In accordance with the left-moststrategy of SAIL, the left-mostone prior to others is selected, which is a[i]. Due to the issue, SUSAIL, so s[i] should be selected. The contradiction is achieved. Thus, s[i] ≤ a[i]. If i= t, s[i] = a[t] = a[i], and if it, s[i] = a[t] ≠ a[i]. It is obvious to concluded that s[i] < a[i].

LEMMA 3Given a text T, a pattern Pand an occurrence S. Let USAILbe the occurrence set of SAIL. If SUSAIL, Sconflicts with at least one occurrence in USAIL.

Proof: Assume Sdoes not conflict with any occurrence in USAIL. Then it indicates that the reason why SAIL lose Scan only be the length constraint. According to the definition 4, all the occurrences satisfy the length constraint. The contradiction is achieved. So the lemma is proved.

LEMMA 4Let USAILbe the occurrence set of SAIL, and Uoptbe the optimal one. Let NSAIL(Nopt) be the matching number in USAIL(Uopt).

(1) If USAILis the completeness set, NSAIL= Noptis satisfied.

(2) Otherwise, NSAIL< Noptis obtained and there is a conflict between USAILand Uopt.

(3) If the condition holds Nopt= 1, NSAIL= Noptis obtained.

(4) If there is no conflict, NSAIL= Noptis achieved.

Proof: It is obvious to conclude (1) is obviously true. According to the definition 5, if NSAIL< Nopt, there is an occurrence Ssatisfying SUoptand SUSAIL. Due to LEMMA 3, Sconflicts with at least one occurrence in USAIL. That is Uopt is conflict with USAIL. So (2) is proved. With regard to (3), let Sbe the unique occurrence of Uopt. Assume NSAIL< Nopt, then NSAIL= 0. That is SAIL has no occurrence. In accordance with LEMMA 3, Sconflicts with at least one occurrence of SAIL. But USAILis empty, so there is no conflict. Thus the contradiction is achieved. And (3) is proved. With regard to (4), it is obvious NSAILNopt.We assume NSAIL< Nopt, then there is an occurrence Ssatisfying SUoptand SUSAIL. Due to LEMMA 3, Sconflicts with at least one occurrence in USAIL. That is Uoptand USAILhave a conflict. The contradiction is achieved. So (4) is proved.

LEMMA 5Given two occurrence sets U1,U2, if U1 conflict with U2, there are two sub-sets u1,u2 with a conflict where u1U1, u2U2.Proof: Assume there is no sub-sets with a conflict. All the matching positions of U1 and U2 have no conflict. According to definition 10, U1 and U2 have no conflict and satisfy the one-offcondition. The contradiction is achieved. Lemma 5 is proved.

LEMMA 6Given two occurrence sets U1, U2, U2 is Uopt. If there is a conflict between U1 and U2, and N(U1)< N(U2), there are two subsets u1,u2 where u1⊆U1, u2⊆U2, u1 is conflict with u2 and N(u1)< N(u2).

Proof: In accordance with LEMMA 5, there are subsets u1,u2 where u1U1, u2U2 with conflict. Let U1 = u11u12∪ ……∪u1n, U2 = u21u22∪ ……∪u2m, and u1i, u2jare arbitrary subsets where u1i⊆U1, u2j⊆ U2, 1 ≤ in, 1 ≤ jm. We discuss in three conditions: ①u1i, u2jhave no conflict and do not satisfy N(u1i)< N(u2j), then U1,U2have no conflict and N(U1)= N(U2), the contradiction is achieved. ② u1i, u2jhave a conflict and do not satisfy N(u1i)< N(u2j), then U1,U2have no conflict, the contradiction is achieved. ③ u1i, u2jhave no conflict and satisfy N(u1i)< N(u2j), then N(U1)= N(U2), the contradiction is achieved. So u1i, u2jhave a conflict and satisfy N(u1i)< N(u2j). Lemma 6 is proved.

THEOREM 1Given a text T, a pattern P, if SAIL is incomplete, Pmust be R pattern.Proof: Let USAILbe the occurrence set of SAIL, Uoptis the completeness set, NSAILis the matching number of SAIL, and Noptis the complete matching number. Consider the SAIL is incompleteness, according to LEMMA 4, NSAIL< Nopt, and USAILconflicts with Uopt. Due to LEMMA 6, we get two subsets u1, u2 with conflict, which are satisfying N(u1) < N(u2) where u1USAIL, u2Uopt. Without loss of generality, let N(u1) = 1, N(u2) = 2. Set u1 = {S}, u2 = {A, B}, that is SUSAIL, AUopt, BUopt. Let:T= t[0], t[1]… t[i]… t[n-1], t[i] is stand for the ithletter in Twhere i= 0,1,2……n-1

P= p[0], p[1]… p[i]… p[m-1], p[i] is stand for the ithletter in Pwhere i= 0,1,2……m-1

A= a[0], a[1]… a[u]… a[m-1], a[i] is stand for the ithcharacter maching position of occurrence Awhere i= 0,1,2…m-1

B= b[0], b[1]… b[w]… b[m-1], another occurrece.

S= s[0], s[1]… s[i]… s[k]… s[m-1], another occurrece.

Let a[u], b[w] be the positions in A, B, which conflict with s[i], s[k] in S separately.We assume other positions in Aand Bdo not conflict with the occurrences in USAIL. ∴ a[u] = s[i], b[w] = s[k] ∴t[a[u]] = t[s[i]], t[b[w]] = t[s[k]] ∵According to the definition 4, t[a[u]] = p[u], t[s[i]] = p[i], t[ b[w] ] = p[w], t[s[k]] = p[k] ∴ p[u] = p[i], p[w] = p[k]

It would be discussed in the following two cases:

uior wk

u= iand w= k

For ①, when if ui, ∵p[u] = p[i] ∴There are two of the same letters from different positions in P. ∴According to definition 6, Pis an R pattern. For the case of wk, similarly, it can be proved.

Then we will prove the other condition is impossible, and conclude P is R pattern.

For ②, we obtain a[u] = s[i] = s[u], b[w] = s[k] = s[w]. There is uw. ∵ Assume u= w, then u= i= w= ka[u] = s[i] = s[k] = b[w]. Consider A, Bbelong to the same occurrence set, which contradicts with LEMMA 1 ∴ uw. Without loss of generality, let u< w, according to LEMMA 2 ∵ SAIL adopts the left-moststrategy, and SUSAIL,A,B ∉ USAILs[u] < b[u], s[w] < a[w] ∵ a[u] = s[u], b[w] = s[w]

a[u] < b[u], b[w] < a[w]

And∵ u< w, we can obtain b[u] < b[w]

A= …a[u]…………..a[w]…

B= ……..b[u]…b[w]………

The occurrence {b0, b1,…, bu,…,aw,…, am-1} can be considerd as {{b0, b1,…,bu}, {bu,…,aw}, {aw,…, am-1}}.

According to the definition 4, {a0, a1,…,au,…,aw,…,am-1} and {b0, b1,…,bu,…,bw,…,bm-1} satisfy the local constraints. So {aw,…, am-1} and {b0, b1,…,bu} satisfy the local constraints.

Due to {bu,…,aw}, we can get { bu, bu+1…, aw-1, aw}.

From the equation (1), a[u] < b[u], b[w] < a[w], and according to the definition 4:

b[i] < b[i+1], a[i] < a[i+1] where uiw-1(2)

There is a tsatisfying:

b[t+1] < a[t+1] and a[t] < b[t] where utw-1 (3)

A= …a[u]……a[t]……………a[t+1]……..a[w]…

B= ……..b[u] ……b[t]……b[t+1]……b[w]………

Assume there is no tsatisfying the condition, consider b[t] ≠ a[t] where utw.Then due to any tthere is a[t+1] < b[t+1] or a[t] > b[t] where utw-1.Consider a[u] < b[u], there is a[u+1] < b[u+1].Due to a[u+k] < b[u+k], we can obtain a[u+k+1] < b[u+k+1] where 0 ≤ kw-u-1.Then we can induce a[i] < b[i] where uiw.It contradicts b[w] < a[w], so the assume is incorrect Due to equation (2) and (3), a[t] < b[t] < b[t+1] < a[t+1] where utw-1.

b[t+1] - b[t] < a[t+1] - b[t] < a[t+1] - a[t] (4)

That is a[t] and b[t-1] satisfy the local constraints. In this way, { bu, bu+1…, aw-1, aw}can be considered as {{ bu, bu+1…,bt-1},{bt, at+1},{ at+2 …, aw-1, aw}}. In accordance with definition 4, { bu, bu+1…,bt-1},{ at+2 …, aw-1, aw} satisfy the local constraints. ∴{ bu, bu+1…, aw-1, aw} satisfy the local constraints. ∴From the above analysis, {b0, b1,…,bu,…,aw,…, am-1} satisfy the local constraints.

However, according to the theorem, the other positions in A,B do not conflict with USAILexcept for a[u], b[w]. That is, {b0, b1,…,bu,,aw,…,am-1} satisfies the one-offcondition. ∴{b0, b1,…,bu, aw,…, am-1} is another occurrence, and does not conflict with any occurrences in USAIL. But USAILdoes not include this occurrence. It contradicts with LEMMA 3. Thus, condition ② is impossible. And from the analysis of ①, under the condition of the theorem, P must be R pattern. The theorem 1 is proved.

THEOREM 2Given a text T, a pattern P, if Pis NR pattern, then SAIL is complete.

Proof: It is the inverse negation of THEOREM 1. Apparently, THEOREM 2 is true.

THEOREM 3Given a text T, a pattern P, if Pis R pattern, then SAIL is incomplete.

Proof: It can be concluded from the analysis and example in section 2.

THEOREM 4If the pattern fulfills gap= 0, SAIL is complete.

Proof: If gap= 0, the wildcard is a constant. For example a¢[1,1]c¢[2,2]c is converted into a¢c¢¢c. There won’t be any conflict or exist seizing between occurrences. SAIL will perform complete.

Experiment design

When ∑ and m are determined, rep can only be some certain values, because rep has correlation with ∑ and m

: ∑= 4, m= {5,7,9}, gap= [0,3], rep= {0,1,2,3,4,6,7,10,11,15, 21,28,35}. In each set of experiments, 20 patterns are randomly generated; the final result is the average.

Analysis of experimental results: with increment of rep, the curve of approximation ratio gradually decreases, followed by a slight increase. The reason for decline is that replead to more nested occurrences, resulting in a greater degree of the possibility of losing occurrences; the reason for the increscent is that larger repcan cause more extreme pattern. For instance, when ∑ = 4, m= 7, rep= 21, patterns like P1 = a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a which is difficult to find a special text containing nested occurrences of such pattern, will be produced. For P1, the text like “aaaaaaaaaaaaaa” contains nested occurrences of this pattern. Obviously, this extreme text is very rare. Therefore, under the premise of nested occurrences are not easily to be formed, the approximation ratio will be increased slightly.

Figure 7.

The relation betweenrepand approximation ratio ε

Next we will analyze the relationship between the repeatability rep and alphabet size ∑, pattern length m. Original problem: a pattern which length is m, and alphabet size is ∑, what is the expectation of repeatability E(rep)?

This description is equivalent to the model of ‘taking ball from the bag’ in the combination mathematics:

There is a bag of balls, and |Σ| kinds of colors, taking m balls from the bag with replacement, then in fetched balls, how many pairs of the same color?

Table 11.

The relationship between ∑, m and rep

Finally, we can deduce:


5. Conclusions

As an extension of traditional matching problem, the PMWL problem has aroused more and more attention because of its unique flexibility and complexity. Based on problem definition and drawing on research idea in traditional matching problem, this article introduces SAIL, RSAIL, SBO and BPBM which are representative algorithms for PMWL in three important respects: the data structures, the matching strategies and the characteristics of pattern. The article also analyzes the pros and cons of the above algorithms from the point of quality of the solution and time complexity, and gives experimental matching results by using real DNA data. Among them, the SAIL algorithm is the first to propose the method of solving PMWL problem, it uses the sliding window structure and the representative left-most matching strategy. This paper finds that in short patterns, the approximation ratio of SAIL is higher than 0.9, while in longer patterns, the occurrences obtained by SAIL are of poor quality; the quality of occurrences obtained by SBO is best, but its time consumption has a non-linear relationship with the length of text; BPBM utilizes bit parallel technology to improve the efficiency of matching greatly, but also is impact by the machine word; for pattern with repeated letters in tail, RSAIL uses symmetry to improve the quality of occurrences under certain conditions, thus providing a solving idea to PMWL problem, but in longer patterns and wilder gaps, the efficiency is not obvious.

Afterwards, this article focus on relationship between approximation ratio ε and alphabet size ∑, pattern length m, wildcards span gap and repeatability rep. Firstly, this article proposes the model ε = F (Σ, m, gap), describing the functional relationship between pattern characteristics and approximation ratio approximately; secondly, this article proves PMWL’s completeness under the conditions of rep = 0; finally, the relationship between the pattern features are also analyzed andm in addition, relationship thatE(rep)=Cm2||is proposed.

In future work, the formal description of the PMWL problem will be considered, in order to explain the complexity of the problem better, thus helping algorithm design and analysis for problem complexity.


  • When ∑ and m are determined, rep can only be some certain values, because rep has correlation with ∑ and m

© 2012 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Haiping Wang, Taining Xiang and Xuegang Hu (November 28th 2012). Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness, Bioinformatics, Horacio Pérez-Sánchez, IntechOpen, DOI: 10.5772/48574. Available from:

chapter statistics

1559total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Hierarchical Biological Pathway Data Integration and Mining

By Shubhalaxmi Kher, Jianling Peng, Eve Syrkin Wurtele and Julie Dickerson

Related Book

First chapter

Molecular Evolution & Phylogeny: What, When, Why & How?

By Pandurang Kolekar, Mohan Kale and Urmila Kulkarni-Kale

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us