Research on Pattern Matching with Wildcards and Length Constraints : Methods and Completeness

© 2012 Wang et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness


Introduction
The practical importance of the string matching problem should be obvious to everyone. For typical word-processing applications, immense amounts of work have been done on this subject. However, with the developments in bioinformatics (Cole et al., 2005), information retrieval (Califf et al., 2003), pattern mining Ji et al., 2007;, etc, sequential Pattern Matching with Wildcards and Length constraints (PMWL) has attracted more and more attention. It is not difficult to think up realistic cases where PMWL plays an important role. In Dan Gusfield's book (Gusfield, 1997), they give an example about transcription factor to illustrate the concept of wildcard. A transcription factor is a protein that binds to specific locations in DNA and regulates the transcription of the DNA into RNA. In this way, production of the protein that the DNA codes for is regulated. Many transcription factors are found and can be separated into families characterized by specific substrings containing wildcards. They use Zinc Finger, a common transcription factor as an example. It has the following signature:

CYS¢¢CYS¢¢¢¢¢¢¢¢¢¢¢¢¢HIS¢¢HIS
Where CYS is the amino acid cysteine and HIS is the amino acid histidine. They also give a conclusion that if the number of wildcards is bounded by a fixed constant, the problem can be solved in linear time.
Another respective example is about promoter. In bioinformatics, promoter will help researchers to quickly locate the starting position of the intron from hundreds of millions of the sequence of ACGT. Among these promoters, TATA box is a common one (Manber & Baeza-Yates, 1991). It has very loose sequence specificity, so many TATA sequences are not TATA box. As a result, indirect positioning by pairs of sites is needed. The commonly used one is CAATCT sequence. The DNA sequence TATA is a common promoter that often occurs after the sequence CAATCT within 30-50 wildcards. Therefore, matching patterns with wildcards becomes especially crucial in exploring valuable information from DNA sequences.
There are many applications that involve pattern matching with wildcards and various researches have provided many solutions to different forms of this problem. Fischer and Paterson were the first to generalized pattern matching with wildcards (Fischer & Paterson, 1974): given a pattern P and a text T, either of which may contain wildcards, denoted by ¢, the goal is to locate all P's occurrences in T. ¢ can match any letter in a given alphabet, such as a¢¢c¢t. Unlike previous work, Chen, et al. proposed a PMWL problem integrating two problems (Chen et al., 2006): one is complex local constraints which means the user can specify a different range of wildcards between each two consecutive letters of P, for example, a¢[0,3]c¢ [1,3]t. Another one is global length constraints. The user can constrain the length of each matching substring of T in which P occurs. Therefore, flexible constraints of wildcards conduct flexible jump of the matching positions. The definition of PMWL problem is an extension of Fischer and Paterson's definition and the introduction of complex local constraints increase the flexibility. On one hand, this definition of pattern is more suitable for areas such as bioinformatics; on the other hand, the size of the matching candidate positions is in the exponential increment which greatly increases the complexity of the problem solving. From a view of practical point, they also proposed two issues: with and without the one-off condition (Chen et al., 2006;Min et al., 2009). In their problem definition, users have more flexibility to search on sequences and the one-off condition has both theoretical and practical significance. One-off condition means that every letter in T can be used once at most. In practical applications, with and without the one-off condition has practical meaning in specific areas. For example, in sequential pattern analysis in data mining, P can be treated as a candidate shopping pattern, the user is interested in how frequently P occurs in one document, it makes sense to count each occurrence for once, what is more, the one-off condition also makes the problem solving possible. However, under the one-off condition, how to allocate limited text resource to each matching occurrences, in order to obtaining the maximum number of occurrences, belongs to optimization problem. In the allocation of resources, the matching of different letters in the pattern possess a strong correlation, which conducts the selection of matching positions in the combination of explosive growth. Since it is difficult to develop a complete matching strategy in this problem, almost existed algorithms for PMWL are using greedy matching strategies, which is the root reason why matching algorithm is not complete. This article will focus on SAIL algorithm (Chen et al., 2006) which is a representative algorithm for PMWL problem and will also describe RSAIL (Wang et al., 2010), SBO ), BPBM (Guo et al., 2011 algorithm which are all designed to solve PMWL problem in different conditions. The each of above algorithms has its own characteristics in the data structures and matching strategies, which will be analyzed in this paper.
What is more, since the theoretic and practical importance of the definition of PMWL, we need to research the nature of this problem. To our best knowledge, there are still no efficient methods on this problem, because as for completeness of the problem, we still do not know whether it could be solved in polynomial time. In this article, we will research the completeness of PMWL under certain condition. In the traditional matching problem, description of pattern and text information is the key to the algorithm design, however, flexibility and complexity of the PMWL problem all depends on the pattern features, so this article will focus on pattern information, especially the pattern features including the size of alphabet, the length of pattern, the gap of wildcards in the pattern, etc. We will also investigate the relationship between pattern features and completeness, and use the approximate ratio judgment. Further more, since the definition itself is produced in realistic background, we need to consider the situation in real biological background and improve the solution of the problem. Based on the above, we choose this topic as a research object in this book.
This capture is organized as follows: In section 2 we will give the development, definition and application of PMWL problem; Section 3 will show the representative algorithms, we will introduce their structure, strategy, complexity and completeness; Section 4 will analyze the PMWL problem completeness based on pattern features. We will give our conclusions in section 5.

Pattern matching with wildcards and length constraints
The sequential pattern matching problem is to given a Text T and a pattern P as input, and output all the occurrences of P in T. After Fischer and Paterson's work, there are a variety of non-standard definitions of the pattern matching problem: the approximate matching , the swapped matching (Amir et al., 2000), the Parameterized matching (Amir et al., 2009), etc. They all belong to Non-standard Stringology problem (Muthukrishnan, 1994). Many of them are still open problems.

The development of PMWL problem
After years of development, these Non-standard Stringology problems always focus on a problem: that is, how to conduct the traditional pattern matching definition to be more flexible to adapt the development of application. The don't cares problem always focus on how to combine the wildcards and the pattern. After Fischer and Paterson's work, Cole et al. considered a slightly different problem (Cole et al., 2004), where instead of fixing the number of ¢s between two consecutive letters in P and T, they fixed the total number of ¢s in P. The disadvantage of these problem definitions is that the number of ¢s is a constant but not a range. This limits flexibilities for the user's queries. To alleviate the problem of a fixed number of ¢s, Kucherov et al. (Kucherov et al., 1995) proposed a solution to allow an unbounded number of ¢s between two consecutive letters in a given pattern. Given a set of such patterns, their objective is to find whether any of these patterns matches some substring of the text that does not contain any ¢.Obviously, allowing an unbounded number of ¢s still does not offer the users enough flexibilities to control their queries. Manber et al. (Manber & Baeza-Yates, 1991) proposed an algorithm for string matching with a sequence of wildcards. They considered the following problem: given two pattern strings P and Q, each of which consists of letters, and an integer g, all occurrences of the form P¢0−gQ in the text are returned. The number of ¢s between P and Q is in the range of [0, g], and the text does not contain any ¢. This problem was so-called exact string matching with variable-length don't cares. Chen et al. sum up all these definitions into three conditions (Chen et al., 2006): firstly, there is a wildcard between two consecutive letters in P, for example A¢[0,1]T¢[0,2]G¢[1,3]C; secondly, every letter in T can only be used once for matching; thirdly, there is a global constraint to limit the matching occurrence length. We call the problem satisfying above definition PMWL problem, which has been used in approximate matching, pattern mining, information retrieval, etc.

The potential applications of the PMWL problem
1. Text Indexing: There is a large amount of hypertext information on the Internet. How to effectively obtain information that meet users' needs is becoming more and more urgent. Text indexing is a method to solve this problem. How to determine the position of user-specified pattern (may contain wildcards) is a challenge task. 2. Data stream is becoming more and more crucial in many new database applications such as data warehouse and sensor network. Mining dependence or association in large amounts of data flow has practical value and during which the sequential pattern matching with wildcards is the first and the most important step. In addition, in data mining, sequential pattern mining also search frequent patterns as in transaction sequence, a typical instance is the similar consuming pattern of many consumers, for example, buying a desktop, a laser printer, a digital camera and an LCD screen monitor in turn, between each of them exists a certain time interval. Mining such typical user mode, which is obviously a pattern matching with wildcards, will has a great influence on the market.
3. Network Security: Pattern matching methods in network security and intrusion detection need high performance. A complete IDS (Intrusion Detection System) based on Snort rules needs to optimize hundreds of rules and many of them need to do pattern matching efficiently for the entire data partition of a package. Efficient pattern matching and mining with wildcards constraints give the system administrator a more flexible and accurate solution to locate the suspicious users.

Problem statement for PMWL
Definition 1 Let  be an alphabet, T = t0t1…tn-1 ∈ *  is called a text of  where n = |T| is the length of T. A pattern is a tuple P = (p, g) where p = p0p1…pm-1 is a sequence of characters, which belong to the alphabet  , and g = g0g1…gm-2 is a sequence of wildcards. And m = |P| is the length of P. The interval of wildcards between pi and pi+1 is denoted by gi = g (Ni, Mi) where 0 ≤ i ≤ m -1, called the local constraints. Ni and Mi is the upper and lower limit of wildcard. Such as P = a¢ [1,3] (Chen et al., 2006). PMWL problem can be defined by the above definition:

Definition 3 PMWL problem (Pattern Matching with Wildcards and Length constraints)
Pattern Matching with Wildcards and Length constraints meets the following conditions: 1. ¢s can occur between each two consecutive letters in pattern and are independent to each other; 2. ¢s between two consecutive letters can match a string which length is limited by local constraints, and the total length of pattern is limited by global constraint; 3. One-off condition is taken into consideration that every letter in T can only be used once for matching pj (0 ≤ j ≤ m -1) and as soon as there exists one occurrence of P in T when T is being scanned from left to right it will be returned.

Algorithms for PMWL
The results of traditional matching algorithm are complete, so the focus of research is to improve the matching efficiency. As a kind of searching problem, the key to solving matching problem is how to use and extract information getting from text and pattern. KMP, BM algorithm uses automata to describe the pattern characteristics, and deposit information obtained from scanning during matching process into automata. Algorithm visits the automata, when the jump distance needs to be calculated, thus to avoid obtaining the pattern information repeatedly and to ensure the jump in matching process does not affect the final result. The basic idea the suffix tree is to use the tree structure to describe the text information, and to avoid scanning the same text repeatedly when matching a set of patterns. We believe that data structure and search strategy are crucial for traditional algorithms to access to information of text and pattern. Reasonable data structure is better to explore the potential of the computer, such as bit parallel technology, and can also be a more reasonable representation of the sequence information, such as automata. In addition, there exist the sliding window, indexes and other data structures. Reasonable matching strategy makes better use of sequence information. These strategies can approximately be divided into prefix searching, suffix searching and factor searching (Navarro & Raffinot, 2001 As different extension of traditional matching problem, PMWL problem, approximate matching, and swap matching all belong to the Non-standard Stringology problem. Problems in this field mostly belong to the optimization problem, and most of them have not yet been completely solved, such as PMWL and approximate matching problems with wildcards etc. What PMWL and traditional matching problem have in common are: 1. From the view of algorithm itself, the data structures and matching strategies are as the key of algorithm design. 2. From the view of describing the object, how to effectively describe the patterns and text information is the key to solve the problem.
What PMWL and traditional matching problem have in difference are: 1. For PMWL, there is no complete solving yet, so algorithm evaluation criteria include both time efficiency and solution quality; but traditional matching algorithm is only concerned with matching time. 2. The flexibility and complexity of the PMWL problem definition are reflected in the pattern, therefore, compared with traditional matching, PMWL pay more attention to the description of the pattern information. Pattern characteristics are extremely associated with the solution of PMWL problem.
Next, we will give the representative algorithms for solving PMWL problem, and detailed description of their design ideas from the perspective of data structure and matching strategy.

The SAIL Algorithm
Description of SAIL Algorithm (Chen et al., 2006): After that, SAIL conducts two phases, the Forward phase and the Backward phase. In the Forward phase, SAIL determines whether there is a potential matching occurrence by using a search table. Afterwards, if a potential matching occurrence can be determined, Backward phase is triggered out to output an optimal occurrence by using the left-most strategy.
A running example for SAIL: In this subsection, we show how SAIL works with a running example where P, T and constraints are given as follows.

Positions in Text
Positions in Table   a g c c Step 5. Enter Backward and select the left-most one from the marked positions in each row, and they are highlighted. In this way, we will get an occurrence {2, 4, 6, 7} and mark the four positions used.
Step 6. Go on to execute Location and get the third matching position is 8, then we can build the table below. Notice the positions of 6, 7 have been used. Under the one-off condition, all used positions (marked as * in Table) of T are never considered for further matching again. If the one-off condition is not considered, SAIL will get another two occurrences {2, 4, 6, 8} and {3, 5, 7, 9}. Then the Forward phase returns false, and SAIL go back to Location.
Step 7. In position 9, the Forward also returns false. Finally, SAIL output only one occurrence {2, 4, 6, 7}. Table   * * * a g c c The time complexity and completeness analysis:

Positions in Text Positions in
where n is the length of T, k is the frequency of P's last letter occurring in T, l is the user-specified maximum length for each matching substring, m is the length of P, and g is the maximum gap of wildcards in P.
Two important issues, Online searching and Optimization, are taken into consideration to design SAIL. As for optimization, under the one-off condition, SAIL determines which occurrence is an optimal one if multiple occurrences end at a P[m-1]'s position by applying the left-most strategy. As a heuristic algorithm, SAIL utilizes a kind of greedy strategy to select a set of occurrences; consequently, SAIL may obtain locally optimal solution which lead to losing occurrences in offline searching.
We believe that the SAIL's data structure is based on the sliding window, the Location also uses the left-most strategy, so it is possible to lose occurrences, for instance: Obviously, in the above example, SAIL loses occurrences in offline condition because of the selection of character c's matching position. For further observation, it is not difficult to find that character c appears in the pattern twice. If pattern is a¢[0,1]g¢[0,1]c, SAIL will get a complete occurrence set, that is, {{2, 4, 6},{3, 5, 7}}. Further experiments show that, the recurring appearances of pattern characters influence the quality of matching occurrences obtained by the algorithm. In next part, we will analyze the completeness of PMWL based on pattern features.

The RSAIL Algorithm
Description of RSAIL Algorithm (Wang et al., 2010):  [m-1] where 1 ≤ i < m-1, P is called a pattern with recurring tail characters and RT pattern in short. Such as a¢[0,1]c¢[0,1]c. As we can see, the RT pattern is a special form of the R pattern.
From the above discussion, in the research of Chen et al., since they only concern about the on-line situation, their proof of SAIL's completeness is incomplete, which is only suitable for the on-line situation. What is more, it ignores the interaction between different occurrences.
We find that SAIL satisfies the completeness under a certain restriction, i.e. the pattern with no-recurring character (NR pattern), such as a¢[0, 1]t¢[0, 1]g¢[0, 1]c¢[0, 1]. The concept of NR pattern has practical significance, for example, in text mining, where the text is a sequence of words, the NR pattern reflects the semantic relation between words.
We utilize the symmetry to scan the text and the pattern. Then convert an RT pattern into an R pattern.
1. According to the characteristic of P, if it is not an RT pattern, we directly call SAIL; otherwise go to (2); 2. Reverse T and P, respectively get T', P'; 3. Call SAIL, and obtain the occurrences of P' in T'; 4. Obtain the occurrences of P in T by coordinate transformation of the obtained solution.
Obviously, since the time of the identification of pattern's characteristics is linear, O (RSAIL) = O (SAIL).

Experiments and Analysis:
We will give a set of experiments to illustrate two problems: 1. Analysis of the complete extent of the SAIL algorithm; 2. The comparison of the complete extent of RSAIL and SAIL algorithm; Considering there is no algorithm can obtain the completeness occurrences of PMWL problem in polynomial time, we have developed a text generator  to generate experimental text, by which way we can know the completeness occurrences in order to analyze the complete extent of algorithm. In addition, the patterns used in these experiments are all RT patterns. The experimental results and analysis: Figure 3. The approximate ratio experimental results of RSAIL and SAIL From the above images, SAIL itself is already a near-complete algorithm, in the above graphs, the average approximation ratio of SAIL is higher than 0.94;For the RT patterns, the completeness of RSAIL is better than SAIL in different  , m and gap. Thus, not only a revised algorithm is obtained, the SAIL's deficiency on handling RT patterns is also proved from another aspect.

The BPBM Algorithm
Description of BPBM Algorithm (Guo et al., 2011): Like the SAIL algorithm, the BPBM algorithm also focuses on pattern matching in online sequential text with both flexible gap constraints by user's specification and the one-off condition. BPBM is based on bit-parallel technology to simulate the matching process and adopt two nondeterministic finite state automatons (NFAs). One is a search mechanism to identify all pattern P's suffix, and another one is a security window transition mechanism which accelerates the scanning process by dropping useless sequences in text.
BPBM has following characteristics: 1. BPBM also uses the left-most strategy to obtain the maximal occurrence of pattern in text, and return all these matching position sequences. This algorithm combines bitparallel technology with nondeterministic finite state automatons. It also simplifies the calculation of shift distance of the security window transition, which gets good results. BPBM inherits the advantage of BM algorithm to skip some of characters in text, which conducts the algorithm with a sub linear average time complexity. Therefore, the time complexity of BPBM is lower compared to SAIL. 2. Compared with Gaps-Shift-And algorithm and Gaps-BNDM algorithm, since they are all based on bit-parallel technology, their time performances are equal, however, because BPBM improves the formula of ε-transition in matching process, BPBM is fit for all patterns. In addition, BPBM returns a concrete set of matching position sequence, which makes it more applicable. 3. Compared with SAIL, SAIL applies two-dimensional table as data structure, but BPBM is base on bit-parallel technology. Since the difference in data structure, BPBM has a better time performance. The similarity between these two algorithms is they all utilize the left-most strategy, therefore, they are all heuristic algorithm with greedy strategy. In addition, the matching occurrences of these two algorithms are same and both incomplete.

The SBO Algorithm
Description of BPBM Algorithm  Wu et al. propose a new nonlinear structure called Nettree to deal with pattern matching with flexible constraints of wildcards. A Nettree is a kind of directed acyclic graph (DAG) with edge labels. They apply a heuristic algorithm to select better occurrence (SBO). In this algorithm, they use two strategies: Strategy of Greedy-Search Parent, SGSP and Strategy of right-most Parent, SRMP to two occurrences of the same leaf, and select the better one as occurrence. The core idea of SGSP is finding an approximately optimal parent (AOP) of current node in each step; while the core idea of SRMP is finding the right-most parent node of current node in each step.
In off-line conditions, owing to its heuristic strategy, SBO can obtain more occurrences than SAIL and BPBM in most cases, but it is still incompleteness. However, the time complexity of SBO algorithm is O(gap*n*(n+m 2 )) which is nonlinear of length of text. Further more, experiments show that, in general, SBO indeed consumes more time than SAIL. In SBO, the improvement of solution's quality is relying on using heuristic strategies repeatedly, and this also consumes a lot of time. Therefore, we need consider the balance between completeness and time efficiency of algorithm. What is more, SAIL originally is not designed for off-line condition, as an on-line algorithm, it only has current information, so when applying it to off-line matching it will definitely be imperfect. But SBO uses global information to search occurrences, it also uses heuristic strategies to search on solution space. With the improvement of occurrences' quality, there are two problems: more information need more place to store, so the space complexity of SBO is O(gap*m*n), the next one is more information means more calculations thus consuming more time.

Other algorithms
In many literatures, similar problems are defined and various algorithms are put out to solve certain problems. Morgante, et al. (Morgante, et al., 2004) described a structured model, which can be considered as 'compound patterns' made of a list of simple motifs and a list of intervals that specify at what distances adjacent motifs should occur. They gave a detailed description of the biological background of the problem definition. For example, many retrotransposons belonging to the Ty1-copia group contain a match of MT¢[115,136]MTNTAYGG¢[121,151]GTNGAYGAY, which consists of three patterns and two intervals. As the paper pointed out, structured motifs are called classes of Characters and Bounded Gaps (CBG) expressions in Navarro and Raffinot, but use of these expressions is quite different: the underlying motivation for CBG expressions is searching in database like PROSITE and a sequence of this kind is usually not very long, while structured motifs can be very long since gaps may span many letters. As we can see, the concept of CBG and structured motifs are all have practical meaning. Because of the different application background, they design different algorithms to solve their problems. From the application point, this paper also considered a problem of q-approximation match which means just finding partial motifs in the sequence. In this paper, they proposed a two-step procedure which is used in many algorithms for PMWL: firstly, finding the occurrences of all the component patterns; secondly, combining the occurrences that satisfy the distance constraints into a structured motif. For step two, they gave a detailed algorithm to build a directed acyclic graph according to the positions of the component patterns and interval constraints. Then they discussed how to output all the occurrences in detail. In (Rahman et al., 2006), the definition of their problem likes SAIL, but they don't consider global constraints and the one-off searching. In addition, just like paper (Chen et al., 2006), the local constraints exist between two substrings, while in SAIL, exist between any two consecutive letters. Certainly, a single character is a substring, but in this paper, all these substrings are used to build an AC automaton. It is not efficient to build a Trie structure over a set of single letters. This paper also used a two-step procedure: firstly using AC automaton to get occurrences of each sub-patterns in orders and combine them. They built an implicit graph, in which vertices are partitioned into several sets in order according to the corresponding sub-pattern and edges between two consecutive sets means two positions in these two consecutive sets fit corresponding local constraints. To output all P in T, we have to enumerate all possible paths in the implicit directed graph which length is the number of sub-patterns in the pattern. Morgante, et al. (Morgante, et al., 2004)  Rahman's work, although this paper also applies AC automaton, it maintains a sorted list containing the ranges defined by previously reported relevant occurrences, and naturally it uses the left-most strategy to count an occurrence as soon as it appears. Haapasalo et al. (Haapasalo et al., 2011) extended the usual dictionary matching problem to the case in which patterns may include single wildcards, or wildcard strings of variable length with fixed or unlimited upper bound. And their algorithm is designed for on-line matching: the text is scanned only once, and the matches for all patterns are reported at the point of occurrence. Firstly, they constructed an AC PMA (pattern matching automaton) with output tuples identifying the keywords of the patterns to be matched. The idea in their algorithm is that they recognize keywords by the PMA and check whether or not a newly found keyword forms a continuation of a pattern prefix found thus far.

Discussion
Because of the complexity of the PMWL problem definition, we believe that the matching occurrence of PMWL problem is with high degrees of freedom. In traditional matching problem with fixed-length wildcard, the positions of each match in the same set of the matching occurrence are relatively fixed to each other. Therefore, to determine the position of any one character, a set of matching occurrences have been identified. We believe that matching of each character in above problem has a strong correlation. For instance, in pattern a¢g¢¢c, However, for pattern in PWML, matching positions of adjacent characters are bounded by local constraints, which means matching of each character has a weak correlation, and to determine the positions of all characters, a set of matching occurrences could have been identified, that is, freedom degree of matching increases. This is an important factor leads to the complexity of the PMWL problem, which greatly increases the difficulties of searching process in matching algorithms. In order to get a complete solution, a lot of backtracking operation are required, making it difficult to be completed in polynomial time, therefore, almost all PMWL algorithms use greedy strategies in matching process. This is destined to incomplete results. However, on the other hand, although the above algorithms are not complete, we find that when length of pattern is shorter than 6, the approximation ratio of these algorithms are more than 0.9. Consequently, the next work can be considered form two aspects: 1, based on SAIL algorithm etc, improving the time efficiency, like BPBM; 2, designing algorithm for PMWL under certain conditions, such as RSAIL's work for RT pattern. We believe that the pattern features, data structures and matching strategies will continue to be the center for PMWL algorithm design.

Analysis of PMWL based on pattern features
In the traditional matching problems, how to search pattern in text much faster as well as the correlation between pattern and text are paid more attention. But the characteristics of the pattern itself are paid little direct attention, because traditional matching problems can always have complete occurrences. The main characteristic of PMWL is with flexible wildcards, which leads to a large number of candidate matching positions. And the conflict between these occurrences will cause the final output incompleteness. However, our research shows that the direct cause which impacts PMWL incompleteness is not wildcards; it is the pattern characteristics directly conduct PMWL incompleteness. This is much different from traditional matching problems.

The impact of the alphabet, the length of pattern and the gap on completeness
In the traditional pattern matching research, the length of pattern and the size of alphabet are key elements influencing time complexity when analyzing traditional matching problems. Taking into account PMWL problem definition, upper and lower limits of the length constraints probably affect problem solving. Especially, instead of upper and lower limits themselves, the distance between the upper and lower limits, that is gap, are taken into consideration. Therefore, the parameters related to the algorithm completeness may be the size of alphabet, the length of pattern and distance between the upper and lower limits, denoted as Σ, m, and gap respectively. In this article, the approximate degree of completeness of the algorithm will be measured by approximation ratio ε. Consequently, we try to build following model: Taking into account that the size of Σ is determined in a specific area, for example, in bioinformatics, DNA sequences can be defined on Σ = {a, c, g, t}, the above formula can be simplified as ε = F (m, gap). In experiment project, input text is a biology DNA sequence, so Σ = {a, c, g, t}. Then, what needs to be done is calculating approximate ratio ε for each pattern. Since ε = N(UALG) / N(Uopt), we need to know N(Uopt). However, it is not desirable to directly solve this from a text T, since there is no any known algorithm to obtain the completeness solution. If we use a simple brute-force, the exponential time will be need. Therefore, we have developed a text generator, which can generate text T according to P and N(UALG). In addition, SAIL algorithm is currently regarded as the most representative algorithm for PMWL problem, since SAIL firstly adopts the left-most strategy which is applied in different situations and technologies such as BPBM (Guo et al., 2011) algorithm and the mining algorithm MAIL . Based on the above analysis, we have SAIL as a research object, that is, N(UALG) = N(USAIL).

5.
Calculating The experimental results： By the figure 4, as m increases, ε is gradually decreasing. As the gap increases, the trend of ε is decreasing first and then increases, especially when gap = 1 and ε = 1, since the left-most strategy can obtain a complete occurrence set. With the increase of gap, ε begin to decline because when the gap is becoming greater, the probability of matching occurrences overlap is becoming greater and the algorithm is becoming more easily to lose occurrences; when gap is sufficient, although matching occurrences are still overlap, greater gap reserve enough space for matching, making the remaining occurrences which have not yet been still have enough resources. Moreover, it is worth noting that the minimum of these curves can be reached when gap is about 7, and have nothing to do with the pattern length. In figure 5, the trend of curves is the same as in figure 4. The difference between them is curves in figure 5 reach the minimum when gap is about 9~11. It can be found that the impact of Σ, m, and gap on the curves is that the change of gap determines the trend of the curve, m affects the magnitude of this change, and Σ makes the curve do translational move. After a series of experiments, we speculate that ε = A*gap 4 + B*gap 3 + C*gap 2 + D*gap +E, where A, B, C, D and E are parameters and for different m there are different parameters. We try to use this model to illustrate the relation between gap and approximation ratio ε.
Use this parameter table, some of illustrations for m = 3, 4……14 are listed below, where horizontal axis is the gap, vertical axis is the ε. We believe this model can be used to predict the completeness of solutions given a certain pattern. For example, given m = 10, Σ = {a, c, g, t}, gap = 5, this model shows the prediction of approximation ratio ε of SAIL algorithm is about 0.878. Therefore, this model can be used in pattern mining showed as below.

PMWL pattern mining evaluation mechanism
Input: Given T, Σ, m, gap, support sup Output: pattern P As we know, mining algorithm strategy is to learn from the strategy of matching algorithm, so PMWL pattern mining problem is naturally based on PMWL matching problem. For example, in mining algorithm MAIL , although a graph structure is utilized which conducts it different from SAIL; it is still based on the left-most strategy. As a result, they have the same degree of completeness. Therefore, our model can propose an evaluation mechanism for mining.

The impact of pattern rep on completeness
In next part, we will put forward another important concept, named rep, and analyze its impact on completeness. We first give an example to illustrate the reason why this concept is needed. Given m = 4, Σ = {a, c, g, t}, gap = 2, the corresponding patterns maybe P1 = a¢[0,2]c¢[0,2]g¢[0,2]t or P2 = a¢[0,2]c¢[0,2]c¢[0,2]t. They have the same Σ, m and gap. However, when applying SAIL or BPBM, the completeness of solutions is not the same, since for P1 algorithms can obtain complete solutions while for P2 can not.
Considering two examples below: Table 9. Example 1 for rep concept A complete occurrence set of this example is {{0, 3, 5}, {2, 4, 6}}, the number of matching occurrences is 2. It is not difficult to find that, in SAIL algorithm, for position 5, the selection of position 2 as p[1]'s occurrence by the left-most strategy will consume the position for the next matching occurrence. We can guess that, the recurring 'b' character in this pattern affect the quality of matching occurrences.  From above examples, the matching of recurring character in the pattern may determine the completeness of the algorithm. As a result, we consider this repeatability as an element to influence the completeness.
In order to quantify the repeatability, the concept of repeatability, rep, is proposed in this paper.
Definition 9 Given a pattern P = p0p1…pm-1, let fij = (pi, pj) be all binary combinations of characters in pattern P (1) If USAIL is the completeness set, NSAIL = Nopt is satisfied.
(2) Otherwise, NSAIL < Nopt is obtained and there is a conflict between USAIL and Uopt.
Proof: It is obvious to conclude (1)  It would be discussed in the following two cases: ∴There are two of the same letters from different positions in P. ∴According to definition 6, P is an R pattern. For the case of w ≠ k, similarly, it can be proved.
Then we will prove the other condition is impossible, and conclude P is R pattern.
There is a t satisfying: , so the assume is incorrect Due to equation (2) and (3), THEOREM 2 Given a text T, a pattern P, if P is NR pattern, then SAIL is complete.
Proof: It is the inverse negation of THEOREM 1. Apparently, THEOREM 2 is true.
THEOREM 3 Given a text T, a pattern P, if P is R pattern, then SAIL is incomplete.
Proof: It can be concluded from the analysis and example in section 2.
THEOREM 4 If the pattern fulfills gap = 0, SAIL is complete.
Proof: If gap = 0, the wildcard is a constant. For example a¢[1,1]c¢[2,2]c is converted into a¢c¢¢c. There won't be any conflict or exist seizing between occurrences. SAIL will perform complete. Analysis of experimental results: with increment of rep, the curve of approximation ratio gradually decreases, followed by a slight increase. The reason for decline is that rep lead to more nested occurrences, resulting in a greater degree of the possibility of losing occurrences; the reason for the increscent is that larger rep can cause more extreme pattern. For instance, when ∑ = 4, m = 7, rep = 21, patterns like P1 = a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a which is difficult to find a special text containing nested occurrences of such pattern, will be produced. For P1, the text like "aaaaaaaaaaaaaa" contains nested occurrences of this pattern. Obviously, this extreme text is very rare. Therefore, under the premise of nested occurrences are not easily to be formed, the approximation ratio will be increased slightly. This description is equivalent to the model of 'taking ball from the bag' in the combination mathematics:

Experiment design
There is a bag of balls, and |Σ| kinds of colors, taking m balls from the bag with replacement, then in fetched balls, how many pairs of the same color?

Conclusions
As an extension of traditional matching problem, the PMWL problem has aroused more and more attention because of its unique flexibility and complexity. Based on problem definition and drawing on research idea in traditional matching problem, this article introduces SAIL, RSAIL, SBO and BPBM which are representative algorithms for PMWL in three important respects: the data structures, the matching strategies and the characteristics of pattern. The article also analyzes the pros and cons of the above algorithms from the point of quality of the solution and time complexity, and gives experimental matching results by using real DNA data. Among them, the SAIL algorithm is the first to propose the method of solving PMWL problem, it uses the sliding window structure and the representative left-most matching strategy. This paper finds that in short patterns, the approximation ratio of SAIL is higher than 0.9, while in longer patterns, the occurrences obtained by SAIL are of poor quality; the quality of occurrences obtained by SBO is best, but its time consumption has a non-linear relationship with the length of text; BPBM utilizes bit parallel technology to improve the efficiency of matching greatly, but also is impact by the machine word; for pattern with repeated letters in tail, RSAIL uses symmetry to improve the quality of occurrences under certain conditions, thus providing a solving idea to PMWL problem, but in longer patterns and wilder gaps, the efficiency is not obvious.
Afterwards, this article focus on relationship between approximation ratio ε and alphabet size ∑, pattern length m, wildcards span gap and repeatability rep. Firstly, this article proposes the model ε = F (Σ, m, gap), describing the functional relationship between pattern characteristics and approximation ratio approximately; secondly, this article proves PMWL's completeness under the conditions of rep = 0; finally, the relationship between the pattern features are also analyzed andm in addition, relationship that 2 () || m C Er e p   is proposed.
In future work, the formal description of the PMWL problem will be considered, in order to explain the complexity of the problem better, thus helping algorithm design and analysis for problem complexity.

Author details
Haiping Wang, Taining Xiang and Xuegang Hu Hefei University of Technology, China