Mining Frequent Itemsets over Recent Data Stream Based on Genetic Algorithm

Data stream is massive sequence of data elements generated at a rapid rate which is characterized by continuously flowing, high arrival rate, unbounded size of data and realtime query requests. The knowledge embedded in a data stream is more likely to be changed as time goes by. Identifying the recent change of a data stream, especially for an online data stream, can provide valuable information for the analysis of the data stream. Frequent patterns on a data stream can provide an important basis for decision making and applications. Because of the data stream’s fluidity and continuity, the information of frequent patterns changes with the new data coming.


Introduction
Data stream is massive sequence of data elements generated at a rapid rate which is characterized by continuously flowing, high arrival rate, unbounded size of data and realtime query requests.The knowledge embedded in a data stream is more likely to be changed as time goes by.Identifying the recent change of a data stream, especially for an online data stream, can provide valuable information for the analysis of the data stream.Frequent patterns on a data stream can provide an important basis for decision making and applications.Because of the data stream's fluidity and continuity, the information of frequent patterns changes with the new data coming.
Mining over data streams is one of the most interesting issues of data mining in recent years.Online mining of data streams is an important technique to handle real-world applications, such as traffic flow management, stock tickers monitoring and analysis, wireless communication management, etc.In most of the data stream applications, users tend to pay more attention to the mode information of the recent data stream.Therefore, mining frequent patterns in recent data stream is a challenging work.The mining process should have one-pass algorithm, high efficiency of updating, limited space cost and online response of queries.However, most of mining algorithms or frequency approximation algorithms over a data stream could not have high efficiency to differentiate the information of recently generated data elements from the obsolete information of old data elements which may be no longer useful or possibly invalid at present.Many previous studies contributed to efficient mining of the frequent itemsets over the streams.Generally, three processing models are used which are the landmark model, the sliding window model and the damped model [1].The landmark model analyzes the stream in a particular window, which starts from a fixed timestamp called landmark and ends up with the current timestamp.For the sliding window model case, the mining process is performed over a sliding window of a fixed length.Based on the sliding window model, the oldest data is pruned immediately when a new data arrives.The damped model uses the entire stream to compute the frequency with a decay factor d, which makes the recent data more important than the previous ones.
DStree algorithm is a relatively new algorithm for mining frequent itemsets which have the concept of nested sub windows in sliding window.DStree algorithm separates the current transaction database data into blocks, then statistic frequent itemsets in the current window.When a next block of data comes to the moment, the prior block data becomes the historical data.The second block of data replace the first one.Some of the information are available in current DStree and prepare for the next generation of a DStree estDec algorithm is a effective way to mine frequent itemsets of current on-line data stream.Each node of estDec algorithm model tree contains a triple (count, error, Id).For the relevant item e, its number is shown by count.The maximum error count of e is shown with error and Id is the determined factor of e wich contains the most recent transactions.estDec algorithm is divided into four parts: update parameter, update count, the delay difference and choose frequent items.
As using model tree in FP-Tree , DStree and estDec algorithm, it is difficult to make the algorithm computing parallel and the algorithm run time is also difficult to reduce.
With the development of the card, GPU (Graphic Process Unit) become more and more powerful.It has transcended the CPU computation not only on graphic but also on scientific computing.CUDA is a parallel computation framework which is introduced by NVIDIA.The schema makes GPU be able to solve complex calculations.It contains the schema CUDA instruction set and internal computation engine.GPU is characterized by processing parallel computation and dense data, so CUDA suites large-scale parallel computation field very well [12].This work proposes a NSWGA (Nested Sliding Window Genetic Algorithm) algorithm.Firstly, NSWGA gets the current data stream through the sliding window and uses a nested sub-window dividing up the data stream in current window into sub-blocks; then, the parallel idea of genetic algorithm and parallel computation ability of GPU are used to seek frequent itemsets in the nested sub-window; at last, NSWGA gets the frequent patterns in the current window through the frequent patterns of the nested sub-windows.This chapter is organized as follows.Theoretical foundation is described in Section 2. The algorithm is designed for Nested Sliding Window Genetic Algorithm of mining frequent itemsets in data streams in Section3.In Section 4, comprehensive experiments for the algorithm are implemented in built environment and give the comparison with other methods.Moreover, algorithm analysis is also proposed for mining time-sensitive sliding windows in this section.. Finally, we summarize the work in Section 5.

Theoretical foundation
The study combines the sliding window techniques, frequent itemsets, genetic algorithm and parallel processing technology.
Sliding window has been used in the network communication, time-series data mining, data stream mining and so on.This algorithm uses the sliding window [9,10] to obtain the current data stream.

Definition 1
sliding window: For a positive number ω1, a certain time T, data sets D = (d 0 , d 1 ,..., d n ) fall into the window SW(the size of window SW is ω1), the window SW is called the sliding window.
Definition 2 nested sub-window: For a positive number ω2, a certain time T, the newest data set dn in sliding window SW falls into the nested window NSW ( the size of NSW is ω2), the nested window NSW is called the nested sub-window.
As shown in Figure 1, the application of sliding window for dynamic updating of data sets is explained.Definition 3 frequent itemsets in sliding window: For the current data in sliding window, a collection of items I = {i 1 , i 2 , ..., i n } , transaction iterm data set S = {s 1 , s 2 ,..., s n }, each transaction iterm is a collection of items, s⊆I。If X⊆S, then X is an itemset.If there are k elements in X, we call X the k-itemsets.With respect to an itemset X, if its support degree is greater than or equal to the minimum support threshold given by the user, then X is called the frequent itemsets.
Genetic algorithm starts the search process from an initial population.Each individual in the population is a possible frequent pattern.We use the genetic algorithm to achieve the result mainly through crossover, mutation and selection [8].After several generations of selection, we achieve a final frequent itemsets.The major rules and operators in genetic algorithm are as follows: 1. Coding rule: this work codes with the integer.For example, each transaction item has ABCDE five attributes in a data stream, the transaction item which is coded 21530 expresses that we take the second value of A attribute, take the first value of B attribute, and analogizes in turn, we use 0 to express that we do not consider the value of E attribute.2. The fitness function: F i =W i /W Z , F i is the support degree of transaction item i, W i is the number of the transaction items which have the same value for each attribute, W Z is the total number of transaction items in the window.3. The selection operator: This algorithm uses the Roulette Wheel Selection.For individual i, its fitness degree F i , the population size M, then its probability of being selected is expressed as Crossover operator is mainly used to interchange some genes between the parent chromosomes.Through the operation between two individuals of parent generation, we get the daughter generation.Thus, daughter generation would inherit the effective models of the parent generation.5. Mutation Operator: The algorithm uses the Simple Mutation.If the parent chromosome is A (a 1 a 2 a 3 ... a i ... a n ), after the variation, the daughter chromosome becomes A 1 (a 1 a 2 a 3 ... b i ... a n ).
Mutation operation changes some genes randomly to generate new individuals.Mutation operation is an important cause to obtain global optimization.It helps to increase the population diversity, but in this algorithm, the corresponding genes which are required to generate the frequent itemsets already exist, so we use a lower mutation rate.
When we establish the parallel part in the program, we can let this part run into GPU.The function which runs in GPU is called kernel (kernel function).A kernel function is not a complete program, but the parallel part of the entire CUDA program [13,14].A complete CUDA program execution is shown in figure 2. The graph shows that in a kernel function there are two parallel levels, the parallel blocks in the grid and the parallel threads in the block.

Algorithm design
NSWGA uses the sliding window to get the recent data and uses genetic algorithms to mine frequent itemsets of the data in the current window.

Algorithm description
Input data streams to be mined Output frequent itemsets of recent data stream NSWGA algorithm is divided into three parts: (1) NSWGA uses the parallelism of genetic algorithm to search for the frequent itemsets of the latest data in the nested sub-window.
(2)The final frequent itemsets of the sliding window are obtained by the integrated treatment of this series of frequent itemsets in nested sub-windows.(3)With the new data coming, the expired data is deleted periodically.Repeat the above two operations.
In the first part, the current frequent itemsets in NSW is obtained.The process is shown as figure 3.
Step 1. Set the size of sliding window SW ω1.Set the size of nested sub-window NSW ω2.Window sizes are determined by the properties of the data stream.ω1 depends on how many current affairs whose frequent itemsets we are interested in.ω2 depends on the processing capability of the algorithm and our statistical frequency.Given the support threshold S, fitness function F i =W i /W Z , when F i ≥ S, transaction iterm i is a frequent pattern of the data set in sliding window.
The iteration times T depends on the number of attributes that a transaction iterm includs and the scope of the attribute values and the original population size.The role of nested sub-window is to avoid repeatedly processing the data which is still in the sliding window after the old data out of the sliding window.Let the crossover probability is P, the individual mutation probability is Q.To implement parallel computing, the data in the nested sub-window is divided into Z segments.
Fig. 3.The generation of initial population.
Step 2. Use the nested sub-window to achieve the latest data, get frequent 1-itemsets of the data, encode the frequent 1 -itemsets to integer strings, and combine the frequent Relying on the powerful parallel computing capability of GPU, parallel matching with Z sections, that will reduce a lot of running time, the process is shown in Figure 5. Step 4. If the number of iterative times is smaller than T, the algorithm jumps to the step 3.
After T times of iterative computation, finish iterative and obtain the frequent itemsets in current nested sub-window; In the second part, the final frequent itemsets in sliding window is obtained.The process is shown as step5.
Step 5. Constitute the mode sets with the frequent itemsets that we obtained this time and the previous frequent itemsets obtained in the last M (M = ω1/ω2-1) times.Carry on a search to determine the final frequent itemsets in the sliding window.
1 For i = 1: M+1 2 Constitute the mode sets; 3 End 4 Make a parallel search in the sliding window SW; 5 When a mode's support degree is greater than or equal to S, identify it as a final frequent mode; The process is shown in Figure 6  In the third part, repeat the above two operations dynamically.The process is shown as step6.
Step 6.With the data stream flowing, this algorithm continues to deal with the new incoming data and discard the old data, transfer to step 2 and continue the above operations until the data stream coming to the end.

NSWGA algorithm analysis
Comparing with other algorithms which use pattern tree to maintain the historical information of data stream, NSWGA processes a quantity of data parallelly at one time, while the pattern tree algorithms process a single transaction item at one time, each transaction item needs match repeatedly.Mining the frequent itemsets of the data in the current window, the time of whole process is not only dependent on the times of scanning the data in the window, but also dependent on the internal basic operation -the number of matching.
Suppose a data stream has N transaction items, each transaction item has V attributes; each attribute has K possible values.The pattern tree algorithms may have K V frequent pattern search paths.Let the window size is N.When the entire data stream in the window flow over, the necessary calculated amount to get frequent itemsets is N * K * V.
For fp-tree algorithm, when the fp-tree has L paths, the calculated amount is 2 * N * V + V * L, the number L will increase with the threshold of support degree reducing.
When the support degree is S, iteration times of genetic algorithms is T, the number of parallel computing is Z (Z according to the amount of data, in this case set Z 200),the sliding window size is N, the necessary calculated amount to get frequent itemsets is P = P1 + P2 + P3.Thereinto: P1 = N * V the calculated amount to get 1 -frequent itemsets; P2 = V * T * N / S * Z the calculated amount to get the frequent itermsets in the nested subwindow; P3= α *V*N/S*Z*M (1<= α <=1/S) the calculated amount to get the final frequent itemsets.
When the property value K is large, this algorithm has obvious advantage in time complexity.When the number of Z is larger, the runtime will become shorter.

Experiment
In this experiment, we use artificial data sets and the MATLAB and CUDA C language to implement NSWGA algorithm.We use the computer with 2.61GHZ CPU, 2GMB memory, Nvidia GPU C1060, windows XP operating system to test the performance of the algorithm.
The size of the sliding window is 100k.The size of the data set is 200K.With the data flowing, we make statistic every 10K of the data.

Analysis of the experimental results
As shown in Table 1, with the support degree increasing, the frequent patterns of these two algorithms are rapidly reducing, the number of matching is reduced and eventually the runtime will be reduced.However, fp-tree algorithm not only needs to maintain the global frequent pattern tree, but also requires additional time to build a sub-pattern tree for each data segment.Then this algorithm saves the information of the sub-pattern tree to the global frequent pattern tree.With the times of process increasing，the runtime of fp-tree algorithm is becoming longer than NSWGA.
Table 2 shows that, with the support degree increasing, the algorithms which use pattern tree to maintain the information of the frequent patterns such as Dstree algorithm can not reduce the runtime, but NSWGA algorithm is able to save a lot of runtime.
In Table 2, the attribute of analog data has 10 possible property values, and in Table 3 there are 20.With the number of possible property values increasing, the runtime of Dstree algorithm will be greatly increased, while the runtime of NSWGA algorithm almost has no change.

Summary
It is important for prediction and decision-making to find frequent items among huge data stream.This chapter presents an approach, namely NSWGA (Nested Sliding Window Genetic Algorithm), about mining frequent itemsets on data stream within the current window.NSWGA uses the parallelism of genetic algorithm to search for the frequent itemset of the latest data in the nested sub-window.The final frequent itemsets of the sliding window is obtained by the integrated treatment of this series of frequent itemsets in nested sub-window.NSWGA captures the latest frequent itemsets accurately and timely on data stream.At the same time the expired data is deleted periodically.As the use of nested windows and the parallel processing capability of genetic algorithm, this method reduced the time complexity.
In this chapter, an algorithm about mining frequent patterns of data stream-NSWGA algorithm is proposed.The main contributions of this algorithm: (1) The parallelism of genetic algorithm is used to mine the frequent patterns of data stream , which reduces the runtime; (2) The algorithm combines the sliding window with genetic algorithm to propose an improved method to obtain initial population; (3) This algorithm gurantees the speed of implementation and query precision.

Fig. 1 .
Fig. 1.Dynamic updating of the data in sliding window

4 .
Crossover: This algorithm uses One Point Crossover.If the parent chromosomes are A (a 1 a 2 a 3 ... a i ... a n ) and B (b 1 b 2 b 3 ... b i ... b n ), after cross operation, the daughter chromosomes are A 1 (a 1 a 2 a 3 ... b i ... b n ) and B 1 (b 1 b 2 b 3 ... a i ... a n ).

1 - 4 . 3 .
itemsets randomly to constitute the initial population in the nested sub-window.The individuals of this population are possible frequent patterns.1 Statistics the number of I 1 , I 2 , I 3 in A attribute; 2 Statistics the number of I 1 , I 2 , I 3 in B attribute; 3 Statistics the number of I 1 , I 2 , I 3 in C attribute; 4 Reserve the value which is greater than or equal to the threshold S, let others are 0 (in this case, S takes 3); 5 Remove the all zero -line, set non-zero values according to their original row; 6 Line up every non-zero value and keep its original location in the line, fill in the rest position with 0; 7 Combine non-zero iterms according to their original location.Constitute the initial population with frequent 1 -itemsets and the combination iterms.The process is shown in Figure Step Calculating the individual fitness degree is the process that individuals in the initial population match with the actual transaction iterms.In order to realize parallel matching, we divide the data into Z sections.Although this operation increases the memory expenses, it reduces the running time.It is important in mining frequent .intechopen.comMining Frequent Itemsets over Recent Data Stream Based on Genetic Algorithm 297 patterns of data stream.Make Roulette Wheel Selection according to the fitness degree.Make crossover with the Crossover probability P. Carry on the variation with the variation probability Q. Ascertain the individual fitness degree after scanning the data.Join the individual which satisfies the condition into the frequent itemsets.

Fig. 6 .
Fig. 6.The process of obtaining frequent patterns

Table 1 .
1.The analog data stream has three attributes.Each attribute has 10 possible values.The running results of the algorithms are shown in Table1.The comparison of fp-tree algorithm and NSWGA algorithm 2. The analog data stream is the same as above.The running results of the algorithms are shown in Table2.

Table 2 .
The comparison 1 of Dstree algorithm and NSWGA algorithm 3. The analog data stream has three attributes.Each attribute has 20 possible values.The running results of the algorithms are shown in Table3.

Table 3 .
The comparison 2 of Dstree algorithm and NSWGA algorithm