Mining Complex Network Data for Adaptive Intrusion Detection

Intrusion detection is the method of identifying intrusions or misuses in a computer network, which compromise the confidentiality and integrity of the network. Intrusion Detection System (IDS) is a security tool used to monitor network traffic and detect unauthorized activities in the network [23, 28, 30]. A security monitoring surveillance system, which is an intrusion detectionmodel based on detecting anomalies in user behaviors was first introduced by James P. Anderson in 1980 [1]. After that several intrusion detection models based on statistics, Markov chains, time-series, etc proposed by Dorothy Denning in 1986. At first host-based IDSwas implemented, which located in the servermachine to examine the internal interfaces [35], but with the evolution of computer networks day by day focus gradually shifted toward network-based IDS [20]. Network-based IDS monitors and analyzes network traffics for detecting intrusions from internal and external intruders [26, 27, 34]. A number of data mining algorithms have been widely used by the intelligent computational researchers in the large amount of network audit data for detecting known and unknown intrusions in the last decade [3, 9, 18, 32, 33]. Even for a small network the amount of network audit data is very large that an IDS needs to examine. Use of data mining for intrusion detection aim to solve the problem of analyzing the large volumes of audit data and realizing performance optimization of detection rules.


Introduction
Intrusion detection is the method of identifying intrusions or misuses in a computer network, which compromise the confidentiality and integrity of the network. Intrusion Detection System (IDS) is a security tool used to monitor network traffic and detect unauthorized activities in the network [23,28,30]. A security monitoring surveillance system, which is an intrusion detection model based on detecting anomalies in user behaviors was first introduced by James P. Anderson in 1980 [1]. After that several intrusion detection models based on statistics, Markov chains, time-series, etc proposed by Dorothy Denning in 1986. At first host-based IDS was implemented, which located in the server machine to examine the internal interfaces [35], but with the evolution of computer networks day by day focus gradually shifted toward network-based IDS [20]. Network-based IDS monitors and analyzes network traffics for detecting intrusions from internal and external intruders [26,27,34]. A number of data mining algorithms have been widely used by the intelligent computational researchers in the large amount of network audit data for detecting known and unknown intrusions in the last decade [3,9,18,32,33]. Even for a small network the amount of network audit data is very large that an IDS needs to examine. Use of data mining for intrusion detection aim to solve the problem of analyzing the large volumes of audit data and realizing performance optimization of detection rules.
There are many drawbacks in currently available commercial IDS, such as low and unbalanced detection rates for different types of network attacks, large number of false positives, long response time in high speed network, and redundant input attributes in intrusion detection training dataset. In general a conventional intrusion detection dataset is complex, dynamic, and composed of many different attributes. It has been successfully tested that not all the input attributes in intrusion detection training dataset may be needed for training the intrusion detection models or detecting intrusions [31]. The use of redundant attributes interfere with the correct completion of mining task, increase the complexity of detection model and computational time, because the information they added is contained in other attributes [7]. Ideally, IDS should have an intrusion detection rate of 100% along with false positive of 0%, which is really very difficult to achieve.
Applying mining algorithms for adaptive intrusion detection is the process of collecting network audit data and convert the collected audit data to the format that is suitable for mining. Finally, developing a clustering or classification model for intrusion detection, which provide decision support to intrusion management for detecting known and unknown intrusions by discovering intrusion patterns [4,5].

Intrusion detection
Intrusion detection is the process of monitoring and analyzing the network traffics. It takes sensor data to gather information for detecting intrusions from internal and external networks [6], and notify the network administrator or intrusion prevention system (IPS) about the attack [19,24]. Intrusion detection is broadly classified into three categories: misuse, anomaly, and hybrid detection model [10]. Misuse detection model detects attacks based on known attack patterns, which already stored in the database by using pattern matching of incoming network packets to the signatures of known intrusions. It begins protecting the network immediately upon installation and produces very low FP, but it requires frequently signature updates and cannot detect new intrusions. Anomaly based detection model detects deviations from normal behaviors to identify new intrusions [22]. It creates a normal profile of the network and then any action that deviated from the normal profile is treated as a possible intrusion, which produces large number of false positives. Hybrid detection model combines both misuse and anomaly detection models [2]. It makes decision based on both the normal behavior of the network and the intrusive behavior of the intruders. Table 1 shows the comparisons among misuse, anomaly, and hybrid detection models. Detection rate (DR) and false positive (FP) are the most important parameters that are used for performance estimation of intrusion detection models [8]. Detection rate is calculated by the number of intrusions detected by the IDS divided by the total number of intrusion instances present in the intrusion dataset, and false positive is an alarm, which rose for something that is not really an attack, which are expressed be equation 1

Intrusion detection dataset
The data generated by IDS contain information about network topology, hosts and other confidential information for this reason intrusion detection dataset cannot be shared in public domain. Whereas it is not difficult task to generate a large set of intrusion detection alets by running IDS in a private or Internet-exposed network. For generating intrusion detection dataset only major challenge is to labeling the network data, because network data are unlabeled and it is not clear which attacks are false positives and which are true positives. The KDD 1999 cup benchmark intrusion detection dataset is the most popular dataset for intrusion detection research, a predictive model capable of distinguishing between intrusions and normal connections, which was first used in the 3 rd International Knowledge Discovery and Data Mining Tools Competition for building a network intrusion detector [29]. A simulated environment was set up by the MIT Lincoln Lab to acquire raw TCP/IP dump data for a local-area network (LAN) to compare the performance of various IDS that is the part of KDD99 dataset. Examples in KDD99 dataset represent attribute values of a class in the network data flow, and each class is labelled either normal or attack. For each network connection 41 attributes are in KDD99 dataset that have either discrete or continuous values. These attributes are divided into three groups: basic features , content features, and statistical features of network connection.
The classes in KDD99 dataset are mainly categorized into five classes: normal, denial of service (DoS), remote to user (R2L), user to root (U2R), and probing. Normal connections are generated by simulated daily user behaviours. Denial of service computes power or memory of a victim machine too busy or too full to handle legitimate requests. Remote to user is an attack that a remote user gains access of a local user or account by sending packets to a machine over a network communication. User to root is an attack that an intruder begins with the access of a normal user account and then becomes a root-user by exploiting various vulnerabilities of the system. Probing is an attack that scans a network to gather information or find known vulnerabilities. In KDD99 dataset the main four attacks are divided into 22 different attacks that tabulated in table 3 and table 4

Combining naïve Bayesian and Decision Tree
In this section, we present the hybrid learning algorithms, NBDTAID (naïve Bayesian with Decision Tree for Adaptive Intrusion Detection) [14], ACDT (Attacks Classificaton using Decision Tree) [13], and Attribute Weighting with Adaptive NBTree Algorithm [11,15] for intrusions classification in intrusion detection problem. Presented algorithms are performed balance detections and keep FP at acceptable level for different types of network intrusions. It has been successfully tested that by combining the properties of naïve Bayesian classifier and decision tree classifier, the performance of intrusion detection classifier can be enhanced.

Adaptive intrusion classifier
Naïve Bayesian with Decision Tree for Adaptive Intrusion Detection (NBDTAID) performs balance detections and keeps FP at acceptable level in intrusion detection. NBDTAID eliminates redundant attributes and contradictory examples from training data, and addresses some mining difficulties such as handling continuous attribute, dealing with missing attribute values, and reducing noise in training data [14].
Given a training data D = {t 1 , · · · , t n } where t i = {t i1 , · · · , t ih } and the training data D contains the following attributes {A 1 , A 2 , · · · , A n } and each attribute A i contains the following attribute values {A i1 , A i2 , · · · , A ih }. The attribute values can be discrete or continuous. Also the training data D contains a set of classes C = {C 1 , C 2 , · · · , C m }. Each example in the training data D has a particular class C j . The algorithm first searches for the multiple copies of the same example in the training data D, if found then keeps only one unique example in the training data D (suppose all attribute values of two examples are equal then the examples are similar). Then the algorithm discreties the continuous attributes in the training data D by finding each adjacent pair of continuous attribute values that are not classified into the same class value for that continuous attribute. Next the algorithm calculates the prior P(C j ) and conditional P(A ij |C j ) probabilities in the training data D. The prior probability P(C j ) for each class is estimated by counting how often each class occurs in the training data D. For each attribute A i the number of occurrences of each attribute value A ij can be counted to determine P(A i ). Similarly, the conditional probability P(A ij |C j ) for each attribute values A ij can be estimated by counting how often each attribute value occurs in the class in the training data D. Then the algorithm classifies all the examples in the training data D with these prior P(C j ) and conditional P(A ij |C j ) probabilities. For classifying the examples, the prior and conditional probabilities are used to make the prediction. This is done by combining the effects of the different attribute values from that example. Suppose the example e i has independent attribute values {A i1 , A i2 , · · · , A ip }, we know P(A ik |C j ), for each class C j and attribute A ik . We then estimate P(e i |C j ) by To classify the example, the algorithm estimates the likelihood that e i is in each class. The probability that e i is in a class is the product of the conditional probabilities for each attribute value with prior probability for that class. The posterior probability P(C j |e i ) is then found for each class and the example classifies with the highest posterior probability for that example. After classifying all the training examples, the class value for each example in training data D updates with Maximum Likelihood (ML) of posterior probability P(C j |e i ).
Then again the algorithm calculates the prior P(C j ) and conditional P(A ij |C j ) probabilities using updated class values in the training data D, and again classifies all the examples of training data using these probabilities. If any of the training example is misclassified then the algorithm calculates the information gain for each attributes {A 1 , A 2 , · · · , A n } in the training data D.
And chooses one of the best attributes A i among the attributes {A 1 , A 2 , · · · , A n } from the training data D with highest information gain value, Then split the training data D into sub-datasets {D 1 , D 2 , · · · , D n } depending on the chosen attribute values of A i . After the algorithm estimates the prior and conditional probabilities for each sub-dataset values that are not classified into the same class value for that continuous attribute. 3: Calculate the prior probabilities P(C j ) and conditional probabilities P(A ij |C j ) in D. 4: Classify all the training examples using these prior and conditional probabilities, Recalculate the prior P(C j ) and conditional P(A ij |C j ) probabilities using updated class values in D. 7: Again classify all training examples in D using updated probability values. 8: If any training examples in D is misclassified then calculate the information gain for each attributes A i = {A 1 , A 2 , · · · , A n } in D using equation 11. 9: Choose the best attribute A i from D with the maximum information gain value. 10: Split dataset D into sub-datasets {D 1 , D 2 , · · · , D n } depending on the attribute values of A i . 11: Calculate the prior P(C j ) and conditional P(A ij |C j ) probabilities of each sub-dataset D i . 12: Classify the examples of each sub-dataset D i with their respective prior and conditional probabilities. 13: If any example of any sub-dataset D i is misclassified then calculate the information gain of attributes for that sub-dataset D i , and choose one best attribute A i with maximum gain value, then split the sub-dataset D i into sub-sub-datasets D ij . Then again calculate the probabilities for each sub-sub-dataset D ij . Also classify the examples in sub-sub-datasets using their respective probabilities. 14

Intrusions Classification using Decision Tree
Attacks Classificaton using Decision Tree (ACDT) for anomaly based network intrusion detection [13] addresses the problem of attacks classification in intrusion detection for classifying different types of network attacks.
In a given dataset, first the ACDT algorithm initializes the weights for each example of dataset; W i equal to 1 n , where n is the number of total examples in dataset. Then the ACDT algorithm estimates the prior probability P(C j ) for each class by summing the weights that how often each class occurs in the dataset. Also for each attribute, A i , the number of occurrences of each attribute value A ij can be counted by summing the weights to determine P(A ij ). Similarly, the conditional probabilities P(A ij |C j ) are estimated for all values of attributes by summing the weights how often each attribute value occurs in the class C j . After that the ACDT algorithm uses these probabilities to update the weights for each example in the dataset. It is performed by multiplying the probabilities of the different attribute values from the examples. Suppose the example e i has independent attribute values {A i1 , A i2 , · · · , A ip }. We already know P(A ik |C j ), for each class C j and attribute A ik . We then estimate P(e i |C j ) by To update the weight, the algorithm estimate the likelihood of e i in each class C j . The probability that e i is in a class is the product of the conditional probabilities for each attribute value. The posterior probability P(C j |e i ) is then found for each class. Now the weight of the example is updated with the highest posterior probability for that example. Finally, the algorithm calculates the information gain by using updated weights and builds a tree for decision making. Algorithm 2 describes the main procedure of learning process: Calculate the posterior probabilities for each example in D.
Find the splitting attribute with highest information gain using the updated weights, W i in D. 7: T = Create the root node and label with splitting attribute. 8: For each branch of the T, D = database created by applying splitting predicate to D, and continue steps 1 to 7 until each final subset belong to the same class or leaf node created. 9: When the decision tree construction is completed the algorithm terminates.

Attribute Weighting with Adaptive NBTree
This subsection presents learning algorithms: Attribute Weighting Algorithm and Adaptive NBTree Algorithm for reducing FP in intrusion detection [11]. It is based on decision tree based attribute weighting with adaptive naïve Bayesian tree (NBTree), which not only reduce FP at acceptable level, but also scale up DR for different types of network intrusions. It estimate the degree of attribute dependency by constructing decision tree, and considers the depth at which attributes are tested in the tree. In NBTree nodes contain and split as regular decision tree, but the leaves contain naïve Bayesian classifier. The purpose of this subsection is to identify important input attributes for intrusion detection that is computationally efficient and effective.

Attribute Weighting Algorithm
In a given training data, C jk } has some values. Each example in the training data contains weight, w = {w 1 , w 2 , · · · , w n }. Initially, all the weights of examples in training data have equal unit value that set to w i = 1 n . Where n is the total number of training examples. Estimates the prior probability P(C j ) for each class by summing the weights that how often each class occurs in the training data. For each attribute, A i , the number of occurrences of each attribute value A ij can be counted by summing the weights to determine P(A ij ). Similarly, the conditional probability P(A ij |C j ) can be estimated by summing the weights that how often each attribute value occurs in the class C j in the training data. The conditional probabilities P(A ij |C j ) are estimated for all values of attributes. The algorithm then uses the prior and conditional probabilities to update the weights. This is done by multiplying the probabilities of the different attribute values from the examples. Suppose the training example e i has independent attribute values {A i1 , A i2 , · · · , A ip }. We already know the prior probabilities P(C j ) and conditional probabilities P(A ik |C j ), for each class C j and attribute A ik . We then estimate P(e i |C j ) by To update the weight of training example e i , we can estimate the likelihood of e i for each class. The probability that e i is in a class is the product of the conditional probabilities for each attribute value. The posterior probability P(C j |e i ) is then found for each class. Then the weight of the example is updated with the highest posterior probability for that example and also the class value is updated according to the highest posterior probability. Now, the algorithm calculates the information gain by using updated weights and builds a tree. After the tree construction, the algorithm initialized weights for each attributes in training data D. If the attribute in the training data is not tested in the tree then the weight of the attribute is initialized to 0, else calculates the minimum depth, d that the attribute is tested at and initialized the weight of attribute to 1 √ d . Finally, the algorithm removes all the attributes with zero weight from the training data D. The main procedure of the algorithm is described in Algorithm 3.

Adaptive NBTree Algorithm
Given training data, D where each attribute A i and each example e i have the weight value. Estimates the prior probability P(C j ) and conditional probability P(A ij |C j ) from the given Find the splitting attribute with highest information gain using the updated weights, W i in D.
: T = Create the root node and label with splitting attribute. 9: For each branch of the T, D = database created by applying splitting predicate to D, and continue steps 1 to 8 until each final subset belong to the same class or leaf node created. 10: When the decision tree construction is completed, for each attribute in the training data D: If the attribute is not tested in the tree then weight of the attribute is initialized to 0. Else, let d be the minimum depth that the attribute is tested in the tree, and weight of the attribute is initialized to 1 √ d . 11: Remove all the attributes with zero weight from the training data D.
training dataset using weights of the examples. Then classify all the examples in the training dataset using these prior and conditional probabilities with incorporating attribute weights into the naïve Bayesian formula: Where W i is the weight of attribute A i . If any example of training dataset is misclassified, then for each attribute A i , evaluate the utility, u(A i ), of a spilt on attribute A i . Let j = argmax i (u i ), i.e., the attribute with the highest utility. If u j is not significantly better than the utility of the current node, create a NB classifier for the current node. Partition the training data D according to the test on attribute A i . If A i is continuous, a threshold split is used; if A i is discrete, a multi-way split is made for all possible values. For each child, call the algorithm recursively on the portion of D that matches the test leading to the child. The main procedure of the algorithm is described in Algorithm 4.

Algorithm 4 Adaptive NBTree Algorithm
Input: Training dataset D of labeled examples. Output: A hybrid decision tree with naïve Bayesian classifier at the leaves. Procedure: 1: Calculate the prior probabilities P(C j ) for each class C j in D.
Calculate the conditional probabilities P(A ij |C j ) for each attribute values in D.
Classify each example in D with maximum posterior probability.
If any example in D is misclassified, then for each attribute A i , evaluate the utility, u(A i ), of a spilt on attribute A i . 5: Let j = argmax i (u i ), i.e., the attribute with the highest utility. 6: If u j is not significantly better than the utility of the current node, create a naÃŕve Bayesian classifier for the current node and return. 7: Partition the training data D according to the test on attribute A i . If A i is continuous, a threshold split is used; if A i is discrete, a multi-way split is made for all possible values. 8: For each child, call the algorithm recursively on the portion of D that matches the test leading to the child.

Clustering, boosting, and bagging
In this section, we present IDNBC (Intrusion Detection through naïve Bayesian with Clustering) algorithm [12], Boosting [16], and Bagging [17] algorithms for adaptive intrusion detection. The Boosting algorithm considers a series of classifiers and combines the votes of each individual classifier for classifying intrusions using NB classifier. The Bagging algorithm ensembles ID3, NB classifier, and k-Nearest-Neighbor classifier for intrusion detection, which improves DR and reduces FP. The purpose of this chapter is to combines the several classifiers to improve the classification of different types of network intrusions.

Naïve Bayesian with clustering
It has been tested that one set of probability derived from data is not good enough to have good classification rate. This subsection presents algorithm namely IDNBC (Intrusion Detection through naïve Bayesian with Clustering) for mining network logs to detect network intrusions through NB classifier [12], which clusters the network logs into several groups based on similarity of logs, and then calculates the probability set for each cluster. For classifying a new log, the algorithm checks in which cluster the log belongs and then use that cluster probability set to classify the new log.
Given a database D = {t 1 , t 2 , · · · , t n } where ti = {t i1 , t i2 , · · · , t ih } and the database D contains the following attributes {A 1 , A 2 , · · · , A n } and each attribute A i contains the following attribute values {A i1 , A i2 , · · · , A ih }. The attribute values can be discrete or continuous. Also the database D contains a set of classes C = {C 1 , C 2 , · · · , C m }. Each example in the database D has a particular class C j . The algorithm first clusters the database D into several clusters {D 1 , D 2 , · · · , D n } depending on the similarity of examples in the database D. A similarity measure, sim(t i , t l ), defined between any two examples, t 1 , t 2 in D, and an integer value k, the clustering is to define a mapping f : D → {1, · · · , K} where each t i is assigned to one cluster K j . Suppose for two examples there is a match between two attribute values then the similarity becomes 0.5. If there is a match only in one attribute value, then similarity between the examples is taken as 0.25 and so on. Then the algorithm calculates the prior probabilities P(C j ) and conditional probabilities P(A ij |C j ) for each cluster. The prior probability P(C j ) for each class is estimated by counting how often each class occurs in the cluster. For each attribute A i the number of occurrences of each attribute value A ij can be counted to determine P(A i ). Similarly, the conditional probability P(A ij |C j ) for each attribute values A ij can be estimated by counting how often each attribute value occurs in the class in the cluster. For classifying a new example whose attribute values are known but class value is unknown, the algorithm checks in which cluster the new example belongs and then use that cluster probability set to classify the new example. For classifying a new example, the prior probabilities and conditional probabilities are used to make the prediction. This is done by combining the effects of the different attribute values from that example. Suppose the example e i has independent attribute values {A i1 , A i2 , · · · , A ip }, we know the P(A ik |C j ), for each class C j and attribute A ik . We then estimate P(e i |C j ) to classify the example, the probability that e i is in a class is the product of the conditional probabilities for each attribute value with prior probability for that class. The posterior probability P(C j |e i ) is then found for each class and the example classifies with the highest posterior probability for that example. The main procedure of the algorithm is described in Algorithm 5.
for each cluster D i , calculate the conditional probabilities: 5: for each cluster D i , store the prior probabilities, S 1 = P(C j ); and conditional probabilities, S 2 = P(A ij |C j ); 6: For classifying new example, check in which cluster the example belongs and then use that cluster probability set to classify the example.

Boosting
Adaptive intrusion detection using boosting and naïve Bayesian classifier [16], which considers a series of classifiers and combines the votes of each individual classifier for classifying an unknown or known intrusion. This algorithm generates the probability set for each round using naïve Bayesian classifier and updates the weights of training examples based on the misclassification error rate that produced by the training examples in each round.
Given a training data D = {t 1 , · · · , t n }, where t i = {t i1 , · · · , t ih } and the attributes {A 1 , A 2 , · · · , A n }. Each attribute A i contains the following attribute values {A i1 , A i2 , · · · , A ih }. The training data D also contains a set of classes C = {C 1 , C 2 , · · · , C m }. The prior probability P(C j ) for each class is estimated by counting how often each class occurs in the dataset D i . For each attribute A i the number of occurrences of each attribute value A ij can be counted to determine P(A i ). Similarly, the class conditional probability P(A ij |C j ) for each attribute values A ij can be estimated by counting how often each attribute value occurs in the class in the dataset D i . Then the algorithm classifies all the training examples in training data D with these prior P(C j ) and class conditional P(A ij |C j ) probabilities from dataset D i . For classifying the examples, the prior and conditional probabilities are used to make the prediction. This is done by combining the effects of the different attribute values from that example. Suppose the example e i has independent attribute values {A i1 , A i2 , · · · , A ip }, we know P(A ik |C j ), for each class C j and attribute A ik . We then estimate P(e i |C j ) by using equation 14.
To classify the example, the probability that e i is in a class is the product of the conditional probabilities for each attribute value with prior probability for that class. The posterior probability P(C j |e i ) is then found for each class and the example classifies with the highest posterior probability value for that example. The algorithm classifies each example t i in D with maximum posterior probability. After that the weights of the training examples ti in training data D are adjusted or updated according to how they were classified. If an example was misclassified then its weight is increased, or if an example was correctly classified then its weight is decreased.
To updates the weights of training data D, the algorithm computes the misclassification rate, the sum of the weights of each of the training example t i in D that were misclassified. That is, Where err(t i ) is the misclassification error of example t i . If the example t i was misclassified, then is err(t i ) 1. Otherwise, it is 0. The misclassification rate affects how the weights of the training examples are updated. If a training example was correctly classified, its weight is multiplied by error ( Or, we can set the number of rounds that the algorithm will iterate the process. To classify a new or unseen example use all the probabilities of each round (each round is considered as a classifier) and consider the class of new example with highest classifier's vote. The main procedure of the boosting algorithm is described in Algorithm 6.

Bagging
Classification of streaming data based on bootstrap aggregation (bagging) [17] creates an ensemble model by using ID3 classifier, naïve Bayesian classifier, and k-Nearest-Neighbor classifier for a learning scheme where each classifier gives the weighted prediction.
Given a dataset D, of d examples and the dataset D contains the following attributes {A 1 , A 2 , · · · , A n } and each attribute A i contains the following attribute values {A i1 , A i2 , · · · , A ih }. Also the dataset D contains a set of classes C = {C 1 , C 2 , · · · , C m }, where each example in dataset D has a particular class C j . The algorithm first generates the training dataset D i from the given dataset D using selection with replacement technique. It is very likely that some of the examples from the dataset D will occur more than once in the training dataset Calculate the class conditional probabilities P(A ij |C j ) for each attribute values in dataset D i : If an example was misclassified then its weight is increased, or if an example was correctly classified then its weight is decreased. To updates the weights of training examples the misclassification rate is calculated, the sum of the weights of each of the training example t i D that were misclassified: error(M i ) = ∑ d i W i * err(t i ); Where err(t i ) is the misclassification error of example t i . If the example t i was misclassified, then is err(t i ) 1. Otherwise, it is 0. If a training example was correctly classified, its weight is multiplied by ( training dataset D i . The algorithm builds three classifiers using ID3, naïve Bayesian (NB), and k-Nearest-Neighbor (kNN) classifiers.
The basic strategy used by ID3 classifier is to choose splitting attributes with the highest information gain first and then builds a decision tree. The amount of information associated with an attribute value is related to the probability of occurrence. The concept used to quantify information is called entropy, which is used to measure the amount of randomness from a data set. When all data in a set belong to a single class, there is no uncertainty, and then the entropy is zero. The objective of decision tree classification is to iteratively partition the given data set into subsets where all elements in each final subset belong to the same class. The entropy calculation is shown in equation 16. Given probabilities p 1 , p 2 , · · · , p s for different classes in the data set Given a data set, D, H(D) finds the amount of entropy in class based subsets of the data set. When that subset is split into s new subsets S = D 1 , D 2 , · · · , D s using some attribute, we can again look at the entropy of those subsets. A subset of data set is completely ordered and does not need any further split if all examples in it belong to the same class. The ID3 algorithm calculates the information gain of a split by using equation 17 and chooses that split which provides maximum information gain.
The naïve Bayesian (NB) classifier calculates the prior probability, P(C j ) and class conditional probability, P(A ij |C j ) from the dataset. For classifying an example, the NB classifier uses these prior and conditional probabilities to make the prediction of class for that example. The prior probability P(C j ) for each class is estimated by counting how often each class occurs in the dataset D i . For each attribute A i the number of occurrences of each attribute value A ij can be counted to determine P(A i ). Similarly, the class conditional probability P(A ij |C j ) for each attribute values A ij can be estimated by counting how often each attribute value occurs in the class in the dataset D i .
The k-Nearest-Neighbor (kNN) classifier assumes that the entire training set includes not only the data in the set but also the desired classification for each item. When a classification is to be made for a test or new example, its distance to each item in the training data must be determined. The test or new example is then placed in the class that contains the most examples from this training data of k closest items.

Experimental results
The experiments were performed by using an Intel Core 2 Duo Processor 2.0 GHz processor (2 MB Cache, 800 MHz FSB) with 1 GB of RAM.

NBDTAID evaluation
In order to evaluate the performance of NBDTAID algorithm for network intrusion detection, we performed 5-class classification using KDD99 intrusion detection benchmark dataset [14]. The results of the comparison of NBDTAID with naïve Bayesian classifier and ID3 classifier are presented in Table 5 using 41 input attributes, and Table 6 using 19 input attributes. The performance of NBDTAID algorithm using reduced dataset (12 and 17 input attributes) increases DR that are summarized in Table 7.   Table 7. Performance of NBDTAID algorithm using reduced dataset.

ACDT evaluation
The results of the comparison of ACDT, ID3, and C4.5 algorithms using 41 attributes are tabulated in Table 8 and using 19 attributes are tabulated   Table 9. Comparison of ACDT with ID3 and C4.5 using 19 Attributes.

Adaptive NBTree evaluation
Firstly, we used attribute weighting algorithm to perform attribute selection from training dataset of KDD99 dataset and then we used adaptive NBTree algorithm for classifier construction [11]. The performance of our proposed algorithm on 12 attributes in KDD99 dataset is listed in Table 10.    We compare the detection rates among Support Vector Machines (SVM), Neural Network (NN), Genetic Algorithm (GA), and adaptive NBTree algorithm on KDD99 dataset that tabulated in Table 15.

IDNBC evaluation
The performance of IDNBC algorithm tested by employing KDD99 benchmark network intrusion detection dataset, and the experimental results proved that it improves DR as well as reduces FP for different types of network intrusions are tabulated in Table 16 [12].  Table 16. Performance of IDNBC algorithm with naïve Bayesian classifier.

Boosting evaluation
We tested the performance of Boosting algorithm with k-Nearest-Neighbor classifier (kNN), Decision Tree classifier (C4.5), Support Vector Machines (SVM), Neural Network (NN), and Genetic Algorithm (GA) by employing on the KDD99 benchmark intrusion detection dataset [16] that is tabulated in Table 17.  It has been successfully tested that effective attributes selection improves the detection rates for different types of network intrusions in intrusion detection. The performance of boosting algorithm on 12 attributes in KDD99 dataset is listed in Table 18.  Table 18. Boosting on reduce KDD99 dataset.

Bagging evaluation
The presented bagging algorithm was tested on the KDD99 benchmark intrusion detection dataset that is tabulated in  Table 19. Comparison of the results on KDD99 dataset using bagging (Detection Rate %).

Conclusions and future work
The work presented in this chapter has explored the basic concepts of adaptive intrusion detection employing data mining algorithms. We focused on naïve Bayesian (NB) classifier and decision tree (DT) classifier for extracting intrusion patterns from network data. Both NB and DT are efficient learning techniques for mining the complex data and already applied in many real world problem domains. NB has several advantages. First, it is easy to use. Second, unlike other learning algorithms, only one scan of the training data is required. NB can easily handle missing attribute values by simply omitting that probability when calculation the likelihoods of membership in each class. On the other side, the ID3 algorithm to build a decision tree based on information theory and attempts to minimize the expected number of comparisons. The basic strategy used by ID3 is to choose splitting attributes with the highest information gain. The amount of information associated with an attribute value is related to the probability of occurrence. Having evaluated the mining algorithms on KDD99 benchmark intrusion detection dataset, it proved that supervised intrusion classification can increased DR and significantly reduced FP. It also proved that data mining for intrusion detection works, and the combination of NB classifier and DT algorithm forms a robust intrusion-processing framework. Algorithms such as NBDTAID, ACDT, Attribute Weighting with Adaptive NBTree, IDNBC, Boosting, and Bagging presented in this chapter can increase the DR and significantly reduce the FP in intrusion detection. The future works focus on improving FP for R2L attacks and ensemble with other mining algorithms to improve the DR for new network attacks.