Open access peer-reviewed chapter

Mining Complex Network Data for Adaptive Intrusion Detection

By Dewan Md. Farid, Mohammad Zahidur Rahman and Chowdhury Mofizur Rahman

Submitted: November 21st 2011Reviewed: April 16th 2012Published: September 12th 2012

DOI: 10.5772/48308

Downloaded: 1242

1. Introduction

Intrusion detection is the method of identifying intrusions or misuses in a computer network, which compromise the confidentiality and integrity of the network. Intrusion Detection System (IDS) is a security tool used to monitor network traffic and detect unauthorized activities in the network [23, 28, 30]. A security monitoring surveillance system, which is an intrusion detection model based on detecting anomalies in user behaviors was first introduced by James P. Anderson in 1980 [1]. After that several intrusion detection models based on statistics, Markov chains, time-series, etc proposed by Dorothy Denning in 1986. At first host-based IDS was implemented, which located in the server machine to examine the internal interfaces [35], but with the evolution of computer networks day by day focus gradually shifted toward network-based IDS [20]. Network-based IDS monitors and analyzes network traffics for detecting intrusions from internal and external intruders [26, 27, 34]. A number of data mining algorithms have been widely used by the intelligent computational researchers in the large amount of network audit data for detecting known and unknown intrusions in the last decade [3, 9, 18, 32, 33]. Even for a small network the amount of network audit data is very large that an IDS needs to examine. Use of data mining for intrusion detection aim to solve the problem of analyzing the large volumes of audit data and realizing performance optimization of detection rules.

There are many drawbacks in currently available commercial IDS, such as low and unbalanced detection rates for different types of network attacks, large number of false positives, long response time in high speed network, and redundant input attributes in intrusion detection training dataset. In general a conventional intrusion detection dataset is complex, dynamic, and composed of many different attributes. It has been successfully tested that not all the input attributes in intrusion detection training dataset may be needed for training the intrusion detection models or detecting intrusions [31]. The use of redundant attributes interfere with the correct completion of mining task, increase the complexity of detection model and computational time, because the information they added is contained in other attributes [7]. Ideally, IDS should have an intrusion detection rate of 100% along with false positive of 0%, which is really very difficult to achieve.

Applying mining algorithms for adaptive intrusion detection is the process of collecting network audit data and convert the collected audit data to the format that is suitable for mining. Finally, developing a clustering or classification model for intrusion detection, which provide decision support to intrusion management for detecting known and unknown intrusions by discovering intrusion patterns [4, 5].

2. Intrusion detection

Intrusion detection is the process of monitoring and analyzing the network traffics. It takes sensor data to gather information for detecting intrusions from internal and external networks [6], and notify the network administrator or intrusion prevention system (IPS) about the attack [19, 24]. Intrusion detection is broadly classified into three categories: misuse, anomaly, and hybrid detection model [10]. Misuse detection model detects attacks based on known attack patterns, which already stored in the database by using pattern matching of incoming network packets to the signatures of known intrusions. It begins protecting the network immediately upon installation and produces very low FP, but it requires frequently signature updates and cannot detect new intrusions. Anomaly based detection model detects deviations from normal behaviors to identify new intrusions [22]. It creates a normal profile of the network and then any action that deviated from the normal profile is treated as a possible intrusion, which produces large number of false positives. Hybrid detection model combines both misuse and anomaly detection models [2]. It makes decision based on both the normal behavior of the network and the intrusive behavior of the intruders. Table 1 shows the comparisons among misuse, anomaly, and hybrid detection models.

CharacteristicsMisuseAnomalyHybrid
Detection AccuracyHigh (for known attacks)LowHigh
Detecting New AttacksNoYesYes
False PositivesLowVery highHigh
False NegativesHighLowLow
Timely NotificationsFastSlowRather Fast
Update Usage PatternsFrequentNot FrequentNot Frequent

Table 1.

Comparisons of Detection Models.

Detection rate (DR) and false positive (FP) are the most important parameters that are used for performance estimation of intrusion detection models [8]. Detection rate is calculated by the number of intrusions detected by the IDS divided by the total number of intrusion instances present in the intrusion dataset, and false positive is an alarm, which rose for something that is not really an attack, which are expressed be equation 1 and 2.

DetectionRate,DR=TotalDetectedAttacksTotalAttacks*100E1
FalsePositive,FP=TotalMisclassifiedProcessTotalNormalProcess*100E2

Also precision, recall, overall, and false alarm have been used to measure the performance of IDS [21, 25] from table 2, precision, recall, overall, and false alarm may be expressed be equation 3 to 6.

ParametersDefinition
True Positive (TP) or Detection Rate (DR)Attack occur and alarm raised
False Positive (FP)No attack but alarm raised
True Negative (TN)No attack and no alarm
False Negative (FN)Attack occur but no alarm

Table 2.

Parameters for performances estimation of IDS.

Precision=TPTP+FPE3
Recall=TPTP+FNE4
Overall=TP+TNTP+FP+FN+TNE5
FalseAlarm=FP+FNTP+FP+FN+TNE6

2.1. Intrusion detection dataset

The data generated by IDS contain information about network topology, hosts and other confidential information for this reason intrusion detection dataset cannot be shared in public domain. Whereas it is not difficult task to generate a large set of intrusion detection alets by running IDS in a private or Internet-exposed network. For generating intrusion detection dataset only major challenge is to labeling the network data, because network data are unlabeled and it is not clear which attacks are false positives and which are true positives. The KDD 1999 cup benchmark intrusion detection dataset is the most popular dataset for intrusion detection research, a predictive model capable of distinguishing between intrusions and normal connections, which was first used in the 3rdInternational Knowledge Discovery and Data Mining Tools Competition for building a network intrusion detector [29]. A simulated environment was set up by the MIT Lincoln Lab to acquire raw TCP/IP dump data for a local-area network (LAN) to compare the performance of various IDS that is the part of KDD99 dataset. Examples in KDD99 dataset represent attribute values of a class in the network data flow, and each class is labelled either normal or attack. For each network connection 41 attributes are in KDD99 dataset that have either discrete or continuous values. These attributes are divided into three groups: basic features, content features, and statistical features of network connection.

The classes in KDD99 dataset are mainly categorized into five classes: normal, denial of service (DoS), remote to user (R2L), user to root (U2R), and probing. Normal connections are generated by simulated daily user behaviours. Denial of service computes power or memory of a victim machine too busy or too full to handle legitimate requests. Remote to user is an attack that a remote user gains access of a local user or account by sending packets to a machine over a network communication. User to root is an attack that an intruder begins with the access of a normal user account and then becomes a root-user by exploiting various vulnerabilities of the system. Probing is an attack that scans a network to gather information or find known vulnerabilities. In KDD99 dataset the main four attacks are divided into 22 different attacks that tabulated in table 3 and table 4 shows the number of training and testing examples for each major class.

4 Main Attack Classes22 Attacks Classes
Denial of Service (DoS)back, land, neptune, pod, smurf, teardrop
Remote to User (R2L)ftp_write, guess_passwd, imap, multihop, phf, spy,
warezclient, warezmaster
User to Root (U2R)buffer_overflow, perl, loadmodule, rootkit
Probingipsweep, nmap, portsweep, satan

Table 3.

Different types of attacks in KDD99 Dataset.

Attack TypesTraining ExamplesTesting Examples
Normal9727860592
Denial of Service391458237594
Remote to User11268606
User to Root5270
Probing41074166
Total Examples494021311028

Table 4.

Number of training and testing examples in KDD99 dataset.

3. Combining na ve Bayesian and Decision Tree

In this section, we present the hybrid learning algorithms, NBDTAID (na ve Bayesian with Decision Tree for Adaptive Intrusion Detection) [14], ACDT (Attacks Classificaton using Decision Tree) [13], and Attribute Weighting with Adaptive NBTree Algorithm [11, 15] for intrusions classification in intrusion detection problem. Presented algorithms are performed balance detections and keep FP at acceptable level for different types of network intrusions. It has been successfully tested that by combining the properties of na ve Bayesian classifier and decision tree classifier, the performance of intrusion detection classifier can be enhanced.

3.1. Adaptive intrusion classifier

Na ve Bayesian with Decision Tree for Adaptive Intrusion Detection (NBDTAID) performs balance detections and keeps FP at acceptable level in intrusion detection. NBDTAID eliminates redundant attributes and contradictory examples from training data, and addresses some mining difficulties such as handling continuous attribute, dealing with missing attribute values, and reducing noise in training data [14].

Given a training data D={t1,,tn}where ti={ti1,,tih}and the training data Dcontains the following attributes {A1,A2,,An}and each attribute Aicontains the following attribute values{Ai1,Ai2,,Aih}. The attribute values can be discrete or continuous. Also the training data Dcontains a set of classesC={C1,C2,,Cm}. Each example in the training data Dhas a particular classCj. The algorithm first searches for the multiple copies of the same example in the training dataD, if found then keeps only one unique example in the training data D(suppose all attribute values of two examples are equal then the examples are similar). Then the algorithm discreties the continuous attributes in the training data Dby finding each adjacent pair of continuous attribute values that are not classified into the same class value for that continuous attribute. Next the algorithm calculates the prior P(Cj)and conditional P(Aij|Cj)probabilities in the training dataD. The prior probability P(Cj)for each class is estimated by counting how often each class occurs in the training dataD. For each attribute Aithe number of occurrences of each attribute value Aijcan be counted to determineP(Ai). Similarly, the conditional probability P(Aij|Cj)for each attribute values Aijcan be estimated by counting how often each attribute value occurs in the class in the training dataD. Then the algorithm classifies all the examples in the training data Dwith these prior P(Cj)and conditional P(Aij|Cj)probabilities. For classifying the examples, the prior and conditional probabilities are used to make the prediction. This is done by combining the effects of the different attribute values from that example. Suppose the example eihas independent attribute values{Ai1,Ai2,,Aip}, we knowP(Aik|Cj), for each class Cjand attributeAik. We then estimate P(ei|Cj)by

P(ei|Cj)=P(Cj)k=1pP(Aij|Cj)E7

To classify the example, the algorithm estimates the likelihood that eiis in each class. The probability that eiis in a class is the product of the conditional probabilities for each attribute value with prior probability for that class. The posterior probability P(Cj|ei)is then found for each class and the example classifies with the highest posterior probability for that example. After classifying all the training examples, the class value for each example in training data Dupdates with Maximum Likelihood (ML) of posterior probabilityP(Cj|ei).

Cj=CiPML(Cj|ei)E8

Then again the algorithm calculates the prior P(Cj)and conditional P(Aij|Cj)probabilities using updated class values in the training dataD, and again classifies all the examples of training data using these probabilities. If any of the training example is misclassified then the algorithm calculates the information gain for each attributes {A1,A2,,An}in the training dataD.

Info(D)=-j=1mfreq(Cj,D)|D|log2freq(Cj,D)|D|E9
Info(T)=-i=1n|Ti||T|info(Ti)E10
InformationGain(Ai)=Info(D)-Info(T)E11

And chooses one of the best attributes Aiamong the attributes {A1,A2,,An}from the training data Dwith highest information gain value, Then split the training data Dinto sub-datasets {D1,D2,,Dn}depending on the chosen attribute values ofAi. After the algorithm estimates the prior and conditional probabilities for each sub-dataset Di={D1,D2,,Dn}and classifies the examples of each sub-dataset Diusing their respective probabilities. If any example of any sub-dataset Diis misclassified then the algorithm calculates the information gain of attributes for that sub-datasetDi, and chooses the best attribute Aiwith maximum information gain value from sub-datasetDi, and split the sub-dataset Diinto sub-sub-datasetsDij. Then again calculates the prior and conditional probabilities for each sub-sub-datasetDij, and also classifies the examples of sub-sub-datasets using their respective probabilities. The algorithm will continue this process until all the examples of sub/sub-sub-datasets are correctly classified. When the algorithm correctly classifies all the examples then the prior and conditional probabilities for each datasets are preserved for future classification of unseen examples. The main procedure of the algorithm is described in Algorithm 1.

3.2. Intrusions Classification using Decision Tree

Attacks Classificaton using Decision Tree (ACDT) for anomaly based network intrusion detection [13] addresses the problem of attacks classification in intrusion detection for classifying different types of network attacks.

In a given dataset, first the ACDT algorithm initializes the weights for each example of dataset; Wiequal to1n, where nis the number of total examples in dataset. Then the ACDT algorithm estimates the prior probability P(Cj)for each class by summing the weights that how often each class occurs in the dataset. Also for each attribute, Ai, the number of occurrences of each attribute value Aijcan be counted by summing the weights to determineP(Aij). Similarly, the conditional probabilities P(Aij|Cj)are estimated for all values of attributes by summing the weights how often each attribute value occurs in the classCj. After that the ACDT algorithm uses these probabilities to update the weights for each example in the dataset. It is performed by multiplying the probabilities of the different attribute values from the examples. Suppose the example eihas independent attribute values{Ai1,Ai2,,Aip}. We already knowP(Aik|Cj), for each class Cjand attributeAik. We then estimate P(ei|Cj)by

P(ei|Cj)=P(Cj)k=1pP(Aij|Cj)E12

To update the weight, the algorithm estimate the likelihood of eiin each classCj. The probability that eiis in a class is the product of the conditional probabilities for each attribute value. The posterior probability P(Cj|ei)is then found for each class. Now the weight of the example is updated with the highest posterior probability for that example. Finally, the algorithm calculates the information gain by using updated weights and builds a tree for decision making. Algorithm 2 describes the main procedure of learning process:

3.3. Attribute Weighting with Adaptive NBTree

This subsection presents learning algorithms: Attribute Weighting Algorithm and Adaptive NBTree Algorithm for reducing FP in intrusion detection [11]. It is based on decision tree based attribute weighting with adaptive na ve Bayesian tree (NBTree), which not only reduce FP at acceptable level, but also scale up DR for different types of network intrusions. It estimate the degree of attribute dependency by constructing decision tree, and considers the depth at which attributes are tested in the tree. In NBTree nodes contain and split as regular decision tree, but the leaves contain na ve Bayesian classifier. The purpose of this subsection is to identify important input attributes for intrusion detection that is computationally efficient and effective.

3.3.1. Attribute Weighting Algorithm

In a given training data, D={A1,A2,,An}of attributes, where each attribute Ai={Ai1,Ai2,,Aik}contains attribute values and a set of classesC={C1,C2,,Cn}, where each class Cj={Cj1,Cj2,,Cjk}has some values. Each example in the training data contains weight,w={w1,w2,,wn}. Initially, all the weights of examples in training data have equal unit value that set towi=1n. Where nis the total number of training examples. Estimates the prior probability P(Cj)for each class by summing the weights that how often each class occurs in the training data. For each attribute, Ai, the number of occurrences of each attribute value Aijcan be counted by summing the weights to determineP(Aij). Similarly, the conditional probability P(Aij|Cj)can be estimated by summing the weights that how often each attribute value occurs in the class Cjin the training data. The conditional probabilities P(Aij|Cj)are estimated for all values of attributes. The algorithm then uses the prior and conditional probabilities to update the weights. This is done by multiplying the probabilities of the different attribute values from the examples. Suppose the training example eihas independent attribute values{Ai1,Ai2,,Aip}. We already know the prior probabilities P(Cj)and conditional probabilitiesP(Aik|Cj), for each class Cjand attributeAik. We then estimate P(ei|Cj)by

P(ei|Cj)=P(Cj)P(Aij|Cj)E13

To update the weight of training exampleei, we can estimate the likelihood of eifor each class. The probability that eiis in a class is the product of the conditional probabilities for each attribute value. The posterior probability P(Cj|ei)is then found for each class. Then the weight of the example is updated with the highest posterior probability for that example and also the class value is updated according to the highest posterior probability. Now, the algorithm calculates the information gain by using updated weights and builds a tree. After the tree construction, the algorithm initialized weights for each attributes in training dataD. If the attribute in the training data is not tested in the tree then the weight of the attribute is initialized to 0, else calculates the minimum depth, dthat the attribute is tested at and initialized the weight of attribute to1d. Finally, the algorithm removes all the attributes with zero weight from the training dataD. The main procedure of the algorithm is described in Algorithm 3.

3.3.2. Adaptive NBTree Algorithm

Given training data, Dwhere each attribute Aiand each example eihave the weight value. Estimates the prior probability P(Cj)and conditional probability P(Aij|Cj)from the given training dataset using weights of the examples. Then classify all the examples in the training dataset using these prior and conditional probabilities with incorporating attribute weights into the na ve Bayesian formula:

P(ei|Cj)=P(Cj)i=1mP(Aij|Cj)WiE14

Where Wiis the weight of attributeAi. If any example of training dataset is misclassified, then for each attributeAi, evaluate the utility, u(Ai), of a spilt on attributeAi. Letj=argmaxi(ui), i.e., the attribute with the highest utility. If ujis not significantly better than the utility of the current node, create a NB classifier for the current node. Partition the training data Daccording to the test on attributeAi. If Aiis continuous, a threshold split is used; if Aiis discrete, a multi-way split is made for all possible values. For each child, call the algorithm recursively on the portion of Dthat matches the test leading to the child. The main procedure of the algorithm is described in Algorithm 4.

4. Clustering, boosting, and bagging

In this section, we present IDNBC (Intrusion Detection through na ve Bayesian with Clustering) algorithm [12], Boosting [16], and Bagging [17] algorithms for adaptive intrusion detection. The Boosting algorithm considers a series of classifiers and combines the votes of each individual classifier for classifying intrusions using NB classifier. The Bagging algorithm ensembles ID3, NB classifier, and k-Nearest-Neighbor classifier for intrusion detection, which improves DR and reduces FP. The purpose of this chapter is to combines the several classifiers to improve the classification of different types of network intrusions.

4.1. Na ve Bayesian with clustering

It has been tested that one set of probability derived from data is not good enough to have good classification rate. This subsection presents algorithm namely IDNBC (Intrusion Detection through na ve Bayesian with Clustering) for mining network logs to detect network intrusions through NB classifier [12], which clusters the network logs into several groups based on similarity of logs, and then calculates the probability set for each cluster. For classifying a new log, the algorithm checks in which cluster the log belongs and then use that cluster probability set to classify the new log.

Given a database D={t1,t2,,tn}where ti={ti1,ti2,,tih}and the database Dcontains the following attributes {A1,A2,,An}and each attribute Aicontains the following attribute values{Ai1,Ai2,,Aih}. The attribute values can be discrete or continuous. Also the database Dcontains a set of classesC={C1,C2,,Cm}. Each example in the database Dhas a particular classCj. The algorithm first clusters the database Dinto several clusters {D1,D2,,Dn}depending on the similarity of examples in the databaseD. A similarity measure, sim(ti,tl), defined between any two examples, t1, t2inD, and an integer valuek, the clustering is to define a mapping f:D{1,,K}where each tiis assigned to one clusterKj. Suppose for two examples there is a match between two attribute values then the similarity becomes 0.5. If there is a match only in one attribute value, then similarity between the examples is taken as 0.25 and so on. Then the algorithm calculates the prior probabilities P(Cj)and conditional probabilities P(Aij|Cj)for each cluster. The prior probability P(Cj)for each class is estimated by counting how often each class occurs in the cluster. For each attribute Aithe number of occurrences of each attribute value Aijcan be counted to determineP(Ai). Similarly, the conditional probability P(Aij|Cj)for each attribute values Aijcan be estimated by counting how often each attribute value occurs in the class in the cluster. For classifying a new example whose attribute values are known but class value is unknown, the algorithm checks in which cluster the new example belongs and then use that cluster probability set to classify the new example. For classifying a new example, the prior probabilities and conditional probabilities are used to make the prediction. This is done by combining the effects of the different attribute values from that example. Suppose the example eihas independent attribute values{Ai1,Ai2,,Aip}, we know theP(Aik|Cj), for each class Cjand attributeAik. We then estimate P(ei|Cj)to classify the example, the probability that eiis in a class is the product of the conditional probabilities for each attribute value with prior probability for that class. The posterior probability P(Cj|ei)is then found for each class and the example classifies with the highest posterior probability for that example. The main procedure of the algorithm is described in Algorithm 5.

4.2. Boosting

Adaptive intrusion detection using boosting and na ve Bayesian classifier [?], which considers a series of classifiers and combines the votes of each individual classifier for classifying an unknown or known intrusion. This algorithm generates the probability set for each round using na ve Bayesian classifier and updates the weights of training examples based on the misclassification error rate that produced by the training examples in each round.

Given a training dataD={t1,,tn}, where ti={ti1,,tih}and the attributes{A1,A2,,An}. Each attribute Aicontains the following attribute values{Ai1,Ai2,,Aih}. The training data Dalso contains a set of classesC={C1,C2,,Cm}. Each training example has a particular classCj. The algorithm first initializes the weights of training examples to an equal value ofwi=1n, where nis the total number of training examples inD. Then the algorithm generates a new dataset Diwith equal number of examples from training data Dusing selection with replacement technique and calculates the prior P(Cj)and class conditional P(Aij|Cj)probabilities for new datasetDi.

The prior probability P(Cj)for each class is estimated by counting how often each class occurs in the datasetDi. For each attribute Aithe number of occurrences of each attribute value Aijcan be counted to determineP(Ai). Similarly, the class conditional probability P(Aij|Cj)for each attribute values Aijcan be estimated by counting how often each attribute value occurs in the class in the datasetDi. Then the algorithm classifies all the training examples in training data Dwith these prior P(Cj)and class conditional P(Aij|Cj)probabilities from datasetDi. For classifying the examples, the prior and conditional probabilities are used to make the prediction. This is done by combining the effects of the different attribute values from that example. Suppose the example eihas independent attribute values{Ai1,Ai2,,Aip}, we knowP(Aik|Cj), for each class Cjand attributeAik. We then estimate P(ei|Cj)by using equation 14.

P(ei|Cj)=P(Cj)k=1pP(Aij|Cj)E15

To classify the example, the probability that eiis in a class is the product of the conditional probabilities for each attribute value with prior probability for that class. The posterior probability P(Cj|ei)is then found for each class and the example classifies with the highest posterior probability value for that example. The algorithm classifies each example tiin Dwith maximum posterior probability. After that the weights of the training examples ti in training data Dare adjusted or updated according to how they were classified. If an example was misclassified then its weight is increased, or if an example was correctly classified then its weight is decreased.

To updates the weights of training dataD, the algorithm computes the misclassification rate, the sum of the weights of each of the training example tiin Dthat were misclassified. That is,

error(Mi)=idWi*err(ti)E16

Where err(ti)is the misclassification error of exampleti. If the example tiwas misclassified, then is err(ti)1. Otherwise, it is 0. The misclassification rate affects how the weights of the training examples are updated. If a training example was correctly classified, its weight is multiplied by error(Mi1-error(Mi)). Once the weights of all of the correctly classified examples are updated, the weights for all examples including the misclassified examples are normalized so that their sum remains the same as it was before. To normalize a weight, the algorithm multiplies the weight by the sum of the old weights, divided by the sum of the new weights. As a result, the weights of misclassified examples are increased and the weights of correctly classified examples are decreased. Now the algorithm generates another new data set Difrom training data Dwith maximum weight values and continues the process until all the training examples are correctly classified. Or, we can set the number of rounds that the algorithm will iterate the process. To classify a new or unseen example use all the probabilities of each round (each round is considered as a classifier) and consider the class of new example with highest classifier’s vote. The main procedure of the boosting algorithm is described in Algorithm 6.

4.3. Bagging

Classification of streaming data based on bootstrap aggregation (bagging) [17] creates an ensemble model by using ID3 classifier, na ve Bayesian classifier, and k-Nearest-Neighbor classifier for a learning scheme where each classifier gives the weighted prediction.

Given a datasetD, of dexamples and the dataset Dcontains the following attributes {A1,A2,,An}and each attribute Aicontains the following attribute values{Ai1,Ai2,,Aih}. Also the dataset Dcontains a set of classesC={C1,C2,,Cm}, where each example in dataset Dhas a particular classCj. The algorithm first generates the training dataset Difrom the given dataset Dusing selection with replacement technique. It is very likely that some of the examples from the dataset Dwill occur more than once in the training datasetDi. The examples that did not make it into the training dataset end up forming the test dataset. Then a classifier model, Mi, is learned for each training examples dfrom training datasetDi. The algorithm builds three classifiers using ID3, na ve Bayesian (NB), and k-Nearest-Neighbor (kNN) classifiers.

The basic strategy used by ID3 classifier is to choose splitting attributes with the highest information gain first and then builds a decision tree. The amount of information associated with an attribute value is related to the probability of occurrence. The concept used to quantify information is called entropy, which is used to measure the amount of randomness from a data set. When all data in a set belong to a single class, there is no uncertainty, and then the entropy is zero. The objective of decision tree classification is to iteratively partition the given data set into subsets where all elements in each final subset belong to the same class. The entropy calculation is shown in equation 16. Given probabilities p1,p2,,psfor different classes in the data set

Entropy:H(p1,p2,,ps)=i=1s(pilog(1pi))E17

Given a data set, D, H(D)finds the amount of entropy in class based subsets of the data set. When that subset is split into snew subsets S=D1,D2,,Dsusing some attribute, we can again look at the entropy of those subsets. A subset of data set is completely ordered and does not need any further split if all examples in it belong to the same class. The ID3 algorithm calculates the information gain of a split by using equation 17 and chooses that split which provides maximum information gain.

Gain(D,S)=H(D)-i=1sp(Di)H(Di)E18

The na ve Bayesian (NB) classifier calculates the prior probability, P(Cj)and class conditional probability, P(Aij|Cj)from the dataset. For classifying an example, the NB classifier uses these prior and conditional probabilities to make the prediction of class for that example. The prior probability P(Cj)for each class is estimated by counting how often each class occurs in the datasetDi. For each attribute Aithe number of occurrences of each attribute value Aijcan be counted to determineP(Ai). Similarly, the class conditional probability P(Aij|Cj)for each attribute values Aijcan be estimated by counting how often each attribute value occurs in the class in the datasetDi.

The k-Nearest-Neighbor (kNN) classifier assumes that the entire training set includes not only the data in the set but also the desired classification for each item. When a classification is to be made for a test or new example, its distance to each item in the training data must be determined. The test or new example is then placed in the class that contains the most examples from this training data of kclosest items.

After building classifiers using ID3, NB, and kNN, each classifier, Mi, classifies the training examples and initialized the weight, Wiof each classifier based on the accuracies of percentage of correctly classified examples from training dataset. To classify the testing examples or unknown examples each classifier returns its class prediction, which counts as one vote. The proposed bagged classifier counts the votes with the weights of classifiers, and assigns the class with the maximum weighted vote. The main procedure of the bagging algorithm is described in Algorithm 7.

5. Experimental results

The experiments were performed by using an Intel Core 2 Duo Processor 2.0 GHz processor (2 MB Cache, 800 MHz FSB) with 1 GB of RAM.

5.1. NBDTAID evaluation

In order to evaluate the performance of NBDTAID algorithm for network intrusion detection, we performed 5-class classification using KDD99 intrusion detection benchmark dataset [14]. The results of the comparison of NBDTAID with na ve Bayesian classifier and ID3 classifier are presented in Table 5 using 41 input attributes, and Table 6 using 19 input attributes. The performance of NBDTAID algorithm using reduced dataset (12 and 17 input attributes) increases DR that are summarized in Table 7.

MethodNormalProbeDOSU2RR2L
NBDTAID (DR %)99.7299.2599.7599.2099.26
NBDTAID (FP %)0.060.390.040.116.81
na ve Bayesian (DR %)99.2799.1199.6964.0099.11
na ve Bayesian (FP %)0.080.450.040.148.02
ID3 (DR %)99.6397.8599.5149.2192.75
ID3 (FP %)0.100.550.040.1410.03

Table 5.

NBDTAID Algorithm: Comparison of the results using 41 attributes.

MethodNormalProbeDOSU2RR2L
NBDTAID (DR %)99.8499.7599.7699.4799.35
NBDTAID (FP %)0.050.280.030.106.22
na ve Bayesian (DR %)99.6599.3599.7164.8499.15
na ve Bayesian (FP %)0.050.320.040.126.87
ID3 (DR %)99.7198.2299.6386.1197.79
ID3 (FP %)0.060.510.040.127.34

Table 6.

NBDTAID Algorithm: Comparison of the results using 19 attributes.

Class Value12 Attributes17 Attributes
Normal99.9899.95
Probe99.9299.93
DoS99.9999.97
U2R99,3899.46
R2L99.5599.69

Table 7.

Performance of NBDTAID algorithm using reduced dataset.

5.2. ACDT evaluation

The results of the comparison of ACDT, ID3, and C4.5 algorithms using 41 attributes are tabulated in Table 8 and using 19 attributes are tabulated Table 9 [13].

MethodNormalProbeDoSU2RR2L
ACDT (DR %)98.7698.2198.5598.1197.16
ACDT (FP %)0.070.440.050.126.85
ID3 (DR %)97.6396.3597.4143.2192.75
ID3 (FP %)0.100.550.040.1410.03
C4.5 (DR %)98.5397.8597.5149.2194.65
C4.5 (FP %)0.100.550.070.1411.03

Table 8.

Comparison of ACDT with ID3 and C4.5 using 41 Attributes.

MethodNormalProbeDoSU2RR2L
ACDT (DR %)99.1999.1599.2698.4398.05
ACDT (FP %)0.060.480.040.106.32
ID3 (DR %)98.7198.2297.6386.1194.19
ID3 (FP %)0.060.510.040.127.34
C4.5 (DR %)98.8198.2297.7356.1195.79
C4.5 (FP %)0.080.510.050.128.34

Table 9.

Comparison of ACDT with ID3 and C4.5 using 19 Attributes.

5.3. Adaptive NBTree evaluation

Firstly, we used attribute weighting algorithm to perform attribute selection from training dataset of KDD99 dataset and then we used adaptive NBTree algorithm for classifier construction [11]. The performance of our proposed algorithm on 12 attributes in KDD99 dataset is listed in Table 10.

ClassesDetection Rates (%)False Positives (%)
Normal1000.04
Probe99.930.37
DoS1000.03
U2R99,380.11
R2L99.536.75

Table 10.

Performance of adaptive NBTree algorithm on KDD99 Dataset.

Table 11 and Table 12 depict the performance of na ve Bayesian (NB) classifier and C4.5 algorithm using the original 41 attributes of KDD99 dataset. Table 13 and Table 14 depict the performance of NB classifier and C4.5 using reduces 12 attributes.

ClassesDetection Rates (%)False Positives (%)
Normal99.270.08
Probe99.110.45
DoS99.680.05
U2R64.000.14
R2L99.118.12

Table 11.

Performance of NB classifier on KDD99 Dataset.

ClassesDetection Rates (%)False Positives (%)
Normal98.730.10
Probe97.850.55
DoS97.510.07
U2R49.210.14
R2L91.6511.03

Table 12.

Performance of C4.5 algorithm on KDD99 Dataset.

ClassesDetection Rates (%)False Positives (%)
Normal99.650.06
Probe99.350.49
DoS99.710.04
U2R64.840.12
R2L99.157.85

Table 13.

Performance of NB classifier using 12 attributes.

ClassesDetection Rates (%)False Positives (%)
Normal98.810.08
Probe98.220.51
DoS97.630.05
U2R56.110.12
R2L91.798.34

Table 14.

Performance of C4.5 algorithm using 12 attributes.

We compare the detection rates among Support Vector Machines (SVM), Neural Network (NN), Genetic Algorithm (GA), and adaptive NBTree algorithm on KDD99 dataset that tabulated in Table 15.

SVMNNGAAdaptive NBTree
Normal99.499.699.399.93
Probe89.292.798.4699.84
DoS94.797.599.5799.91
U2R71.44899.2299.47
R2L87.29898.5499.63

Table 15.

Comparison of several algorithms with adaptive NBTree algorithm.

5.4. IDNBC evaluation

The performance of IDNBC algorithm tested by employing KDD99 benchmark network intrusion detection dataset, and the experimental results proved that it improves DR as well as reduces FP for different types of network intrusions are tabulated in Table 16 [12].

MethodNormalProbeDoSU2RR2L
IDNBC (DR %)99.6699.2499.6299.1999.08
IDNBC (FP %)0.080.860.090.187.85
NB (DR %)99.2799.1199.6964.0099.11
NB (FP %)0.080.450.050.148.02

Table 16.

Performance of IDNBC algorithm with na ve Bayesian classifier.

5.5. Boosting evaluation

We tested the performance of Boosting algorithm with k-Nearest-Neighbor classifier (kNN), Decision Tree classifier (C4.5), Support Vector Machines (SVM), Neural Network (NN), and Genetic Algorithm (GA) by employing on the KDD99 benchmark intrusion detection dataset [16] that is tabulated in Table 17.

MethodNormalProbeDoSU2RR2L
Boosting Algorithm10099.9599.9299.5599.60
kNN99.6075.0097.3035.000.60
C4.598.4994.8297.5149.2591.26
SVM99.4089.294.771.4087.20
NN99.6092.797.5048.0098.00
GA99.3098.4699.5799.2298.54

Table 17.

Comparison of the results for the intrusion detection problem (Detection Rate %).

It has been successfully tested that effective attributes selection improves the detection rates for different types of network intrusions in intrusion detection. The performance of boosting algorithm on 12 attributes in KDD99 dataset is listed in Table 18.

Attack TypesDR (%)FP (%)
Normal1000.03
Probing99.950.36
DoS1000.03
U2R99.670.10
R2L99.586.71

Table 18.

Boosting on reduce KDD99 dataset.

5.6. Bagging evaluation

The presented bagging algorithm was tested on the KDD99 benchmark intrusion detection dataset that is tabulated in Table 19 [17].

MethodNormalProbeDoSU2RR2L
ID399.6397.8599.5149.2192.75
NB99.2799.1199.6964.0099.11
kNN99.6075.0097.3035.000.60
Bagging Algorithm10099.9299.9399.5799.61

Table 19.

Comparison of the results on KDD99 dataset using bagging (Detection Rate %).

6. Conclusions and future work

The work presented in this chapter has explored the basic concepts of adaptive intrusion detection employing data mining algorithms. We focused on na ve Bayesian (NB) classifier and decision tree (DT) classifier for extracting intrusion patterns from network data. Both NB and DT are efficient learning techniques for mining the complex data and already applied in many real world problem domains. NB has several advantages. First, it is easy to use. Second, unlike other learning algorithms, only one scan of the training data is required. NB can easily handle missing attribute values by simply omitting that probability when calculation the likelihoods of membership in each class. On the other side, the ID3 algorithm to build a decision tree based on information theory and attempts to minimize the expected number of comparisons. The basic strategy used by ID3 is to choose splitting attributes with the highest information gain. The amount of information associated with an attribute value is related to the probability of occurrence. Having evaluated the mining algorithms on KDD99 benchmark intrusion detection dataset, it proved that supervised intrusion classification can increased DR and significantly reduced FP. It also proved that data mining for intrusion detection works, and the combination of NB classifier and DT algorithm forms a robust intrusion-processing framework. Algorithms such as NBDTAID, ACDT, Attribute Weighting with Adaptive NBTree, IDNBC, Boosting, and Bagging presented in this chapter can increase the DR and significantly reduce the FP in intrusion detection. The future works focus on improving FP for R2L attacks and ensemble with other mining algorithms to improve the DR for new network attacks.

Acknowledgement

Support for this research received from the National Science & Information and Communication Technology (NSICT), Ministry of Science and Information & Communication Technology, Government of Bangladesh, Department of Computer Science and Engineering, Jahangirnagar University, Bangladesh, and Department of Computer Science and Engineering, United International University, Bangladesh.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Dewan Md. Farid, Mohammad Zahidur Rahman and Chowdhury Mofizur Rahman (September 12th 2012). Mining Complex Network Data for Adaptive Intrusion Detection, Advances in Data Mining Knowledge Discovery and Applications, Adem Karahoca, IntechOpen, DOI: 10.5772/48308. Available from:

Embed this chapter on your site Copy to clipboard

<iframe src="http://www.intechopen.com/embed/advances-in-data-mining-knowledge-discovery-and-applications/mining-complex-network-data-for-adaptive-intrusion-detection" />

Embed this code snippet in the HTML of your website to show this chapter

chapter statistics

1242total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

BotNet Detection: Enhancing Analysis by Using Data Mining Techniques

By Erdem Alparslan, Adem Karahoca and Dilek Karahoca

Related Book

First chapter

Survey of Data Mining and Applications (Review from 1996 to Now)

By Adem Karahoca, Dilek Karahoca and Mert Şanver

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More about us