Data Mining in Banking Sector Using Weighted Decision Jungle Method

Classification, as one of the most popular data mining techniques, has been used in the banking sector for different purposes, for example, for bank customer churn prediction, credit approval, fraud detection, bank failure estimation, and bank telemarketing prediction. However, traditional classification algorithms do not take into account the class distribution, which results into undesirable performance on imbalanced banking data. To solve this problem, this paper proposes an approach which improves the decision jungle (DJ) method with a class-based weighting mechanism. The experiments conducted on 17 real-world bank datasets show that the proposed approach outperforms the decision jungle method when handling imbalanced banking data.


Introduction
Data mining is the process of analyzing large data stored in data warehouses in order to automatically extract hidden, previously unknown, valid, interesting, and actionable knowledge such as patterns, anomalies, associations, and changes. It has been commonly used in a wide range of different areas that include marketing, health care, military, environment, and education. Data mining is becoming increasingly important and essential for banking sector as well, since the amount of data collected by banks has grown remarkably and the need to discover hidden and useful patterns from banking data becomes widely recognized.
Banking systems collect huge amounts of data more rapidly as the number of channels (i.e., Internet banking, telebanking, retail banking, mobile banking, ATM) has increased. Banking data has been currently generated from various sources, including but not limited to bank account transactions, credit card details, loan applications, and telex messages. Hence, data mining can be used to extract meaningful information from these collected banking data, to enable banking institutions to make better decision-making process. For example, classification, which is one of the most popular data mining techniques, can be used to predict bank failures [1][2][3], to estimate bank customer churns [4], to detect frauds [5], and to evaluate loan approvals [6].
In many real-world banking applications, the distribution of the classes in the dataset is highly skewed. A bank data is imbalanced, when its target variable is categorical and if the number of samples in one class is significantly different from those of the other class(es). For example, in credit card fraud detection, most of the instances in the dataset are labeled as "non-fraud" (majority class), while very few are labeled as "fraud" (minority class). Similarly, in bank customer churn prediction, many instances are represented as negative class, whereas the minorities are marked as positive class. However, the performance of classification models is significantly affected by a skewed distribution of the classes; hence, this imbalance problem in the dataset may lead to bad estimates and misclassifications. Dealing with imbalanced data has been considered as one of the 10 most difficult problems in the field of data mining [7]. With this motivation, this paper proposes a classbased weighting strategy.
The main contribution of this paper is that it improves the decision jungle (DJ) method by a class-based weighting mechanism to make it effective in handling imbalanced data. In the proposed approach, a weight is assigned to each class based on its distribution, and this weight value is combined with class probabilities. The experimental studies conducted on 17 real-world banking datasets confirm that our approach generally performs better than the traditional decision jungle algorithm when the data is imbalanced.
The rest of this paper is organized as follows. Section 2 briefly presents the recent and related research in the literature. Section 3 describes the proposed approach, class-based weighted decision jungle method, in detail. Section 4 is devoted to the presentation and discussion of the experimental results, including the dataset descriptions. Finally, Section 5 gives the concluding remarks and provides some future research directions.

Related work
As a data-intensive sector, banking has been a popular application area for data mining researchers since the information technology revolution. The continuous developments in banking systems and the rapidly increasing availability of big banking data make data mining one of the most essential tasks for the banking industry.
Banking industries have used data mining techniques in various applications, especially on bank failure prediction [1][2][3], possible bank customer churns identification [4], fraudulent transaction detection [5], customer segmentation [8][9][10], predictions on bank telemarketing [11][12][13][14], and sentiment analysis for bank customers [15]. Some of the classification studies in the banking sector have been compared in Table 1. The objectives of the studies, years they were conducted, algorithms and ensemble learning techniques they used, the country of the bank, and obtained results are shown in this table.
The main data mining tasks are classification (or categorical prediction), regression (or numeric prediction), clustering, association rule mining, and anomaly detection. Among these data mining tasks, classification is the most frequently used one in the banking sector [16], which is followed by clustering. Some banking applications [8,10] have used more than one data mining techniques, among which clustering before classification has shown sufficient evidence of both popularity and applicability.
Apart from novel task-specific algorithms proposed by the authors, the most commonly used classification algorithms in the banking sector are decision tree (DT), neural network (NN), support vector machine (SVM), k-nearest neighbor Ilham et al. [11] 2019 √ √ (KNN), Naive Bayes (NB), and logistic regression (LR), as shown in Table 1. Some data mining studies in the banking sector [1,2,6,11,15] have used ensemble learning methods to increase the classification performance. Bagging and boosting are the most popular ensemble learning methods due to their theoretical performance advantages. Random forest (RF) [2,6,11,15], AdaBoost (AB) [6], and extreme gradient boosting (XGB) [2,15] have also been used in the banking sector as the most well-known bagging and boosting algorithms, respectively. As shown in Table 1, accuracy (ACC) and area under ROC curve (AUC) are the commonly used performance measures for classification. Dealing with class imbalance problem, various solutions have been proposed in the literature. Such methods can be mainly grouped under two different approaches: (i) application of a data preprocessing step and (ii) modifying existing methods. The first approach focuses on balancing the dataset, which may be done either by increasing the number of minority class examples (over-sampling) or reducing the number of majority class examples (under-sampling). In the literature, synthetic minority over-sampling technique (SMOTE) [17] is commonly used as an over-sampling technique. As an alternative approach, some studies (i.e., [18]) focus on modifying the existing classification algorithms to make them more effective when dealing with imbalanced data. Unlike these studies, this paper proposes a novel approach (class-based weighting approach) to solve imbalanced data problem.

Decision jungle
A decision jungle is an ensemble of rooted decision directed acyclic graphs (DAGs), which are powerful and compact distinct models for classification. While a traditional decision tree only allows one path to every node, a DAG in a DJ allows multiple paths from the root to each leaf [19]. During the training phase, node splitting and merging operations are done by the minimization of an objective function (the weighted sum of entropies at the leaves).
Unlike a decision forest that consists of several evolutionary induced decision trees, decision jungle consists of an ensemble of decision directed acyclic graphs. Experiments presented in [19] show that decision jungles require significantly less memory while significantly improving generalization, compared to decision forests and their variants.

Class-based weighted decision jungle method
In this study, we improve the decision jungle method by a class-based weighting mechanism to make it effective in dealing with imbalanced data.
Giving a training dataset D = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y N )} that contains N instances, each instance is represented by a pair (x, y), where x is a d-dimensional vector such that x i = [x i1 , x i2 , ..., x id ] and y is its corresponding class label. While x is defined as input variable, y is referred as output variable in the categorical domain Y = {y 1 , y 2 , ..., y k }, where k is the number of class labels. The goal is to learn a classifier function f: X ! Y that optimizes some specific evaluation metric(s) and can predict the class label for unseen instances.
Training dataset is usually considered as a set of samples from a probability distribution F on X Â Y. An instance component x is associated with a label class y j of Y such that: where P(y j |x) is the predicted conditional probability of x belonging to y j and threshold is typically set to 1.
In this paper, we focus on imbalanced data problem, where the number of instances in one class (y i ) is much larger or less than instances in the other class (y j ). Like many other classification algorithms, the decision jungle method is also affected by a skewed distribution of the classes, because the traditional classifiers tend to be overwhelmed by the majority class and ignore the rare samples in the minority class. In order to overcome this problem, we locally adapted a class-based weighted mechanism, where weights are determined depending on the distribution of the class labels in the dataset. The main idea is that the minority class receives a higher weight, while the majority class is assigned with a lower weight during the combination class probabilities. According to this approach, the weight over a class is calculated as follows: where W c is the weight assigned to the class c, N is the total number of instances in the dataset, N c is the number of instances present in the class c, and k is the number of class labels. In the proposed approach, Eq. (1) is updated as follows: Figure 1 shows the general structure of the proposed approach. In the first step, various types of raw banking data are obtained from different sources such as account transactions, credit card details, loan applications, and social media texts. Next, raw banking data is preprocessed by applying several different techniques to provide data integration, data selection, and data transformation. The prepared data is then passed to the training step, where weighted decision jungle algorithm is used to build an effective model which accurately maps inputs to desired outputs. The classification validation step provides feedback to the learning phase for adjustment to improve model performance. The training phase is repeated until a desired classification performance is achieved. Once a model is build, after that it can be used to predict unseen data.

Experimental studies
We implemented the proposed approach in Azure Machine Learning Studio framework on cloud platform. In all experiments, default input parameters of the decision forest algorithm were used as follows: • Ensemble approach: Bagging Conventionally, accuracy is the most commonly used measure for evaluating a classifier performance. However, in the case of imbalanced data, accuracy is not sufficient alone since the minority class has very little impact on accuracy than the majority class. Using only accuracy measure is meaningless when the data is imbalanced and where the main learning target is the identification of the rare samples. In addition, accuracy does not distinguish between the numbers of correct class labels or misclassifications of different classes. Therefore, in this study, we also used several more metrics: macro-averaged precision, recall, and F-measure.

Dataset description
In this study, we conducted a series of experiments on 17 publically available real-world banking datasets which are described in Table 2. We obtained eight from the UCI Machine Learning Repository [20] and nine datasets from Kaggle data repository. Table 3 shows the comparison of the classification performances of DJ and weighted DJ methods. According to the experimental results, on average, the weighted DJ method shows better classification outcome than its traditional version on the imbalanced banking datasets in terms of both accuracy and recall metrics. For example, the imbalanced dataset "bank additional" has an accuracy of 94.54% with the DJ method and 94.61% with the weighted DJ method. The accuracy is slightly higher with the weighted version because the classifier was able to classify the minority class samples better (0.8385, instead of 0.7914). The proposed method only disappointed in its accuracy and recall values for 4 of 17 datasets (with IDs 5,9,12,and 13).

Experimental results
It is observed from the experiments that the weighted DJ method failed in classifying only one dataset among 17 datasets in terms of macro-averaged recall values. This means that the proposed method generally can be able to build a good model to predict minority class samples.
It can be deduced from the average precision and recall values that higher classification rates can be achieved with the weighted DJ method for minority classes, while more misclassified points in majority classes may also be detectable in the case of imbalanced data. Figure 2 shows the comparison of the classification performances of two methods in terms of F-measure: decision jungle and class-based weighted decision jungle (weighted DJ). In principle, F-measure is defined as F = (2 Â Recall Â Precision)/(Recall + Precision), which is a harmonic mean between recall and precision. According to the results, for all banking datasets, the proposed method showed some increase or the same performance in the F-measure value.
It can be possible to conclude from the experiments that the minority and majority ratios are not the only issues in constructing a good prediction model. For example, the minority and majority ratios of the first and last datasets are very close, but the classification outcomes related to these datasets are not similar. Although the minority and majority class ratios are almost the same for these two datasets, there is a significant difference between the classification accuracy, precision, and recall values of the datasets, as can be seen in Table 3. There is also a need  Table 3.
Comparison of unweighted and class-based weighted decision jungle methods in terms of accuracy, macro-averaged precision, and macro-averaged recall.

Figure 2.
Comparison of unweighted and class-based weighted decision jungle methods in terms of F-measure.
for appropriate training examples that have data characteristics consistent with the class label assigned to them.

Conclusion and future work
As a well-known data mining task, classification in real-world banking applications usually involves imbalanced datasets. In such cases, the performance of classification models is significantly affected by a skewed distribution of the classes. The data imbalance problem in the banking dataset may lead to bad estimates and misclassifications. To solve this problem, this paper proposes an approach which improves the decision jungle method with a class-based weighting mechanism. In the proposed approach, a weight is assigned to each class based on its distribution, and this weight value is combined with class probabilities. The empirical experiments conducted on 17 real-world bank datasets demonstrated that it is possible to improve the overall accuracy and recall values with the proposed approach.
As a future study, the proposed approach can be adapted for multi-label classification task. In addition, it can be enhanced for the ordinal classification problem.