Open access peer-reviewed chapter

Data Mining in Banking Sector Using Weighted Decision Jungle Method

Written By

Derya Birant

Submitted: 25 November 2019 Reviewed: 20 February 2020 Published: 20 April 2020

DOI: 10.5772/intechopen.91836

From the Edited Volume

Data Mining - Methods, Applications and Systems

Edited by Derya Birant

Chapter metrics overview

1,031 Chapter Downloads

View Full Metrics

Abstract

Classification, as one of the most popular data mining techniques, has been used in the banking sector for different purposes, for example, for bank customer churn prediction, credit approval, fraud detection, bank failure estimation, and bank telemarketing prediction. However, traditional classification algorithms do not take into account the class distribution, which results into undesirable performance on imbalanced banking data. To solve this problem, this paper proposes an approach which improves the decision jungle (DJ) method with a class-based weighting mechanism. The experiments conducted on 17 real-world bank datasets show that the proposed approach outperforms the decision jungle method when handling imbalanced banking data.

Keywords

  • data mining
  • classification
  • banking sector
  • decision jungle
  • imbalanced data

1. Introduction

Data mining is the process of analyzing large data stored in data warehouses in order to automatically extract hidden, previously unknown, valid, interesting, and actionable knowledge such as patterns, anomalies, associations, and changes. It has been commonly used in a wide range of different areas that include marketing, health care, military, environment, and education. Data mining is becoming increasingly important and essential for banking sector as well, since the amount of data collected by banks has grown remarkably and the need to discover hidden and useful patterns from banking data becomes widely recognized.

Banking systems collect huge amounts of data more rapidly as the number of channels (i.e., Internet banking, telebanking, retail banking, mobile banking, ATM) has increased. Banking data has been currently generated from various sources, including but not limited to bank account transactions, credit card details, loan applications, and telex messages. Hence, data mining can be used to extract meaningful information from these collected banking data, to enable banking institutions to make better decision-making process. For example, classification, which is one of the most popular data mining techniques, can be used to predict bank failures [1, 2, 3], to estimate bank customer churns [4], to detect frauds [5], and to evaluate loan approvals [6].

In many real-world banking applications, the distribution of the classes in the dataset is highly skewed. A bank data is imbalanced, when its target variable is categorical and if the number of samples in one class is significantly different from those of the other class(es). For example, in credit card fraud detection, most of the instances in the dataset are labeled as “non-fraud” (majority class), while very few are labeled as “fraud” (minority class). Similarly, in bank customer churn prediction, many instances are represented as negative class, whereas the minorities are marked as positive class. However, the performance of classification models is significantly affected by a skewed distribution of the classes; hence, this imbalance problem in the dataset may lead to bad estimates and misclassifications. Dealing with imbalanced data has been considered as one of the 10 most difficult problems in the field of data mining [7]. With this motivation, this paper proposes a class-based weighting strategy.

The main contribution of this paper is that it improves the decision jungle (DJ) method by a class-based weighting mechanism to make it effective in handling imbalanced data. In the proposed approach, a weight is assigned to each class based on its distribution, and this weight value is combined with class probabilities. The experimental studies conducted on 17 real-world banking datasets confirm that our approach generally performs better than the traditional decision jungle algorithm when the data is imbalanced.

The rest of this paper is organized as follows. Section 2 briefly presents the recent and related research in the literature. Section 3 describes the proposed approach, class-based weighted decision jungle method, in detail. Section 4 is devoted to the presentation and discussion of the experimental results, including the dataset descriptions. Finally, Section 5 gives the concluding remarks and provides some future research directions.

Advertisement

2. Related work

As a data-intensive sector, banking has been a popular application area for data mining researchers since the information technology revolution. The continuous developments in banking systems and the rapidly increasing availability of big banking data make data mining one of the most essential tasks for the banking industry.

Banking industries have used data mining techniques in various applications, especially on bank failure prediction [1, 2, 3], possible bank customer churns identification [4], fraudulent transaction detection [5], customer segmentation [8, 9, 10], predictions on bank telemarketing [11, 12, 13, 14], and sentiment analysis for bank customers [15]. Some of the classification studies in the banking sector have been compared in Table 1. The objectives of the studies, years they were conducted, algorithms and ensemble learning techniques they used, the country of the bank, and obtained results are shown in this table.

RefYearAlgorithmsEnsemble learningDescriptionCountry of the bankResult
DTNNSVMKNNNBLRBagging (i.e., RF)Boosting (AB, XGB)
Manthoulis et al. [1]2020Bank failure predictionUSAAUC >0.97
Ilham et al. [11]2019Long-term deposit predictionPortugalACC 97.07%
Lv et al. [5]2019Fraud detection in bank accountsACC 97.39%
Krishna et al. [15]2019Sentiment analysis for bank customersIndiaAUC 0.8268
Farooqi and Iqbal [12]2019Prediction of bank telemarketing outcomesPortugalACC 91.2%
Carmona et al. [2]2019Bank failure predictionUSAACC 94.74%
Jing and Fang [3]2018Bank failure predictionUSAAUC 0.916
Lahmiri [13]2017Prediction of bank telemarketing outcomesPortugalACC 71%
Marinakos and Daskalaki [8]2017Customer classification for bank direct marketingPortugalAUC
0.9
Keramati et al. [4]2016Bank customer churn predictionAUC 0.929
Wan et al. [6]2016Predicting nonperforming loansChinaAUC 0.965
Ogwueleka et al. [10]2015Identifying bank customer behaviorIntercontinentalAUC 0.94
Moro et al. [14]2014Prediction of bank telemarketing outcomesPortugalAUC 0.8
Smeureanu et al. [9]2013Customer segmentation in banking sectorRomaniaACC 97.127%

Table 1.

Classification applications in the banking sector.

The main data mining tasks are classification (or categorical prediction), regression (or numeric prediction), clustering, association rule mining, and anomaly detection. Among these data mining tasks, classification is the most frequently used one in the banking sector [16], which is followed by clustering. Some banking applications [8, 10] have used more than one data mining techniques, among which clustering before classification has shown sufficient evidence of both popularity and applicability.

Apart from novel task-specific algorithms proposed by the authors, the most commonly used classification algorithms in the banking sector are decision tree (DT), neural network (NN), support vector machine (SVM), k-nearest neighbor (KNN), Naive Bayes (NB), and logistic regression (LR), as shown in Table 1. Some data mining studies in the banking sector [1, 2, 6, 11, 15] have used ensemble learning methods to increase the classification performance. Bagging and boosting are the most popular ensemble learning methods due to their theoretical performance advantages. Random forest (RF) [2, 6, 11, 15], AdaBoost (AB) [6], and extreme gradient boosting (XGB) [2, 15] have also been used in the banking sector as the most well-known bagging and boosting algorithms, respectively. As shown in Table 1, accuracy (ACC) and area under ROC curve (AUC) are the commonly used performance measures for classification.

Dealing with class imbalance problem, various solutions have been proposed in the literature. Such methods can be mainly grouped under two different approaches: (i) application of a data preprocessing step and (ii) modifying existing methods. The first approach focuses on balancing the dataset, which may be done either by increasing the number of minority class examples (over-sampling) or reducing the number of majority class examples (under-sampling). In the literature, synthetic minority over-sampling technique (SMOTE) [17] is commonly used as an over-sampling technique. As an alternative approach, some studies (i.e., [18]) focus on modifying the existing classification algorithms to make them more effective when dealing with imbalanced data. Unlike these studies, this paper proposes a novel approach (class-based weighting approach) to solve imbalanced data problem.

Advertisement

3. Methods

3.1 Decision jungle

A decision jungle is an ensemble of rooted decision directed acyclic graphs (DAGs), which are powerful and compact distinct models for classification. While a traditional decision tree only allows one path to every node, a DAG in a DJ allows multiple paths from the root to each leaf [19]. During the training phase, node splitting and merging operations are done by the minimization of an objective function (the weighted sum of entropies at the leaves).

Unlike a decision forest that consists of several evolutionary induced decision trees, decision jungle consists of an ensemble of decision directed acyclic graphs. Experiments presented in [19] show that decision jungles require significantly less memory while significantly improving generalization, compared to decision forests and their variants.

3.2 Class-based weighted decision jungle method

In this study, we improve the decision jungle method by a class-based weighting mechanism to make it effective in dealing with imbalanced data.

Giving a training dataset D = {(x1, y1), (x2, y2), ..., (xn, yN)} that contains N instances, each instance is represented by a pair (x, y), where x is a d-dimensional vector such that xi = [xi1, xi2, ..., xid] and y is its corresponding class label. While x is defined as input variable, y is referred as output variable in the categorical domain Y = {y1, y2, ..., yk}, where k is the number of class labels. The goal is to learn a classifier function f: X → Y that optimizes some specific evaluation metric(s) and can predict the class label for unseen instances.

Training dataset is usually considered as a set of samples from a probability distribution F on X × Y. An instance component x is associated with a label class yj of Y such that:

PyjxPymx>threshold,mjE1

where P(yj |x) is the predicted conditional probability of x belonging to yj and threshold is typically set to 1.

In this paper, we focus on imbalanced data problem, where the number of instances in one class (yi) is much larger or less than instances in the other class (yj). Like many other classification algorithms, the decision jungle method is also affected by a skewed distribution of the classes, because the traditional classifiers tend to be overwhelmed by the majority class and ignore the rare samples in the minority class. In order to overcome this problem, we locally adapted a class-based weighted mechanism, where weights are determined depending on the distribution of the class labels in the dataset. The main idea is that the minority class receives a higher weight, while the majority class is assigned with a lower weight during the combination class probabilities. According to this approach, the weight over a class is calculated as follows:

Wc=1LogNc+1i=1k1LogNi+1E2

where Wc is the weight assigned to the class c, N is the total number of instances in the dataset, Nc is the number of instances present in the class c, and k is the number of class labels. In the proposed approach, Eq. (1) is updated as follows:

WjPyjxWmPymx>threshold,mjE3

Figure 1 shows the general structure of the proposed approach. In the first step, various types of raw banking data are obtained from different sources such as account transactions, credit card details, loan applications, and social media texts. Next, raw banking data is preprocessed by applying several different techniques to provide data integration, data selection, and data transformation. The prepared data is then passed to the training step, where weighted decision jungle algorithm is used to build an effective model which accurately maps inputs to desired outputs. The classification validation step provides feedback to the learning phase for adjustment to improve model performance. The training phase is repeated until a desired classification performance is achieved. Once a model is build, after that it can be used to predict unseen data.

Figure 1.

General structure of proposed approach.

Advertisement

4. Experimental studies

We implemented the proposed approach in Azure Machine Learning Studio framework on cloud platform. In all experiments, default input parameters of the decision forest algorithm were used as follows:

  • Ensemble approach: Bagging

  • Number of decision DAGs: 8

  • Maximum width of the decision DAGs: 128

  • Maximum depth of the decision DAGs: 32

  • Number of optimization steps per decision DAG layer: 2048

Conventionally, accuracy is the most commonly used measure for evaluating a classifier performance. However, in the case of imbalanced data, accuracy is not sufficient alone since the minority class has very little impact on accuracy than the majority class. Using only accuracy measure is meaningless when the data is imbalanced and where the main learning target is the identification of the rare samples. In addition, accuracy does not distinguish between the numbers of correct class labels or misclassifications of different classes. Therefore, in this study, we also used several more metrics: macro-averaged precision, recall, and F-measure.

4.1 Dataset description

In this study, we conducted a series of experiments on 17 publically available real-world banking datasets which are described in Table 2. We obtained eight from the UCI Machine Learning Repository [20] and nine datasets from Kaggle data repository.

NoDataset#Instances#Features#ClassMajority class (%)Minority class (%)Data source
1Abstract dataset for credit card fraud detection307512285.414.6Kaggle
2Bank marketing
[14]
Bank452117288.511.5UCI
3Bank full45,21117288.311.7UCI
4Bank additional411921289.110.9UCI
5Bank additional full41,18821288.711.3UCI
6Bank customer churn prediction10,00014279.620.4Kaggle
7Bank loan status100,00019277.422.6Kaggle
8Banknote authentication13725255.544.5UCI
9Credit approval69016255.544.5UCI
10Credit card fraud detection [21]284,80731299.80.2Kaggle
11Default of credit card clients [22]30,00025277.922.1UCI
12German credit100021270.030.0UCI
13Give me some credit150,00012293.36.7Kaggle
14Loan campaign response20,00040287.412.6Kaggle
15Loan data for dummy bank887,37930292.47.6Kaggle
16Loan prediction61413268.731.3Kaggle
17Loan repayment prediction957814284.016.0Kaggle

Table 2.

The main characteristics of the banking datasets.

4.2 Experimental results

Table 3 shows the comparison of the classification performances of DJ and weighted DJ methods. According to the experimental results, on average, the weighted DJ method shows better classification outcome than its traditional version on the imbalanced banking datasets in terms of both accuracy and recall metrics. For example, the imbalanced dataset “bank additional” has an accuracy of 94.54% with the DJ method and 94.61% with the weighted DJ method. The accuracy is slightly higher with the weighted version because the classifier was able to classify the minority class samples better (0.8385, instead of 0.7914). The proposed method only disappointed in its accuracy and recall values for 4 of 17 datasets (with IDs 5, 9, 12, and 13).

IDDatasetDecision jungleClass-based weighted decision jungle
Acc (%)PrecisionRecallAcc (%)PrecisionRecall
1Abstract dataset for credit card fraud detection99.090.99180.971599.190.99230.9749
2Bank92.700.89090.717592.700.84920.7593
3Bank full91.060.81810.687491.170.80390.7217
4Bank additional94.540.90820.791494.610.87390.8385
5Bank additional full92.210.83320.734792.190.81260.7762
6Bank customer churn prediction87.370.85140.729187.400.83940.7411
7Bank loan status84.370.91700.632884.380.91690.6332
8Banknote authentication99.850.99870.9984100.001.00001.0000
9Credit approval92.800.92730.927592.650.92570.9261
10Credit card fraud detection99.970.99150.916799.970.98610.9309
11Default of credit card clients83.050.78330.669583.160.77930.6785
12German credit86.300.85450.808885.700.83380.8198
13Give me some credit93.880.82450.598693.770.78610.6240
14Loan campaign response89.340.93930.576390.340.93900.6178
15Loan data for dummy bank95.190.97530.683795.200.97530.6844
16Loan prediction83.540.87150.744383.540.86310.7481
17Loan repayment prediction84.820.90590.526685.350.89000.5453
Average91.180.89900.747991.250.88630.7659

Table 3.

Comparison of unweighted and class-based weighted decision jungle methods in terms of accuracy, macro-averaged precision, and macro-averaged recall.

It is observed from the experiments that the weighted DJ method failed in classifying only one dataset among 17 datasets in terms of macro-averaged recall values. This means that the proposed method generally can be able to build a good model to predict minority class samples.

It can be deduced from the average precision and recall values that higher classification rates can be achieved with the weighted DJ method for minority classes, while more misclassified points in majority classes may also be detectable in the case of imbalanced data.

Figure 2 shows the comparison of the classification performances of two methods in terms of F-measure: decision jungle and class-based weighted decision jungle (weighted DJ). In principle, F-measure is defined as F = (2 × Recall × Precision)/(Recall + Precision), which is a harmonic mean between recall and precision. According to the results, for all banking datasets, the proposed method showed some increase or the same performance in the F-measure value.

Figure 2.

Comparison of unweighted and class-based weighted decision jungle methods in terms of F-measure.

It can be possible to conclude from the experiments that the minority and majority ratios are not the only issues in constructing a good prediction model. For example, the minority and majority ratios of the first and last datasets are very close, but the classification outcomes related to these datasets are not similar. Although the minority and majority class ratios are almost the same for these two datasets, there is a significant difference between the classification accuracy, precision, and recall values of the datasets, as can be seen in Table 3. There is also a need for appropriate training examples that have data characteristics consistent with the class label assigned to them.

Advertisement

5. Conclusion and future work

As a well-known data mining task, classification in real-world banking applications usually involves imbalanced datasets. In such cases, the performance of classification models is significantly affected by a skewed distribution of the classes. The data imbalance problem in the banking dataset may lead to bad estimates and misclassifications. To solve this problem, this paper proposes an approach which improves the decision jungle method with a class-based weighting mechanism. In the proposed approach, a weight is assigned to each class based on its distribution, and this weight value is combined with class probabilities. The empirical experiments conducted on 17 real-world bank datasets demonstrated that it is possible to improve the overall accuracy and recall values with the proposed approach.

As a future study, the proposed approach can be adapted for multi-label classification task. In addition, it can be enhanced for the ordinal classification problem.

References

  1. 1. Manthoulis G, Doumpos M, Zopounidis C, Galariotis E. An ordinal classification framework for bank failure prediction: Methodology and empirical evidence for US banks. European Journal of Operational Research. 2020;282(2):786-801
  2. 2. Carmona P, Climent F, Momparler A. Predicting failure in the U.S. banking sector: An extreme gradient boosting approach. International Review of Economics and Finance. 2019;61:304-323
  3. 3. Jing Z, Fang Y. Predicting US bank failures: A comparison of logit and data mining models. Journal of Forecasting. 2018;37:235-256
  4. 4. Keramati A, Ghaneei H, Mirmohammadi SM. Developing a prediction model for customer churn from electronic banking services using data mining. Financial Innovation. 2016;2(1):1-13
  5. 5. Lv F, Huang J, Wang W, Wei Y, Sun Y, Wang B. A two-route CNN model for bank account classification with heterogeneous data. PLoS One. 2019;14(8):1-22
  6. 6. Wan J, Yue Z-L, Yang D-H, Zhang Y, Jiao L, Zhi L, et al. Predicting non performing loan of business Bank with data mining techniques. International Journal of Database Theory and Application. 2016;9(12):23-34
  7. 7. Yang Q, Wu X. 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making. 2006;5(4):597-604
  8. 8. Marinakos G, Daskalaki S. Imbalanced customer classification for bank direct marketing. Journal of Marketing Analytics. 2017;5(1):14-30
  9. 9. Smeureanu I, Ruxanda G, Badea LM. Customer segmentation in private banking sector using machine learning techniques. Journal of Business Economics and Management. 2013;14(5):923-939
  10. 10. Ogwueleka FN, Misra S, Colomo-Palacios R, Fernandez L. Neural network and classification approach in identifying customer behavior in the banking sector: A case study of an international bank. Human Factors and Ergonomics in Manufacturing. 2015;25(1):28-42
  11. 11. Ilham A, Khikmah L, Indra A, Ulumuddin A, Iswara I. Long-term deposits prediction: A comparative framework of classification model for predict the success of bank telemarketing. Journal of Physics Conference Series. 2019;1175(1):1-6
  12. 12. Farooqi R, Iqbal N. Performance evaluation for competency of bank telemarketing prediction using data mining techniques. International Journal of Recent Technology and Engineering. 2019;8(2):5666-5674
  13. 13. Lahmiri S. A two-step system for direct bank telemarketing outcome classification. Intelligent Systems in Accounting, Finance and Management. 2017;24(1):49-55
  14. 14. Moro S, Cortez P, Rita P. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems. 2014;62:22-31
  15. 15. Krishna GJ, Ravi V, Reddy BV, Zaheeruddin M, Jaiswal H, Sai Ravi Teja P, et al. Sentiment classification of Indian Banks’ Customer Complaints. In: Proceedings of IEEE Region 10 Annual International Conference. India; 17–20 October 2019. pp. 429-434
  16. 16. Hassani H, Huang X, Silva E. Digitalisation and Big Data Mining in Banking. Big Data and Cognitive Computing. 2018;2(3):1-13
  17. 17. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16:321-357
  18. 18. Cieslak D, Liu W, Chawla S, Chawla N. A robust decision tree algorithms for imbalanced data sets. In: Proceedings of the Tenth SIAM International Conference on Data Mining (SDM 2010). Columbus, Ohio, USA; 29 Apr-1 May 2010. pp. 766-777
  19. 19. Shotton J, Nowozin S, Sharp T, Winn J, Kohli P, Criminisi A. Decision jungles: Compact and rich models for classification. Advances in Neural Information Processing Systems. 2013;26:234-242
  20. 20. Dua D, Graff C. UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. 2019. Available from: http://archive.ics.uci.edu/ml
  21. 21. Carcillo F, Borgne Y-A, Caelen O, Oble F, Bontempi G. Combining unsupervised and supervised learning in credit card fraud detection. Information Sciences. 2020 in press. DOI: 10.1016/j.ins.2019.05.042
  22. 22. Yeh IC, Lien CH. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications. 2009;36(2):2473-2480

Written By

Derya Birant

Submitted: 25 November 2019 Reviewed: 20 February 2020 Published: 20 April 2020