Abstract: In this paper, an innovative fraud detection model built upon existing data mining and fraud detection methods has been proposed. Here a bagging model has been applied and has been compared with other methods such as Logistic Regression, Naïve Bayes (Lewis, 1998a) and Decision tree (DD) (Murthy, 1998). We use these methods as basic classifiers and make a bagging model according to them. A variety of measures is used for measuring and evaluating the efficiency and performance of each classifier and then all of them with the proposed model. This study is based on real world dataset which has been divided into 4 smaller datasets with different fraudulent transaction rates. The proposed bagging model has shown higher performance compared to other mentioned models regarding almost all measures. The introduced model is using a virtual binary dataset which has been derived from the real life dataset.
Keywords: Fraud detection, Bagging, Naïve Bayes, Logistic Regression, Decision Tree.
1. Introduction
The rapid growth of E-commerce and the use of credit cards and online purchasing has caused an explosion in credit card fraud (Raj and Portia, 2011). The 14th annual online fraud report by CyberSource shows that although the revenue loss percentage has been decreased in online payment during the last four years in developed countries, but due to the very large volume of E-commerce transactions, the amount of revenue loss is still very large and cannot be underestimated (CyberSource Report, 2014). This report also indicates that applying fraud detection systems has a great effect on decreasing frauds. One important point that should not be ignored is that the report is for developed countries where E-commerce started many years ago and now spends its maturity age, but in developing countries like Iran which recently started E-commerce the situation is worse and hence using fraud detection systems especially in electronic banking is an important necessity.
According to (West and Bhattacharya, 2016) appearing of new technologies such as cloud and mobile computing has compounded the problem and traditional methods cannot be responsive.
Fraudsters always try to adapt themselves with new fraud detection systems (Patnaik et al., 2015a) and their behavior varies from one culture to another and this adds more complexity to the process of fraud detection. Hence the methods for detecting fraud always require revision.
The performance of fraud detection methods can be estimated based on different measures and the importance of these measures depends on the fraud detection system and the type of the business.
Three famous data mining methods are used in this study for detecting fraud in card transactions and different performance measures are used for comparing them.
Cost (money value) is always an important measure for fraud detection systems, especially in financial institutions and banks so our introduced bagging model in this paper which is based on simple classifier methods has the best result in almost all measures including cost.
The remainder of the paper is ordered as follows. Section 2 describes the work related to card fraud detection. Section 3 describes the card frauds. Section 4 describes the performance measures used for evaluation. In section 5 the data mining techniques which have been used as a basis for this paper are described. Section 6 describes meta-classifiers. The proposed bagging model has been described in section 7. Section 8 describes the dataset. Results are put forth in section 9, and section 10 describes the conclusion and future works.
2. Literature Review
The application of predictive models in real world fraud detection systems and in the literature is impressive. Some of them are mentioned in (West and Bhattacharya, 2016). Regarding the usefulness of data mining techniques for data processing and extracting knowledge from large datasets, however, there have been few reported studies of data mining for fraud detection and also according to a survey done by (West and Bhattacharya, 2016), there is no dominant method accepted as a basis in this field.
Some of the main issues for fraud detection namely concept drift, skewed class distribution, reduction of large amount of data and supporting real-time detection is mentioned in (Abdallah et al., 2016)
Prevalent methods in credit card fraud detection are artificial immune systems (Wong et al., 2012) (Soltani Halvaiee and Akbari, 2014), fuzzy logics (Sánchez et al., 2009) (Jans et al., 2011), self-organising map (Quah and Sriganesh, 2008) (Olszewski, 2014), decision trees (Bai et al., 2008) (Whitrow et al., 2009) (Sahin et al., 2013), support vector machines (Kim et al., 2003) and hybrid methods (Duman and Ozcelik, 2011).
Case-based reasoning is another fraud detection technique (Wheeler and Aitken, 2000) and Markov models have been used in (Raj and Portia, 2011), (Srivastava et al., 2008). The evaluation of some other techniques such as support vector machine, random forest, and logistic regression has been discussed in (Whitrow et al., 2009). Neural networks, decision trees, and Bayesian belief networks as classifiers are discussed and compared in (Kirkos et al., 2007). According to (Al-Furiah and Al-Braheem, 2009) the neural network and the logistic regression methods do better than the decision tree.
3. Card Frauds
Using payment cards is a good alternative for paper money and has been prevalent in almost all ranges of purchases in recent years. There are two main categories of frauds namely CP (Card Present) and CNP (Card Not Present) (Krivko, 2010).
Counterfeit cards, stolen cards, lost cards and receiving cards from banks with the presentation of another person’s information all fall in CP frauds. The most occurring of CNP frauds are online transactions with the help of a computer device or a cell phone. This type of fraud has assigned the most amount of card frauds in recent years (“Cybersource 2015 annual report,” 2015). In these cases, obtaining the personal information of others leads fraudsters to get access to accounts. There are several methods for obtaining others personal information, some of which are mentioned in (Bolton and Hand, 2002).
There are many real obstacles for fraud detection and researchers facing many challenges. Some of them are mentioned in (Dal Pozzolo et al., 2014) and (Phua et al., 2012).
Exchanging of ideas in this field is difficult. Each idea can be regarded as a competitive advantage for a company and this holds back innovation in fraud detection.
According to (Bhattacharyya et al., 2011a) researchers have difficulty in obtaining card transaction datasets and also not much of the detection techniques get discussed in public; because this may help fraudsters to gain knowledge about how to evade detection.
(Bhattacharyya et al., 2011a).
The heterogeneity of the fields of the fraud detection datasets (numerical, date, categorical etc) make the analysis process of transactions more difficult, hence using data mining and machine learning techniques is inevitable.
In this study, some new numeric attributes have been derived from the main dataset. An effective model has been proposed for making a new dataset based on the main dataset in which all attributes are binary.
4. Performance measures
Several measures of classification performance which have been used in previous studies is used in this paper. Some of these measures are mentioned in (Bhattacharyya et al., 2011b). Accuracy alone is not an informative indicator for fraud detection because of the significant class imbalance in the data and a default prediction of all cases into the majority class (non-fraud) will still show a high-performance value (Bhattacharyya et al., 2011b), in our case 99.2%. Sensitivity and specificity measure the accuracy on the positive (fraud) and negative (non-fraud) cases which are more informative than just the accuracy measure. The F-measure gives the harmonic mean of precision and recall and G-mean gives the geometric mean of fraud and non-fraud accuracies (Bhattacharyya et al., 2011b).
The various performance measures are defined with respect to the confusion matrix below (Table 1). Here positive and negative correspond to fraud and non-fraud cases respectively. TP and TN are those cases which have been correctly predicted as fraud or non-fraud and FP and FN are those which have been predicted as fraud or non-fraud mistakenly.
Table 1- Confusion Matrix
Predicted as non-fraud Predicted as fraud
False Negative(FN) True Positive(TP) Fraud
True Negative(TN) False Positive(FP) Non-fraud
Some of the various measures according to (Labatut and Cherifi, 2012) are calculated as follows:
Accuracy= (TP+TN) / (TP+FP+TN+FN) (3)
Sensitivity=TP/ (TP+FN) (4)
Specificity =TN/ (FP+TN) (5)
Precision= TP/ (TP+FP) (6)
F-measure=2×(Precision ×Sensitivity)/(Precision+Sensitivity) (7)
G-mean= √((Sensitivity Specificity)) (8)
Those transactions that are more prone to fraud are investigated more than others, and this will impose more cost on the business. Correct detection of fraud cases prevents the loss of fraudulent activities. The cost of losses is always more than the cost of investigation of a fraud-prone transaction. This means that the cost of FP and cost of FN are not equal. Hence, measures like AUC which treat FN and FP equally (Dal Pozzolo et al., 2014) are not useful for problems like fraud detection.
The importance of cost measure in fraud detection is so high that it can easily overshadow other measures. Cost can be regarded as the time, power and resources needed for processing a transaction. In fraud detection and in this study when we say cost, we mean the money value lost because of false recognition. False recognitions are either “False Positive” or “False Negative” and their costs are not equal. In almost all businesses it is obvious that the cost of FN is greater than the cost of FP. Regardless of the overall cost of using a fraud detection system, we can say that the cost of TP and TN are equal to zero. The cost of FP is equal to the review cost plus credit costs. The cost of FN is equal to the expected loss because of the fraud plus a credit cost. These costs are different for different businesses and also different cultures and should be estimated using historical data of that business and also expert’s recommendation.
Here and for our business, the cost of fraud can be estimated as below:
The cost of fraud is equal to immediate direct loss due to fraud plus the cost of fraud prevention and detection plus the cost of lost business (when replacing card) plus the opportunity cost of fraud prevention/detection plus deterrent effect on the spread of e-commerce.
In this study, the cost measure is computed with respect to the cost matrix below.
Table 2- Cost Matrix
Not Fraud Fraud Cost Matrix
Credit loss = 23 0 Alarm Fraud
0 Loss cost=70 Alarm non-fraud
Here we consider the cost of fraud as the immediate direct loss due to fraud and set the other parameters equal to zero and since the average amount of all transactions in our sample dataset is about 700000, we set this equal to 70.
The credit loss has been estimated according to the following formula:
Credit loss= CCP * (CPR + CFN) (9)
Where CCP is the customer churn probability, CPR is the average of customer profit and CFN is the cost of finding a new customer. In this paper and according to the advice of bank experts, the parameters of the formula have been selected. For different business, these may be different. CCP is set to 0.05. CPR is equal to all of the money that a customer spent in this business multiplied by 0.2 (the average profit of the money a customer spent in the retail business is 20 percent of all that he/she spent), and CFN is set to zero. For our case, the result of this formula is equal to 23 which is the cost of the FN.
The value of these parameters can be estimated using the nature of the business data and also use expert’s opinion.
5. Data mining techniques
Classifiers are supervised data mining methods. In this paper, three classification methods namely Naïve Bayes (NB), logistic regression (LR) and C5 are used. According to (Entezari-Maleki et al., 2009) C4.5 (C5 is an improved version of C4.5) is better than NB and LR based on the measure of AUC , and NB and LR have the same performance. Here we compare these three classifiers according to other measures as mentioned in section 8.
5.1 Naïve Bayes(NB)
Naïve Bayes is a supervised classifier machine learning method which was first introduced by John and Langley in (John and Langley, 1995). It is based on the Bayes rule of conditional probability and is particularly suitable for situations in which the size of inputs is very high (Lewis, 1998b). An experiment on real world dataset shows that NB algorithm performs comparably well (Patnaik et al., 2015b).
This theory makes it possible to gain secondary probabilities according to primary probabilities (Lewis, 1998b).
Pr[Y│X]=(Pr[X│Y] Pr[Y])/Pr[X] (1)
Here Pr[Y] denotes the probability of an event Y, and Pr[Y|X] denotes the probability of Y conditional on another event X. The evidence X can be seen as a particular combination of attribute values, X=(X1,…,Xp).
5.2 Logical Regression (LR)
LR is a data mining technique which originates in statistics. It has been derived from linear regression and is suitable for situations in which the output variable has only two states (Shen et al., 2007a) just like fraud detection, fraud or not fraud. LR is useful for situations in which some want to predict the presence or absence of a characteristic or outcome based on values of a set of predictor variables (Shen et al., 2007b). The model of the Logistic Regression is:
log(p/(1-p))=β_0+β_1 X_1+β_2 X_2+⋯+β_n X_n (2)
where p denotes the probability of the response variable Y being 1. X1, X2…Xn are the explanatory variables and ß0, ß1 … ßn are the regression coefficients to be determined by the regression model, usually estimated using a maximum likelihood estimation.
LR can be used ideally in binary classification problems in which the occurrence of class1 implies the non-occurrence of class 2 and vice versa. Therefore, Logistic Regression can be applied to fraud detection problems, which are very typical two-class problems.
5.3 C5 Decision tree
Decision trees are a technique that classifies or predicts data using a tree with internal nodes representing binary choices on attributes and branches representing the outcome of that choice (West and Bhattacharya, 2016).
Decision trees are popular among data scientists. Although they are simple and easy to interpret and implement, their predictive ability is so high. Decision trees are also suitable for financial institutes because rather than mentioned advantages, in prediction decisions the reasoning behind a decision must be described in some cases for example in fraud or money laundering predictions. Decision tree algorithms can handle nonnumerical data in addition to numeric data and is very suitable for individual credit evaluation of commercial banks (PANG and GONG, 2009, p. 5).
C5 is a decision tree package that combines Adaboost with C4.5 (Yoav Freund and Llew Mason, 1999).
It is based on an ID3 algorithm and both of them have been developed by Quinlan.
6. Meta-Classifiers
Meta-classifiers are algorithms which strive to improve the performance of a predictive model using any given classifier as a base classifier (Patnaik et al., 2015b). The nice thing about these meta-classifiers is that they basically work with any classifier, although some classifiers are more suited in conjunction with specific meta-classifiers than the others (Westreich et al., 2010). Improving performance, of course, depends on how you measure performance. The measures for evaluating a data mining method can be divided into two main categories: reducing the error rate which is computed by dividing the number of incorrectly classified instances and the other is reducing the amount of cost which is due to the wrong classified samples and extra costs which the fraud detection system imposes to the business (Westreich et al., 2010).
Bagging, AdaBoost, and Stacking are the most popular meta-classifiers (Bauer and Kohavi, 1999). According to (Phua et al., 2004) the selection of different good classifier algorithms in bagging model is likely to produce better cost saving than either bagging multiple classifiers from same algorithm.
In this study, a Bagging meta-classifier is used. The word bagging stands for Bootstrap Aggregating. Bootstrap means taking equally sized samples from a dataset with replacement. In this case, the subsets may contain repeated samples several times (Dudoit and Fridlyand, 2003).
The basic idea behind the Bagging model is to aggregate different models derived by the same classifier using different bootstrap samples. All models in Bagging are taking part in an ensemble vote with equal weight. In order to classify a new instance of all created models which are usually quite different, one should vote for a class. The class with the most votes will be chosen as the predicted class (Dudoit and Fridlyand, 2003).
7. The proposed Bagging model
Here a multi-level method is presented. Each level has several models (the models that have been mentioned previously) with equal voting value. The accuracy degree of this model is higher and its cost is lower in comparison with simple models. This model can be made in 5 steps (Figure 1):
Using the training dataset, corresponding models of simple algorithms (LR, NB, and C5) will be created.
Created models will be evaluated with the training dataset. The result for all samples is a column of 0 or 1 and this column will be added to the training dataset. Here we used 3 models, so three new binary columns will be added to the training dataset.
Except the three new binary columns all other training dataset columns will be removed. Hence we will have a training dataset with three new binary columns plus the object column.
Steps 2 and 3 will be repeated but this time for the test dataset. Hence a new test dataset with 4 binary columns will be obtained.
Based on the new test and training datasets, new models will be created and the results will be evaluated.
Figure 1- Proposed Bagging Model
8. Dataset
The real life dataset in this study contains the POS transactions of an Iranian bank. There are 102423 transactions from 4472 customers in a period of 23 months. In pre-processing phase, 4221 transactions have been removed. The transactions belong to 3576 separate accounts and the total numbers of devices that these transactions were done on them were 5061 which distributed in 411 different city points. The total number of fraudulent transactions in this dataset is 786 transactions (0.8% of all transactions).
The transaction mostly belongs to B2C commerce and the average of amounts are about 700000 (22 US dollar).
Primary attributes are those attributes that are available in the real dataset (like amount, time, id etc) and all of them except the “Amount” are non-numerical.
Incompatibility and high dimensionality always make it impossible to apply all transactions of the dataset in fraud detection systems (Whitrow et al., 2009). For solving this problem, an integrating transaction strategy, can be used. In this strategy, the current behaviors of customers will be integrated into new attributes (Whitrow et al., 2009). In this study a similar method is used and based on primary attributes in the dataset, some new attributes are exported. The new derived attributes are integrations of historical backgrounds of recent transaction some of these derived attributes are like the number of purchases, the average amount of purchases, maximum and minimum amount of purchases in special periods like days, weeks or months and etc.
In this study three classifier methods, C5 decision tree, Naïve Bayes, and Logistic Regression are evaluated and compared with a meta-learning algorithm. The meta-learning algorithm that used here is working based on a virtual dataset which is derived from the real dataset. Because of the low rate of fraudulent transactions in comparison with legal transactions (0.8%) which is typical in such applications, the two classes of data are sampled at different rates to obtain training data with a reasonable proportion of fraud to non-fraud cases. Hence according to (Van Hulse et al., 2007) random under-sampling of the majority class has been found to be generally better than other sampling approaches for our purpose.
According to (Bhattacharyya et al., 2011b), We examine the performance of the selected algorithms on sampled training datasets with 15%, 10%, 5%, and 1% fraudulent transactions. These are labeled, DS-15, DS-10, DS-5, and DS-1 in the results. Performance is observed on a separate test dataset having 0.1% fraudulent transactions.
Clementine which is a popular suite of machine learning is used for analysis of the data.
Like (Bhattacharyya et al., 2011b), Parameters of the techniques were set from what has been found useful in the literature for the comparative evaluation. Further fine tuning of the parameters is not suggested because it needs a significant effort and time for tuning the parameters and hence it can often be a deterrent to practical use and can also lead to issues of over-fitting to specific data (Goodwin et al., 2003).
9. Results
In this section, we present the results from our experiments comparing the performance of Naïve Bayes (NB), Logistic Regression(LR) and C5 decision tree with each other and finally with respect to the Bagging model which is proposed in this study.
The results of the mentioned models and the respected Bagging models for the DS-1 dataset are presented in table 3. The difference in Accuracy measure for NB, LR and C5 is low while the difference of Sensitivity measure for these three models is notable. Among them C5, has the best results.
Table 3- Comparative table of mentioned measures on DS-1
G-Mean F-Measure Precision Specificity Sensitivity acc
0.7029 0.4129 0.3522 0.9908 0.4987 0.9860 LR
0.4394 0.2605 0.3976 0.9966 0.1937 0.9874 NB
0.7960 0.7758 1 1 0.6337 0.9895 C5
0.8913 0.7906 0.7846 0.9970 0.7968 0.9943 Bagging-LR
0.8953 0.8791 0.9728 0.9996 0.8019 0.9965 Bagging-NB
0.7960 0.7758 1 1 0.6337 0.9963 Bagging-C5
After applying the proposed Bagging model on each of the three models, all measures have improved (Table 3). In this case, the Accuracy of Bagging-NB is more than the others. Except for Specificity and Precision, the equation below is true:
Bagging-NB > Bagging-C5>Bagging-LR
For Precision and Specificity, the following equation is true:
Bagging-C5 > Bagging-BN > Bagging-LR
In the following, the performance of the mentioned models according to different measures on datasets with different fraud rates is presented (refer to tables 4,5 and 6).
Table 4- Performance of Logistic Regression across different fraud rates
DS-1 DS-5 DS-10 DS-15 LR
0.9860 0.9898 0.8416 0.0096 accuracy
0.4987 0.1896 0.4103 0.9766 sensitivity
0.9908 0.8459 0.8459 0 specificity
0.3522 0.4620 0.0258 0.0096 precision
0.4129 0.2688 0.0486 0.0190 F-measure
0.7029 0.4004 0.5892 0 G-Mean
Table 5- Performance of Naive Bayes across different fraud rates
DS-1 DS-5 DS-10 DS-15 NB
0.9874 0.9519 0.7999 0.7416 accuracy
0.1937 0.3048 0.5287 0.6190 sensitivity
0.9966 0.8034 0.8034 0.7433 specificity
0.3976 0.0888 0.0329 0.0310 precision
0.2605 0.1376 0.0620 0.0590 F-measure
0.4394 0.4948 0.6517 0.6783 G-Mean
One of the most important measures in fraud detection systems is Cost measure. In most cases, Cost is a measure for accepting or rejecting a fraud detection system, especially in retail commerce and E-commerce. The overall cost involves the cost of using the fraud detection system and the cost of the wrong prediction of the system.
The cost of false negative (FN) and false positive (FP) is presented in Table 2.
Since the model on DS-1 had the best performance across different measures, we computed the cost measure for DS-1 and the results were compared with the results of the proposed Bagging model.
Table 6- Performance of C5 across different fraud rates
DS-1 DS-5 DS-10 DS-15 C5
0.9895 0.9584 0.9967 0.0096 accuracy
0.6337 0.7409 0.6935 0.9766 sensitivity
1 0.9997 0.9997 0 specificity
1 0.1581 0.9673 0.0096 precision
0.7758 0.2607 0.8078 0.0190 F-measure
0.7960 0.8606 0.8326 0 G-Mean
The results of comparing the costs are presented in Table 7. On the right part of the table the cost of applying C5, NB and LR is shown and here the C5 model has the best result, but as we see in the right part of the table, the cost of applying Bagging models is shown. The overall costs in comparison with simple models are far less, and the B-NB (Bagging Naïve Bayes) has the best result with 4501, while the worst result in Bagging models belongs to B-C5 with 9870. For C5, the Bagging model and the simple model have the same results.
Table 7- Comparing Cost measure for models
LR NB C5 B-LR B-NB B-C5 Model
21629 22179 9870 6160 4501 9870 Cost
Figure 2- Comparative diagram of costs
Figure 3- Performance of mentioned models across different fraud rates
10. Conclusion
This paper examined the performance of three famous data mining techniques Naïve Bayes, Logistic Regression, C5 decision tree and their respective Bagging models for fraud detection in card transactions. The real dataset belongs to transactions of an Iranian bank. Due to the highly imbalanced data which is typical in such applications, an undersampling method was used for obtaining datasets with a reasonable proportion of fraud to non–fraud cases. Hence, 4 different datasets with different rates of fraud transactions were obtained and the performance measures were computed for them separately.
The results showed that the models are not the same according to different measures and on different datasets with different fraud rates. Although it seemed that the unbalanced nature of fraud datasets caused incorrect results, here DS-1 had better results in comparison to other datasets. The C5 model had better results in comparison with other simple models like LR and NB. After applying the Bagging model on them, all results improved and Bagging-C5 still had better results according to transactional measures.
As mentioned before, Cost measure is very important in retail E-commerce. Bagging models have much lower cost among the Bagging models, and the result of Bagging-NB is better than others and imposes a lower cost to the business (Figure 3- Performance of mentioned models across different fraud rates).
Since the performance of meta-learning models in fraud detection is so high, it is expected that more research is carried out on improving their performance in the future.