Abstract

Santander is the largest Spanish bank and is the bank with the highest market value in the Eurozone. Part of their vision is to win customer’s trust. Unhappy customers are known to switch loyalty, the bank wants to identify customers who are thinking of leaving so they can take necessary initiatives to retain them. In this paper, with the given data (from Kaggle) of hundreds of anonymized features we predicted the customers, the bank might be in danger of losing. We used the classification models Random Forest, R-part, Gradient Boosting Model and Extreme Gradient boosting model to predict the unhappy customers.

Keywords

Satisfaction classification: X-g boost, GBM, R-Part and Random forest ,Banking industry ,Data mining.

1. Introduction

The Kaggle competition we have chosen for our project is ‘Santander Customer Satisfaction’. Santander wants to identify dissatisfied customers at early stages of the banking life cycle which will help them to take proactive measures to improve customer satisfaction and prevent the customers from leaving.

The Evaluation for the competition is based on, ‘area under the ROC curve’, between the predicted probability and the observed target. We have built three models using the algorithms, Random Forest, R-part and XGBoost .We have made a prediction of probability for the TARGET variable for each ID in the test data set using the models and uploaded the results as a submission file on Kaggle. This is a currently active competition and our best submission has us currently placed 964 out of 3600 entries, which is roughly the top 26th percentile.

2. Background

In this section we have given a brief summary of the models we have used.

2.1 Random Forest: A random forest starts as a simple machine learning algorithm known as a “decision tree”. The decision tree works by dividing the data set into smaller and smaller data sets which helps to identify patterns, which in turn helps with the prediction. Random forest uses ensemble techniques where by several weak learners come together as a strong learner. In other words several trees are combined giving better predictive results

2.2 Recursive partitioning algorithm (r-part): R-part tree was developed from the ideas of CART (Classification and Regression Trees) .Rpart is used to build classification or regression models in a two stage process and the resulting output can be represented as binary trees. In this step-by-step process a decision tree is constructed by either splitting or not splitting each node on the tree into two daughter nodes. The splitting of the nodes is carried on until no improvement to the results can be made.

2.3 Gradient Boosting Model (GBM): GBM is a predictive modeling algorithm that can be used for problems related to both classification and regression. We have used the classification aspect of GBM which uses decision trees as a foundation. Every successive tree is based on the prediction residuals of the previous tree. Using Boosted algorithms are efficient when dealing with large data sets, the model algorithmically combines several weak models and it is ‘gradient boosted’ for better accuracy. The Adaboost exponential loss function for 0’s and 1’s has been used for the GBM in our study.

2.4. Extreme Gradient Boosting (XGBoost): XGBoost is a very powerful boosting algorithm. It is known for its speed and accurate predictive power. XGBoost manages only numeric vectors which is characteristic of the Santander dataset. The algorithm works by creating a large number of trees which combines to form a highly predictive model. The key to the performance of the model lies in choosing the right parameters.

3. Related Work

There are many machine learning algorithms in data mining. To get an understanding of which algorithms will work best for this Kaggle problem, we looked at research papers that could give us a direction.

In project of Sales Forecasting for Retail Chains (Jain, A., Menon, M.N. and Chandra, S, n.d.), they performed prediction using Random Forest Regression, Linear Regression and XGBoost. The analysis has been done to identify patterns and outliers which would increase the ranking of the prediction algorithm. Our focus is also to get good prediction. We use Random Forest and XGBoost to identify the patterns and remove the variables which does not add value to the prediction.

Another work that we found with relevant information was, ‘Handling Class Imbalance in customer churn prediction (J Burez, D van den Poel, 2009)’. Even the title is about churn prediction; we get the idea of the using of gradient boosting and weighted random forest. They analyzed using ROC graph to assess the accuracy of classifier and evaluated using evaluation metrics (AUC) which was helping to increase in performance over standard techniques without sampling. This is what we need in our project. But we have not used Lift as they did in their project because it helped them to analyze the accuracy in marketing practice.

4. Methods and Tools

The purpose of this competition is to identify the unhappy customers in the bank. The prediction of unhappy customers can be approached by using classification models in data mining. For this work, we used the R programming language for data analysis, understanding and preparation of models to build the decision trees. For some of the data visualization, Python generates good quality of plots and submitted the predicted values in Kaggle. An analysis was carried out comparing the following algorithms: Random Forest, R Part, Gradient Boost and XGBoost. From these techniques, the one that suited better is XGBoost

4.1 Business Understanding and Data Understanding:

The main goal of this competition is to predict the unhappy customers as early as possible and convincing them to stay in business with Bank. The Bank provided a train dataset of 371 anonymized variables in 76020 observations and test data set of 370 variables in 75818 observations. The training dataset provided an indicator of customer satisfaction called TARGET. The model has to be tuned to predict the customer satisfaction from a test dataset where the satisfaction indicator is not provided. The dataset contains 370 anonymized variables with numeric values.

There is no information on the variables provided by the bank and it is in Spanish. We are not sure which variables are categorical and which are continuous. From the first look of the dataset, there are many columns with the same constant values for all the observations. Some of the variable names are similar to other variables names and it does not add value to prediction process. The data in Target variable denotes “0” for happy customers and “1” for unhappy customer and it is highly imbalanced by group. There are 73,012 satisfied customers and 3008 unsatisfied customers. The class distribution between two groups are 0.96 for ‘0’ and 0.4 for ‘1’.

4.2 Data Preparation:

The train data contains unique ID and TARGET variable. The test data contains only unique ID. The ID and TARGET variable is assigned to a buffer variable and is taken off from the train and test data as part of the data cleaning process. The number of counts in a row with value 0 on it and append it to the train and test data-frame. The value of each row is same in one column, then remove it. It will not give enough information about the data. A new vector would be created that have identical column names and values of train and test data as elements. Set difference function would be used to discard the elements of new vector from train and test data. All the data clean-up process is finished and append the TARGET variable to train data from buffer variables.

5. Principal Component Analysis

A Principal Components Analysis of the train data set was done. Figure 1 shows the number of principal components. The first 2 components when plotted for the target variable gave a split of the happy and unhappy customers represented by 0 and 1 respectively, shown in Figure 2. A plot for the same in Python is shown in Figure 3 for comparison.

Implementation

The implementation phase is focused on getting the models to translate business goals through the application of data mining techniques. The modelling was done in R- Studio 3.2.4. R – Studio is an integrated environment for data mining, machine learning, predictive and business analysis. In this prediction, we employed four different models in classification and compare the performance between them.

6.1 Random Forest Model

First we have taken the simple regression model using random forest. We were able to run this model by including the R caret package, random forest package to build the model and plyr package for extracting the tools for splitting, applying and combining data. The e-1071 package for applying the statistics, probability theory group to get the confusion matrix. In this model, we have combined the train and test data with rbind function. The combined datasets may have variables with no predictive values can be removed by checking with nearZeroVar function. Then we split the train and test data to predict the TARGET. We build an unbalanced random forest model of 100 tree samples and mtry = 10. It took a bit long to finish off the process to build the model. We could not get effective results from this model.

6.2 Recursive Partitioning (RPart) Tree:

To build this model, we checked the imbalance of data available in TARGET variable in train data by using table (dataframe$TARGET) function in R. The frequency of class distribution is checked by using prop.table (table (dataframe$TARGET). The results show 96% of satisfied customer and only 4% of unsatisfied customer. As we have an imbalanced data, some of our studies gave us an input of using the Recursive Partitioning tree. This algorithm can be used to model classification and regression tree. The R-part tree can be run by including the rpart package in R. The There is no further data clean up required to build this model. We build the model after data preparation. The process to build the tree is very quick. The performance of this model is good when compare with random forest model.

6.3 Gradient Boosting Model:

After creating our own models, the forum from Kaggle motivated us to focus on Gradient boosting model and XGBoost model. Gradient Boosting for classification.GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced [2]. This model combines multiple weak nodes and it is boosted by gradient. This tree iteratively solve the residuals to improve accuracy. We choose adaboost which is an exponential loss function for 0’s and 1’s. In this model, we have chosen a set of variables and stored it in a vector. We understand that the factor of prediction may be lying behind these variables after data cleaning. The important parameters are no. of trees = 10000, shrinkage – how fast the algorithm moves across the gradient = 0.01, depth – the number of decisions that each tree will evaluate = 5, cv.folds = 5 – running 5 five-fold cross-validation. This model took nearly 2 hours to process. The performance of this model is very good compare to other models.

6.4 Extreme Gradient Boosting Tree:

XGBoost is the short for extreme gradient boosting tree and is used for supervised learning problems where we use the train data to predict a target variable in test data. For this model, we need to find a way to find out the best parameters given in training data. The objective function is to measure the performance of the model given a certain set of parameters. A very important fact about objective functions is they must always contain two parts: training loss and regularization.

Obj (Θ) = L (Θ) + Ω (Θ)

Where L is the training loss function, and Ω is the regularization term. The training loss measures how predictive our model is on training data. The regularization term controls the complexity of the model, which helps us to avoid overfitting. [3] The prediction scores of each individual tree are summed up to get the final score in XGBoost model. We can run this model by extracting the package XGBoost in R and the package matrix to create a sparse model matrix of train data. This model matrix is created to run on XGBoost model. The parameters should be specified with objective = "binary: logistic", booster = "gbtree", eval_metric = "auc", eta = 0.02, max_depth = 5, subsample = 0.7, colsample_bytree = 0.7. We have tried three different parameters in building the model. The performance of this model is the best when compare with other 3 models.

7. Evaluation

The evaluation criteria on Kaggle for this competition is based on area under the ROC curve, between the predicted and observed target. The accuracy of the models depend on, how successful it is in separating the satisfied customers from the unsatisfied customers. The accuracy is measured by the area under the ROC curve. A truly perfect model which can correctly predict every customer as either satisfied or unsatisfied will have an area of 1.00. The Evaluation of each of the models is discussed below.

7.1 Random Forest:

A confusion matrix is a good measure of model performance. We based our model on two samples, a down sample and an unbalanced sample. From results shown below we can see that the accuracy is good for the down sample but is poor for the unbalanced sample. This is our least successful model. The ROC in figure 4 confirms the results of the confusion matrix. The blue curve represents the unbalanced sample with poor accuracy as it is almost parallel to the 45 deg. line. The down sample represented by the red line has better accuracy.

Confusion Matrix and Statistics

Down Sample

Prediction 0 1

0 15822 94

1 6081 808

Accuracy: 0.8091

AUC: 0.8049

Unbalanced Sample

Prediction 0 1

0 21902 899

1 1 3

Accuracy: 0.501640

AUC: 0.5333

We have evaluated the Rpart algorithm by using cross validation. The model yielded two variables, ‘saldo_var30’ and ‘var15’. A cross validation was performed on these two variables, the results of which is shown below. The x-error (cross validation error), relative error and x-std together help to determine the optimal place to prune the tree, which is shown in Figure 5.

CP n-split rel-error x-error xstd

1. 0.028954 0 1.00000 1.00003 0.017133

2. 0.010000 2 0.94209 0.94238 0.015463

The tree summary produced 7 nodes of which node 2, 4 and 5 produced primary and surrogate splits, shown in figure 6. The way in which the splits are made affects the performance of the model. Our model produced good results.

This model was trained on a selected list of variables and cross validated on the target variable. A fivefold cross validation was used to train the model. The Figure 7 shows the exponential log loss of the model trained with fivefold cross validation. The optimum number of trees for the model to perform well was determined as 2014 trees. For a 50% held out model the optimum number of trees determined was 1947. The model took nearly two hours to run. We measured the relative importance of the variables that was used to train the model and the results are shown in Figure 8. From the figure, ‘var38’ has the most importance. The rmse (root mean squared error is 0.1859 which is another indicator that the model performance is good. This was our most successful model. To evaluate its performance we built the confusion matrix, the results are shown below. The sensitivity is very high and the specificity is low. The accuracy of the test reflects how well it has separated the unhappy customers from the happy customers and is measured by the area under the ROC curve (Figure 9). The Accuracy of the model is very good and the score received on Kaggle’s public leader board is also good. The relative influence of the variables was plotted which is shown in Figure 10.

Confusion Matrix and Statistics

Reference

Prediction 0 1

0 1440 63

1 16 1

Accuracy: 0.948

Sensitivity: 0.98901

Specificity: 0.01562

. Conclusion and Future Work

As a starting point of our project we referred to academic papers in the field of classification models to help us understand the features and characteristics of each of the models. We hope the work we have done provides a comparison of how effective each of the different models are in predicting problems similar to differentiating between Happy and Unhappy.

The competition requires to predict Happy and Unhappy customers and from the models that we built we were able to predict the Happy and Unhappy customers, but the terms happy and unhappy are subjective and our conclusions are based on the predictions of the models. We are not able to attribute the causes which may be classifying the customers as Happy or Unhappy.

The solutions we obtained are limited by the parameters we have chosen as the performance seems to vary when we alter the parameters. As the variables provided are anonymized we are unable to make a learned choice in selecting the variables when training the model. Our models are based on the assumptions on what information we think the variables provide. Our models may have gained if we were able to do more feature engineering.

Permitting time, we would like to build the models by varying the parameters and comparing the performances. We would also like to try building the models using other ensemble techniques such as bagging or stacking. Another focus will be to use a different programming language, like Python and tools like RapidMiner to compare the performances.

The scores the four models received on Kaggle’s public leader board is shown in Figure

**...(download the rest of the essay above)**