To build this model, we checked the imbalance of data available in TARGET variable in train data by using table (dataframe$TARGET) function in R. The frequency of class distribution is checked by using prop.table (table (dataframe$TARGET). The results show 96% of satisfied customer and only 4% of unsatisfied customer. As we have an imbalanced data, some of our studies gave us an input of using the Recursive Partitioning tree. This algorithm can be used to model classification and regression tree. The R-part tree can be run by including the rpart package in R. The There is no further data clean up required to build this model. We build the model after data preparation. The process to build the tree is very quick. The performance of this model is good when compare with random forest model.
6.3 Gradient Boosting Model:
After creating our own models, the forum from Kaggle motivated us to focus on Gradient boosting model and XGBoost model. Gradient Boosting for classification.GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced . This model combines multiple weak nodes and it is boosted by gradient. This tree iteratively solve the residuals to improve accuracy. We choose adaboost which is an exponential loss function for 0’s and 1’s. In this model, we have chosen a set of variables and stored it in a vector. We understand that the factor of prediction may be lying behind these variables after data cleaning. The important parameters are no. of trees = 10000, shrinkage – how fast the algorithm moves across the gradient = 0.01, depth – the number of decisions that each tree will evaluate = 5, cv.folds = 5 – running 5 five-fold cross-validation. This model took nearly 2 hours to process. The performance of this model is very good compare to other models.
6.4 Extreme Gradient Boosting Tree:
XGBoost is the short for extreme gradient boosting tree and is used for supervised learning problems where we use the train data to predict a target variable in test data. For this model, we need to find a way to find out the best parameters given in training data. The objective function is to measure the performance of the model given a certain set of parameters. A very important fact about objective functions is they must always contain two parts: training loss and regularization.
Obj (Θ) = L (Θ) + Ω (Θ)
Where L is the training loss function, and Ω is the regularization term. The training loss measures how predictive our model is on training data. The regularization term controls the complexity of the model, which helps us to avoid overfitting.  The prediction scores of each individual tree are summed up to get the final score in XGBoost model. We can run this model by extracting the package XGBoost in R and the package matrix to create a sparse model matrix of train data. This model matrix is created to run on XGBoost model. The parameters should be specified with objective = "binary: logistic", booster = "gbtree", eval_metric = "auc", eta = 0.02, max_depth = 5, subsample = 0.7, colsample_bytree = 0.7. We have tried three different parameters in building the model. The performance of this model is the best when compare with other 3 models.
The evaluation criteria on Kaggle for this competition is based on area under the ROC curve, between the predicted and observed target. The accuracy of the models depend on, how successful it is in separating the satisfied customers from the unsatisfied customers. The accuracy is measured by the area under the ROC curve. A truly perfect model which can correctly predict every customer as either satisfied or unsatisfied will have an area of 1.00. The Evaluation of each of the models is discussed below.
7.1 Random Forest:
A confusion matrix is a good measure of model performance. We based our model on two samples, a down sample and an unbalanced sample. From results shown below we can see that the accuracy is good for the down sample but is poor for the unbalanced sample. This is our least successful model. The ROC in figure 4 confirms the results of the confusion matrix. The blue curve represents the unbalanced sample with poor accuracy as it is almost parallel to the 45 deg. line. The down sample represented by the red line has better accuracy.
We have evaluated the Rpart algorithm by using cross validation. The model yielded two variables, ‘saldo_var30’ and ‘var15’. A cross validation was performed on these two variables, the results of which is shown below. The x-error (cross validation error), relative error and x-std together help to determine the optimal place to prune the tree, which is shown in Figure 5.The tree summary produced 7 nodes of which node 2, 4 and 5 produced primary and surrogate splits, shown in figure 6. The way in which the splits are made affects the performance of the model. Our model produced good results.
This model was trained on a selected list of variables and cross validated on the target variable. A fivefold cross validation was used to train the model. The Figure 7 shows the exponential log loss of the model trained with fivefold cross validation. The optimum number of trees for the model to perform well was determined as 2014 trees. For a 50% held out model the optimum number of trees determined was 1947. The model took nearly two hours to run. We measured the relative importance of the variables that was used to train the model and the results are shown in Figure 8. From the figure, ‘var38’ has the most importance. The rmse (root mean squared error is 0.1859 which is another indicator that the model performance is good.
This was our most successful model. To evaluate its performance we built the confusion matrix, the results are shown below. The sensitivity is very high and the specificity is low. The accuracy of the test reflects how well it has separated the unhappy customers from the happy customers and is measured by the area under the ROC curve (Figure 9). The Accuracy of the model is very good and the score received on Kaggle’s public leader board is also good. The relative influence of the variables was plotted which is shown in Figure 10.
As a starting point of our project we referred to academic papers in the field of classification models to help us understand the features and characteristics of each of the models. We hope the work we have done provides a comparison of how effective each of the different models are in predicting problems similar to differentiating between Happy and Unhappy.
The competition requires to predict Happy and Unhappy customers and from the models that we built we were able to predict the Happy and Unhappy customers, but the terms happy and unhappy are subjective and our conclusions are based on the predictions of the models. We are not able to attribute the causes which may be classifying the customers as Happy or Unhappy.
The solutions we obtained are limited by the parameters we have chosen as the performance seems to vary when we alter the parameters. As the variables provided are anonymized we are unable to make a learned choice in selecting the variables when training the model. Our models are based on the assumptions on what information we think the variables provide. Our models may have gained if we were able to do more feature engineering.
Permitting time, we would like to build the models by varying the parameters and comparing the performances. We would also like to try building the models using other ensemble techniques such as bagging or stacking. Another focus will be to use a different programming language, like Python and tools like RapidMiner to compare the performances.
The scores the four models received on Kaggle’s public leader board is shown in Figure 11.
...(download the rest of the essay above)