Analyzing Hijama Data with Decision Tree and Regression Tree : Analyzing Hijama Data Using Decision Tree and Regression Tree Techniques

This chapter focuses generally on statistical methods used in analyzing this study data. Hence, the practical application will be followed in the next chapter. The first part will explain the methodology of decision tree, it’s structure, types and size. The second part will deals with related basic statistical methods to analyze Hijama data. Decision Tree Decision tree generate classification or regression models in the shape of flowchart tree structure. It is a nonparametric method so it does not have assumptions. Generally, this tree method has a fixed structure. It has a number of nodes, some of these nodes are internal nodes which branched out to other nodes. And the second type of nodes is terminal nodes, or leaves which hold the class label. The first step is to start with the root node of the tree, then ask a question that is related to this node attribute. The answer will lead to choose one of this node’s branches Figure 2.1 represent this step minutely. Figure 2.1: Decision tree structure for numerical and categorical variables Classification and Regression Tree Decision tree has two types, which are Classification and Regression tree CART. If the study in hand aims to predict a categorical variable then the appropriate tree is classification tree. In the other hand, if the predicted variable is continuous then we should use regression tree. Thus, the regres- sion tree used deviance to represent the error. Besides that, misclassifica- tion is the measure that used to evaluate classifier performance. splitting rules The process of splitting a node needs specific measure to be more reason- able and accurate in generating the tree. This measure called node impurity which measures how much the node is pure. That means, if the nude are completely pure, all elements in this node are from the majority class. Then, the impurity will be zero because the node is completely pure. There are many measures for node impurity. For this study we used Gini Index measure, represented in equation 2.1. Gini(D) = 1 − ∑ pi2 (2.1) i=1 Where D is Data partition. And pi is the probability that a tuple in D belongs to class i. When considering a binary split, we compute a weighted sum of the impurity of each resulting partition. For example, if a binary split on A partitions D into D1 and D2, the Gini index of D given that partitioning is For each categorical variable (A), the subset that gives the minimum Gini index is selected as the splitting subset. Moreover, for continuous variables (A), the point that giving the minimum Gini index for a given continuous variable is taken as the split point of that node. Then, the possi- ble split point of A will Produce binary split, represented as A ≤ split point and A > split point. Complexity Parameter The Complexity Parameter CP is used for controlling the best size of the decision tree and to select the best tree size. We could say that tree con- struction does not continue unless it would decrease the overall lack of fit by a factor of CP. CP table is the most important part of the regression tree pruning, since it reports the complexity of the tree model (CP), training error (rel error), and cross validation error (xerror). The value of cp should be minimum, besides the minimum cross validated error rate. Tree Pruning There are two common types of tree pruning, pre-pruning and post-pruning. Pre-pruning Also, it called Early Stopping Rule which terminate the algorithm before the tree is completed. Stopping rule might stop if all in- stances belong to the same class, or if all the attribute values are the same. Post-pruning This is the common pruning method in decision tree model. The main idea is to remove the subtrees from the completed tree. When these subtrees are removed, all their branches are also removed and re- placed it with one leaf for each subtree that were removed. Figure 2.2 shows an example of tree pruning 12 Figure 2.2: Tree pruning It can be seen from Figure 2.2 that (A3) subtree has been removed with all of it’s branches and replaced by the majority class of this subtree. 2.2.4 Regression Tree Deviance As in ordinary least squares regression models, the objective was to mini- mize the mean square method (MSE). In Regression trees the objective is to minimize the sum of squares (deviance). n Deviance = ∑(yi − f (xi))2 (2.2) i=1 2.2.5 Evaluating Classifier Performance There are a measures that compute how accurate and error rate for the clas- sifier. The good classifier must have minimum misclassification rate (error rate) and maximum accuracy. 13 They are given by Accuracy = TP+TN P+N (2.3) (2.4) Misclassification = FP+FN = 1−Accuracy P+N Where: • P is the total number of positive tuples. • N is the total number of positive tuples. • TP and TN are the positive and negative tuples that were correctly labeled respectively. Inversely, FP and FN are the positive and negative tuples that were incorrectly labeled respectively. (Jiawei et al. 2012). It is easy to compute accuracy and misclassification, by using a table called Confusion Matrix which indicates the number of class’s tuples as showing next: Paired Samples Wilcoxon Signed Rank Test It is a non-parametric statistical hypothesis test used when comparing two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ. State the Hypotheses Null Hypotheses: Both samples have the same median. Alternative Hypotheses: There is a significant difference between both samples. Assumptions 1- Data are paired and come from the same population. 2- Each pair is chosen randomly and independently. 3- The data are measured at least on an ordinal scale . Test statistic If there is no ties and number of observation is ≤50, then test statistic will be equal to the summation of positive ranks. Besides, if there is a ties and number of observation is >50, test statistic will be,

Chi square is the test that applied when you have two categorical variables from the same population and wanted to determine whether there is a sig- nificant association between the two variables.

Test Hypotheses

Null Hypotheses: There is no relationship between the variables Alternative Hypotheses: There is a relationship between the variables

Assumptions

1- The data must be randomly obtained.

2- the expected frequency for each cell must be ≥ 5.

Degrees of freedom (DF) is equal to (r − 1) × (c − 1)

Where r is the number of rows for one categorical variable, and c is the number of columns for the other categorical variable.

Expected frequencies are computed separately for each observation, so that the complete number of expected frequencies will be r × c, according to the formula:

E = (nr × nc)/n

Where nr is the total number of sample observations at row level , and nc

is the total number of sample observations at column level.

Test statistic χ2 = ∑r ∑c (Oi j−Ei j)2 i=1 j=1 Eij

Where Oi j is the observed frequency for ith row and jth column. And Ei j is the expected frequency for ith row and jth column (Bluman, 2014).

Rejection Region

Briefly, in Chi-Square test the null hypothesis should be rejected if the test statistic χ2 is greater than χ2 (DF,1−α) from Chi-Square Distribution Table.

Essay: Analyzing Hijama Data with Decision Tree and Regression Tree : Analyzing Hijama Data Using Decision Tree and Regression Tree Techniques

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: