Essay: AI Classifiers

Essay details:

  • Subject area(s): Computer science essays
  • Reading time: 11 minutes
  • Price: Free download
  • Published on: July 15, 2019
  • File format: Text
  • Number of pages: 2
  • AI Classifiers
    0.0 rating based on 12,345 ratings
    Overall rating: 0 out of 5 based on 0 reviews.

Text preview of this essay:

This page of the essay has 2898 words. Download the full version above.

INTRODUCTION

My task is produce an output decision based on a dataset that I was given. I am going to use machine learning to experiment with the data using various classifier and then report on the best classifier approach. This report will be about comparing classifiers and offering an insight into why I have taken a specific classifier to be used to make an output decision from the datasets.

A classifier is an object of instructions, which inputs data or information about one entity (it could be: a picture, houses, cars, human, animals, etc.), and outputs a prediction (a quality, response to a binary question, probability of a value, etc.) about this entity.

Examples can be:
– input a picture (an ensemble of RGB values disposed in a matrix), and output the probability that there is a dog in the picture,
– input details of a house, output the most probable price the house will be sold for

Each classifier uses a strategy that embraces a learning calculation to recognize a model that best fits the relationship between the property set and class name of the data.

Decision Tree Classifier is a basic and generally used classification technique. It applies a straightforward thought to approach the classification issue. Decision Tree Classifier represents a progression of precisely created inquiries regarding the attributes of the test record. Each time it gets an answer, a subsequent inquiry is requested until some information about the class label of the record is reached.

So when does it terminate?
1. Either it has divided into classes that are pure (only containing members of single class)
2. Some criteria of classifier attributes are met.
DECISION TREE PARAMETERS
Criterion are string and it is optional. The default value is “gini”. Taking the one that gives the best data pick up is one of the good splitting choice. Measuring the quality of a split is the Function.
Splitter are string and it is optional. The default value is “best”. This is used for splitting each node.
Max_depth are integers or none and are optional. The default value is “None”. This is the maximum depth of the tree. If there is none, then nodes are expanded until all leaves contain less than min_samples_split samples or until all leaves are pure.
Min_samples_split are integers or float and are optional. The default value is “2”. Ideally, features either it runs of features or working set ends up in same class is what decision tree quits part the working set is based on. It can be made quicker by enduring some error at minimum split criteria. Within this parameter, if the quantity of items in working set decreases below specified value decision tree classifier quits the splitting.
Min_samples_leaf are integers or float and are optional. The default value is “1”. Float values were recently added for percentages. The required minimum number of samples to be at a leaf node is if integer, then considered min_samples_leaf as the minimum number or if float, ceil (min_samples_leaf * n_samples) are the minimum number of samples for each node and min_samples_leaf is a percentage.
Min_weight_fraction_leaf are float and are optional. The default value is “0”. The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. When sample_weight is not provided samples will have equal weight.

Max_features can be integers, float, string or none and are optional. The default value is “None”. If none, then max_features=n_features. The search for a split will not stop until at least one valid partition of the node samples is found.
Random_state are integers, RandomState instance or none and are optional. The default value is “None”.
Max_leaf_nodes are integers or none and are optional. The default value is “None”.
Min_impurity_decrease are floats and are optional. The default value is “0”. A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
Min_impurity_split are floats, threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
Class_weight are dict, list of dicts, “balanced” or none. The default value is “None”.
Presort are Boolean and are optional. The default value is “False”. Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.

Artificial Neural Networks
Artificial neural networks are relatively crude electronic systems of neurons in the view of the neural structure of the brain. They process records each one in turn, and learn by comparing their classification of the record (i.e., largely arbitrary) with the known real classification of the record. The errors from the initial classification of the principal record is sustained again into the network, and used to modify the networks algorithm for advance iterations.

Neurons are categorized into layers: input, hidden and output. The input layer is not made up of full neurons. Its contains records values that are input to the next layer of neurons. The hidden layer can have several of its layer hidden that exist in one neural network.

ARTIFICIAL NEURAL NETWORKS PARAMETERS

LEARNING RATE
The learning rate parameter represents the speed at which the value of each weight is updated and this parameter can be set between 0 and 1.

NUMBER OF EPOCHS
The neural network needs to iterate several times for it to achieve higher accuracy. For all training instances through the network an epoch is one forward pass and one backward pass. the number of iterations to be made is determined from the number of epochs.

MOMENTUM
Similar to the learning rate, the momentum parameter is set to make sure that minimization of the cost function is not mistaken with local minima and the momentum parameter should be between 0 and 1.

NUMBERS OF LAYERS AND NEURON
These parameters define the number of hidden layers and the numbers of neurons (ith element represents the number of neurons in the ith hidden layers). Calculations are performed to arrive at an outcome when the input data received contains neurons.

DATA SET
The data set consist of six classifications, which are numerical. They are BI-RADS, Age, Shape, Margin, Density and Severity. It is a data set for breast imaging reporting and data system.

BI-RADS has a label scale of 0-6. The BI-RADS assessment categories are:
♣ 0- incomplete
♣ 1-negative
♣ 2-benign findings
♣ 3-probably benign
♣ 4-suspicious abnormality
♣ 5-highly suspicious of malignancy
♣ 6-known biopsy with proven malignancy
BI_RADS is ordinal. With ordinal type, it is the order of the values that is important and significant, but the differences between each one is not clear.  Look at the category above.  In each case, we know that a #1 is better than a #3 or #2, but we do not know and cannot quantify how much better it is.

The median of the Bi-RADS is 4 which means in the dataset there’s a high percentage of category 4-suspicious abnormality in the dataset. The number of missing value is 2.

Age is an attribute that records the age of each patient. The minimum age in this dataset is 18 while the maximum is 96. Age is an integer and the average value is 55. The number of values is 5.

Shape is an attribute with Nominal type that refers to the shape of the tumor. Nominal type is numeric value that name the label uniquely and we use mode to grade it statically. The shape labels are split into 4 categories:
♣ 1-Round
♣ 2-Oval
♣ 3-Lobular
♣ 4-Irregular
The result of this mode is 4. The number of missing values is 31.

Margin is an attribute with nominal type that refers to the margin of the tumor. Nominal type is numeric value that name the label uniquely and we use mode to grade it statically. The margin labels are split into 4 categories:
♣ 1-Circumscribe
♣ 2-Microlobulated
♣ 3-Obscured
♣ 4-Ill-defined
♣ 5-Speculated
The result of this mode is 1. The number of missing values is 48.

Density is an attribute with ordinal type and it refers to how dense the tumor is. With ordinal type, it is the order of the values that is important and significant, but the differences between each one is not clear. We use median because using mean when the data is not distributed evenly will not be meaningful.
The density labels are split into 4 categories:
♣ 1-High
♣ 2-Iso
♣ 3-Low
♣ 4-Fat-containing
The result of this median is 4. The number of missing values is 76.

Severity is an attribute with Binominal type and it refers to the classification of the tumor.
The severity labels are split into 2 categories:
♣ 0-Benign
♣ 1-Malignant
The mode result is 0.

Classifier for dataset
In this section, a brief listing of advantage and disadvantage of both algorithms used on this study will be provided. Taking that into account, at the end of the section a prediction of which algorithm should perform better under the current scenario will be given.

Decision trees
Pros
Cons
easy to interpret visually when the trees only contain several levels
prone to overfitting
Can easily handle qualitative (categorical) features
possible issues with diagonal decision boundaries
Works well with decision boundaries parallel to the feature axis

Neural networks
Pros
Cons
Can approximate complex situations.
Training process requires time.
Can achieve really high levels of accuracy.
Require large amounts of data.

Difficult to interpret (Black box model)

According to the pro and cons of each algorithm, there is no enough information to explore the potential of a neural network. The decision tree will get better results because the data provided has few attributes.

Initial observations and experiments with preprocessing
First observation is the value 55 in the Bi-rads attribute as you can see in the image below. The assessment has a label scale of 0-6 might but looking at the graph in the image below, it might be a mistake from the data set and as it only one entry no data has been left outside.

The second observation is that there is no entry of the value 1. It may be possible that the input might be outside the value range as its impossible to know if the values are errors

I will experiment by deleting all the 16 entries containing the values of 0 or 6 and compare the dataset with performance with no modification.

Metric
Model 1(after pre-processing)
Model 2 (before pre-processing)
Accuracy (%)
83.37
81.06
False Positive Rate (%)
15.0
16.0
True Positive Rate (%)
81.6
79.0

Strategy for missing attributes values
In 131 different entries there are 161 missing values among 5 attributes which is 13.64% of the dataset, my strategy would be to delete all the missing values that’s occur in the dataset as “?” and replace them with “0” and also deleting the 131 rows from the dataset then allowing weka to handle the rest.
I used true positive rate and false negative rate and after analysing the results, the best option was to replace all the missing attribute to get a consistent result. I tried deleting all the entries with the missing value and that wasn’t the best choice as it represented 13.64% of the dataset.
Model
True Positive Rate
False Negative Rate
1
82.35
17.45
2
80.24
19.76
3
81.86
18.14
4
79.25
20.75

Main experimental series
For all experiments, a 10-fold cross validation strategy was used for validating the results.
I will be experimenting with both decision tree and artificial neural network.

Decision tree
In this section I used 16 models to train for this series of experiment. The experiments were performed to identify the influence of three crucial elements of the model.
• The best configuration of parameters.
• BI-RADS assessment attribute.
• using a prune tree or not.
I included MDLCorrection, BinarySplits, ColapseTree and SubtreeRaising (when pruned) parameters. I started all 8 model values on false and then changing one parameter to true and comparing them with each other. If the result was better than the previous, the parameter was left as true, otherwise it was returned to false.
This was applied for the evaluation of the pruning parameter and influence of BI_RADS assessment.
Model #
BI-RADS
Pruned
MDL correction
Binary splits
Collapse tree
Subtree raising
1
FALSE
FALSE
FALSE
FALSE
FALSE

2
FALSE
FALSE
TRUE
FALSE
FALSE

3
FALSE
FALSE
TRUE
TRUE
FALSE

4
FALSE
FALSE
TRUE
TRUE
TRUE

5
FALSE
TRUE
FALSE
FALSE
FALSE

6
FALSE
TRUE
TRUE
FALSE
FALSE

7
FALSE
TRUE
TRUE
TRUE
FALSE

8
FALSE
TRUE
TRUE
TRUE
TRUE

9
TRUE
FALSE
FALSE
FALSE
FALSE
FALSE
10
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
11
TRUE
FALSE
TRUE
TRUE
FALSE
FALSE
12
TRUE
FALSE
TRUE
TRUE
FALSE
TRUE
13
TRUE
FALSE
TRUE
TRUE
TRUE
TRUE
14
TRUE
TRUE
FALSE
FALSE
FALSE
FALSE
15
TRUE
TRUE
TRUE
FALSE
FALSE
FALSE
16
TRUE
TRUE
FALSE
TRUE
FALSE
FALSE
The main purpose of this experiment to the reduce the number od benign masses wrongly classified as malignant. it necessary to perform an invasive biopsy when an instance is classified as a malignant mass.

Model #
False positive
True positive
False positive rate
1
108
300
26.37%
2
116
343
25.17%
3
116
354
24.58%
4
124
360
25.52%
5
110
346
24.02%
6
106
347
23.30%
7
106
362
22.65%
8
103
362
22.15%
9
73
314
18.86%
10
77
347
18.16%
11
65
334
16.29%
12
65
334
16.29%
13
81
344
19.06%
14
69
346
16.63%
15
77
357
17.74%
16
62
337
15.54%

Looking at the table above model 16 achieved the lowest false positive rate of 15.54%. its configuration can be seen in the table below.

Metric
Model 1(after pre-processing)
Model 2 (before pre-processing)
Accuracy (%)
83.37
81.06
False Positive Rate (%)
15.0
16.0
True Positive Rate (%)
81.6
79.0

The MDLCorrecto and the Subtree Raising were deactivated while the BinarySplit and the ColapseTree parameters contributed to increase the performance of the model.
Its crucial to compare the top model with the prediction to know if it worth experimenting or using this model or not. It will not be useful to implement this models the false positive rate 11.59% is lower than the best decision tree model which is 15.54%.

Artificial Neural Network
16 models were created for the multilayer perception algorithm. The three different element influence were tested with it.
• the BI-RADS assessment attribute.
• The size of the network. Three configurations were evaluated here.
♣ A one-layer neural network.
♣ A deep network (many layers with few neurons each one).
♣ A big network (few layers with many neurons per layer).
• The values of the most important parameters of the NN: Learning rate and momentum.
Model #
BI-RADS
Number of layers
Number of neurons
Learning rate
Momentum
Number of epochs
1
FALSE
1
10
0.3
0.2
500
2
FALSE
1
10
0.3
0.2
1000
3
FALSE
1
10
0.1
0.2
1000
4
FALSE
1
10
0.1
0.2
2000
5
FALSE
1
10
0.3
0.4
1000
6
FALSE
1
10
0.3
0.7
1000
7
FALSE
1
10
0.3
0.4
5000
8
TRUE
1
10
0.3
0.2
500
9
TRUE
1
10
0.3
0.2
1000
10
TRUE
1
10
0.1
0.2
1000
11
TRUE
1
10
0.1
0.2
2000
12
TRUE
1
10
0.3
0.4
1000
13
TRUE
1
10
0.3
0.7
1000
14
TRUE
1
10
0.3
0.4
5000
15
TRUE
3
10
0.3
0.4
1000
16
TRUE
3
10
0.3
0.4
2000

It is crucial to note that the number of epochs can differ between models but its not considered a parameter so it not included as a part of the experiments. if the momentum increases or learning rate decreases, for example it would be crucial to increase the number of epochs as the training process might be slower. if the results improve it wont be because of the changes to the number of epochs but instead the change to the learning rate or the momentum rather than.

I will be using a different method to evaluate results because parameters of the multilayer perceptron algorithm are numeric and not Boolean.

Model #
False positive
True positive
False positive rate
1
110
354
23.71%
2
114
353
24.41%
3
107
353
23.26%
4
115
353
24.57%
5
110
360
23.40%
6
123
358
25.57%
7
117
363
24.38%
8
66
344
16.10%
9
69
340
16.87%
10
79
343
18.72%
11
79
355
18.20%
12
59
335
14.97%
13
71
337
17.40%
14
67
333
16.75%
15
82
347
19.11%
16
84
342
19.72%
17
92
343
21.15%
18
102
82
55.43%

When the learning rate parameter was decreased there wasn’t any improvement on the outcome of the neural network even when the number of epochs was increased as well. The learning rate was left at 0.3 instead. The best results were outputted when the momentum increased to 0.4 which means there was some local minima that couldn’t overcome with the momentum at 0.2.

According to the above, it can be concluded that both algorithms can reach almost the same level of accuracy, the best configuration for this experiment reached a false positive rate of 14.97% which has a difference of just a percentage point, in comparison to the best decision tree model.

Advance preprocessing

I will be implementing an advance preprocessing approach that will covert the dataset into a new once with difference graph. when I applied the principal component analysis algorithm to the dataset, all the attributes were defined as numeric with a mean of 0 and it went from 5 attributes to 11 attributes.

Whilst evaluating the performance of this algorithm using both decision tree and neural network I obtained a false positive rate that’s was s 18.53% to the decision tree and 18.66% to the artificial neural network. This means the performance did not improve and was the same as using just 5 attributes.

Conclusion
• The best decision tree had a false positive rate of 15.54%.

Decision tree diagram

• 14.97% was the lowest false positive rate achieved by any model in the neural network. The configuration achieved the rate with just one hidden layer (with 10 neurons), a learning rate of 0.3 and a momentum of 0.4

Artificial neural network

• I recommend the neural network over the decision tree as it has 14.97% false rate compared to decision tree 15.54%. There isn’t a massive difference but neural network had the better result.

REFERENCES

Scikit learn, n.d. Scikit Learn. [Online] 
Available at: http://scikit-learn.org/
[Accessed 04 2018].
Witten, Frank & Hall, 2011. Output: Knowledge Representation. In: Data Mining: Practical Machine Learning Tools and Techniques. s.l.:s.n., p. 64.
Witten, Frank & Hall, 2011. What’s It All About. In: Data Mining: Practical Machine Learning Tools and Techniques. s.l.:s.n., p. 13.
Research Gate. [Online] 
Available at: http:// www.researchgate.net/figure/Attributes-in-BI-RADS-mammographic-mass-dataset_tbl1_224136807
[Accessed 04 2018].
Science Direct. [Online] 
Available at: https://www.sciencedirect.com/science/article/pii/S1532046417300813
[Accessed 04 2018].
Karayiannis, N. & Venetsanopoulos, A. N., 2013. Artificial Neural Networks: Learning Algorithms, Performance Evaluations and Applications.. New York: Springer Science+Business Media.
Frontline Solvers. [Online] 
Available at: https://www.solver.com/xlminer/help/neural-networks-classification-intro
[Accessed 04 2018].
Meduim Corporation. [Online] 
Available at: https://medium.com/machine-learning-101/chapter-3-decision-tree-classifier-coding-ae7df4284e99
[Accessed 04 2018].

...(download the rest of the essay above)

About this essay:

This essay was submitted to us by a student in order to help you with your studies.

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay Sauce, AI Classifiers. Available from:<https://www.essaysauce.com/computer-science-essays/ai-classifiers/> [Accessed 09-12-19].

Review this essay:

Please note that the above text is only a preview of this essay.

Name
Email
Review Title
Rating
Review Content

Latest reviews: