Introduction
In this review, I will be outlining the methods that will be used throughout the project and what steps would be taken to come to a final solution to the given problem.
Machine Learning
– What is it?
In light of new technologies machine learning today is not like how it was in the past, the processes have changed with the times and can be seen as an advantage to the modern-day computing world. Machine Learning is a form of Artificial intelligence (AI) which provides the systems the ability to be able to learn and improve from experience without being openly programmed, which occurs automatically. In common, terms machine learning focuses on the development of computer programmes where they can access the data and learn from it. Machine learning was created from pattern recognition and the theory that computers can learn without having to be programmed to be able to perform specific tasks, researchers in this field wanted to find out whether or not computers are able to learn from data. The aspect of machine learning is seen to be an important one because due to models being exposed to new data, they are then able to independently adapt to this new information. The way the models are able to do this is they learn from previous computations that may have taken place to be able to produce repeatable decisions and results that are reliable. This aspect of machine learning is not new – it is one that is gaining new insight.
Classification
When looking at what classification is we can ask the question what is meant by classifying data? What is the reason to classify data? In addition, how many classification methods are there? Having these questions in mind, Classification is a data-mining task, which predicts the value of a categorical variable, i.e.: target or class. This is done by simply building a model, which is based on one or more numerical and or categorical variables. We can use either one or more or none of the variables to build the model.
In classification, there are two types of problems in classification binary and multi-class. Binary is a two-class problem where there are only two possible outcomes for example: yes or no, or success or failure. In comparison to Multi-class, which is a problem, which has more than two possible outcomes, for example: Producing revenue and the outcome would be either ‘Low, Medium or high’.
The classification of brain tumour types plays an important role in the context of medical image diagnosis. The requirement of being able to identify brain tumour types have increased over the years with the increase of analytical fields for example artificial intelligence with the increase of different analytical methods that are available. Conversely, the primeval stage prior to medical image analytics involves (signal) in particular image processing methods which led to the development of strong and efficient algorithms. However, when looking at the practicality of the application, most of these algorithms fail due to the effective methods of analysis and not the characteristic of the algorithm.
Supervised Classification
Assessing the model
Testing and training the data
To be able to assess the model, we need to carry out a few steps and tests to ensure we reach a suitable solution.
Firstly, we would acquire the dataset and train the model and then test that model for accuracy. In order to do this, we would divide the dataset into two parts the training set and the testing test. The training set is used to train the dataset while the testing set is used to evaluate the dataset we trained. The division of the dataset into a training and testing set is split. The-7uiy most commonly used splitting set is to take 2/3 of the original dataset into a training set and the remaining 1/3 of the dataset into the testing set.
When testing and training the model we need to ensure that the data in this case the Brain Tumour types which is under ‘cls’ column in the dataset are present in the training model and the amount of all three instances have to be more or less equal, to ensure that there are no bias results when predicting the class data.
Cross Validation
Cross validation is a simple method of being able to estimate the prediction accuracy of a model.
One way to be able to evaluate a model is to see how well it predicts the data which is used to fit the model.
In my project, I used 10-fold cross validation which simply means the data set is separated into 10 parts and each part is tested to come up with a figure that shows the accuracy of the dataset tested.
I also used the bootstrapping method to be able to estimate the accuracy of the data, which is similar to using Cross-Validation. Bootstrapping is a very powerful statistical tool; this method can be used to quantify the ambiguity which is linked with a given learning method. Bootstrapping can be used to estimate the standard errors of the coefficients from a linear regression fit.
ROC
The receiver operating characteristic curve otherwise known as the ROC curve allows us to compare the tests we have carried out previously with the given dataset. The ROC curve is a plot od True Positive Rate (tpr) against False Positive Rate (FPR).
The ROC curve shows us the relationship which is between sensitivity and specificity. For example, if there is a decrease in sensitivity there will be an increase in specificity. The ROC curve also shows us how accurate the test we carried out was, the closer the graph is to the top of the plot and to the left border the more accurate the test is.
The accuracy is also shown as area under the curve (AUC) the greater the area under the curve the more accurate the test results are.
Feature selection
Linear Discriminant Analysis (LDA)
When assessing the model and trying to find a solution to the problem I will be applying the method Linear Discriminant Analysis (LDA). LDA is a supervised technique of feature extraction that is used to be able to find a linear combination of the available features that separates the classes.
When applying LDA the main aim is to be able to reduce the measurement of the data, in order to reduce the computational cost of classification.
After applying the Linear Discriminant Analysis algorithm to the given dataset, the “new features” minimises the diffusion between the samples of the same class and maximises the diffusion between samples of different classes.
An example of Linear Discriminant Analysis would be, if we have a dataset of people who practice different types of sports for example: Football, Swimming, Badminton, Cricket and the available features are height, weight and muscular index.
When plotting using LDA, these instances might have regions of intersection that include two or more classes. When LDA is then applied to the dataset it will then generate a new dataset which may have a better class separation.
R
– Why use R?
In a nutshell R is a language and an environment that is used for statistical computing and graphics. R is an open source solution, which in terms means R’s source code is made available with a license allowing users to review, change and develop the code. There are many other more popular graphical and statistical software available that can do the same job that R can do for example: (Microsoft Excel etc.) but why choose R as a method to analyse the given dataset?
Firstly, R is a comprehensive statistical platform which means that just about any data analysis technique can be done in R. R also has many new methods that are readily available for download which means a wider range of statistical techniques are made available regularly. R can easily import data from a wide of sources, these can include sources such as text files, database- management systems etc. R can also access data from web pages and a wide range of online data services.
When looking at R in terms of how it functions, R can be integrated into applications that are written in other languages such as C++, Java etc.
When analysing the dataset in R I will be using an R platform known as R studio. This will allow me to use the relevant methods to be able to test my dataset and come up with a suitable solution to my problem.
Brain tumour
A specialist in the designated area, for example, a neurologist or even a doctor who is specialised on surgically treating nervous system diseases, usually diagnoses brain tumours. These specialists would usually carry out numerous tests and examinations to be able to make a detailed diagnosis.
When looking at trying to distinguish a brain tumour, imaging tests are usually requested such as Magnetic resonance imaging (MRI), Computed Tomography Scan also known as a CT scan or Magnetic Resonance Spectroscopy. These tests use computer technology to produce detailed images of the brain.
In previous years, MRI played an important role in being able to detect any irregularities in the brain and the location of where these irregularities are. This then enables the specialist to establish the type of a tumour the patient may have.
In comparison to all other imaging techniques, MRI is efficient when looking at detecting and identifying brain tumour types.
When trying to diagnose brain tumours using medical images etc., there are limitations that come with these methods. The first limitation, which is observed in medical image diagnostics, is manual interpretation. When diagnosing a patient using a scanner, which usually involves manual interpretation, would increase the price and is also time consuming, hence the reason as to why it would be seen as being necessary to have an automated method of classification to be able to detect the classification of brain tumour is indicated.
Magnetic Resonance Spectroscopy