Exploring Leukemia Classification w/ Data Mining: Challenges and Research Problem

Chapter 1

Introduction

This chapter will provide the reader with a close view of the research area. Section 1.1 describes general guidelines for the thesis including overview about data mining and leukemia diseases, Then Section 1.2 the challenges of cancer classification research, followed by Section 1.3 that explains research problem, while Section 1.4 presents objectives of research, Whereas Section 1.5 describes thesis organization.

1.1 Overview

Data mining plays an important role for predicting diseases. Recent advances in microarray technology offer the ability to measure expression levels of thousands of genes simultaneously. Analysis of such data helps us identifying different clinical outcomes that are caused by expression of a few predictive genes. The feature extraction and classification are carried out with combination of the high accuracy of ensemble based algorithms, and comprehensibility of a single decision tree. These allow deriving exact rules by describing gene expression differences among significantly expressed genes in leukemia. It is evident from our results that it is possible to achieve better accuracy in classifying leukemia without sacrificing the level of comprehensibility. Some of the most important and popular data mining techniques are association rules, classification, clustering, prediction and sequential patterns [1].

Data mining techniques can be classified into both unsupervised and supervised learning techniques. Unsupervised learning technique is not guided by variable and does not create a hypothesis before analysis. Based on the results, a model will be built. A common unsupervised technique is clustering [2]. Supervised learning technique requires the building of a model that is used in prior performing analysis. Supervised learning techniques that are used in both medical and clinical research are Classification, Statistical regression and Association rules [3].

Leukemia is a group of cancers that usually begins in the marrow and results in high numbers of abnormal white blood cells. These white blood cells are not fully developed and are called blasts or leukemia cells. Symptoms may include bleeding and bruising problems, feeling very tired, and an increased risk of infections. These symptoms occur due to a lack of normal blood cells. Diagnosis is typically by blood tests or bone marrow biopsy. Acute Myelogenous Leukemia (AML), Acute Lymphoblastic Leukemia (ALL), Chronic Myeloid Leukemia (CML) and Chronic Lymphocytic Leukemia (CLL) are categorized as leukemia diseases [4]. In general, leukemia is grouped by how fast it gets worse and what kind of white blood cells it affects [5]. Leukemia disease it may be acute or chronic. Acute leukemia gets worse very fast and may make you feel sick right away. Chronic leukemia gets worse slowly and may not cause symptoms for years. Also leukemia disease it may be lymphocytic or myeloid. Lymphocytic (or lymphoblastic) leukemia affects white blood cells called lymphocytes. Myeloid leukemia affects the other type of cells that normally become granulocytes, red blood cells, or platelets.

Microarray is one such technology which enables the researchers to investigate and address issues which were once thought to be non traceable by facilitating the simultaneous measurement of the expression levels of thousands of genes [6]. Microarray datasets are commonly very large, and analytical precision is influenced by a number of variables. So it is extremely useful to reduce the dataset to those genes that are best distinguished between the two cases or classes (e.g. normal vs. diseased). There are two common methods for in depth microarray data analysis such as clustering and classification [7]. Clustering is one of the unsupervised approaches to classify data into groups of genes or samples with similar patterns that are characteristic to the group. Classification is supervised learning and also known as class prediction or discriminate analysis. Generally, classification is a process of learning-from-examples. A DNA microarray technique allows to simultaneously observing the expression levels of thousands of genes during significant biological processes and across collections of related samples [8].

In the present study, we will focus on the usage of classification techniques in the field of medical bioinformatics. Classification is the most commonly applied data mining technique, and employs a set of pre classified examples to develop a model that can classify the population of records at large. The major goal of the classification technique is to predict the target class accurately for each case in the data. There are several classification mechanisms that are used in analyzing medical data. These include Decision trees, K-Nearest Neighbor (KNN), Bayesian network, neural networks, fuzzy logic and support vector machines.

In order to carry out experimentations and implementations Weka was used as the data mining tool. Weka (Waikato Environment for Knowledge Analysis) is a data mining tool written in java developed at Waikato. WEKA is a very good data mining tool for the users to classify the accuracy on the basis of datasets by applying different algorithmic approaches and compared in the field of bioinformatics. This research has used these data mining techniques to predict the Leukemia disease through classification of different algorithms accuracy.

The rest of this chapter is organized as follows: Section 1.2 presents the cancer classification research challenges. The research problem is described in Section 1.3. Research objectives are briefly clarified in Section 1.4. Thesis structure and outlines are organized in Section 1.5.

1.2 Cancer classification research challenges

Gene classification as domain of research poses a new challenges due to its unique problem nature. First, challenge comes from the unique nature of the available gene expression dataset; where most of these datasets has sample size below 200, vs. thousands to hundred thousands of genes presented in each tuples. Second, only a few numbers of these (genes) presents relevant attributes to the investigated disease. Third, comes from the presence of noise (biological and technical) inherent in the dataset. Fourth challenge arises from the application area, for instance accuracy is an important criterion in cancer classification task, but it is not the only goal, in cancer domain we want to achieve, biological relevancy as well as classification accuracy [9].

1.3 Problem statement

Leukemia disease is a type of cancer caused by abnormal increase of the white blood cells. Yearly thousands of people die of leukemia throughout the world due to the nature of Leukemia cells that become out of control and they spread randomly, and the most effective way to reduce deaths from this disease is the early discovering, and this requires an accurate diagnosis, while doctors don't have an effective technique to predict the disease at an early stage.

A major problem in bioinformatics analysis or medical science is in attaining the correct diagnosis of leukemia disease infection. For the ultimate diagnosis, normally, many tests generally involve the clustering or classification of large scale data. All of these test procedures are said to be necessary in order to reach the ultimate diagnosis. However, on the other hand, too many tests could complicate the main diagnosis process and lead to the difficulty in obtaining the end results, particularly in the case where many tests are performed. Also, there is no specific classification technique or prediction tool to predict Leukemia diseases. This kind of difficulty could be resolved with the aid of machine learning which could be used directly to obtain the end result with the aid of several classification algorithms which perform the role as classifiers.

Another problem in classification is the accuracy of the classifier. The accuracy of the classifier depends not only with the classification algorithm but also on the feature selection method. Selection of irrelevant and inappropriate features may lead to increase the complexity. The feature selection method plays the major role which increases the efficiency of classification. Though different kinds of feature selection methods are available, for selecting an appropriate features, the best algorithm should be chosen to maximize the accuracy of the classification and also the feature selection algorithm should consumes less space and time for its better performance.

Data mining methods used for diagnosing diseases based on previous data and information have been improving over the years. The data mining methods used currently particularly for disease diagnosis use various classification techniques which includes decision tree and rule classifier. Data mining techniques can not only draw conclusions accurately but also aid in visualizing patterns within the dataset itself. And there is no single classifier superior over the rest, for instance the classification accuracy is depend on the classification method, gene selection method, and datasets.

1.4 Research objectives

Considering the social and medical impact of the control of leukemia disease, it is proposed to carry out a data mining analysis of the available genetically data on leukemia using advanced data mining tools. Apart from a detailed study about the leukemia disease, it is intended to collect vast sample of genetic data of the people with and without leukemia disease. Advanced data mining tools and algorithms will be developed to analyze the collected data, intending to identify patterns to discover of the leukemia disease. Also it is proposed to suggest best measures, if any.

This research uses data mining techniques for analysis and evaluation of classification algorithms about leukemia disease datasets. Through open source WEKA data mining techniques, we can generate predictive model to classification of leukemia disease, evaluate accuracies, and performance of several techniques.

In order to reach the main goal of the research the following tasks are to be fulfilled:

ï¿½ï¿½ï¿½ Thorough literature survey in order to determine which algorithms are used in medicine (especially for the leukemia disease). This is also to see what performance metrics are used to evaluate the algorithms.

ï¿½ï¿½ï¿½ Analyses the performance of various classification function techniques in data mining for predicting the leukemia from the blood disease datasets.

ï¿½ï¿½ï¿½ Overview of the methods and their implementation in the analytical environment.

ï¿½ï¿½ï¿½ Investigate the performance of different classification methods using WEKA for leukemia disease.

ï¿½ï¿½ï¿½ Identification and selection of the most common data mining algorithms implemented in the leukemia disease.

ï¿½ï¿½ï¿½ Comparison of different data mining classification algorithms on Leukemia datasets.

ï¿½ï¿½ï¿½ To extract useful classified accuracy for prediction of Leukemia diseases.

ï¿½ï¿½ï¿½ Generation of data mining models to classify the leukemia disease.

ï¿½ï¿½ï¿½ Evaluation of the performance of the models.

ï¿½ï¿½ï¿½ Identify the best performance algorithm for prediction of diseases.

ï¿½ï¿½ï¿½ Finding the right algorithm for classification of data that works better on diverse datasets.

1.5 Thesis organization

While this chapter presents an introduction to the area of research of this masterï¿½ï¿½ï¿½s thesis is organized in the following manner. Then Chapter 2 presents scientific literature review and background on this topic. Chapter 3 discusses data mining concepts and techniques. Chapter 4 the testing datasets are described. It presents the sources of the data and describes the methods of evaluation of the effectiveness and the accuracy of the data mining methods. The measures taken into consideration are also presented. Chapter 5 describes Results, analysis and general findings. Finally, chapter 6 the thesis outlines, conclusion and future work are presented.

Essay: Exploring Leukemia Classification w/ Data Mining: Challenges and Research Problem

Essay details and download:

Text preview of this essay:

Introduction

About this essay:

Essay details and download:

Text preview of this essay:

Introduction

About this essay:

Essay Categories: