Diagnosing Hepatitis Disease: A Framework Using Rule-Based Induction, Robust BoxCox Transformation, and Neural Network

Diagnosing the liver disease is the challenging task

for many public health physicians. In this study, we propose the

framework to diagnose the hepatitis disease. For this study the

adaptive rule based induction were formulated and the adaptive

rule implemented in combined Robust BoxCox Transformation

(RBCT) and Neural Network (NN) methods. The performance of

the proposed model is compared and evaluated based on the

classification accuracy. Based on the evaluation parameters

RBCT-NN obtained improved accuracy rate of 98.07%

compared to other techniques thereby, minimizing the difficulty

in predicting the hepatitis disease with reduced possible errors.

The health care service needs regularly updations in quality

of services at affordable cost and in optimal time [10, 12]. The

quality of services includes the diagnosing patients accurately

and providing treatment that are effective for the patients [2, 3,

14]. This can be achieved by employing the decision support

systems [12, 14, 17]. The aim of this research is to design a

framework, to predict the accuracy in diagnosing the hepatitis

disease in optimal time period and to distinguish the type of

liver disease in hepatitis patients using data mining techniques.

Hepatitis is caused due to the inflammation of liver, which

is considered to be one of the most common infectious

diseases, causing overall 1.5 million deaths worldwide each

year [10, 18]. Viral hepatitis is an inflammation of the liver

with six different virus types like HAV, HBV, HCV, HDV,

HEV, and HGV respectively [15]. Proper diagnosis and

accurate prediction of disease, on time can save many patients.

Data mining is an efficient tool to diagnose hepatitis from large

database and predict the severity of disease.

The current paper illustrates the following subdivisions as

follows: Section 2 represents the related study undertaken for

conducting this research; Section 3 mentions the dataset used

for conduction the experimental work for prediction. Proposed

problem for identifying the disease is formulated in section 4.

Next section specifies about the steps involved in the

methodology followed by the cases involved in rule based

induction conditions which are applied in the algorithm.

Finally the paper is concluded with the conclusion section.

II. RELATED STUDY

Yılmaz Kayaa et al. [16] implemented a hybrid medical

decision support system based on rough set (RS) and extreme

learning machine (ELM) for the diagnosis of hepatitis disease.

They used two stages for the diagnosis, in the initial stage the

redundant features were removed from the data set through RS

approach. Further in the succeeding stage the classification

process were implemented through ELM. From the obtained

results using these techniques the classification accuracy was

96.49% using RS-ELM model.

Javed Salimi Sartakhti et al. [11] presented a machine

learning method using hybridized Support Vector machine and

simulated annealing for hepatitis disease diagnosis. They used

a method for solving difficult optimization problems. The

accuracy results obtained using 10-fold cross validations were

96.25%.

Duygu et al. [4] proposed an intelligent hepatitis diagnosis

system using Principle Component Analysis and Least Square

Support Vector Machine Classifier (PCA–LSSVM). The

proposed diagnosis system was separated into two phases: (1)

In the first phase the feature extraction was done from

hepatitis diseases database and the feature reduction from the

database was extracted from PCA, (2) in the next phase the

classification LSSVM classifier were used to obtain the

classification accuracy. Feature extraction is an important

process for extracting the original feature for making good

predictions. The original hepatitis database has 19 features by

using PCA relevant feature was extracted and it was reduced

to 10 features using PCA. In second phase, these reduced 10

features were given as the inputs in the LSSVM classifier.

LSSVM classifier uses two parameters the width of Gaussian

kernels σ and the regularization factor C for prediction. The σ

parameter values were adjusted between 0.1 and 25 and also

the parameter values of C was adjusted between 1 and 100000

which is suitable for SVM prediction. From 10 combinations

of C and σ values, the best classification accuracy was

obtained with 96.12% accuracy from σ as 0.8 and C as 100.

G. Sathya Devi [7] proposed the application for diagnosing

hepatitis disease using decision trees C4.5 algorithm, ID3

algorithm and CART algorithms. It initially classifies the

hepatitis diseases and compares the effectiveness of the

disease among them. From the comparison of different

classification models the CART model gave the best accuracy

of 83.2% compared to the other techniques.

In A.H.Roslina et al. [1] Support Vector Machines and

Wrapper method were used for implementation hepatitis

prognosis prediction. The noise features from the dataset were

removed using wrapper methods before starting the

classification process. Then Support Vector Machines was

used to obtain the accuracy. The accuracy rate was increased,

from the obtained results the clinical lab test cost and time was

reduced for patients. So the combining of Wrappers Method

and SVM techniques showed improved results in the diagnosis

process of hepatitis disease.

Fadl Mutaher et al. [6] presented the comparative analysis

of different techniques in the prognostic of hepatitis data using

Rough set technique and Multi- layer Neural Network using

back-propagation algorithm. The prediction done using these

techniques gave an outcome which was more specific and

accurate. Performance and time taken to run the hepatitis data

in the prediction process is much faster than other techniques.

Rong-Ho Lin [13] presented a model for diagnosing liver

disease using classification and Regression tree (CART) and

case-based reasoning (CBR). The model undergoes two

stages, the first stage to adopt CART for diagnosing to predict

whether a patient suffering from liver disease or not and the

second stage is to employ CBR diagnose to predict the type of

liver disease. The five fold cross validation were performed

and obtained an accuracy rate of 92.16% using CART and

87.25% accuracy rate using CBR.

Ihsan Omur Bucak et al. [9] implemented the diagnosis

system for diagnosing the liver disease by using CMAC neural

network approach to minimize the medical diagnosis process

and help the physician to handle complicated cases. It consists

of 24 input nodes and 5 output nodes, Normalization of load

and input of training data were done and calculated the

variation between actual and desired output. Testing weight

was saved. The learning rate is 0.5 and it increases sharply as

the desired output was 0.0001.

Hui-Ling et al. [8] developed an innovative medical

diagnostic method using local fisher discriminate analysis and

support vector machines (LFDA-SVM) for hepatitis diagnosis

problem. Experimentation was done on the hepatitis dataset

for distinguishing the live from dead liver patient database.

The comparative study was conducted on different methods

PCA_SVM, the FDA_SVM and the SVM. From the observed

results the LFDA_SVM achieved the best classification

accuracies of 96.77% from 80–20% training–testing partition.

Enas M. F. El Houby [5] proposed a framework for

prediction of HCV patient‟s response for the treatment of

HCV from clinical information. They used three phases such

as preprocessing phase to prepare the data for applying in the

data mining technique. The next phase is to apply the data

mining technique at last the evaluation phase to evaluate and

to compare the performance. Associative classification,

Artificial Neural Network and the Decision Tree techniques

were applied, from that the associative classification obtained

92% of the accuracy rate.

Yugal Kumar and G. Sahoo [17] proposed a rule based

classification model for the prediction of different liver

diseases. They proposed the rule based classification model

and the model without rule based for prediction, from that the

rule based classification model with decision tree provided

more accurate results.

The primary goal of this work is to design a framework for

the healthcare management in the diagnosing and categorizing

of the hepatitis disease. The proposed work categorized in

three phases represented in Fig. 1,

 The initially phase pre-processes the input data by

performing the scaling of the dataset to estimate the

normalization of the dataset.

 The next phase is model processing which splits the

dataset as training dataset and test dataset. Then rule

based induction is formulated and applied in different

classification algorithms.

 Last phase is evaluation phase, were Cross – Validation

performed in each model to obtain maximum accuracy

then selecting the model which gives best accuracy rate.

The obtained model evaluated with different evaluation

parameters and comparisons of the results are analyzed

and appropriate technique is selected.

A. Algorithm for Pre-Processing

Step-1: Start

Step-2: Read the Dataset Xn

Step-3: Scaling the dataset for normalizing the dataset is done

by reading the minimum and maximum value of each dataset

the normalization of the dataset calculated using (1),

 

Y i-Y m in

Y i= + 1

Y m ax -Y m in

(1)

The p-Value is generated using (2) , if p-value is less than 0.05

then reject the hypothesis.

1 

n P

p P

 

    





(2)

where, m represents „yes‟ response, n represents random

sample size, p represents proportion and P represents

population.

B. Algorithm for Model Processing

Step 1: Read the pre-processed dataset

Step 2: Splitting the dataset as 50 -50% training set and test

set partition

Step 3: Formulate Rule based on the rule based induction

Step 4: The rule based induction is analyzed for diagnosing,

present / absent of liver disease, Liver Types which as HAV,

HBV, HCV and HDC and intensity of liver disease as acute

and chronic.

Step 5: The robust linear mixed model for normally distributed

vector elements with random effect using binomial type are

used to obtain the predicated result,

Y i rlm E Y X i X i X i X in        | 1 2 * 3 * *        



where, Yi represents linear predicted, γ represents vector of

data elements, β specifies intercept weight of the dataset

elements

Step 6: Transforming robust multi-linear mixed model using

Box-Cox transformation in (4).

   

 

2 1 2

1 lo g lo g lo g 1

lo g

x x

x x 

 



 

        

  (4)

Step 7: Calculate the accuracy from the obtained model using

a) Accuracy: Prediction accuracy is calculated in (5)

 

T P T N

A c c u r a c y

T P T N F P F N

  

  

  



where TP, TN, FP and FN represents true positive,

true negative, false positive and false negative

b) Standard error: Error in the prediction is identified

using (6).

1 

S E p p

  

  

 



where, p represents the sample proportion and n represents the

sample size

c) R-Squared value

r S S E rro r S S T o ta l   1 ( ) / ( ) (7)

Where, SS Total = Total Sum of Square and SS Error is Sum

of Square error.

Step 8: Formulated rule is applied in Random Forest, Neural

Network, Naïve Bayes, Decision Tree, K-Nearest Neighbor

and Support Vector Machine classification techniques

Step 9: Obtain the result of each technique

Step 10: Stop

C. Algorithm for Evaluation

Step 1: Read the results obtained from the classification

techniques

Step 2: Cross – Validation performed in each model

Step 3: Select the model which gives best accuracy rate

Step 4: The obtained mode evaluated with different evaluation

parameters

Step 5: Stop

VI. RULE BASED INDUCTION IN HEPATITIS DIAGNOSIS

The Rule based induction formulated based on the standard

values given in Table I.

1) Rule 1: Partitioning the patient gender wise,

If (Gender == 0, “Male”, “Female”)

2) Rule 2: This condition differentiates between healthy

and unhealthy male patient

If ((Gender == 0&&(AST>5 && AST <40) && (ALT>5 &&

ALT &&42) && (ALP>25 && ALP <120) && (GGT<=40)

&& (PT >11 && PT <16) &&(TBil > 0.1 && TBil <1.0) &&

(TP >5 && TP < 80) && (Al>3.5 &&Al<5), “Mhealthy”,

“Munhealthy”)

3) Rule 3: If the rule2 fails then this condition will check

for Acute, chronic, alcoholic, cirrohosis liver diseases in male

Case 1: if (TB > 2.0 && ALT>900 && AST > 90 && ALP >

120 && GGT >65 && TBil >20 && GGT>65 && (RatioAST

&& RatioALT <1)), “Acute Viral Hepatitis”)

Case 2: if ((TBil > 20 && GGT>65), “Jaundice”)

Case 3: if(AST > 225 AST <450) && (RatioAST &&

RatioALT >2)&& ALP > 120 && GGT >65 && TBil >20 &&

GGT>65), “Chronic Hepatitis”)

Case 4: if (AST > 225 AST <450) && (RatioAST &&

RatioALT >2.5)&& ALP > 120 && GGT >65 && TBil >20

&& GGT>65), “alcoholic acute Hepatitis”)

Case 5: If((AST>ALT && TBil >20 &&

GGT>65),”Cirrohosis”)

4) Rule 4: This condition differentiates between healthy

and unhealthy Female patient

If ((Gender != 0&&(AST>5 && AST <38) && (ALT>6 &&

ALT &&34) && (ALP>25 && ALP &&90) && (GGT<=25)

&& (PT >11 && PT <16) &&(TBil > 0.1 && TBil <1.0) &&

(TP >5 && TP < 80) && (Al>3.5 &&Al<5), “Fhealthy”,

“Funhealthy”)

5) Rule 5: If the rule2 fails then this condition will check

for Acute, chronic, alcoholic, cirrohosis liver diseases in

Female

Case 1: if (TBil > 2.0 && ALT>850 && AST > 25 && ALP >

90 && GGT >65 && TBil >20 && GGT>65 && (RatioAST

&& RatioALT <1)), “Acute Viral Hepatitis” )

Case 2: if((TBil > 20 && GGT>65), “Jaundice”)

Case 3: if(AST > 225 AST <450) && (RatioAST &&

RatioALT >2)&& ALP > 90 && GGT >65 && TBil >20 &&

GGT>65), “Chronic Hepatitis”)

Case 4: if(AST > 220 AST <425) && (RatioAST &&

RatioALT >2.5)&& ALP > 90 && GGT >65 && TBil >20

&& GGT>65), “alcoholic acute Hepatitis”)

Case 5: If((AST>ALT && TBil >20 &&

GGT>65),”Cirrohosis”)

6) Rule 6: If Rule2 and Rule4 fails and it is used to

diagnose for the type of the viral disease in Male and Female

Case 6: If ((HAVIgM =”pos” and HAVIgG =”pos”),

“ViralHepatitis A”)

Case 7: If( HBsAg =”pos” && anti-HBC =”pos” & IgMantiHBC

= “pos” && antiHBs =”neg”), “Acute Viral HepatitisB”)

Case 8: If( HBsAg =”pos” && anti-HBC =”pos” & IgMantiHBC

= “neg” && antiHBs =”neg”), “Chronic Viral

HepatitisB”)

Case 9: If ((HCVAb =”pos” and HCVRNA =”pos”),

“ViralHepatitis C”), If Case 7 or Case 8 then Case 10:

Case 10: If ((HDVAb =”pos” and HDVRNA =”pos”),

“ViralHepatitis D”)

Case 11: If ((HEV =”pos”),” ViralHepatitis E”)

The experiment on the hepatitis disease database has been

performed using RStudio version 0.99.903.

To mention the effectiveness of the proposed algorithm,

dataset listed in the Table I is used for conducting the

experimental work. Initially, the normality of the dataset is

identified, for that scaling has been done to range the value of

each dataset between 0 and 1 scale. Table2. Shows the p-value

Patient‟s raw dataset

Pre- Processing

Estimating Linearity of dataset

Model Processing

Adaptive Rule

Formulated (Proposed)

Adaptive Transformative

Robust Linear Model

(Proposed)

Apply Classification

algorithms

Success

Testing for diagnosis and

categorization

Evaluation

Model Evaluation – Cross

Validation

Comparative Analysis of

Models

Evaluating Model for

Prediction

Hepatitis Detection Hepatitis Categorization

Estimating Normality

Scaling of Dataset

obtained after scaling the dataset. The p-value for obtaining

the normalization is calculated using (1). Table II shows the

results of the dataset after scaling. The histogram results of the

scaled dataset of each attributes are represented in Fig. 2.

Next, the experiment analysis of the proposed work has

been done by dividing the dataset into training and testing set

with 50 – 50% partition. Then 10-fold cross validation is

performed in each technique for evaluating the training phase

of each classifier. The Rule-Based induction algorithm is

formulated for identifying whether, present or absent of

disease in patients. Also it categorizes the hepatitis disease

types based on the standard values of the attributes mentioned

in Table I.

The formulated algorithm initially categorizes the patient

database class as either “present / absent” of the disease in

patients. Further, proposed algorithm rejects the absent cases

from the database and classifies the present class into different

types like “HAV, HBV, HCV, HDV and HEV”. Later it

identifies the severity of the disease as “acute, chronic”, if the

case is present.

The dataset used in this experiment is first predicted to

obtain the linearity. For obtaining the linearity of the dataset,

the linear model and generalized linear models are used.

Further, transformation techniques are applied on the models

for getting more improved results. Binomial, log, square root,

Gaussian, Poisson and boxcox transformations are applied on

each model. Results obtained from each models are listed in

Table III. Linear model, linear model with binomial

transformation, Robust log transformation model, Robust

square transformation model, Robust boxcox transformation

model, Robust Gaussian boxcox transformation, Robust

Poisson boxcox transformation model, Generalized Gaussian

linear model and Generalized linear binomial square root

transformation models are evaluated. Table III shows the

accuracy, residual error, R2

value and adjusted R2

value of all

the models. From the obtained results of different models

Table III shows that Robust Boxcox Transformation Model

(RBCT) gives improved results than other models for

evaluating the linearity. The accuracy rate of all the models is

represented graphically in Fig. 3.

Further, the obtained result of RBCT is incorporated in

Random Forest (RBCT-RF), Neural Network (RBCT-NN),

Naïve Bayes (RBCT-NB), Decision Tree (RBCT-DT), KNearest

Neighbor (RBCT-KNN) and Support Vector Machine

(RBCT-SVM) classification algorithms to obtain the

classification accuracy. Among these classification techniques

RBCT-NN gives the improved accuracy rate of 98.07% with

comparison of other algorithms. Table IV summarize the

classification accuracy of all the algorithms. Fig. 4 shows the

accuracy results, sensitivity, specificity, positively predicted

values, negatively predicted value and balanced accuracy of

the proposed models.

In this study, a framework is proposed for diagnosing the

hepatitis disease. Proposed framework initially scales the

dataset to find the normality of the dataset. Then it estimates

linearity of the dataset using different regression techniques.

Regression technique obtained using RBCT model gives the

better accuracy in measuring the linearity compared to the

other techniques. After estimating the linearity of the model,

the prediction accuracy was calculated by using different

classification algorithms. From the observed results of

different classification algorithms RBCT-NN showed an

improved results. The proposed RBCT-NN can be a prevailing

model for diagnosing hepatitis disease, once diagnosed it

further categorizes its types and also to find severity of the

disease. In future, this model can be implemented predicting

for other diseases.

Essay: Diagnosing Hepatitis Disease: A Framework Using Rule-Based Induction, Robust BoxCox Transformation, and Neural Network

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: