Diagnosing the liver disease is the challenging task
for many public health physicians. In this study, we propose the
framework to diagnose the hepatitis disease. For this study the
adaptive rule based induction were formulated and the adaptive
rule implemented in combined Robust BoxCox Transformation
(RBCT) and Neural Network (NN) methods. The performance of
the proposed model is compared and evaluated based on the
classification accuracy. Based on the evaluation parameters
RBCT-NN obtained improved accuracy rate of 98.07%
compared to other techniques thereby, minimizing the difficulty
in predicting the hepatitis disease with reduced possible errors.
The health care service needs regularly updations in quality
of services at affordable cost and in optimal time [10, 12]. The
quality of services includes the diagnosing patients accurately
and providing treatment that are effective for the patients [2, 3,
14]. This can be achieved by employing the decision support
systems [12, 14, 17]. The aim of this research is to design a
framework, to predict the accuracy in diagnosing the hepatitis
disease in optimal time period and to distinguish the type of
liver disease in hepatitis patients using data mining techniques.
Hepatitis is caused due to the inflammation of liver, which
is considered to be one of the most common infectious
diseases, causing overall 1.5 million deaths worldwide each
year [10, 18]. Viral hepatitis is an inflammation of the liver
with six different virus types like HAV, HBV, HCV, HDV,
HEV, and HGV respectively [15]. Proper diagnosis and
accurate prediction of disease, on time can save many patients.
Data mining is an efficient tool to diagnose hepatitis from large
database and predict the severity of disease.
The current paper illustrates the following subdivisions as
follows: Section 2 represents the related study undertaken for
conducting this research; Section 3 mentions the dataset used
for conduction the experimental work for prediction. Proposed
problem for identifying the disease is formulated in section 4.
Next section specifies about the steps involved in the
methodology followed by the cases involved in rule based
induction conditions which are applied in the algorithm.
Finally the paper is concluded with the conclusion section.
II. RELATED STUDY
Yılmaz Kayaa et al. [16] implemented a hybrid medical
decision support system based on rough set (RS) and extreme
learning machine (ELM) for the diagnosis of hepatitis disease.
They used two stages for the diagnosis, in the initial stage the
redundant features were removed from the data set through RS
approach. Further in the succeeding stage the classification
process were implemented through ELM. From the obtained
results using these techniques the classification accuracy was
96.49% using RS-ELM model.
Javed Salimi Sartakhti et al. [11] presented a machine
learning method using hybridized Support Vector machine and
simulated annealing for hepatitis disease diagnosis. They used
a method for solving difficult optimization problems. The
accuracy results obtained using 10-fold cross validations were
96.25%.
Duygu et al. [4] proposed an intelligent hepatitis diagnosis
system using Principle Component Analysis and Least Square
Support Vector Machine Classifier (PCA–LSSVM). The
proposed diagnosis system was separated into two phases: (1)
In the first phase the feature extraction was done from
hepatitis diseases database and the feature reduction from the
database was extracted from PCA, (2) in the next phase the
classification LSSVM classifier were used to obtain the
classification accuracy. Feature extraction is an important
process for extracting the original feature for making good
predictions. The original hepatitis database has 19 features by
using PCA relevant feature was extracted and it was reduced
to 10 features using PCA. In second phase, these reduced 10
features were given as the inputs in the LSSVM classifier.
LSSVM classifier uses two parameters the width of Gaussian
kernels σ and the regularization factor C for prediction. The σ
parameter values were adjusted between 0.1 and 25 and also
the parameter values of C was adjusted between 1 and 100000
which is suitable for SVM prediction. From 10 combinations
of C and σ values, the best classification accuracy was
obtained with 96.12% accuracy from σ as 0.8 and C as 100.
G. Sathya Devi [7] proposed the application for diagnosing
hepatitis disease using decision trees C4.5 algorithm, ID3
algorithm and CART algorithms. It initially classifies the
hepatitis diseases and compares the effectiveness of the
disease among them. From the comparison of different
classification models the CART model gave the best accuracy
of 83.2% compared to the other techniques.
In A.H.Roslina et al. [1] Support Vector Machines and
Wrapper method were used for implementation hepatitis
prognosis prediction. The noise features from the dataset were
removed using wrapper methods before starting the
classification process. Then Support Vector Machines was
used to obtain the accuracy. The accuracy rate was increased,
from the obtained results the clinical lab test cost and time was
reduced for patients. So the combining of Wrappers Method
and SVM techniques showed improved results in the diagnosis
process of hepatitis disease.
Fadl Mutaher et al. [6] presented the comparative analysis
of different techniques in the prognostic of hepatitis data using
Rough set technique and Multi- layer Neural Network using
back-propagation algorithm. The prediction done using these
techniques gave an outcome which was more specific and
accurate. Performance and time taken to run the hepatitis data
in the prediction process is much faster than other techniques.
Rong-Ho Lin [13] presented a model for diagnosing liver
disease using classification and Regression tree (CART) and
case-based reasoning (CBR). The model undergoes two
stages, the first stage to adopt CART for diagnosing to predict
whether a patient suffering from liver disease or not and the
second stage is to employ CBR diagnose to predict the type of
liver disease. The five fold cross validation were performed
and obtained an accuracy rate of 92.16% using CART and
87.25% accuracy rate using CBR.
Ihsan Omur Bucak et al. [9] implemented the diagnosis
system for diagnosing the liver disease by using CMAC neural
network approach to minimize the medical diagnosis process
and help the physician to handle complicated cases. It consists
of 24 input nodes and 5 output nodes, Normalization of load
and input of training data were done and calculated the
variation between actual and desired output. Testing weight
was saved. The learning rate is 0.5 and it increases sharply as
the desired output was 0.0001.
Hui-Ling et al. [8] developed an innovative medical
diagnostic method using local fisher discriminate analysis and
support vector machines (LFDA-SVM) for hepatitis diagnosis
problem. Experimentation was done on the hepatitis dataset
for distinguishing the live from dead liver patient database.
The comparative study was conducted on different methods
PCA_SVM, the FDA_SVM and the SVM. From the observed
results the LFDA_SVM achieved the best classification
accuracies of 96.77% from 80–20% training–testing partition.
Enas M. F. El Houby [5] proposed a framework for
prediction of HCV patient‟s response for the treatment of
HCV from clinical information. They used three phases such
as preprocessing phase to prepare the data for applying in the
data mining technique. The next phase is to apply the data
mining technique at last the evaluation phase to evaluate and
to compare the performance. Associative classification,
Artificial Neural Network and the Decision Tree techniques
were applied, from that the associative classification obtained
92% of the accuracy rate.
Yugal Kumar and G. Sahoo [17] proposed a rule based
classification model for the prediction of different liver
diseases. They proposed the rule based classification model
and the model without rule based for prediction, from that the
rule based classification model with decision tree provided
more accurate results.
The primary goal of this work is to design a framework for
the healthcare management in the diagnosing and categorizing
of the hepatitis disease. The proposed work categorized in
three phases represented in Fig. 1,
The initially phase pre-processes the input data by
performing the scaling of the dataset to estimate the
normalization of the dataset.
The next phase is model processing which splits the
dataset as training dataset and test dataset. Then rule
based induction is formulated and applied in different
classification algorithms.
Last phase is evaluation phase, were Cross – Validation
performed in each model to obtain maximum accuracy
then selecting the model which gives best accuracy rate.
The obtained model evaluated with different evaluation
parameters and comparisons of the results are analyzed
and appropriate technique is selected.
A. Algorithm for Pre-Processing
Step-1: Start
Step-2: Read the Dataset Xn
Step-3: Scaling the dataset for normalizing the dataset is done
by reading the minimum and maximum value of each dataset
the normalization of the dataset calculated using (1),
Y i-Y m in
Y i= + 1
Y m ax -Y m in
(1)
The p-Value is generated using (2) , if p-value is less than 0.05
then reject the hypothesis.
1
m
n P
x
p P
n
(2)
where, m represents „yes‟ response, n represents random
sample size, p represents proportion and P represents
population.
B. Algorithm for Model Processing
Step 1: Read the pre-processed dataset
Step 2: Splitting the dataset as 50 -50% training set and test
set partition
Step 3: Formulate Rule based on the rule based induction
Step 4: The rule based induction is analyzed for diagnosing,
present / absent of liver disease, Liver Types which as HAV,
HBV, HCV and HDC and intensity of liver disease as acute
and chronic.
Step 5: The robust linear mixed model for normally distributed
vector elements with random effect using binomial type are
used to obtain the predicated result,
Y i rlm E Y X i X i X i X in | 1 2 * 3 * *
where, Yi represents linear predicted, γ represents vector of
data elements, β specifies intercept weight of the dataset
elements
Step 6: Transforming robust multi-linear mixed model using
Box-Cox transformation in (4).
2 1 2
1 lo g lo g lo g 1
2
lo g
x x
x x
(4)
Step 7: Calculate the accuracy from the obtained model using
a) Accuracy: Prediction accuracy is calculated in (5)
T P T N
A c c u r a c y
T P T N F P F N
where TP, TN, FP and FN represents true positive,
true negative, false positive and false negative
b) Standard error: Error in the prediction is identified
using (6).
1
*
S E p p
n
where, p represents the sample proportion and n represents the
sample size
c) R-Squared value
2
r S S E rro r S S T o ta l 1 ( ) / ( ) (7)
Where, SS Total = Total Sum of Square and SS Error is Sum
of Square error.
Step 8: Formulated rule is applied in Random Forest, Neural
Network, Naïve Bayes, Decision Tree, K-Nearest Neighbor
and Support Vector Machine classification techniques
Step 9: Obtain the result of each technique
Step 10: Stop
C. Algorithm for Evaluation
Step 1: Read the results obtained from the classification
techniques
Step 2: Cross – Validation performed in each model
Step 3: Select the model which gives best accuracy rate
Step 4: The obtained mode evaluated with different evaluation
parameters
Step 5: Stop
VI. RULE BASED INDUCTION IN HEPATITIS DIAGNOSIS
The Rule based induction formulated based on the standard
values given in Table I.
1) Rule 1: Partitioning the patient gender wise,
If (Gender == 0, “Male”, “Female”)
2) Rule 2: This condition differentiates between healthy
and unhealthy male patient
If ((Gender == 0&&(AST>5 && AST <40) && (ALT>5 &&
ALT &&42) && (ALP>25 && ALP <120) && (GGT<=40)
&& (PT >11 && PT <16) &&(TBil > 0.1 && TBil <1.0) &&
(TP >5 && TP < 80) && (Al>3.5 &&Al<5), “Mhealthy”,
“Munhealthy”)
3) Rule 3: If the rule2 fails then this condition will check
for Acute, chronic, alcoholic, cirrohosis liver diseases in male
Case 1: if (TB > 2.0 && ALT>900 && AST > 90 && ALP >
120 && GGT >65 && TBil >20 && GGT>65 && (RatioAST
&& RatioALT <1)), “Acute Viral Hepatitis”)
Case 2: if ((TBil > 20 && GGT>65), “Jaundice”)
Case 3: if(AST > 225 AST <450) && (RatioAST &&
RatioALT >2)&& ALP > 120 && GGT >65 && TBil >20 &&
GGT>65), “Chronic Hepatitis”)
Case 4: if (AST > 225 AST <450) && (RatioAST &&
RatioALT >2.5)&& ALP > 120 && GGT >65 && TBil >20
&& GGT>65), “alcoholic acute Hepatitis”)
Case 5: If((AST>ALT && TBil >20 &&
GGT>65),”Cirrohosis”)
4) Rule 4: This condition differentiates between healthy
and unhealthy Female patient
If ((Gender != 0&&(AST>5 && AST <38) && (ALT>6 &&
ALT &&34) && (ALP>25 && ALP &&90) && (GGT<=25)
&& (PT >11 && PT <16) &&(TBil > 0.1 && TBil <1.0) &&
(TP >5 && TP < 80) && (Al>3.5 &&Al<5), “Fhealthy”,
“Funhealthy”)
5) Rule 5: If the rule2 fails then this condition will check
for Acute, chronic, alcoholic, cirrohosis liver diseases in
Female
Case 1: if (TBil > 2.0 && ALT>850 && AST > 25 && ALP >
90 && GGT >65 && TBil >20 && GGT>65 && (RatioAST
&& RatioALT <1)), “Acute Viral Hepatitis” )
Case 2: if((TBil > 20 && GGT>65), “Jaundice”)
Case 3: if(AST > 225 AST <450) && (RatioAST &&
RatioALT >2)&& ALP > 90 && GGT >65 && TBil >20 &&
GGT>65), “Chronic Hepatitis”)
Case 4: if(AST > 220 AST <425) && (RatioAST &&
RatioALT >2.5)&& ALP > 90 && GGT >65 && TBil >20
&& GGT>65), “alcoholic acute Hepatitis”)
Case 5: If((AST>ALT && TBil >20 &&
GGT>65),”Cirrohosis”)
6) Rule 6: If Rule2 and Rule4 fails and it is used to
diagnose for the type of the viral disease in Male and Female
Case 6: If ((HAVIgM =”pos” and HAVIgG =”pos”),
“ViralHepatitis A”)
Case 7: If( HBsAg =”pos” && anti-HBC =”pos” & IgMantiHBC
= “pos” && antiHBs =”neg”), “Acute Viral HepatitisB”)
Case 8: If( HBsAg =”pos” && anti-HBC =”pos” & IgMantiHBC
= “neg” && antiHBs =”neg”), “Chronic Viral
HepatitisB”)
Case 9: If ((HCVAb =”pos” and HCVRNA =”pos”),
“ViralHepatitis C”), If Case 7 or Case 8 then Case 10:
Case 10: If ((HDVAb =”pos” and HDVRNA =”pos”),
“ViralHepatitis D”)
Case 11: If ((HEV =”pos”),” ViralHepatitis E”)
The experiment on the hepatitis disease database has been
performed using RStudio version 0.99.903.
To mention the effectiveness of the proposed algorithm,
dataset listed in the Table I is used for conducting the
experimental work. Initially, the normality of the dataset is
identified, for that scaling has been done to range the value of
each dataset between 0 and 1 scale. Table2. Shows the p-value
Patient‟s raw dataset
Pre- Processing
Estimating Linearity of dataset
Model Processing
Adaptive Rule
Formulated (Proposed)
Adaptive Transformative
Robust Linear Model
(Proposed)
Apply Classification
algorithms
Success
Testing for diagnosis and
categorization
Evaluation
Model Evaluation – Cross
Validation
Comparative Analysis of
Models
Evaluating Model for
Prediction
Hepatitis Detection Hepatitis Categorization
Estimating Normality
Scaling of Dataset
obtained after scaling the dataset. The p-value for obtaining
the normalization is calculated using (1). Table II shows the
results of the dataset after scaling. The histogram results of the
scaled dataset of each attributes are represented in Fig. 2.
Next, the experiment analysis of the proposed work has
been done by dividing the dataset into training and testing set
with 50 – 50% partition. Then 10-fold cross validation is
performed in each technique for evaluating the training phase
of each classifier. The Rule-Based induction algorithm is
formulated for identifying whether, present or absent of
disease in patients. Also it categorizes the hepatitis disease
types based on the standard values of the attributes mentioned
in Table I.
The formulated algorithm initially categorizes the patient
database class as either “present / absent” of the disease in
patients. Further, proposed algorithm rejects the absent cases
from the database and classifies the present class into different
types like “HAV, HBV, HCV, HDV and HEV”. Later it
identifies the severity of the disease as “acute, chronic”, if the
case is present.
The dataset used in this experiment is first predicted to
obtain the linearity. For obtaining the linearity of the dataset,
the linear model and generalized linear models are used.
Further, transformation techniques are applied on the models
for getting more improved results. Binomial, log, square root,
Gaussian, Poisson and boxcox transformations are applied on
each model. Results obtained from each models are listed in
Table III. Linear model, linear model with binomial
transformation, Robust log transformation model, Robust
square transformation model, Robust boxcox transformation
model, Robust Gaussian boxcox transformation, Robust
Poisson boxcox transformation model, Generalized Gaussian
linear model and Generalized linear binomial square root
transformation models are evaluated. Table III shows the
accuracy, residual error, R2
value and adjusted R2
value of all
the models. From the obtained results of different models
Table III shows that Robust Boxcox Transformation Model
(RBCT) gives improved results than other models for
evaluating the linearity. The accuracy rate of all the models is
represented graphically in Fig. 3.
Further, the obtained result of RBCT is incorporated in
Random Forest (RBCT-RF), Neural Network (RBCT-NN),
Naïve Bayes (RBCT-NB), Decision Tree (RBCT-DT), KNearest
Neighbor (RBCT-KNN) and Support Vector Machine
(RBCT-SVM) classification algorithms to obtain the
classification accuracy. Among these classification techniques
RBCT-NN gives the improved accuracy rate of 98.07% with
comparison of other algorithms. Table IV summarize the
classification accuracy of all the algorithms. Fig. 4 shows the
accuracy results, sensitivity, specificity, positively predicted
values, negatively predicted value and balanced accuracy of
the proposed models.
In this study, a framework is proposed for diagnosing the
hepatitis disease. Proposed framework initially scales the
dataset to find the normality of the dataset. Then it estimates
linearity of the dataset using different regression techniques.
Regression technique obtained using RBCT model gives the
better accuracy in measuring the linearity compared to the
other techniques. After estimating the linearity of the model,
the prediction accuracy was calculated by using different
classification algorithms. From the observed results of
different classification algorithms RBCT-NN showed an
improved results. The proposed RBCT-NN can be a prevailing
model for diagnosing hepatitis disease, once diagnosed it
further categorizes its types and also to find severity of the
disease. In future, this model can be implemented predicting
for other diseases.