Professional Documents
Culture Documents
In Tell I Health
In Tell I Health
1
Overview
Introduction
Problem Statement
Existing Approaches
Research Methodology
Experimental Results
IntelliHealth Application
Case Studies
Conclusion
Future Work
2
Motivation
3
Introduction
4
Why Medical Data Mining
• In 2012, worldwide digital healthcare data was estimated to be equal to
500 petabytes and is expected to reach 25,000 petabytes in 2020 (1024
Gigabytes = 1 Terabyte · 1024 Terabytes = 1 Petabyte)
Sun, J., & Reddy, C. K. (2013, August). Big data analytics for healthcare. In Proceedings of the 19th ACM SIGKDD international
conference on Knowledge discovery and data mining (pp. 1525-1525). ACM.
5
Why Medical Data Mining
• Manually impossible to process such large amount of data
• Identify hidden patterns or structures of historical data
• Assist doctors in medical decision making
Sun, J., & Reddy, C. K. (2013, August). Big data analytics for healthcare. In Proceedings of the 19th ACM SIGKDD international
conference on Knowledge discovery and data mining (pp. 1525-1525). ACM.
6
Medical Data Mining
Analytics
Integrate Repository Knowledge
Pattern
Base
Extraction
Clinical
Decision
Support
System
Hunink, H. and Glasziou, P. (2009) Decision Making in Health and Medicine: Integrating Evidence and Values (7th ed.). Cambridge
England: Cambridge
7
Clinical Decision Support System
A Computer based clinical decision support system
(CDSS) provides
• Clinical knowledge and patient related information
• Intelligently filtered and presented at appropriate time
• Enhance patient’s care
Osheroff, JA, Pifer, EA, Sitting, DF, Jenders, RA, Teich, JM. ‘Clinical decision support implementers’ workbook”. Chicago:
HIMSS, 2004
8
An Overview of Machine Learning
Techniques for CDSS
Naive Bayes
Classification
techniques
Decision trees
Support vector
machine
Machine Clustering
learning techniques
K nearest
neighbor
K mean
clustering
Majority voting
SOM
Ensemble Bagging
techniques Expectation
Max.
AdaBoost
Foster, K. R., Koprowski, R., & Skufca, J. D. (2014). Machine learning, medical diagnosis, and biomedical engineering research-
commentary.Biomedical engineering online, 13(1), 94.
9
Machine Learning Techniques for CDSS
Classification Techniques
New objects are labeled based on the training set
Examples:
Naïve Bayes,
SVM,
Manning, E. M., Holland, B. R., Ellingsen, S. P., Breen, S. L., Chen, X., & Humphries, M. (2016). Comparison of three Statistical
Classification Techniques for Maser Identification. arXiv preprint arXiv:1603.06395.
10
Machine Learning Techniques for CDSS
Clustering Techniques
Identify hidden structure from unlabeled data
Examples:
• k-means,
• AGNES etc
Machnik, Ł. (2015). Documents Clustering Techniques. Annales UMCS Sectio AI Informatica, 2(1), 401-411.
11
Machine Learning Techniques for CDSS
Machine Learning Techniques for CDSS
Ensemble Techniques
• Aggregate predictions made by multiple classifiers
Examples include:
• Majority voting
• Weighted voting
• Bagging etc
Sarumathi, S., Shanthi, N., & Ranjetha, P. (2016). Analysis of Diverse Cluster Ensemble Techniques. World Academy of Science,
Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering, 9(11),
2208-2218.
12
Why Ensemble
• Each individual or single classifier has some limitations
13
(Literature Review)
Existing Approaches
14
Existing Approaches
• Several machine learning techniques are used for
disease diagnosis
15
Examples of Benchmark Datasets
No. Datasets Dataset name Reference
16
Machine Learning Techniques Applied on
Benchmark Heart Disease Datasets
Author/Reference Year Technique Accuracy
Shouman, M., et al., [6] 2011 Bagging with Gain Ratio Decision Trees 84.1%
17
Machine Learning Techniques Applied on
Benchmark Breast Cancer Datasets
Author/Reference Year Technique Accuracy
SMO 77.31%
Christobel, A., et al. [14] 2011 Support Vector Machine Ensemble 96.84%
Lavanya, D., et al. [15] 2011 CART with Feature Selection 94.56%
18
Machine Learning Techniques Applied on
Benchmark Diabetes Datasets
Author/Reference Year Technique Accuracy
Gandhi, K.K. et al. [17] 2014 F-score Feature Selection+ SVM 75%
19
Machine Learning Techniques Applied on
Benchmark Liver Disease Datasets
Author/Reference Year Technique Accuracy
Vijayarani, S. et al. [26] 2015 Support Vector Machine 61.2%
20
Machine Learning Techniques Applied on
Benchmark Hepatitis Dataset
Author/Reference Year Technique Accuracy
21
Gaps Identified in Existing Approaches
• Most of the work is done on single classifiers
22
Gaps Identified in Existing Approaches
• No single framework for diagnosis of multiple diseases with
high accuracy
23
Ensemble Model
• Aggregate predictions made by multiple classifiers
2. Combination of classifiers
Rokach, L. (2005). Ensemble methods for classifiers. In Data Mining and Knowledge Discovery Handbook (pp. 957-980). Springer
US.
24
Selection of Classifiers
• The individual classifiers composing an ensemble must
be accurate and diverse:
• Accurate classifier
• Accurate classifier has an error rate better than the
random when guessing new examples
• Diverse classifiers
• Two classifiers are diverse if they use different
training data or different learning algorithms and
produce predictions independently
Rokach, L. (2005). Ensemble methods for classifiers. In Data Mining and Knowledge Discovery Handbook (pp. 957-980). Springer
US.
25
Combination of Classifiers
• Different ways to combine output of single classifiers for
maximum accuracy
• Examples
• Majority Voting
• Bagging
• AdaBoost
• Stacking etc
Hou, J., Xu, E., Xia, Q., & Qi, N. M. (2015). Evaluating classifier combination in disease classification. Pattern Analysis and
Applications,18(4), 799-816.
26
Research Methodology
27
Proposed Approach
• Different variations of ensemble models have been
proposed
28
Medical Data Preprocessing
29
Why Data Preprocessing
• Patient records consist of clinical, lab parameters, results
of particular investigations, specific to disease
Hu, Y. H., Lin, W. C., Tsai, C. F., Ke, S. W., & Chen, C. W. (2015). An efficient data preprocessing approach for large scale medical
data mining. Technology and Health Care, 23(2), 153-160.
30
Medical Data Preprocessing
• Missing Value Replacement
• KNNimpute method
• Find the instances most similar to the test instance
• Euclidean distance based similarity
k
d(x,y)=
i 1
( xi y i ) 2
García-Laencina, P. J., Abreu, P. H., Abreu, M. H., & Afonoso, N. (2015). Missing data imputation on the 5-year survival prediction of
breast cancer patients with unknown discrete values. Computers in biology and medicine,59, 125-133.
31
Medical Data Preprocessing
• Outlier Detection and Elimination
• Euclidean Distance based outliers
k
d(x,y)= (x y )
i 1
i i
2
Aggarwal, C. C. (2015). Outlier analysis. In Data Mining (pp. 237-263). Springer International Publishing.
32
Medical Data Preprocessing
• Feature Selection
• Greedy forward feature selection [*]
• Select subset of features based on performance
gain
*Ververidis, D., & Kotropoulos, C. (2011). Fast and accurate sequential floating forward feature
selection with the Bayes classifier applied to speech emotion recognition. Signal
Processing, 88(12), 2956-2970.
33
Medical Data Preprocessing
34
Proposed Ensemble Model 1
MV5
35
Proposed Ensemble Model 1 - MV5
• Majority Voting Ensemble based on Five Classifiers
(MV5)
2. Combination of classifiers
36
Selection of Classifiers
37
Selection of Classifiers
• From literature, following five heterogeneous
classifiers are selected:
• Naïve Bayes (NB)
• Decision Tree Induction using Information Gain (DT-IG)
• Decision Tree Induction using Gini Index (DT-GI)
• Support Vector Machine (SVM)
• K Nearest Neighbor (KNN)
38
Selection of Classifiers
• Heterogeneous classifiers, complement each other
• Naïve Bayes
P(X | C )P(C )
P(C | X) i i
i P(X)
x=(x1….xn) represents the features and i is possible classes
Parthiban, G., Rajesh, A., & Srivatsa, S. K. (2011). Diagnosis of heart disease for diabetic patients using naive bayes
method. International Journal of Computer Applications, 24(3), 7-11.
39
Selection of Classifiers
• Decision Trees
• Powerful and popular tools for classification and prediction
• Faster learning, high accuracy
• May suffer from overfitting -> Resolved by SVM
• Handle continuous and categorical variables -> Resolve
Naïve Bayes issue
Nahar, J., Imam, T., Tickle, K. S., & Chen, Y. P. P. (2013). Computational intelligence for heart disease diagnosis: A medical
knowledge driven approach. Expert Systems with Applications, 40(1), 96-104.
40
Selection of Classifiers
• Decision Tree using Information Gain
m
Entropy: Info( D) p log
i 1
i 2 ( pi )
pi is class probability,
m is number of classes
v
| Dj |
Entropy of each attribute: InfoA ( D) | D | I (D )
j 1
j split D into v
partitions
Nahar, J., Imam, T., Tickle, K. S., & Chen, Y. P. P. (2013). Computational intelligence for heart disease diagnosis: A medical
knowledge driven approach. Expert Systems with Applications, 40(1), 96-104.
41
Selection of Classifiers
• Decision Tree using Gini Index
n
Gini index gini( D) 1 p 2j n is no. of classes
j 1
|D1| |D |
Gini of each attribute giniA ( D) gini( D1) 2 gini( D2)
|D| |D|
Nahar, J., Imam, T., Tickle, K. S., & Chen, Y. P. P. (2013). Computational intelligence for heart disease diagnosis: A medical
knowledge driven approach. Expert Systems with Applications, 40(1), 96-104.
42
Selection of Classifiers
• K Nearest Neighbor
k
Euclidean distance d (x,y)=
(x y )
i 1
i i
2
Gagliardi, F (2011). "Instance-based classifiers applied to medical databases: Diagnosis and knowledge extraction". Artificial
Intelligence in Medicine 52 (3): 123–139
43
Selection of Classifiers
• Support Vector Machine
• SVM attempts to find maximum margin to separate classes
Sartakhti, J. S., Zangooei, M. H., & Mozafari, K. (2012). Hepatitis disease diagnosis using a novel hybrid method based on support
vector machine and simulated annealing (SVM-SA). Computer methods and programs in biomedicine, 108(2), 570-579.
44
Selection of Classifiers
• Support Vector Machine
• High classification and prediction accuracy
• Alternative to neural network
• Handles both continuous and categorical variables ->Resolves Naïve
Bayes issue
• Decreases the overfitting issue -> Resolves decision tree issue
• Sensitive to noise -> Preprocessing
Sartakhti, J. S., Zangooei, M. H., & Mozafari, K. (2012). Hepatitis disease diagnosis using a novel hybrid method based on support
vector machine and simulated annealing (SVM-SA). Computer methods and programs in biomedicine, 108(2), 570-579.
45
Majority Voting Ensemble
The majority voting ensemble will output the class which has highest
number of votes. Mathematically
1 yc
g ( y, c)
0 yc
47
MV5 Example
• Classifiers training is performed on training data
48
Experimental Results
Performance comparison of MV5 for heart disease datasets
Acc Prec Rec F-M Acc Prec Rec F-M
Classifiers
Cleveland Dataset Eric Dataset
NB 77.23% 81.71% 71.94% 76.51% 68.90% 77.78% 57.61% 66.19%
DT-GI 75.91% 79.74% 72.00% 75.67% 71.77% 75.89% 67.01% 71.18%
DT-IG 73.60% 76.92% 70.07% 73.34% 75.12% 75.19% 75.00% 75.10%
SVM 78.22% 74.26% 86.14% 79.76% 78.95% 76.64% 73.33% 74.95%
KNN 64.36% 68.90% 58.99% 63.56% 65.55% 68.38% 61.96% 65.01%
MV5 85.23% 84.10% 86.68% 85.37% 82.22% 77.99% 74.00% 75.94%
SPECT Dataset SPECTF Dataset
NB 80.52% 76.36% 81.60% 78.90% 78.28% 23.64% 92.45% 37.65%
DT-GI 78.65% 48.08% 86.05% 61.69% 74.16% 38.71% 84.88% 53.17%
DT-IG 78.65% 51.85% 93.01% 66.58% 77.15% 41.18% 82.40% 54.91%
SVM 81.52% 36.36% 95.75% 52.71% 79.50% 47.27% 87.26% 61.32%
KNN 79.40% 81.10% 66.91% 73.32% 71.91% 36.36% 81.13% 50.22%
MV5 81.99% 81.21% 95.86% 87.92% 80.15% 55.56% 93.93% 69.82%
Statlog Dataset Acc = Accuracy Prec= Precision
NB 78.52% 82.00% 74.17% 77.89% Rec= Recall F-M = F-Measure
DT-GI 74.81% 79.29% 70.00% 74.35% NB = Naïve Bayes
DT-IG 73.33% 75.00% 71.05% 72.97% SVM = Support Vector Machine
SVM 78.52% 73.96% 89.74% 81.09% DT-IG = Decision tree using information gain
KNN 65.56% 68.67% 61.67% 64.98% DT-GI = Decision tree using gini index
MV5 80.52% 82.11% 90.92% 86.29% KNN = K Nearest Neighbor
49
Proposed Ensemble Model 1 - MV5
• Limitations
• Majority voting is the simplest way of combining
predictions
• More complex methods can be employed to improve
accuracy
50
Proposed Ensemble Model 2
AccWeight
51
Proposed Ensemble Model 2- AccWeight
• Improvement in MV5
• Weighted voting ensemble scheme instead of
majority voting
52
Proposed Ensemble Model 2- AccWeight
• Weighted Voting Ensemble using Accuracy Measure
(AccWeight)
• Five heterogeneous classifiers as used in MV5
• Naïve Bayes
• Decision Tree using Gini index
• Decision Tree using Information Gain
• Support Vector Machine
• K Nearest Neighbor
1 yc
g ( y, c)
0 yc
55
AccWeight Example
• Classifier training is performed on training data, calculate Accuracy
NB=70%, DT-GI= 75%, DT-IG=80%, KNN=82%, SVM=87%
56
Experimental Results
57
Experimental Results
Performance comparison of AccWeight for heart disease datasets
Acc Prec Rec F-M Acc Prec Rec F-M
Classifiers
Cleveland Dataset Eric Dataset
NB 77.23% 81.71% 71.94% 76.51% 68.90% 77.78% 57.61% 66.19%
DT-GI 75.91% 79.74% 72.00% 75.67% 71.77% 75.89% 67.01% 71.18%
DT-IG 73.60% 76.92% 70.07% 73.34% 75.12% 75.19% 75.00% 75.10%
SVM 78.22% 74.26% 86.14% 79.76% 78.95% 76.64% 73.33% 74.95%
KNN 64.36% 68.90% 58.99% 63.56% 65.55% 68.38% 61.96% 65.01%
AccWeight 86.82% 86.18% 87.27% 86.72% 83.12% 78.31% 76.60% 77.44%
SPECT Dataset SPECTF Dataset
NB 80.52% 76.36% 81.60% 78.90% 78.28% 23.64% 92.45% 37.65%
DT-GI 78.65% 48.08% 86.05% 61.69% 74.16% 38.71% 84.88% 53.17%
DT-IG 78.65% 51.85% 93.01% 66.58% 77.15% 41.18% 82.40% 54.91%
SVM 81.52% 36.36% 95.75% 52.71% 79.50% 47.27% 87.26% 61.32%
KNN 79.40% 81.10% 66.91% 73.32% 71.91% 36.36% 81.13% 50.22%
AccWeight 82.40% 81.32% 95.87% 87.99% 81.00% 71.71% 94.50% 81.54%
Statlog Dataset Acc = Accuracy Prec= Precision
NB 78.52% 82.00% 74.17% 77.89% Rec= Recall F-M = F-Measure
DT-GI 74.81% 79.29% 70.00% 74.35% NB = Naïve Bayes
DT-IG 73.33% 75.00% 71.05% 72.97% SVM = Support Vector Machine
SVM 78.52% 73.96% 89.74% 81.09% DT-IG = Decision tree using information gain
KNN 65.56% 68.67% 61.67% 64.98% DT-GI = Decision tree using gini index
AccWeight 83.04% 82.67% 91.31% 86.77% KNN = K Nearest Neighbor
58
Proposed Ensemble Model 2- AccWeight
• Limitations
• The biasness of accuracy results can be generated if
we have biased dataset
• Sometimes a single measure cannot capture the
quality of a good ensemble reliably
59
Proposed Ensemble Model 3
FmWeight
60
Proposed Ensemble Model 3- FmWeight
• Improvements in AccWeight
• Weighted voting ensemble using F-Measure
• An unbiased metric which assigns the weight to base
classifiers instead of accuracy measure
• Multiple objectives weighted voting such as precision
and recall
61
Proposed Ensemble Model 3- FmWeight
Weighted Voting Ensemble using F-Measure
(FmWeight)
Five heterogeneous classifiers as used in MV5 and
AccWeight
• Naïve Bayes
• Decision Tree using Gini index
• Decision Tree using Information Gain
• K Nearest Neighbor
• Support Vector Machine
62
Weighted Voting Ensemble-FmWeight
The weighted voting ensemble will output the class which has
highest weight (F-Measure) associated with it. Mathematically
1 yc
g ( y, c)
0 yc
64
FmWeight Example
• Classifier training is performed on training data, calculate F-Measure
NB=60%, DT-Gini= 70%, DT-info=80%, IBL=85%, SVM=90%
65
Experimental Results
Comparison between MV5, AccWeight & FmWeight Ensembles for
Cleveland heart disease dataset
66
Experimental Results
Performance comparison of FmWeight for heart disease datasets
Acc Prec Rec F-M Acc Prec Rec F-M
Classifiers
Cleveland Dataset Eric Dataset
NB 77.23% 81.71% 71.94% 76.51% 68.90% 77.78% 57.61% 66.19%
DT-GI 75.91% 79.74% 72.00% 75.67% 71.77% 75.89% 67.01% 71.18%
DT-IG 73.60% 76.92% 70.07% 73.34% 75.12% 75.19% 75.00% 75.10%
SVM 78.22% 74.26% 86.14% 79.76% 78.95% 76.64% 73.33% 74.95%
kNN 64.36% 68.90% 58.99% 63.56% 65.55% 68.38% 61.96% 65.01%
FmWeight 87.37% 87.50% 88.27% 87.90% 84.19% 83.75% 77.53% 80.52%
SPECT Dataset SPECTF Dataset
NB 80.52% 76.36% 81.60% 78.90% 78.28% 23.64% 92.45% 37.65%
DT-GI 78.65% 48.08% 86.05% 61.69% 74.16% 38.71% 84.88% 53.17%
DT-IG 78.65% 51.85% 93.01% 66.58% 77.15% 41.18% 82.40% 54.91%
SVM 81.52% 36.36% 95.75% 52.71% 79.50% 47.27% 87.26% 61.32%
kNN 79.40% 81.10% 66.91% 73.32% 71.91% 36.36% 81.13% 50.22%
FmWeight 83.75% 81.40% 95.99% 88.09% 82.73% 72.67% 96.33% 82.84%
Statlog Dataset Acc = Accuracy Prec= Precision
NB 78.52% 82.00% 74.17% 77.89% Rec= Recall F-M = F-Measure
DT-GI 74.81% 79.29% 70.00% 74.35% NB = Naïve Bayes
DT-IG 73.33% 75.00% 71.05% 72.97% SVM = Support Vector Machine
SVM 78.52% 73.96% 89.74% 81.09% DT-IG = Decision tree using information gain
kNN 65.56% 68.67% 61.67% 64.98% DT-GI = Decision tree using gini index
FmWeight 86.82% 87.50% 93.27% 90.29% KNN = k Nearest Neighbor
67
Proposed Ensemble Model 4
BagMoov
68
Proposed Ensemble Model 4- BagMoov
• Improvements in FmWeight
• Bagging is enhanced with weighted metric
• F-Measure is used to assign weight
69
Proposed Ensemble Model 4- BagMoov
• Bagging Approach with Multi-objective Weighted
voting Ensemble Scheme (BagMoov)
70
Selection of Classifiers
71
Selection of Classifiers
• Based on literature, following five classifiers are selected
to make an ensemble model:
• Naïve Bayes (NB)
• Support Vector Machine (SVM)
• K Nearest Neighbor (KNN)
• Linear Regression (LR)
• Linear Discriminant Analysis (LDA)
72
Selection of Classifiers
• In addition to NB, SVM and KNN that are used in previous frameworks;
Now, we have two more classifiers:
• Linear Regression
y w0 w1 x
A straight line is used to fit the data
x is given variable and y is predicted value,w0 is y-intercept and w1 is slope of line
Least square method: choose the best fitting straight line ( values of w0 and w1)
Campbell MJ. Blackwell BMJ Books, 2006. Statistics at Square Two. 2nd Ed.
73
Selection of Classifiers
• Linear Discriminant Analysis
D b1 X 1 b2 X 2 b3 X 3 ........ bk X k c
D=Discriminant score, b’s=Discriminant coefficient or weight
X’s= Input variables, c= Constant
J. Vander Sloten, P. Verdonck, M. Nyssen, J. Haueisen (Eds.): ECIFMBE 2008, IFMBE Proceedings 22, pp. 389–392, 2008
74
Proposed Bagging Approach with Multi-objective
Weighted Voting
BagMoov (T,M) // T is original training set of N samples
Sample-With-Replacement (T,N)
S={ }
For i=1,2,……N
r=random_integer (1,N)
Add T[r] to S
Return S
75
BagMoov Framework
76
BagMoov Example
• Classifiers training is performed on training data, calculate F-Measure
NB=60%, LR= 70%, LDA=80%, KNN=85%, SVM=90%
77
Experimental Results
Benchmark datasets taken from UCI data repository
• 5 heart disease datasets
• 4 breast cancer datasets
• 2 liver disease datasets
• 2 diabetes datasets
• 1 hepatitis dataset
78
Experimental Results
79
Experimental Results- Heart disease
Performance comparison of BagMoov for heart disease datasets
Acc Prec Rec F-M Acc Prec Rec F-M
Classifiers
Cleveland Dataset Eric Dataset
NB 77.23% 81.71% 71.94% 76.51% 68.90% 77.78% 57.61% 66.19%
LR 83.50% 88.41% 77.70% 82.71% 77.99% 88.89% 64.13% 74.51%
LDA 65.68% 68.29% 62.59% 65.32% 69.41% 64.00% 73.33% 68.35%
SVM 78.22% 74.26% 86.14% 79.76% 78.95% 76.64% 73.33% 74.95%
KNN 64.36% 68.90% 58.99% 63.56% 65.55% 68.38% 61.96% 65.01%
BagMoov 88.83% 89.01% 89.17% 89.09% 85.16% 93.29% 78.38% 85.19%
SPECT Dataset SPECTF Dataset
NB 80.52% 76.36% 81.60% 78.90% 78.28% 23.64% 92.45% 37.65%
LR 83.15% 38.18% 94.81% 54.44% 78.28% 68.38% 61.96% 65.01%
LDA 83.52% 36.36% 95.75% 52.71% 68.60% 47.27% 87.26% 61.32%
SVM 81.52% 36.36% 95.75% 52.71% 79.50% 47.27% 87.26% 61.32%
KNN 79.40% 81.10% 66.91% 73.32% 71.91% 36.36% 81.13% 50.22%
BagMoov 84.92% 83.27% 96.23% 89.28% 83.28% 75.27% 96.70% 84.65%
Statlog Dataset
Acc = Accuracy Prec= Precision
NB 78.52% 82.00% 74.17% 77.89% Rec= Recall F-M = F-Measure
LR 82.59% 87.33% 76.67% 81.65% NB = Naïve Bayes
LDA 68.15% 64.00% 73.33% 68.35% SVM= Support Vector Machine
SVM 78.52% 73.96% 89.74% 81.09% LR = Linear Regression
LDA = Linear Discriminant Analysis
KNN 65.56% 68.67% 61.67% 64.98%
80 KNN = K Nearest Neighbor
BagMoov 87.07% 92.00% 95.17% 93.56%
Experimental Results- Breast Cancer
Performance comparison of BagMoov for breast cancer datasets
Acc Prec Rec F-M Acc Prec Rec F-M
Classifiers UMC Dataset WPBC Dataset
NB 68% 78.9% 75.04% 76.92% 72.76% 36.36% 16.5% 22.70%
SVM 70.31% 73.76% 89.62% 80.92% 76.29% 67.65% 69.99% 68.80%
LR 71.33% 74.16% 91.12% 81.77% 78.87% 63.02% 46.5% 53.51%
LDA 71.05% 73.75% 91.60% 81.71% 72.95% 45.46% 85.55% 59.37%
81
Experimental Results- Diabetes
Performance comparison of BagMoov for diabetes datasets
82
Experimental Results- Liver Disease
Performance comparison of BagMoov for liver disease datasets
83
Experimental Results- Hepatitis
Performance comparison of BagMoov for hepatitis dataset
Classifiers Accuracy Precision Recall F-Measure
84
Proposed Ensemble Model 4- BagMoov
• Limitations
• Single layer ensemble model
• Experimentation showed no further increase in
accuracy as the number of classifiers are increased
• Multi-layer ensemble models may result in higher
accuracy
85
Proposed Ensemble Model 5
HM-BagMoov
86
Proposed Ensemble Model 5- HM-BagMoov
• Improvement in BagMoov
• Multi-layer ensemble framework in order to further
improve disease diagnosis accuracy
87
Proposed Ensemble Model 5- HM-BagMoov
Hierarchical Multi-level classifiers Bagging with
Multi-objective Weighted voting (HM-BagMoov)
88
Layered Structure
• Taxonomy structure, improves classification efficiency and
accuracy
Zolfaghar, K., Verbiest, N., Agarwal, J., Meadem, N., Chin, S. C., Roy, S. B., ... & Reed, L. (2013). Predicting risk-of-readmission for
congestive heart failure patients: A multi-layer approach. arXiv preprint arXiv:1306.2094
89
Selection of Classifiers
• So far we have used single classifiers in proposed
frameworks
90
Selection of Classifiers
• From literature, top two ensembles are selected
based on highest accuracy and diversity for multiple
diseases diagnosis
• Artificial Neural Network Ensemble (ANN)
• Random Forest (RF)
91
Selection of Classifiers
• Artificial Neural Network Ensemble
x0 w0 - k
x1 w1
f output y
xn wn
For Example
n
Input weight weighted y sign( wi xi k )
Activation i 0
vector x vector w sum
function
Ani, R., Augustine, A., Akhil, N. C., & Deepa, O. S. (2016). Random Forest Ensemble Classifier to Predict the Coronary Heart
Disease Using Risk Factors. In Proceedings of the International Conference on Soft Computing Systems (pp. 701-710). Springer
93
Architecture of HM-BagMoov
• Layer 1
• Five heterogeneous classifiers as used in BagMoov
• Naïve Bayes
• Linear Regression
• Linear Discriminant Analysis
• K Nearest Neighbor
• Support Vector Machines
94
Architecture of HM-BagMoov
• Layer 2
• Output of layer-1 classifiers is combined using
proposed weighted bagging ensemble approach
• Two ensemble classifiers are added
• Artificial Neural Network Ensemble
• Random Forest
95
Architecture of HM-BagMoov
• Layer 3
• Final result is obtained
• Final output of the proposed ensemble is labeled with
the class that has highest weighted vote
96
HM-BagMoov Framework
97
HM-BagMoov Example
• Classifiers training is performed on training data, calculate F-Measure
NB=60%, LDA= 70%, LR=80%, KNN=85%, SVM=82%, ANN=85%, RF= 87%
99
Single Layer vs Layered Approach
Comparison of Single layer vs multi layer ensemble for Cleveland
heart disease dataset
Technique Accuracy Precision Recall F-Measure
Dataset
100
Experimental Results- Heart disease
Acc Prec Rec F-M Acc Prec Rec F-M
Classifiers
Cleveland Dataset Eric Dataset
NB 77.23% 81.71% 71.94% 76.51% 68.90% 77.78% 57.61% 66.19%
LR 83.50% 88.41% 77.70% 82.71% 77.99% 88.89% 64.13% 74.51%
LDA 65.68% 68.29% 62.59% 65.32% 69.41% 64.00% 73.33% 68.35%
SVM 78.22% 74.26% 86.14% 79.76% 78.95% 76.64% 73.33% 74.95%
kNN 64.36% 68.90% 58.99% 63.56% 65.55% 68.38% 61.96% 65.01%
RF 69.64% 84.76% 51.80% 64.30% 69.86% 90.60% 43.48% 58.76%
ANN 79.21% 79.88% 78.42% 79.14% 76.08% 80.34% 70.65% 75.19%
HM-BagMOOV 91.99% 91.31% 90.19% 90.75% 89.82% 94.74% 85.74% 90.01%
SPECT Dataset SPECTF Dataset
NB 80.52% 76.36% 81.60% 78.90% 78.28% 23.64% 92.45% 37.65%
LR 83.15% 38.18% 94.81% 54.44% 78.28% 68.38% 61.96% 65.01%
LDA 83.52% 36.36% 95.75% 52.71% 68.60% 47.27% 87.26% 61.32%
SVM 81.52% 36.36% 95.75% 52.71% 79.50% 47.27% 87.26% 61.32%
kNN 79.40% 81.10% 66.91% 73.32% 71.91% 36.36% 81.13% 50.22%
RF 79.40% 59.09% 87.00% 70.38% 79.40% 77.76% 71.60% 74.55%
ANN 79.78% 50.91% 87.26% 64.30% 78.65% 50.91% 85.85% 63.92%
HM-BagMOOV 88.77% 87.33% 97.67% 92.21% 89.21% 83.54% 97.10% 89.81%
Statlog Dataset
NB 78.52% 82.00% 74.17% 77.89% Acc = Accuracy Prec= Precision
LR 82.59% 87.33% 76.67% 81.65% Rec= Recall F-M = F-Measure
LDA 68.15% 64.00% 73.33% 68.35% NB = Naïve Bayes, SVM = Support Vector
Machine, LR = Linear Regression
SVM 78.52% 73.96% 89.74% 81.09%
LDA = Linear Discriminant Analysis
kNN 65.56% 68.67% 61.67% 64.98% kNN = k Nearest Neighbor
RF 71.11% 81.33% 58.33% 67.94% RF= Random Forest
ANN 78.15% 79.33% 76.67% 77.98% ANN= Artificial Neural Network
101 HM-BagMOOV 90.93% 94.67% 96.50% 95.58%
Experimental Results- Heart disease
Performance comparison of HM-BagMoov with other proposed ensembles
for heart disease datasets
Acc Prec Rec F-M Acc Prec Rec F-M
Ensembles
Cleveland Dataset Eric Dataset
104
Experimental Results- Breast Cancer
Performance comparison of HM-BagMoov for breast cancer datasets
Acc Prec Rec F-M Acc Prec Rec F-M
Classifiers UMC Dataset WPBC Dataset
NB 68% 78.9% 75.04% 76.92% 72.76% 36.36% 16.5% 22.70%
SVM 70.31% 73.76% 89.62% 80.92% 76.29% 67.65% 69.99% 68.80%
LR 71.33% 74.16% 91.12% 81.77% 78.87% 63.02% 46.5% 53.51%
LDA 71.05% 73.75% 91.60% 81.71% 72.95% 45.46% 85.55% 59.37%
Sørensen, K.P. et al. [81] 2015 Long term coding RNA 92% 90% 65%
107
Experimental Results- Diabetes
Performance comparison of HM-BagMoov for diabetes datasets
108
Experimental Results- Diabetes
Performance comparison of HM-BagMoov with multi-layer ensembles for
diabetes datasets
Acc Prec Rec F-M Acc Prec Rec F-M
Classifiers
Pima Indian Diabetes Dataset Biostat Diabetes Dataset
Majority voting 76.30% 50.00% 80.40% 61.66% 91.07% 79.54% 48.33% 60.13%
AdaBoost
76.43% 52.99% 89.00% 66.42% 88.83% 85.79% 43.33% 57.58%
Stacking
74.61% 69.99% 75.39% 72.59% 87.78% 81.88% 60.11% 69.33%
109
Experimental Results- Diabetes
State of the art comparison of HM-BagMoov for Biostat dataset
Reference Year Technique Accuracy Precision Recall
Kandhasamy, J.P. et al. [86] 2015 J48 73.82% 59.7% 81.4%
KNN 73.17% 53.7% 83.6%
SVM 73.34% 53.84% 73.39%
Random forest 71.74% 53.81% 80.4%
Bashir, S. et al. [87] 2014 Majority voting 74.09% 89.4% 45.52%
AdaBoost 74.22% 84.2% 55.6%
Bayesian Boosting 73.18% 82.6% 55.6%
Stacking 68.23% 76% 53.73%
Bagging 74.48% 81.4% 61.5%
- -
Gandhi, K.K. et al. [88] 2014 F-score FS+ SVM 75%
Tapak, L. et al. [89] 2013 Logistic regression 76.3% 13.3% 99.9%
Linear discriminant analysis 71% .6% 99.8%
Fuzzy c-mean 67.8% 33% 90.1%
Neural network 75.1% 8.4% 99.8%
Random forest 71.7% 8.1% 99.8%
Karthikeyani, V. et al. [90] 2012 Discriminant analysis 76% - -
Proposed technique HM-BagMoov 95.07% 90.31% 99.88%
110
Experimental Results- Liver Disease
Performance comparison of HM-BagMoov for liver disease datasets
111
Experimental Results- Liver Disease
Performance comparison of HM-BagMoov with multi-layer ensembles for
liver disease datasets
Acc Prec Rec F-M Acc Prec Rec F-M
Classifiers
ILPD Dataset BUPA liver disease Dataset
Majority voting 71.53% 69.04% 62.99% 65.87% 71.88% 45.52% 81.00% 58.28%
112
Experimental Results- Liver Disease
State of the art comparison of HM-BagMoov for ILPD dataset
Reference Year Technique Accuracy Precision Recall
J48 68.7%
MLP 68.2%
Gulia, A. et al. [91] 2015 SVM 71.3% - -
RandomForest 70.3%
BayesNet 67.2%
Naïve Bayes 53.9% 37.4% 75.2%
Decision tree 69.4% 73.1% 35.2%
Jin, H. et al. [29] 2014
Multi-layer perception 67.9% 72.9% 30.3%
K nearest neighbor 65.3% 72.7% 46.7%
CART 67.82%
Sug, H. [92] 2012 - -
C4.5 66.47%
NBC 55.59% 71.03% 37.5%
C4.5 55.94% 50.34% 60%
Ramana, B.V. et al. [33] 2011 Back propagation 66.66% 51.03% 78%
K-NN 57.97% 0% 1%
SVM 62.6% 55.86% 67.5%
Naïve Bayes 55% 39.7%
Karthik, S. et al. [34] 2011 -
RBF network 61.5% 46.6%
Proposed technique
113
HM-BagMoov 82.7% 83.9% 81.76%
Experimental Results- Hepatitis
Performance comparison of HM-BagMoov for hepatitis dataset
Classifiers Accuracy Precision Recall F-Measure
114
Experimental Results- Hepatitis
Performance comparison of HM-BagMoov with multi-layer ensembles for
hepatitis datasets
115
Experimental Results- Hepatitis
State of the art comparison of HM-BagMoov for hepatitis dataset
Reference Year Technique Accuracy Precision Recall
Artificial neural network 69% 77.3% 50.6%
Houby, E.M.F. [93] 2015 Associative classification 81.3% 62.5% 73.4%
Decision tree 73.3% 68.53% 74.3%
Bayes.NaiveBayes 84%
Bayes.BayesNet 81%
Bayes.NaiveBayes
84%
Karthikeyan, T. et al. [38 ] 2013 Updatable - -
J48 83%
Random forest 83%
Multilayer perception 83%
KNN 70.29%
Naïve Bayes 66.94%
Neshat, M. et al. [94] 2012 SVM 65.22% - -
FDT 61.49%
CBR-PSO 77.16%
SVM 83.12%
Kumar. M.V. et al. [40] 2012 - -
SVM and Wrapper method 74.55%
Proposed technique HM-BagMoov 87.04% 83.27% 81.67%
116
Processing Time
• The performance of HM-BagMoov is also compared wrt
time with ensemble approaches
117
Processing Time
Comparison of HM-BagMoov with other ensemble classifiers
Processing Time
120
100
Cleveland
Eric
80 Statlog
SPECT
Time (ms)
SPECTF
60
PIMA
Diabetes
40 UMC
WBC
ILPD
20 BUPA
Hepatitis
0
Majority Voting AdaBoost Bagging HM-BagMoov
Ensemble Classifiers
118
Stopping Criteria
The stopping criteria is based on:
• No. of classifiers
• No. of layers
• Processing time
119
Stopping Criteria
If more no. of classifiers added then accuracy decreases
Addition of two classifiers at layer 1 (9-classifiers)
Perceptron and Polynomial regression
Further addition of two classifiers at layer 2 (11-classifiers)
SVM ensemble and Decision tree ensemble
Cleveland heart disease dataset
120
Stopping Criteria
If more no. of layers added then accuracy decreases
Addition of one more layer (4-layered approach)
SVM ensemble and Decision tree ensemble
Cleveland heart disease dataset
3-layered approach
91.99% 91.31% 90.19% 90.75%
(HM-BagMoov)
121
Stopping Criteria
Processing time increases if more classifiers or layers
added
Cleveland heart disease dataset
HM-BagMoov 35.10
122
Analysis of Proposed Technique
• An optimal model of class diversity
• Consistently produced the highest accuracy for all
diseases/datasets
• Each classifier in ensemble framework has diverse set of
qualities
• Complement and overcome the limitations of each other
• Make an accurate ensemble framework for disease
diagnosis
• The framework is not limited to attributes or records
• The framework can be used for any disease diagnosis
123
IntelliHealth: An Intelligent Application for
Disease Diagnosis
124
IntelliHealth
• A complete medical decision support application for
• Data acquisition and preprocessing
• Disease diagnosis and report generation
125
IntelliHealth Architecture
126
Users of IntelliHealth
• Three Users
• Each user has its login ID and password
• Admin/ I.T. Staff
• Doctor
• Patient
127
Modules of IntelliHealth
• Four Modules
1. Data Acquisition and Preprocessing module
2. Classifier Training and Model Generation
3. Disease Diagnosis Module
4. Report Generation
128
Data Acquisition & Preprocessing Module
131
Report Generation
• Report Generation
• Generate medical report for each patient
• Display whether patient shows the symptoms of
disease or not
132
Case Studies
133
Implementing IntelliHealth in Real-time Clinical Practice
Patient showing certain
Entry Clinical examination Suggest medical tests
symptoms of disease
Results
Diagnosis by DSS
comparison
No
135
Case Study-2
• Dr. Lubna Naseem, Pakistan Institute of Medical
Sciences (PIMS), Islamabad
• System was trained on 495 patient’s data consisting 11
attributes
• Tested on 150 patients
Tests Unit Reference value
T.L.C *1000/µL 4.5-11.5
Red cell count million/µL 3.5-5.5
Haemoglobin g/dL 12-15
PCV/HCT Fl 35-55
MCV Fl 75-100
MCH Pg 25.0-35.0
MCHC g/dL 31.0-38.0
Platelet count *1000/µL 100-400
RDW-CV % 11.6-15.0
Neutrophils % 60-70
Lymphocytes % 30-40
137
Conclusion
138
Conclusion
Proposed multiple ensemble frameworks for disease
diagnosis
Final ensemble framework adopted layered approach with
enhanced bagging in order to attain maximum accuracy
Evaluation performed on multiple benchmark and real time
medical datasets to show the performance
Consistent performance for all medical datasets
Proposed IntelliHealth application based on HM-BagMoov for
disease diagnosis
3 Case studies on real time medical datasets in order to prove
the superiority of results
139
Future Work
140
Future Work
The proposed system can be extended to predict the levels
and types of particular disease
The next step to this research is a recommender system
where recommendation of medicines and medical treatments
is done for predicted disease
The proposed system can be applied on other fields data
where classification is required such as Banking, finance and
marketing etc
141
References
1. Thenmozhi,K.,Deepika, P.: Heart Disease Prediction Using Classification with Different Decision Tree Techniques. International
Journal of Engineering Research and General Science Volume 2, Issue 6, October-November, 2015
2. Bashir, S., Qamar, U., & Younus Javed, M. (2014, November). An ensemble based decision support framework for intelligent
heart disease diagnosis. In Information Society (i-Society), 2014 International Conference on (pp. 259-264). IEEE.
3. Chitra, R., Seenivasagam, D.V.: Heart Disease Prediction System Using Supervised Learning Classifier. In: International Journal
of Software Engineering and Soft Computing, Vol. 3, No. 1, March (2013)
4. Shouman,M.,Turner, T., Stocker,R.: Integrating Clustering with Different Data Mining Techniques in the Diagnosis of
Heart Disease. In: Journal of computer science and engineering, Vol 20, Issue 1, (2013).
5. Ghumbre, S., Patil, C., Ghatol, A.: Heart Disease Diagnosis using Support Vector Machine. In: International Conference on
Computer Science and Information Technology (ICCSIT') Pattaya (2011).
6. Shouman, M., Turner, T., Stocker, R.: Using Decision Tree for Diagnosing Heart Disease Patients, In: Proceedings of the 9th
Australasian Data Mining Conference, Ballarat, Australia (2011).
7. Özçift, A. Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia
diagnosis. Computers in Biology and Medicine, 41(5), 265-271. (2011).
8. Chaurasia, V., Pal, S.: A Novel Approach for Breast Cancer Detection using Data Mining Techniques. International Journal of
Innovative Research in Computer and Communication Engineering. Vol. 2, Issue 1, January 2014
9. K,A. A., Aljahdali, S., Hussain, S.N.: Comparative Prediction Performance with Support Vector Machine and Random Forest
Classification Techniques. International Journal of Computer Applications (0975 – 8887) Vol 69– No.11, (2013).
10. Salama, G.I., Abdelhalim, M.B., Zeid, M.A.: Breast Cancer Diagnosis on Three Different Datasets Using Multi-Classifiers.
International Journal of Computer and Information Technology (2277 – 0764) Vol 01– Issue 01, (2012).
11. Lavanya, D., Rani, K.U.: Ensemble decision tree classifier for breast cancer data. In: International Journal of Information
Technology Convergence and Services (IJITCS) Vol.2. No. 1 (2012).
12. Luo, S. T., & Cheng, B. W. (2012). Diagnosing breast masses in digital mammography using feature selection and ensemble
methods. Journal of medical systems, 36(2), 569-577.
13. Aruna, S., Rajagopalan, S.P., Nandakishore, L.V.: Knowledge based analysis of various statistical tools in detecting breast cancer.
CCSEA, CS & IT, Vol 02, pp. 37–45 (2011).
14. Christobel, A., Sivaprakasam, Y.: An Empirical Comparison of Data Mining Classification Methods. International Journal of
Computer Information Systems,Vol. 3, No. 2, (2011).
15. Lavanya, D., Rani,K.U.: Analysis of feature selection with classification: Breast cancer datasets. Indian Journal of Computer
Science and Engineering (IJCSE), Vol 2, No 5 (2011).
142
References
16. Han, L., Luo, S., Yu, J., Pan, L., & Chen, S. (2015). Rule extraction from support vector machines using ensemble learning
approach: an application for diagnosis of diabetes. Biomedical and Health Informatics, IEEE Journal of, 19(2), 728-734.
17. Gandhi, K.K., Prajapati, N.B.: Diabetes prediction using feature selection and classification. International Journal of Advance
Engineering and Research Development (2014)
18. Stahl, F., Johansson, R., Renard, E.: Ensemble Glucose Prediction in Insulin-Dependent Diabetes. Data driven modeling for
diabetes. Springer (2014)
19. Aslam, M.W., Zhu, Z., Nandi, A.K.: Feature generation using genetic programming with comparative partner selection for diabetes
classification.In: Expert Systems with Applications. Elsevier (2013)
20. NirmalaDevi, M., Appavu, S., Swathi, U.V.: An amalgam KNN to predict diabetes mellitus. In: International conference on
Emerging Trends in Computing, Communication and Nanotechnology (ICE-CCN), (2013)
21. Christobel. Y. A., SivaPrakasam, P.: The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and
Solution for Improvement. IOSR Journal of Computer Engineering (IOSRJCE). Volume 7, Issue 4 (2012), PP 16-23
22. Zolfaghari, R.: Diagnosis of Diabetes in Female Population of Pima Indian Heritage with Ensemble of BP Neural Network and
SVM. IJCEM International Journal of Computational Engineering & Management, Vol. 15 (2012)
23. Lee, C.: A Fuzzy Expert System for Diabetes Decision Support Application. Systems, Man, and Cybernetics, Part B: Cybernetics,
IEEE Transactions on (Volume:41 , Issue: 1 ) PP(139-153) 2011
24. Ozcift, A., & Gulten, A. (2011). Classifier ensemble construction with rotation forest to improve medical diagnosis performance of
machine learning algorithms. Computer methods and programs in biomedicine, 104(3), 443-451.
25. Verma, B., & Hassan, S. Z. (2011). Hybrid ensemble approach for classification. Applied Intelligence, 34(2), 258-278.
26. Vijayarani, S., Dhayanand, S., Liver Disease Prediction using SVM and Naïve Bayes Algorithms. International Journal of Science,
Engineering and Technology Research (IJSETR) Volume 4, Issue 4, April 2015
27. Seker, S. E., Unal, Y., Erdem, Z., & Kocer, H. E. (2014). Ensembled correlation between liver analysis outputs. arXiv preprint
arXiv:1401.6597
28. Pahareeya, J., Vohra, R., Makhijani, J., Patsariya, S.: Liver Patient Classification using Intelligence Techniques. International
Journal of Advanced Research in Computer Science and Software Engineering. Volume 4, Issue 2, February 2014
29. Jin, H., Kim, S., Kim, J.: Decision Factors on Effective Liver Patient Data Prediction. International Journal of BioScience and
BioTechnology Vol.6, No.4 (2014), pp.167-178
30. Sugawara, K., Nakayama, N., Mochida, S.: Acute liver failure in Japan: definition, classification, and prediction of the outcome. J
Gastroenterol. 2012 Aug; 47(8): 849–861
143
References
31. Kumar, Y., Sahoo, G.: Prediction of different types of liver diseases using rule based classification model. Technology and
healthcare, vol 21 no. 5, pp 417-432 (2013)
32. Ramana, B.V., Babu, M.P.: Liver Classification Using Modified Rotation Forest. International Journal of Engineering Research and
Development. Volume 1, Issue 6. PP.17-24. (2012)
33. Ramana, B.V., Babu, M.P., Venkateswarlu, N.B.: A Critical Study of Selected Classification Algorithms for Liver Disease
Diagnosis. International Journal of Database Management Systems ( IJDMS ), Vol.3, No.2, (2011)
34. Karthik, S., Priyadarishini, A., Anuradha, J., Tripathy, B.K.: Classification and Rule Extraction using Rough Set for Diagnosis of
Liver Disease and its Types. Advances in Applied Science Research, 2011, 2 (3): 334-345
35. Krawczyk, B., Woźniak, M., & Cyganek, B. (2014). Clustering-based ensembles for one-class classification. Information
Sciences, 264, 182-195.
36. Pushpalatha, S., Pandya, J.: Data model comparison for Hepatitis diagnosis. International Journal of Emerging Research in
Management &Technology. Volume-3, Issue-7. Pp 138-141. July 2014
37. El Houby, E.M.F.: Analysis of Associative Classification for Prediction of HCV Response to Treatment. International Journal of
Computer Applications (0975 – 8887) Volume 63– No.15, February 2013
38. Karthikeyan, T., Thangaraju, P.: Analysis of Classification Algorithms Applied to Hepatitis Patients. International Journal of
Computer Applications (0975 – 8887) Volume 62– No.15, January (2013)
39. Kaya,Y., Murat Uyar, A hybrid decision support system based on rough set and extreme learning machine for diagnosis of
hepatitis disease. 2013 Elsevier, 3429–3438
40. Kumar.M, V., Sharathi.V, V., Devi.B.R.G.: Hepatitis Prediction Model based on Data Mining Algorithm and Optimal Feature
Selection to Improve Predictive Accuracy. International Journal of Computer Applications (0975 – 8887) Volume 51– No.19,
August 2012
41. Lin, S., Chen, S., Chou, S., Enhancing the classification accuracy by scatter-search-based ensemble approach. Applied Soft
Computing. Volume 11, Issue 1, January 2011, Pages 1021–1028
42. Eldin, A.M.S.A.G.: A Data Mining Approach for the Prediction of Hepatitis C Virus protease Cleavage Sites. (IJACSA) International
Journal of Advanced Computer Science and Applications, Vol. 2, No. 12, December 2011
43. Javad Salimi Sartakht, J. S. (2011). Hepatitis disease diagnosis using a novel hybrid method. Elsevier, 570-579.
44. Oh, S., Lee, M., Zhang, B.: Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification.
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). Volume 8 Issue 2, March 2011 .Pages 316-325
45. Pattekari, S.A., Parveen, A.: Prediction system for heart disease using Naïve Bayes. In: International Journal of Advanced
Computer and Mathematical Sciences .Vol 3, Issue 3, pp 290-294 , (2012).
144
References
46. Shouman,M.,Turner, T., Stocker,R.: Integrating Naive Bayes and K-means clustering with different initial centroid selection methods in the
diagnosis of heart disease patients. In: Computer science and information technology, pp. 125–137 (2012)
47. Peter, J., Somasundaram.: Probabilistic Classification for Prediction of Heart Disease. Australian Journal of Basic and Applied Sciences,
9(7), Pp 639-643 (2015)
48. K,A. A., Aljahdali, S., Hussain, S.N.: Comparative Prediction Performance with Support Vector Machine and Random Forest Classification
Techniques. International Journal of Computer Applications, Vol 69– No.11, (2013)
49. Kousarrizi, M.R.N., Seiti, F., Teshnehlab, M.: An Experimental Comparative Study on Thyroid Disease Diagnosis Based on Feature Subset
Selection and classification. International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 12 No: 01 (2012)
50. Chen, J. H., Podchiyska, T., & Altman, R. B., OrderRex: Clinical order decision support and outcome predictions by data-mining electronic
medical records. Journal of the American Medical Informatics Association (2015)
51. Rajkumar, A. and G.S. Reena, Diagnosis Of Heart Disease Using Data mining Algorithm. Global Journal of Computer Science and
Technology, 2010. Vol. 10 (Issue 10).
52. Polat , K., S. Sahan, and S. Gunes, Automatic detection of heart disease using an artificial immune recognition system (AIRS) with fuzzy
resource allocation mechanism and k-nn (nearest neighbour) based weighting preprocessing. Expert Systems with Applications 2007. 32 p.
625–631
53. Kurt,I., Ture, M., Kurum, A.T.: Comparing performances of logistic regression, classification and regression tree, and neural networks for
predicting coronary artery disease. Expert Systems with Applications Volume 34, Issue 1, January 2008, Pages 366–374
54. Liao,J.G., Chin,K.: Logistic regression for disease classification using microarray data: model selection in a large pand small n case.
Bioinformatics. Vol. 23 no. 15 2007, pages 1945–1951
55. Ma,Y., Zuo,L., Chen,J., Luo,Q., Yu,X., Li,Y., Xu,J., Huang,S.,Wang,L., Huang, W., Wang,M., Xu,G., Wang,H.: Modified Glomerular Filtration
Rate Estimating Equation for Chinese Patients with Chronic Kidney Disease. Journal of American society of nephrology. Oct 2006
56. Dubberke, E.R., Reske, K., Olsen, M., McDonald,L., Fraser,V.: Short- and Long-Term Attributable Costs of Clostridium difficile–Associated
Disease in Nonsurgical Inpatients. Clinical Infectious Diseases 2008; 46:497–504
57. KNAUS, WILLIAM A. MD; DRAPER, ELIZABETH A. MS; WAGNER, DOUGLAS P. PhD; ZIMMERMAN, JACK E. MD. APACHE II: A severity
of disease classification system. Critical care medicine. Oct 1985. Vol 3 issue 10
58. Wilson,P.W.F., D’Agostino,R.B., Levy,D., Belanger,A.M., Silbershatz,H., Kannel,W.B.: Prediction of Coronary Heart Disease Using Risk
Factor Categories. American heart association. 1998;1837-1847
59. Ster, B., Dobnikar, A.: Neural networks in medical diagnosis: Comparison with other methods. In: Proceedings of the international conference
on engineering applications of neural networks pp. 427–430. (1996).
60. Georgiou-Karistianis ,N., Gray,M.A., Domínguez D,J.F., Dymowski,A.R., Bohanna,I., Johnston,L.A., Churchyard,A., Chua,P., Stout,J.C.
Egan,G.F.: Automated differentiation of pre-diagnosis Huntington's disease from healthy control individuals based on quadratic discriminant
analysis of the basal ganglia: The IMAGE-HD study. Neurobiology of Disease Volume 51, March 2013, Pages 82–92
145
References
61. Zhang, M.Q. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proceedings of the national
academy of sciences of the united states of America. Vol 94 no. 2. 2015
62. Maroco,J., Silva,D.,Rodrigues,A., Guerreiro,M., Santana,I.,Mendonç,A.: Data mining methods in the prediction of Dementia: A real-data
comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector
machines, classification trees and random forests. BMC research notes. 2011, 4:299
63. Drent,M,. Mulder,PG.,Wagenaar,SS., Hoogsteden,HC.,Velzen-Blad,H.,Bosch,JM.: Differences in BAL fluid variables in interstitial lung
diseases evaluated by discriminant analysis. European respiratory journal. June 1993
64. Srivastava,S., Gupta,M., Frigyik,B.: Bayesian Quadratic Discriminant Analysis. Journal of Machine Learning Research 8 (2007) 1277-1305
65. Ani, R., Augustine, A., Akhil, N. C., & Deepa, O. S. (2016). Random Forest Ensemble Classifier to Predict the Coronary Heart Disease Using
Risk Factors. In Proceedings of the International Conference on Soft Computing Systems (pp. 701-710). Springer India
66. Yang, J., Lee, Y., Kang, U.: Comparison of Prediction Models for Coronary Heart Diseases in Depression Patients. International Journal of
Multimedia and Ubiquitous Engineering Vol. 10, No. 3, pp. 257-268 (2015)
67. Chandna, D. (2014). Diagnosis of heart disease using data mining algorithm.Int. J. Comput. Sci. Inf. Technol.(IJCSIT), 5(2), 1678-1680.
68. Kiruthika, C., Rajini, S.N.S.: An Ill-identified Classification to Predict Cardiac Disease Using Data Clustering. International Journal of Data
Mining Techniques and Applications. Volume: 03, Pages: 321-325 (2014)
69. Jabbar, M.A., Chandra, P., Deekshatulu, B.L.: Heart Disease Prediction System using Associative Classification and Genetic Algorithm.
International Conference on Emerging Trends in Electrical, Electronics and Communication Technologies-ICECIT, 2012
70. Das, R., Turkoglu, I., and Sengur, A.: Effective diagnosis of heart disease through neural networks ensembles. In: Expert Systems with
Applications, Elsevier, pp. 7675–7680. (2009).
71. Chen, H., Yang, B., Liu, J., Liu, D.: A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis.
Expert systems with applications. pp 9014-9022. Vol 38 (2011).
72. Chunekar,V.N., Ambulgekar, H.P.: Approach of Neural Network to Diagnose Breast Cancer on three different Data Set. In: International
Conference on Advances in Recent Technologies in Communication and Computing. pp 893-895 (2009)
73. Kahramanli, H., Allahverdi, N.: Design of a hybrid system for the diabetes and heart diseases. Expert systems with applications. Elsevier.
Volume 35, Issues 1–2, July–August 2008, Pages 82–89
74. Tu,M.C., Shin,D., Shin, D.: Effective Diagnosis of Heart Disease through Bagging Approach. 2nd International Conference on Biomedical
Engineering and Informatics. pp 1-4 (2009).
75. Khalilia, M., Chakraborty, S., & Popescu, M. (2011). Predicting disease risks from highly imbalanced data using random forest. BMC medical
informatics and decision making, 11(1), 1.
76. Hsieh, C. H., Lu, R. H., Lee, N. H., Chiu, W. T., Hsu, M. H., & Li, Y. C. J. (2011). Novel solutions for an old disease: diagnosis of acute
appendicitis with random forest, support vector machines, and artificial neural networks.Surgery, 149(1), 87-93.
77. Ramírez, J., Górriz, J. M., Segovia, F., Chaves, R., Salas-Gonzalez, D., López, M., ... & Padilla, P. (2010). Computer aided diagnosis system
for the Alzheimer's disease based on partial least squares and random forest SPECT image classification. Neuroscience letters, 472(2), 99-
103.
146
References
78. Yang, J., Lee, Y., Kang, U.: Comparison of Prediction Models for Coronary Heart Diseases in Depression Patients. International Journal of
Multimedia and Ubiquitous Engineering Vol. 10, No. 3 (2015), pp. 257-268
79. Peter, J., Somasundaram.: Probabilistic Classification for Prediction of Heart Disease. Australian Journal of Basic and Applied Sciences,
9(7), April 2015. Pp 639-643
80. Kiruthika, C., Rajini, S.N.S.: An Ill-identified Classification to Predict Cardiac Disease Using Data Clustering. International Journal of Data
Mining Techniques and Applications. Volume: 03, June 2014, Pages: 321-325
81. Sørensen, K.P., Thomassen, M., Tan, Q., Bak, M., Cold, S., Burton, M., Larsen, M.J., Kruse, T.A.: Long non-coding RNA expression profiles
predict metastasis in lymph node-negative breast cancer independently of traditional prognostic markers.B reast Cancer Research(2015)
82. Zand. H.K.K.: A comparitive survey on data mining techniques for breast cancer diagnosis and prediction. Indian Journal of Fundamental and
Applied Life Sciences. 2015 Vol.5 (S1), pp. 4330-4339
83. Chaurasia, V., Pal, S.: Data Mining Techniques: To Predict and Resolve Breast Cancer Survivability. International Journal of Computer
Science and Mobile Computing. Vol.3 Issue.1, January 2014, pg. 10-22
84. Chaurasia, V., Pal, S.: A Novel Approach for Breast Cancer Detection using Data Mining Techniques. International Journal of Innovative
Research in Computer and Communication Engineering. Vol. 2, Issue 1, January 2014
85. K,A. A., Aljahdali, S., Hussain, S.N.: Comparative Prediction Performance with Support Vector Machine and Random Forest Classification
Techniques. International Journal of Computer Applications (0975 – 8887) Vol 69– No.11, (2013).
86. Kandhasamy, J.P., Balamurali, S.: Performance Analysis of Classifier Models to Predict Diabetes Mellitus. Procedia Computer Science 47 (
2015 ) 45 – 51
87. Bashir, S., Qamar, U., Khan, F.H.: An efficient rule based classification of diabetes using ID3, C4.5 and CART ensemble. Frontier Information
Technology. IEEE. Islamabad, Pakistan (2015)
88. Gandhi, K.K., Prajapati, N.B.: Diabetes prediction using feature selection and classification. International Journal of Advance Engineering and
Research Development (2014)
89. Tapak, L., Mahjub, H., Hamidi, O., Poorolajal, J., Real-Data Comparison of Data Mining Methods in Prediction of Diabetes in Iran. Healthcare
Information Research. 2013 Sep; 19(3): 177–185.
90. Karthikeyani, V., Begum, I.P., Tajudin, K., Begam, I.S.: Comparative of Data Mining Classification Algorithm (CDMCA) in Diabetes Disease
Prediction. International Journal of Computer Applications (0975 – 8887) Volume 60– No.12, December 2012
91. Julia, A., Vohra, R., Rani, P.: Liver patient classification using intelligent techniques. International Journal of Computer Science and
Information Technologies, Vol. 5 (4) , 2015, 5110-511
92. Sug, H.: Improving the Prediction Accuracy of Liver Disorder Disease with Oversampling. Applied Mathematics in Electrical and Computer
Engineering. Pp 331-335. 2012
93. Houby, E.M.F.: A Framework for Prediction of Response to HCV Therapy Using Different Data Mining Techniques. Advances in
Bioinformatics Volume 2014 (2015)
94. Neshat, M., Sargolzaei, M., Toosi, A.N., Masoumi, A.: Hepatitis Disease Diagnosis Using Hybrid Case Based Reasoning and Particle Swarm
Optimization. Artificial Intelligence. Volume 2012 (2012), Article ID 609718, 6 pages
147
Questions?
148
Thank You!
149
Appendix
150
Top 5 Classifiers for heart disease diagnosis
KNN 69.35%
SVM 79.34%
151
Top 5 Classifiers for multiple diseases diagnosis
152
Selection of Classifiers [2/2]
153
Combination in BagMoov
154
Proposed Bagging approach with multi-
objective weighted voting
Training Set
1 2 3 4 ... m
Sampling with
Replacement
Bootstrap Sets
Set 1 1 2 3 ... n
Set 2 1 2 3 ... n
. m and n are number of
. samples, h’s are
.
trained classifiers and
Set t 1 2 3 ... n
p’s are predictions by
the classifiers
h1 h2 h3 ... ht Hypothesis
p1 p2 p3 ... pt Prediction
Multi-Objective Optimized
Weighter Voting
P Final Prediction
155
Limitations
Identify multiple level of disease using different
classifiers such as svm and LR etc
156
BagMoov
Classifier Accuracy Precision Recall F-Measure
157