You are on page 1of 11

An Iterative Hadoop-Based Ensemble

Data Classification Model on Distributed


Medical Databases

Thulasi Bikku, Sambasiva Rao Nandam and Ananda Rao Akepogu

Abstract As the size and complexity of the online biomedical databases are
growing day by day, finding an essential structure or unstructured patterns in the
distributed biomedical applications has become more complex. Traditional
Hadoop-based distributed decision tree models such as Probability based decision
tree (PDT), Classification And Regression Tree (CART) and Multiclass Classifi-
cation Decision Tree have failed to discover relational patterns, user-specific pat-
terns and feature-based patterns, due to the large number of feature sets. These
models depend on selection of relevant attributes and uniform data distribution.
Data imbalance, indexing and sparsity are the three major issues in these distributed
decision tree models. In this proposed model, an enhanced attributes selection
ranking model and Hadoop-based decision tree model were implemented to extract
the user-specific interesting patterns in online biomedical databases. Experimental
results show that the proposed model has high true positive, high precision and low
error rate compared to traditional distributed decision tree models.

Keywords Distributed data mining ⋅ Hadoop ⋅ Ensemble approach ⋅ Medical


databases

T. Bikku (✉)
Department of CSE, VNITSW, Guntur, AP, India
e-mail: thulasi.bikku@gmail.com
S.R. Nandam
Department of CSE, SRITW, Warangal, Telangana, India
e-mail: snandam@gmail.com
A.R. Akepogu
Department of CSE, JNTUCEA, Ananthapuramu, India
e-mail: akepogu@gmail.com

© Springer Science+Business Media Singapore 2017 341


S.C. Satapathy et al. (eds.), Proceedings of the First International Conference
on Computational Intelligence and Informatics, Advances in Intelligent Systems
and Computing 507, DOI 10.1007/978-981-10-2471-9_33
342 T. Bikku et al.

1 Introduction

In the past, to identify named entities in the biomedical field, single kinds of entities
such as protein and/or gene labels were used. It is more effective to use multiple
kinds of named entities at the same time. For each named entity, an entity pattern
recognizer could be executed, and then, after multiple runs, all kinds of named
entities are annotated and, finally, the results could be merged. Gene clustering is
one of the data mining models that were developed for microarray gene expression
data [1]. The basic assumption with respect to training and test data is that, all
distributed instances have been taken from the same feature space with the same
distribution. In the biomedical field, transfer learning enables the context knowl-
edge contained in labeled source domains to predict unlabeled data in the target
domain, where the domains differ in distributions. Traditionally, the transfer
learning was developed to find the correspondence between pivot features and other
specific features extracted from different domains. These learning models extract
persistent information that aims to reduce the difference between the domains.
Domain transfer learning reduces the difference between the distributions of dif-
ferent domains and, thus, minimizes cross-domain prediction errors. There have
been many approaches implemented for transfer learning; one of the promising
ones is feature transfer learning. In this, most relevant contextual features are
adopted as representative objects of both domains.
The distributed data that are generated from different sources are multiple,
complex, distinct and independent. Due to the large number of instances with
redundant and irrelevant attributes, an appropriate filtering model has been used to
fill the noisy attribute or instances. To handle large number of attributes, an efficient
feature selection model was used to select the ranked attributes for classification or
clustering. Al-Khateeb and Masud [2] implemented concise set of rules using
attribute selection model in rough set theory. This model has gained wide accep-
tance in data analysis, machine learning and statistical analysis. The important
challenge in the classification algorithms is the error rate and class imbalance.
Traditional models attempt to optimize the overall precision of their predictions.
Therefore, we would prefer an efficient classification model which performs well on
the minority class, size, complexity and arbitrary data distribution in the distributed
data mining. Big data concern large volume, growing datasets that are automatic
and complex to analyze. It is too complex to filter noise and recognize unstructured
data for decision-making.
The main challenges to handle online medical databases are data accessing and
arithmetic computing procedures, semantics and domain knowledge for a variety of
big data applications and difficulties that come due to dynamic, complex, noisy and
constantly changing and evolving data. The above challenges can be handled using
the three-tier architecture as shown in Fig. 1. Three tiers mentioned in the
An Iterative Hadoop-Based Ensemble Data … 343

Fig. 1 Big data mining


Accessing Distributed Medical databases
framework

Sharing Domain Knowledge

Big Data Mining Algorithms

architecture are data accessing and computing, data privacy and domain knowledge
and Big data classification model.
In the first tier, data accessing and arithmetic computing procedures are per-
formed on the distributed data. The main problem is that huge amounts of data are
shared in different locations and the data are growing in each data source enor-
mously. Hence, to compute and analyze large-scale distributed data, we need an
effective platform like Hadoop. In the second tier, data semantics and domain
knowledge are used for a variety of applications that involve big data. In the online
biomedical applications, users collaborate with each other and share knowledge to
the groups/user communities. It is the most crucial task in both low-level and
high-level big data mining algorithms. In the third tier, problems that come from big
data size, distributed data, complexity and dynamic nature are analyzed prior to
mining algorithm. In this tier, uncertain, sparse, incomplete and multisource data
are preprocessed using filtering models and data are analyzed after preprocessing.
After the preprocessing, the local Hadoop learning models are applied to find the
hidden patterns, and the feedback is sent to preprocessing stage.
Hadoop-based classification models are one of the distributed data mining that
classifies high-dimensional unstructured data into meaningful patterns and it helps
users for decision-making and knowledge discovery. These classification tech-
niques make the large datasets appear in a simpler form for hidden patterns.
Classification approaches on big data are broadly classified in two ways, i.e.,
supervised and unsupervised. In supervised learning, all the instances in the training
data have class labels for decision-making. In unsupervised learning, all the
instances in the training data do not possess class labels for decision-making.
Supervised classifier is used to classify the new instance using the training data.

2 Related Work

Yu and Li [1] and Zhang and Suganthan [3] implemented multi-scale decision tree
representation using granularity computation for generating mixed hierarchical
decision rules. They implemented these multi-scale decision tables under different
levels of granularity and level-wise threshold. Al-Khateeb and Masud [2] imple-
mented Probability based decision tree (PDT) under single level granularity to
hierarchical multi-level granularity and research on attribute generalization
344 T. Bikku et al.

reduction by refining attribute values. Mendes-Moreira and Soares [4] and Mathe
et al. [5] proposed a hierarchical reduction model for concept hierarchy to acquire
multi-confidence rules from the covering decision systems. These attribute selection
models are not applicable to big data for storage and manipulations on a single
machine. Hence, it is necessary to create the most efficient hierarchical attribute
reduction models for big data to accommodate a variety of user’s medical
requirements on different levels. In most of the big data applications, sampling
techniques have been applied to find the relational features or interesting patterns.
Sampling models would be practically successful only if the samples are equally
distributed or satisfy the hypothetical space. Instead of using sampling techniques,
the best solution to handle the large number of attributes for smaller databases is
parallel computing. Traditional features selection based classifiers were applied on
large databases to compute granularities separately and later combined together to
find the global solution of the whole data. But there is no guarantee, as these
partitioned attributes or instances could exchange relational information to each
other. Thus, in the majority of the cases, these techniques fail to extract a subset of
features for large datasets. Graph-based medical disease models have been gaining
a lot of attention due to its uncertainty and the process they detect disease patterns
or relationships between entities. The structural representation holds essential
information for categorizing and visualizing entities and, hence, useful in learning,
clustering and decision-making in biomedical applications. Graph-based classifi-
cation models are promising and could optimize traditional keyword-based tech-
niques [5].
Machine learning models applied on distributed clinical databases attempt to find
patterns and relationships, to understand the features and progression of certain
diseases. Single class models are used in noise filtering with limited instances. In
many medical applications, the degree of class imbalance will alter, particularly
while classifying the online medical datasets. Conventional models such as
k-nearest neighbors, classify an instance set by comparing its Euclidean distances to
each class without considering the features’ contextual information. A multiclass
imbalanced classification brings a lot of challenges in big data due to its data
complexity. A typical solution is to partition the multiclass data into binary clas-
sification and then use balancing techniques to combine distributed data. Data
imbalance rate in the binary classification can be defined as the ratio between the
majority class instances count to the minority class instances count [6, 7]. Existing
dictionary-based approach does not give optimal results for identifying protein
names, because new protein names tend to build, and there are, sometimes, hun-
dreds in terms of identical proteins are referenced. Medline is a large repository of
publicly available scientific literature. Model-based clustering algorithms have been
implemented on document clustering [6–8], where document clusters are repre-
sented as probabilistic methods that are conceptually separated from the data
dimensions. Presently, graph-based clustering models using statistical models are
also successfully applied to document clustering mechanism. These graph models
An Iterative Hadoop-Based Ensemble Data … 345

are optimized using some predefined document measures on the directed graph. The
hierarchical Latent Dirichlet allocation (hLDA) method was implemented in [9] as
unsupervised method which is a generalization of LDA. Graphical ranking-based
clustering algorithms have been implemented to construct a sentence model graph
in which each node is a sentence in the overlay documents. The traditional medical
clustering algorithms are not suitable because the algorithms work in batch pro-
cessing, whereas iterative process merges each iteration cluster into the existing
clusters which leads to duplicate attributes.

3 Proposed Model

In this proposed framework, an iterative Hadoop-based ensemble classification


model was implemented to find the qualitative patterns from the associated features.
Medical disease prediction was executed in three steps as shown in Fig. 2. In the
first step, distributed medical data from different sources are integrated as a single
dataset. In the second step, an improved distributed data filtering algorithm was
applied to replace the inconsistent values in the Hadoop framework. In the third
phase, the Hadoop’s mapper filtered dataset was used to find the interesting patterns
using a reducer’s ensemble algorithm (Fig. 3).

Fig. 2 Different static pattern


analyzers DB-1
Integrated
Databases
DB-1

User specific Data


Integrity (Step-1)

Data Filtering
(Step-2)
Phase Based
Clustering
Ensemble Hybrid
Classifier (Step-3)

Feature Based Decision Patterns


346 T. Bikku et al.

Fig. 3 Attributes similarity DB2


computation DB1

Att j1
Att j2

Att j3
Atti
.
.
Att jn

Mapper Phase:
Distributed Data Integration (Combining two databases)

______________________________________________
Input: Medical datasets: MDList
Output: Single integrated data.
Procedure: Let the two databases are represented as
MDlist1 and MDlist2.
For each attribute Atti in MDlist1 do
For each attribute Attj in MDlist2 do
If (Atti!= φ && Attj!= φ ) then
If (Type(Atti)==Type(Attj))then
Sim(Att i , Att j ) = ( P(Att i / Att j )* P( Att j ))* Correlation(Atti , Att j );
Where correlation is the correlation between the
two attributes. Map each correlated attributes
using it similarity measure as :
Map( (Atti , Att j ) , Sim(Att i , Att j ) );
Done
Select maximum similarity values from the list of
attributes for data integration.
For each attribute Atti in MDlist1 do
For each attribute Attj in MDlist2 do
Select attributes pair (Atti, Attj) as
Maximum (sim(Atti, Attj))
D' =Integrate attributes Atti & Attj
Map ( (Atti , Att j ) , Sim(Att i , Att j ) );
Done
Done
End if
Done
______________________________________________
An Iterative Hadoop-Based Ensemble Data … 347

In this algorithm, medical data from different sources are integrated using
relational attributes. Similarity measure was computed to find the attribute’s rela-
tionship for data integration.
Mapper-Based Data Cleaning (Data filtering method)
In this Mapper preprocessing mechanism, integrated dataset is used to fill the
continuous and nominal missing values. Since the distributed medical data may
have numerical and nominal missing or inconsistent values, these values are
replaced with the computed values. For the numerical type of attributes, the max-
imum possible estimators can be used to replace the null or inconsistent values. In
case of nominal attributes, the conditional posterior probability estimator can be
used to replace the null or inconsistent values.

Input: Integrated dataset D' ;


Output: Filtered dataset as FDList;
Procedure: For each attribute Ai in D' do
For each instance I (Ai) in Ai do // Nominal attributes
If (Type (I(Ai)==Nominal && I(Ai)== φ )then
Probability estimation is computed as
P1(I(A i )) = log(Pr ob(I(A i ) / Cls m ) + elog(Prob( I ( Ai )) .
log(Prob( I ( A j ))
P2 (I(A j )) = log(Pr ob(I(A j ) / Cls m ) + e .
A j = Max({P1 ( I( A i ))},{P2 ( ( A j ))});
i.e if A j ∈ P1 (I(A i )) then
I( A j )= vali ;
Else
I ( A j )= val j ;
End if // Numerical Attributes
If (Type (I(Ai)==Numerical && I(Ai)== φ )then
log(Prob( I ( A j ))
I(Aj)= (Max(num(A j ), num(A j ))/ | (μ − (elog(Prob( I ( Ai )) + e ) / 2 |) ;
End if
End for
End for
FData=Cleaned data;

To execute our hybrid ensemble model, we must distribute the filtered data
among different machines. The given data are partitioned onto different parallel
machines rather than replicating them to minimize memory storage. Due to the
348 T. Bikku et al.

large size of data, we should take a random sampling of the training data and ensure
the consistency of the ensemble model by repeating of sampling. The input data
objects are distributed across the Hadoop mappers. The initial number of clusters
and representative object are randomly selected as N and R. These parameters are
placed in individual mappers or in a common location and accessed by all the
mappers. The distance between each object to the representative object is measured,
and the shortest objects are clustered together. The reducer processes accept the
clustered objects and update the representative objects using the fuzzy membership
value.
Hadoop-based Ensemble classifier (HBEC):

Input: Filtered dataset FData.


Output: Decision patterns for medical databases
Procedure: For each attribute FData do
Divide the medical records FData into ‘N’
independent clusters.
Select a representative point randomly Ri
While i<N do
Dist (Ri, x) = lim(
p →0
∑| x
i =1
i − R i |p )1/p
Assign each data object to the cluster, which has
the nearest distance.
Update representative instance using the fuzzy
membership matrix as shown below
N = # clusters
Mem( μ X ) = ( 1 / Dist ( Ri , x ) )1/ θ −1 / ∑r =1
( 1 / Dist ( Rr , x ) )1/ θ −1

Where θ = fuzzy parameter


Update representative instance= Maximum {Dist (Ri,x)}/
Mem( μ X ) ;
End while
Done
Done
An Iterative Hadoop-Based Ensemble Data … 349

Hybrid ensemble decision tree construction:

Here the clustered data can be represented as CData.


For each phase clustered data CDij in CData do
If CDij ==Null then
Return leaf node with matched medical pattern set as
empty.
Else if class ( CDij ) ==1 then
Return leaf node with medical patterns m.
Else
Split CDij into r disjoint partitions using random
sampling distribution where r=m-classes.
Let CD1 (i1, j1), CD2 (i 2, j2)...CDr (ir , jr ) are r disjoint partitions
with m-classes such that
CDij = CD1 (i1, j1) ∪ CD2 (i 2, j2)... ∪ CDr (ir , jr ) At1 ( n) Corresponds to
the attribute list of the data partition At1 ( n) .
For each matched partition s do
Find the medical attribute ranking using the equation
ARank(P, Ati ( n) )= {∑ P( Ati (i) / At j (i))}m − classes / max{IG( Ati (i), At j (i))}
Where i,j=1,2….n attributes and m=number of classes.
P( Ai (n)) : Probability of the tuples satisfying
θ = Data scaling factor (0-1)
If ARank (P, Ati ( n) ,m)< θ then
ARank(P, Ai ( n) )= AttRank(P, Ai ( n) )+ θ ;
End if
End for
Select the root-node using the attribute with the
highest ARank in all the partitions.
Repeat until no more instances in the based partitions.
Display phase based patterns in the decision tree.
End for // end

In this algorithm, ensemble model filtered data were clustered, and patterns are
extracted using hybrid decision tree construction. In each phase, cross-defect
metrics and its relationships are evaluated using the pattern discovery process.
These patterns are used to identify the medical documents and its dependencies in
each partition. As the scaling parameter changes, different decision patterns are
evaluated at each iteration.
350 T. Bikku et al.

4 Performance Analysis

In this experimental study, we have analyzed the Proposed Ensemble Classifier


model with traditional algorithm model in terms of classifier accuracy, different
cluster rate and run time taken to classify different medical datasets such as Medline
and PubMed repositories. Separation index is measured between clusters, it tells
about the compactness and measures the gap between clusters in a partition.
Entropy is defined as the measure of uncertainity for partition set. Purity is a simple
evaluation measure for checking quality of a cluster. The lower entropy means
better clustering and greater entropy means that the clustering is not good. The
quantity of disorder is found using entropy measure (Tables 1 and 2). Bad clusters
have purity value close to 0, a perfect clusters has a purity of 1. To measure the
efficiency of the clustering we use precision. Precision=True Positive/(True
Positive+False Positive)
In Table 3, time taken to extract medical patterns is minimized using the pro-
posed model compared to the traditional models.

Table 1 Comparing the proposed model clustering accuracy of ensemble model with other
algorithms in terms of Entropy, Seperation Index and Precision
Algorithm Avg_Cluster_Entropy Avg_Separation_Index Precision
(%)
Hierarchical multi-class DT 0.61 0.25 78.34
P2P C4.5 0.678 0.26 74.25
CART 0.598 0.198 89.6
Neural networks 0.698 0.473 89.13
PDT 0.498 0.526 83.5
Proposed ensemble model 0.3987 0.187 94.76

Table 2 Comparing the Proposed Ensemble classifier performance with different algorithms
based on different cluster rate
Algorithm 5-clusters 10-clusters 30-clusters 40-clusters
based classifier based classifier based classifier based classifier
accuracy (%) accuracy (%) accuracy (%) accuracy (%)
Hierarchical 69 78.35 67.88 79.35
multi-class DT
P2P C4.5 74 84.5 82.34 71.45
CART 89.57 81.56 79.67 81.46
Neural 82.56 69.35 71.64 82.45
networks
PDT 75.74 83.45 78.34 74.35
Proposed 91.45 88.43 89.35 92.46
An Iterative Hadoop-Based Ensemble Data … 351

Table 3 Comparing the Algorithm Total Runtime


proposed model patterns (ms)
with traditional algorithm
models in terms of Runtime Hierarchical multi-class 100 2562
(ns) DT
P2P C4.5 100 3244
CART 100 2891
Neural networks 100 2608
PDT 100 2382
Proposed ensemble model 100 1693

5 Conclusion

Pattern extraction from medical databases using a traditional rule-based approach


results in more error rate for similarity identification of gene/proteins. Patterns
which are subsets of medical patterns are not relevant to the biological study.
Traditional models do not provide an efficient preprocessing approach for
protein/gene names tokenization. Data imbalance, indexing and sparsity are the
three major issues in these distributed decision tree models. In this proposed model,
an enhanced attributes selection ranking model and Hadoop-based decision tree
model were implemented to extract the user-specific interesting patterns in online
biomedical databases. Experimental results show that the proposed model has high
true positive, high precision and low error rate compared to traditional distributed
decision tree models.

References

1. Zhiwen Yu, Le Li; Jiming Liu, “Hybrid Adaptive Classifier Ensemble”, Cybernetics, IEEE,
(2015):177–190.
2. Al-Khateeb, Masud, “Recurring and Novel Class Detection using Class-Based Ensemble for
Evolving Data Stream”, IEEE Transactions on Knowledge and Data Engineering (TKDE),
(2015):34–45.
3. Le Zhang, Suganthan P.N, “Oblique Decision Tree Ensemble via Multisurface Proximal
Support Vector Machine”, IEEE Transactions on Cybernetics, (2015):2165–2176.
4. Joao Mendes-Moreira, Carlos Soares “Ensemble approaches for regression: A survey”, ACM
Computing Surveys, Volume 45 Issue 1, November (2012):123–136.
5. Mathe, C., Sagot, M.F., Schiex, T., Rouze, P. Current methods of gene prediction, their
strengths and weaknesses. Nucl Acids Res. 2002;30:4103–4117.
6. Stein, L. The case for cloud computing in genome informatics. Rev J: Genome Biol.
2010;11:207.
7. Mason, C.E., Elemento, O. Faster sequencers, larger datasets, new challenges. Genome Biol.
2012;13:314.
8. Shanjiang Tang, Bu-Sung Lee, Bingsheng He, “DynamicMR: A Dynamic Slot Allocation
Optimization Framework for MapReduce Clusters”, IEEE Transactions, Vol 2 No.3 Sep 2013,
pp. 333–345.
9. Wullianallur Raghupathi and Viju Raghupathi, “Big data analytics in healthcare: promise and
potential”, Health Information Science and Systems, pp. 1–10, 2014.

You might also like