Professional Documents
Culture Documents
Bikku2016 - Springer Con
Bikku2016 - Springer Con
Abstract As the size and complexity of the online biomedical databases are
growing day by day, finding an essential structure or unstructured patterns in the
distributed biomedical applications has become more complex. Traditional
Hadoop-based distributed decision tree models such as Probability based decision
tree (PDT), Classification And Regression Tree (CART) and Multiclass Classifi-
cation Decision Tree have failed to discover relational patterns, user-specific pat-
terns and feature-based patterns, due to the large number of feature sets. These
models depend on selection of relevant attributes and uniform data distribution.
Data imbalance, indexing and sparsity are the three major issues in these distributed
decision tree models. In this proposed model, an enhanced attributes selection
ranking model and Hadoop-based decision tree model were implemented to extract
the user-specific interesting patterns in online biomedical databases. Experimental
results show that the proposed model has high true positive, high precision and low
error rate compared to traditional distributed decision tree models.
T. Bikku (✉)
Department of CSE, VNITSW, Guntur, AP, India
e-mail: thulasi.bikku@gmail.com
S.R. Nandam
Department of CSE, SRITW, Warangal, Telangana, India
e-mail: snandam@gmail.com
A.R. Akepogu
Department of CSE, JNTUCEA, Ananthapuramu, India
e-mail: akepogu@gmail.com
1 Introduction
In the past, to identify named entities in the biomedical field, single kinds of entities
such as protein and/or gene labels were used. It is more effective to use multiple
kinds of named entities at the same time. For each named entity, an entity pattern
recognizer could be executed, and then, after multiple runs, all kinds of named
entities are annotated and, finally, the results could be merged. Gene clustering is
one of the data mining models that were developed for microarray gene expression
data [1]. The basic assumption with respect to training and test data is that, all
distributed instances have been taken from the same feature space with the same
distribution. In the biomedical field, transfer learning enables the context knowl-
edge contained in labeled source domains to predict unlabeled data in the target
domain, where the domains differ in distributions. Traditionally, the transfer
learning was developed to find the correspondence between pivot features and other
specific features extracted from different domains. These learning models extract
persistent information that aims to reduce the difference between the domains.
Domain transfer learning reduces the difference between the distributions of dif-
ferent domains and, thus, minimizes cross-domain prediction errors. There have
been many approaches implemented for transfer learning; one of the promising
ones is feature transfer learning. In this, most relevant contextual features are
adopted as representative objects of both domains.
The distributed data that are generated from different sources are multiple,
complex, distinct and independent. Due to the large number of instances with
redundant and irrelevant attributes, an appropriate filtering model has been used to
fill the noisy attribute or instances. To handle large number of attributes, an efficient
feature selection model was used to select the ranked attributes for classification or
clustering. Al-Khateeb and Masud [2] implemented concise set of rules using
attribute selection model in rough set theory. This model has gained wide accep-
tance in data analysis, machine learning and statistical analysis. The important
challenge in the classification algorithms is the error rate and class imbalance.
Traditional models attempt to optimize the overall precision of their predictions.
Therefore, we would prefer an efficient classification model which performs well on
the minority class, size, complexity and arbitrary data distribution in the distributed
data mining. Big data concern large volume, growing datasets that are automatic
and complex to analyze. It is too complex to filter noise and recognize unstructured
data for decision-making.
The main challenges to handle online medical databases are data accessing and
arithmetic computing procedures, semantics and domain knowledge for a variety of
big data applications and difficulties that come due to dynamic, complex, noisy and
constantly changing and evolving data. The above challenges can be handled using
the three-tier architecture as shown in Fig. 1. Three tiers mentioned in the
An Iterative Hadoop-Based Ensemble Data … 343
architecture are data accessing and computing, data privacy and domain knowledge
and Big data classification model.
In the first tier, data accessing and arithmetic computing procedures are per-
formed on the distributed data. The main problem is that huge amounts of data are
shared in different locations and the data are growing in each data source enor-
mously. Hence, to compute and analyze large-scale distributed data, we need an
effective platform like Hadoop. In the second tier, data semantics and domain
knowledge are used for a variety of applications that involve big data. In the online
biomedical applications, users collaborate with each other and share knowledge to
the groups/user communities. It is the most crucial task in both low-level and
high-level big data mining algorithms. In the third tier, problems that come from big
data size, distributed data, complexity and dynamic nature are analyzed prior to
mining algorithm. In this tier, uncertain, sparse, incomplete and multisource data
are preprocessed using filtering models and data are analyzed after preprocessing.
After the preprocessing, the local Hadoop learning models are applied to find the
hidden patterns, and the feedback is sent to preprocessing stage.
Hadoop-based classification models are one of the distributed data mining that
classifies high-dimensional unstructured data into meaningful patterns and it helps
users for decision-making and knowledge discovery. These classification tech-
niques make the large datasets appear in a simpler form for hidden patterns.
Classification approaches on big data are broadly classified in two ways, i.e.,
supervised and unsupervised. In supervised learning, all the instances in the training
data have class labels for decision-making. In unsupervised learning, all the
instances in the training data do not possess class labels for decision-making.
Supervised classifier is used to classify the new instance using the training data.
2 Related Work
Yu and Li [1] and Zhang and Suganthan [3] implemented multi-scale decision tree
representation using granularity computation for generating mixed hierarchical
decision rules. They implemented these multi-scale decision tables under different
levels of granularity and level-wise threshold. Al-Khateeb and Masud [2] imple-
mented Probability based decision tree (PDT) under single level granularity to
hierarchical multi-level granularity and research on attribute generalization
344 T. Bikku et al.
reduction by refining attribute values. Mendes-Moreira and Soares [4] and Mathe
et al. [5] proposed a hierarchical reduction model for concept hierarchy to acquire
multi-confidence rules from the covering decision systems. These attribute selection
models are not applicable to big data for storage and manipulations on a single
machine. Hence, it is necessary to create the most efficient hierarchical attribute
reduction models for big data to accommodate a variety of user’s medical
requirements on different levels. In most of the big data applications, sampling
techniques have been applied to find the relational features or interesting patterns.
Sampling models would be practically successful only if the samples are equally
distributed or satisfy the hypothetical space. Instead of using sampling techniques,
the best solution to handle the large number of attributes for smaller databases is
parallel computing. Traditional features selection based classifiers were applied on
large databases to compute granularities separately and later combined together to
find the global solution of the whole data. But there is no guarantee, as these
partitioned attributes or instances could exchange relational information to each
other. Thus, in the majority of the cases, these techniques fail to extract a subset of
features for large datasets. Graph-based medical disease models have been gaining
a lot of attention due to its uncertainty and the process they detect disease patterns
or relationships between entities. The structural representation holds essential
information for categorizing and visualizing entities and, hence, useful in learning,
clustering and decision-making in biomedical applications. Graph-based classifi-
cation models are promising and could optimize traditional keyword-based tech-
niques [5].
Machine learning models applied on distributed clinical databases attempt to find
patterns and relationships, to understand the features and progression of certain
diseases. Single class models are used in noise filtering with limited instances. In
many medical applications, the degree of class imbalance will alter, particularly
while classifying the online medical datasets. Conventional models such as
k-nearest neighbors, classify an instance set by comparing its Euclidean distances to
each class without considering the features’ contextual information. A multiclass
imbalanced classification brings a lot of challenges in big data due to its data
complexity. A typical solution is to partition the multiclass data into binary clas-
sification and then use balancing techniques to combine distributed data. Data
imbalance rate in the binary classification can be defined as the ratio between the
majority class instances count to the minority class instances count [6, 7]. Existing
dictionary-based approach does not give optimal results for identifying protein
names, because new protein names tend to build, and there are, sometimes, hun-
dreds in terms of identical proteins are referenced. Medline is a large repository of
publicly available scientific literature. Model-based clustering algorithms have been
implemented on document clustering [6–8], where document clusters are repre-
sented as probabilistic methods that are conceptually separated from the data
dimensions. Presently, graph-based clustering models using statistical models are
also successfully applied to document clustering mechanism. These graph models
An Iterative Hadoop-Based Ensemble Data … 345
are optimized using some predefined document measures on the directed graph. The
hierarchical Latent Dirichlet allocation (hLDA) method was implemented in [9] as
unsupervised method which is a generalization of LDA. Graphical ranking-based
clustering algorithms have been implemented to construct a sentence model graph
in which each node is a sentence in the overlay documents. The traditional medical
clustering algorithms are not suitable because the algorithms work in batch pro-
cessing, whereas iterative process merges each iteration cluster into the existing
clusters which leads to duplicate attributes.
3 Proposed Model
Data Filtering
(Step-2)
Phase Based
Clustering
Ensemble Hybrid
Classifier (Step-3)
Att j1
Att j2
Att j3
Atti
.
.
Att jn
Mapper Phase:
Distributed Data Integration (Combining two databases)
______________________________________________
Input: Medical datasets: MDList
Output: Single integrated data.
Procedure: Let the two databases are represented as
MDlist1 and MDlist2.
For each attribute Atti in MDlist1 do
For each attribute Attj in MDlist2 do
If (Atti!= φ && Attj!= φ ) then
If (Type(Atti)==Type(Attj))then
Sim(Att i , Att j ) = ( P(Att i / Att j )* P( Att j ))* Correlation(Atti , Att j );
Where correlation is the correlation between the
two attributes. Map each correlated attributes
using it similarity measure as :
Map( (Atti , Att j ) , Sim(Att i , Att j ) );
Done
Select maximum similarity values from the list of
attributes for data integration.
For each attribute Atti in MDlist1 do
For each attribute Attj in MDlist2 do
Select attributes pair (Atti, Attj) as
Maximum (sim(Atti, Attj))
D' =Integrate attributes Atti & Attj
Map ( (Atti , Att j ) , Sim(Att i , Att j ) );
Done
Done
End if
Done
______________________________________________
An Iterative Hadoop-Based Ensemble Data … 347
In this algorithm, medical data from different sources are integrated using
relational attributes. Similarity measure was computed to find the attribute’s rela-
tionship for data integration.
Mapper-Based Data Cleaning (Data filtering method)
In this Mapper preprocessing mechanism, integrated dataset is used to fill the
continuous and nominal missing values. Since the distributed medical data may
have numerical and nominal missing or inconsistent values, these values are
replaced with the computed values. For the numerical type of attributes, the max-
imum possible estimators can be used to replace the null or inconsistent values. In
case of nominal attributes, the conditional posterior probability estimator can be
used to replace the null or inconsistent values.
To execute our hybrid ensemble model, we must distribute the filtered data
among different machines. The given data are partitioned onto different parallel
machines rather than replicating them to minimize memory storage. Due to the
348 T. Bikku et al.
large size of data, we should take a random sampling of the training data and ensure
the consistency of the ensemble model by repeating of sampling. The input data
objects are distributed across the Hadoop mappers. The initial number of clusters
and representative object are randomly selected as N and R. These parameters are
placed in individual mappers or in a common location and accessed by all the
mappers. The distance between each object to the representative object is measured,
and the shortest objects are clustered together. The reducer processes accept the
clustered objects and update the representative objects using the fuzzy membership
value.
Hadoop-based Ensemble classifier (HBEC):
In this algorithm, ensemble model filtered data were clustered, and patterns are
extracted using hybrid decision tree construction. In each phase, cross-defect
metrics and its relationships are evaluated using the pattern discovery process.
These patterns are used to identify the medical documents and its dependencies in
each partition. As the scaling parameter changes, different decision patterns are
evaluated at each iteration.
350 T. Bikku et al.
4 Performance Analysis
Table 1 Comparing the proposed model clustering accuracy of ensemble model with other
algorithms in terms of Entropy, Seperation Index and Precision
Algorithm Avg_Cluster_Entropy Avg_Separation_Index Precision
(%)
Hierarchical multi-class DT 0.61 0.25 78.34
P2P C4.5 0.678 0.26 74.25
CART 0.598 0.198 89.6
Neural networks 0.698 0.473 89.13
PDT 0.498 0.526 83.5
Proposed ensemble model 0.3987 0.187 94.76
Table 2 Comparing the Proposed Ensemble classifier performance with different algorithms
based on different cluster rate
Algorithm 5-clusters 10-clusters 30-clusters 40-clusters
based classifier based classifier based classifier based classifier
accuracy (%) accuracy (%) accuracy (%) accuracy (%)
Hierarchical 69 78.35 67.88 79.35
multi-class DT
P2P C4.5 74 84.5 82.34 71.45
CART 89.57 81.56 79.67 81.46
Neural 82.56 69.35 71.64 82.45
networks
PDT 75.74 83.45 78.34 74.35
Proposed 91.45 88.43 89.35 92.46
An Iterative Hadoop-Based Ensemble Data … 351
5 Conclusion
References
1. Zhiwen Yu, Le Li; Jiming Liu, “Hybrid Adaptive Classifier Ensemble”, Cybernetics, IEEE,
(2015):177–190.
2. Al-Khateeb, Masud, “Recurring and Novel Class Detection using Class-Based Ensemble for
Evolving Data Stream”, IEEE Transactions on Knowledge and Data Engineering (TKDE),
(2015):34–45.
3. Le Zhang, Suganthan P.N, “Oblique Decision Tree Ensemble via Multisurface Proximal
Support Vector Machine”, IEEE Transactions on Cybernetics, (2015):2165–2176.
4. Joao Mendes-Moreira, Carlos Soares “Ensemble approaches for regression: A survey”, ACM
Computing Surveys, Volume 45 Issue 1, November (2012):123–136.
5. Mathe, C., Sagot, M.F., Schiex, T., Rouze, P. Current methods of gene prediction, their
strengths and weaknesses. Nucl Acids Res. 2002;30:4103–4117.
6. Stein, L. The case for cloud computing in genome informatics. Rev J: Genome Biol.
2010;11:207.
7. Mason, C.E., Elemento, O. Faster sequencers, larger datasets, new challenges. Genome Biol.
2012;13:314.
8. Shanjiang Tang, Bu-Sung Lee, Bingsheng He, “DynamicMR: A Dynamic Slot Allocation
Optimization Framework for MapReduce Clusters”, IEEE Transactions, Vol 2 No.3 Sep 2013,
pp. 333–345.
9. Wullianallur Raghupathi and Viju Raghupathi, “Big data analytics in healthcare: promise and
potential”, Health Information Science and Systems, pp. 1–10, 2014.