Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Map Reduce for DC4.5 and Ensemble Learning In Distributed Data Mining

Map Reduce for DC4.5 and Ensemble Learning In Distributed Data Mining

Ratings: (0)|Views: 157 |Likes:
Published by ijcsis
MapReduce one of the distributed computing techniques is integrated with decision tree classifier C4.5 in the distributed environment and ensemble learning with its classifier. This paper proposes an algorithm to classify, predict data using MapReduce for DC4.5 with ensemble learning. Proposed algorithm increases the accuracy and scalability of the data. Noise handling in the decision trees with respect to the distributed data is also handled here.
MapReduce one of the distributed computing techniques is integrated with decision tree classifier C4.5 in the distributed environment and ensemble learning with its classifier. This paper proposes an algorithm to classify, predict data using MapReduce for DC4.5 with ensemble learning. Proposed algorithm increases the accuracy and scalability of the data. Noise handling in the decision trees with respect to the distributed data is also handled here.

More info:

Published by: ijcsis on Feb 15, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/16/2012

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, 2011
Map Reduce for DC4.5 and Ensemble Learningin Distributed Data Mining
Dr.E.Chandra
Research Supervisor and DirectorDepartment Of Computer ScienceD J Academy for Managerial ExcellenceCoimbatore, Tamilnadu, India.crcspeech@gmail.com P.AjithaResearch Scholar and Assistant ProfessorDepartment Of Computer ScienceD J Academy for Managerial ExcellenceCoimbatore, Tamilnadu, India.ajitha@y7mail.com
 Abstract
 — 
MapReduce one of the distributed computingtechniques is integrated with decision tree classifier C4.5 inthe distributed environment and ensemble learning withits classifier. This paper proposes an algorithm to classify,predict data using MapReduce for DC4.5 with ensemblelearning. Proposed algorithm increases the accuracy andscalability of the data. Noise handling in the decision treeswith respect to the distributed data is also handled here.
 Keywords
C4.5, Distributed Decision Trees, MapReduce, Ensemble learning
I.
 
I
NTRODUCTION
 Classification of the decision trees plays a major role in thedistributed and centralised environment. Centralisedclassification techniques are not efficient in handling large
volumes of data. Deluge information is available in today’s
world and it had become necessitated to analyse theinformation and mine knowledge. For these aspects distributedclassification techniques comes in handy. One of theclassification techniques C4.5 in distributed environment isdiscussed here and also MapReduce which comes in handy inthe distributed environment.C4.5 decision tree algorithm is one of the most populartechniques used to predict and classify the data. On massivehandling of large data sets may lead to biased selection andchances of missing the value are high. C4.5 selects theunbiased values and prunes the tree in the times of over fitting.Handling discrete and missing values with precision is alsopossible through C4.5. In a distributed environment, C4.5selects the attributes across the environment, but the inheritdisadvantage of the C4.5 is it is an unstable classification soanother learning techniques of ensemble learning andMapReduce is utilised.In Ensemble learning, base classifiers are constructed from thedata sets. New data is classified by combining the predictionsof the base classifiers. Ensemble paves way to combining andperturbing many methods to increase the accuracy of andimproves the scalability in a significant way. Ability toGenerate multiple versions of the classifiers by the training setor any method or considering any set of parameters.SPRINT algorithm[3] achieve the scalable performance in theparallel and distributed environment. Prodromidis[4] in themeta learning discusses in respect to the distributedenvironment.JAM[5] in the grid and cloud computingenvironment and Weka[6] for grid enabled cross validationand testing. These are the few algorithms already specifies forthe parallel and distributed environment with ensemblemethods.Map Reduce [1] proposed by Google for handling massivelarge data sets. There are two primary functions (1) a MapFunction (2) Reduce Function for the distributed, parallel andcloud computing environment. Basic work of map reduce isto iterate the input, compute key
 – 
value pairs for each part of the input, group all intermediate values by key, then againiterate over the resulting groups and finally reduce each group.Implementation issues like fault tolerance, load balancing andperformance are also be able to handle in the Map Reduceenvironment. Series of machine learning techniques are usedin MapReduce to efficiently solve the problems arises in largescale distributed environment.Recently ensemble learning had become popular , because of it is inherent nature to train many learnings and combine orgroup the results. For the distributed environment ensemblelearning can increase the accuracy and reduce the computingefforts. Utilising the advantages of Map Reduce, ensembleand C4.5 , a new classification algorithm DEC4.5-MR isproposed here. Distributed ensemble C4.5 with map reducealgorithm will classify , construct and predict the data.Rest of the paper is organised as follows. Section 2 discuss therelated work exists for C4.5, ensemble, Map Reduce indistributed environment. Section 3 examines thedistributedC4.5 with ensemble and Map Reduce. Section 4 theexperimental evaluation and Section 5 described theconclusions of this paper
188http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, 2011
 II.
 
R
ELATED
W
ORK
 
 A.
 
C4.5 Method 
C4.5 an successor of CLS and ID3. Generates the classifier inform of decision trees and generate rules based on the results.Divide and conquer strategy is utilised in the C4.5 which isdescribed in the figure 1. Two heuristic evaluation methodslike information gain and gini index is used in the C4.5 boththe heuristic technique has the ability to handle discrete,missing values and precision of the data handling will be high.Numeric or Nominal data can also be considered. After theconstruction pruning can be done to avoid over fitting by usingpost pruning pessimistic method. Suppose for N nodes , E willbe the errors that occurs and most frequent node is set and canbe experimented by Bernoulli experiments equation specifiesthe same. Pessimistic error condition can be computed usingequation 2.
(1 )
 f q pr z cq q
----- (1)
2 2 222
241
 z f f Z  f  N N e z N 
-----(2)The algorithm for distributed DC4.5 decision tree is specifiedin figure1 .
Input
: A training sets
, a node
;
Output
: A decision tree with the root
;1: If the instances in
belong to the same class or the amountof instances in
is too few, set
as leaf nodeand label the node
with the most frequent class in
;2: Otherwise, choose a test attribute
 X 
with two or moreoutcomes based on a selecting criterion, and label thenode
with
 X 
;3: Partition
into subsets
1
,
2
, …,
n
 
according to theoutcomes of attribute
 X 
for each instance; generate
’s
 
n
children nodes
1
,
2
, …,
n
;4: For every group (
i
,
i
), build recursively a subtree with theroot
i
.5. Each and every group is integrated and global tree isgenerated
Fig1 Algorithm for DC4.5 decision tree
Algorithm for DC4.5 of the distributed decision tree discusseshow to use C4.5 in distributed environment.
 B.
 
 Ensemble Learning in DDM 
Ensemble learning methods in distributed data mining are of 2methods one is either mining inherently distributed data andscaling up ensemble methods based on partitioning andcombining results. When the nature of data is geographicallydistributed and handling it in centralised environment is notefficient without perturbing the conventional methods.Majority voting and weighted voting both can combined indistributed decision trees. So ensemble technique allows thepartition of vertical partitioning otherwise called asheterogeneous data[7]. This ensemble technique increasesaccuracy, reduces the processing time in compared withcentralised prediction. Scalability issues are also handled inthe ensemble methodsBagging, boosting and sub space are some of the algorithms inensemble techniques. Here, bagging one of the most andpopular classifier is utilised here as it scales well in distributedenvironments[9]. Bagging is well able to handle[8] noise inthe data
Bagging ensemble algorithmInput
:
 L
: a classification method,
 D
: a training data set,
m
:thenumber of base classifiers;
Output
:
φ
(*): an ensemble classifier;1: for
i
= 1 to
m
2:
 D
’ = bootstrap sampling from the set
 D
;3:
φi
(*) =
 L
(
 D
’);
 4: end for; /*
is a nominal class label set */ 5:
1
(*) arg max (*) ;
mi y
 y
 
 
Fig 2. The Bagging Ensemble Algorithm.
C.
 
MapReduce Distributed Computing Model 
The following architecture specifies the MapReducedistributed computing Model and the operation of the twofunctions Map and Reduce. The architecture of theMapReduce splits the Map and Value pair and local tasksplayer with the necessitude Jobs Scheduling and work.Map function can be highly parallelized and so is the Reducefunction. Map functions usually deal with parallelized anddistributed sub-missions. Reduce function usually collect thesub-missions and parallelize or distributed it according to theneed[10]. Computing process of Map Reduce is short(i)Divideinto large sets (ii)each (or several) data sets handled in onecluster for intermediate processing (iii) intermediate iscombine with final cluster. The Map function read input datasets with the format of <
key
1,
value
1>. After the analysis, itgenerates an intermediate result <
key
2,
value
2> and submits itto a Reducer; then the Reduce function combines the results toget a final list <
key
3,
value
3> according to the list <
key
2,
value
2>.
189http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, 2011
 
Fig 3-Hadoop MapReduce framework Architecture
During the Map process, in order to improve the combinationefficiency, a Combiner can be used which has similar functionas Reducer to reduce at local. The transformation between theinput and the output looks as follows [6].
map
(
key
1,
value
1) → (
key
2,
value
2) [ ]
reduce
(
key
2,
value
2 [ ]) → (
key
3,
value
3) [ ]Hadoop is an implementation of the MapReduce parallelcomputing model of the open source framework for distributedprogramming. With the help of Hadoop, programmers caneasily write parallel and distributed programs. It runs incomputing clusters to deal with massive data [11]. The basiccomponents of an application on Hadoop's MapReduceinclude a Mapper and a Reducer class, as well as a program tocreate a JobConf. Some applications also include a Combinerclass which is actually the implement of the Reducer on local.Hadoop implements a distributed file system, referred toHDFS. HDFS has the characteristic of high fault-tolerant, andis designed for deployed on low-cost hardware. It provideshigh throughput to access the data of applications, which issuitable for an application with large data sets. HDFS relax therequirements of POSIX, allowing streaming access to data inthe file system. In addition, Hadoop implements theMapReduce distributed computing paradigm. MapReducesplits the mission of an application into small blocks of work.HDFS establishes multiple replicas of data blocks forreliability and places them on compute nodes on servergroups. So MapReduce can handle these data on the associatednodes. Figure 3 shows the Hadoop MapReduce framework [10].III.
 
P
ROPOSED
A
LGORITHM
 
 A.
 
 Distributed C4.5 with ensemble and MapReduce
The proposed algorithm DC4.5 with ensemble and Mapreduces has 3 phases. Three phases are partition Phase/themap,Build base classifier phase and Reduce/Ensemble phaseDivide data sets D into n subsets of {D
1
,D
2
,..Dn} and usersdetermine the value n. In the first phase of Map phase a Baseclassifier BC
i
needs to be generated into Classifier C
i
with theDC4.5 algorithm. In Reduce/Ensemble phase assemble the nbase classifiers to generated final classifier using Bagging.
 B.
 
Types of keys and values
The types of sets of key and value of MReC4.5 illustrates asfollows
key
1 :
Text value
1 :
 Instanceskey
2 :
Text value
2 :
 Iterator of Classifierskey
3 :
Text value
3 :
Classifer key
1,
key
2,
key
3 are all the
Text 
type offered by Hadoop andtheir values are the file name associated with the input data set
 D
. In the Partition phase, when the data set
 D
is split into
m
data sets, according to the input format of the C4.5 algorithmeach data set is formatted as
value
1 with the
 Instances
type. Inthe Map phase, we build a classifier model with the C4.5algorithm and obtain a classifier model set
value
2 whichbelongs to the
 Iterator of Classifiers
type; In the Reducerphase, we assemble classifiers from
value
2 to obtain aclassifier model
value
3 with the
Classifier 
type.
C.
 
Map/Reduce Phase
Figure 4 specifies the proposed algorithm for the Mapoperations in respect to the C4.5 algorithm. A change is donein the original proposed algorithm to Map and reduce for thekey value pairs
function mapper(
 key
,
value
)
 /*
 Build base-classifier 
*/ 1: Build a C4.5 Classifier
c
with the data set
value
; /*
Submit intermediate results
*/ 2:
 Emit
(
key
,
c
);
3. Generate {D
1
,D
2
,..D
n
} for all subsets4.Build and Map with C
i
 5.Integrate into one cluster the (key,c)
Fig 4. The Map Operation
190http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->