Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
4Activity
0 of .
Results for:
No results containing your search query
P. 1
Classification Based on Association-rule Mining Techniques a General Survey and Empirical Comparative Evaluation - Ubiquitous Computing and Communication Journal

Classification Based on Association-rule Mining Techniques a General Survey and Empirical Comparative Evaluation - Ubiquitous Computing and Communication Journal

Ratings: (0)|Views: 161 |Likes:
UBICC, the Ubiquitous Computing and Communication Journal [ISSN 1992-8424], is an international scientific and educational organization dedicated to advancing the arts, sciences, and applications of information technology. With a world-wide membership, UBICC is a leading resource for computing professionals and students working in the various fields of Information Technology, and for interpreting the impact of information technology on society.
www.ubicc.org
UBICC, the Ubiquitous Computing and Communication Journal [ISSN 1992-8424], is an international scientific and educational organization dedicated to advancing the arts, sciences, and applications of information technology. With a world-wide membership, UBICC is a leading resource for computing professionals and students working in the various fields of Information Technology, and for interpreting the impact of information technology on society.
www.ubicc.org

More info:

Categories:Types, Research
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/23/2012

pdf

text

original

 
 
CLASSIFICATION BASED ON ASSOCIATION-RULE MININGTECHNIQUES: A GENERAL SURVEY AND EMPIRICAL COMPARATIVEEVALUATION
 
 
“Alaa Al Deen” Mustafa Nofal and Sulieman Bani
-Ahmad
Department of Information TechnologyAl-Balqa Applied UniversityJordan, Al-Salt, 19117alano911@yahoo.co.uk 
 
 ABSTRACT
In this paper classification and association rule mining algorithms are discussed anddemonstrated. Particularly, the problem of association rule mining, and the investigationand comparison of popular association rules algorithms. The classic problem of classification in data mining will be also discussed. The paper also considers the use of association rule mining in classification approach in which a recently proposed algorithm isdemonstrated for this purpose. Finally, a comprehensive experimental study against 13 UCIdata sets is presented to evaluate and compare traditional and association rule basedclassification techniques with regards to
classification accuracy
,
number of derived rules
,
rules features
and
processing time
.
 Keywords:
Data mining, Classification, Association, Associative Classification, MMAC,CBA, C4.5, PART, RIPPER
 
1.  INTRODUCTION
Constructing fast and accurate classifiers forlarge data sets is an important task in data mining andknowledge discovery. There is growing evidence thatmerging classification and association rule miningtogether can produce more efficient and accurateclassification systems than traditional classificationtechniques [26]. In this paper, a recent proposedclassification algorithm [37] will be discussed in details.Classification is one of the most importanttasks in data mining. There are many classificationapproaches for extracting knowledge from data such asstatistical [21], divide-and-conquer [15] and covering[6] approaches. Numerous algorithms have beenderived from these approaches, such as Naiave Bayes[21], See5 [34], C4.5 [30], PART [14], Prism [6] andIREP [16]. However, traditional classificationtechniques often produce a
small subset of rules
, andtherefore usually miss detailed rules that might play animportant role in some cases [29].
 
Another vital task in data mining is thediscovery of association rules in a data set that passcertain user constraints [1, 2]. Classification andassociation rule discovery are similar except thatclassification involves prediction of one attribute, i.e.,the class, while association rule discovery can predictany attribute in the data set. In the last few years, a newapproach that integrates association rule mining withclassification has emerged [26, 37, 22]. Few accurateand effective classifiers based on associativeclassification approach have been presented recently,such as CPAR [39], CMAR [22], MMAC [37] andCBA [26]. Many experimental studies [26, 39, 37]showed that classification based on association rulesmining is a high potential approach that constructs morepredictive and accurate classification systems thantraditional classification methods like decision trees [30,34]. Moreover, many of the rules found by associativeclassification methods can not be found by traditionalclassification techniques.In this paper, the details of a recent proposedclassification based on association rules techniques issurveyed and discussed, which extends the basic idea of association rule [1] and integrates it with classificationto generate a subset of effective rules. This proposaluses association rule mining approach in theclassification framework. It has been named multi-classclassification based on association rules [37]. It utilizesa efficient techniques for discovering frequent itemsetsand employs a rule ranking method to ensure thatgeneral and detailed rules with high confidence are partof the classification system.The main contribution of this paper is thatseveral popular association rule-mining techniques aretheoretically compared in terms of a number of criteria.Further, a comparison of several classificationalgorithms is conducted. Moreover, the integration of association rule-mining with classification is alsoinvestigated, for that the recently proposed algorithm(the MMAC algorithm) is designed and implemented.Finally, an experimental study to compare MMAC witha set five popular classification algorithms and theMMAC algorithm is conducted using a group of realand artificial benchmark UCI datasets. Morespecifically, our testbed involves 13 artificial datasetsand 10 real world application datasets.The major findings of the paper are:
 
 
 
Performance of some simple classificationalgorithms like OneR is quite well on real worldapplication data, even if they perform poor onartificial data sets.
 
There is consistency on the classification accuracyand number of rules produced by decision trees C.45and PART algorithms.
 
Naive Bayes and OneR algorithms are the fastestones to construct the classification system due thesimplicity of such methods in constructing the rules.
 
RIPPER on the other hand, is the slowest algorithmin building the classification system due to theoptimization phase it employs to deduce the size of the rules set.
 
In terms of accuracy, the MMAC algorithm is thebest, probably due to the relatively large number of rules it can identify.2. Association Rule MiningSince the presentation of association rulemining by Agrawal, Imielinski and Swami in their
paper “Mining association rules between sets of itemsin large databases” in 1993 [1], this area remained one
of the most active research areas in machine learningand knowledge discovery.Presently, association rule mining is one of themost important tasks in data mining.  It is considered astrong tool for market basket analysis that aims toinvestigate the shopping behavior of customers inhoping to find regularities [1]. In finding associationrules, one tries to find group of items that are frequentlysold together in order to infer items from the presence
of other items in the customer’s shopping cart. For instance, an association rule may state that “ 80% of customers who buy diaper and ice also buy cereal”.
This kind of information may be beneficial and can beused for strategic decisions like items shelving, targetmarketing, sale promotions and discount strategies.Association rules is a valuable tool that hasbeen used widely in various industries likesupermarkets, mail ordering, telemarketing, insurancefraud, and many other applications where findingregularities are targeted. The task of association rulesmining over a market basket has been described in [1],formally, let
D
be a database of sales transactions, andlet
= {
i
1
, i
2
,
…, i
m
} be a set of binary literals calleditems. A transaction
in
D
contains a set of itemscalled itemset, such that
 
 
. Generally, the number of items in an itemset is called length of an itemset.Itemsets that have a length
are denoted by
-itemsets.Each
i
temset is associated with a statistical thresholdnamed
support 
. The support of the itemset is number of transactions in
D
that contain the itemset. Anassociation rule is an expression
, where
,
 
 
are two sets of items and
=
.
is called the
antecedent 
, and
is called the
consequent 
of theassociation rule. An association rule
has a
measure of goodness
named
confidence
, which can bedefined as, the probability a transaction contains
 given that it contains
, and is given assupport(
XY 
)/support(
).Given the transactional database
D
, theassociation rule problem is to find all rules that have asupport and confidence greater than certain userspecified thresholds, denoted by
minsupp
and
minconf 
,respectively.The problem of generating all association rulesfrom a transactional database can be decomposed intotwo subproblems [1].
 
Table 1
 
:  Transactional DatabaseTransaction Id Item Time
I1 bread, milk, juice 10:12I2 bread, juice, milk 12:13I3 milk, ice, bread, juice 13:22I4 bread, eggs, milk 13:26I5 ice, basket, bread, juice 15:11
 
1.
 
The generation of all itemsets with support greaterthan the
minsupp
. These itemsets are called
frequent 
 itemsets. All other itemsets are called
infrequent 
.2.
 
For each frequent
 
itemset generated in step1, generateall rules that pass
minconf 
threshold. For example if item XYZ is frequent, then we might evaluate theconfidence of rules
XY 
,
XZ 
and
YZ 
.For clarity, consider for example the database shownbelow in Table 1, and let
minsupp
and
minconf 
be  0.70and 1.0, respectively. The frequent itemsets in Table 1are {bread}, {milk}, {juice}, {bread, milk} and {bread,juice}. The association rules that pass
minconf 
amongthese frequent itemsets
 
are
bread milk 
and
bread juice
.While the second step of association rulediscovery that involves generation of the rules isconsiderably a straightforward problem given that thefrequent itemsets and their
support 
are known [1, 2, 18,23].  The first step of finding frequent
 
itemsets isrelatively a resource consuming problem that requiresextensive computation and large resource capacityespecially if the size of the database and the itemsets arelarge [1, 28, 4]. Generally, for a number of distinctitems
m
in a customer transaction database
D
, there are2
m
possible number of itemsets. Consider for example agrocery store that contains 2100 different distinct items.Then there are 2
2100
 
possible different combinations of potential frequent itemsets, known by candidateitemsets, in which some of them do not appear evenonce in the database, and thus usually only a smallsubset of this large number of candidate itemsets isfrequent. This problem has extensively beinginvestigated in the last decade for the purpose of improving the performance of candidate itemsetsgeneration [4, 28, 17, 23, 25, 40]. In this paper, we onlyconsidered a number of well known association rulemining algorithms that contributed improvement on theperformance in the first step of the mining process. Thesecond step, however, is not considered in this paper.One of the first algorithms that has significantimprovements over the previous association rulealgorithms is the Apriori algorithm[2]. The Apriorialgorithm presents a new key property named the
“downward
-clos
ure” of the support, which states that if 
an itemset
 
passes the
minsupp
then all of its subset mustalso pass the
minsupp
. This means that any subset of afrequent itemset have to be frequent, where else, any
 
 
superset of infrequent itemset must be infrequent. Mostof the classic association rule algorithms which havebeen developed after the Apriori algorithm such as [28,4] have used this property in the first step of associationrules discovery. Those algorithms are referred to as theApriori-like algorithms or techniques.Apriori-like techniques such as [28, 4, 25] cansuccessfully achieve good level of performancewhenever the size of the candidate itemsets is small.However, in circumstances with large candidateitemsets size, low minimum support threshold and longpatterns, these techniques may still suffer from thefollowing costs [17]:
 
Holding large number of candidate itemsets. Forinstance, to find a frequent itemset of size 50, oneneeds to derive more than 2
50
candidate itemsets. Thissignificantly is costly in runtime and memory usageregardless of the implementation method in use.
 
Passing over the database multiple times to check large number of candidate itemsets by patternmatching. The Apriori-like algorithms require acomplete pass over the database to find candidateitems at each level. Thus, to find potential candidateitemsets of size
n
+1, a merge of all possiblecombinations of frequent itemsets of size
n
and acomplete scan of the database to update theoccurrence frequencies of candidate itemsets of size
n
+1 will be performed. The process of repeatedlyscan the database at each level is significantly costlyin processing time.
 
Rare items with high confidence and low support inthe database will be basically ignored.
3. CLASSIFICATION IN DATA MINING
 
3.1 Literature ReviewClassification presently is considered one of the most common data mining tasks [14, 24, 30, 39].Classifying real world instances is a common thinganyone practices through his life. One can classifyhuman beings based on their race or can categorizeproducts in a supermarket based on the consumersshopping choices. In general, classification involvesexamining the features of new objects and trying toassign it to one of the predefined set of classes [38].Given a collection of records in a data set, each recordconsists of a group of attributes; one of the attributes isthe class. The goal of classification is to build a modelfrom classified objects in order to classify previouslyunseen objects as accurately as possible.There are many classification approaches forextracting knowledge from data such as divide-and-conquer [31], separate-and-conquer [15], covering andstatistical approaches [24, 6]. The divide-and-conquerapproach starts by selecting an attribute as a root node,and then it makes a branch for each possible level of that attribute. This will split the training instances intosubsets, one for each possible value of the attribute. Thesame process will be repeated until all instances that fallin one branch have the same classification or theremaining instances cannot be split any further. Theseparate-and-conquer approach, on the other hand,starts by building up the rules in greedy fashion (one byone). After a rule is found, all instances covered by therule will be deleted. The same process is repeated untilthe best rule found has a large error rate. Statisticalapproaches such as Naïve Bayes [21] use probabilisticmeasures, i.e. likelihood, to classify test objects.Finally, covering approach [6] selects each of theavailable classes in turn, and looks for a way of covering most of training objects to that class in orderto come up with maximum accuracy rules.Numerous algorithms have been derived fromthese approaches, such as decision trees [32, 30],PART[14], RIPPER [7]and Prism[6].While single labelclassification, which assigns each rule in the classifierto the most obvious label, has been widely studied [30,14, 7, 6, 19, 21], little work has been done on multi-label classification.  Most of the previous research work to date on multi-label classification is related to textcategorization [20]. In this paper, only traditionalclassification algorithms that generate rules with asingle class will be considered.
3.2 The Classification Problem
Most of the research conducted onclassification in data mining has been devoted to singlelabel problems. A traditional classification problem canbe defined as follows: let
D
denote the domain of possible training instances and
be a list of class labels,let
denote the set of classifiers for
D
, eachinstance
 
 
D
is assigned a single class y that belongsto
. The goal is to find a classifier
h
 
 
thatmaximizes the probability that
h
(
) =
y
for each testcase (
,
y
). In multi-label problems, however, eachinstance
 
 
D
can be assigned multiple labels
y1
,
y2
,
…,
y
for
y
i
 
 
y
, and is represented as a pair (
, (
y1
,
y2
,
…,
y
 
)) where (
y1
,
y2
, …,
y
) is a list of 
ranked classlabels
from
y
associated with the instance
in thetraining data. In this work, we only consider thetraditional single class classification problem.
4. ASSOCIATIVE CLASSIFICATION
Generally, in association rule mining, any itemthat passes
minsupp
is known as a
frequent itemset
. If the frequent item consists of only a single attributevalue, it is said to be a
frequent one-item
. For example,with
minsupp
= 20%, the frequent one-items in Table 4are < (
AT 
1
, z
1
)>, < (
AT 
1
, z
2
)>, < (
AT 
2
, w
1
)>, < (
AT 
2
, w
2
)>and < (
AT 
2
, w
3
)>. Current associative classificationtechniques generate frequent items by making morethan one scan over the training data set. In the first scan,they find the support of one-items, and then in eachsubsequent scan, they start with items found to befrequent in the previous scan in order to produce newpossible frequent items involving more attribute values.In other words, frequent single items are used for thediscovery of frequent two-items, and frequent two-itemsare input for the discovery of frequent three-items andso on.When the frequent items have been discovered,classification based on association rules algorithmsextract a complete set of class-association-rules (CAR)for those frequent items that pass
minconf 
.  
  

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->