You are on page 1of 5

Market basket analysis based on

text segmentation and association rule mining


XIE Wen-xiu
School of Information Engineering
Zhejiang Agriculture and Forestry
Unversity
Hangzhou, China
zjfcdqxxxwx@126.com
Qi Heng-nian
School of Information Engineering
Zhejiang Agriculture and Forestry
Unversity
Hangzhou, China
qihengnian@yahoo.com.cn
Huang Mei-li
School of Information Engineering
Zhejiang Agriculture and Forestry
Unversity
Hangzhou, China
meilihuang@zafu.edu.cn
AbstractMarket basket analysis helps to provide scientific
decision support for retail market by mining association rules
among items people purchased together. In this paper, we
propose an innovative market basket analysis method by
mining association rules on the items internal characteristics
which are obtained by using automatic words segmentation
technology. The method has been applied to a dynamic dishes
recommend system and validated by the experimental results.
Keywords-Market basket analysis; Text segmentation;
Association rule;
I. INTRODUCTION
Market basket analysis is a good way to provide
scientific decision support for retail market by mining
association relationships among items people purchased
together[1]. Mining results may not be interesting to people
since they dont show semantic associations among items.
However, the consequence may be different if association
rules are mined on items internal characteristics. In this
paper, we propose an innovative market basket analysis
method by mining association rules on items internal
characteristics which are obtained by using automatic words
segmentation technology.
The rest of this paper is organized as follows. Section
II reviews the general architecture of the market basket
analysis system. Section III describes the proposed method
to get association relationships between items themselves by
using item characteristic association rules. Section IV
demonstrates the effectiveness of the method through three
different experiments. Section V concludes the paper.
II THE MARKET BASKET ANALYSIS SYSTEM
ARCHITECTURE
Generally, the historical retail database is huge and the
time complexity of the association rule mining is high.
Therefore, real-time data mining is unpractical and it is
much better to establish the market basket analysis system
by applying the Client/Server architecture (Figure 1). The
roles of client and server in the architecture are as follows.
Figure 1. The Market Basket Analysis System Architecture
Server: generate characteristics automatically for each
item by using word segmentation technology; mine item
characteristic association rules and store the set of rules in
database.
Client: receive a set (sequence) of items and read the
item characteristic association rules; return a set of items
2010 First International Conference on Networking and Distributed Computing
978-0-7695-4207-2/10 $26.00 2010 IEEE
DOI 10.1109/ICNDC.2010.67
309
2010 First International Conference on Networking and Distributed Computing
978-0-7695-4207-2/10 $26.00 2010 IEEE
DOI 10.1109/ICNDC.2010.67
309
which is strongly associated to the received set (sequence).
A typical use of this market basket analysis system is
that in a restaurant equipped with electronic ordering system,
consumers can get scientific recommendations after each
step of ordering dishes with this system.
III. ASSOCIATION RULE ANALYSIS
After having mined the item characteristic association
rules, we introduce methods of how to find a set of items
that is strongly associated to a certain set (sequence) of
items when a client receives a request.
A. On Condition That the Request Format is a Set of
Items
Because the request formats (set or sequence) received
by clients are different, the method used to calculate a set of
items which is strongly associated to the request is different
as well. While the request format is a set of items, it will be
much simple.
Suppose A is a set of items described as follows:{
(Hongshaoniurougaijiaofan)}, and the item
characteristic association rules are:
{}== >{} [sup = 0.1; conf= 0.8]
{}==>{} [sup= 0.15; conf= 0.6]
The corresponding meaning for the second association
rule is: Among all the consumers who ordered dishes with
(Niurou) and (Hongshao), 60% of them ordered
dishes with(Jidan) as well. This is supported by 15% of
the whole trading records. The first rule can be rendered
similarly.
Because item (Jidantang) contains
characteristics (Tang) and (Jidan) which are
strongly associated to the characteristics of items in set A,
we can make a qualitative decision that item
(Jidantang) is strongly associated to the given set.
How to determine another set of items B to make sure
that every item in set B is strongly associated to set A when
market basket analysis system receives a set of items A?
Obviously, a quantitative method to measure the strength of
association relationship between set of items A and any
other item is needed.
A new concept is introduced for this purpose: Similarity
(g,A,p). Here, g represents a certain item, and A represents a
certain set (sequence) of items, p represents a certain
association rule. Similarity (g,A,p) is used to measure the
similarity between the process of recommend p.second
according to p.first and the process of recommend item g
according to a set (sequence) of items A quantitatively.
Similarity(g,A,p) is calculated as Equation(1):
Similarity(g,A,p) = (|p.first G characteristics in A|
* |p.second G characteristics in g|)
/ (|p.first| * |p.second|). (1)
To any set of items A and any item g which is excluded
from set A, the strength of association relationship between
g and A can be defined as function Elem_t Relation(A,g),
the pseudo code is:
Elem_t Strength_set(A, g){
Elem_t ret = 0;
For_each(p in P)
ret=ret+p.strength*Similarity(g,A,p);
return ret;
}
Here is the description of variable names and symbols:
P: a set of item characteristic association rules;
p: an item characteristic association rule;
p.first: antecedents of association rule p;
p.second: consequent of association rule p;
p.strength: strength of rule p, generally, represented by
the confidence of the rule.
B On Condition That the Request Format is a Sequence
of Items
When a client receives a sequence of items S, S can be
considered as a sequence of items a consumer purchased.
What a market basket analysis system should do is to
forecast consumers purchasing behavior and give scientific
suggestions.
Suppose S is a sequence of items, Snext is a new
sequence which is generated by inserting a new item in the
end of S. What the system should do next is to obtain a new
set of items B that is strongly associated to sequence Snext.
Define a sequence Spre, of which S is generated by
inserting a new item in the end. Define two sets of
association rules C and D. C is made up of those association
310 310
rules which have at least one characteristic in antecedent
that exists in sequence Spre. D is made up of those
association rules which have at least one characteristic in
antecedent that exists in sequence S. It is obvious that C is a
subset of D.
Consider following two calculating processes:
Process 1: Client receives a sequence of items Spre, set
of association rules C plays a role in the process of
generating set of items B.
Process 2: Client receives a sequence of items S, set of
association rules D plays a role in the process of generating
set of items B.
In the second process, set D-C is more important than
set C because set C has been considered in process 1.
Another concept is introduced: Weight (p,S). For a
certain sequence S and two different association rules p1
and p2, Weight(p1,S) > Weight(p2,S) is true when and
only when p1 is more important in the process of generating
set B.
To calculate Weight (p,S), a function f(x) which is
monotonically increasing in natural numbers is necessary.
Typically, f(x) can be defined as f(x) = x * x * x.
Weight(p,S) can be calculated as follows:
Elem_t Weight(p,S){
Elem_t sum = 0, t = 0;
for_each(attribute in p.first){
for (i = 1; i <= S.size(); i++){
if (attribute is in S[i]){
sum+=i; t++;
}
}
}
if (time==0) return 0;
return f(sum / time);
}
Similarly, when the request format is sequence of
items, to any sequence of items S and any item g which is
excluded from sequence S, the strength of association
relationship between g and S can be defined as function
Elem_t Strength_sequence(A,g), the pseudo code is shown
as follows:
Elem_t Strength_sequence(A,g){
Elem_t ret = 0;
For_each(p in P)
ret = ret + p.strength * Weight(p) *
Similarity(g,S,p);
return ret;
}
So far, we can use a quantitative way to measure the
strength of association relationship between any item g and
a set (sequence) of items S. Set of items B can be obtained
by selecting items which are highly associated to the set of
items A(sequence of items S) according to the practical
situation. It is regarded that there is certain association
relationship between set of items A (sequence of items S)
and set of items B.
IV EXPERIMENTS AND ANALYSIS OF THE
RESULTS
Three groups of data mining experiments were done
on a given huge historical sales database which is obtained
from a restaurant with 12613 trading records. We set
minimum support to 0.01 and minimum confidence to 0.1.
Results of the experiments are analyzed in the end of this
section.
A. First Experiment
The object of data mining in this experiment is the
original transaction database which contains 722 kinds of
goods and 63425 sale records including large numbers of
fruits and drinks. 119 association rules are obtained in this
experiment.
Here are 5 association rules with highest confidence.
Format of the rules is: Antecedents of association rule,
Antecedents count ==> Consequent of association rule,
Consequent count [confidence of the rule].
1) - 148 ==> - 139
[conf=0.94]
2) - - 154 ==> -
128 [conf=0.83]
3) - 316 ==> - 258
[conf=0.82 ]
4) - - 204 ==>
311 311
- 166 [conf=0.81]
5) - - 163 ==> -
128 [conf=0.79]
The advantage of this experiment is that most of
association rules have high support and confidence.
The disadvantage is that most items in association rules
are fruits and drinks which policy makers and consumers
are not interested in. Moreover, the process of data mining
lasts too long.
Why are most items in the antecedents and consequents
of association rules not mainstream food? The reason is that
there are only limited kinds of fruits and drinks in database,
thus it is at a high probability that consumers buy a certain
set of fruits and drinks. This kind of biased data in database
seriously affects the quality of data mining.
Another question is how to save time in the process of
data mining. Time complexity of discovering association
rules in a frequent item-set is an exponential function whose
parameter is the rank of the frequent item-set. So removing
uninterested data can help to save time in the process of data
mining.
B. Second Experiment
The object of data mining in the second experiment is
the transaction database after the process of data cleaning in
which non-interest data is removed. In the cleaned database,
there are 377 kinds of goods and 30156 sale records. 71
association rules are discovered in the process of data
mining on cleaned database.
Here are 5 association rules with highest confidence:
1) 66 ==> 24 [conf=0.36]
2) 90 ==> 29 [conf=0.32]
3) 83 ==> 24
[conf=0.29]
4) 145 ==> 38 [conf=0.26]
5) 135 ==> 34 [conf=0.25]
The advantage of this experiment is that most items in
data mining result are mainstream food. Time cost of the
data mining is greatly reduced compared to the first
experiment.
The disadvantage is that only a small part of the whole
items exist in data mining result. It is hard to get a
reasonable explanation for association rules.
A question raised from the experiment is whether there
are any similar association rules. Support and confidence of
most association rules in this experiment are both low
because of similar food in menu. Different rules may reflect
same features in consumers buying behavior, e.g., rule
115 ==> 28 [conf=0.24] and rule
166 ==> 32 [conf=0.19]. One feature
they both reflect is if one consumer choose goods (Yu),
the consumer will probably choose (Sijidou) next.
There are some association rules with high confidence but
low support which can not be obtained by association rule
mining algorithm but show similar feature in consumers
buying behavior which obtained rules reflect.
Item characteristic association rules such as whoever
likes hot food may also like soup can not be obtained in
traditional market basket analysis method. This kind of rules
will be obtained directly in the third experiment.
C. Third experiment
In this experiment, characteristics of each item are
generated by taking ICTCLAS as the tool of word
segmentation and taking catering lexicon together with daily
words as segmentation lexicon before mining association
rules on the items internal characteristics[5].
Before the process of data mining, the trading record
which is in the format of a set of items must be transformed
into the format of a set of items characteristics. A piece of
trading record is represented as { ID1,{G1G2Gm}},
when ID1 is record number,{G1G2Gm}is the set
of items people purchased together. Item Gi has ni
characteristics, the set of characteristics belong to Gi is
{Gi-1Gi-2Gi-ni}. After be transformed into a set
of items characteristic, the trading record is represented as
{ ID1,{ G1-1G1-2G1-n1,G2-1G2-2
G2-n2G3-1Gm-1Gm-2Gm-nm}}. If
there are same characteristics in the transformed record,
only one of them is retained.
There are 1873 pieces of association rules obtained in
the third experiment. Some of the prominent item
characteristic association rules are listed as follows:
1) 277 ==> 231 [conf=0.83]
312 312
2) 533 ==> 389 [conf=0.73]
3) 473 ==> 339 [conf=0.72]
4) 348 ==> 246 [conf=0.71]
5) 317 ==> 224 [conf=0.71]
The advantage of this experiment is that the number of
association rules is huge; most items have at least one
characteristic which is in the experiment result. Association
rules which can be explained easily can reflect the essence
of customers buying behavior.
The disadvantage is that if the name of item can not
reflect its characteristics, then the method is useless, e.g.
(Mayishangshu).
D. Experiment Conclusion
Experiment results were shown in Table 1.
TABLE 1. EXPERIMENT RESULTS
First Second Third
Number of association
rules
119 71 1873
Average confidence(all) 0.48 0.23 0.41
Average support(all) 0.29 0.03 0.16
Average confidence
(up 30%)
0.69 0.37 0.50
Average support
(up 30%)
0.51 0.07 0.19
Number of items in rules 82 127 349
Apriori 57 m 19 m 16 m
Time cost
FP_Growth 9 m 2.4 m 2 m
Association rule mining method used in the third
experiment can obtain association rules with higher support
and confidence. With same minimum support and
confidence, the third experiment results in much more
association rules than the first two experiments with less
time cost. If the time cost in the process of word
segmentation is not considered, time cost in the third
experiment is the least. The experiment results show that a
large number of association rules can be obtained with the
innovative market basket analysis method. In the cases
where the traditional analysis method cannot obtain enough
association rules that policy makers are interested in, the
innovative market basket analysis method can provide more
scientific decision support for retail market and consumers
buying behavior.
. CONCLUSIONS
We know that in some cases, market basket analysis
does not provide useful information if data-item of the
market basket analysis is the name of goods. In this paper,
we proposed a new market basket analysis method which
integrates words segmentation technology and association
rule mining technology. Characteristics of items can be
generated automatically before mining association rules by
using word segmentation technology. This method has been
applied to a restaurant equipped with electronic ordering
system to give recommendations to customers, where the
experiments were done. The experiment results show that
the method is efficient and valid.
REFERENCES
[1] J. Han and M. Kamber, Data mining concepts and techniques(2
nd
ed.), Morgan Kaufmann, pp. 227-275,2006.
[2] R. Agrawal, T. Imielinski, and A. Swami, Mining association rules
between sets of items in large databases, In Proc. 1993
ACM-SIGMOD Int. Conf. Management of Data(SIGMOD 93), ACM
Press, May 1993, pp. 207-216.
[3] R. Agrawal and R. Srikant, Fast algorithms for mining association
rules, In proc. 1994 Int. Conf. Very Large Data Bases (VLDB 94),
ACM Press, Sept. 1994, pp. 487-499.
[4] J. Han, J. Pei, and Y. Yin, Mining frequent patterns without
candidate generation, In Proc. 2000 ACM-SIGMOD Int. Conf.
Management of Data(SIGMOD 00), ACM Press, May 2000, pp. 1-12.
[5] htt:`t'u:.o_
[6] H. Zhang, H. Yu, and D Xiong, HHMM-based Chinese lexical
analyzer ICTCLAS, In Proc. Second of SIGHAN Workshop on
Chinese Language Processing(ACL 03), ACL press, July 2003,
pp.184-187, doi:10.3115/1119250.1119280.
[7] H. Zhang and Q. Liu, Chinese Lexical Analysis Using Hierarchical
Hidden Markov Model, In Proc. Second of SIGHAN Workshop on
Chinese Language Processing(ACL 03), ACL press, July 2003,
pp.63-70.
313 313

You might also like