In this paper, we propose an innovative market basket analysis method by mining association rules on the items' internal characteristics. The method has been applied to a dynamic dishes recommend system and validated by the experimental results. Market basket analysis is a good way to provide scientific decision support for retail market by mining association relationships among items people purchased together.
In this paper, we propose an innovative market basket analysis method by mining association rules on the items' internal characteristics. The method has been applied to a dynamic dishes recommend system and validated by the experimental results. Market basket analysis is a good way to provide scientific decision support for retail market by mining association relationships among items people purchased together.
In this paper, we propose an innovative market basket analysis method by mining association rules on the items' internal characteristics. The method has been applied to a dynamic dishes recommend system and validated by the experimental results. Market basket analysis is a good way to provide scientific decision support for retail market by mining association relationships among items people purchased together.
XIE Wen-xiu School of Information Engineering Zhejiang Agriculture and Forestry Unversity Hangzhou, China zjfcdqxxxwx@126.com Qi Heng-nian School of Information Engineering Zhejiang Agriculture and Forestry Unversity Hangzhou, China qihengnian@yahoo.com.cn Huang Mei-li School of Information Engineering Zhejiang Agriculture and Forestry Unversity Hangzhou, China meilihuang@zafu.edu.cn AbstractMarket basket analysis helps to provide scientific decision support for retail market by mining association rules among items people purchased together. In this paper, we propose an innovative market basket analysis method by mining association rules on the items internal characteristics which are obtained by using automatic words segmentation technology. The method has been applied to a dynamic dishes recommend system and validated by the experimental results. Keywords-Market basket analysis; Text segmentation; Association rule; I. INTRODUCTION Market basket analysis is a good way to provide scientific decision support for retail market by mining association relationships among items people purchased together[1]. Mining results may not be interesting to people since they dont show semantic associations among items. However, the consequence may be different if association rules are mined on items internal characteristics. In this paper, we propose an innovative market basket analysis method by mining association rules on items internal characteristics which are obtained by using automatic words segmentation technology. The rest of this paper is organized as follows. Section II reviews the general architecture of the market basket analysis system. Section III describes the proposed method to get association relationships between items themselves by using item characteristic association rules. Section IV demonstrates the effectiveness of the method through three different experiments. Section V concludes the paper. II THE MARKET BASKET ANALYSIS SYSTEM ARCHITECTURE Generally, the historical retail database is huge and the time complexity of the association rule mining is high. Therefore, real-time data mining is unpractical and it is much better to establish the market basket analysis system by applying the Client/Server architecture (Figure 1). The roles of client and server in the architecture are as follows. Figure 1. The Market Basket Analysis System Architecture Server: generate characteristics automatically for each item by using word segmentation technology; mine item characteristic association rules and store the set of rules in database. Client: receive a set (sequence) of items and read the item characteristic association rules; return a set of items 2010 First International Conference on Networking and Distributed Computing 978-0-7695-4207-2/10 $26.00 2010 IEEE DOI 10.1109/ICNDC.2010.67 309 2010 First International Conference on Networking and Distributed Computing 978-0-7695-4207-2/10 $26.00 2010 IEEE DOI 10.1109/ICNDC.2010.67 309 which is strongly associated to the received set (sequence). A typical use of this market basket analysis system is that in a restaurant equipped with electronic ordering system, consumers can get scientific recommendations after each step of ordering dishes with this system. III. ASSOCIATION RULE ANALYSIS After having mined the item characteristic association rules, we introduce methods of how to find a set of items that is strongly associated to a certain set (sequence) of items when a client receives a request. A. On Condition That the Request Format is a Set of Items Because the request formats (set or sequence) received by clients are different, the method used to calculate a set of items which is strongly associated to the request is different as well. While the request format is a set of items, it will be much simple. Suppose A is a set of items described as follows:{ (Hongshaoniurougaijiaofan)}, and the item characteristic association rules are: {}== >{} [sup = 0.1; conf= 0.8] {}==>{} [sup= 0.15; conf= 0.6] The corresponding meaning for the second association rule is: Among all the consumers who ordered dishes with (Niurou) and (Hongshao), 60% of them ordered dishes with(Jidan) as well. This is supported by 15% of the whole trading records. The first rule can be rendered similarly. Because item (Jidantang) contains characteristics (Tang) and (Jidan) which are strongly associated to the characteristics of items in set A, we can make a qualitative decision that item (Jidantang) is strongly associated to the given set. How to determine another set of items B to make sure that every item in set B is strongly associated to set A when market basket analysis system receives a set of items A? Obviously, a quantitative method to measure the strength of association relationship between set of items A and any other item is needed. A new concept is introduced for this purpose: Similarity (g,A,p). Here, g represents a certain item, and A represents a certain set (sequence) of items, p represents a certain association rule. Similarity (g,A,p) is used to measure the similarity between the process of recommend p.second according to p.first and the process of recommend item g according to a set (sequence) of items A quantitatively. Similarity(g,A,p) is calculated as Equation(1): Similarity(g,A,p) = (|p.first G characteristics in A| * |p.second G characteristics in g|) / (|p.first| * |p.second|). (1) To any set of items A and any item g which is excluded from set A, the strength of association relationship between g and A can be defined as function Elem_t Relation(A,g), the pseudo code is: Elem_t Strength_set(A, g){ Elem_t ret = 0; For_each(p in P) ret=ret+p.strength*Similarity(g,A,p); return ret; } Here is the description of variable names and symbols: P: a set of item characteristic association rules; p: an item characteristic association rule; p.first: antecedents of association rule p; p.second: consequent of association rule p; p.strength: strength of rule p, generally, represented by the confidence of the rule. B On Condition That the Request Format is a Sequence of Items When a client receives a sequence of items S, S can be considered as a sequence of items a consumer purchased. What a market basket analysis system should do is to forecast consumers purchasing behavior and give scientific suggestions. Suppose S is a sequence of items, Snext is a new sequence which is generated by inserting a new item in the end of S. What the system should do next is to obtain a new set of items B that is strongly associated to sequence Snext. Define a sequence Spre, of which S is generated by inserting a new item in the end. Define two sets of association rules C and D. C is made up of those association 310 310 rules which have at least one characteristic in antecedent that exists in sequence Spre. D is made up of those association rules which have at least one characteristic in antecedent that exists in sequence S. It is obvious that C is a subset of D. Consider following two calculating processes: Process 1: Client receives a sequence of items Spre, set of association rules C plays a role in the process of generating set of items B. Process 2: Client receives a sequence of items S, set of association rules D plays a role in the process of generating set of items B. In the second process, set D-C is more important than set C because set C has been considered in process 1. Another concept is introduced: Weight (p,S). For a certain sequence S and two different association rules p1 and p2, Weight(p1,S) > Weight(p2,S) is true when and only when p1 is more important in the process of generating set B. To calculate Weight (p,S), a function f(x) which is monotonically increasing in natural numbers is necessary. Typically, f(x) can be defined as f(x) = x * x * x. Weight(p,S) can be calculated as follows: Elem_t Weight(p,S){ Elem_t sum = 0, t = 0; for_each(attribute in p.first){ for (i = 1; i <= S.size(); i++){ if (attribute is in S[i]){ sum+=i; t++; } } } if (time==0) return 0; return f(sum / time); } Similarly, when the request format is sequence of items, to any sequence of items S and any item g which is excluded from sequence S, the strength of association relationship between g and S can be defined as function Elem_t Strength_sequence(A,g), the pseudo code is shown as follows: Elem_t Strength_sequence(A,g){ Elem_t ret = 0; For_each(p in P) ret = ret + p.strength * Weight(p) * Similarity(g,S,p); return ret; } So far, we can use a quantitative way to measure the strength of association relationship between any item g and a set (sequence) of items S. Set of items B can be obtained by selecting items which are highly associated to the set of items A(sequence of items S) according to the practical situation. It is regarded that there is certain association relationship between set of items A (sequence of items S) and set of items B. IV EXPERIMENTS AND ANALYSIS OF THE RESULTS Three groups of data mining experiments were done on a given huge historical sales database which is obtained from a restaurant with 12613 trading records. We set minimum support to 0.01 and minimum confidence to 0.1. Results of the experiments are analyzed in the end of this section. A. First Experiment The object of data mining in this experiment is the original transaction database which contains 722 kinds of goods and 63425 sale records including large numbers of fruits and drinks. 119 association rules are obtained in this experiment. Here are 5 association rules with highest confidence. Format of the rules is: Antecedents of association rule, Antecedents count ==> Consequent of association rule, Consequent count [confidence of the rule]. 1) - 148 ==> - 139 [conf=0.94] 2) - - 154 ==> - 128 [conf=0.83] 3) - 316 ==> - 258 [conf=0.82 ] 4) - - 204 ==> 311 311 - 166 [conf=0.81] 5) - - 163 ==> - 128 [conf=0.79] The advantage of this experiment is that most of association rules have high support and confidence. The disadvantage is that most items in association rules are fruits and drinks which policy makers and consumers are not interested in. Moreover, the process of data mining lasts too long. Why are most items in the antecedents and consequents of association rules not mainstream food? The reason is that there are only limited kinds of fruits and drinks in database, thus it is at a high probability that consumers buy a certain set of fruits and drinks. This kind of biased data in database seriously affects the quality of data mining. Another question is how to save time in the process of data mining. Time complexity of discovering association rules in a frequent item-set is an exponential function whose parameter is the rank of the frequent item-set. So removing uninterested data can help to save time in the process of data mining. B. Second Experiment The object of data mining in the second experiment is the transaction database after the process of data cleaning in which non-interest data is removed. In the cleaned database, there are 377 kinds of goods and 30156 sale records. 71 association rules are discovered in the process of data mining on cleaned database. Here are 5 association rules with highest confidence: 1) 66 ==> 24 [conf=0.36] 2) 90 ==> 29 [conf=0.32] 3) 83 ==> 24 [conf=0.29] 4) 145 ==> 38 [conf=0.26] 5) 135 ==> 34 [conf=0.25] The advantage of this experiment is that most items in data mining result are mainstream food. Time cost of the data mining is greatly reduced compared to the first experiment. The disadvantage is that only a small part of the whole items exist in data mining result. It is hard to get a reasonable explanation for association rules. A question raised from the experiment is whether there are any similar association rules. Support and confidence of most association rules in this experiment are both low because of similar food in menu. Different rules may reflect same features in consumers buying behavior, e.g., rule 115 ==> 28 [conf=0.24] and rule 166 ==> 32 [conf=0.19]. One feature they both reflect is if one consumer choose goods (Yu), the consumer will probably choose (Sijidou) next. There are some association rules with high confidence but low support which can not be obtained by association rule mining algorithm but show similar feature in consumers buying behavior which obtained rules reflect. Item characteristic association rules such as whoever likes hot food may also like soup can not be obtained in traditional market basket analysis method. This kind of rules will be obtained directly in the third experiment. C. Third experiment In this experiment, characteristics of each item are generated by taking ICTCLAS as the tool of word segmentation and taking catering lexicon together with daily words as segmentation lexicon before mining association rules on the items internal characteristics[5]. Before the process of data mining, the trading record which is in the format of a set of items must be transformed into the format of a set of items characteristics. A piece of trading record is represented as { ID1,{G1G2Gm}}, when ID1 is record number,{G1G2Gm}is the set of items people purchased together. Item Gi has ni characteristics, the set of characteristics belong to Gi is {Gi-1Gi-2Gi-ni}. After be transformed into a set of items characteristic, the trading record is represented as { ID1,{ G1-1G1-2G1-n1,G2-1G2-2 G2-n2G3-1Gm-1Gm-2Gm-nm}}. If there are same characteristics in the transformed record, only one of them is retained. There are 1873 pieces of association rules obtained in the third experiment. Some of the prominent item characteristic association rules are listed as follows: 1) 277 ==> 231 [conf=0.83] 312 312 2) 533 ==> 389 [conf=0.73] 3) 473 ==> 339 [conf=0.72] 4) 348 ==> 246 [conf=0.71] 5) 317 ==> 224 [conf=0.71] The advantage of this experiment is that the number of association rules is huge; most items have at least one characteristic which is in the experiment result. Association rules which can be explained easily can reflect the essence of customers buying behavior. The disadvantage is that if the name of item can not reflect its characteristics, then the method is useless, e.g. (Mayishangshu). D. Experiment Conclusion Experiment results were shown in Table 1. TABLE 1. EXPERIMENT RESULTS First Second Third Number of association rules 119 71 1873 Average confidence(all) 0.48 0.23 0.41 Average support(all) 0.29 0.03 0.16 Average confidence (up 30%) 0.69 0.37 0.50 Average support (up 30%) 0.51 0.07 0.19 Number of items in rules 82 127 349 Apriori 57 m 19 m 16 m Time cost FP_Growth 9 m 2.4 m 2 m Association rule mining method used in the third experiment can obtain association rules with higher support and confidence. With same minimum support and confidence, the third experiment results in much more association rules than the first two experiments with less time cost. If the time cost in the process of word segmentation is not considered, time cost in the third experiment is the least. The experiment results show that a large number of association rules can be obtained with the innovative market basket analysis method. In the cases where the traditional analysis method cannot obtain enough association rules that policy makers are interested in, the innovative market basket analysis method can provide more scientific decision support for retail market and consumers buying behavior. . CONCLUSIONS We know that in some cases, market basket analysis does not provide useful information if data-item of the market basket analysis is the name of goods. In this paper, we proposed a new market basket analysis method which integrates words segmentation technology and association rule mining technology. Characteristics of items can be generated automatically before mining association rules by using word segmentation technology. This method has been applied to a restaurant equipped with electronic ordering system to give recommendations to customers, where the experiments were done. The experiment results show that the method is efficient and valid. REFERENCES [1] J. Han and M. Kamber, Data mining concepts and techniques(2 nd ed.), Morgan Kaufmann, pp. 227-275,2006. [2] R. Agrawal, T. Imielinski, and A. Swami, Mining association rules between sets of items in large databases, In Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data(SIGMOD 93), ACM Press, May 1993, pp. 207-216. [3] R. Agrawal and R. Srikant, Fast algorithms for mining association rules, In proc. 1994 Int. Conf. Very Large Data Bases (VLDB 94), ACM Press, Sept. 1994, pp. 487-499. [4] J. Han, J. Pei, and Y. Yin, Mining frequent patterns without candidate generation, In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data(SIGMOD 00), ACM Press, May 2000, pp. 1-12. [5] htt:`t'u:.o_ [6] H. Zhang, H. Yu, and D Xiong, HHMM-based Chinese lexical analyzer ICTCLAS, In Proc. Second of SIGHAN Workshop on Chinese Language Processing(ACL 03), ACL press, July 2003, pp.184-187, doi:10.3115/1119250.1119280. [7] H. Zhang and Q. Liu, Chinese Lexical Analysis Using Hierarchical Hidden Markov Model, In Proc. Second of SIGHAN Workshop on Chinese Language Processing(ACL 03), ACL press, July 2003, pp.63-70. 313 313