You are on page 1of 7

Thut ton Apriori khai ph lut kt hp Nguyn Vn Chc - chucnv@ud.edu.vn 1.

Lut kt hp trong khai ph d liu (Association Rule in Data Mining) Trong lnh vc Data Mining, mc ch ca lut kt hp (Association Rule - AR) l tm ra cc mi quan h gia cc i tng trong khi lng ln d liu. Ni dung c bn ca lut kt hp c tm tt nh di y. Cho c s d liu gm cc giao dch T l tp cc giao dch t1, t2, , tn. T = {t1, t2, , tn}. T gi l c s d liu giao dch (Transaction Database) Mi giao dch ti bao gm tp cc i tng I (gi l itemset) I = {i1, i2, , im}. Mt itemset gm k items gi l k-itemset Mc ch ca lut kt hp l tm ra s kt hp (association) hay tng quan (correlation) gia cc items. Nhng lut kt hp ny c dng X =>Y Trong Basket Analysis, lut kt hp X =>Y c th hiu rng nhng ngi mua cc mt hng trong tp X cng thng mua cc mt hng trong tp Y. (X v Y gi l itemset). V d, nu X = {Apple, Banana} v Y = {Cherry, Durian} v ta c lut kt hp X =>Y th chng ta c th ni rng nhng ngi mua Apple v Banana th cng thng mua Cherry v Durian. Theo quan im thng k, X c xem l bin c lp (Independent variable) cn Y c xem l bin ph thuc (Dependent variable) h tr (Support) v tin cy (Confidence) l 2 tham s dng o lng lut kt hp.

h tr (Support) ca lut kt hp X =>Y l tn sut ca giao dch cha tt c cc items trong c hai tp X v Y. V d, support ca lut X =>Y l 5% c ngha l 5% cc giao dch X v Y c mua cng nhau. Cng thc tnh support ca lut X =>Y nh sau:

Trong : N l tng s giao dch. tin cy (Confidence) ca lut kt hp X =>Y l xc sut xy ra Y khi bit X. V d tin cy ca lut kt hp {Apple} =>Banana} l 80% c ngha l 80% khch hng mua Apple cng mua Banana. Cng thc tnh tin cy ca lut kt hp X =>l xc sut c iu kin Y khi bit X nh sau :

Trong : n(X) l s giao dch cha X thu c cc lut kt hp, ta thng p dng 2 tiu ch: minimum support (min_sup) v minimum confidence (min_conf) Cc lut tha mn c support v confidence tha mn (ln hn hoc bng) c Minimum support v Minimum confidence gi l cc lut mnh (Strong Rle) Minimum support v Minimum confidence gi l cc gi tr ngng (threshold) v phi xc nh trc khi sinh cc lut kt hp. Mt itemsets m tn sut xut hin ca n >= min_sup goi l frequent itemsets Mt s loi lut kt hp

Binary association rules (lut kt hp nh phn): Apple => Banana Quantitative association rules (lut kt hp nh lng): weight in [70kg 90kg] => height in [170cm 190cm] Fuzzy association rules (Lut kt hp m): weight in HEAVY => height in TALL Thut ton ph bin nht tm cc lut kt hp l Apriori s dng Binary association rules.

2.Thut ton sinh cc lut kt hp Apriori (by Agrawal and Srikant 1994) T tng chnh ca thut ton Apriori l: - Tm tt c frequent itemsets: k-itemset (itemsets gm k items) c dng tm (k+1)- itemset. u tin tm 1-itemset (k hiu L1). L1 c dng tm L2 (2-itemsets). L2 c dng tm L3 (3-itemset) v tip tc cho n khi khng c k-itemset c tm thy. - T frequent itemsets sinh ra cc lut kt hp mnh (cc lut kt hp tha mn 2 tham s min_sup v min_conf) Apriori Algorithm 1. Duyt (Scan) ton b transaction database c c support S ca 1-itemset, so snh S vi min_sup, c c 1-itemset (L1) 2. S dng Lk-1 ni (join) Lk-1 sinh ra candidate k-itemset. Loi b cc itemsets khng phi l frequent itemsets thu c k-itemset

3.

Scan transaction database c c support ca mi candidate k-itemset, so snh S vi min_sup thu c frequent k itemset (Lk)

4.

Lp li t bc 2 cho n khi Candidate set (C) trng (khng tm thy frequent itemsets)

5. 6.

Vi mi frequent itemset I, sinh tt c cc tp con s khng rng ca I Vi mi tp con s khng rng ca I, sinh ra cc lut s => (I-s) nu tin cy (Confidence) ca n > =min_conf

Chn hn vi I= {A1,A2,A5},cc tp con ca I: {A1}, {A2}, {A5}, {A1,A2},{A1,A5},{A2,A5} s c cc lut sau {A1} => {A2,A5},{A2} =>{A1,A5},{A5} =>{A1,A2} {A1,A2} =>{A5},{A1,A5} =>{A2},{A2,A5} => {A1}

V d: Gi s ta c c s d liu giao dch (Transaction Database -TDB) nh sau :

Thut ton Apriori khai ph lut kt hp c m t qua cc bc sau

Ta c frequent itemsets I ={B,C,E}, vi min_conf =80% ta c 2 lut kt hp l {B,C} => {E} v {C,E} => {B}

Gi s c c s d liu giao dch bn hng gm 5 giao dch nh sau:

Thut ton Apriori tm cc lut kt hp trong giao dch bn hng trn nh sau:

Kt qu ta c cc lut kt hp sau (vi min_sup= 40%, min_conf=70%)

R1: Beer => Diaper (support =60%, confidence = 75%) R2: Diaper =>Beer (support =60%,confidence = 75%) R3: Milk =>Beer (support =40%, confidence = 100%) R4: Baby Powder => Diaper (support =40%,confidence = 100%) T kt qu cc lut c sinh ra bi giao dch bn hng trn, ta thy rng c lut c th tin c (hp l) nh Baby Powder => Diaper, c lut cn phi phn tch thm nh Milk =>Beer v c lut c v kh tin nh Diaper =>Beer.V d ny sinh ra cc lut c th khng thc t v d liu dng phn tch (transaction database) hay cn gi l tranining data rt nh. Thut ton Apriori c dng pht hin cc lut kt hp dng khng nh (Positive Rule X=>Y) nh phn (Binary Association Rules) ch khng th pht hin cc lut kt hp dng ph nh (Negative Association Rule) chn hn nh cc kt hp dng Khch hng mua mt hng A thng KHNG mua mt hng B hoc Nu ng h quan im A thng KHNG ng h quan im B. Khai ph cc lut kt hp dng ph nh (Mining Negative Association Rules) c phm vi ng dng rt rng v th v nht l trong Marketing, Health Care v Social Network Analysis. PS. All comments please send to chucnv@ud.edu.vn. Thank you and Welcome!

You might also like