You are on page 1of 71

n tt nghip: Khai ph d liu t website vic lm

1

LI CM N
Em xin chn thnh cm n cc thy gio, c gio trong ngnh Cng ngh
thng tin i Hc Dn Lp Hi Phng, tn tm ging dy cc kin thc
trong 4 nm hc qua cng vi s ng vin t gia nh v bn b v s ch gng
ht sc ca bn thn.
c bit em xin by t s bit n su sc n thy gio Tin s Phng Vn
n, ngi tn tnh hng dn, ng vin em thc hin n ny.
Rt mong s ng gp kin t tt c thy c, bn b ng nghip
n c th pht trin v hon thin hn n ny.
Hi phng, thng 7 nm 2010
Ngi thc hin
Nguyn Ngc Chu
n tt nghip: Khai ph d liu t website vic lm

2


MC LC

LI CM N ......................................................................................................................................... 1
M U ................................................................................................................................................. 4
Chng 1: TNG QUAN V KHAI PH D LIU V PHT HIN TRI THC ............................ 5
I. Tng quan v khai ph d liu ......................................................................................... 5
1. T chc v khai thc c s d liu truyn thng ............................................................. 5
2. Tng quan v k thut pht hin tri thc v khai ph d liu (KDD Knowledge Discovery
and Data Mining) .............................................................................................................. 6
II. ng dng lut kt hp vo khai ph d liu ................................................................. 10
1. L thuyt lut kt hp ............................................................................................... 10
2. Cc c trng ca lut kt hp ................................................................................... 19
3. Mt s gii thut c bn khai ph cc tp ph bin ....................................................... 22
4. Pht sinh lut t cc tp ph bin ............................................................................... 43
5. nh gi, nhn xt.................................................................................................... 46
Chng 2: M HNH TM KIM THNG TIN ................................................................................. 47
1. Tm kim thng tin ...................................................................................................... 47
2. M hnh Search engine ................................................................................................. 48
2.1 Search engine ....................................................................................................... 48
2.2 Agents ................................................................................................................. 49
3. Hot ng ca cc Search engine ................................................................................... 49
3.1 Hot ng ca cc robot ........................................................................................ 50
3.2 Duyt theo chiu rng ........................................................................................... 50
3.3 Duyt theo chiu su ............................................................................................. 51
3.4 su gii hn .................................................................................................... 52
3.5 Vn tc nghn ng chuyn ............................................................................. 52
3.6 Hn ch ca cc robot ........................................................................................... 53
3.7 Phn tch cc lin kt trong trang web ..................................................................... 53
3.8 Nhn dng m ting vit ........................................................................................ 53
Chng 3: NG DNG TH NGHIM KHAI PH D LIU TCH HP T CC WEBSITE
TUYN DNG ..................................................................................................................................... 55
1. Bi ton: ..................................................................................................................... 55
1.1 Pht biu bi ton: ................................................................................................ 55
n tt nghip: Khai ph d liu t website vic lm

3

1.2 Mt s website tm vic lm ni ting ca vit nam: ................................................. 55
1.3 Thit k c s d liu: ........................................................................................... 58
1.4 c t d liu: ...................................................................................................... 61
1.5 Minh ha chng trnh .......................................................................................... 67
1.6 Phn tch nh gi ................................................................................................ 69
1.7 Hng pht trin .................................................................................................. 69
KT LUN ........................................................................................................................................... 70
TI LIU THAM KHO ..................................................................................................................... 71


n tt nghip: Khai ph d liu t website vic lm

4

M U
Trong nhng nm gn y, vic nm bt c thng tin c coi l c s
ca mi hot ng sn xut, kinh doanh. Cc nhn hoc t chc no thu thp v
hiu c thng tin, v hnh ng da trn cc thng tin c kt xut t cc
thng tin c s t c thnh cng trong mi hot ng.
S tng trng vt bc ca cc c s d liu (CSDL) trong cuc sng
nh: thng mi, qun l lm ny sinh v thc y s pht trin ca k thut
thu thp, lu tr, phn tch v khai ph d liu khng ch bng cc php ton
n gin thng thng nh: php m, thng k m i hi mt cch x l
thng minh hn, hiu qu hn. Cc k thut cho php ta khai thc c tri thc
hu dng t CSDL (ln) c gi l cc k thut Khai ph d liu
(datamining). n nghin cu v nhng khi nim c bn v khai ph d liu,
lut kt hp v ng dng thut ton khai ph lut kt hp trong CSDL ln.
Cu trc ca n c trnh by nh sau:
CHNG 1: TNG QUAN V KHAI PH D LIU V PHT HIN TRI
THC
Trnh by kin thc tng quan v khai thc v x l thng tin.
Khi nim v lut kt hp v cc phng php khai ph lut kt hp
Trnh by v thut ton Apriori v mt s thut ton khai ph lut kt hp
CHNG 2: M HNH TM KIM THNG TIN
Trnh by cc thnh phn c bn ca mt search engine
Trnh by nguyn l hot ng ca search engine v mt s gii thut tm kim
ca search engine
CHNG 3: NG DNG, TH NGHIM KHAI PH D LIU VIC LM
TCH HP T CC WEBSITE TUYN DNG
Ni dung ca chng l p dng k thut khai ph d liu vo bi ton tm xu
hng chn ngnh ngh ca cc ng vin v tuyn dng ca ca cc doanh
nghip.
Cui cng l kt lun li nhng kt qu t c ca ti v hng pht trin
tng lai.
n tt nghip: Khai ph d liu t website vic lm

5

Chng 1: TNG QUAN V KHAI PH D LIU V PHT HIN TRI
THC
I. Tng quan v khai ph d liu
1. T chc v khai thc c s d liu truyn thng
Vic dng cc phng tin tin hc t chc v khai thc c s d liu
(CSDL ) c pht hin t nhng nm 60 ca th k trc. T cho n
nay, rt nhiu CSDL c t chc, pht trin v khai thc mi quy m v
cc lnh vc hot ng ca con ngi v x hi. Theo nh nh gi cho thy,
lng thng tin trn th gii c sau 20 thng li tng ln gp i. Kch thc v
s lng CSDL thm ch cn tng nhanh hn. Vi s pht trin ca cng ngh
in t, s pht trin mnh m ca cng ngh phn cng to ra cc b nh c
dung lng ln, b x l c tc cao cng vi s pht trin ca cc h thng
vin thng, ngi ta v ang xy dng cc h thng thng tin nhm t ng
ho mi hot ng ca con ngi. iu ny to ra mt dng d liu tng ln
khng ngng v ngay c nhng hot ng n gin nh gi in thoi, tra cu
sch trong th vin, ... u c thc hin thng qua my tnh. Cho n nay, s
lng CSDL tr nn khng l bao gm cc CSDL cc ln c gigabytes v
thm ch terabytes lu tr cc d liu kinh doanh v d nh d liu thng tin
khc hng , d liu bn hng, d liu cc ti khon, ... Nhiu h qun tr CSDL
mnh vi cc cng c phong ph v thun tin gip con ngi khai thc c
hiu qu ngun ti nguyn d liu. M hnh CSDL quan h v ngn ng vn
p chun (SQL) c vai tr ht sc quan trng trong vic t chc v khai thc
CSDL. Cho n nay, khng mt t chc no s dng tin hc trong cng vic
m khng s dng cc h qun tr CSDL v cc h cng c bo co, ngn ng
hi p nhm khai thc CSDL phc v cho cc hot ng tc nghip ca mnh.
Cng vi vic tng khng ngng khi lng d liu, cc h thng thng tin
cng c chuyn mn ho, phn chia theo lnh vc ng dng nh sn xut, ti
chnh, hot ng kinh doanh, .... Nh vy bn cnh chc nng khai thc d liu
c tnh cht tc nghip, s thnh cng trong cng vic khng cn l nng sut
ca cc h thng thng tin na m l tnh linh hot v sn sng p li nhng
yu cu trong thc t, CSDL cn em li nhng tri thc hn l chnh nhng
d liu trong . Cc quyt nh cn phi c cng nhanh cng tt v phi chnh
xc da trn nhng d liu sn c trong khi khi lng d liu c sau 20 thng
li tng gp i lm nh hng n thi gian ra quyt nh cng nh kh nng
hiu ht c ni dung d liu. Lc ny, cc m hnh CSDL truyn thng v
ngn ng SQL cho thy khng c kh nng thc hin cng vic ny. ly
thng tin c tnh tri thc trong khi d liu khng l ny, ngi ta tm ra
n tt nghip: Khai ph d liu t website vic lm

6

nhng k thut c kh nng hp nht cc d liu t cc h thng giao dch khc
nhau, chuyn i thnh mt tp hp cc CSDL n nh, c cht lng c s
dng ch cho ring mt vi mc ch no . Cc k thut gi chung l k
thut to kho d liu (data warehousing) v mi trng cc d liu c c gi
l cc kho d liu (data warehouse).
Nhng ch c kho d liu thi cha c tri thc. Cc kho d liu c
s dng theo mt s cch nh:
Theo cch khai thc truyn thng: tc l kho d liu c s dng khai
thc cc thng tin bng cc cng c truy vn v bo co.
Cc kho d liu c s dng h tr cho phn tch trc tuyn (OLAP-
OnLine Analytical Processing): Vic phn tch trc tuyn c kh nng phn tch
d liu, xc nh xem gi thuyt ng hay sai. Tuy nhin, phn tch trc tuyn
li khng c kh nng a ra cc gi thuyt.
Cng ngh khai ph d liu (data mining) ra i p ng nhng i hi
trong khoa hc cng nh trong hot ng thc tin. y chnh l mt ng dng
chnh ca kho d liu.
2. Tng quan v k thut pht hin tri thc v khai ph d liu (KDD
Knowledge Discovery and Data Mining)
2.1 Pht hin tri thc v khai ph d liu l g?
Nu cho rng cc in t v cc sng in t chnh l bn cht ca cng
ngh in t truyn thng th d liu, thng tin v tri thc hin ang l tiu im
ca mt lnh vc mi trong nghin cu v ng dng v pht hin tri thc
(Knowledge Discovery) v khai ph d liu (Data Mining).
Thng thng chng ta coi d liu nh mt dy cc bit, hoc cc s v cc
k hiu, hoc cc i tng vi mt ngha no khi c gi cho mt
chng trnh di mt dng nht nh. Chng ta s dng cc bit o lng cc
thng tin v xem n nh l cc d liu c lc b cc d tha, c rt gn
ti mc ti thiu c trng mt cch c bn cho d liu. Chng ta c th xem
tri thc nh l cc thng tin tch hp, bao gm cc s kin v cc mi quan h
gia chng. Cc mi quan h ny c th c hiu ra, c th c pht hin,
hoc c th c hc. Ni cch khc, tri thc c th c coi l d liu c
tru tng v t chc cao.
Pht hin tri thc trong cc c s d liu l mt qui trnh nhn bit cc mu
hoc cc m hnh trong d liu vi cc tnh nng: hp thc, mi, kh ch, v c
th hiu c. Cn khai thc d liu l mt bc trong qui trnh pht hin tri
thc gm c cc thut ton khai thc d liu chuyn dng di mt s qui nh
n tt nghip: Khai ph d liu t website vic lm

7

v hiu qu tnh ton chp nhn c tm ra cc mu hoc cc m hnh trong
d liu. Ni mt cch khc, mc ch ca pht hin tri thc v khai ph d liu
chnh l tm ra cc mu v/hoc cc m hnh ang tn ti trong cc c s d liu
nhng vn cn b che khut bi hng ni d liu.
nh ngha: KDD l qu trnh khng tm thng nhn ra nhng mu c
gi tr, mi, hu ch tim nng v hiu c trong d liu.
Cn cc nh thng k th xem Khai ph d liu nh l mt qui trnh phn
tch c thit k thm d mt lng cc ln cc d liu nhm pht hin ra
cc mu thch hp v/hoc cc mi quan h mang tnh h thng gia cc bin v
sau s hp thc ho cc kt qu tm c bng cch p dng cc mu pht
hin c cho cc tp con mi ca d liu. Qui trnh ny bao gm ba giai on
c bn: thm d, xy dng m hnh hoc nh ngha mu, hp thc/kim chng.
2.2 Quy trnh pht hin tri thc
Qui trnh pht hin tri thc c m t tm tt trn Hnh 1:

Hnh 1: qu trnh pht hin tri thc
Bc th nht: Hnh thnh, xc nh v nh ngha bi ton. L tm hiu
lnh vc ng dng t hnh thnh bi ton, xc nh cc nhim v cn phi
hon thnh. Bc ny s quyt nh cho vic rt ra c cc tri thc hu ch v
cho php chn cc phng php khai ph d liu thch hp vi mc ch ng
dng v bn cht ca d liu.
Bc th hai: Thu thp v tin x l d liu. L thu thp v x l th, cn
c gi l tin x l d liu nhm loi b nhiu, x l vic thiu d liu, bin
i d liu v rt gn d liu nu cn thit, bc ny thng chim nhiu thi
gian nht trong ton b qui trnh pht hin tri thc.
n tt nghip: Khai ph d liu t website vic lm

8

Bc th ba: Khai ph d liu, rt ra cc tri thc. L khai ph d liu, hay
ni cch khc l trch ra cc mu v/hoc cc m hnh n di cc d liu. Giai
on ny rt quan trng, bao gm cc cng on nh: chc nng, nhim v v
mc ch ca khai ph d liu, dng phng php khai ph no?
Bc th t: S dng cc tri thc pht hin c. L hiu tri thc tm
c, c bit l lm sng t cc m t v d on. Cc bc trn c th lp i
lp li mt s ln, kt qu thu c c th c ly trung bnh trn tt c cc ln
thc hin.
Tm li: KDD l mt qu trnh chit xut ra tri thc t kho d liu m
trong khai ph d liu l cng on quan trng nht.
2.3 Cc phng php khai ph d liu
KDD bao gm hai yu t quan trng khng th thiu c l D on
(Prediction) v M t (Description)
D on: i hi s dng mt vi bin hoc trng d on thng tin
tim n hoc mt gi tr tng lai ca mt bin thuc tnh m ta quan tm n.
M t: Tp trung l ni bt ln m hnh kt qu m con ngi c th hiu
su v thng tin d liu.
Vi hai ch chnh nu trn, ngi ta thng s dng cc phng php
sau cho khai ph d liu:
- Phn lp, phn loi (Classification): L vic hc mt hm nh x t mt mu
d liu vo mt trong s cc lp c xc nh trc .
- Hi qui (Regression): L vic hc mt hm nh x t mt mu d liu thnh
mt bin d on c gi tr thc.
- Phn nhm (Clustering): L vic m t chung tm ra cc tp hay cc nhm,
loi m t d liu. Cc nhm c th tch nhau hoc phn cp.
- Tng hp (Summarization): L cng vic ln quan n cc phng php tm
kim mt m t tp con d liu, thng p dng trong vic phn tch d
liu c tnh thm d v bo co t ng.
- M hnh rng buc (Dependency modeling): L vic tm kim mt m hnh m
t s ph thuc gia cc bin, thuc tnh theo hai mc: ph thuc cc b
vo cu trc ca m hnh, ph thuc vo thc o, c lng ca mt nh
lng no .
n tt nghip: Khai ph d liu t website vic lm

9

- D tm bin i v lch (Change and Deviation Dectection): Ch vo
nhng thay i quan trng trong d liu t cc gi tr chun hoc c
xc nh trc .
- Biu din m hnh (Model Representation): L vic dng mt ngn ng L_
Language no m t cc mu m hnh c th khai ph c. M t
m hnh r rng th hc my s to ra mu c m hnh chnh xc cho d
liu. Tuy nhin, nu m hnh qu ln th kh nng d on ca hc my s
b hn ch. Nh th s lm cho vic tm kim phc tp hn cng nh hiu
c m hnh l khng n gin.
- Kim nh m hnh (Model Evaluation): L vic nh gi, c lng cc m
hnh chi tit, chun trong qu trnh x l v pht hin tri thc vi s c
lng c d bo chnh xc hay khng v c tho mn c s logic hay
khng? c lng phi c nh gi cho (cross validation) vi vic m
t c im bao gm d bo chnh xc, tnh mi l, tnh hu ch, tnh hiu
c ph hp vi cc m hnh. Hai phng php logic v thng k chun
c th s dng trong m hnh kim nh.
- Phng php tm kim (Search Method):Gm c hai thnh phn: (1) Trong
bng tham bin (phm vi tm kim tham s) thut ton phi tm kim cc
tham s tronng phm vi cc chun ca m hnh kim nh ri ti u ho v
a ra tiu ch (quan st) d liu v biu din m hnh nh. (2) M
hnh tm kim, xut hin nh mt ng vng trn ton b phng php
tm kim, biu din m hnh phi thay i sao cho cc h m hnh phi thay
i sao cho cc h gia ph m hnh phi c thng qua.
2.4 Cc lnh vc lin quan n pht hin tri thc v khai ph d liu
Pht hin tri thc v khai ph d liu lin quan n nhiu ngnh, nhiu lnh
vc: thng k, tr tu nhn to, c s d liu, thut ton hc, tnh ton song song
v tc cao, thu thp tri thc cho cc h chuyn gia, quan st d liu... c
bit pht hin tri thc v khai ph d liu rt gn gi vi lnh vc thng k, s
dng cc phng php thng k m hnh d liu v pht hin cc mu, lut...
Ngn hng d liu (Data Warehousing) v cc cng c phn tch trc tuyn
(OLAP) cng lin quan rt cht ch vi pht hin tri thc v khai ph d liu.
Khai ph d liu c nhiu ng dng trong thc t. Mt s ng dng in
hnh nh:
- Bo him, ti chnh v th trng chng khon: Phn tch tnh hnh ti chnh
v d bo gi ca cc loi c phiu trong th trng chng khon. Danh mc
vn v gi, li sut, d liu th tn dng, pht hin gian ln, ...
n tt nghip: Khai ph d liu t website vic lm

10

- Phn tch d liu v h tr ra quyt nh.
- iu tr y hc v chm sc y t: Mt s thng tin v chun on bnh lu
trong cc h thng qun l bnh vin. Phn tch mi lin h gia cc triu
chng bnh, chun on v phng php iu tr (ch dinh dng, thuc,
...)
- Sn xut v ch bin: Quy trnh, phng php ch bin v x l s c.
- Text mining v Web mining: Phn lp vn bn v cc trang Web, tm tt vn
bn,...
- Lnh vc khoa hc: Quan st thin vn, d liu gene, d liu sinh vt hc, tm
kim, so snh cc h gene v thng tin di truyn, mi lin h gene v mt s
bnh di truyn, ...
- Mng vin thng: Phn tch cc cuc gi in thoi v h thng gim st li,
s c, cht lng dch v, ...
II. ng dng lut kt hp vo khai ph d liu
Vic d on cc thng tin c gi tr cao da trn s lng d liu ln v
nghip v cng ngy cng tr ln quan trng i vi nhiu t chc, doanh
nghip. Chng hn, nhng vn cc nh qun l v kinh doanh cn bit l cc
kiu mu hnh vi mua hng ca cc khch hng, xu hng kinh doanh, vv
Nhng thng tin ny c th hc c t nhng d liu c sn.
Mt trong nhng vn kh khn nht trong vic khai ph d liu trong
CSDL l c mt s v cng ln d liu cn c x l. Cc t chc doanh
nghip quy m va c th c t hng hng trm Megabyte n vi Gigabyte d
liu thu thp c. Cc ng dng khai ph d liu thng thc hin phn tch
d liu kh phc tp, mt nhiu thi gian trong ton b CSDL. V vy, tm mt
thut ton nhanh v hiu qu x l khi lng d liu ln l mt thch thc
ln.
Phn ny trnh by c s l thuyt ca lut v lut kt hp, khai ph d liu
da vo lut kt hp, ng thi trnh by mt s thut ton lin quan n lut kt
hp.
1. L thuyt lut kt hp
T khi n c gii thiu t nm 1993, bi ton khai thc lut kt hp nhn
c rt nhiu s quan tm ca nhiu nh khoa hc. Ngy nay vic khai thc cc
lut nh th vn l mt trong nhng phng php khai thc mu ph bin nht
trong vic khm ph tri thc v khai thc d liu (KDD: Knowledge Discovery
and Data Mining).
n tt nghip: Khai ph d liu t website vic lm

11

Mt cch ngn gn, mt lut kt hp l mt biu thc c dng: Y X ,
trong X v Y l tp cc trng gi l item. ngha ca cc lut kt hp kh
d nhn thy: Cho trc mt c s d liu D l tp cc giao tc - trong mi
giao tc T D l tp cc item - khi Y X din t ngha rng bt c khi
no giao tc T c cha X th chc chn T c cha Y. tin cy ca lut (rule
confidence) c th c hiu nh xc sut iu kin p(Y T | X T). tng
ca vic khai thc cc lut kt hp c ngun gc t vic phn tch d liu mua
hng ca khch v nhn ra rng Mt khch hng mua mt hng x1 v x2 th s
mua mt hng y vi xc sut l c%. ng dng trc tip ca cc lut ny trong
cc bi ton kinh doanh cng vi tnh d hiu vn c ca chng ngay c i
vi nhng ngi khng phi l chuyn gia khai thc d liu lm cho lut kt
hp tr thnh mt mt phng php khai thc ph bin. Hn na, lut kt hp
khng ch b gii hn trong phn tch s ph thuc ln nhau trong phm vi cc
ng dng bn l m chng cn c p dng thnh cng trong rt nhiu bi ton
kinh doanh.
Vic pht hin lut kt hp gia cc mc (item) trn d liu gi l bi
ton rt c trng ca khai ph d liu. D liu gi l d liu bao gm cc mc
c mua bi khch hng vi cc thng tin nh ngy mua hng, s lng, gi
c, Lut kt hp ch ra tp cc mc m thng c mua nht vi cng cc
tp mc khc.
Hin nay, c nhiu thut ton dng cho vic pht hin lut kt hp. Tuy
nhin, vn ny sinh l s ln qut (duyt) CSDL qu nhiu s nh hng rt
ln n hiu qu v tnh kh thi ca thut ton trn cc CSDL ln. i vi cc
CSDL c lu trn a, php duyt CSDL s gy ra s ln c a rt ln.
Chng hn mt CSDL kch thc 1GB s i hi khong 125000 ln c khi
cho mi ln duyt (vi kch thc khi l 8KB). Nu thut ton c 10 ln duyt
th s gy ra1250000 ln c khi. Gi thit thi gian c trung bnh l 12ms
mt trang, thi gian cn thit thc hin mt thao tc I/O ny l1250000*12ms
hay sp s 4 ting ng h !!!
Trong phn ny, chng ta xem xt mt s nh ngha, tnh cht c lin quan
n lut v lut kt hp. ng thi chng ta tm hiu ngha ca lut kt hp.
1.1 Lut kt hp
a) ngha lut kt hp: Lut kt hp l mt lnh vc quan trng
trong khai thc d liu. Lut kt hp gip chng ta tm c cc mi lin h
gia cc mc d liu (items) ca c s d liu. Trong mi trng mng nhu cu
tm vic trc tuyn tr thnh xu hng pht trin cc website tuyn dng
ngy cng nhiu thng tin v ngi tm vic v doanh nghip tuyn ngi ngy
n tt nghip: Khai ph d liu t website vic lm

12

cng nhiu do nhu cu ca x hi, do chng ta c th tm xu hng tuyn
dng v nhu cu vic lm cc nh qun l a ra nhu cu vic lm ca x hi.
Hay nh trong ngnh vin thng, cc loi dch v cung cp cho khch hng ngy
cng nhiu, do chng ta c th tm mi lin kt gia vic s dng cc loi
dch v phc v cho vic qung co, tip th. V d nh tm hiu thi quen
s dng cc dch v vin thng ca khch hng, ngi ta thng t cu hi
Nhng dch v no khch hng thng hay s dng cng lc vi nhau khi ng
k s dng ti trung tm chm sc khch hng ?. Cc kt qu nhn c c th
dng cho vic tip th dch v nh lit k cc dch v khch hng hay s dng
cng lc nm gn nhau, hoc khuyn mi dch v km theo.
b) nh ngha lut kt hp: Cho mt tp I = {I1, I2, ...,Im} l tp
gm m khon mc (item), cn c gi l cc thuc tnh (attribute). Cc phn
t trong I l phn bit nhau. X I c gi l tp mc (itemset). Nu lc lng
ca X bng k (tc l |X| = k) th X c gi l k-itemset.
Mt giao dch (transaction) T c nh ngha nh mt tp con (subset) ca
cc khon mc trong I (T I). Tng t nh khi nim tp hp, cc giao dch
khng c trng lp, nhng c th ni rng tnh cht ny ca tp hp v trong
cc thut ton sau ny, ngi ta u gi thit rng cc khon mc trong mt giao
dch v trong tt c cc tp mc (item set) khc, c th coi chng c sp
xp theo th t t in ca cc item.
Gi D l CSDL ca n giao dch v mi giao dch c nh nhn vi mt
nh danh duy nht (Unique Transasction IDentifier-TID). Ni rng, mt giao
dch T D h tr (support) cho mt tp X I nu n cha tt c cc item ca
X, ngha l X T, trong mt s trng hp ngi ta dng k hiu T(X) ch
tp cc giao dch h tr cho X. K hiu support(X) (hoc supp(X), s(X)) l t l
phn trm ca cc giao dch h tr X trn tng cc giao dch trong D, ngha l:

supp(X) =
D
T X D T
%
V d v c s d liu D (dng giao dch) : I = {A, B, C, D, E}, T = {1, 2,
3, 4, 5, 6}. Thng tin v cc giao dch cho bng sau :



n tt nghip: Khai ph d liu t website vic lm

13

nh danh giao dch (TID) Tp mc (itemset)
1 A B D E
2 B C E
3 A B D E
4 A B C E
5 A B C D E
6 B C D
Bng 1: V d v mt c s d liu dng giao dch D
Ta c: supp( {A }) = 4/6 (%)= 66.67 %;
supp({ABDE}) = 3/6 =50%;
supp({ABCDE}) = 1/6 = 16.67%; ...
Tp ph bin (frequent itemset):
Support ti thiu minsup ( 0, 1] (Minimum Support) l mt gi tr cho trc bi
ngi s dng. Nu tp mc X I c supp(X) minsup th ta ni X l mt tp
ph bin-frequent itemset (hoc large itemset). Mt frequent itemset c s
dng nh mt tp ng quan tm trong cc thut ton, ngc li, nhng tp
khng phi frequent itemset l nhng tp khng ng quan tm. Trong cc trnh
by sau ny, ta s s dng nhng cm t khc nh X c support ti thiu, hay
X khng c support ti thiu cng ni ln rng X tha mn hay khng tha
mn support(X) minsupp.
V d: Vi c s d liu D cho bng 3, v gi tr ngng minsupp = 50% s
lit k tt c cc tp ph bin (frequent-itemset) nh sau :


n tt nghip: Khai ph d liu t website vic lm

14

Cc tp mc ph bin h tr (supp) tng ng
B 100% (6/6)
E, BE 83% (5/6)
A, C, D, AB, AE, BC, BD, ABE 67% (4/6)
AD, CE, DE, ABD, ADE, BCE, BDE 50% (3/6)
Bng 2 : Cc tp ph bin trong c s d liu bng 1
vi h tr ti thiu 50%
Mt s tnh cht (TC) lin quan n cc frequent itemset:
TC 1. support cho tt c cc subset: nu A B, A, B l cc itemset th
supp(A) supp(B) v tt c cc giao dch ca D support B th cng support A.
TC 2. Nu mt item A khng c support ti thiu trn D ngha l
support(A) < minsupp th mt superset B ca A s khng phi l mt frequent v
support(B) support(A) < minsup.
TC 3. Nu item B l frequent trn D, ngha l support(B) minsup th mi
subset A ca B l frequent trn D v support(A) support(B) > minsup.
nh ngha lut kt hp:
Mt lut kt hp c dng R: X Y, trong X, Y l cc itemset, X, Y I
v X Y = . X c gi l tin v Y c gi l h qu ca lut.
Lut X Y tn ti mt h tr support - supp. Supp(X Y) c nh
ngha l kh nng m tp giao dch h tr cho cc thuc tnh c trong c X ln
Y, ngha l:
Support(X Y) = support(X Y).
Lut X Y tn ti mt tin cy c (confidence - conf). Conf c c nh
ngha l kh nng giao dch T h tr X th cng h tr Y. Ni cch khc c biu
th s phn trm giao dch c cha lun A trong s nhng giao dch c cha X.
Ta c cng thc tnh conf c nh sau:
conf(X Y) = p(Y T| X T) =
) ( sup
) ( sup
) (
) T X (
X p
Y X p
T X p
T Y p
%
Ta ni rng, lut X Y l tho trn D nu vi mt support ti thiu
minsup v mt ngng cofidence ti thiu minconf cho trc no m:
Support(X Y) minsup v confidence(X Y) minconf
n tt nghip: Khai ph d liu t website vic lm

15

Ch rng, nu lut X Y m tho trn D th c X v Y u phi l cc
Frequent Itemset trn D v khi xt mt lut c tho hay khng, th c support v
confidence ca n u phi quan tm, v mt lut c th c confidence = 100%
> minconf nhng c th l n khng t support ti thiu minsup.
1.2 Mt s tnh cht ca lut kt hp
Trc ht ta phi gi s rng vi lut X Y, X c th l rng, cn Y phi
lun khc rng v X Y v nu khng th:
confidence(X Y) = 1
support(X)
Y) support(X

Ta c cc tnh cht sau :
1) Nu X Z v Y Z l tho trn D, th khng nht thit l X Y Z.
n trng hp X Y = v cc giao dch trn D h tr Z nu v
ch nu chng h tr X hoc h tr Y. Khi , support(X Y) = 0 v
cofidence(X Y) = 0.
Tng t ta cng c : Nu X Y v X Z khng th suy ra X Y Z.
2) Nu lut X Y Z l tho trn D th X Z v Y Z c th khng tho
trn D.
Chng hn, khi Z l c mt trong mt giao dch ch nu c X v Y u c
mt trong giao dch , ngha l support(X Y)=support(Z). Nu support cho X
v Y ln hn support(X Y), th 2 lut trn s khng c confidence yu cu.
Tuy nhin, nu X Y Z l tho trn D th c th suy ra X Y v X Z cng
tho trn D V support(XY) support(XYZ) v support(XZ) support(XYZ).
3) Nu X Y v Y Z l tho trn D th khng th khng nh rng X Z
cng gi c trn D.
Gi s T(X) T(Y) T(Z) v confidence(X Y) = confidence(Y Z) =
minconf. Khi ta c confidence(X Z) = minconf
2
< minconf v minconf <1,
ngha l lut X Z khng c cofidence ti thiu.
4) Nu lut A (L-A) khng c confidence ti thiu th cng khng c
lut no trong cc lut B (L-B) c confidence ti thiu trong L-A, B l cc
intemset v B A.
Tht vy, theo tnh cht TC1, v B A. Nn support(B) support(A) v
theo nh ngha ca confidence, ta c :
confidence(B (L-B)) =
) ( sup
) ( sup
B port
L port
) ( sup
) ( sup
A port
L port
<minconf.
n tt nghip: Khai ph d liu t website vic lm

16

Cng vy, nu lut (L-C) C l tho trn D, th cc lut (L-K) K vi
K C v K cng tho trn D.
Bi ton khai ph lut kt hp:
C th din t mt bi ton khai ph lut kt hp nh sau[2][3][8]:
Cho mt tp cc item I, mt c s d liu giao dch D, ngng support ti
thiu minsup, ngng confidence ti thiu minconf, tm tt c cc lut kt hp
X Y trn D sao cho: support(X Y) minsup v confidence(X Y)
minconf.
1.3 Phn loi lut kt hp
Tu theo ng cnh cc thuc tnh d liu cng nh phng php trong cc
thut ton m ngi ta c th phn bi ton khai ph lut kt hp ra nhiu nhm
khc nhau. Chng hn, nu gi tr ca cc item ch l cc gi tr theo kiu
boolean th ngi ta gi l khai ph lut kt hp boolean (Mining Boolean
Association Rules), cn nu cc thuc tnh c tnh n khong gi tr ca n
(nh thuc tnh phn loi hay thuc tnh s lng chng hn) th ngi ta gi n
l khai ph lut kt hp nh lng (Mining Quantitative Association Rules)
Ta s xem xt c th cc nhm .
Lnh vc khai thc lut kt hp cho n nay c nghin cu v pht
trin theo nhiu hng khc nhau. C nhng xut nhm ci tin tc thut
ton, c nhng xut nhm tm kim lut c ngha hn, v. v. v c mt s
hng chnh sau y.
Lut kt hp nh phn (binary association rule hoc boolean association
rule): l hng nghin cu u tin ca lut kt hp. Hu ht cc nghin cu
thi k u v lut kt hp u lin quan n lut kt hp nh phn. Trong dng
lut kt hp ny, cc mc (thuc tnh) ch c quan tm l c hay khng xut
hin trong giao tc ca c s d liu ch khng quan tm v mc xut
hin. C ngha l vic gi 10 cuc in thoi v 1 cuc c xem l ging nhau.
Thut ton tiu biu nht khai ph dng lut ny l thut ton Apriori v cc
bin th ca n. y l dng lut n gin v cc lut khc cng c th chuyn
v dng lut ny nh mt s phng php nh ri rc ho, m ho, v. v. . . Mt
v d v dng lut ny : gi lin tnh=yes AND gi di ng=yes gi
quc t=yes AND gi dch v 108 = yes, vi h tr 20% v tin cy
80%
Lut kt hp c thuc tnh s v thuc tnh hng mc (quantitative and
categorial association rule): Cc thuc tnh ca cc c s d liu thc t c kiu
rt a dng (nh phn binary, s quantitative, hng mc categorial,. v. v).
n tt nghip: Khai ph d liu t website vic lm

17

pht hin lut kt hp vi cc thuc tnh ny, cc nh nghin cu xut
mt s phng php ri rc ho nhm chuyn dng lut ny v dng nh phn
c th p dng cc thut ton c. Mt v d v dng lut ny phng thc
gi = T ng AND gi gi ? 23:00:39..23:00:59 AND Thi gian m thoi?
200.. 300 gi lin tnh =c , vi h tr l 23. 53% , v tin cy l
80%.
Lut kt hp tip cn theo hng tp th (mining association rules base
on rough set): Tm kim lut kt hp da trn l thuyt tp th.
Lut kt nhiu mc (multi-level association rule): Vi cch tip cn theo
lut ny s tm kim thm nhng lut c dng mua my tnh PC mua h
iu hnh AND mua phn mm tin ch vn phng, thay v ch nhng lut
qu c th nh mua my tnh IBM PC mua h iu hnh Microsoft
Windows AND mua phn mm tin ch vn phng Microsoft Office, . Nh
vy dng lut u l dng lut tng qut ho ca dng lut sau v tng qut theo
nhiu mc khc nhau.
Lut kt hp m (fuzzy association rule): Vi nhng hn ch cn gp phi
trong qu trnh ri rc ho cc thuc tnh s (quantitave attributes), cc nh
nghin cu xut lut kt hp m nhm khc phc cc hn ch trn v
chuyn lut kt hp v mt dng t nhin hn, gn gi hn vi ngi s dng
mt v d ca dng ny l : thu bao t nhn = yes AND thi gian m thoi
ln AND cc ni tnh = yes cc khng hp l = yes, vi h tr 4%
v tin cy 85%. Trong lut trn, iu kin thi gian m thoi ln v tri
ca lut l mt thuc tnh c m ho.
Lut kt vi thuc tnh c nh trng s (association rule with
weighted items): Trong thc t, cc thuc tnh trong c s d liu khng phi
lc no cng c vai tr nh nhau. C mt s thuc tnh c ch trng hn v
c mc quan trng cao hn cc thuc tnh khc. V d khi kho st v doanh
thu hng thng, thng tin v thi gian m thoi, vng cc l quan trng hn
nhiu so vi thng tin v phng thc gi.. . Trong qu trnh tm kim lut,
chng ta s gn thi gian gi, vng cc cc trng s ln hn thuc tnh phng
thc gi. y l hng nghin cu rt th v v c mt s nh nghin cu
xut cch gii quyt bi ton ny. Vi lut kt hp c thuc tnh c nh
trng s, chng ta s khai thc c nhng lut him (tc l c h tr thp,
nhng c ngha c bit hoc mang rt nhiu ngha).
Khai thc lut kt hp song song (parallel mining of association rules):
Bn cnh khai thc lut kt hp tun t, cc nh lm tin hc cng tp trung vo
nghin cu cc thut gii song song cho qu trnh pht hin lut kt hp. Nhu
n tt nghip: Khai ph d liu t website vic lm

18

cu song song ho v x l phn tn l cn thit bi kch thc d liu ngy
cng ln hn nn i hi tc x l cng nh dung lng b nh ca h thng
phi c m bo. C rt nhiu thut ton song song khc nhau xut
c th khng ph thuc vo phn cng. Bn cnh nhng nghin cu v nhng
bin th ca lut kt hp, cc nh nghin cu cn ch trng xut nhng thut
ton nhm tng tc qu trnh tm kim tp ph bin t c s d liu.
Ngoi ra, cn c mt s hng nghin cu khc v khai thc lut kt hp
nh: Khai thc lut kt hp trc tuyn, khai thc lut kt hp c kt ni trc
tuyn n cc kho d liu a chiu (Multidimensional data, data warehouse)
thng qua cng ngh OLAP (Online Analysis Processing), MOLAP
(multidimensional OLAP), ROLAP (Relational OLAP), ADO (Active X Data
Object) for OLAP..v.v.
1.4 c t bi ton khai ph d liu
Vi cc nh ngha trn, ta c th m t cu trc c bn ca mt thut ton
khai ph lut kt hp. Mc d, trong thc t, cc thut ton c th c s khc
nhau v mt s vn , nhng v c bn th chng tun theo mt lc chung.
C th tm tt lc qua 2 giai on chnh sau:
Khai ph tt c cc tp ph bin-Frequent itemset (Large itemset)
Nh lu trc y, s lng cc tp frequent c kh nng tng
ng vi kch thc m ca tp cc item, trong hm m tng theo s cc
item. Phng php c bn trong mi thut ton l to mt tp cc itemset gi l
candidate vi hi vng rng n l frequent.
iu m bt k thut ton no cng phi quan tm l lm sao tp cc
candidate ny cng nh cng tt v n lin quan chi ph b nh lu tr cc tp
candidate ny chi ph thi gian cho vic kim tra n l mt Frequent itemset hay
khng.
tm ra nhng candidate itemset l frequent vi cc support c th ca n
l bao nhiu th support ca mi tp candidate phi c m bi mi giai on
trn CSDL (tc l thc hin mt php duyt trn tng giao dch ca c s d
liu tnh giao dch support cho mi candidate itemset).
Cng vic khai ph cc Frequent Itemset c thc hin lp i lp li qua
mt giai on (pass) nhm mc ch nhn c kt qu cui cng l mi
Frequen Itemset biu th tt nht s tng quan gia cc item trong c s d
liu giao dch D.
Khai ph lut kt hp (sinh ra cc lut kt hp mnh t cc tp mc ph
bin)
n tt nghip: Khai ph d liu t website vic lm

19

Sau khi xc nh c tp Frequent Itemset cui cng, ngi ta thc hin
tip thut ton sinh ra cc lut da trn mi frequent itemset ny ng thi xc
nh lun confidence ca chng trn c s cc s m support ca mi frequent
itemset v subset ca mi frequent itemset. Vi mi frequent itemset X, mi
subset ring bit ca n l c chn nh l tin ca lut v cc item cn li
th c a vo h qu ca lut, do X chnh n l mt frequent, v tt c cc
subset ca n cng l Frequent (theo tnh cht TC3 mc 1.1). Mi lut c
sinh ra nh trn c c chp nhn hay khng chp nhn cn ph thuc vo
mc confidence ti thiu (minconf) m ngi s dng ch ra. Mt lut s c
coi l chp nhn nu confidence ca n ln hn hoc bng cofidence ti thiu
ny. Theo tnh cht TC4, mc 1.2, nu mt lut l khng c chp nhn th
khng c mt subset no ca tin t ca n l c th cn nhc sinh thm cc
lut khc.
Ni chung th t tng sinh ra lut kt hp c th m t nh sau:
Nu ABCD v AB l cc frequent itemset th ta c th xc nh xem lut
AB CD c c xem l chp nhn hay khng bng cch tnh confidence ca
n theo nh ngha conf =
) ( sup
) ( sup
AB port
ABCD port
. Nu conf minconf th lut c coi
l chp nhn c ( rng lut l tho mn yu t support v support
(AB CD) = support(ABCD) minsup).
2. Cc c trng ca lut kt hp
2.1 Khng gian tm kim lut:
Nh gii thch trn y, ta phi tm tt c cc itemset tha ngng
minsupp. Vi cc ng dng thc tin, vic duyt tt c cc tp con ca I s hon
ton tht bi v khng gian tm kim qu ln. Trn thc t, s tng tuyn tnh s
lng cc item vn ko theo s tng theo cp ly tha cc itemset cn xem xt.
Vi trng hp c bit I ={1,2,3,4}, ta c th biu din khng gian tm kim
thnh mt li nh trong hnh 2.
Cc tp ph bin nm trong phn trn ca hnh trong khi nhng tp khng
ph bin li nm trong phn di. Mc d khng ch ra mt cch tng minh
cc gi tr h tr cho mi itemset nhng ta gi s rng ng bin m trong
hnh phn chia cc tp ph bin v tp khng ph bin. S tn ti ca ng
bin nh vy khng ph thuc vo bt k c s d liu D v minsupp no. S
tn ti ca n ch n thun c m bo bi tnh chn di ca itemset tha
ngng minsupp.
Nguyn l c bn ca cc gii thut thng thng l s dng ng bin
ny thu hp khng gian tm kim mt cch c hiu qu. Khi ng bin c
n tt nghip: Khai ph d liu t website vic lm

20

tm thy, chng ta c th gii hn trong vic xc nh cc gi tr h tr ca cc
itemset pha trn ng bin v b qua cc itemset pha di ng bin.













Cho nh x: I {1,, |I|} l mt php nh x t cc phn t x I nh x
1-1 vo cc s t nhin. By gi, cc phn t c th c xem l c th t hon
ton trn quan h < gia cc s t nhin. Hn na, vi X I, cho X.item:
{1,,|X|} I: nX.item
n
l mt nh x, trong X.item
n
l phn t th n ca
cc phn t x X sp xp tng dn trn quan h <. n-tin t ca mt itemset X
vi n |X| c nh ngha bi P={X.itemm |1 m n}.
Cho cc lp E(P), P I vi E(P) = {X I | |X| = |P|+1 v P l mt tin t ca
X} l cc nt ca mt cy. Hai nt s c ni vi nhau bng 1 cnh nu tt c
cc itemset ca lp E c th c pht sinh bng cch kt 2 itemset ca lp cha
E, v d nh trong hnh 3.




Hnh 2: Dn cho tp I = {1,2,3,4}

n tt nghip: Khai ph d liu t website vic lm

21









Hnh 3: Cy cho tp I = {1, 2, 3, 4}
Cng vi tnh chn di ca itemset tha ngng minsupp, iu ny suy ra:
Nu lp cha E ca lp E khng c ti thiu hai tp ph bin th E cng phi
khng cha bt k mt tp ph bin no. Nu gp mt lp E nh vy trong qu
trnh duyt cy t trn xung th ta tin n ng bin phn chia gia tp
ph bin v khng ph bin. Ta khng cn phi tm tip phn sau ng bin
ny, tc l ta loi b E v cc lp con ca E trong khng gian tm kim. Th
tc tip theo cho php ta gii hn mt cch c hiu qu s lng cc itemset cn
phi duyt. Ta ch cn xc nh cc support values ca cc itemset m ta
duyt qua trong qu trnh tm kim ng bin gia tp ph bin v tp khng
ph bin. Cui cng, chin lc thc s tm ng bin l do la chn ca
chng ta. Cc hng tip cn ph bin hin nay s dng c tm kim u tin b
rng (BFS) ln tm kim u tin chiu su (DFS). Vi BFS, gi tr h tr ca tt
c (k-1)-itemset c xc nh trc khi tnh gi tr h tr ca k-itemset. Ngc
li, DFS duyt quy theo cu trc cy m t trn.
2.2 h tr lut
Trong phn ny, mt itemset c kh nng l ph bin v ta cn phi xc
nh h tr ca n trong qu trnh duyt dn, c gi l mt itemset ng
vin. Mt hng tip cn ph bin xc nh gi tr h tr ca mt itemset l
m cc th hin ca n trong c s d liu. Vi mc ch , mt bin m
(counter) c to ra v khi to bng 0 cho mi itemset ang duyt. Sau ,
qut qua tt c cc giao tc v khi tm c mt ng vin l tp con ca mt
giao tc th tng bin m ca n ln. Thng thng, tp con to ra v bng tm
kim ng c vin c tch hp v ci t bng mt hashtree hay mt cu trc
d liu tng t. Tm li, khng phi tt c cc tp con ca mi giao tc u
n tt nghip: Khai ph d liu t website vic lm

22

c to ra m ch nhng giao tc c cha trong cc ng vin hoc c mt tin
t chung vi t nht mt ng c vin mi c to ra.
Mt cch tip cn khc xc nh gi tr h tr ca cc ng vin l s
dng giao tp hp (set intersection). Mt TID (Transaction IDentifier) l mt
kha-bin nhn dng giao tc duy nht. Vi mt phn t n, tidlist l tp hp
ca cc bin nhn dng tng ng vi cc giao tc c cha phn t ny. Do ,
cc tidlist cng tn ti cho mi itemset X v c biu din bi X.tidlist. Tidlist
ca mt ng vin C = X Y xc nh bi: C.tidlist=X.tidlist Y.tidlist. Cc
tidlist c sp xp theo th t tng dn cc php giao c hiu qu.
Lu rng bng cch dng vng m cho tidlist ca cc ng vin ph bin
nh l cc kt qu trung gian, ta c th tng ng k tc pht sinh tidlist cho
cc ng vin tip theo. Cui cng, cc h tr thc s ca ng c vin chnh
l |C.tlist|.
3. Mt s gii thut c bn khai ph cc tp ph bin
Phn ny s trnh by v h thng ha mt cch ngn gn cc gii thut
ang c dng ph bin hin nay khai ph cc tp ph bin. Chng s c
thc hin da vo nhng nguyn tc c bn ca phn trc. Mc tiu ca chng
ta l th hin c nhng s khc bit gia cc cch tip cn khc nhau.
Cc gii thut m ta xem xt trong bi ny c h thng ha nh hnh v
4. Cc gii thut c phn loi da vo vic:
a) Duyt theo khng gian tm kim (BFS, DFS)
b) Cc nh gi tr h tr ca tp item (itemset)
c) Ngoi ra, mt gii thut c th dng mt s cc ti u khc tng tc
thm.









n tt nghip: Khai ph d liu t website vic lm

23










Hnh 4: H thng ha cc gii thut
3.1 Gii thut BFS( BFS Breadth first search)
Gii thut ph bin nht ca loi ny l gii thut Apriori, trong c trnh
by tnh chn di ca itemset tha ngng minsupp. Gii thut Apriori to ra
vic s dng cc tnh cht ny bng vic ta bt nhng ng vin thuc tp khng
ph bin trc khi tnh ph bin ca chng. Cch ti u c th thc hin
c v cc gii thut tm kim u tin theo chiu rng (BFS) bo m rng cc
gi tr h tr ca cc tp ca mt ng vin u c bit trc. Gii thut
Apriori m tt c cc ng vin c k phn t trong mt ln c c s d liu.
Phn ct li ca bi ton l xc nh cc ng vin trong mi giao tc. thc
hin c mc ch ny phi da vo mt cu trc gi l hashtree. Cc item
trong mi giao dch c dng i ln xung trong cu trc hashtree. Bt c
khi no ti c nt l ca n, ngha l ta tm c mt tp cc ng vin c
cng tin t c cha trong giao dch . Sau cc ng vin ny s c thc
hin tm kim trong giao dch m n c m ha trc thnh ma trn bit.
Trong trng hp thnh cng bin m cc ng vin trong cy c tng ln.
Gii thiu bi ton:
Apriori l thut ton c Rakesh Agrawal, Tomasz Imielinski, Arun
Swami xut ln u vo nm 1993. Bi ton c pht biu: Tm t c h
tr s tha mn s s
0
v tin cy c c
0
(s
0
, c
0
l hai ngng do ngi dng xc
nh v s
0
=minsupp, c
0
=minconf) . K hiu L
k
tp cc tp k - mc ph bin, C
k
tp cc tp k-mc ng c (c hai tp c: tp mc v h tr).
Bi ton t ra l:
n tt nghip: Khai ph d liu t website vic lm

24

Tm tt c cc tp mc ph bin vi minsupp no .
S dng cc tp mc ph bin sinh ra cc lut kt hp vi tin cy
minconf no .
Qu trnh thc hin (duyt):
Thc hin nhiu ln duyt lp i lp li, trong tp (k-1) - mc c s
dng cho vic tm tp k-mc. Ln th nht tm tt c cc h tr ca cc mc,
xc nh mc ph bin (mc tho mn h tr cc tiu-minsupp). Gi s tm
c L
1
-mc ph bin.
Cc ln duyt cn li: Bt u kt qu tm c bc trc n, s dng cc
tp mc mu (L
1
) sinh ra cc tp mc ph bin tim nng (ng c) (gi s L
2
),
tm h tr thc s. Mi ln duyt ta phi xc nh tp mc mu cho ln
duyt tip theo.
Thc hin lp tm L
3
, ..., L
k
cho n khi khng tm thy tp mc ph
bin no na.
Ch :
ng dng L
k-1
tm L
k
bao gm hai bc chnh:
Bc kt ni: tm L
k
l tp k-mc ng c sinh ra bi vic kt ni L
k-1
vi
chnh n cho kt qu l C
k
. Gi s L
1
, L
2
thuc L
k-1
. K hiu L
i
j
l mc th j
trong L
i
. iu kin l cc tp mc hay cc mc trong giao dch c th t. Bc
kt ni nh sau: Cc thnh phn L
k-1
kt ni (nu c chung k-2-mc u tin)
tc l:(L
1
[1]=L
2
[1]) (L
1
[2]=L
2
[2]) ... (L
1
[k-2]=L
2
[k-2]) (L
1
[k-1]=L
2
[k-
1]).
Bc ta: C
k
l tp cha L
k
(c th l tp ph bin hoc khng) nhng tt c
tp k-mc ph bin c cha trong C
k
. Bc ny, duyt ln hai CSDL tnh
h tr cho mi ng c trong C
k
s nhn c L
k
. Tuy nhin khc phc
kh khn, gii thut Apriori s dng cc tnh cht: 1- Tt c cc tp con khc
rng ca mt tp mc ph bin l ph bin; 2 - Nu L l tp mc khng ph
bin th mi tp cha n khng ph bin.
3.1.1 M phng thut ton Apriori:
Nh trn ni, cc thut ton khai ph Frequent Itemset phi thit lp
mt s giai on (pass) trn CSDL. Trong giai on u tin, ngi ta m
support cho mi tp ring l v xc nh xem tp no l ph bin (ngha l c
support minsup). Trong mi giai on tip theo, ngi ta bt u vi tp cc
tp ph bin tm c trong giai on trc li sinh ra tp cc tp mc c
kh nng l ph bin mi (gi l tp cc ng c vin - candidate itemset) v thc
n tt nghip: Khai ph d liu t website vic lm

25

hin m support cho mi tp cc ng c vin trong tp ny bng mt php
duyt trn CSDL. Ti im kt ca mi giai on, ngi ta xc nh xem trong
cc tp ng vin ny, tp no l ph bin v lp thnh tp cc tp ph bin cho
giai on tip theo. Tin trnh ny s c tip tc cho n khi khng tm c
mt tp ph bin no mi hn na.
tm hiu cc thut ton, ta gi s rng, cc item trong mi giao dch
c sp xp theo th t t in (ngi ta s dng khi nim t in y
din t mt th t quy c no trn cc item ca c s d liu). Mi bn ghi
- record ca c s d liu D c th coi nh l mt cp <TID, itemset> trong
TID l nh danh cho giao dch. Cc item trong mt itemset cng c lu theo
th t t in, ngha l nu k hiu k item c mt k-itemset c l c[1],c[2],,c[k],
th c[1]<c[2]<<c[k]. Nu c=X.Y v Y l mt m-itemset th Y cng c gi l
m-extension (m rng) ca X. Trong lu tr, mi itemset c mt trng
support-count tng ng, y l trng cha s m support cho itemset ny.
Thut ton Apriori
Cc k hiu:
L
k
: Tp cc k-mc ph bin (large k-itemset) (tc tp cc itemset c
support ti thiu v c lc lng bng k).
Mi phn t ca tp ny c 2 trng: itemset v suport-count.
C
k
: Tp cc candidate k-itemset (tp cc tp k-mc ng c vin). Mi phn
t trong tp ny cng c 2 trng itemset v support-count.
Ni dung thut ton Apriori c trnh by nh sau:
Input: Tp cc giao dch D, ngng support ti thiu minsup
Output: L- tp mc ph bin trong D
Method:
L
1
={large 1-itemset} //tm tt c cc tp mc ph bin: nhn c L
1


for (k=2; L
k-1
; k++) do
begin
C
k
=apriori-gen(L
k-1
)
;
//sinh ra tp ng c vin t L
k-1

for (mi mt giao dch T D) do
n tt nghip: Khai ph d liu t website vic lm

26

begin
C
T
= subset(C
k
, T); //ly tp con ca T l ng c vin trong C
k

for (mi mt ng c vin c C
T
) do
c.count++; //tng b m tn xut 1 n v
end;
L
k
= {c C
k
| c.count minsup}
end;
return
k
L
k

Trong thut ton ny, giai on u n gin ch l vic m support cho
cc item. xc nh tp 1-mc ph bin (L
1
), ngi ta ch gi li cc item m
support ca n ln hn hoc bng minsup.
Trong cc giai on th k sau (k>1), mi giai on gm c 2 pha.
Trc ht cc large(k-1)-itemset trong tp L
k-1
c s dng sinh ra cc
candidate itemset C
k
, bng cch thc hin hm Apriori_gen. Tip theo CSDL D
s c qut tnh support cho mi ng vin trong C
k
. vic m c
nhanh, cn phi c mt gii php hiu qu xc nh cc ng vin trong C
k
l
c mt trong mt giao dch T cho trc.
Vn sinh tp candidate ca Apriori Hm Apriori_gen:
Hm Apriori_gen vi i s l L
k-1
(tp cc large(k-1)-itemset) s cho li
kt qu l mt superset, tp ca tt c cc large k itemset. S sau l thut
ton cho hm ny.
Input: tp mc ph bin L
k-1
c kch thc k-1

Output: tp ng c vin C
k

Method:
function apriori-gen(L
k-1
: tp mc ph bin c kch thc k-1)
Begin
For (mi L
1
L
k-1
) do
n tt nghip: Khai ph d liu t website vic lm

27

For (mi L
2
L
k-1
) do
begin
If ((L
1
[1]=L
2
[1]) (L
1
[2]=L
2
[2]) ... (L
1
[k-2]=L
2
[k-2])
(L
1
[k-1]=L
2
[k-1])) then
c = L
1
L
2
; // kt ni L
1
vi L
2
sinh ra ng c vin c
If has_infrequent_subset(c, L
k-1
) then
remove (c) // bc ta (xo ng c vin c)
else C
k
= C
k
{c}; kt tp c vo C
k

end;
Return C
k
;
End;
Hm kim tra tp con k-1 mc ca ng c vin k-mc khng l tp ph
bin:
function has_infrequent_subset(c: ng c vin k-mc; L
k-1
tp ph bin k-
1 mc)
Begin
//s dng tp mc ph bin trc
For (mi tp con k-1 mc s ca c) do
If s L
k-1
then return TRUE;
End;
C th m t hm Apriori_gen trn theo lc sau:
n tt nghip: Khai ph d liu t website vic lm

28

Input: tp cc large(k-1)- itemset L
k-1

Output: tp candidate k-itemset C
k

Method:
Hm Apriori-gen() //bc ni
1. insert into C
k

2. select p.item
1
, p.item
2,
..., p.item
k-1
, q.item
k-1

3. from L
k-1
p , L
k-1
q
4. where p.item
1
=q.item
1
, , p.item
k-2
=q.item
k-2
, p.item
k-1
<q.item
k-1
//bc ct ta:
5. for (mi tp mc c C
k
) do
6. for (mi (k-1) tp con s ca c( do
7. if (s L
k-1
) then
8. delete c khi C
k
;
Vi ni dung trn, ta thy hm ny c 2 bc:
- Bc ni (join step): Bc ny ni L
k-1
vi L
k-1
. Trong bc ny, cho
rng cc item ca cc itemset c sp xp theo th t t in. Nu c k-2
item u tin (gi l phn tin t) ca hai(k-1)-itemset i
1
v i
2
(i
1
i
2
) no m
ging nhau th ta khi to mt candidate k-itemset cho C
k
bng cch ly phn
tin t ny hp vi 2 item th k-1 ca i
1
v i
2
(c th phi sp li th t cho cc
item ny). iu kin p.item
k-1
<q.item
k-1
n gin ch l vic trnh k-itemset
trng lp c a vo C
k
.
- Bc ct ta (prune step): y l bc tip theo sau bc join. Trong
bc ny, ta cn loi b tt c cc k-itemset c C
k
m chng tn ti mt(k-1)-
subset khng c mt trong L
k-1
. Gii thch iu ny nh sau: gi s s l mt(k-
1)-subset ca c m khng c mt trong L
k-1
. Khi , support (s)<minsup. Mt
n tt nghip: Khai ph d liu t website vic lm

29

khc, theo tnh cht p1.1, v c s nn support(s)<minsup. Vy c khng th l
mt large-itemset, n cn phi loi b khi C
k
.
V d : Gi s tp cc item I = {A ,B, C, D, E} v c s d liu giao dch:
D = {<1, {A,C,D}>, <2,{B,C,E}>, <3,{A,B,C,E}>,<4,{B,E}>}.
Vi minsup = 0.5 (tc tng ng 2 giao dch). Khi thc hin thut ton
Apriori trn ta c s sau:




















n tt nghip: Khai ph d liu t website vic lm

30


D (CSDL)
TID Cc mc
1 {A, C, D}
2 {B, C, E}
3 {A, B, C, E}
4 {B, E}

C
1

1 - itemset Count-support
{A} 2 - 50%
{B} 3 75%
{C} 3 75%
{D} 1 - 25%
{E} 3 - 75%


Qut ton b D
Xa b mc c
support < minsup
C
2

2 - itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}

C
2

2 - itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}

Ta
L
1

1 - itemset Count-support
{A} 2 - 50%
{B} 3 75%
{C} 3 75%
{E} 3 - 75%


Kt ni
L
1
& L
1

n tt nghip: Khai ph d liu t website vic lm

31



Hnh 5. V d thut ton Apriori
L
2


2 - itemset Count-support
{A, C} 2 50%
{B, C} 2 50%
{B, E} 3 75%
{C, E} 2 50%


Kt ni
L
2
& L
2


Ta
C
3

3 - itemset Count- support
{B, C, E} 2 - 50%


Qut ton b D
C
2

2 - itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}

Qut ton b D
C
2

2 - itemset Count-support
{A, B} 1 25%
{A, C} 2 50%
{A, E} 1 25%
{B, C} 2 50%
{B, E} 3 75%
{C, E} 2 50%



Xa b mc c
support < minsup
Xa b mc c
support < minsup
L
3

3 - itemset Count- support
{B, C, E} 2 - 50%


n tt nghip: Khai ph d liu t website vic lm

32


3.1.2 Mt s bin th ca gii thut Apriori
Gii thut Apriori_TID l phn m rng theo hng tip cn c bn ca
gii thut Apriori. Thay v da vo c s d liu th, gii thut AprioriTID biu
din bn trong mi giao tc bi cc ng vin hin hnh.
L
1
= {Large 1-itemset};
C
1
= Database D;
for (k=2; L
k-1
; k++) do
Begin
C
k
= apriori_gen(L
k-1
);
C
k
= ;
for tt c t C
k-1
do
begin
// xc nh tp ng vin trong C
k
cha trong giao dch vi nh
//danh t. Tid (Transaction Code)
C
t
= c C
k
| (c-c[k]) t.Set_of_ItemSets ^ (c-c[k-1]
t.Set_of_ItemSets
for nhng ng vin c C
t
do c.count ++;
if (C
t
) then C
k
+= < t.Tid, C
t
>
end
L
k
= c C
k
| c.count minsup ;
End
return =
k
L
k
;
n tt nghip: Khai ph d liu t website vic lm

33

Thut ton ny cng s dng hm apriori_gen sinh ra cc tp ng c
vin cho mi giai on. Nhng thut ton ny khng dng CSDL D m cc
support vi cc giai on k > 1 m s dng tp C
k
. Mi phn t ca C
k
c
dng <Tid, {X
k
}>, trong mi X
k
l mt tp ph bin k_itemset tim nng
trong giao dch Tid. Khi k = 1, C
k
tng ng vi D, trong mi item i c
coi l mt itemset {i}. Vi k>1, C
k
c sinh ra bi C
k
+= < t.Tid, C
t
>. Phn
t ca C
k
tng ng vi giao dch t l <t.Tid, {c | c cha trong t}>. Nu mt
giao dch khng cha bt k tp ngvin k_itemset no th C
k
s khng c mt
im vo no cho giao dch ny. Do , s lng im vo trong C
k
c th nh
hn s giao dch trong CSDL, c bit vi k ln. Hn na, vi cc gi tr k kh
ln, mi im vo c th nh hn giao dch tng ng v mt s ng vin
c cha trong giao dch. Tuy nhin, vi cc gi tr k nh, mi im vo c
th ln hn giao dch tng ng v mt mt im vo trong C
k
bao gm tt c
cc ng vin k_itemset c cha trong giao dch.
Gii thut AprioriHybrid kt hp c hai hng tip cn trn. Ngoi ra cn
c mt s cc gii thut ta Apriori(TID), chng c nh hng ci trc
tip trong SQL.
Gii thut DIC l mt bin th khc na ca gii thut Apriori. Gii thut
DIC lm gim i khong phn bit nghim ngt gia vic m v vic pht sinh
cc ng vin. Bt k ng vin no ti c ngng minsupp, th gii thut DIC
bt u pht sinh thm cc ng vin da vo n. thc hin iu ny gii thut
DIC dng mt prefix-tree (cy tin t). Ngc vi hashtree, mi nt (nt l hoc
nt trong) ca prefix-tree c gn mt ng vin xc nh trong tp ph bin.
Cch s dng cng ngc vi hashtree, bt c khi no ti c mt nt ta c th
khng nh rng tp item kt hp vi nt ny trong giao tc . Hn na, vic
xc nh h tr v pht sinh ng vin khp nhau s lm gim i s ln duyt
c s d liu.
3.1.3 Ci tin thut ton Apriori:
Nh trnh by trn, qu trnh tm lut kt hp gm hai giai on:
1) Tm cc tp ph bin vi ngng minsupp (0, 1] cho trc;
2) Vi cc tp ph bin tm c trong bc 1 v vi ngng tin cy
minconf (0, 1] cho trc, lit k tt c cc lut kt hp tha mn ngng
minconf.
Cng vic chim hu ht thi gian ca bc 1 l xc nh mt tp d liu
c phi l tp ph bin hay khng. Trong thc t, ta khng cn thit phi khai
ph tt c cc tp mc ph bin trong bc th nht m ch cn khai ph tp cc
n tt nghip: Khai ph d liu t website vic lm

34

mc ph bin ng. Phn ny trnh by v vic s dng nh x ng tm cc
tp ph bin ng. Do tp ph bin ng nh hn rt nhiu so vi tp tt c cc
tp ph bin nn thi gian ca thut ton tm tp ph bin s gim i ng k.
nh ngha 1: Kt ni Galois
Cho quan h nh phn R I x X. Cho R I & R T th cc nh x:
t: I T, t(X) = {y T/ x X, x R s}, X I.
- t(X) l tp hp tt c cc giao tc ca T cha tt c cc thuc tnh ca X.
i:T I, i(S) = {x I/ s S, x R s}, S T.
- i(S) l tp hp tt c cc thuc tnh ca I xut hin tt c cc giao
tc trong S
V d: Cho CSDL D
A B C D E
1 1 1 0 1 1
2 0 1 1 0 1
3 1 1 0 1 1
4 1 1 1 0 1
5 1 1 1 1 1
6 0 1 1 1 0
Ta c: t(AB) = 1345; t(BCD) = 56; t(E) = 12345
i(123) = BE; i(345) = ABE; i(23) = BE
Cp nh x (t, i) c gi l kt ni Galois trn T x I.
nh ngha 2: nh x hp
Cho X I v S T, ta nh ngha hai nh x hp:
Cit: I -> I C
it
(X) = i(t(X))
Cti: T -> T C
ti
(S) = t(i(S))
V d:
C
it
(AB) = i(t(AB)) = i(1345) = ABE
n tt nghip: Khai ph d liu t website vic lm

35

C
ti
(23) = t(i(23)) = t(BE) = 12345
nh ngha 3: nh x ng
Cho tp U, subset(U) = {X | X U}. nh x f: subset(U) -> subset(U)
c gi l ng trn U nu mi tp con X, Y U ta c cc tnh cht sau:
T1) Tnh phn x: X f(X)
T2) Tnh ng bin: Nu X Y th f(X) f(Y)
T3) Tnh ly ng: f(f(X)) = f(X)
Nhn thy C
it
v C
ti
l hai nh x ng trn cc tp mc v cc tp giao
dch tng ng.
nh ngha 4: Bao ng ca tp mc d liu
Cho X I, bao ng ca X l X
+
= C
it
(X)
V d: Xt CSDL D trn
A
+
= ABE v C
it
(A) = i(t(A)) = i(1345) = ABE
B
+
= B v C
it
(B) = i(t(B)) = i(123456) = B
AC
+
= ACE v C
it
(ACE) = i(t(AC)) = i(45) = ACE
nh ngha 5: Tp ph bin ng
X I l tp ph bin theo ngng minsupp. Ta ni X l tp ph bin ng
theo ngng minsupp nu X = X
+
= C
it
(X).
V d, xt CSDL trn, ta c: B, BC l tp ph bin ng theo ngng
minsupp = 0,4 v C
it
(B) = B C
it
(BC) = BC v supp(B)=1, supp(BC)=0,66.
BCD khng l tp ph bin ng theo ngng minsupp = 0,4 v
C
it
(BCD)=BCD nhng supp(BCD)=0,33 < minsupp.
nh ngha 6: Bao ng ca mt tp mc
Cho K supset(I) tha minsupp, ta nh ngha K
+
= {X
+
| X K} l bao
ng ca h K.
Thut ton 1: Tm bao ng ca tp I
Format: Fred_1_Item(T, I, minsupp)
Input: CSDL D, minsupp, tp cc mc I
Output: K
+
= {X
+
| X K, X
+
= C
it
(X) v supp(X) minsupp}
Method:
n tt nghip: Khai ph d liu t website vic lm

36

K= ;
For mi X I do
If ( supp(X) minsupp) then
K:= K {C
it
(X)}
Endif;
Endfor;
Return (K);
Thut ton 2: Tm tp ng, tm Fix(C
it
)
Format: Fix(T, I, minsupp)
Input: CSDL D, minsupp, tp cc mc I
K = Fred_1_Item(T, I, minsupp)
Output: K
+
{X K | X = X
+
v supp(X) minsupp}
Method:
K
+
:= ;
While (K K
+
) do
K = K
+
;
K
1
:={X Y | X, Y K};
K
2
:= ;
For mi X K
1
do
K
2
:=K
2
{X
+
}
Endfor
Frequent(K
1
, K
2
, minsupp, K
+
);
K:=K;
Endwhile;
Return(K
+
);
Thut ton 3: Tm cc tp thng xuyn ca K
Format: Frequent(K
1
, K, minsupp, K
+
)
Input: K I, minsupp;
n tt nghip: Khai ph d liu t website vic lm

37

K
1
={X K tnh bc trn v supp(X) minsupp}
Output: K
+
= {X K | supp(X) minsupp}
Method:
K
2
:= ;
For mi X K do
If ( Y K
1
) and (X Y) then
K
1
:= K
1
{X}
Else
If not(( Y K
2
) and (X Y)) then
If supp(X) minsupp then
K
1
:= K
1
{X}
Else
K
2
:= K
2
{X};
Endif;
Endif;
Endfor;
Return(K
1
);
V d: Xt CSDL D trn, vi I = {A, B, C, D, E}=ABCDE; T={1, 2, 3,4,
5,6}=123456; minsupp = 0,4 (tng ng vi 3 giao dch)
p dng thut ton 1 ta c K = {ABE, B, BC, BD, BE}
p dng thut ton 2 vi Input: K = {ABE, B, BC, BD, BE}
Ta c Output: K2 = {ABCE, ABDE, BCD, BCE, BDE}
p dng thut ton 3 vi Input: K
1
= {ABE, B, BC, BD, BE}
Ta c Output: {ABE, B, BC, BD, BE,ABDE, BCE, BDE}
Nhn xt: Trn y trnh by mt ci tin ca vic tm tp ph bin bng
cch s dng cc kt qu l thuyt v nh x ng, bao ng, Thut ton a
ra trnh phi tm ton b cc tp ph bin, thay vo chi phi tm mt s lng
nh hn cc tp ph bin ng, iu ny ci tin ng k tc tnh ton trong
trng hp d liu c dung lng ln.
n tt nghip: Khai ph d liu t website vic lm

38

3.2 Gii thut DFS (Depth First Search)
Gi s vic m cc th hin c thc hin trn tp cc ng vin c kch
thc hp l, vi mi tp cc ng vin th cn mt thao tc duyt c s d
liu. Chng hn nh, gii thut Apriori da vo BFS thc hin duyt c s d
liu mi k-kch thc ng vin mt ln. Khi thc hin tm kim u tin theo
chiu su (DFS) tp ng vin ch gm ch gm mt nt ca cy t phn 2.2. Mt
iu hin nhin l nu phi thc hin duyt c s d liu cho mi nt th tng
chi ph kt qu tht khng l. V th vic kt hp DFS vi vic m cc th hin
l khng tht s thch hp.
Gn y c mt cch tip cn mi c gi l FP-growth c trnh
by. Trong bc tin x l gii thut FP-growth dn xut cch biu din rt dy
c ca d liu giao tc, do cn mt FP-tree. Vic pht sinh ng vin ca FP-
tree c thc hin thng qua vic m cc th hin v DFS. Ngc vi hng
tip cn ca DFS, FP-growth khng theo nt ca cy t phn trn, m i trc
tip xung mt s phn ca tp item trong khng gian tm kim. Trong bc th
hai, FP-growth dng FP-tree dn xut tt c cc gi tr h tr ca tt c cc
tp ph bin.
3.3 Gii thut DHP (Direct Hashing and Pruning)
Thut gii Direct Hashing and Pruning thc cht l mt bin th ca thut
ton Apriori. Cc hai thut ton u pht sinh ra cc ng vin k+1 phn t t
mt tp k-phn t (vi s lng ln). V cng vi s lng ln cc tp k+1
phn t ny c xc nhn bng cch m s xut hin ca cc ng vin k+1
phn t ny trn database (thc cht l tnh li 2 support). S khc bit ca
thut ton DHP y l chng ta s s dng k thut hashing (bm) loi b
ngay cc tp khng cn thit cho pha pht sinh cc ng vin k tip.
Nhn xt rng, tp cc ng vin c pht sinh ban u, c bit l tp 2-
phn t l vn mu cht nh gi mc hiu qu ca data mining v
trong mi bc, cc tp k-phn t (L
k
) c dng to cc ng c vin (k+1)-
phn t (C
k+1
) bng cch ghp L
k
vi chnh mt phn t L
k
khc trong bc k.
Ni chung, cng nhiu tp c trong C
k
th chi ph x l cho vic xc nh L
k+1

cng tng. Trong gii thut Apriori,
2
1
2
L
C v vy s bc xc nh L
2
t
C
2
bng cch qut qua ton b c s d liu v kim tra trn tng transaction ln
tp C
2
l qu tn km. Bng cch xy dng mt C
2
c gim thiu ng k,
thut gii DHP thc hin vic m trn tp C
2
nhanh hn nhiu so vi Apriori.
n tt nghip: Khai ph d liu t website vic lm

39

Trong qu trinh m support ca C
k
trong thut gii DHP bng cch
qut qua c s d liu, thut gii cng tch ly nhng thng tin cn thit h
tr vic tnh ton trn cc ng vin (k+1)-phn t theo tng l tt c cc tp
con (k+1)-phn t ca mi transaction sau vi thao tc ct xn c bm vo
trong bng bm. Mi mc trong bng bm cha mt s cc tp c bm vo
theo hm bm. Sau , bng bm ny c dng xc nh C
k+1
. tm ra
C
k+1
, thut gii pht sinh ra tt c cc tp (k+1)-phn t t L
k
nh trong trng
hp ca Apriori. y, thut gii chi a mt tp (k+1)-phn t vo C
k+1
ch
khi tp (k+1)-phn t ny qua c bc lc da trn bng bm. Nh vy thut
gii gim c vic pht sinh cc phn t d tha trong C
k
gim chi ph
kim tra khi pht sinh tp L
k
. Qua kim nghim cho thy, thut gii DHP
gim ng k kch thc ca C
k+1
.
Input: Database D
Output: Tp ph bin k-item
/* Database = set of transaction;
Items = set of items;
transaction = <TID, {x Items}>;
F
1
l tp ph bin l-item */
F
1
= ;
/*
H
2
l bng bm c 2-item
*/
for each transaction t Database do begin
for each item x in t do
x.count++;
for each 2-itemset y in t do
H
2
.add(y);
end
for each item i Item do
if i.count/|Database| minsupp
then F
1
=F
1
i;
end
H
2
.prune(minsupp)
/* Tm F
k
tp ph bin k-item, k 2 */
for each (k:=2; F
k-1
; k++) do begin
n tt nghip: Khai ph d liu t website vic lm

40

// Ck: tp cc ng vin k-item
Ck=
for each x {F
k-1
*F
k-1
} do
if H
k
.hassupport(x)
then C
k
= C
k
x;
end
for each transaction t Database do begin
for each k-itemser x in t do
if x C
k

then x.count++;
for each (k+1)-itemset y in t do
if z | z = k subset of y
H
k
.hassupport(z)
then H
k+1
.add(y);
end
// F
k
l tp ph bin k-item
F
k
= ;
for each x C
k
do
if x.count/|Database| minsupp
then F
k
=F
k
x;
end
Hk+1.prune(minsupp)
end
Answer=
k k
F

Trong bc khi to, trong khi m s ln xut hin ca cc tp 1-phn t,
s xut hin ca cc gi tr bm cho tp 2-phn t cng c m. Khi tp
cc ng c vin c loi khi bng bm nu gi tr bm tng ng trong bng
bm nh hn minSupp. Mt tp (k+1)-phn t trong mt transaction c thm
vo bng bm H
k+1
nu gi tr bm ca tt c tp con k-phn t ca ng vin
(k+1)-phn t tha minSupp trong H
k
. Gii thut DHP cng xt n vic loi b
cc transaction khng cha bt k mt tp ph bin no khi c s d liu cng
nh loi b cc item khng tham gia tp ph bin sau mi bc.
n tt nghip: Khai ph d liu t website vic lm

41

Trong trng hp kch thc ca c s d liu tng th thut gii DHP ci
thin ng k tc so vi gii thut Apriori. Tuy nhin, mc ny cn ph
thuc nhiu vo kch thc bng bm


3.4 Gii thut PHP(Perfect Hashing and Pruning)
Trong thut gii DHP, nu chng ta c th nh ngha mt bng bm ln
sao cho mi tp item c th c nh x vo cc ring bit trong bng bm th
gi tr bm ca bng bm s cho bit s lng xut hin tht s ca mi tp phn
t. Trong trng hp ny, chng ta s khng cn phi thc hin li vic m s
ln xut hin cho cc tp item ny.
D thy rng s lng dng d liu cn qut vi mt tp gm nhiu tp
item cng l mt vn nh hng xu n hiu qu thc hin. Vic gim thiu
s transaction cn phi c li v b bt cc item khng cn xt r rng ci thin
cht lng data mining vi c s d liu ln.
Gii thut c ngh PHP s dng mt bng bm l tng (perfect
hashing) cho mi bc pht sinh bng bng v ng thi gim kch thc c s
d liu bng cch ct b nhng transaction khng cha bt k mt tp ph bin
no. Thut gii c m t nh sau: Trong bc u tin ca thut gii, kch
thc ca bng bm bng vi s lng item trong c s d liu. Mi item ny
c nh x vo mt v tr ring bit trong bng bm, do , ta gi gii thut ny
l perfect hashing. Phng thc cng ca bng bm thm vo mt mc mi nu
mc ny cha tn ti trong bng bm v gi tr m c khi to l 1; ngc
li bin m s c tng ln 1 n v. Sau bc u tin, bng bm cha ng
s ln xut hin ca mi item trong c s d liu. Ch cn duyt mt bc qua
bng bm (c t trong b nh chnh), thut gii d dng pht sinh ra cc tp
ph bin 1-phn t. Sau bc ny, phng thc prune ca bng bm s loi b
tt c cc mc c support nh hn minSupp.
Trong cc bc tip theo, gii thut ct xn bt c s d liu bng cch b
i khng xt n cc transaction khng cha bt k mt tp ph bin no cng
nh b tt c cc item khng tham gia vo mt tp ph bin no. K , thut
gii pht sinh cc ng vin k-phn t v m s ln xut hin ca cc tp k-phn
t. Cui ca bc ny, D
k
l c s d liu c ct xn, H
k
cha s ln xut
hin ca cc tp k-phn t, F
k
l cc tp ph bin k-phn t. Qu trnh ny tip
tc cho n khi khng cn F
k
no c tm thm na.
n tt nghip: Khai ph d liu t website vic lm

42

Thut gii ny r rng l tt hn DHP v sau khi to bng bm, chng ta
khng cn m s ln xut hin ca cc ng vin k-phn t nh trong trng
hp DHP. Gii thut ny cng tt hn gii thut Apriori v ti mi vng lp,
kch thc ca c s d liu c gim, iu ny lm tng hiu qu ca thut
ton trong trng hp c s d liu ln v s lng cc tp ph bin tng i
nh.
Input: Database
Output: Tp ph bin k-item
/* Database = set of transaction;
Items = set of items;
transaction = <TID, {x Items}>;
F
1
l tp ph bin l-item */
F
1
= ;
/*
H
1
l bng bm c 1-itemset
*/
for each transaction t Database do begin
for each item x in t do
H
1
.add(x);
end
for each itemset y in H
1
do
if H
1
.hassupport(y)
then F
1
=F
1
y
end
H
1
.prune(minsupp)
D
1
=Database
/* Tm F
k
tp ph bin k-item, k 2 */
k=2;
repeat
D
k
= ;
F
k
= ;
for each transaction t D
k-1
do begin
// w l k-1 subset ca item in t
if 1
|
k
F w w

then skip t;
else
n tt nghip: Khai ph d liu t website vic lm

43

items = ;
for each k-itemser y in t do
if z | z = k subset of y
H
k-1
.hassupport(z)
then H
k
.add(y);
items = item y;
end
D
k
=D
k
t;
end
for each itemset y in H
k
do
if Hk.hassport(y)
then F
k
=F
k
y;
end
H
k+1
.prune(minsupp)
k++;
until F
k-1
= ;
Answer=
k k
F
;
Ngoi ra, sau mi vng lp th D
k
l c s d liu ch cha cc transaction
c cha tp ph bin. Gii thut to tt c cc tp con k-phn t ca mi item
trong mi giao tc v chn phn t no c cc tp con k-1 phn t tha
support trong bng bm. V thut gii thc hin vic ct xn trong qu trnh
thm cc ng vin k-phn t vo H
k
nn kch thc ca bng bm khng qu
ln v c th t trong b nh chnh.
4. Pht sinh lut t cc tp ph bin
Sau khi c c cc tp ph bin vi tin cy minSupp, chng ta cn rt
ra cc lut c tin cy minConf. sinh cc lut, vi mi tp ph bin L, ta
tm cc tp con khc rng ca L. Vi mi tp con s tm c, ta xut ra lut s
(L-s) nu t s supp(L)/supp(a) ti thiu l minsconf
for mi tp ph bin L
to tt c cc tp con khc rng s of L
for mi tp con khc rng s of L
cho ra lut "s (L-s)" nu support(L)/support(s) min_conf"
trong min_conf l ngng tin cy ti thiu
V d: tp ph bin l = {abc}, subsets s = {a, b, c, ab, ac, bc)
n tt nghip: Khai ph d liu t website vic lm

44

a b, a c, b c
a bc, b ac, c ab
ab c, ac b, bc a
Vn y l nu lc lng item trong |L| = n tr nn ln, s lut c th
pht sinh t mt tp ph bin L s khng nh cht no
S lut pht sinh t L = 2
n
2 vi |L| = n (ngha l nu |L| = 10, ta cn phi
kim tra tin cy ca 1022 lut c pht sinh).
4.1 Ci tin 1 Gim s lng cc lut c pht sinh v cn phi kim tra
Kh khn u tin m chng ta phi gii quyt trong bi ton l khi |L| ch
hi tng th s lut pht sinh tng theo cp s m dn n phi kim tra nhiu
lut hn.
Xt mt lut r: X => Y c khng tha minConf th chc chn lut r c
pht sinh bng cch thm vo v tri mt item i L cng khng th tha
minConf:
Nu r: X => Y c conf(r) < minConf th r: X => Y i (vi i L)
cng c conf(r)<minConf.
Nh vy, nu nh ta ch xt trn mt tp X th vic pht sinh v kim tra
cc lut r nn bt u vi tp Y l tp gm 1 phn t, ri n cc tp 2 phn t, 3
phn t Nu chng ta nhn li bi ton tm tp ph bin th ta s thy vic tm
tp Y cng c tnh cht gn tng t vi bi ton i tm tp ph bin L. Chng
ta ch pht sinh v kim tra tin cy ca mt phn t y mc k nu mi tp
con ca n u tha minConf (ni mt cch khc l mi tp con ca n phi
thuc Y
k
).
for each X L do
if X then begin
YS
1
= generate_1_itemset_has_confident(X, L\X);
k = 2
while YS
k-1
do
CY
k
= generate_k_itemset_from(YS
k-1
, L\X);
YS
k
= DB.check_confident(X, CY
k
);
Endwhile
end
endfor;
n tt nghip: Khai ph d liu t website vic lm

45


Hnh 6: v d minh ha
Trong v d trn, lut {1, 3} => {7} khng tha minConf dn n cc lut
{1, 3} => {2, 7}, {1, 3} => {5, 7}, {1, 3} => {2, 5, 7} cng khng cn xt na.
Vi nhn xt ny, chng ta c th p dng c mt s ci tin trong
nhng ci tin c s dng cho bi ton tm tp ph bin nhng y cn lu
mt iu l y lc lng |L| khng qu ln v vic tnh supp(X Y) v
supp(Y) c th xem nh c lu li (xem li thut gii PHP) nn c th mt
s ci tin tr nn khng cn thit.
4.2 Ci tin 1.a trnh pht sinh cc lut khng c ngha
Mt tnh cht khc m chng ta cng cn lu l l nu chng ta c mt lut
r: X => Y tha conf(r) minConf th lut c pht sinh bng cch thm vo v
tri mt mt item i Y cng tha tin cy minConf:
Nu r: X => Y,conf(r) minConf th r: X i => Y cng c conf(r)
minConf
y, lut r khng em li ngha thc t nu ta c lut r nn trong
phn ln cc ng dng tm lut kt hp, ta u c mong mun b khng xt n
n.
Nh vy thay v xt tun t cc X L, ta s xt c th t u tin l tp
cc X c 1 phn t, ri n tp cc X c 2 phn t, , tp X c |L|-1 phn t.
Vic xt c th t ny s gip cho ta pht hin sm v loi b hon ton nhng
lut c pht sinh r: X => Y khng c ngha bng cch nh du nhng lut r
ny nh l lut khng tha minConf nu nh chng ta pht hin c mt lut
X => Y tha minConf, vi X X.
Thut gii c sa li nh sau:
L= { 1, 2, 3, 5, 7 }
X = { 1, 3 }

{ 2 }

{ 5 }

{ 7 }

{ 2, 5 }

{ 2, 7 }

{ 5, 7 }

{ 2, 5, 7 }

Y

n tt nghip: Khai ph d liu t website vic lm

46

for k=1 to |L|-1 do
for each X generate_k_itemset(f) do
YS
1
= generate_1_itemset_has_confident(X, L\X);
YS
1.
= Cache.FilterOutRedundantRules(X, YS
1
);
k = 2;
while YS
k-1
do
CY
k
= generate_k_itemset_from(YS
k-1
, L\X);
YS
k
= DB.check_confident(X, CY
k
);
endwhile
endfor; endfor;
4.3 Mt s k thut khc trong vic ti u ha chi ph tnh Confident
trnh vic phi qut li c s d liu tnh tin cy (tn km chi ph
khng km g vic qut c s d liu tnh support), ta c th p dng mt
hng tip cn no cache (lu li) support ca cc tp ph bin. Chi
ph lu tr ny r rng l qu nh so vi chi ph phi b ra tnh li
confident cho lut. Ta cng c th tn dng hash tree c s dng trong thut
ton PHP c th nhanh chng tnh c support ca mt tp ph bin bt
k.
5. nh gi, nhn xt
Phn ny chng ta xem xt cc gii thut khai ph tp ph bin nh:
Apriori, AprioriTID, ... Cc gii thut ny u t l tuyn tnh vi kch thc
CSDL. Ngha l tt c cc phc tp v thi gian, b nh, tnh ton thut
ton, . . . u t l thun vi ln CSDL D.
n tt nghip: Khai ph d liu t website vic lm

47

Chng 2: M HNH TM KIM THNG TIN
1. Tm kim thng tin
Hy tng tng vic tm kim mt cun sch trong th vin m khng c
bng lit k mc lc. Tht khng phi l mt cng vic d dng. Cng nh vic
tm kim mt thng tin trn Internet. bt u ngi dng theo cc siu lin
kt n trang web mi ri xc nh cc ti liu lin quan cha thng tin mnh
cn. Mi lin kt khng r rng c th a h i xa hn phm vi tm kim.
Trong mt h thng nh v c nh vic thit k mt ti liu hng dn vic tm
kim khng thnh vn . Nhng trong mi trng world Wide Web l mt mi
trng thng tin khng tp trung, gm nhiu loi khc nhau, lin tc thay i v
pht trin nhanh chng th vic tm kim thng tin c th ni l mt thch thc
i hi kh nhiu thi gian.
Hin nay c kh nhiu cc cng c hay nhng b my tm kim thng tin
thng minh cho php gii quyt vn ny. N cung cp mt c ch tm kim
nhanh chng bng cch duy tr mt h thng ch mc cc trang web. Cn vic
ca b ch mc l phn loi cc trang web thnh cc nhm thng tin v nh ch
mc full-text cho tt c cc trang web. Do mi trng web lin tc thay i nn
vic nh ch mc phi c thc theo nh k. Ngi dng ch vic nhp vo
cc t kha hay ch mnh cn, b my tm kim s lit k tt c cc ti liu
lin quan theo th t chnh xc tm c.
Hin nay c rt nhiu loi mt tm kim. C th tm kim ca n c th l
tm kim theo mt ch hay mt loi thng tin no . V d: tm kim thng
tin v phn mm (www.softseek.com), m nhc ( www.mp3search.com), .
Hay cng c th l cc thng tin tng hp.
Cng vi nhu cu tm kim thng tin l nhu cu nm bt nhng thay i trn
web. nhng thay i bao gm vic cp nht nhng thng tin v cc nhu cu vic
lm mi trn internet, hay nhng tin tc nng bng N gip cho cc ng vin
tm c nhng vic lm ph hp hay cc doanh nghip c th tm nhng ng
vin ph hp vi yu cu doanh nghip, n cng gip cho ngi dng bit c
nhng g v ang din ra xung quanh.
Nh ni trn vic duy tr h thng ch mc (bao gm c ch mc v loi
thng tin ca ti liu ln ch mc full-text cc ti liu) cho cc trang web quyt
nh cht lng ca cc search engine. duy tr h thng ch mc ny chng
lin tc duyt qua cc trang web bng cch i theo cc siu lin kt, qua
n tt nghip: Khai ph d liu t website vic lm

48

quyt nh xem nhng ti liu no s c thm vo bng ch mc ca mnh.
c im quan trng nht ca world wide web l m hnh thng tin khng tp
trung. Bt c ai cng c th thm vo cc server, cc thng tin hay cc siu lin
kt. trong mi trng thay i nh vy, i vi mt search engine cng vi vic
thu thp cc thng tin lin quan, vic pht hin cc thng tin mi cng l rt
quan trng.
Cc search engine nhn bit cc thng tin cn thit ca ngi dng thng
qua a ch url ca chng. Khi xt mt Url, search engine s da vo mc ch
tm kim quyt nh xem n c nn c dng tm kim tip hay khng v s
lu ni dung ca n li nu thch hp, sau khi lu mt ti liu, search engine tm
kim v nh du ti liu c xt ri. v tm tt c cc lin kt c trong ti
liu v li tip tc nh vy i vi cc lin kt mi ny. Tt c cc bc ny u
nh hng n vic lu thng tin trong c s d liu.
2. M hnh Search engine






Mt Search engine bao gm cc thnh phn
- Modul chnh Search engine: iu khin tt c hot ng ca h thng
- Modul cp nht thng tin Robots: chu trch nhim tm kim v ti hin thng
tin v cc ti liu trn internet ph hp vi yu cu do modul chnh a ra.
- Phn c s d liu: lu tr cc thng tin v cc ti liu nh: ni dung ti liu,
cc siu lin kt gia chng,
2.1 Search engine
Mt search engine pht hin cc ti liu mi bng cch bt u vi mt tp
hp cc ti liu bit, kim tra cc siu lin kt xut hin trong , duyt theo
mt trong cc lin kt n ti liu mi, sau lp li ton b qu trnh ny.
Tng tng web nh l mt th c hng v vic tm kim n gin ch l
duyt qua th s dng vi mt thut ton duyt th no . Search engine
Search Engine
Internet
Robots
Query Server
Database

n tt nghip: Khai ph d liu t website vic lm

49

khng ch chu trch nhim quyt nh xem ti liu no s duyt m cn quyt
nh xem kiu ti liu no mi c duyt.
2.2 Agents
thc hin vic thu thp ti liu t web, search engine gi n cc
Agent hay cn gi l cc Robot. u vo ca n l mt a ch Url v nhim
v l ti hin thng tin v ti liu ti a ch . Kt qu tr v cho modul chnh
l mt i tng ch ni dung ti liu a ch hoc mt gii thch l do ti
sao ti liu khng c ti hin. Cc Agent ny phi c kh nng truy cp c
cc kiu ni dung khc nhau vi cc giao thc ph bin nh HTTP, FTP,
Vic ch i s tr li t mt server xa c th gy tn ti nguyn ca h
thng, cc Agent thng c t chc thnh cc tin trnh khc nhau v chy
song song vi nhau. Modul chnh lm chc nng qun l tin trnh ny, khi pht
hin ra mt a ch mi, n s tm mt Agent ang ri v giao nhim v cho
Agent ny. Khi thc hin xong n tr li kt qu cho modul chnh v thit t
trng thi ri. Qu trnh c tip tc nh th cho n ht thi gian quy nh hay
khi khng cn c mt a ch mi no na.
3. Hot ng ca cc Search engine
Nh ni trn Search Engine dng cc robot xy dng bng ch mc
ni dung cc trang Web. l cc chng trnh t ng i theo cc siu lin kt
trn cc trang Web, thu thp cc d liu ti cc trang Web cn thit cho vic
nh ch mc. Chng c gi l cc robot bi v chng hot ng c lp:
chng t phn tch cc siu lin kt v i theo cc siu lin kt ny. Mt s tn
khc cho nhng chng trnh kiu ny: spider, spider, worm, wanderer,
gatherer, ... Vic cc rbt i theo cc lin kt cng ging nh mt ngi duyt
Web xem cc trang ti liu trn browser ca mnh.
Bn c th hi ti sao cc robot li phi to ra bng ch mc cc trang Web
nh vy, ti sao khng ch tm kim khi ngi dng nhp vo yu cu tm
kim. l v vic t chc bng ch mc tp trung s cho php gim khi lng
d liu vo ra trn server, cho php tm kim mt s lng ln ti liu v bi
nhiu ngi cng mt lc. N cn cho php lit k kt qu theo th t lin quan
ca ti liu i vi yu cu tm kim.
Di chng ta s tm hiu k hn xem cc robot tp hp d liu cho vic xy
dng bng ch mc nh th no, cch chng i theo cc lin kt trn Internet,
cch chng nh ch mc ti liu v cp nht bng ch mc ...
n tt nghip: Khai ph d liu t website vic lm

50

3.1 Hot ng ca cc robot
Cc robot bt u t mt trang cho trc, thng thng l trang ch ca
mt Web site no , n c ni dung ca trang ging nh mt trnh duyt
Web, v theo cc siu lin kt n cc trang khc. Vic quyt nh c i n
trang khc hay khng tu thuc vo cu hnh ca h thng. Cc robot c th ch
cho php duyt cc trang Web trong phm vi mt server hay mt tn min no
.
Mt mt tm kim pht hin cc ti liu mi bng cch bt u vi mt
tp hp cc ti liu bit, kim tra cc siu lin kt xut hin trong , duyt
theo mt trong cc lin kt n ti liu mi, sau lp li ton b qu trnh ny.
Tng tng Web nh l mt th c hng v vic tm kim n gin ch l
duyt qua th s dng vi mt thut ton duyt th no .
Hnh di ch ra mt v d. Gi s rng duyt qua ti liu A trn Server1
v ti liu E trn Server3 v by gi mt quyt nh xem ti liu mi no s
c duyt tip. Ti liu A c cc lin kt n ti liu B, C, ti liu E c cc lin
kt n ti liu D v F. Mt tm kim s la chn mt trong cc ti liu B, C
hoc D duyt tip da trn yu cu tm kim ang c thc hin.

Hnh 7: hot ng ca robot
3.2 Duyt theo chiu rng
tng duyt theo chiu rng mc ch l tp hp c tt c cc trang
xung quanh im xut pht trc khi theo cc lin kt i ra xa im bt u. y
l cch thng thng nht m cc robot hay lm. Nu vic thc hin nh ch
mc trn mt vi server th khi lng yu cu ti cc server c phn phi
u nhau, v th lm tng hiu qu tm kim. Chin lc ny cng gip cho vic
ci t c ch x l song song cho h thng.
n tt nghip: Khai ph d liu t website vic lm

51

Trong th di y, trang bt u gia c t mu m nht. Cc
trang tip theo, t mu m va s c nh ch mc u tin, sau mi n
cc trang c t mu nht hn cui cng n cc trang mu trng.
tng duyt theo chiu rng mc ch l tp hp c tt c cc trang
xung quanh im xut pht trc khi theo cc lin kt i ra xa im bt u. y
l cch thng thng nht m cc robot hay lm. Nu vic thc hin nh ch
mc trn mt vi server th khi lng yu cu ti cc server c phn phi
u nhau, v th lm tng hiu qu tm kim. Chin lc ny cng gip cho vic
ci t c ch x l song song cho h thng.
Trong th di y, trang bt u gia c t mu m nht. Cc
trang tip theo, t mu m va s c nh ch mc u tin, sau mi n
cc trang c t mu nht hn cui cng n cc trang mu trng.

Hnh 8: m hnh tm kim theo chiu rng
3.3 Duyt theo chiu su
Theo cch duyt ny, cc robot i theo cc lin kt t lin kt th nht
trong trang bt u, sau n lin kt th nht trong trang th hai v tip tc
nh th. Khi n nh ch mc c cc lin kt u tin ca mi trang, n tip
tc ti cc lin kt th hai v tip theo. Mt s robot n gin dng phng
php ny v n d ci t.
n tt nghip: Khai ph d liu t website vic lm

52


Hnh 9: m hnh tm kim theo chiu su
3.4 su gii hn
Mt vn i vi cc robot l su gii hn cho php chng trong khi
duyt mt Web site. Trong v d v duyt theo su trn, trang bt u c
su 0, v xm ca cc trang ch ra 3 mc lin kt vi cc su 1, 2, 3. i
vi mt s Web site, thng tin quan trng nht thng gn vi trang ch v cc
trang c su ln hn thng t lin quan n ch chnh. Mt s khc vi
mc u tin cha ch yu l cc lin kt cn ni dung chi tit li cc mc su
hn. Trong trng hp ny, cc robot phi m bo nh ch mc c cc
trang chi tit bi v chng c gi tr i vi nhng ngi mun tm kim trn
Web site . Cng c mt s robot ch nh ch mc mt vi mc u tin
mc ch tit kim khng gian lu tr.
3.5 Vn tc nghn ng chuyn
Cc Web robot, ging nh cc trnh duyt, c th dng nhiu kt ni ti
mt Web Server c d liu. Tuy nhin, iu ny c th lm cc server qu
ti vi vic bt chng phi tr li hng lot yu cu ca robot. Khi kim tra hot
ng ca server hoc phn tch cc thng bo truy vn t bn ngoi, ngi qun
tr mng c th pht hin ra rt nhiu yu cu xut pht t cng mt a ch IP v
c th ngn chn robot khng cho n truy cp thng tin t na.
Rt nhiu Web robot c c ch t khong thi gian tr i vi cc yu
cu ti cng mt server. iu ny cc k quan trng khi robot xut pht t mt
n tt nghip: Khai ph d liu t website vic lm

53

a ch n v server cn nh ch mc c bng thng hp hay c rt nhiu truy
vn cng lc.
i vi cc server qu ti, t bit l cc server vi nhng trang Web c
kch thc ln v t thay i, th vic kim tra ngy thng cp nht thng tin l
rt cn thit. Vi tp lnh trong giao thc HTTP: HEAD hay CONDITIONAL
GET, cc robot c th ly cc thng tin META v trang Web trong c thng
tin v thi gian trang Web b thay i. iu ny c ngha l robot ch ly v
cc trang Web thay i ch khng phi l tt c cc trang, do lm gim
khi lng truy vn ti server mt cch ng k.
3.6 Hn ch ca cc robot
Mi khi robot truy nhp mt trang Web t mt server no qua giao thc
HTTP, giao thc ny bao gm mt s thng tin v c im ca pha client v
kiu thng tin yu cu trong phn header. Trong c trng User-Agent, n
ghi li tn ca client (chng trnh gi yu cu), hoc l mt trnh duyt hay
l mt chng trnh robot. Ngi qun tr mng qua c th bit c hot
ng ca robot.
Cng do c ch bo mt, ngi qun tr mng c th ch nh nhng th
mc c th cho php robot truy nhp cng ngn khng cho robot truy nhp vo
mt s th mc v d nh: CGI, cc th mc tm, th mc c nhn. Tt c
nhng thng tin ny c lu trong file robots.txt v c t trong th mc
gc.
3.7 Phn tch cc lin kt trong trang web
i vi rt nhiu trang Web vic tm kim cc lin kt n cc trang Web
khc rt d dng. Cc lin kt c dng URL chun : <A HREF = page.html>
(i vi mt file trong cng mt th mc trn cng mt server) hay <A HREF
= http://ww.domain.com/page.html> (i vi cc file trn cc server khc
nhau).
Tuy nhin mt s Web site vic pht hin ra cc lin kt ny khng n
gin nh vy. Tt c cc th JavaScript, Frames, Image Maps v mt s th khc
c th lm cho robot khng th phn bit c u l cc lin kt trong .
3.8 Nhn dng m ting vit
Ting Vit cha c mt bng m thng nht dng trong c nc, mi vng
quen dng mt loi m ting Vit ring nh cc tnh pha Bc hay dng ABC,
n tt nghip: Khai ph d liu t website vic lm

54

VietWare, pha Nam hay dng VNI, HBK tpHCM. iu ny gy ra kh khn
khi trao i thng tin trn my tnh. Khi ta nhn tp tin ting Vit t my khc
khng dng chung bng m ting Vit vi my ca ta th ta phi thc hin thao
tc chuyn m. Nu bit m ngun th cng vic tr nn n gin hn, vit
mt chng trnh nh vi d liu m ngun bit ta c th chuyn i m
nhanh chng. Cc phn mm ting Vit thng dng nh VietWare, VNI u c
chc nng chuyn m bit m ngun ny. Vn tr nn phc tp hn khi m
ngun khng bit, ta phi t ng on ra m ngun ca on vn ting Vit gi
n. Hin nay vi s bng n ca Internet vic trao i thng tin trn mng
thnh thng xuyn hn th nhu cu nhn dng t ng m ting Vit l rt ln.
Ta th tng tng vi bt c chng trnh no chy trn Web server c u
vo l mt on ting Vit nhn t cc my client cc vng khc nhau s dng
cc bng m khc nhau (nh chng trnh truy cp thng tin sch bo, chng
trnh chn bi nhc, cc chng trnh hi p c s d liu t xa v.v ) u cn
phi nhn dng loi m m client dng bit ng ngha ca xu gi n
m p ng yu cu ca client. Vic nhn dng m ting Vit cn gip ta
chuyn i tt c cc ti liu trn mng v mt chun m thun tin cho vic x
l sau ny.
n tt nghip: Khai ph d liu t website vic lm

55

Chng 3: NG DNG TH NGHIM KHAI PH D LIU
TCH HP T CC WEBSITE TUYN DNG
1. Bi ton:
1.1 Pht biu bi ton:
Hin nay do nhu cu ca x hi, vic tuyn dng trn cc website tuyn
dng kh ph bin cc thng tin vic tm ngi v ngi tim vic c cp nht
lin tc. Cc thng tin v vic tm ngi bao gm: Ngnh tuyn, doanh nghip
cn tuyn, cng vic, mc lng, tui, gii tnh. Cc thng tin v ngi tm
vic bao gm: Ngnh tuyn, ngi tuyn, tui, gii tnh, cng vic. cc thng
tin tng hp ny s gip cc nh qun l, cc trng i hc bit c xu hng
tuyn ca doanh nghip, xu hng chn ngnh ngh ca ngi hc, dnh gi v
mc lng ca mi ngnh qua c iu chnh cho ph hp
Trong phm vi ca n ny, Em s dng cc k thut khai ph d liu i
vi CSDL Vic tm ngi v Ngi tm vic nhm xc nh xu hng tm vic
ca ngi tm vic v xu hng tuyn ca doanh nghip theo ngnh thng qua
thut ton Apriori.
1.2 Mt s website tm vic lm ni ting ca vit nam:
http://www.vietnamworks.com
Ngi tm vic Vic tm ngi
Tm lc S lc v Cng ty
H tn S lc v cng ty
a ch email Quy m cng ty
Bng cp cao nht a ch cng ty
Cp bc hin ti Chi tit cng vic
Tng s nm Chc danh
Kinh nghim M t cng vic
Cng vic gn y nht yu cu chung
Cng vic mong mun Nhn h s bng ngn ng
V tr K nng bt buc
Cp bc Loi hnh lm vic
Loi hnh Ni lm vic
Ngnh ngh Ngnh ngh
Ni lm vic Cp bc ti thiu
Mc lng mong mun Mc lng

n tt nghip: Khai ph d liu t website vic lm

56

http://www.tuyendungnhanh.com
Ngi tm vic Vic tm ngi
Tm lc S lc v Cng ty
H tn Cng ty
a ch email M t
Bng cp cao nht in thoi
K nng c nhn Quy m
Tiu ch hot ng
Email
Website
Cp bc hin ti Chi tit cng vic
Tng s nm Chc danh/v tr
Kinh nghim S lng tuyn
Cng vic gn y nht Lnh vc ngnh ngh
Cng vic mong mun a im lm vic
V tr M t vic lm
Chc danh K nng ti thiu
M t cng vic Trnh ti thiu
Mc lng hin ti Kinh nghim yu cu
Mc lng mong mun Yu cu gii tnh
Loi hnh cng vic Hnh thc lm vic
Ngnh ngh mun Mc lng
a im Thi gian th vic
Cc ch khc
Yu cu h s
Hn np h s

http://www.ungvien.com.vn
Ngi tm vic Vic tm ngi
Tm lc S lc v Cng ty
H tn Tn cng ty
a ch email Tm lc cng ty
Bng cp cao nht a ch cng ty
Cp bc hin ti Chi tit cng vic
Tng s nm Chc danh
Kinh nghim Ngnh ngh
Cng vic gn y nht a im lm vic
Cng vic mong mun S lng tuyn
V tr M t cng vic
Cp bc Kinh nghim k nng
Loi hnh Trnh hc vn
n tt nghip: Khai ph d liu t website vic lm

57

Ngnh ngh Yu cu kinh nghim
Ni lm vic Loi hnh cng vic
Mc lng mong mun Mc lng

http://works.vn
Ngi tm vic Vic tm ngi
Tm lc S lc v Cng ty
H tn
S lc
Tui
Quy m
a ch
a ch
Chc danh Chi tit cng vic
Yu cu Chc danh
Kh nng M t cng vic
Yu cu
Cng vic mong mun Loi hnh cng vic
Loi hnh cng vic Ni lm vic
Ni lm vic Ngnh ngh
Ngnh ngh Cp bc ti thiu
Mc lng Mc lng
Trnh hc vn Lin h
K nng Hn np h s

http://www.timviecnhanh.com
Ngi tm vic Vic tm ngi
Tm lc S lc v Cng ty
H tn
Cng ty
Ngy sinh
a ch
Gii tnh
M t
Tnh trng hn nhn
in thoi
a ch
Quy m
in thoi
Tiu ch hot ng
Trnh
Website
email Chi tit cng vic
Chc danh/ v tr
S lng tuyn
Lnh vc ngnh ngh
Cng vic mong mun a im lm vic
Chc danh K nng ti thiu
M t cng vic Trnh ti thiu
Mc lng Kinh nghim yu cu
n tt nghip: Khai ph d liu t website vic lm

58

a im Yu cu gii tnh
Trnh hc vn Hnh thc lm vic
Kinh nghim Mc lng
1.3 Thit k c s d liu:
Hin nay do s bng n ca cng ngh thng tin, nhu cn tuyn dng trc
tuyn tr ln ph hp hn vi cc ng vin v cc nh tuyn dng so vi cch
tuyn dng truyn thng. Vi cch tuyn dng ny cc ng vin hay nh tuyn
dng ch cn truy cp vo cc website tuyn dng tm cc cng vic, hay cc h
s ng vin ph hp vi kh nng ca cc ng hay, nh tuyn dng v cc ng
vin s hp h s trc tip qua email cho cc nh tuyn dng, cho cc ng vin.
Vi cch tuyn dng mi ny cng gip cho cc nh qun l mt thi gian
trong vic thu thp thng tin v vic lm ca cc c quan qun l c th nm bt
c nhu cu vic lm ca x hi v c th t cc thng tin vic lm trong csdl
vic lm c th rt ra cc tri thc hay cc xu hng cng vic v l ngun thng
tin gip trng i hc dn lp hi phng xc nh xu hng ngnh ngh gp
phn nh hng o to ca trng.
Vic thu thp thng tin vic lm t cc trang web mt cch t ng lm cho
vic thu thp thng tin mt cch nhanh chng v chnh xc. Do cc web site
c t chc di dng phn cp, chnh v vy ta phi lu li cc ng
dn(url) v mt s thng tin quan trng ca website. Vic to c s d liu
lu cc thng tin cn thit phc v cho vic ly d liu mt cc t ng t cc
web site gip cho cng vic ly thng tin c nhanh hn. Thng tin cn lu li
phc v vic ly thng tin mt cc t ng t cc website bao gm: tn
website, cc lin kt c bn trong website, d liu ca cc lin kt trong website
...
Ta c m hnh c s d liu nh sau:

Hnh 10: m hnh csdl ly data t website
n tt nghip: Khai ph d liu t website vic lm

59

Qua tm hiu h s ca cc website tuyn dng ni ting ca vit nam c th
chia thnh hai loi thng tin nh sau: Thng tin vic tm ngi v ngi tm
vic. Cc thng tin v vic tm ngi bao gm: Ngnh tuyn, doanh nghip cn
tuyn, cng vic, mc lng, tui, gii tnh. Cc thng tin v ngi tm vic
bao gm: Ngnh tuyn, ngi tuyn, tui, gii tnh, cng vic...
Bng m hnh ngi tm vic
Bng Ngnh
MaNganh Int
TenNganh Nvarchar(100)

Bng thng tin tm vic
MaTTTim Int
MaNganh Int
TenUngVien Nvarchar(50)
Dotuoi Int
Gioitinh Boolean
TenCv Nvarchar(30)
Ta c m hnh c s d liu quan h:

Hnh 11: m hnh CSDL tm vic
Ta c c s d liu Vic tm ngi nh sau:
Bng Ngnh
MaNganh Int
TenNganh Nvarchar(100)



n tt nghip: Khai ph d liu t website vic lm

60

Bng thng tin tuyn dng
MaTTTuyen Int
MaNganh Int
TenDN Nvarchar(50)
MucLuong Money
Gioitinh Boolean
TenCv Nvarchar(30)
Dotuoi Int
Ta c m hnh c s d liu quan h:

Hnh 12: m hnh CSDL tuyn dng
T vic phn tch nh trn, ta c s quan h lu tr d liu ca bi ton
nh sau:
n tt nghip: Khai ph d liu t website vic lm

61


Hnh 13: m hnh CSDL ca chng trnh
1.4 c t d liu:
Mt c im mang tnh thc t l cc item khng n thun ch c xt
l C hay Khng trong khi m Support m mi item c km theo mt
trng s m t mc quan trng ca item . Cc item ta vn xem xt thng
dng Boolean. Chng mang gi tr l 1 nu item c mt trong giao tc v 0
nu ngc li. Cc bi ton khai ph d liu nh trn ngi ta vn gi l khai
ph d kiu nh phn (Mining Boolean Association Rules).
Nhng trong thc t, cc bng s liu thng xut hin cc thuc tnh khng
n gin nh vy. Cc thuc tnh c th dng s (quantitative) nh: mc
lng, tui, Cc thuc tnh c th dng Hng mc (categorical) nh: Tn
Ngnh, Tn Cng Vic, Gii tnh, Ta phi ri rc ha a v dng bi ton
phai ph kt hp nh lng (Mining Quantitative Association Rules). Cng nh
cc bi ton khai ph lut kt hp trc y, mc tiu ca bi ton khai ph lut
kt hp nh lng cng l kt xut cc lut kt hp trn cc ngng support ti
thiu v cc ngng confidence ti thiu.
Vi cc thuc tnh hng mc th ta phi thc hin phn on cho cc thuc tnh
ny v lm nh vy s d dng nh x cc thuc tnh tnh lng sang cc thuc
tnh boonlean. Nu cc thuc tnh phn loi hoc s lng ch c vi gi tr ring
bit( v d: gii tnh) th c th nh x nh sau: Mi thuc tnh trong bng d
n tt nghip: Khai ph d liu t website vic lm

62

liu c p gi tr ring bit s c lp thnh p thuc tnh Boolean mi. Mi
thuc tnh Boolean mi ny tng ng vi mt cp <attribute,value>. N c gi
tr 1 nu value c mt trong d liu gc v c gi tr 0 nu ngc li. Nu s
gi tr ring bit ca mt s thuc tnh kh ln th ngi ta thc hin vic phn
on thuc tnh thnh cc khong v nh x mi cp <attribute,value> thnh mt
thuc tnh. Sau khi nh x, c th thc hin khai ph lut kt hp trn CSDL
mi bng thut ton khai ph lut kt hp kiu Boolean.
Tng qut, ta c th a ra mt s phng php ri rc ho nh sau:
Trng hp 1 : Nu A l thuc tnh s ri rc hoc l thuc tnh hng mc
c min gi tr hu hng dng {V1, V2,. . . . , Vk} v k nh (<100) th ta bin
i thuc tnh ny thnh k thuc tnh nh phn A_V1, A_V2,. . . . , A_Vk. Gi tr
ca bn ghi ti trng A_Vi = True (hoc 1) Nu gi tr ca bn ghi ti thuc
tnh A ban u bng vi, Ngc li Gi tr ca A_Vi = False (hoc 0).
Trng hp 2 : Nu A l thuc tnh s lin tc hoc A l thuc tnh s ri
rc hay thuc tnh hng mc c min gi tr hu hng dng {V1, V2,. . . . , Vp} (p
ln) th ta s nh x thnh q thuc tnh nh phn <A:start1. . end1>, <A : start2. .
end2>, . . . . , <A : startq. . endq>. Gi tr ca bn ghi ti trng <A : starti. . endi>
bng True (hoc 1) nu gi tr ca bn ghi ti thuc tnh A ban u nm trong
khong [starti. . endi] , ngc li gi tr ca <A:starti. . endi> = False (hoc 0).
MaNganh TenUngVien Dotuoi GioiTinh TenCv
CNTT Nguyn Vn dng 25 1 Lp trnh vin
CNTT Nguyn Vn h 27 1 Lp trnh vin
CNTT Nguyn Th Linh 24 0 Qun tr mng
CNTT Nguyn Th Hng Ngn 23 0 Qun tr mng
CNTT inh Mnh Dng 23 1 K thut Vin
CNTT Phm th Linh 23 0 Qun tr mng
CNTT Phm Cng Tm 23 1 K thut vin
CNTT Phm th thu h 23 0 Qun tr mng
CNTT Trn thanh tng 23 1 ha my tnh
Kt th h 22 0 k ton vin
Kt Trn bc thy 26 0 k ton trng
Kt Trn th thy 23 0 k ton vin
Kt Trn th phng 23 0 k ton vin
Kt Phm thanh tng 25 1 k ton trng
Kt Phm thanh hng 25 1 k ton trng
. . . . .
Bng 5: CSDL v thng tin tm vic
V d: Vi bng s liu trn y ta c th phn chia nh sau:
Thuc tnh tui l thuc tnh c nhiu gi tr, ta c th phn thnh cc
khong <20, 20..23,24..26, >26. Khi , trong tp d liu mi c cc thuc tnh
n tt nghip: Khai ph d liu t website vic lm

63

( s ngi tm vic c tui < 20, s ngi tm vic c tui t 2023, s
ngi tm vic c tui t 2326 v s ngi tm vic c tui< 26) tng
ng vi thuc tnh tui. cc thuc tnh khc c th phn chia tng t nhng
c th khong phn chia khc nhau.
Nh vy CSDL nh x t CSDL ban u s l:
MaNga
nh TenUngVien Dotuoi
GioiTi
nh TenCv

Dotuoi<
20
20<Dotuoi
<23
23<Dotuoi
<26
26<Dot
uoi
CNTT
Nguyn Vn
dng 0 0 1 0 Nam
Lp trnh
vin
CNTT
Nguyn Vn
h 0 0 0 1 Nam
Lp trnh
vin
CNTT
Nguyn Th
Linh 0 0 1 0 N
Qun tr
mng
CNTT
Nguyn th
Ngn 0 1 0 0 N
Qun tr
mng
CNTT
inh Mnh
Dng 1 0 0 0 Nam K thut Vin
CNTT Phm th Linh 0 1 0 0 N
Qun tr
mng
CNTT
Phm Cng
Tm 1 0 0 0 Nam K thut vin
CNTT
Phm th thu
h 0 1 0 0 N
Qun tr
mng
CNTT
Trn thanh
tng 0 1 0 0 Nam
ha my
tnh
Kt th h 0 0 1 0 N k ton vin
Kt Trn bc thy 0 1 0 0 N
k ton
trng
Kt Trn th thy 0 1 0 0 N k ton vin
Kt
Trn th
phng 0 0 1 0 N k ton vin
Kt
Phm thanh
tng 0 0 1 0 Nam
k ton
trng
Kt
Phm thanh
hng 0 0 1 0 Nam
k ton
trng
. . . . . . . .
Bng 6: D liu chuyn i t dng s lng sang dng boolean
Vic nh x nh trn c th xy ra vn sau:
minsup: Nu s lng khong cho thuc tnh s lng( hoc s cc gi
tr ring cho cc thuc tnh hng mc) l ln th support cho cc khong c th
l nh. Do , vic chia mt thuc tnh ra qu nhiu khong c th lm cho lut
cha n khng t c support ti thiu.
minconf: Mt s thng tin c th b mt d liu do vic chia khong.
Mt s lut c th c minconf ch khi mt item trong chng c gi tr n hoc
n tt nghip: Khai ph d liu t website vic lm

64

mt khong rt nh, do thng tin c th b mt. S mt mt thng tin cng
tng khi kch thc khong chia cng ln.
Nh vy. nu kch thc khong l qu ln(s khong nh) th c nguy c
mt s lut s khng c confidence ti thiu, cn kch thc cc khong l qu
nh ( s khong ln) th mt s lut c nguy c khng c support ti thiu.
gii quyt cc vn trn, ngi ta ch n tt c cc vng lin tc
trn thuc tnh s lng hoc trn tt c cc phn on. Vn minsup s
c khc phc bng cch lin hp cc khong gn k hoc cc gi tr gn k.
Vn minconf s c khc phc bng cch tng s lng khong m
khng nh hng n vn minsup.
Ngi ta c th thc hin mt phng php n gin thc hin vic
chuyn cc thuc tnh s lng v hng mc v cng mt dng vi nhau. Vi
thuc tnh phn loi, cc gi tr ca n s c nh x vo tp cc s nguyn lin
tip. Vi cc thuc tnh s lng khng cn khong chia ( tc l c t gi tr) th
cc gi tr s c nh x vo tp cc s nguyn lin tip theo th t ca cc gi
tr . Cn i vi cc thuc tnh s lng c phn khong, th cc khong s
c nh x vo tp s nguyn lin tip, trong th t cc khong ny s c
bo tn. Cc nh x ny s lm cho mi bn ghi trong CSDL tr thnh mt tp
cc cp <Attribute,value>. Bi ton khai ph lut kt hp lc ny c th thc
hin qua cc bc sau:
Bc 1: Xc nh s lng mi phn t chia cho mi thuc tnh s lng.
Bc 2: Vi cc thuc tnh hng mc, nh x cc thuc tnh vo tp s
nguyn lin tip. Vi cc thuc tnh s lng khng cn s ph thuc khong,
nh x cc gi tr ca chng vo tp cc s nguyn lin tip theo th t gi tr
thuc tnh. Vi cc thuc tnh s lng c phn khong, nh x cc
khong c chia vo tp cc s nguyn lin tip v bo tn th t cc khong.
Bng cch ny, thut ton ch xem cc gi tr hoc cc vng gi tr nh l cc
thuc tnh nh lng.
Bc 3: Tm support cho mi gi tr ca cc thuc tnh hng mc v
thuc tnh s lng, tip theo tm tt c cc itemset v support ca n ln hn
minsupport.
Bc 4: S dng cc tp tm c sinh ra cc lut kt hp.
Bc 5: Xc nh lut ng quan tm v kt xut chng.
Nh vy, khi xt trn CSDL l h s tm vic ca cc ng vin xin vic (
MaNganh, TenUngVien, Dotuoi, GioiTinh, TenCv) trn cc website tuyn dng
n tt nghip: Khai ph d liu t website vic lm

65

ln trong nc, ta c th thc hin vic phn chia cc thuc tnh trong bng
thnh cc khong v k hiu nh nhau:
M ngnh:
Ngnh Xy dng in
K hiu: A B
Ngnh Vn ha du lch ti chnh ngn hng
K hiu: C D
Ngnh Cng ngh thng tin Ngnh K ton
K hiu: E F
Ngnh Qun tr
K hiu: G
tui:
Dotuoi <20 20<Dotuoi<23 23<Dotuoi<26 26<Dotuoi
K hiu : H I J K
Gii tnh :
Nam N
K hiu N M
TenUngVien TenCv
MaNga
nh Dotuoi GioiTinh

Dotuoi<
20
20<Dotuoi
<23
23<Dotuoi
<26
26<Dot
uoi
Na
m
N

Nguyn Vn
dng
Lp trnh
vin E

J

N
Nguyn Vn
h
Lp trnh
vin E

K N
Nguyn Th
Linh Qun tr G

J

M
Nguyn Th
Ngn Qun tr G

I

M
inh Mnh
Dng K thut Vin E H

N

Phm th Linh
Qun tr
mng E

I

M
Phm Cng
Tm K thut vin E H

N
Phm th thu
h Qun tr G

I

M
Trn thanh
tng
ha my
tnh E

I

N

n tt nghip: Khai ph d liu t website vic lm

66

th h k ton vin F

J

M
Trn bc thy
k ton
trng F

I

M
Trn th thy k ton vin F

J

M
Trn th
phng k ton vin F

I

M
Phm thanh
tng
k ton
trng F

J

N
Phm thanh
hng
k ton
trng F

J

N

. . . . . . . .

.
Chng trnh chy th nghim nhn c l ( kt qu ny ty thuc vo
minsupp v minconf, di y l kt qu nhn c vi minsupp=0.1 v
minconf=0.1):

Lut kt hp
Supp Conf
Cng ngh thng tin => tui [23-26] 0.7104 0.9023
Cng ngh thng tin => tui [<20] 0.4409 0.9266
Cng ngh thng tin => tui [20-23] 0.554 0.9687
Cng ngh thng tin => Nam 0.854 0.9885
Cng ngh thng tin => tui [>26] 0.5573 0.9765
Cng ngh thng tin => N 0.4901 0.9654
K ton =>Dotuoi[20-23] 0.4409 0.9605
K ton =>Dotuoi[23-26] 0.6737 0.9722
K ton =>Dotuoi[>26] 0.5081 0.9117
K ton =>Nam 0.5409 0.9166
K ton =>N 0.5737 1


n tt nghip: Khai ph d liu t website vic lm

67

Da vo kt qu trn ta nhn thy rng ngnh cng ngh thng tin c cn
tuyn nhiu nht thng c tui t 23 => 26 thng l nam, ngnh k ton
kim ton s lng tuyn nhiu nht thng c tui 23=>26 ch yu l n
1.5 Minh ha chng trnh
Chng trnh c ci t bng ngn ng C#.net, CSDL thit k trn SQL
2005, h iu hnh window 7, chp dua 2 core T6500 2.1 Mhz, RAM 2GB,
cng 250GB cn trng 50Gb.
Chng trnh c mt s giao din chnh sau:

Hinh 14: giao din chng trnh
n tt nghip: Khai ph d liu t website vic lm

68


Hnh 15: qu trnh to lut kt hp theo thut ton Apriori

n tt nghip: Khai ph d liu t website vic lm

69

Hnh 16: lut thu c
1.6 Phn tch nh gi
Chng trnh thc hin tm cc tp ph bin v lut kt hp thng qua thut
ton Apriori. Ta c mt s nhn xt sau:
xc nh Support ca cc tp ng vin, thut ton Apriori lun lun
phi qut li ton b cc giao tc trong CSDL. Do vy s tiu tn rt nhiu thi
gian khi s k-items tng(s ln duyt cc giao tc tng).
Trong qu trnh xt duyt khi to thut ton Apriori, kch thc ca Ck l
rt ln v hu ht tng ng vi kch thc ca CSDL gc. Do , thi gian
tiu tn cng s bng vi thut ton Apriori.
1.7 Hng pht trin
- Tip tc hon thin v m rng chng trnh trong n ny c th p
dng vo thc t mt cch trit . Chng trnh thc hin theo ng cc
bc trong qu trnh khai ph d liu: 1- Chn lc d liu(chn lc, trch
rt cc d liu cn thit t CSDL), 2- lm sch d liu(chng trng lp
v gii hn vng gi tr), 3- lm giu d liu, 4- khai thc tri thc t d
liu(tm tc v pht hin lut kt hp, trnh chiu bo co), 5- chn d
liu c ch p dng vo trong hot ng thc t.
- Cho n nay hu ht cc thut ton xc nh cc tp ph bin u c
xy dng da trn tha nhn h tr cc tiu(minsup) l thng nht, tc
l cc tp mc c chp nhn u c support ln hn cng mt ti
thiu. iu ny khng thc t v c nhiu ngoi l khc c chp nhn
thng c h tr thp hn nhiu so vi khuynh hng chung(cc tiu
ch phn loi, u tin l khc nhau). Mt khc, khi xem xt cc thuc tnh
s lng ri rc ha bng phng php phn khong thng to ra s
khong rt ln. V vy, hng nghin cu tip theo ca em l lut kt hp
m(iu ny cng ang c nhiu ngi quan tm).
- Nghin cu su cc thut ton khai ph d liu v p dng vo mt s bi
ton khai ph d liu ph hp vi giai on hin nay: d bo vic lm,
nh hng trong kinh doanh
n tt nghip: Khai ph d liu t website vic lm

70

KT LUN
n cp n cc ni dung v kho d liu v ng dng ca lu tr v khai
ph tri thc trong kho d liu nhm h tr ra quyt nh.
V mt l thuyt, khai ph tri thc bao gm cc bc: Hnh thnh, xc nh v
nh ngha bi ton, thu thp v tin x l d liu, khai ph d liu, rt ra cc tri
thc, s dng cc tri thc pht hin c. Phng php khai ph d liu c th
l: phn lp, cy quyt nh, suy din Cc phng php trn c th p dng
trong d liu thng thng.
V thut ton khai ph tri thc, n trnh by mt s thut ton v minh ha
mt thut ton kinh in v pht hin tp ch bo ph bin v khai ph lut kt
hp l: Apriori
V mt ci t th nghim, n gii thiu k thut khai ph d liu theo thut
ton Apriori p dng vo bi ton d bo xu hng tm vic ca cc ng vin,
xu hng tuyn dng ca doanh nghip.
Trong qu trnh thc hin n, em c gng tp trung tm hiu v tham kho
cc ti liu lin quan. Tuy nhin, vi thi gian v trnh c hn nn khng
trnh khi nhng hn ch v thiu st. Em rt mong nhn c cc nhn xt v
gp ca cc thy c gio v bn b, nhng ngi cng quan tm hon thin
hn cc kt qu nghin cu ca mnh.

n tt nghip: Khai ph d liu t website vic lm

71

TI LIU THAM KHO
Ting vit:
Hong Kim - Phc, Gio trnh khai ph d liu - Trung tm nghin cu pht
trin cng ngh thng tin, i hc Quc gia thnh ph H Ch Minh, 2005.
Nguyn Lng Thc, Mt s phng php khai ph lut kt hp v ci t th
nghim - Lun vn thc s ngnh CNTT, Khoa Tin hc, i hc S phm Hu,
2002.
Ting anh:
Bao Ho Tu (1998), Introduction to Knowledge Discovery and Data mining, Institute of
Information Technology National Center for Natural Science and Technology.
Jean-Marc Adamo (2001), Data Mining for Association Rule and Sequential
Pattens, With 54 Illustrations. ISBN0-95048-6.

John Wiley & Sons (2003) - Data Mining-Concepts Models Methods And
Algorithms, Copyright 2003 The Institute of Electrical and Electronics
Engineers, Inc.
John Wiley & Son, Visual Data Mining: Techniques and Tools for Data
Visualization and Mining, by Tom Soukup and Ian Davidson, ISBN:
0471149993.
John Wiley & Sons (2003), Data Mining: Concepts, Models, Methods, and
Algorithms, by Mehmed Kantardzic, ISBN:0471228524.
Patrick BOSC - Didier DUBOIS - Henri PRADE, Fuzzy functional
dependencies.

You might also like