Professional Documents
Culture Documents
http://www.ebook.edu.vn
MC LC
MC LC1
1.1. Khai ph d liu v pht hin tri thc ......................................................... 2 1.2. K thut phn cm trong khai ph d liu .................................................. 3 1.3. Khai ph Web .............................................................................................. 6 1.4. X l d liu vn bn ng dng trong khai ph d liu Web ..................... 7
http://www.ebook.edu.vn
Trch chn
Tin x l
Bin i
Tri thc
D liu th
D liu la chn
D liu tin x l
d liu khc nhau. c bit i vi d liu hn hp, ang ngy cng tng trng khng ngng trong cc h qun tr d liu, y cng l mt trong nhng thch thc ln trong lnh vc KPDL trong nhng thp k tip theo v c bit l trong lnh vc KPDL Web.
ln th s ging nhau gia i tng cng ln v ngc li, cn hm tnh phi tng t t l nghch vi hm tnh tng t. Mt s php o tng t p dng i vi cc kiu d liu khc nhau [10][17][27]: + Thuc tnh khong: o phi tng t ca hai i tng d liu x, y c xc nh bng cc metric nh sau: Khong cch Minskowski, Euclide, Manhattan, khong cch cc i. + Thuc tnh nh phn: Bng tham s sau: y: 1 x: 1 x: 0
y: 0
Cc php o thng dng i vi d liu thuc tnh nh phn: + - H s i snh n gin: d ( x, y) = . - H s Jacard: d ( x, y) . + + = + Thuc tnh nh danh: o phi tng t gia hai i tng x v y
pm c nh ngha nh sau: d ( x, y) , trong m l s thuc tnh i snh = p
tng ng trng nhau v p l tng s cc thuc tnh. + Thuc tnh c th t: Php o phi tng t gia cc i tng d liu vi thuc tnh th t c thc hin nh sau: Cc trng thi Mi c sp th t: [1Mi], ta c th thay th mi gi tr ca thuc tnh bng gi tr cng loi ri, vi ri {1,,Mi}. Mi mt thuc tnh th t c cc min gi tr khc nhau, v vy ta chuyn i chng v cng min gi tr [0,1] bng cch thc hin php bin i sau cho mi thuc tnh: z = r M
(j) i i (j)
1 , vi i=1,..,Mi 1
(j) i
+ Thuc tnh t l: C nhiu cch khc nhau tnh tng t gia cc thuc tnh t l. C th s dng cng thc tnh logarit cho mi thuc tnh xi. Tu tng trng hp d liu c th m ngi ta s dng cc m hnh tnh tng t khc nhau. Vic xc nh tng t thch hp, chnh xc, m bo khch quan l rt quan trng v gp phn xy dng thut ton PCDL c hiu qu cao trong vic m bo cht lng v chi ph tnh ton ca thut ton.
cc dch v v cu trc Web. Lnh vc ny thu ht c nhiu nh khoa hc quan tm. Qu trnh khai ph Web c th chia thnh cc cng vic nh nh sau: Tm kim ngun ti nguyn, la chn v tin x l d liu, tng hp, phn tch.
Structure data
Dynamic link
Ngoi ra, nng cao cht lng x l, mt s cng trnh nghin cu a ra mt s ci tin thut ton xem xt n c tnh ng cnh ca cc t bng vic s dng cc cm t/vn phm ch khng ch xt cc t ring l [31]. 1.4.2.1. Loi b t dng Ta thy trong ngn ng t nhin c nhiu t ch dng biu din cu trc cu ch khng biu t ni dung ca n. Nh cc gii t, t ni,... nhng t nh vy xut hin nhiu trong cc vn bn m khng lin quan g ti ch hoc ni dung ca vn bn. Do , ta c th loi b nhng t gim s chiu ca vector biu din vn bn, nhng t nh vy c gi l nhng t dng. 1.4.2.2. nh lut Zipf gim s chiu ca vector biu din vn bn hn na ta da vo mt quan st sau: Nhiu t trong vn bn xut hin rt t ln, nu mc tiu ca ta l xc nh tng t v s khc nhau trong ton b tp hp cc vn bn th cc t c tn s xut hin nh th nh hng rt b n cc vn bn. nh lut Zipf c pht biu di dng cng thc nh sau: rt.ft K (vi K l mt hng s). Ta c th vit li nh lut Zipf nh sau: rt K/ ft Tng qut, mt t ch xut hin mt ln trong tp hp, ta c rmax=K. Xt phn b ca cc t duy nht xut hin b ln trong tp hp, chia 2 v cho nhau ta c K/b. nh lut Zipf cho ta thy s phn b ng ch ca cc t ring bit trong 1 tp hp bi cc t xut hin t nht trong tp hp.
1.4.3.2. M hnh tn s 1.4.3.2.1. M hnh da trn tn s xut hin cc t Trong m hnh da trn tn s xut hin t (TF-Term Frequency) gi tr ca cc t c tnh da vo s ln xut hin ca n trong ti liu, gi tfij l s ln xut hin ca t ti trong ti liu dj, khi wij c th c tnh theo mt trong cng thc [31] Wij = 1+log(tfij) Khi s ln xut hin thut ng ti trong ti liu dj cng ln th c ngha l dj cng ph thuc nhiu vo thut ng ti, ni cch khc thut ng ti mang nhiu thng tin hn trong ti liu dj. 1.4.3.2.2. Phng php da trn tn s vn bn nghch o Trong m hnh da trn tn s vn bn nghch o (IDF) gi tr trng s ca t c tnh bng cng thc sau [31]:
Wij= n log( ) = log(n) log(hi ) hi 0 nu ti dj nu ngc li (ti dj)
Nu ti xut hin cng t trong cc vn bn th n cng quan trng, do nu ti xut hin trong dj th trng s ca n cng ln. 1.4.3.2.3. M hnh kt hp TF-IDF Trong m hnh TF-IDF [31], mi ti liu dj c xt n th hin bng mt c trng ca (t1, t2,.., tn) vi ti l mt t/cm t trong dj. Th t ca ti da trn trng s ca mi t. Cc tham s c th c thm vo ti u ha qu trnh thc hin nhm. Cng thc tnh trng s TF-IDF l: n tf ij idf = [1 + log( f ij )] log( ) nu ti dj Wij= hi Data
ij
Thng thng ta xy dng mt t in t ly i nhng t rt ph bin v nhng t c tn s xut hin thp. Trng s wij c tnh bng tn s xut hin ca thut ng ti trong ti liu dj v him ca thut ng ti trong ton b CSDL. K thut phn cm phn cp v phn cm phn hoch (k-means) l 2 k thut phn cm thng c s dng cho phn cm ti liu vi m hnh TF-IDF.
CiD
( x mi )
t gi tr ti
thiu. Vi mi l trng tm ca cm Ci, D l khong cch gia hai i tng. Thut ton k-means bao gm cc bc c bn nh sau:
INPUT: Mt CSDL gm n i tng v s cc cm k. OUTPUT: Cc cm Ci (i=1,..,k) sao cho hm tiu chun E t gi tr ti thiu. Bc 1: Khi to Chn k i tng mj (j=1...k) l trng tm ban u ca k cm t tp d liu (vic la chn ny c th l ngu nhin hoc theo kinh nghim). Bc 2: Tnh ton khong cch i vi mi i tng Xi (1 i n) , tnh ton khong cch t n ti mi trng tm mj vi j=1,..,k, sau tm trng tm gn nht i vi mi i tng. Bc 3: Cp nht li trng tm i vi mi j=1,..,k, cp nht trng tm cm mj bng cch xc nh trung bnh cng ca cc vector i tng d liu. Bc 4: iu kin dng Lp cc bc 2 v 3 cho n khi cc trng tm ca cm khng thay i. Hnh 2.1. Thut ton k-means
phc tp tnh ton l: O( (n k d ) ) vi d l s chiu, l s vng lp. k-means cn rt nhy cm vi nhiu v cc phn t ngoi lai trong d liu. Cht lng PCDL ph thuc nhiu vo cc tham s u vo nh: s cm k v k trng tm khi to ban u. c rt nhiu thut ton k tha t tng ca thut ton k-means p dng trong KPDL gii quyt tp d liu c kch thc ln nh thut ton kmedoid, PAM, CLARA, CLARANS,...
- Trng hp 1: Gi s Oj hin thi thuc v cm c i din l Om v Oj tng t vi Om,2 hn Op (d(Oj, Op) d(Oj, Om,2)). Trong khi , Om,2 l i tng medoid tng t xp th 2 ti Oj trong s cc medoid. Nu ta thay th Om bi i tng medoid mi Op v Oj s thuc v cm c i tng i din l Om,2. Cjmp = d(Oj, Om,2) d(Oj, Om) l khng m. - Trng hp 2: Oj hin thi thuc v cm c i din l Om, nhng Oj t tng t vi Om,2 so vi Op. Nu thay th Om bi Op th Oj s thuc v cm c i din l Op. Cjmp=(Oj,Op)- d(Oj, Om) c th l m hoc dng. - Trng hp 3: Gi s Oj hin thi khng thuc v cm c i tng i din l Om m thuc v cm c i din l Om,2. Mt khc, gi s Oj tng t vi Om,2 hn so vi Op, khi , nu Om c thay th bi Op th Oj vn s li trong cm c i din l Om,2. Do : Cjmp= 0. - Trng hp 4: Oj hin thi thuc v cm c i din l Om,2 nhng Oj t tng t ti Om,2 hn so vi Op. V vy, nu ta thay th Om bi Op th Oj s chuyn t cm Om,2 sang cm Op. Cjmp= (Oj,Op)- d(Oj, Om,2) y lun m. - Kt hp c bn trng hp trn, tng gi tr hon chuyn Om bng Op c xc nh nh sau: TCmp = C
j jmp
phc tp tnh ton ca PAM l O(Ik(n-k) ), trong I l s vng lp. Nh vy, thut ton PAM km hiu qu v thi gian tnh ton khi gi tr ca k v n l ln.
ging c hn ch bi tham s do ngi dng a vo l Maxneighbor. Tham s Numlocal cho php ngi dng xc nh s vng lp ti u cc b c tm kim. Khng phi tt cc lng ging c duyt m ch c Maxneighbor lng ging c duyt. Thut ton CLARANS c th c din t nh sau [10][19]:
INPUT: Tp d liu gm n i tng, s cm k, O, dist, numlocal, maxneighbor; OUTPUT: k cm d liu; For i=1 to numlocal do Begin Khi to ngu nhin k medois j = 1; while j < maxneighbor do Begin Chn ngu nhin mt lng ging R ca S. Tnh ton phi tng t v khong cch gia 2 lng ging S v R. Nu R c chi ph thp hn th hon i R cho S v j=1 ngc li j++; End; Kim tra khong cch ca phn hoch S c nh hn khong cch nh nht khng, nu nh hn th ly gi tr ny cp nht li khong cch nh nht v phn hoch S l phn hoch tt nht ti thi im hin ti. End. Hnh 2.4. Thut ton CLARANS
2
phc tp tnh ton ca CLARANS l O(kn ). CLARANS c u im l khng gian tm kim khng b gii hn nh i vi CLARA v trong cng mt lng thi gian th cht lng ca cc cm phn c ln hn so vi CLARA.
S dng cu trc cy CF lm cho thut ton BIRCH c tc thc hin PCDL nhanh v c th p dng i vi tp d liu ln, BIRCH c bit hiu qu khi p dng vi tp d liu tng trng theo thi gian. phc tp l O(n). Nhc im ca n l cht lng ca cc cm c khm ph khng c tt v khng thch hp vi d liu a chiu.
phc tp ca thut ton l O(n log(n)). CURE l thut ton tin cy trong vic khm ph cc cm vi hnh th bt k v c th p dng tt trn cc tp d
liu hai chiu. Tuy nhin, n li rt nhy cm vi cc tham s nh l tham s cc i tng i din, tham s co ca cc phn t i din. Nhn chung th BIRCH tt hn so vi CURE v phc tp, nhng km v cht lng phn cm.
Nu ta chn s dng gi tr tr ton cc Eps v MinPts, DBSCAN c th ho nhp hai cm thnh mt cm nu mt ca hai cm gn bng nhau. phc tp tnh ton trung bnh ca mi truy vn l O(nlogn).
cc cm d liu khm ph c c cc bin ngang v dc, theo bin ca cc . STING s dng cu trc d liu li cho php kh nng x l song song, phc tp tnh ton cc i lng thng k cho mi l O(n). Sau khi xy dng cu trc d liu phn cp, thi gian x l cho cc truy vn l O(g) vi g l tng s cell ti mc thp nht (g<<n).
CLIQUE c kh nng p dng tt i vi d liu a chiu, nhng n li rt nhy cm vi th t ca d liu vo, phc tp tnh ton ca CLIQUE l O(n).
(t ) ij
= E ( zij | x) = Pr ( zij = 1 | x) =
f j ( xi ) j
k
(t)
( xi )
1 fg
n
2) Bc M: nh gi xc sut ,
( t +1) j
( = zij t ) / n i =1
EM c th khm ph ra nhiu hnh dng cm khc nhau, tuy nhin do thi gian lp ca thut ton kh nhiu nhm xc nh cc tham s tt nn ch ph tnh ton ca thut ton l kh cao.
Mt s hn ch ca COBWEB l n tha nhn phn b xc sut trn cc thuc tnh n l l c lp thng k v chi ph tnh ton phn b xc sut ca cc cm khi cp nht v lu tr l kh cao. Cc phng php ci tin ca thut ton COBWEB l CLASSIT, AutoClass.
cc ti liu tng t nhau v mt ni dung th a chng vo cng nhm, cc ti liu phi tng t th a chng vo cc nhm khc nhau.
Trch rt cc mu
3.1.2.1. La chn d liu V c bn, vn bn cc b c nh dng tch hp thnh cc ti liu theo mong mun khai ph v phn phi trong nhiu dch v Web bng vic s dng k thut truy xut thng tin. 3.1.2.2. Tin x l d liu c c kt qu khai ph tt ta cn c d liu r rng, chnh xc v xa b d liu hn n v d tha. Sau bc tin x l, tp d liu t c thng c cc c im nh sau [16]: - D liu thng nht v hn hp cng bc. - Lm sch d liu khng lin quan, nhiu v d liu rng. D liu khng b mt mt v khng b lp. - Gim bt s chiu v lm tng hiu qu vic pht hin tri thc bng vic chuyn i, quy np, cng bc d liu,...
- Lm sch cc thuc tnh khng lin quan gim bt s chiu ca d liu. 3.1.2.3. Biu in vn bn KPVB Web l khai ph cc tp ti liu HTML . Do ta s phi bin i v biu din d liu thch hp cho qu trnh x l. Ngi ta thng dng m hnh TF-IDF vector ha d liu. Nhng c mt vn quan trng l vic biu din ny s dn n s chiu vector kh ln. 3.1.2.4. Trch rt cc t c trng Rt ra cc c trng l mt phng php, n c th gii quyt s chiu vector c trng ln c mang li bi k thut KPVB. Vic rt ra cc c trng da trn hm trng s: - Mi t c trng s nhn c mt gi tr trng s tin cy bng vic tnh ton hm trng s tin cy. Tn s xut hin cao ca cc t c trng l kh nng chc chn n s phn nh n ch ca vn bn, th ta s gn cho n mt gi tr tin cy ln hn. Hn na, nu n l tiu , t kha hoc cm t th chc chn n c gi tr tin cy ln hn. - Vic rt ra cc c trng da trn vic phn tch thnh phn chnh trong phn tch thng k. tng chnh ca phng php ny l s dng thay th t c trng bao hm ca mt s t cc t c trng chnh trong phn m t thc hin gim bt s chiu. 3.1.2.5. Khai ph vn bn Sau khi tp hp, la chn v trch ra tp vn bn hnh thnh nn cc c trng c bn, n s l c s KPDL. T ta c th thc hin trch, phn loi, phn cm, phn tch v d on. 3.1.2.5.1 Trch rt vn bn Vic trch rt vn bn a ra ngha chnh c th m t tm tt ti liu vn bn trong qu trnh tng hp. Sau , ngi dng c th hiu ngha chnh ca vn bn nhng khng cn thit phi duyt ton b vn bn. y l phng php c bit c s dng trong searching engine, thng cn a ra vn bn trch dn. Nhiu searching engines lun a ra nhng cu d on trong qu
trnh tm kim v tr v kt qu, cch tt nht thu c ngha chnh ca mt vn bn hoc tp vn bn ch yu bng vic s dng nhiu thut ton khc nhau. 3.1.2.5.2. Phn lp vn bn Trc ht, nhiu ti liu c phn lp t ng mt cch nhanh chng v hiu qu cao. Th hai, mi lp vn bn c a vo mt ch ph hp. Ta thng s dng phng php phn lp Navie Bayesian v K-lng ging gn nht khai ph thng tin vn bn. Trong phn lp vn bn, u tin l phn loi ti liu. Th hai, xc nh c trng thng qua s lng cc c trng ca tp ti liu hun luyn. Cui cng, tnh ton kim tra phn lp ti liu v tng t ca ti liu phn lp bng thut ton no . Khi cc ti liu c tng t cao vi nhau th nm trong cng mt phn lp. tng t s c o bng hm nh gi xc nh trc. Nu t ti liu tng t nhau th a n v 0. Nu n khng ging vi s la chn ca phn lp xc nh trc th xem nh khng ph hp. 3.1.2.5.3. Phn cm vn bn Ch phn loi khng cn xc nh trc nhng ta phi phn loi cc ti liu vo nhiu cm. Trong cng mt cm, th tt c tng t ca cc ti liu yu cu cao hn, ngc li ngoi cm th tng t thp hn. Phng php sp xp lin kt v phng php phn cp thng c s dng trong phn cm vn bn.
- Trc ht ta s chia tp ti liu thnh cc cm khi u thng qua vic ti u ha hm nh gi theo mt nguyn tc no , R={R1, R2,...,Rn}, vi n phi c xc nh trc. - Vi mi ti liu trong tp ti liu W, W={w1, w2,..,wm}, tnh ton tng t ca n ti Rj ban u, sim(wi, Rj), sau la chn ti liu tng t ln nht, a n vo cm Rj. - Lp li cc cng vic trn cho ti khi tt c cc ti liu a vo trong cc cm xc nh. Hnh 3.3. Thut ton phn cm phn hoch
3.1.2.5.4. Phn tch v d on xu hng Thng qua vic phn tch cc ti liu Web, ta c th nhn c quan h phn phi ca cc d liu c bit trong tng giai on ca n v c th d on c tng lai pht trin. 3.1.3. nh gi cht lng mu KPDL Web c th c xem nh qu trnh ca machine learning. Kt qu ca machine learning l cc mu tri thc. Phn quan trng ca machine learning l nh gi kt qu cc mu. Ta thng phn lp cc tp ti liu vo tp hun luyn v tp kim tra. Sau lp li vic hc v kim th trong tp hun luyn v tp kim tra. Cui cng, cht lng trung bnh c dng nh gi cht lng m hnh.
Hnh 3.4. Kin trc tng qut ca khai ph theo s dng Web
3.2.3.1. Chng thc phin ngi dng Chng thc ngi dng: Mi ngi dng vi cng mt Client IP c xem l cng mt ngi. Chng thc phin lm vic: Mi phin lm vic mi c to ra khi mt a ch mi c tm thy hoc nu thi gian thm mt trang qu ngng thi gian cho php (v d 30 pht) i vi mi a ch IP. 3.2.3.2. ng nhp Web v xc nh phin chuyn hng ngi dng Dch v file ng nhp Web: Mt file ng nhp Web l mt tp cc s ghi li nhng yu cu ngi dng v cc ti liu trong mt Web site 3.2.3.3. Cc vn i vi vic x l Web log - Thng tin c cung cp c th khng y , khng chi tit. - Khng c thng tin v ni dung cc trang c thm. - C qu nhiu s ghi li cc ng nhp do yu cu phc v bi cc proxy. - S ghi li cc ng nhp khng y do cc yu cu phc v bi proxy. - Lc cc mc ng nhp. - c lng thi gian thm trang. 3.2.3.4. Phng php chng thc phin lm vic v truy cp Web Chng thc phin lm vic: Nhm cc tham chiu trang ca ngi dng vo mt phin lm vic da trn nhng phng php gii quyt heuristic. Phng php heuristics da trn IP v thi gian kt thc mt phin lm vic (v d 30 pht) c s dng chng thc phin ngi dng. y l phng php n gin nht.
thng tin v ng nhp Web c th c bin i thnh cc mu giao tc thch hp cho vic x l sau ny trong cc lnh vc khc nhau. B sung hoc xa b cc d liu khuyt thiu nh cache cc b, dch v proxy. X l thng tin trong cc Cookie, thng tin ang k ngi dng kt hp vi IP, tn trnh duyt v cc thng tin lu tm. Chng thc giao tc: Chng thc cc phin ngi dng, cc giao tc. 3.2.4.2. Khi ph d liu S dng cc phng php KPDL trong cc lnh vc khc nhau nh lut kt hp, phn tch, thng k, phn tch ng dn, phn lp v phn cm khm ph ra cc mu ngi dng. + Phn tch ng dn [8][9][22]: Hu ht cc cc ng dn thng c thm c b tr theo th vt l ca trang Web. Thng qua vic phn tch ng dn trong qu trnh truy cp ca ngi dng ta c th bit c mi quan h trong vic truy cp ca ngi gia cc ng dn lin quan. + Lut kt hp [8]: S tng quan gia cc tham chiu n cc file khc nhau c trn dch v nh vic s dng lut kt hp. + Chui cc mu: Cc mu thu c gia cc giao tc v chui thi gian. Th hin mt tp cc phn t c theo sau bi phn t khc trong th t thi gian lu hnh tp giao tc. + Quy tc phn loi [22]: Profile ca cc phn t thuc mt nhm ring bit theo cc thuc tnh chung. + Phn tch phn cm: Nhm cc khch hng li cng nhau hoc cc phn t d liu c cc c tnh tng t nhau. N gip cho vic pht trin v thc hin cc chin lc tip th khch hng c v trc tuyn hoc khng trc tuyn nh vic tr li t ng cho cc khch hng thuc nhm chc chn, n to ra s thay i linh ng mt WebSite ring bit i vi mi khch hng. 3.2.4.3. Phn tch nh gi Phn tch m hnh [22]: Thng k, tm kim tri thc v tc nhn thng minh. Phn tch tnh kh thi, truy vn d liu hng ti s tiu dng ca con ngi. Trc quan ha: Trc quan Web s dng lc ng dn Web v a ra th c hng OLAP.
- th trch dn: Mi nt cho mt trang, khng c cung hng t u n v nu c mt trang th ba w lin kt c u v v. - Gi nh: Mt lin kt t trang u n trang v l mt thng bo n trang v bi trang u. Nu u v v c kt ni bi mt ng lin kt th rt c kh nng hai trang Web u c ni dung tng t nhau.
3.3.2.1. Thut ton PageRank Google da trn thut ton PageRank, n lp ch mc cc lin kt gia cc Web site v th hin mt lin kt t A n B nh l xc nhn ca B bi A. Cc lin kt c nhng gi tr khc nhau. Nu A c nhiu lin kt ti n v C c t cc lin kt ti n th mt lin kt t A n B c gi tr hn mt lin kt t C n B. Gi tr c xc nh nh th c gi l PageRank ca mt trang v xc nh th t sp xp ca n trong cc kt qu tm kim. Cc lin kt c th c phn tch chnh xc v hiu qu hn i vi khi lng chu chuyn hoc khung nhn v tr thnh o ca s thnh cng v vic bin i th hng ca cc trang.
PageRank khng n gin ch da trn tng s cc lin kt n. Cc tip cn c bn ca PageRank l mt ti liu trong thc t c xt n quan trng hn l cc ti liu lin kt ti n, nhng nhng lin kt v khng bng nhau v s lng. Mt ti liu xp th hng cao trong cc phn t ca PageRank nu nh c cc ti liu th hng cao khc lin kt ti n. 3.3.2.2. Phng php phn cm nh thut ton HITS HITS l thut ton pht trin hn trong vic vic xp th hng ti liu da trn thng tin lin kt gia tp cc ti liu. - Authority: L cc trang cung cp thng tin quan trng, tin cy da trn cc ch a ra. - Hub: L cc trang cha cc lin kt n authorities - Bc trong: L s cc lin kt n mt nt, c dng o y quyn. - Bc ngoi: L s cc lin kt i ra t mt nt, n c s dng o mc trung tm.
Cc Authority v hub th hin mt quan h tc ng qua li tng cng lc lng. Ngha l mt Hub s tt hn nu n tr n cc Authority tt v ngc li mt Authority s tt hn nu n c tr n bi nhiu Hub tt. Cc bc ca phng php HITS Bc 1: Xc nh mt tp c bn S, ly mt tp cc ti liu tr v bi Search Engine chun c gi l tp gc R, khi to S tng ng vi R. Bc 2: Thm vo S tt c cc trang m n c tr ti t bt k trang no trong R. Thm vo S tt c cc trang m n tr ti bt k trang no trong R Vi mi trang p trong S: Tnh gi tr im s Authority: ap (vector a) Tnh gi tr im s Hub: hp (vector h) Vi mi nt khi to ap v hp l 1/n (n l s cc trang) Bc 3. Trong mi bc lp tnh gi tr trng s Authority cho mi nt trong S theo cng thc: a p =
q: q p
Lu rng cc trng s Hub c tnh ton nh vo cc trng s Authority hin to, m cc trng s Authority ny li c tnh ton t cc trng s ca cc Hub trc . Bc 5. Sau khi tnh xong trng s mi cho tt c cc nt, cc trng s c chun ha li theo cng thc:
(a
1
pS
2 ) =
and
(h
pS
) =1
2
3.4. p dng thut ton phn cm d liu trong tm kim v phn cm ti liu Web
Nh s ci tin khng ngng ca cc Search engine v c chc nng tm kim ln giao din ngi dng gip cho ngi s dng d dng hn trong vic tm kim thng tin trn web. Tuy nhin, ngi s dng thng vn phi duyt qua hng chc thm ch hng ngn trang Web mi c th tm kim c th m h cn. Nhm gii quyt vn ny, chng ta c th nhm cc kt qu tm kim thnh thnh cc nhm theo cc ch , khi ngi s dng c th b qua cc nhm m h khng quan tm tm n nhm ch quan tm. iu ny s gip cho ngi dng thc hin cng vic ca h mt cch hiu qu hn. Tuy nhin vn phn cm d liu trn Web v chn ch thch hp n c th m t c ni dung ca cc trang l mt vn khng n gin. Trong bi bo ny, ta s xem kha cnh s dng k thut phn cm phn cm ti liu Web da trn kho d liu c tm kim v lu tr.
xem nh khng ph hp vi truy vn v loi b n khi tp kt qu. K tip, chng ta nh trng s cho cc cm v cc trang trong tp kt qu theo thut ton sau:
INPUT: tp d liu D cha cc trang gm k cm v k trng tm OUTPUT: trng s ca cc trang BEGIN Mi cm d liu th m v trng tm Cm ta gn mt trng s tsm. Vi cc trng tm Ci, Cj bt k ta lun c tsi>tsj nu ti tng t vi truy vn hn tj. Vi mi trang p trong cm m ta xc nh trng s trang pwm. Vi mi pwi, pwj bt k, ta lun c pw1>pw2 nu pw1 gn trng tm hn pw2. END Hnh 3.7. Thut ton nh trng s cm v trang
Nh vy, theo cch tip cn ny ta s gii quyt c cc vn sau: + Kt qu tm kim s c phn thnh cc cm theo cc ch khc nhau, ty vo yu cu c th ngi dng s xc nh ch m h cn. + Qu trnh tm kim v xc nh trng s cho cc trang ch yu tp trung vo ni dung ca trang hn l da vo cc lin kt trang. + Gii quyt c vn t/cm t ng ngha trong cu truy vn ca ngi dng. + C th kt hp phng php phn cm trong lnh vc khai ph d liu vi cc phng php tm kim c. Hin ti, c mt s thut ton phn cm d liu c s dng trong phn cm vn bn nh thut ton phn cm phn hoch (k-means, PAM, CLARA), thut ton phn cm phn cp (BIRCH, STC),... Trong thc t phn cm theo ni dung ti liu Web, mt ti liu c th thuc vo nhiu nhm ch khc nhau. gii quyt vn ny ta c th s dng thut ton phn cm theo cch tip cn m.
- Tm kim cc trang Web t cc Website tha mn ni dung truy vn. - Trch rt thng tin m t t cc trang v lu tr n cng vi cc URL tng ng. - S dng k thut phn cm d liu phn cm t ng cc trang Web thnh cc cm, sao cho cc trang trong cm tng t v ni dung vi nhau hn cc trang ngoi cm.
D liu web Tm kim v trch rt d liu Tin x l
Biu din kt qu
3.4.2.1. Tm kim d liu trn Web Nhim v ch yu ca giai on ny l da vo tp t kha tm kim tm kim v tr v tp gm ton vn ti liu, tiu , m t tm tt, URL, tng ng vi cc trang . Nhm nng cao tc x l, ta tin hnh tm kim v lu tr cc ti liu ny trong kho d liu s dng cho qu trnh tm kim (tng t nh cc Search Engine Yahoo, Google,). 3.4.2.2. Tin x l d liu Qu trnh lm sch d liu v chuyn dch cc ti liu thnh cc dng biu din d liu thch hp. Giai on ny bao gm cc cng vic nh sau: Chun ha vn bn, xa b cc t dng, kt hp cc t c cng t gc, s ha v biu din vn bn,..
3.4.2.2.1. Chun ha vn bn y l giai on chuyn vn bn th v dng vn bn sao cho vic x l sau ny c d dng, n gin, thut tin, chnh xc so vi vic x l trc tip trn vn bn th m nh hng t n kt qu x l. 3.4.2.2.2. Xa b cc t dng Trong vn bn c nhng t mang t thng tin trong qu trnh x l, nhng t c tn s xut hin thp, nhng t xut hin vi tn s ln nhng khng quan trng cho qu trnh x l u c loi b. Theo mt s nghin cu gn y cho thy vic loi b cc t dng c th gim bi c khong 20-30% tng s t trong vn bn. C rt nhiu t xut hin vi tn s ln nhng n khng hu ch cho qu trnh phn cm d liu. Nhng t xut hin vi tn s qu ln cng s c loi b. n gin trong ng dng thc t, ta c th t chc thnh mt danh sch cc t dng, s dng nh lut Zipf xa b cc t c tn s xut hin thp hoc qu cao. 3.4.2.2.3. Kt hp cc t c cng gc Hu ht trong cc ngn ng u c rt nhiu cc t c chung ngun gc vi nhau, chng mang ngha tng t nhau, do gim bt s chiu trong biu din vn bn, ta s kt hp cc t c cng gc thnh mt t. Theo mt s nghin cu [5] vic kt hp ny s gim c khong 40-50% kch thc chiu trong biu din vn bn. V d trong ting Anh, t user, users, used, using c cng t gc v s c quy v l use; t engineering, engineered, engineer c cng t gc s c quy v l engineer. 3.4.2.3. Xy dng t in Vic xy dng t in l mt cng vic rt quan trng trong qu trnh vector ha vn bn, t in s gm cc t/cm t ring bit trong ton b tp d liu. T in s gm mt bng cc t, ch s ca n trong t in v c sp xp theo th t.
Mt s bi bo xut [31] nng cao cht lng phn cm d liu cn xem xt n vic x l cc cm t trong cc ng cnh khc nhau. Theo xut ca Zemir [19][31] xy dng t in c 500 phn t l ph hp. 3.4.2.4. Tch t, s ha vn bn v biu din ti liu Tch t l cng vic ht sc quan trng trong biu din vn bn, qu trnh tch t, vector ha ti liu l qu trnh tm kim cc t v thay th n bi ch s ca t trong t in. y ta c th s dng mt trong cc m hnh ton hc TF, IDF, TFIDF,... biu din vn bn. Chng ta s dng mng W (trng s) hai chiu c kch thc m x n, vi n l s cc ti liu, m l s cc thut ng trong t in (s chiu), hng th j l mt vector biu din ti liu th j trong c s d liu, ct th i l thut ng th i trong t in. Wij l gi tr trng s ca thut ng i i vi ti liu j. Giai on ny thc hin thng k tn s thut ng ti xut hin trong ti liu dj v s cc ti liu cha ti. T xy dng bng trng s ca ma trn W theo cng thc sau: Cng thc tnh trng s theo m hnh IF-IDF:
Wij=
3.4.2.5. Phn cm ti liu Sau khi tm kim, trch rt d liu v tin x l v biu din vn bn chng ta s dng k thut phn cm phn cm ti liu.
INPUT: Tp gm n ti liu v k cm. OUTPUT: Cc cm Ci (i=1,..,k) sao cho hm tiu chun t gi tr cc tiu. BEGIN Bc 1. Khi to ngu nhin k vector lm i tng trng tm ca k cm. Bc 2. Vi mi ti liu dj xc nh tng t ca n i vi trng tm ca mi cm theo mt trong cc o tng t thng dng (nh Dice, Jaccard, Cosine, Overlap, Euclidean, Manhattan). Xc nh trng tm tng t nht cho mi ti liu v
a ti liu vo cm . Bc 3. Cp nhn li cc i tng trng tm. i vi mi cm ta xc nh li trng tm bng cch xc nh trung bnh cng ca cc vector ti liu trong cm . Bc 4. Lp li bc 2 v 3 cho n khi trong tm khng thay i. END. Hnh 3.9. Thut ton k-means trong phn cm ni dung ti liu Web
- phc tp ca thut ton k-means l O((n.k.d).r). Trong : n l s i tng d liu, k l s cm d liu, d l s chiu, r l s vng lp. Sau khi phn cm xong ti liu, tr v kt qu l cc cm d liu v cc trng tm tng ng.
10 15 10 15
Bng 3.2. Bng o thi gian thc hin thut ton phn cm
Ta thy rng thi gian thc hin thut ton ph vo ln d liu v s cm cn phn cm. Ngoi ra, vi thut ton k-means cn ph thuc vo k trng tm khi to ban u. Nu k trng tm c xc nh tt th cht lng v thi gian thc hin c ci thin rt nhiu. Phn giao din chng trnh v mt s on m code in hnh c trnh by ph lc.
TI LIU THAM KHO Ti liu ting Vit [1] Cao Chnh Ngha, Mt s vn v phn cm d liu, Lun vn thc s, Trng i hc Cng ngh, H Quc gia H Ni, 2006. [2] Hong Hi Xanh, V cc k thut phn cm d liu trong data mining, lun vn thc s, Trng H Quc Gia H Ni, 2005 [3] Hong Th Mai, Khai ph d liu bng phng php phn cm d liu, Lun vn thc s, Trng HSP H Ni, 2006. Ti liu ting Anh [4] Athena Vakali, Web data clustering Current research status & trends, Aristotle University,Greece, 2004. [5] Bing Liu, Web mining, Springer, 2007. [6] Brij M. Masand, Myra Spiliopoulou, Jaideep Srivastava, Osmar R. Zaiane, Web Mining for Usage Patterns & Profiles, ACM, 2002. [7] Filippo Geraci, Marco Pellegrini, Paolo Pisati, and Fabrizio Sebastiani, A scalable algorithm for high-quality clustering of Web Snippets, Italy, ACM, 2006. [8] Giordano Adami, Paolo Avesani, Diego Sona, Clustering Documents in a Web Directory, ACM, 2003. [9] Hiroyuki Kawano, Applications of Web mining- from Web search engine to P2P filtering, IEEE, 2004. [10] Ho Tu Bao, Knowledge Discovery and Data Mining, 2000. [11] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, Jinwen Ma, Learning to Cluster Web Search Results, ACM, 2004. [12] Jitian Xiao, Yanchun Zhang, Xiaohua Jia, Tianzhu Li, Measuring Similarity of Interests for Clustering Web-Users, IEEE, 2001. [13] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, University of Illinois at Urbana-Champaign, 1999. [14] Khoo Khyou Bun, Topic Trend Detection and Mining in World Wide Web, A thesis for the degree of PhD, Japan, 2004. [15] LIU Jian-guo, HUANG Zheng-hong , WU Wei-ping, Web Mining for Electronic Business Application, IEEE, 2003. [16] Lizhen Liu, Junjie Chen, Hantao Song, The research of Web Mining, IEEE, 2002
[17] Maria Rigou, Spiros Sirmakessis, and Giannis Tzimas, A Method for Personalized Clustering in Data Intensive Web Applications, 2006. [18] Miguel Gomes da Costa Jnior, Zhiguo Gong, Web Structure Mining: An Introduction, IEEE, 2005. [19] Oren Zamir and Oren Etzioni, Web document Clustering: A Feasibility Demonstration, University of Washington, USA, ACM, 1998. [20] Pawan Lingras, Rough Set Clustering for Web mining, IEEE, 2002. [21] Periklis Andritsos, Data Clusting Techniques, University Toronto,2002. [22] R. Cooley, B. Mobasher, and J. Srivastava, Web mining: Information and Pattern Discovery on the World Wide Web, University of Minnesota, USA, 1998. [23] Raghu Krishnapuram, Anupam Joshi, and Liyu Yi, A Fuzzy Relative of the K -Medoids Algorithm with Application toWeb Document and Snippet Clustering, 2001 [24] Raghu Krishnapuram,Anupam Joshi, Olfa Nasraoui, and Liyu Yi, Low- Complexity Fuzzy Relational Clustering Algorithms for Web Mining, IEEE, 2001. [25] Raymond and Hendrik, Web Mining Research: A Survey, ACM, 2000 [26] Rui Wu, Wansheng Tang,Ruiqing Zhao, An Efficient Algorithm for Fuzzy Web-Mining, IEEE, 2004. [27] T.A.Runkler, J.C.Bezdek, Web mining with relational clustering, ELSEVIER, 2002. [28] Tsau Young Lin, I-Jen Chiang , A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering, ELSEVIER, 2005. [29] Wang Jicheng, Huang Yuan, Wu Gangshan, and Zhang Fuyan, Web Mining: Knowledge Discovery on the Web, IEEE, 1999. [30] WangBin, LiuZhijing, Web Mining Research, IEEE, 2003. [31] Wenyi Ni, A Survey of Web Document Clustering, Southern Methodist University, 2004. [32] Yitong Wang, Masaru Kitsuregawa, Evaluating Contents-Link Coupled Web Page Clustering for Web Search Results, ACM, 2002. [33] Zifeng Cui, Baowen Xu , Weifeng Zhang, Junling Xu, Web Documents Clustering with Interest Links, IEEE, 2005.