You are on page 1of 44

Khai ph d liu Web bng k thut phn cm

http://www.ebook.edu.vn

MC LC

MC LC1
1.1. Khai ph d liu v pht hin tri thc ......................................................... 2 1.2. K thut phn cm trong khai ph d liu .................................................. 3 1.3. Khai ph Web .............................................................................................. 6 1.4. X l d liu vn bn ng dng trong khai ph d liu Web ..................... 7

Chng 2. MT S K THUT PHN CM D LIU ....................... 10


2.1. Phn cm phn hoch ................................................................................10 2.2. Phn cm phn cp ....................................................................................14 2.3. Phn cm da trn mt .........................................................................17 2.4. Phn cm da trn li..............................................................................19 2.5. Phn cm d liu da trn m hnh...........................................................20 2.6. Phn cm d liu m .................................................................................22

Chng 3. KHAI PH D LIU WEB ............................................. ....... 23 .


3.1. Khai ph ni dung Web .............................................................................23 3.2. Khai ph theo s dng Web ......................................................................27 3.3. Khai ph cu trc Web ..............................................................................31 3.4. p dng thut ton phn cm d liu trong tm kim v phn cm ti liu Web ...................................................................................................................35

TI LIU THAM KHO .......................................................................... 42 .

Khai ph d liu Web bng k thut phn cm

http://www.ebook.edu.vn

CHNG 1. TNG QUAN V KHAI PH D LIU

1.1. Khai ph d liu v pht hin tri thc


1.1.1. Khai ph d liu
KPDL l mt lnh vc mi c nghin cu, nhm t ng khai thc thng tin, tri thc mi hu ch, tim n t nhng CSDL ln cho cc n v, t chc, doanh nghip,. t lm thc y kh nng sn xut, kinh doanh, cnh tranh cho cc n v, t chc ny. Cc kt qu nghin cu khoa hc cng nhng ng dng thnh cng trong KDD cho thy KPDL l mt lnh vc pht trin bn vng, mang li nhiu li ch v c nhiu trin vng, ng thi c u th hn hn so vi cc cng c tm kim phn tch d liu truyn thng. Ta c th khi qut ha khi nim KPDL l mt qu trnh tm kim, pht hin cc tri thc mi, hu ch, tim n trong CSDL ln.

1.1.2. Qu trnh khm ph tri thc


Qu trnh kh ph tri thc c th chia thnh 5 bc nh sau [10]:
nh gi, biu din Cc mu

Trch chn

Tin x l

Bin i

Khai ph D liu bin i

Tri thc

D liu th

D liu la chn

D liu tin x l

Hnh 1.1. Qu trnh khm ph tri thc

1.1.3. Khai ph d liu v cc lnh vc lin quan


KPDL l mt lnh vc lin quan ti thng k, hc my, CSDL, thut ton, tnh ton song song, thu nhn tri thc t h chuyn gia v d liu tru tng. c trng ca h thng khm ph tri thc l nh vo cc phng php, thut ton v k thut t nhng lnh vc khc nhau KPDL.

1.1.4. Cc k thut p dng trong khai ph d liu


KDD l mt lnh vc lin ngnh, bao gm: T chc d liu, hc my, tr tu nhn to v cc khoa hc khc. ng trn quan im ca hc my, th cc k thut trong KPDL, bao gm: Hc c gim st, hc khng c gim st, hc na gim st. Nu cn c vo lp cc bi ton cn gii quyt, th KPDL bao gm cc k thut p dng sau: Phn lp v d bo, lut kt hp, phn tch chui theo thi gian, phn cm, m t v tm tt khi nim.

1.1.5. Nhng chc nng chnh ca khai ph d liu


Hai mc tiu chnh ca KPDL l m t v d bo. Trong lnh vc KDD, m t c quan tm nhiu hn d bo, n ngc vi cc ng dng hc my v nhn dng mu m trong vic d bo thng l mc tiu chnh. Cc nhim v chnh ca KDD gm: M t lp v khi nim, phn tch s kt hp, phn lp v d bo, phn cm, phn tch cc i tng ngoi cuc, phn tch s tin ho.

1.1.6. ng dng ca khai ph d liu


KPDL l mt lnh vc c quan tm v ng dng rng ri. Mt s ng dng in hnh trong KPDL c th lit k nh sau: Phn tch d liu v h tr ra quyt nh, iu tr y hc, KPVB, khai ph Web, tin-sinh, ti chnh v th trng chng khon, bo him,... Hin nay cc h qun tr CSDL v phn mm tch hp nhng modul KPDL nh SQL Server, Oracle, Office 2007,..

1.2. K thut phn cm trong khai ph d liu


1.2.1. Tng quan v k thut phn cm
PCDL l mt k thut trong KPDL nhm tm kim, pht hin cc cm, cc mu d liu t nhin, tim n, quan trng trong tp d liu ln t cung cp thng tin, tri thc hu ch cho vic ra quyt nh. PCDL ang l vn m v kh v ngi ta cn phi i gii quyt nhiu vn c bn nh cp trn mt cch trn vn v ph hp vi nhiu dng

d liu khc nhau. c bit i vi d liu hn hp, ang ngy cng tng trng khng ngng trong cc h qun tr d liu, y cng l mt trong nhng thch thc ln trong lnh vc KPDL trong nhng thp k tip theo v c bit l trong lnh vc KPDL Web.

1.2.2. ng dng ca phn cm d liu


PCDL c ng dng trong nhiu lnh vc nh thng mi v khoa hc. Cc k thut PCDL c p dng cho mt s ng dng in hnh trong cc lnh vc sau [10][19]: Thng mi, sinh hc, phn tch d liu khng gian, lp quy hoch th, nghin cu tri t, a l, khai ph Web,...

1.2.3. Cc yu cu i vi k thut phn cm d liu


Hu ht cc nghin cu v pht trin thut ton PCDL u nhm tho mn cc yu cu c bn sau [10][19]: C kh nng m rng, thch nghi vi cc kiu d liu khc nhau, khm ph ra cc cm vi hnh th bt k, ti thiu lng tri thc cn cho xc nh cc tham s vo, t nhy cm vi th t ca d liu vo, thch nghi vi d liu nhiu cao, t nhy cm vi cc tham s u vo, thch nghi vi d liu a chiu, d hiu, d ci t v kh thi.

1.2.4. Cc kiu d liu v o tng t


1.2.4.1. Phn loi kiu d liu da trn kch thc min Ta c th phn thnh 2 loi thuc tnh lin tc, thuc tnh ri rc. 1.2.4.2. Phn loi kiu d liu da trn h o Mt s kiu d liu thng dng nh thuc tnh nh danh, thuc tnh c th t, thuc tnh khong, thuc tnh t l. Cc n v o c nh hng n cc kt qu phn cm. khc phc iu ny ngi ta phi chun ho d liu. 1.2.4.3. Khi nim v php o tng t, phi tng t Khi cc c tnh ca d liu c xc nh, ngi ta tm cch thch hp xc nh " tng t" gia cc i tng. y l cc hm o s ging nhau gia cc cp i tng d liu, dng tnh tng t hoc l tnh phi tng t gia cc i tng d liu. Gi tr ca hm tnh o tng t cng

ln th s ging nhau gia i tng cng ln v ngc li, cn hm tnh phi tng t t l nghch vi hm tnh tng t. Mt s php o tng t p dng i vi cc kiu d liu khc nhau [10][17][27]: + Thuc tnh khong: o phi tng t ca hai i tng d liu x, y c xc nh bng cc metric nh sau: Khong cch Minskowski, Euclide, Manhattan, khong cch cc i. + Thuc tnh nh phn: Bng tham s sau: y: 1 x: 1 x: 0

y: 0

Bng 1.1. Bng tham s thuc tnh nh phn

Cc php o thng dng i vi d liu thuc tnh nh phn: + - H s i snh n gin: d ( x, y) = . - H s Jacard: d ( x, y) . + + = + Thuc tnh nh danh: o phi tng t gia hai i tng x v y
pm c nh ngha nh sau: d ( x, y) , trong m l s thuc tnh i snh = p

tng ng trng nhau v p l tng s cc thuc tnh. + Thuc tnh c th t: Php o phi tng t gia cc i tng d liu vi thuc tnh th t c thc hin nh sau: Cc trng thi Mi c sp th t: [1Mi], ta c th thay th mi gi tr ca thuc tnh bng gi tr cng loi ri, vi ri {1,,Mi}. Mi mt thuc tnh th t c cc min gi tr khc nhau, v vy ta chuyn i chng v cng min gi tr [0,1] bng cch thc hin php bin i sau cho mi thuc tnh: z = r M
(j) i i (j)

1 , vi i=1,..,Mi 1

S dng cng thc tnh phi tng t ca thuc tnh khong i vi cc gi tr

(j) i

, y cng chnh l phi tng t ca thuc tnh c th t.

+ Thuc tnh t l: C nhiu cch khc nhau tnh tng t gia cc thuc tnh t l. C th s dng cng thc tnh logarit cho mi thuc tnh xi. Tu tng trng hp d liu c th m ngi ta s dng cc m hnh tnh tng t khc nhau. Vic xc nh tng t thch hp, chnh xc, m bo khch quan l rt quan trng v gp phn xy dng thut ton PCDL c hiu qu cao trong vic m bo cht lng v chi ph tnh ton ca thut ton.

1.3. Khai ph Web


1.3.1. Li ch ca khai ph Web
Vi s pht trin nhanh chng ca thng tin trn www, KPDL Web tng bc tr nn quan trng hn trong lnh vc KPDL, ngi ta lun hy vng ly c nhng tri thc hu ch thng qua vic tm kim, phn tch, tng hp, khai ph Web. Nhng tri thc hu ch c th gip ta xy dng nn nhng Web site hiu qu c th phc v cho con ngi tt hn, c bit trong lnh vc thng mi in t. Khm ph v phn tch nhng thng tin hu ch trn www bng cch s dng k thut KPDL tr thnh mt hng quan trng trong lnh vc khm ph tri thc. Vy lm th no c th tm kim c thng tin m ngi dng cn? Lm th no c c nhng trang Web cht lng cao?... Nhng vn ny s c thc hin hiu qu hn bng cch nghin cu cc k thut KPDL p dng trong mi trng Web.

1.3.2. Khai ph Web


C nhiu khi nim khc nhau v khai ph Web, nhng c th tng qut ha nh sau [5][30]: Khai ph Web l vic s dng cc k thut KPDL t ng ha qu trnh khm ph v trch rt nhng thng tin hu ch t cc ti liu,

cc dch v v cu trc Web. Lnh vc ny thu ht c nhiu nh khoa hc quan tm. Qu trnh khai ph Web c th chia thnh cc cng vic nh nh sau: Tm kim ngun ti nguyn, la chn v tin x l d liu, tng hp, phn tch.

1.3.3. Cc kiu d liu Web


Ta c th khi qut bng s sau:
Free Text HTML file Content data Web data Usage data User Profile data XML file Dynamic content Multimedia Static link

Structure data

Dynamic link

Hnh 1.2. Phn loi d liu Web

1.4. X l d liu vn bn ng dng trong khai ph d liu Web


1.4.1. D liu vn bn
CSDL vn bn c th chia lm 2 loi chnh [14][20]: + Dng khng c cu trc: y l nhng ti liu vn bn thng thng m ta c thng ngay trn cc sch, bo, internet,... + Dng na cu trc: y l nhng vn bn c t chc di dng cu trc lng, nhng vn th hin ni dung chnh ca vn bn, nh vn bn HTML, Email,..

1.4.2. Mt s vn trong x l d liu vn bn


Mt s vn lin quan n vic biu din vn bn bng m hnh khng gian vector: Khng gian vector l mt tp hp bao gm cc t. T l mt chui cc k t; ngoi tr cc khong trng, k t xung dng, du cu, khng phn bit ch hoa v ch thng. Ct b t: Trong nhiu ngn ng, nhiu t c cng t gc hoc l bin th ca t gc sang mt t khc. Vic s dng t gc lm gim s lng cc t.

Ngoi ra, nng cao cht lng x l, mt s cng trnh nghin cu a ra mt s ci tin thut ton xem xt n c tnh ng cnh ca cc t bng vic s dng cc cm t/vn phm ch khng ch xt cc t ring l [31]. 1.4.2.1. Loi b t dng Ta thy trong ngn ng t nhin c nhiu t ch dng biu din cu trc cu ch khng biu t ni dung ca n. Nh cc gii t, t ni,... nhng t nh vy xut hin nhiu trong cc vn bn m khng lin quan g ti ch hoc ni dung ca vn bn. Do , ta c th loi b nhng t gim s chiu ca vector biu din vn bn, nhng t nh vy c gi l nhng t dng. 1.4.2.2. nh lut Zipf gim s chiu ca vector biu din vn bn hn na ta da vo mt quan st sau: Nhiu t trong vn bn xut hin rt t ln, nu mc tiu ca ta l xc nh tng t v s khc nhau trong ton b tp hp cc vn bn th cc t c tn s xut hin nh th nh hng rt b n cc vn bn. nh lut Zipf c pht biu di dng cng thc nh sau: rt.ft K (vi K l mt hng s). Ta c th vit li nh lut Zipf nh sau: rt K/ ft Tng qut, mt t ch xut hin mt ln trong tp hp, ta c rmax=K. Xt phn b ca cc t duy nht xut hin b ln trong tp hp, chia 2 v cho nhau ta c K/b. nh lut Zipf cho ta thy s phn b ng ch ca cc t ring bit trong 1 tp hp bi cc t xut hin t nht trong tp hp.

1.4.3. Cc m hnh biu din d liu vn bn


Cch biu din tt nht l bng cc t ring bit c rt ra t ti liu gc v cch biu din ny nh hng tng i nh i vi kt qu. 1.4.3.1. M hnh Boolean y l m hnh biu din vector vi hm f nhn gi tr ri rc vi duy nht hai gi tr ng/sai (true/false). Hm f tng ng vi thut ng ti s cho gi tr ng khi v ch khi ti xut hin trong ti liu .

1.4.3.2. M hnh tn s 1.4.3.2.1. M hnh da trn tn s xut hin cc t Trong m hnh da trn tn s xut hin t (TF-Term Frequency) gi tr ca cc t c tnh da vo s ln xut hin ca n trong ti liu, gi tfij l s ln xut hin ca t ti trong ti liu dj, khi wij c th c tnh theo mt trong cng thc [31] Wij = 1+log(tfij) Khi s ln xut hin thut ng ti trong ti liu dj cng ln th c ngha l dj cng ph thuc nhiu vo thut ng ti, ni cch khc thut ng ti mang nhiu thng tin hn trong ti liu dj. 1.4.3.2.2. Phng php da trn tn s vn bn nghch o Trong m hnh da trn tn s vn bn nghch o (IDF) gi tr trng s ca t c tnh bng cng thc sau [31]:
Wij= n log( ) = log(n) log(hi ) hi 0 nu ti dj nu ngc li (ti dj)

Nu ti xut hin cng t trong cc vn bn th n cng quan trng, do nu ti xut hin trong dj th trng s ca n cng ln. 1.4.3.2.3. M hnh kt hp TF-IDF Trong m hnh TF-IDF [31], mi ti liu dj c xt n th hin bng mt c trng ca (t1, t2,.., tn) vi ti l mt t/cm t trong dj. Th t ca ti da trn trng s ca mi t. Cc tham s c th c thm vo ti u ha qu trnh thc hin nhm. Cng thc tnh trng s TF-IDF l: n tf ij idf = [1 + log( f ij )] log( ) nu ti dj Wij= hi Data
ij

nu ngc li (ti dj)

Thng thng ta xy dng mt t in t ly i nhng t rt ph bin v nhng t c tn s xut hin thp. Trng s wij c tnh bng tn s xut hin ca thut ng ti trong ti liu dj v him ca thut ng ti trong ton b CSDL. K thut phn cm phn cp v phn cm phn hoch (k-means) l 2 k thut phn cm thng c s dng cho phn cm ti liu vi m hnh TF-IDF.

Chng 2. MT S K THUT PHN CM D LIU


Cc k thut PCDL c th c phn loi thnh mt s loi c bn da trn cc phng php tip cn nh sau [10][19]:

2.1. Phn cm phn hoch


tng chnh ca k thut ny l phn mt tp d liu c n phn t cho trc thnh k nhm d liu sao cho mi phn t d liu ch thuc v mt nhm d liu v mi nhm d liu c ti thiu t nht mt phn t d liu. tng chnh ca thut ton phn cm phn hoch ti u cc b l s dng chin lc n tham tm kim nghim. Sau y l mt s thut ton kinh in c k tha s dng rng ri.

2.1.1. Thut ton k-means


Mc ch ca thut ton k-means l sinh ra k cm d liu {C1, C2,, Ck} t mt tp d liu ban u gm n i tng trong khng gian d chiu Xi =(xi1, xi2,,xid) ( i = 1, n ), sao cho hm tiu chun:
E = x
i =1 k

CiD

( x mi )

t gi tr ti

thiu. Vi mi l trng tm ca cm Ci, D l khong cch gia hai i tng. Thut ton k-means bao gm cc bc c bn nh sau:
INPUT: Mt CSDL gm n i tng v s cc cm k. OUTPUT: Cc cm Ci (i=1,..,k) sao cho hm tiu chun E t gi tr ti thiu. Bc 1: Khi to Chn k i tng mj (j=1...k) l trng tm ban u ca k cm t tp d liu (vic la chn ny c th l ngu nhin hoc theo kinh nghim). Bc 2: Tnh ton khong cch i vi mi i tng Xi (1 i n) , tnh ton khong cch t n ti mi trng tm mj vi j=1,..,k, sau tm trng tm gn nht i vi mi i tng. Bc 3: Cp nht li trng tm i vi mi j=1,..,k, cp nht trng tm cm mj bng cch xc nh trung bnh cng ca cc vector i tng d liu. Bc 4: iu kin dng Lp cc bc 2 v 3 cho n khi cc trng tm ca cm khng thay i. Hnh 2.1. Thut ton k-means

phc tp tnh ton l: O( (n k d ) ) vi d l s chiu, l s vng lp. k-means cn rt nhy cm vi nhiu v cc phn t ngoi lai trong d liu. Cht lng PCDL ph thuc nhiu vo cc tham s u vo nh: s cm k v k trng tm khi to ban u. c rt nhiu thut ton k tha t tng ca thut ton k-means p dng trong KPDL gii quyt tp d liu c kch thc ln nh thut ton kmedoid, PAM, CLARA, CLARANS,...

2.1.2. Thut ton PAM


Thut ton PAM l s m rng ca thut ton k-means, nhm x l hiu qu i vi d liu nhiu hoc cc phn t ngoi lai. Thay v s dng cc trng tm nh k-means, PAM s dng cc i tng medoid biu din cho cc cm d liu, mt i tng medoid l i tng t ti v tr trung tm nht bn trong ca mi cm. xc nh cc medoid, PAM bt u bng cch la chn k i tng medoid bt k. Sau mi bc thc hin, PAM c gng hon chuyn gia i tng medoid Om v mt i tng Op khng phi l medoid, min l s hon chuyn ny nhm ci tin cht lng ca phn cm, qu trnh ny kt thc khi cht lng phn cm khng thay i. quyt nh hon chuyn hai i tng Om v Op hay khng, thut ton PAM s dng gi tr tng chi ph hon chuyn Cjmp lm cn c: - Om: L i tng medoid hin thi cn c thay th - Op: L i tng medoid mi thay th cho Om; - Oj: L i tng d liu (khng phi l medoid) c th c di chuyn sang cm khc. - Om,2: L i tng medoid hin thi khc vi Om m gn i tng Oj nht. PAM tnh gi tr hon i Cjmp cho tt c cc i tng Oj. Cjmp y nhm lm cn c cho vic hon chuyn gia Om v Op. Trong mi trng hp Cjmp c tnh vi 4 cch khc nhau nh sau:

- Trng hp 1: Gi s Oj hin thi thuc v cm c i din l Om v Oj tng t vi Om,2 hn Op (d(Oj, Op) d(Oj, Om,2)). Trong khi , Om,2 l i tng medoid tng t xp th 2 ti Oj trong s cc medoid. Nu ta thay th Om bi i tng medoid mi Op v Oj s thuc v cm c i tng i din l Om,2. Cjmp = d(Oj, Om,2) d(Oj, Om) l khng m. - Trng hp 2: Oj hin thi thuc v cm c i din l Om, nhng Oj t tng t vi Om,2 so vi Op. Nu thay th Om bi Op th Oj s thuc v cm c i din l Op. Cjmp=(Oj,Op)- d(Oj, Om) c th l m hoc dng. - Trng hp 3: Gi s Oj hin thi khng thuc v cm c i tng i din l Om m thuc v cm c i din l Om,2. Mt khc, gi s Oj tng t vi Om,2 hn so vi Op, khi , nu Om c thay th bi Op th Oj vn s li trong cm c i din l Om,2. Do : Cjmp= 0. - Trng hp 4: Oj hin thi thuc v cm c i din l Om,2 nhng Oj t tng t ti Om,2 hn so vi Op. V vy, nu ta thay th Om bi Op th Oj s chuyn t cm Om,2 sang cm Op. Cjmp= (Oj,Op)- d(Oj, Om,2) y lun m. - Kt hp c bn trng hp trn, tng gi tr hon chuyn Om bng Op c xc nh nh sau: TCmp = C
j jmp

Thut ton PAM gm cc bc thc hin chnh nh sau:


INPUT: Tp d liu c n phn t, s cm k OUTPUT: k cm d liu sao cho cht lng phn hoch l tt nht. Bc 1: Chn k i tng medoid bt k; Bc 2: Tnh TCmp cho tt c cc cp i tng Om, Op. Trong Om l i tng medoid v Op l i tng khng phi l modoid. Bc 3: Vi mi cp i tng Om v Op. Tnh minOm, minOp, TCmp. Nu TCmp l m, thay th Om bi Op v quay li bc 2. Nu TCmp dng, chuyn sang bc 4. Bc 4: Vi mi i tng khng phi l medoid, xc nh i tng medoid tng t vi n nht ng thi gn nhn cm cho chng. Hnh 2.2. Thut ton PAM

phc tp tnh ton ca PAM l O(Ik(n-k) ), trong I l s vng lp. Nh vy, thut ton PAM km hiu qu v thi gian tnh ton khi gi tr ca k v n l ln.

2.1.3. Thut ton CLARA


CLARA c Kaufman v Rousseeuw xut nm 1990, nhm khc phc nhc im ca thut ton PAM trong trng hp gi tr ca k v n ln. CLARA tin hnh trch mu cho tp d liu c n phn t v p dng thut ton PAM cho mu ny v tm ra cc cc i tng medoid ca mu ny. Ngi ta thy rng, nu mu d liu c trch mt cch ngu nhin, th cc medoid ca n xp x vi cc medoid ca ton b tp d liu ban u. tin ti mt xp x tt hn, CLARA a ra nhiu cch ly mu ri thc hin phn cm cho mi trng hp ny v tin hnh chn kt qu phn cm tt nht khi thc hin phn cm trn cc mu ny. Kt qu thc nghim ch ra rng, 5 mu d liu c kch thc 40+2k cho cc kt qu tt. Thut ton CLARA nh sau:
INPUT: CSDL gm n i tng, s cm k. OUTPUT: k cm d liu 1. For i = 1 to 5 do Begin 2. Ly mt mu c 40 + 2k i tng d liu ngu nhin t tp d liu v p dng thut ton PAM cho mu d liu ny nhm tm cc i tng medoid i din cho cc cm. 3. i vi mi i tng Oj trong tp d liu ban u, xc nh i tng medoid tng t nht trong s k i tng medoid. 4. Tnh phi tng t trung bnh cho phn hoch cc i tng dnh bc trc, nu gi tr ny b hn gi tr ti thiu hin thi th s dng gi tr ny thay cho gi tr ti thiu trng thi trc, nh vy tp k i tng medoid xc nh bc ny l tt nht cho n thi im hin ti. End; Hnh 2.3. Thut ton CLARA
2

phc tp tnh ton ca thut ton l O(k(40+k) + k(n-k))

2.1.4. Thut ton CLARANS


tng c bn ca CLARANS l khng xem xt tt c cc kh nng c th thay th cc i tng tm medoids bi mt i tng khc, n ngay lp tc thay th cc i tng medoid ny nu vic thay th c tc ng tt n cht lng phn cm ch khng cn xc nh cch thay th ti u nht. S cc lng

ging c hn ch bi tham s do ngi dng a vo l Maxneighbor. Tham s Numlocal cho php ngi dng xc nh s vng lp ti u cc b c tm kim. Khng phi tt cc lng ging c duyt m ch c Maxneighbor lng ging c duyt. Thut ton CLARANS c th c din t nh sau [10][19]:
INPUT: Tp d liu gm n i tng, s cm k, O, dist, numlocal, maxneighbor; OUTPUT: k cm d liu; For i=1 to numlocal do Begin Khi to ngu nhin k medois j = 1; while j < maxneighbor do Begin Chn ngu nhin mt lng ging R ca S. Tnh ton phi tng t v khong cch gia 2 lng ging S v R. Nu R c chi ph thp hn th hon i R cho S v j=1 ngc li j++; End; Kim tra khong cch ca phn hoch S c nh hn khong cch nh nht khng, nu nh hn th ly gi tr ny cp nht li khong cch nh nht v phn hoch S l phn hoch tt nht ti thi im hin ti. End. Hnh 2.4. Thut ton CLARANS
2

phc tp tnh ton ca CLARANS l O(kn ). CLARANS c u im l khng gian tm kim khng b gii hn nh i vi CLARA v trong cng mt lng thi gian th cht lng ca cc cm phn c ln hn so vi CLARA.

2.2. Phn cm phn cp


Phn cm phn cp sp xp mt tp d liu cho thnh mt cu trc c dng hnh cy, cy phn cp ny c xy dng theo k thut quy. Cy phn cm c th c xy dng theo hai phng php tng qut: phng php Top down v phng php Bottom up. Mt s thut ton phn cm phn cp in hnh nh CURE, BIRCH, Chemeleon, AGNES, DIANA,...

2.2.1. Thut ton BIRCH


tng ca thut ton l khng cn lu ton b cc i tng d liu ca cc cm trong b nh m ch lu cc i lng thng k. i vi mi cm d liu, BIRCH ch lu mt b ba (n, LS, SS), vi n l s i tng trong cm, LS l tng cc gi tr thuc tnh ca cc i tng trong cm v SS l tng bnh phng cc gi tr thuc tnh ca cc i tng trong cm. Cc b ba ny c gi l cc c trng ca cm CF=(n, LS, SS) v c lu gi trong mt cy c gi l cy CF. Cy CF c c trng bi hai tham s: yu t nhnh (B) v ngng (T) Thut ton BIRCH thc hin qua giai on sau:
INPUT: CSDL gm n i tng, ngng T OUTPUT: k cm d liu Bc 1: Duyt tt c cc i tng trong CSDL v xy dng mt cy CF khi to. Mt i tng c chn vo nt l gn nht to thnh cm con. Nu ng knh ca cm con ny ln hn T th nt l c tch. Khi mt i tng thch hp c chn vo nt l, tt c cc nt tr ti gc ca cy c cp nht vi cc thng tin cn thit. Bc 2: Nu cy CF hin thi khng c b nh trong th tin hnh xy dng mt cy CF nh hn bng cch iu khin bi tham s T (v tng T s lm ho nhp mt s cc cm con thnh mt cm, iu ny lm cho cy CF nh hn). Bc ny khng cn yu cu bt u c d liu li t u nhng vn m bo hiu chnh cy d liu nh hn. Bc 3: Thc hin phn cm: Cc nt l ca cy CF lu gi cc i lng thng k ca cc cm con. Trong bc ny, BIRCH s dng cc i lng thng k ny p dng mt s k thut phn cm th d nh k-means v to ra mt khi to cho phn cm. Bc 4: Phn phi li cc i tng d liu bng cch dng cc i tng trng tm cho cc cm c khm ph t bc 3: y l mt bc tu chn duyt li tp d liu v gn nhn li cho cc i tng d liu ti cc trng tm gn nht. Bc ny nhm gn nhn cho cc d liu khi to v loi b cc i tng ngoi lai Hnh 2.5. Thut ton BIRCH

S dng cu trc cy CF lm cho thut ton BIRCH c tc thc hin PCDL nhanh v c th p dng i vi tp d liu ln, BIRCH c bit hiu qu khi p dng vi tp d liu tng trng theo thi gian. phc tp l O(n). Nhc im ca n l cht lng ca cc cm c khm ph khng c tt v khng thch hp vi d liu a chiu.

2.2.2. Thut ton CURE


CURE l thut ton s dng chin lc Bottom up. Thay v s dng cc trng tm hoc cc i tng tm biu din cm, CURE s dng nhiu i tng din t cho mi cm d liu. Cc i tng i din cho cm ny ban u c la chn ri rc u cc v tr khc nhau, sau chng c di chuyn bng cch co li theo mt t l nht nh. Ti mi bc ca thut ton, hai cm c cp i tng i din gn nht s c trn li thnh mt cm. CURE c kh nng x l i vi cc phn t ngoi lai. p dng vi CSDL ln, CURE s dng ly mu ngu nhin v phn hoch. Thut ton CURE c thc hin qua cc bc c bn nh sau:
Bc 1. Chn mt mu ngu nhin t tp d liu ban u; Bc 2. Phn hoch mu ny thnh nhiu nhm d liu c kch thc bng nhau: tng chnh y l phn hoch mu thnh p nhm d liu bng nhau, kch thc ca mi phn hoch l n'/p (vi n' l kch thc ca mu); Bc 3. Phn cm cc im ca mi nhm: Ta thc hin PCDL cho cc nhm cho n khi mi nhm c phn thnh n'/(pq)cm (vi q>1); Bc 4. Loi b cc phn t ngoi lai: Trc ht, khi cc cm c hnh thnh cho n khi s cc cm gim xung mt phn so vi s cc cm ban u. Sau , trong trng hp cc phn t ngoi lai c ly mu cng vi qu trnh pha khi to mu d liu, thut ton s t ng loi b cc nhm nh. Bc 5. Phn cm cc cm khng gian: Cc i tng i din cho cc cm di chuyn v hng trung tm cm, ngha l chng c thay th bi cc i tng gn trung tm hn. Bc 6. nh du d liu vi cc nhn tng ng. Hnh 2.6. Thut ton CURE

phc tp ca thut ton l O(n log(n)). CURE l thut ton tin cy trong vic khm ph cc cm vi hnh th bt k v c th p dng tt trn cc tp d

liu hai chiu. Tuy nhin, n li rt nhy cm vi cc tham s nh l tham s cc i tng i din, tham s co ca cc phn t i din. Nhn chung th BIRCH tt hn so vi CURE v phc tp, nhng km v cht lng phn cm.

2.3. Phn cm da trn mt


Phng php ny nhm cc i tng theo hm mt xc nh. Mt c nh ngha nh l s cc i tng ln cn ca mt i tng d liu theo mt ngng no . Trong cch tip cn ny, khi mt cm d liu xc nh th n tip tc c pht trin thm cc i tng d liu mi min l s cc i tng ln cn ca cc i tng ny phi ln hn mt ngng c xc nh trc. Phng php phn cm da vo mt ca cc i tng xc nh cc cm d liu v c th pht hin ra cc cm d liu vi hnh th bt k. Cc cm c th c xem nh cc vng c mt cao, c tch ra bi cc vng khng c hoc t mt . Khi nim mt y c xem nh l cc s cc i tng lng ging. Mt s thut ton PCDL da trn mt in hnh nh [2][3][13][20]: DBSCAN, OPTICS, DENCLUE,.

2.3.1 Thut ton DBSCAN


Thut ton i tm cc i tng m c s i tng lng ging ln hn mt ngng ti thiu. Mt cm c xc nh bng tp tt c cc i tng lin thng mt vi cc lng ging ca n. Cc bc thut ton DBSCAN nh sau:
Bc 1: Chn mt i tng p tu Bc 2: Ly tt c cc i tng mt - n c t p vi Eps v MinPts. Bc 3: Nu p l im nhn th to ra mt cm theo Eps v MinPts. Bc 4: Nu p l mt im bin, khng c im no l mt - n c mt t p v DBSCAN s i thm im tip theo ca tp d liu. Bc 5: Qu trnh tip tc cho n khi tt c cc i tng c x l. Hnh 2.7. Thut ton DBSCAN

Nu ta chn s dng gi tr tr ton cc Eps v MinPts, DBSCAN c th ho nhp hai cm thnh mt cm nu mt ca hai cm gn bng nhau. phc tp tnh ton trung bnh ca mi truy vn l O(nlogn).

2.3.2. Thut ton OPTICS


Thut ton OPTICS l thut ton m rng cho thut ton DBSCAN, bng cch gim bt cc tham s u vo. N thc hin tnh ton v sp xp cc i tng theo th t tng dn nhm t ng phn cm v phn tch cm tng tc hn l a ra phn cm mt tp d liu r rng. Th t ny din t cu trc d liu phn cm da trn mt cha thng tin tng ng vi phn cm da trn mt vi mt dy cc tham s u vo. OPTICS xem xt bn knh ti thiu nhm xc nh cc lng ging ph hp vi thut ton. Thut ton DBSCAN v OPTICS tng t vi nhau v cu trc v c cng phc tp: O(nLogn).

2.3.3. Thut ton DENCLUE


DENCLUE l thut ton PCDL da trn mt tp cc hm phn phi mt . tng chnh ca thut ton ny nh sau [19]: - nh hng ca mt i tng ti lng ging ca n c xc nh bi hm nh hng. - Mt ton cc ca khng gian d liu c m hnh phn tch nh l tng tt c cc hm nh hng ca cc i tng. - Cc cm c xc nh bi cc i tng mt cao, trong mt cao l cc im cc i ca hm mt ton cc. Thut ton DENCLUE ph thuc nhiu vo ngng nhiu v tham s mt . phc tp tnh ton ca thut ton DENCLUE l O(nlogn). Cc thut ton da trn mt khng thc hin k thut phn mu trn tp d liu nh trong cc thut ton phn cm phn hoch, v iu ny c th lm tng thm phc tp do c s khc nhau gia mt ca cc i tng trong mu vi mt ca ton b d liu.

2.4. Phn cm da trn li


L phng php da trn cu trc d liu li PCDL, phng php ny ch yu tp trung p dng cho lp d liu khng gian. Cch tip cn da trn li ny khng di chuyn cc i tng trong cc m xy dng nhiu mc phn cp ca nhm cc i tng trong mt . Cc cm khng da trn o khong cch m n c quyt nh bi mt tham s xc nh trc. u im ca phng php PCDL da trn li l thi gian x l nhanh v c lp vi s i tng d liu trong tp d liu ban u, thay vo l chng ph thuc vo s trong mi chiu ca khng gian li. Mt s thut ton PCDL da trn cu trc li in hnh nh [13][20]: STING, WaveCluster, CLIQUE,

2.4.1 Thut ton STING


STING do Wang, Yang v Muntz xut nm 1997, n phn r tp d liu khng gian thnh s hu hn cc cell s dng cu trc phn cp ch nht. C nhiu mc khc nhau cho cc cell trong cu trc li, cc cell ny hnh thnh nn cu trc phn cp nh sau: Mi cell mc cao c phn hoch thnh cc cell mc thp hn trong cu trc phn cp. Gi tr ca cc tham s thng k cho cc thuc tnh ca i tng d liu c tnh ton v lu tr thng qua cc tham s thng k cc cell mc thp hn. Cc tham s ny bao gm: tham s m count, tham s trung bnh means, tham s ti a max tham s ti thiu min, lch chun s, . Cc i tng d liu ln lt c chn vo li v cc tham s thng k trn c tnh trc tip thng qua cc i tng d liu ny. Cc truy vn khng gian c thc hin bng cch xt cc thch hp ti mi mc ca phn cp. Mt truy vn khng gian c xc nh nh l mt thng tin khi phc li ca d liu khng gian v cc quan h ca chng. STING c kh nng m rng cao, nhng do s dng phng php a phn gii nn n ph thuc cht ch vo trng tm ca mc thp nht. a phn gii l kh nng phn r tp d liu thnh cc mc chi tit khc nhau. Khi ho nhp cc ca cu trc li hnh thnh cc cm, cc nt ca mc con khng c ho nhp ph hp v hnh th ca

cc cm d liu khm ph c c cc bin ngang v dc, theo bin ca cc . STING s dng cu trc d liu li cho php kh nng x l song song, phc tp tnh ton cc i lng thng k cho mi l O(n). Sau khi xy dng cu trc d liu phn cp, thi gian x l cho cc truy vn l O(g) vi g l tng s cell ti mc thp nht (g<<n).

2.4.2 Thut ton CLIQUE


Thut ton CLIQUE do Agrawal, Gehrke, Gunopulos, Raghavan xut nm 1998, l thut ton t ng phn cm khng gian con vi s chiu ln, n cho php phn cm tt hn khng gian nguyn thy. Cc bc chnh ca thut ton nh sau:
Bc 1: Phn hoch tp d liu thnh cc hnh hp ch nht v tm cc hnh hp ch nht c (ngha l cc hnh hp ny cha mt s cc i tng d liu trong s cc i tng lng ging cho trc). Bc 2: Xc nh khng gian con cha cc cm c s dng nguyn l Apriori. Bc 3: Hp cc hnh hp ny to thnh cc cm d liu. Bc 4: Xc nh cc cm: Trc ht n tm cc cell c n chiu, tip n chng tm cc hnh ch nht 2 chiu, ri 3 chiu,, cho n khi hnh hp ch nht c k chiu c tm thy. Hnh 2.8. Thut ton CLIQUE

CLIQUE c kh nng p dng tt i vi d liu a chiu, nhng n li rt nhy cm vi th t ca d liu vo, phc tp tnh ton ca CLIQUE l O(n).

2.5. Phn cm d liu da trn m hnh


Phng php ny c gng khm ph cc php xp x tt ca cc tham s m hnh sao cho khp vi d liu mt cch tt nht. Chng c th s dng chin lc phn cm phn hoch hoc chin lc phn cm phn cp, da trn cu trc hoc m hnh m chng gi nh v tp d liu v cch m chng tinh chnh cc m hnh ny nhn dng ra cc phn hoch. Mt s thut ton in hnh nh EM, COBWEB,...

2.5.1. Thut ton EM


Thut ton ny nhm tm ra s c lng v kh nng ln nht ca cc tham s trong m hnh xc sut, n c xem nh l thut ton da trn m hnh hoc l m rng ca thut ton k-means. Thut ton gm 2 bc x l: nh gi d liu cha c gn nhn (bc E) v nh gi cc tham s ca m hnh, kh nng ln nht c th xy ra (bc M). C th thut ton EM bc lp th t thc hin cc cng vic sau: 1) Bc E: Tnh ton xc nh gi tr ca cc bin ch th da trn m hnh hin ti v d liu:

(t ) ij

= E ( zij | x) = Pr ( zij = 1 | x) =

f j ( xi ) j
k

(t)

( xi )

1 fg
n

2) Bc M: nh gi xc sut ,

( t +1) j

( = zij t ) / n i =1

EM c th khm ph ra nhiu hnh dng cm khc nhau, tuy nhin do thi gian lp ca thut ton kh nhiu nhm xc nh cc tham s tt nn ch ph tnh ton ca thut ton l kh cao.

2.5.2. Thut ton COBWEB


Cc bc chnh ca thut ton: 1) Khi to cy bt u bng mt nt rng. 2) Sau khi thm vo tng nt mt v cp nht li cy cho ph hp ti mi thi im. 3) Cp nht cy bt u t l bn phi trong mi trng hp, sau cu trc li cy. 4) Quyt nh cp nht da trn s phn hoch v cc hm tiu chun phn loi. Ti mi nt, n xem xt 4 kh nng xy ra (Insert, Create, Merge, Split) v la chn mt kh nng c hm gi tr CU t c tt nht ca qu trnh.

Mt s hn ch ca COBWEB l n tha nhn phn b xc sut trn cc thuc tnh n l l c lp thng k v chi ph tnh ton phn b xc sut ca cc cm khi cp nht v lu tr l kh cao. Cc phng php ci tin ca thut ton COBWEB l CLASSIT, AutoClass.

2.6. Phn cm d liu m


Thng thng, mi phng php PCDL phn mt tp d liu ban u thnh cc cm d liu c tnh t nhin v mi i tng d liu ch thuc v mt cm d liu, phng php ny ch ph hp vi vic khm ph ra cc cm c mt cao v ri nhau. Tuy nhin, trong thc t, cc cm d liu li c th chng ln nhau, ngi ta p dng l thuyt v tp m trong PCDL gii quyt cho trng hp ny, cch thc kt hp ny c gi l phn cm m. Trong phng php phn cm m, ph thuc ca i tng d liu xk ti cm th i (uik) c gi tr thuc khong [0,1]. tng trn c gii thiu bi Ruspini (1969) v c Dunn p dng nm 1973 nhm xy dng mt phng php phn cm m da trn ti thiu ho hm tiu chun. Bezdek (1982) tng qut ho phng php ny v xy dng thnh thut ton phn cm m cmeans c s dng trng s m [10][13][20]. c-means l thut ton phn cm m (ca k-means). Thut ton c-means m hay cn gi tt l thut ton FCM (Fuzzy c- means) c p dng thnh cng trong gii quyt mt s ln cc bi ton PCDL nh trong nhn dng mu, x l nh, y hc, Tuy nhin, nhc im ln nht ca thut ton FCM l nhy cm vi cc nhiu v phn t ngoi lai, ngha l cc trung tm cm c th nm xa so vi trung tm thc t ca cm. c nhiu cc phng php xut ci tin cho nhc im trn ca thut ton FCM bao gm: Phn cm da trn xc sut (keller, 1993), phn cm nhiu m (Dave, 1991), Phn cm da trn ton t LP Norm (Kersten, 1999). Thut ton - Insensitive Fuzzy c-means ( FCM-khng nhy cm m c- means).

Chng 3. KHAI PH D LIU WEB


Tng ng cc kiu d liu Web, ta c th phn chia cc hng tip cn trong khai ph Web nh sau:
Web mining

Web content mining Web Page Content Mining

Web Structure mining

Web Usage mining Customized Usage Tracking

Search Result Mining

General Access Pattern Tracking

Hnh 3.1. Phn loi khai ph Web

3.1. Khai ph ni dung Web


Khai ph ni dung Web tp trung vo vic khm ph mt cch t ng ngun thng tin c gi tr trc tuyn. Khai ph ni dung Web c th c tip cn theo 2 cch khc nhau: Tm kim thng tin v KPDL trong CSDL ln. KPDL a phng tin l mt phn ca khai ph ni dung Web, n ha hn vic khai thc c cc thng tin v tri thc mc cao t ngun a phng tin trc tuyn rng ln. y thc s l mt thch thc, lnh vc nghin cu ny vn cn l thi k s khai, nhiu vic ang i thc hin. C nhiu cch tip cn khc nhau v khai ph ni dung Web, song trong lun vn ny s xem xt di 2 gc : Khai ph kt qu tm kim v khai ph ni dung trang HTML.

3.1.1. Khai ph kt qu tm kim


- Phn loi t ng ti liu s dng searching engine: Search engine c th nh ch s tp trung d liu hn hp trn Web. - Trc quan giao din thn thin t kt qu tm kim: Phn tch v phn cm kt qu tm kim, th hiu qu tm kim s c ci thin tt hn, ngha l

cc ti liu tng t nhau v mt ni dung th a chng vo cng nhm, cc ti liu phi tng t th a chng vo cc nhm khc nhau.

3.1.2. Khai ph vn bn Web


KPVB l vic s dng k thut KPDL i vi cc tp vn bn tm ra tri thc c ngha tim n trong n. D liu ca n c l d liu c cu trc, na cu trc hoc khng cu trc [16]. Kt qu khai ph khng ch l trng thi chung ca mi ti liu vn bn m cn l s phn loi, phn cm cc tp vn bn phc v cho mc ch no .
Ngun d liu Web Tin x l Biu din d liu S dng cc k thut khai ph d liu x l

nh gi v biu din tri thc

Trch rt cc mu

Hnh 3.2. Qu trnh khai ph vn bn Web

3.1.2.1. La chn d liu V c bn, vn bn cc b c nh dng tch hp thnh cc ti liu theo mong mun khai ph v phn phi trong nhiu dch v Web bng vic s dng k thut truy xut thng tin. 3.1.2.2. Tin x l d liu c c kt qu khai ph tt ta cn c d liu r rng, chnh xc v xa b d liu hn n v d tha. Sau bc tin x l, tp d liu t c thng c cc c im nh sau [16]: - D liu thng nht v hn hp cng bc. - Lm sch d liu khng lin quan, nhiu v d liu rng. D liu khng b mt mt v khng b lp. - Gim bt s chiu v lm tng hiu qu vic pht hin tri thc bng vic chuyn i, quy np, cng bc d liu,...

- Lm sch cc thuc tnh khng lin quan gim bt s chiu ca d liu. 3.1.2.3. Biu in vn bn KPVB Web l khai ph cc tp ti liu HTML . Do ta s phi bin i v biu din d liu thch hp cho qu trnh x l. Ngi ta thng dng m hnh TF-IDF vector ha d liu. Nhng c mt vn quan trng l vic biu din ny s dn n s chiu vector kh ln. 3.1.2.4. Trch rt cc t c trng Rt ra cc c trng l mt phng php, n c th gii quyt s chiu vector c trng ln c mang li bi k thut KPVB. Vic rt ra cc c trng da trn hm trng s: - Mi t c trng s nhn c mt gi tr trng s tin cy bng vic tnh ton hm trng s tin cy. Tn s xut hin cao ca cc t c trng l kh nng chc chn n s phn nh n ch ca vn bn, th ta s gn cho n mt gi tr tin cy ln hn. Hn na, nu n l tiu , t kha hoc cm t th chc chn n c gi tr tin cy ln hn. - Vic rt ra cc c trng da trn vic phn tch thnh phn chnh trong phn tch thng k. tng chnh ca phng php ny l s dng thay th t c trng bao hm ca mt s t cc t c trng chnh trong phn m t thc hin gim bt s chiu. 3.1.2.5. Khai ph vn bn Sau khi tp hp, la chn v trch ra tp vn bn hnh thnh nn cc c trng c bn, n s l c s KPDL. T ta c th thc hin trch, phn loi, phn cm, phn tch v d on. 3.1.2.5.1 Trch rt vn bn Vic trch rt vn bn a ra ngha chnh c th m t tm tt ti liu vn bn trong qu trnh tng hp. Sau , ngi dng c th hiu ngha chnh ca vn bn nhng khng cn thit phi duyt ton b vn bn. y l phng php c bit c s dng trong searching engine, thng cn a ra vn bn trch dn. Nhiu searching engines lun a ra nhng cu d on trong qu

trnh tm kim v tr v kt qu, cch tt nht thu c ngha chnh ca mt vn bn hoc tp vn bn ch yu bng vic s dng nhiu thut ton khc nhau. 3.1.2.5.2. Phn lp vn bn Trc ht, nhiu ti liu c phn lp t ng mt cch nhanh chng v hiu qu cao. Th hai, mi lp vn bn c a vo mt ch ph hp. Ta thng s dng phng php phn lp Navie Bayesian v K-lng ging gn nht khai ph thng tin vn bn. Trong phn lp vn bn, u tin l phn loi ti liu. Th hai, xc nh c trng thng qua s lng cc c trng ca tp ti liu hun luyn. Cui cng, tnh ton kim tra phn lp ti liu v tng t ca ti liu phn lp bng thut ton no . Khi cc ti liu c tng t cao vi nhau th nm trong cng mt phn lp. tng t s c o bng hm nh gi xc nh trc. Nu t ti liu tng t nhau th a n v 0. Nu n khng ging vi s la chn ca phn lp xc nh trc th xem nh khng ph hp. 3.1.2.5.3. Phn cm vn bn Ch phn loi khng cn xc nh trc nhng ta phi phn loi cc ti liu vo nhiu cm. Trong cng mt cm, th tt c tng t ca cc ti liu yu cu cao hn, ngc li ngoi cm th tng t thp hn. Phng php sp xp lin kt v phng php phn cp thng c s dng trong phn cm vn bn.
- Trc ht ta s chia tp ti liu thnh cc cm khi u thng qua vic ti u ha hm nh gi theo mt nguyn tc no , R={R1, R2,...,Rn}, vi n phi c xc nh trc. - Vi mi ti liu trong tp ti liu W, W={w1, w2,..,wm}, tnh ton tng t ca n ti Rj ban u, sim(wi, Rj), sau la chn ti liu tng t ln nht, a n vo cm Rj. - Lp li cc cng vic trn cho ti khi tt c cc ti liu a vo trong cc cm xc nh. Hnh 3.3. Thut ton phn cm phn hoch

cc phn t khi u v s lng ca n.

Kt qu phn cm n nh v nhanh chng. Nhng phi xc nh tt trc

cc phn t khi u v s lng ca n.

3.1.2.5.4. Phn tch v d on xu hng Thng qua vic phn tch cc ti liu Web, ta c th nhn c quan h phn phi ca cc d liu c bit trong tng giai on ca n v c th d on c tng lai pht trin. 3.1.3. nh gi cht lng mu KPDL Web c th c xem nh qu trnh ca machine learning. Kt qu ca machine learning l cc mu tri thc. Phn quan trng ca machine learning l nh gi kt qu cc mu. Ta thng phn lp cc tp ti liu vo tp hun luyn v tp kim tra. Sau lp li vic hc v kim th trong tp hun luyn v tp kim tra. Cui cng, cht lng trung bnh c dng nh gi cht lng m hnh.

3.2. Khai ph theo s dng Web


Nm bt nhng c tnh ca ngi dng Web l vic rt quan trng i vi ngi thit Web site. Thng qua vic khai ph lch s cc mu truy xut ca ngi dng Web, khng ch thng tin v Web c s dng nh th no m cn nhiu c tnh khc nh cc hnh vi ca ngi dng c th c xc nh. S iu hng ng dn ngi dng Web mang li gi tr thng tin v mc quan tm ca ngi dng n cc WebSite . Khai ph theo s dng Web l khai ph truy cp Web khm ph cc mu ngi dng truy nhp vo WebSite. Kin trc tng qut ca qu trnh khai ph theo s dng Web nh sau:

Hnh 3.4. Kin trc tng qut ca khai ph theo s dng Web

3.2.1. ng dng ca khai ph theo s dng Web


- Tm ra nhng khch hng tim nng trong thng mi in t. - Chnh ph in t (e-Gov), gio dc in t (e-Learning). - Xc nh nhng qung co tim nng. - Nng cao cht lng truyn ti ca cc dch v thng tin Internet. - Ci tin hiu sut h thng phc v ca cc my ch Web. - Ci tin thit k Web v c nhn dch v Web. - Pht hin gian ln v xm nhp bt hp l trong dch v thng mi in t v cc dch v Web khc. - Thng qua vic phn tch chui truy cp ca ngi dng c th d bo nhng hnh vi ca ngi dng trong qu trnh tm kim thng tin.

3.2.2. Cc k thut c s dng trong khai ph theo s dng Web


Lut kt hp: tm ra nhng trang Web thng c truy cp cng nhau ca ngi dng, nhng la chn cng nhau ca khch hng trong thng mi in t. K thut phn cm: Phn cm ngi dng da trn cc mu duyt tm ra s lin quan gia nhng ngi dng Web v cc hnh vi ca h.

3.2.3. Nhng vn trong khai kh theo s dng Web.


Khai ph theo cch dng Web c 2 vic: Trc tin, Web log cn c lm sch, nh ngha, tch hp v bin i. Da vo phn tch v khai ph. Nhng vn tn ti: - Cu trc vt l cc Web site khc nhau t nhng mu ngi dng truy xut. - Rt kh c th tm ra nhng ngi dng, cc phin lm vic, cc giao tc. Vn chng thc phin ngi dng v truy cp Web: Cc phin chuyn hng ca ngi dng: Nhm cc hnh ng c thc hin bi ngi dng t lc h truy cp vo Web site n lc h ri khi Web site . Nhng hnh ng ca ngi dng trong mt Web site c ghi v lu tr li trong mt file ng nhp (log file).

3.2.3.1. Chng thc phin ngi dng Chng thc ngi dng: Mi ngi dng vi cng mt Client IP c xem l cng mt ngi. Chng thc phin lm vic: Mi phin lm vic mi c to ra khi mt a ch mi c tm thy hoc nu thi gian thm mt trang qu ngng thi gian cho php (v d 30 pht) i vi mi a ch IP. 3.2.3.2. ng nhp Web v xc nh phin chuyn hng ngi dng Dch v file ng nhp Web: Mt file ng nhp Web l mt tp cc s ghi li nhng yu cu ngi dng v cc ti liu trong mt Web site 3.2.3.3. Cc vn i vi vic x l Web log - Thng tin c cung cp c th khng y , khng chi tit. - Khng c thng tin v ni dung cc trang c thm. - C qu nhiu s ghi li cc ng nhp do yu cu phc v bi cc proxy. - S ghi li cc ng nhp khng y do cc yu cu phc v bi proxy. - Lc cc mc ng nhp. - c lng thi gian thm trang. 3.2.3.4. Phng php chng thc phin lm vic v truy cp Web Chng thc phin lm vic: Nhm cc tham chiu trang ca ngi dng vo mt phin lm vic da trn nhng phng php gii quyt heuristic. Phng php heuristics da trn IP v thi gian kt thc mt phin lm vic (v d 30 pht) c s dng chng thc phin ngi dng. y l phng php n gin nht.

3.2.4. Qu trnh khai ph theo s dng Web


Khai ph s dng Web c 3 pha [22]: Tin x l, khai ph v phn tch nh gi, biu din d liu. 3.2.4.1. Tin x l d liu Chng thc ngi dng, chng thc hot ng truy nhp, ng dn y , chng thc giao tc, tch hp d liu v bin i d liu. Trong pha ny, cc

thng tin v ng nhp Web c th c bin i thnh cc mu giao tc thch hp cho vic x l sau ny trong cc lnh vc khc nhau. B sung hoc xa b cc d liu khuyt thiu nh cache cc b, dch v proxy. X l thng tin trong cc Cookie, thng tin ang k ngi dng kt hp vi IP, tn trnh duyt v cc thng tin lu tm. Chng thc giao tc: Chng thc cc phin ngi dng, cc giao tc. 3.2.4.2. Khi ph d liu S dng cc phng php KPDL trong cc lnh vc khc nhau nh lut kt hp, phn tch, thng k, phn tch ng dn, phn lp v phn cm khm ph ra cc mu ngi dng. + Phn tch ng dn [8][9][22]: Hu ht cc cc ng dn thng c thm c b tr theo th vt l ca trang Web. Thng qua vic phn tch ng dn trong qu trnh truy cp ca ngi dng ta c th bit c mi quan h trong vic truy cp ca ngi gia cc ng dn lin quan. + Lut kt hp [8]: S tng quan gia cc tham chiu n cc file khc nhau c trn dch v nh vic s dng lut kt hp. + Chui cc mu: Cc mu thu c gia cc giao tc v chui thi gian. Th hin mt tp cc phn t c theo sau bi phn t khc trong th t thi gian lu hnh tp giao tc. + Quy tc phn loi [22]: Profile ca cc phn t thuc mt nhm ring bit theo cc thuc tnh chung. + Phn tch phn cm: Nhm cc khch hng li cng nhau hoc cc phn t d liu c cc c tnh tng t nhau. N gip cho vic pht trin v thc hin cc chin lc tip th khch hng c v trc tuyn hoc khng trc tuyn nh vic tr li t ng cho cc khch hng thuc nhm chc chn, n to ra s thay i linh ng mt WebSite ring bit i vi mi khch hng. 3.2.4.3. Phn tch nh gi Phn tch m hnh [22]: Thng k, tm kim tri thc v tc nhn thng minh. Phn tch tnh kh thi, truy vn d liu hng ti s tiu dng ca con ngi. Trc quan ha: Trc quan Web s dng lc ng dn Web v a ra th c hng OLAP.

3.3. Khai ph cu trc Web


WWW l h thng thng tin ton cu, bao gm tt c cc Web site. Mi mt trang c th c lin kt n nhiu trang. Cc siu lin kt thay i cha ng ng ngha chung ch ca trang. Mt siu lin kt tr ti mt trang Web khc c th c xem nh l mt chng thc ca trang Web . Do , n rt c ch trong vic s dng nhng thng tin ng ngha ly c thng tin quan trng thng qua phn tch lin kt gia cc trang Web. Mc tiu ca khai ph cu trc Web l pht hin thng tin cu trc v Web. Nu nh khai ph ni dung Web ch yu tp trung vo cu trc bn trong ti liu th khai ph cu trc Web c gng pht hin cu trc lin kt ca cc siu lin kt mc trong ca ti liu. Da trn m hnh hnh hc ca cc siu lin kt, khai ph cu trc Web s phn loi cc trang Web, to ra thng tin nh tng t v mi quan h gia cc WebSite khc nhau. Nu trang Web c lin kt trc tip vi trang Web khc th ta s mun pht hin ra mi quan h gia cc trang Web ny. + Vic phn tch lin kt Web c s dng cho nhng mc ch: Sp th t ti liu ph hp vi truy vn ca ngi s dng, quyt nh Web no c a vo la chn trong truy vn, phn trang, tm kim nhng trang lin quan, tm kim nhng bn sao ca Web. - th th lin kt: Mi nt l mt trang, cung c hng t u n v nu c mt siu lin kt t trang Web u sang trang Web v.

Hnh 3.5. thi lin kt Web

- th trch dn: Mi nt cho mt trang, khng c cung hng t u n v nu c mt trang th ba w lin kt c u v v. - Gi nh: Mt lin kt t trang u n trang v l mt thng bo n trang v bi trang u. Nu u v v c kt ni bi mt ng lin kt th rt c kh nng hai trang Web u c ni dung tng t nhau.

3.3.1. Tiu chun nh gi tng t


Khm ph ra mt nhm cc trang Web ging nhau khai ph, ta phi ch ra s ging nhau ca hai nt theo mt tiu chun no . Tiu chun 1: i vi mi trang Web d1 v d2. Ta ni d1 v d2 quan h vi nhau nu c mt lin kt t d1 n d2 hoc t d2 n d1. Tiu chun 2: ng trch dn: tng t gia d1 v d2 c o bi s trang dn ti c d1 v d2. Tng t ch mc: tng t gia d1 v d2 c o bng s trang m c d1 v d2 u tr ti.

3.3.2. Khi ph v qun l cng ng Web


Cng ng Web l mt nhm gm cc trang Web chia s chung nhng vn m ngi dng quan tm. Cc thnh vin ca cng ng Web c th khng bit tnh trng tn ti ca mi trang. Nhn bit c cc cng ng Web, hiu c s pht trin v nhng c trng ca cc cng ng Web l rt quan trng. Vic xc nh v hiu cc cng ng trn Web c th c xem nh vic khai ph v qun l Web. c im ca cng ng Web: - Cc trang Web trong cng mt cng ng s tng t vi nhau hn cc trang Web ngoi cng ng. - Mi cng ng Web s to thnh mt cm cc trang Web. - Cc cng ng Web c xc nh mt cch r rng, tt c mi ngi u bit, nh cc ngun ti nguyn c lit k bi Yahoo. - Cng ng Web c xc nh hon chnh. C nhiu phng php chng thc cng ng nh thut ton tm kim theo ch HITS, lung cc i v nht ct cc tiu, thut ton PageRank,...

3.3.2.1. Thut ton PageRank Google da trn thut ton PageRank, n lp ch mc cc lin kt gia cc Web site v th hin mt lin kt t A n B nh l xc nhn ca B bi A. Cc lin kt c nhng gi tr khc nhau. Nu A c nhiu lin kt ti n v C c t cc lin kt ti n th mt lin kt t A n B c gi tr hn mt lin kt t C n B. Gi tr c xc nh nh th c gi l PageRank ca mt trang v xc nh th t sp xp ca n trong cc kt qu tm kim. Cc lin kt c th c phn tch chnh xc v hiu qu hn i vi khi lng chu chuyn hoc khung nhn v tr thnh o ca s thnh cng v vic bin i th hng ca cc trang.

Hnh 3.6. Kt qu ca thut ton PageRank

PageRank khng n gin ch da trn tng s cc lin kt n. Cc tip cn c bn ca PageRank l mt ti liu trong thc t c xt n quan trng hn l cc ti liu lin kt ti n, nhng nhng lin kt v khng bng nhau v s lng. Mt ti liu xp th hng cao trong cc phn t ca PageRank nu nh c cc ti liu th hng cao khc lin kt ti n. 3.3.2.2. Phng php phn cm nh thut ton HITS HITS l thut ton pht trin hn trong vic vic xp th hng ti liu da trn thng tin lin kt gia tp cc ti liu. - Authority: L cc trang cung cp thng tin quan trng, tin cy da trn cc ch a ra. - Hub: L cc trang cha cc lin kt n authorities - Bc trong: L s cc lin kt n mt nt, c dng o y quyn. - Bc ngoi: L s cc lin kt i ra t mt nt, n c s dng o mc trung tm.

Cc Authority v hub th hin mt quan h tc ng qua li tng cng lc lng. Ngha l mt Hub s tt hn nu n tr n cc Authority tt v ngc li mt Authority s tt hn nu n c tr n bi nhiu Hub tt. Cc bc ca phng php HITS Bc 1: Xc nh mt tp c bn S, ly mt tp cc ti liu tr v bi Search Engine chun c gi l tp gc R, khi to S tng ng vi R. Bc 2: Thm vo S tt c cc trang m n c tr ti t bt k trang no trong R. Thm vo S tt c cc trang m n tr ti bt k trang no trong R Vi mi trang p trong S: Tnh gi tr im s Authority: ap (vector a) Tnh gi tr im s Hub: hp (vector h) Vi mi nt khi to ap v hp l 1/n (n l s cc trang) Bc 3. Trong mi bc lp tnh gi tr trng s Authority cho mi nt trong S theo cng thc: a p =
q: q p

Bc 4. Mi bc lp tnh gi tr trng s Hub i vi mi nt trong S theo cng thc hq =


p q: q p

Lu rng cc trng s Hub c tnh ton nh vo cc trng s Authority hin to, m cc trng s Authority ny li c tnh ton t cc trng s ca cc Hub trc . Bc 5. Sau khi tnh xong trng s mi cho tt c cc nt, cc trng s c chun ha li theo cng thc:

(a
1
pS

2 ) =

and

(h
pS

) =1
2

Lp li bc 3 cho ti khi cc hp v ap khng i.

3.4. p dng thut ton phn cm d liu trong tm kim v phn cm ti liu Web
Nh s ci tin khng ngng ca cc Search engine v c chc nng tm kim ln giao din ngi dng gip cho ngi s dng d dng hn trong vic tm kim thng tin trn web. Tuy nhin, ngi s dng thng vn phi duyt qua hng chc thm ch hng ngn trang Web mi c th tm kim c th m h cn. Nhm gii quyt vn ny, chng ta c th nhm cc kt qu tm kim thnh thnh cc nhm theo cc ch , khi ngi s dng c th b qua cc nhm m h khng quan tm tm n nhm ch quan tm. iu ny s gip cho ngi dng thc hin cng vic ca h mt cch hiu qu hn. Tuy nhin vn phn cm d liu trn Web v chn ch thch hp n c th m t c ni dung ca cc trang l mt vn khng n gin. Trong bi bo ny, ta s xem kha cnh s dng k thut phn cm phn cm ti liu Web da trn kho d liu c tm kim v lu tr.

3.4.1. Hng tip cn bng k thut phn cm


Hin nay, xc nh mc quan trng ca mt trang web chng ta c nhiu cch nh gi nh PageRank, HITS, Tuy nhin, cc phng php nh gi ny ch yu u da vo cc lin kt trang xc nh trng s cho trang. Ta c th tip cn cch nh gi mc quan trng theo mt hng khc l da vo ni dung ca cc ti liu xc nh trng s, nu cc ti liu "gn nhau" v ni dung th s c mc quan trng tng ng v s thuc v cng mt nhm. Gi s cho tp S gm cc trang web, hy tm trong tp S cc trang cha ni dung cu hi truy vn ta c tp R. S dng thut ton phn cm d liu phn tp R thnh k cm (k xc nh) sao cho cc phn t trong cm l tng t nhau nht, cc phn t cc cm khc nhau th phi tng t vi nhau. T tp S-R, chng ta a cc phn t ny vo mt trong k cm c thit lp trn. Nhng phn t no tng t vi trng tm ca cm (theo mt ngng xc nh no ) th a vo cm ny, nhng phn t khng tha mn

xem nh khng ph hp vi truy vn v loi b n khi tp kt qu. K tip, chng ta nh trng s cho cc cm v cc trang trong tp kt qu theo thut ton sau:
INPUT: tp d liu D cha cc trang gm k cm v k trng tm OUTPUT: trng s ca cc trang BEGIN Mi cm d liu th m v trng tm Cm ta gn mt trng s tsm. Vi cc trng tm Ci, Cj bt k ta lun c tsi>tsj nu ti tng t vi truy vn hn tj. Vi mi trang p trong cm m ta xc nh trng s trang pwm. Vi mi pwi, pwj bt k, ta lun c pw1>pw2 nu pw1 gn trng tm hn pw2. END Hnh 3.7. Thut ton nh trng s cm v trang

Nh vy, theo cch tip cn ny ta s gii quyt c cc vn sau: + Kt qu tm kim s c phn thnh cc cm theo cc ch khc nhau, ty vo yu cu c th ngi dng s xc nh ch m h cn. + Qu trnh tm kim v xc nh trng s cho cc trang ch yu tp trung vo ni dung ca trang hn l da vo cc lin kt trang. + Gii quyt c vn t/cm t ng ngha trong cu truy vn ca ngi dng. + C th kt hp phng php phn cm trong lnh vc khai ph d liu vi cc phng php tm kim c. Hin ti, c mt s thut ton phn cm d liu c s dng trong phn cm vn bn nh thut ton phn cm phn hoch (k-means, PAM, CLARA), thut ton phn cm phn cp (BIRCH, STC),... Trong thc t phn cm theo ni dung ti liu Web, mt ti liu c th thuc vo nhiu nhm ch khc nhau. gii quyt vn ny ta c th s dng thut ton phn cm theo cch tip cn m.

3.4.2. Qu trnh tm kim v phn cm ti liu


V c bn, qu trnh phn cm kt qu tm kim s din ra theo cc bc c th hin nh sau:

- Tm kim cc trang Web t cc Website tha mn ni dung truy vn. - Trch rt thng tin m t t cc trang v lu tr n cng vi cc URL tng ng. - S dng k thut phn cm d liu phn cm t ng cc trang Web thnh cc cm, sao cho cc trang trong cm tng t v ni dung vi nhau hn cc trang ngoi cm.
D liu web Tm kim v trch rt d liu Tin x l

Biu din kt qu

p dng thut ton phn cm

Biu din d liu

Hnh 3.8. Cc bc phn cm kt qu tm kim trn Web

3.4.2.1. Tm kim d liu trn Web Nhim v ch yu ca giai on ny l da vo tp t kha tm kim tm kim v tr v tp gm ton vn ti liu, tiu , m t tm tt, URL, tng ng vi cc trang . Nhm nng cao tc x l, ta tin hnh tm kim v lu tr cc ti liu ny trong kho d liu s dng cho qu trnh tm kim (tng t nh cc Search Engine Yahoo, Google,). 3.4.2.2. Tin x l d liu Qu trnh lm sch d liu v chuyn dch cc ti liu thnh cc dng biu din d liu thch hp. Giai on ny bao gm cc cng vic nh sau: Chun ha vn bn, xa b cc t dng, kt hp cc t c cng t gc, s ha v biu din vn bn,..

3.4.2.2.1. Chun ha vn bn y l giai on chuyn vn bn th v dng vn bn sao cho vic x l sau ny c d dng, n gin, thut tin, chnh xc so vi vic x l trc tip trn vn bn th m nh hng t n kt qu x l. 3.4.2.2.2. Xa b cc t dng Trong vn bn c nhng t mang t thng tin trong qu trnh x l, nhng t c tn s xut hin thp, nhng t xut hin vi tn s ln nhng khng quan trng cho qu trnh x l u c loi b. Theo mt s nghin cu gn y cho thy vic loi b cc t dng c th gim bi c khong 20-30% tng s t trong vn bn. C rt nhiu t xut hin vi tn s ln nhng n khng hu ch cho qu trnh phn cm d liu. Nhng t xut hin vi tn s qu ln cng s c loi b. n gin trong ng dng thc t, ta c th t chc thnh mt danh sch cc t dng, s dng nh lut Zipf xa b cc t c tn s xut hin thp hoc qu cao. 3.4.2.2.3. Kt hp cc t c cng gc Hu ht trong cc ngn ng u c rt nhiu cc t c chung ngun gc vi nhau, chng mang ngha tng t nhau, do gim bt s chiu trong biu din vn bn, ta s kt hp cc t c cng gc thnh mt t. Theo mt s nghin cu [5] vic kt hp ny s gim c khong 40-50% kch thc chiu trong biu din vn bn. V d trong ting Anh, t user, users, used, using c cng t gc v s c quy v l use; t engineering, engineered, engineer c cng t gc s c quy v l engineer. 3.4.2.3. Xy dng t in Vic xy dng t in l mt cng vic rt quan trng trong qu trnh vector ha vn bn, t in s gm cc t/cm t ring bit trong ton b tp d liu. T in s gm mt bng cc t, ch s ca n trong t in v c sp xp theo th t.

Mt s bi bo xut [31] nng cao cht lng phn cm d liu cn xem xt n vic x l cc cm t trong cc ng cnh khc nhau. Theo xut ca Zemir [19][31] xy dng t in c 500 phn t l ph hp. 3.4.2.4. Tch t, s ha vn bn v biu din ti liu Tch t l cng vic ht sc quan trng trong biu din vn bn, qu trnh tch t, vector ha ti liu l qu trnh tm kim cc t v thay th n bi ch s ca t trong t in. y ta c th s dng mt trong cc m hnh ton hc TF, IDF, TFIDF,... biu din vn bn. Chng ta s dng mng W (trng s) hai chiu c kch thc m x n, vi n l s cc ti liu, m l s cc thut ng trong t in (s chiu), hng th j l mt vector biu din ti liu th j trong c s d liu, ct th i l thut ng th i trong t in. Wij l gi tr trng s ca thut ng i i vi ti liu j. Giai on ny thc hin thng k tn s thut ng ti xut hin trong ti liu dj v s cc ti liu cha ti. T xy dng bng trng s ca ma trn W theo cng thc sau: Cng thc tnh trng s theo m hnh IF-IDF:
Wij=

n tf ij idf = [1 + log(tf )] log( ) hi ij ij 0

nu ti dj nu ngc li (ti dj)

3.4.2.5. Phn cm ti liu Sau khi tm kim, trch rt d liu v tin x l v biu din vn bn chng ta s dng k thut phn cm phn cm ti liu.
INPUT: Tp gm n ti liu v k cm. OUTPUT: Cc cm Ci (i=1,..,k) sao cho hm tiu chun t gi tr cc tiu. BEGIN Bc 1. Khi to ngu nhin k vector lm i tng trng tm ca k cm. Bc 2. Vi mi ti liu dj xc nh tng t ca n i vi trng tm ca mi cm theo mt trong cc o tng t thng dng (nh Dice, Jaccard, Cosine, Overlap, Euclidean, Manhattan). Xc nh trng tm tng t nht cho mi ti liu v

a ti liu vo cm . Bc 3. Cp nhn li cc i tng trng tm. i vi mi cm ta xc nh li trng tm bng cch xc nh trung bnh cng ca cc vector ti liu trong cm . Bc 4. Lp li bc 2 v 3 cho n khi trong tm khng thay i. END. Hnh 3.9. Thut ton k-means trong phn cm ni dung ti liu Web

- phc tp ca thut ton k-means l O((n.k.d).r). Trong : n l s i tng d liu, k l s cm d liu, d l s chiu, r l s vng lp. Sau khi phn cm xong ti liu, tr v kt qu l cc cm d liu v cc trng tm tng ng.

3.4.6. Kt qu thc nghim


+ D liu thc nghim l cc trang Web ly t 2 ngun chnh sau: - Cc trang c ly t ng t cc Website trn Internet, vic tm kim c thc hin bng cch s dng Yahoo tm kim t ng, chng trnh s da vo URL ly ton vn ca ti liu v lu tr li phc v cho qu trnh tm kim sau ny (da liu gm hn 4000 bi v cc ch data mining, web mining, Cluster algorithm, Sport). - Tm kim c chn lc, phn ny c tin hnh ly th cng, ngun d liu ch yu c ly t cc Web site: http://www.baobongda.com.vn/ http://bongda.com.vn/, http://vietnamnet.vn, http://www.24h.com Gm hn 250 bi bo ch bng . - Vic xy dng t in, sau khi thng k tn s xut hin ca cc t trong tp ti liu, ta p dng nh lut Zipf loi b nhng t c tn s xut hin qu cao v loi b nhng t c tn s qu thp, ta thu c b t in gm 500 t. S ti liu 50 50 100 100 S cm 10 15 10 15 Thi gian trung bnh (giy) Phn cm Tin x l v biu din vn bn ti liu 0,206 0,957 0,206 1,156 0,353 2,518 0,353 3,709

150 150 250 250

10 15 10 15

0,515 0,515 0,824 0,824

4,553 5,834 9,756 13,375

Bng 3.2. Bng o thi gian thc hin thut ton phn cm

Ta thy rng thi gian thc hin thut ton ph vo ln d liu v s cm cn phn cm. Ngoi ra, vi thut ton k-means cn ph thuc vo k trng tm khi to ban u. Nu k trng tm c xc nh tt th cht lng v thi gian thc hin c ci thin rt nhiu. Phn giao din chng trnh v mt s on m code in hnh c trnh by ph lc.

TI LIU THAM KHO Ti liu ting Vit [1] Cao Chnh Ngha, Mt s vn v phn cm d liu, Lun vn thc s, Trng i hc Cng ngh, H Quc gia H Ni, 2006. [2] Hong Hi Xanh, V cc k thut phn cm d liu trong data mining, lun vn thc s, Trng H Quc Gia H Ni, 2005 [3] Hong Th Mai, Khai ph d liu bng phng php phn cm d liu, Lun vn thc s, Trng HSP H Ni, 2006. Ti liu ting Anh [4] Athena Vakali, Web data clustering Current research status & trends, Aristotle University,Greece, 2004. [5] Bing Liu, Web mining, Springer, 2007. [6] Brij M. Masand, Myra Spiliopoulou, Jaideep Srivastava, Osmar R. Zaiane, Web Mining for Usage Patterns & Profiles, ACM, 2002. [7] Filippo Geraci, Marco Pellegrini, Paolo Pisati, and Fabrizio Sebastiani, A scalable algorithm for high-quality clustering of Web Snippets, Italy, ACM, 2006. [8] Giordano Adami, Paolo Avesani, Diego Sona, Clustering Documents in a Web Directory, ACM, 2003. [9] Hiroyuki Kawano, Applications of Web mining- from Web search engine to P2P filtering, IEEE, 2004. [10] Ho Tu Bao, Knowledge Discovery and Data Mining, 2000. [11] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, Jinwen Ma, Learning to Cluster Web Search Results, ACM, 2004. [12] Jitian Xiao, Yanchun Zhang, Xiaohua Jia, Tianzhu Li, Measuring Similarity of Interests for Clustering Web-Users, IEEE, 2001. [13] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, University of Illinois at Urbana-Champaign, 1999. [14] Khoo Khyou Bun, Topic Trend Detection and Mining in World Wide Web, A thesis for the degree of PhD, Japan, 2004. [15] LIU Jian-guo, HUANG Zheng-hong , WU Wei-ping, Web Mining for Electronic Business Application, IEEE, 2003. [16] Lizhen Liu, Junjie Chen, Hantao Song, The research of Web Mining, IEEE, 2002

[17] Maria Rigou, Spiros Sirmakessis, and Giannis Tzimas, A Method for Personalized Clustering in Data Intensive Web Applications, 2006. [18] Miguel Gomes da Costa Jnior, Zhiguo Gong, Web Structure Mining: An Introduction, IEEE, 2005. [19] Oren Zamir and Oren Etzioni, Web document Clustering: A Feasibility Demonstration, University of Washington, USA, ACM, 1998. [20] Pawan Lingras, Rough Set Clustering for Web mining, IEEE, 2002. [21] Periklis Andritsos, Data Clusting Techniques, University Toronto,2002. [22] R. Cooley, B. Mobasher, and J. Srivastava, Web mining: Information and Pattern Discovery on the World Wide Web, University of Minnesota, USA, 1998. [23] Raghu Krishnapuram, Anupam Joshi, and Liyu Yi, A Fuzzy Relative of the K -Medoids Algorithm with Application toWeb Document and Snippet Clustering, 2001 [24] Raghu Krishnapuram,Anupam Joshi, Olfa Nasraoui, and Liyu Yi, Low- Complexity Fuzzy Relational Clustering Algorithms for Web Mining, IEEE, 2001. [25] Raymond and Hendrik, Web Mining Research: A Survey, ACM, 2000 [26] Rui Wu, Wansheng Tang,Ruiqing Zhao, An Efficient Algorithm for Fuzzy Web-Mining, IEEE, 2004. [27] T.A.Runkler, J.C.Bezdek, Web mining with relational clustering, ELSEVIER, 2002. [28] Tsau Young Lin, I-Jen Chiang , A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering, ELSEVIER, 2005. [29] Wang Jicheng, Huang Yuan, Wu Gangshan, and Zhang Fuyan, Web Mining: Knowledge Discovery on the Web, IEEE, 1999. [30] WangBin, LiuZhijing, Web Mining Research, IEEE, 2003. [31] Wenyi Ni, A Survey of Web Document Clustering, Southern Methodist University, 2004. [32] Yitong Wang, Masaru Kitsuregawa, Evaluating Contents-Link Coupled Web Page Clustering for Web Search Results, ACM, 2002. [33] Zifeng Cui, Baowen Xu , Weifeng Zhang, Junling Xu, Web Documents Clustering with Interest Links, IEEE, 2005.

You might also like