1.1. GII THIU V KHAI PH D LIU (DATAMING) V KDD 1.1.1. Ti sao li cn khai ph d liu (datamining) Khong hn mt thp k tr li y, lng thng tin c lu tr trn cc thit b in t (a cng, CD-ROM, bng t, .v.v.) khng ngng tng ln. S tch ly d liu ny xy ra vi mt tc bng n. Ngi ta c on rng lng thng tin trn ton cu tng gp i sau khong hai nm v theo s lng cng nh kch c ca cc c s d liu (CSDL) cng tng ln mt cch nhanh chng. Ni mt cch hnh nh l chng ta ang ngp trong d liu nhng li i tri thc. Cu hi t ra l liu chng ta c th khai thc c g t nhng ni d liu tng chng nh b i y khng ? Necessity is the mother of invention - Data Mining ra i nh mt hng gii quyt hu hiu cho cu hi va t ra trn []. Kh nhiu nh ngha v Data Mining v s c cp phn sau, tuy nhin c th tm hiu rng Data Mining nh l mt cng ngh tri thc gip khai thc nhng thng tin hu ch t nhng kho d liu c tch tr trong sut qu trnh hot ng ca mt cng ty, t chc no . 1.1.2. Khai ph d liu l g? Khai ph d liu (datamining) c nh ngha nh l mt qu trnh cht lc hay khai ph tri thc t mt lng ln d liu. Mt v d hay c s dng l l vic khai thc vng t v ct, Dataming c v nh cng vic "i ct tm vng" trong mt tp hp ln cc d liu cho trc. Thut ng Dataming m ch vic tm kim mt tp hp nh c gi tr t mt s lng ln cc d liu th. C nhiu thut ng hin c dng cng c ngha tng t vi t Datamining nh Knowledge Mining (khai ph tri thc), knowledge extraction(cht lc tri thc), data/patern analysis(phn tch d liu/mu), data archaeoloogy (kho c d liu), datadredging(no vt d liu),... nh ngha: Khai ph d liu l mt tp hp cc k thut c s dng t ng khai thc v tm ra cc mi quan h ln nhau ca d liu trong mt tp hp d liu khng l v phc tp, ng thi cng tm ra cc mu tim n trong tp d liu . Khai ph d liu l mt bc trong by bc ca qu trnh KDD (Knowleadge Discovery in Database) v KDD c xem nh 7 qu trnh khc nhau theo th t sau:s 1. Lm sch d liu (data cleaning & preprocessing)s: Loi b nhiu v cc d liu khng cn thit. 2. Tch hp d liu: (data integration): qu trnh hp nht d liu thnh nhng kho d liu (data warehouses & data marts) sau khi lm sch v tin x l (data cleaning & preprocessing). 3. Trch chn d liu (data selection): trch chn d liu t nhng kho d liu v sau chuyn i v dng thch hp cho qu trnh khai thc tri thc. Qu trnh ny bao gm c vic x l vi d liu nhiu (noisy data), d liu khng y (incomplete data), .v.v. 4. Chuyn i d liu: Cc d liu c chuyn i sang cc dng ph hp cho qu trnh x l 5. Khai ph d liu(data mining): L mt trong cc bc quan trng nht, trong s dng nhng phng php thng minh cht lc ra nhng mu d liu. 6. c lng mu (knowledge evaluation): Qu trnh nh gi cc kt qu tm c thng qua cc o no . 7. Biu din tri thc (knowledge presentation): Qu trnh ny s dng cc k thut biu din v th hin trc quan cho ngi dng.
Hnh 1 - Cc bc trong Data Mining & KDD 1.1.3. Cc chc nng chnh ca khai ph d liu Data Mining c chia nh thnh mt s hng chnh nh sau: M t khi nim (concept description): thin v m t, tng hp v tm tt khi nim. V d: tm tt vn bn. Lut kt hp (association rules): l dng lut biu din tri th dng kh n gin. V d: 60 % nam gii vo siu th nu mua bia th c ti 80% trong s h s mua thm tht b kh. Lut kt hp c ng dng nhiu trong lnh vc knh doanh, y hc, tin-sinh, ti chnh & th trng chng khon, .v.v. Phn lp v d on (classification & prediction): xp mt i tng vo mt trong nhng lp bit trc. V d: phn lp vng a l theo d liu thi tit. Hng tip cn ny thng s dng mt s k thut ca machine learning nh cy quyt nh (decision tree), mng n ron nhn to (neural network), .v.v. Ngi ta cn gi phn lp l hc c gim st (hc c thy). Phn cm (clustering): xp cc i tng theo tng cm (s lng cng nh tn ca cm cha c bit trc. Ngi ta cn gi phn cm l hc khng gim st (hc khng thy). Khai ph chui (sequential/temporal patterns): tng t nh khai ph lut kt hp nhng c thm tnh th t v tnh thi gian. Hng tip cn ny c ng dng nhiu trong lnh vc ti chnh v th trng chng khon v n c tnh d bo cao. 1.1.4. ng dng ca khai ph d liu Data Mining tuy l mt hng tip cn mi nhng thu ht c rt nhiu s quan tm ca cc nh nghin cu v pht trin nh vo nhng ng dng thc tin ca n. Chng ta c th lit k ra y mt s ng dng in hnh: Phn tch d liu v h tr ra quyt nh (data analysis & decision support) iu tr y hc (medical treatment) Text mining & Web mining Tin-sinh (bio-informatics) Ti chnh v th trng chng khon (finance & stock market) Bo him (insurance) Nhn dng (pattern recognition) .v.v. 1.2. C S S LIU HYPERTEXT V FULLTEXT 1.2.1. C s d liu FullText D liu dng FullText l mt dng d liu phi cu trc vi thng tin ch gm cc ti liu dng Text. Mi ti liu cha thng tin v mt vn no th hin qua ni dung ca tt c cc t cu thnh ti liu . ngha ca mi t trong ti liu khkng c nh m tu thuc vo tng ng cnh khc nhau s mang ngha khc nhau. Cc t trong ti liu c lin kt vi nhau theo mt ngn ng no . Trong cc d liu hin nay th vn bn l mt trong nhng d liu ph bin nht, n c mt khp mi ni v chng ta thng xuyn bt gp do cc bi ton v x l vn bn c t ra kh lu v hin nay vn l mt trong nhng vn trong khai ph d liu Text, trong c nhng bi ton ng ch nh tm kim vn bn, phn loi vn bn, phn cm vn bn hoc dn ng vn bn CSDL full_text l mt dng CSDL phi cu trc m d liu bao gm cc ti liu v thuc tnh ca ti liu. C s d liu Full_Text thng c t chc nh mt t hp ca hai thnh phn: Mt CSDL c cu trc thng thng (cha c im ca cc ti liu) v cc ti liu
Ni dung cu ti liu c lu tr gin tip trong CSDL theo ngha h thng ch qun l a ch lu tr ni dung. C s d liu dng Text c th chia lm hai loi sau: Dng khng c cu trc (unstructured): Nhng vn bn thng thng m chng ta thng c hng ngy c th hin di dng t nhin ca con ngi v n CSDL Full-Text CSDL c cu trc cha c im ca cc ti liu Cc ti liu khng c mt cu trc nh dng no. VD: Tp hp sch, Tp ch, Bi vit c qun l trong mt mng th vin in t. Dng na cu trc (semi-structured): Nhng vn bn c t chc di dng cu trc khng cht ch nh bn ghi cc k hiu nh du vn bn v vn th hin c ni dung chnh ca vn bn, v d nh cc dnh HTML, email,... Tuy nhin vic phn lm hai loi cng khng tht r rng, trong cc h phn mm, ngi ta thng phi s dng cc phn kt hp li thnh mt h nh trong c h tm tin (Search Engine), hoc trong bi ton tm kim vn bn (Text Retrieval), mt trong nhng lnh vc qua tm nht hin nay. Chng hn trong h tm kim nh Yahoo, Altavista, Google... u t chc d liu theo cc nhm v th mc, mi nhm li c th c nhiu nhm con nm trong . H Altavista cn tch hp thm chng trnh dch t ng c th dch chuyn i sang nhiu th ting khc nhau v cho kt qu kh tt. 1.2.2. C s d liu HyperText Theo t in ca i hc Oxford (Oxford English Dictionary Additions Series) th Hypertext c nh ngha nh sau: l loi Text khng phi c theo dng lin tc n, n c th c c theo cc th t khc nhau, c bit l Text v nh ha (Graphic) l cc dng c mi lin kt vi nhau theo cch m ngi c c th khng cn c mt cch lin tc. V d khi c mt cun sch ngi c khng phi c ln lt tng trang t u n cui m c th nhy cc n cc on sau tham kho v cc vn h quan tm. Nh vy vn bn HyperText bao gm dng ch vit khng lin tc, chng c phn nhnh v cho php ngi c c th chn cch c theo mun ca mnh. Hiu theo ngha thng thng th HyperText l mt tp cc trang ch vit c kt ni vi nhau bi cc lin kt v cho php ngi c c th c theo cc cch khc nhau. Nh ta lm quen nhiu vi cc trang nh dng HTML, trong cc trang c nhng lin kt tr ti tng phn khc nhau ca trang hoc tr ti trang khc, v ngi c s c vn bn da vo nhng lin kt . Bn cnh , HyperText cng l mt dng vn bn Text c bit nn cng c th bao gm cc ch vit lin tc (l dng ph bin nht ca ch vit). Do khng b hn ch bi tnh lin tc trong HyperText, chng ta c th to ra cc dng trnh by mi, do ti liu s phn nh tt hn ni dung mun din t. Hn na ngi c c th chn cho mnh mt cch c ph hp chng hn nh i su vo mt vn m h quan tm. Sng kin to ra mt tpc c vn bn cng vi cc con tr tr ti cc vn bn khc lin kt mt tp cc vn bn c mi quan h voi nhau vi nhau l mt cch thc s hay v rt hu ch t chc thng tin. Vi ngi vit, cch ny cho php h c th thoi mi loi b nhng bn khon v th t trnh by, m c th t chc vn thnh nhng phn nh, ri s dng kt ni ch ra mi lin h gia cc phn nh vi nhau. Vi ngi c cch ny cho php h c th i tt trn mng thng tin v quyt nh phn thng tin no c lin quan n vn m h quan tm tip tc tm hiu. So snh vi cch c tuyn tnh, tc l c ln lt th HyperText cung cp cho chng ta mt giao din c th tip xc vi ni dung thng tin hiu qu hn rt nhiu. Theo kha cnh ca cc thut ton hc my th HyperText cung cp cho chng ta c hi nhn ra ngoi phm vi mt ti liu phn lp n, ngha l c tnh c n cc ti liu c lin kt vi n. Tt nhin khng phi tt c cc ti liu c lin kt n n u c ch cho vic phn lp, c bit l khi cc siu lin kt c th ch n rt nhiu loi cc ti liu khc nhau. Nhng chc chn vn cn tni ti tim nng m con ngi cn tip tc nghin cu v vic s dng cc ti liu lin kt n mt trang nng cao chnh xc phn lp trang . C hai khi nim v HyperText m chng ta cn quan tm: Hypertext Document (Ti liu siu vn bn): L mt ti liu vn bn n trong h thng siu vn bn. Nu tng tng h thng siu vn bn l mt th, th cc ti liu tng ng vi cc nt. Hypertext Link (Lin kt siu vn bn): L mt tham chiu ni mt ti liu HyperText ny vi mt ti liu HyperText khc. Cc siu lin kt ng vai tr nh nhng ng ni trong th ni trn. HyperText l loi d liu ph bin hin nay, v cng l loi d liu c nhu cu tm kim v phn lp r ln. N l d liu ph bin trn mng thng tin Internet CSDL HyperText vi vn bn dng na cu trc do xut hin thm cc th : Th cu trc (tiu , m u, ni dung), th nhn trnh by ch (m, nghing,). Nh cc th ny m chng ta c thm mt tiu chun (so vi ti liu fulltext) c th tm kim v phn lp chng. Da vo cc th quy nh trc chng ta c th phn thnh cc u tin khc nhaucho cc t kha nu chng xut hin nhng v tr khc nhau. V d khi tm kim cc ti liu c ni dung lin quan n people th chng ta a t kha tm kim l people, v cc ti liu c t kha poeple ng tiu th s gn vi yu cu tm kim hn.
So snh c im ca d liu Fulltext v d liu trang web Mc d trang Web l mt dang c bit ca d liu FullText, nhng c nhiu im khc nhau gia hai loi d liu ny. Mt s nhn xt sau y cho thy s khc nhau gia d liu Web v FullText. S khc nhau v c im l nguyn nhn chnh dn n s khc nhau trong khai ph hai loi d liu ny (phn lp, tm kim,). Mt s minh ho Hypertext Document nh l cc nt v cc Hypertext Link nh l cc lin kt gia chng Mt s i snh di y v c im gia d liu Fulltext vi d liu trang c trnh by trong [2]. STT Trang web Vn bn thng thng (Fulltext) 1 L dng vn bn na cu trc. Trong ni dung c phn tiu v c cc th nhn mnh ngha ca t hoc cm t Vn bn thng l dng vn bn phi cu trc. Trong ni dung ca n khng c mt tiu chun no cho ta da vo nh gi 2 Ni dung ca cc trang Web thng n m t ngn gn, c ng, c cc siu lin kt ch ra cho ngi c n nhng ni khc c ni dung lin quan Ni dung ca cc vn bn thng thng thng rt chi tit v y
3 Trong ni dung cc trang Web c cha cc siu lin kt cho php lin kt cc trang c ni dung lin vi nhau Cc trng vn bn thng thng khng lin kt c n ni dung ca cc trang khc
1.3. KHAI PH D LIU VN BN (TEXTMINING) V KHAI PH D LIU WEB (WEBMINING) Nh cp trn, TextMining (Khai ph d liu vn bn) v WebMining (Khai ph d liu Web) l mt trong nhng ng dng quan trng ca Datamining. Trong phn ny ta s i su hn vo bi ton ny. 1.3.1. Cc bi ton trong khai ph d liu vn bn 1. Tm kim vn bn a. Ni dung Tm kim vn bn l qu trnh tm kim vn bn theo yu cu ca ngi dng. Cc yu cu c th hin di dng cc cu hi (query), dng cu hi n gin nht l cc t kha. C th hnh dung h tm kim vn bn sp xp vn bn thnh hai lp: Mt lp cho ra nhng cc vn bn tha mn vi cu hi a ra v mt lp khng hin th nhng vn bn khng c tha mn. Cc h thng thc t hin nay khng hin th nh vy m a ra cc danh sch vn bn theo quan trng ca vn bn tu theo cc cu hi a vo, v d in hnh l cc my tm tin nh Google, Altavista, b. Qu trnh Qu trnh tm tin c chia thnh bn qu trnh chnh : nh ch s (indexing): Cc vn bn dng th cn c chuyn sang mt dng biu din no x l. Qu trnh ny cn c gi l qu trnh biu din vn bn, dng biu din phi c cu trc v d dng khi x l. nh dng cu hi: Ngi dng phi m t nhng yu cu v ly thng tin cn thit di dng cu hi. Cc cu hi ny phi c biu din di dng ph bin cho cc h tm kim nh nhp vo cc t kha cn tm. Ngoi ra cn c cc phng php nh dng cu hi di dng ngn ng t nhin hoc di dng cc v d, i vi cc dngny th cn c cc k thut x l phc tp hn. Trong cc h tm tin hin nay th i a s l dng cu hi di dng cc t kha. So snh: H thng phi c s so snh r rng v hon ton cu hi cc cu hi ca ngi dng vi cc vn bn cl u tr trong CSDL. Cui cng h a ra mt quyt nh phn loi cc vn bn c lin quan gnvi cu hi a vo v th t ca n. H s hin th ton b vn bn hoc ch mt phn vn bn. Phn hi: Nhiu khi kt qu c tr v ban u khng tha mn yu cu ca ngi dng, do cn phi c qua trnh phn hi ngi dng c tht hay i li hoc nhp mi cc yu cu ca mnh. Mt khc, ngi dng c th tng tc vi cc h v cc vn bn tha mn yu cu ca mnh v h c chc nng cp nhu cc vn bn . Qu trnh ny c gi l qu trnh phn hi lin quan (Relevance feeback). Cc cng c tm kim hin nay ch yu tp trung nhiu vo ba qu trnh u, cn phn ln cha c qu trnh phn hi hay x l tng tc ngi dng v my. Qu trnh phn hi hin nay ang c nghin cu rng ri v ring trong qu trnh tng tc giao din ngi my xut hin hng nghin cu l interface agent. 2. Phn lp vn bn(Text Categoization) a. Ni dung Phn lp vn bn c xem nh l qu trnh gn cc vn bn vo mt hay nhiu vn bn xc nh t trc. Ngi ta c th phn lp cc vn bn mtc ch th cng, tc l c tng vn bn mt v gn n vo mt lp no . Cch ny s tn rt nhiu thi gian v cng sc i vi nhiu vn bn v do khng kh thi. Do vy m phi c cc phng php phn lp t ng. phn lp t ng ngi ta s dng cc phng php hc my trong tr tu nhn to (Cy quyt nh, Bayes, k ngi lng ging gn nht) Mt trong nhng ng dng quan trng nht ca phn lp vn bn l trong tm kim vn bn. T mt tp d liu phn lp cc vn bn s c nh ch s vi tng lp tng ng. Ngi dng c th xc nh ch hoc phn lp vn bn m mnh mong mun tm kim thng qua cc cu hi. Mt ng dng khc ca phn lp vn bn l trong lnh vc tm hiu vn bn. Phn lp vn bn c th c s dng lc cc vn bn hoc mt phn cc vn bn cha d liu cn tm m khng lm mt i tnh phc tp ca ngn ng t nhin. Trong phn lp vn bn, mt lp c th c gn gi tr ng sai (True hay False hoc vn bn thuc hay khng thuc lp) hoc c tnh theo mc ph thuc (vn bn c mt mc ph thuc vo lp). Trong trng hp c nhiu lp th phn loi ng sai s l vic xem mt vn bn c thuc vo mt lp duy nht no hay khng.. b. Qu trnh Qu trnh phn lp vn bn. tun theo cc bc sau: nh ch s (Indexing): Qu trnh nh ch s vn bn cng ging nh trong qu trnh nh ch s ca tm kim vn bn. Trong phn ny th tc nh ch s ng vai tr quan trng v mt s cc vn bn mi c th cn c x l trong thi gan thc Xc nh phn lp: Cng ging nh trong tm kim vn bn, phn lp vn bn yu cu qu trnh din t vic xc nh vn bn thuc lp no nh th no, da trn cu trc biu din ca n. i vi h phn lp vn bn, chng ta gi qu trnh ny l b phn lp (Categorization hoc classifier). N ng vai tr nh nhng cu hi trong h tm kim. Nhng trong khi nhng cu hi mang tnh nht thi, th b phn loi c s dng mt cch n nh v lu di cho qu trnh phn loi. So snh: Trong hu ht cc b phn loi, mi vn bn u c yu cu gn ng sai vo mt lp no . S khc nhau ln nht i vi qu trnh so snh trong h tm kim vn bn l mi vn bn ch c so snh vi mt s lng cc lp mt ln v vicc hn quyt nh ph hp cn ph thuc vo mi quan h gia cc lp vn bn. Phn hi (Hay thch nghi): Qu trnh phn hi ng vai tr trong h phn lp vn bn. Th nht l khi phn loi th phi c mt s lng ln cc vn bn c xp loi bng tay trc , cc vn bn ny c s dng lm mu hun luyn h tr xy dng b phn loi. Th hai l i vi vic phn loi vn bn ny khng d dng thay i cc yu cu nh trong qu trnh phn hi ca tm kim vn bn , ngi dng c th thng tin cho ngi bo tr h thng v vic xa b, thm vo hoc thay i cc phn lp vn bn no m mnh yu cu. 3. Mt s bi ton khc Ngoi hai bi ton k trn, cn c cc bi ton sau: Tm tt vn bn Phn cm vn bn Phn cm cc t mc Phn lp cc t mc nh ch mc cc t tim nng Dn ng vn bn Trong cc bi ton x l vnbn nu trn, chng tra thy vai tr ca biu din vn bn rt ln, c bit trong cc bit on tm kim, phn lp, phn cm, dn ng 1.3.2. Khai ph d liu Web a. Nhu cu S pht trin nhanh chng ca mng Internet v Intranet sinh ra mt khi lng khng l cc d liu dng siu vn bn(d liu Web). Cng vi s thay i v pht trin hng nga hng gi v ni dung cng nh s lng ca cc trang Web trn Internet th vn tm kim thn g tin i vi ngi s dng li ngy cng kh khn. C th ni nhu cu tm kim thng tin trn mt CSDL phi cu trc c pht trin ch yu cng vi s pht trin ca Internet. Thc vy vi Internet con ngi lm quen vi cc trang Web cng vi v vn cc thng tin. Trong nhng nm gn y Intrnet tr thnh mt trong nhng knh v khoa hc, thng tin kinh t, thng mi v qung co. Mt trong nhng l do cho s pht trin ny l s thp v gi c tiu tn khi cng khai mt trang Web trn Internet. So snh vi nhng dch v khc nh mua bn hay qung co trn mt t bo hay tp ch, th mt trang Web "i" r hn rt nhiu v cp nht nhanh chng hn ti hng triu ngi dng khp mi ni trn th gii. C th ni trang Web nh l cun t in Bch khoa ton th. Thng tin trn cc trang Web a dng v mt ni dung cng nh hnh thc. C th ni Internet nh mt x hi o, n bao gm cc thng tin v mi mt ca i sng kinh t, x hi c trnh by di dng vn bn, hnh nh, m thanh,... Tuy nhin cng vi s a dng v s lng ln thng tin nh vy ny sinh vn qu ti thng tin. Ngi ta khng th tm t kim a ch trang Web cha thng tin m mnh cn, do vy i hi cn phi c mt trnh tin ch qun l ni dung ca cc trang Web v cho php tm thy cc a ch trang Web c ni dung ging vi yu cu ca ngi tm kim. Cc tin ch ny qun l d liu nh cc i tng phi cu trc. Hin nay chng ta lm quen vi mt s cc tin ch nh vy l: Yahoo, goolel, Alvista,... Mt khc, gi s chng ta c cc trang Web v cc vn Tin hc, Th thao, Kinh t-X hi v xy dng...Cn c vo ni dung ca cc ti liu m khch hng xem hoc download v, sau khi phn lp chng ta s bit khch hng hay tp trung vo ni dung g trn trang Web ca chng ta, t chng ta s b sung thm nhiu cc ti liu v cc ni dung m khch hng quan tm v ngc li. Cn v pha khch hng sau khi phn tch chng ta cng bit c khch hng hay tp trung v vn g, t c th a ra nhng h tr thm cho khch hng . T nhng nhu cu thc t trn, phn lp v tm kim trang Web vn l bi ton hay v cn pht trin nghin cu hin nay. b. Kh khn H thng phc v World Wide Web nh l mt h thng trung tm rt ln phn b rng cung cp thng tin trn mi lnh vc khoa hc, x hi, thng mi, vn ha,... Web l mt ngun ti nguyn giu c cho Khai ph d liu. Nhng quan st sau y cho thy Web a ra s thch thc ln cho cng ngh Khai ph d liu 1. Web dng nh qu ln t chc thnh mt kho d liu phc v Dataming Cc CSDL truyn thng th c kch thc khng ln lm v thng c lu tr mt ni, , Trong khi kch thc Web rt ln, ti hng terabytes v thay i lin tc, khng nhng th cn phn tn trn rt nhiu my tnh khp ni trn th gii. Mt vi nghin cu v kch thc ca Web a ra cc s liu nh sau: Hin nay trn Internet c khong hn mt t cc trang Web c cung cp cho ngi s dng., gi s kch thc trung bnh ca mi trang l 5-10Kb th tng kch thc ca n t nht l khong 10 terabyte. Cn t lt ng ca cc trang Web th tht s gy n tng. Hai nm gn y s cc trang Web tng gp i v cng tip tc tng trong hai nm ti. Nhiu t chc v x hi t hu ht nhng thng tin cng cng ca h ln Web. Nh vy vic xy dng mt kho d liu (datawarehouse) lu tr, sao chp hay tch hp cc d liu trn Web l gn nh khng th 2. phc tp ca trang Web ln hn rt nhiu so vi nhng ti liu vn bn truyn thng khc Cc d liu trong cc CSDL truyn thng th thng l loi d liu ng nht (v ngn ng, nh dng,), cn d liu Web th hon ton khng ng nht. V d v ngn ng d liu Web bao gm rt nhiu loi ngn ng khc nhau (C ngn ng din t ni dung ln ngn ng lp trnh), nhiu loi nh dng khc nhau (Text, HTML, PDF, hnh nh m thanh,), nhiu loi t vng khc nhau (a ch Email, cc lin kt (links), cc m nn (zipcode), s in thoi) Ni cch khc, trang Web thiu mt cu trc thng nht. Chng c coi nh mt th vin k thut s rng ln, tuy nhin con s khng l cc ti liu trong th vin th khng c sp xp tun theo mt tiu chun c bit no, khng theo phm tr, tiu , tc gi, s trang hay ni dung,... iu ny l mt th thch rt ln cho vic tm kim thng tin cn thit trong mt th vin nh th. 3. Web l mt ngun ti nguyn thng tin c thay i cao Web khng ch c thay i v ln m thng tin trong chnh cc trang Web cng c cp nht lin tc. Theo kt qu nghin cu , hn 500.000 trang Web trong hn 4 thng th 23% cc trang thay i hng ngy, v khong hn 10 ngy th 50% cc trang trong tn min bin mt, ngha l a ch URL ca n khng cn tn ti na. Tin tc, th trng chng khon, cc cng ty qun co v trung tm phc v Web thng xuyn cp nht trang Web ca h.s Thm vo s kt ni thng tin v s truy cp bn ghi cng c cp nht 4. Web phc v mt cng ng ngi dng rng ln v a dng Internet hin nay ni vi khong 50 trm lm vic, v cng ng ngi dng vn ang nhanh chng lan rng. Mi ngi dng c mt kin thc, mi quan tm, s thch khc nhau. Nhng hu ht ngi dng khng c kin thc tt v cu trc mng thng tin, hoc khng c thc cho nhng tm kim, rt d b "lc" khi ang "m mm"trong "bng ti" ca mng hoc s chn khi tm kim m ch nhn nhng mng thng tin khng my hu ch 5. Ch mt phn rt nh ca thng tin trn Web l thc s hu ch Theo thng k, 99% ca thng tin Web l v ch vi 99% ngi dng Web. Trong khi nhng phn Web khng c quan tm li b bi vo kt qu nhn c trong khi tm kim. Vy th ta cn phi khai ph Web nh th no nhn c trang web cht lng cao nht theo tiu chun ca ngi dng? Nh vy chng ta c th thy cc im khc nhau gia vic tm kim trong mt CSDL truyn thng vi vvic tm kim trn Internet. Nhng thch thc trn y mnh vic nghin cu khai ph v s dng ti nguyn trn Internet c. Thun li Bn cnh nhng th thch trn, cn mt s li th ca trang Web cung cp cho cng vic khai ph Web. 1. Web bao gm khng ch c cc trang m cn c c cc hyperlink tr t trang ny ti trang khc. Khi mt tc gi to mt hyperlink t trang ca ng ta ti mt trang A c ngha l A l trang c hu ch vi vn ang bn lun. Nu trang A cng nhiu Hyperlink t trang khc tr n chng t trang A quan trng. V vy s lng ln cc thng tin lin kt trang s cung cp mt lng thng tin giu c v mi lin quan, cht lng, v cu trc ca ni dung trang Web, v v th l mt ngun ti nguyn ln cho khai ph Web 2. Mt my ch Web thng ng k mt bn ghi u vo (Weblog entry) cho mi ln truy cp trang Web. N bao gm a ch URL, a ch IP, timestamp. D liu Weblog cung cp lng thng tin giu c v nhng trang Web ng. Vi nhng thng tin v a ch URL, a ch IP, mt cch hin th a chiu c th c cu trc nn da trn CSDL Weblog. Thc hin phn tch OLAP a chiu c th a ra N ngi dng cao nht, N trang Web truy cp nhiu nht, v khong thi gian nhiu ngi truy cp nht, xu hng truy cp Web d. Cc ni dung trong Webmining Nh phn tch v c im v ni dung cc vn bn HyperText trn, t khai ph d liu Web cng s tp trung vo cc thnh phn c trong trang Web. chnh l: 1. Khai ph ni dung trang Web (Web Content mining) Khai ph ni dung trang Web gm hai phn: a. Web Page Content Ngha l s s dng ch cc t trong vn bn m khng tnh n cc lin kt gia cc vn bn. y chnh l khai ph d liu Text (Textmining) b.Search Result Tm kim theo kt qu. Trong cc my tm kim, sau khi tm ra nhng trang Web tho mn yu cu ngi dng, cn mt cng vic khng km phn quan trng, l phi sp xp kt qu theo th t d gn nhau vi ni dung cn tm kim. y cng chnh l khai ph ni dung trang Web. 2. Web Structure Mining Khai ph da trn cc siu lin kt gia cc vn bn c lin quan. 3. Web Usage Mining a. General Access Partern Tracking: Phn tch cc Web log khm ph ra cc mu truy cp ca ngi dng trong trang Web. b. Customize Usage Tracking: Phn tch cc mu truy cp ca ngi dng ti mi thi im bit xu hng truy cp trang Web ca tng i tng ngi dng ti mi thi im khc nhau
Cc ni dung trong khai ph Web
Web Structure Web Content Web Page Content Search Result Web Usage General Access Pattern Customized Usage Web Mining Chng 2. MY TM KIM 2.1. NHU CU Nh cp phn trn. Internet nh mt x hi o, n bao gm cc thng tin v mi mt ca i sng kinh t, x hi c trnh by di dng vn bn, hnh nh, m thanh,... Thng tin trn cc trang Web a dng v mt ni dung cng nh hnh thc Tuy nhin cng vi s a dng v s lng ln thng tin nh vy ny sinh vn qu ti thng tin. i vi mi ngi dng ch mt phn rt nh thng tin l c ch, chng hn c ngi ch quan tm n trang Th thao, Vn ha m khng my khi quan tm n Kinh t. Ngi ta khng th tm t kim a ch trang Web cha thng tin m mnh cn, do vy i hi cn phi c mt trnh tin ch qun l ni dung ca cc trang Web v cho php tm thy cc a ch trang Web c ni dung ging vi yu cu ca ngi tm kim. Hin nay chng ta lm quen vi mt s cc tin ch nh vy l: Yahoo, Google, Alvista,... My tm kim l cc h thng c xy dng c kh nng tip nhn cc yu cu tm kim ca ngi dng (thng l mt tp cc t kho), sau phn tch v tm kim trong c s d liu c sn v a ra cc kt qu l cc trang web cho ngi s dng. C th, ngi dng gi mt truy vn, dng n gin nht l mt danh sch cc t kha, v my tm kim s lm vic tr li mt danh sch cc trang Web c lin quan hoc c cha cc t kha . Phc tp hn, th truy vn l c mt vn bn hoc mt on vn bn hoc ni dung tm tt ca vn bn. 2.2. CU TRC V C CH HOT NG 2.2.1. Tng quan v cc h tm kim hin nay Bng mt v d c th, ta xem xt h tm kim Google Trong phn ny ta a ra ci nhn tng quan v cch lm vic ca mt h tm kim Google. Phn sau s tho lun v ng dng chnh (Crawling, indexing, searching) v cu trc d liu m phn ny cha kp cp. Phn ln Google c thit k bng C, C++ v chy tt trn Solaris hay Linux. Trong Google, Web crawling(download cc trang Web) c thc hin bi mt vi Webcrawler phn tn. C mt my ch URL gi danh sch cc URL m c nh km ti crawler. Nhng trang Web c nh km c gi ti my ch lu tr. My ch lu tr s nn v lu tr cc trang vo Repository (Ni lu tr). Mi trang Web u c mt ch s ID km theo gi l DocID. Chc nng Index c c thc hin bi Indexer v Sorter. Indexer thc hin cc chc nng sau: c t Repository , gii nn ti liu v phn tch chng. Mi ti liu c c chuyn thnh mt tp hp cc t xut hin gi l Hits. Hits ghi cc t, v tr cc t, xp x ca phng ch, s vit hoa thng. Indexer phn b nhng Hits thnh cc b gi l "Barrels". Indexer thc hin mt chc nng quan trng khc, l n phn tch tt c nhng hyperlink trn tt c cc trang v lu tr nhng thng tin quan trng v chng vo mt file ngun. File ny cha mt lng ln cc thng tin xc nh mi lin kt tr t v tr ti trang no, cng ni dung ca lin kt. Nh vy, Crawler c nhim v down cc trang web v lu tr vo respository Indexer c t respository gii nn cc ti liu v phn tch, m ha thnh Hits, sp xp thnh "Barrels". Phn tch tt c cc hyperlink lu tr vo mt file 2.2.2. Cu trc ca cc h tm kim Cc my tm kim hin nay thng c t chc thnh ba Modul sau: Modul nh ch mc (indexing): D tm cc trang Web trn Internet, phn tch chng ri lu vo CSDL. Modul tm kim (searching): Truy xut cc CSDL tr v danh sch cc ti liu tha mn mt yu cu ngi dng (di dng truy vn l mt tp cc t kha). Modul giao din ngi my: Ly kt qu t modul tm kim. Sau y ta i su vo chi tit ca tng modul v nhim v ca chng
Hnh 2.3_M hnh kin trc ca my tm kim Google a. Modul nh ch mc (Indexing) Modul nh ch mc thc hin cc nhim v sau 1. Phn tch c php vn bn v nh ch mc ton b cc t kho trong vn bn (s ln xut hin, v tr xut hin) 2. Lp th lin kt gia cc siu vn bn (lin kt xui v lin kt ngc). 3. Tnh ton quan trng PageRank ca tt c cc vn bn da vo cu trc lin kt siu vn bn (GoogleTM). Sau y, ta xem xt chi tit tng nhim v a.1. B d trn Web theo cc hyperlink (Web Crawler) Crawler (s): Hu ht cc my tm kim hot ng da trn cc chng trnh c tn l Crawler, chng trnh ny cung cp d liu (l cc trang Web) cho my tm kim hot ng. Crawler l cc chng trnh nh ca cc my tm kim lm cng vic duyt Web. Cng vic ca n cng tng t nh cng vic ca con ngi truy cp Web da vomi lin kt i n cc trang Web khc nhau. Cc Crawler c cung cp cc a ch URL ban u v s phn tch cc lin kt c trong cc trang v a cc thng tin v cho b phn iu khin crawler (Crawler control). B phn iu khin ny s quyt nh xem lin kt no s c i thm tip theo v gi li kt qu cho Crawler (trong mt vi my tm kim chc nng ny ca b phn iu khin crawler c th c crawler thc hin lun). Cc Crawler cng chuyn lun cc trang tm thy vo kho cha cc trang (Page Repository), tip tc i thm cc trang Web khc trn Internet cho n khi cc ngun cha cn kit. Vy modul Crawler truy lc cc trang ly t Mng, download xung sau cc trang c nh ch mc bi Mdul nh ch mc, sau y vo CSDL. Qu trnh ny c lp i lp li cho n khi Crawler c quyt nh dng. b iu khin quyt nh c trang Web no c i thm tip theo Mt my tm kim tiu chun cn xem xt hai vn chnh trong modul crawler: - S cc trang Web l rt ln, nn Crawler khng th down ton b cc trang m ch chn nhng trang "quan trng". Vy nhng trang nh th no c coi l quan trng v quan trng c tnh ton nh th no? - Bi v ni dung cc trang Web lin tc thay i nn sau khi download, crawler phi thng xuyn thm li cc trang c down cp nht s thay i . Hn na mc thay i ca cc trang l khc nhau nn crawler phi cn thn xem xt trang no cn xem li, trang no b qua. Vn 1: quan trng Cho mt trang Web P, chng ta c cc cch tnh quan trng sau: 1. C mt truy vn Q. quan trng ca P c nh ngha l "s ging nhau v t ng" gia P v Q 2. Biu din Q v P bi hai vector n chiu v=(w1, w2,..., w n ) vi w i l biu th cho t th i trong b t vng , c th w i =s ln xut hin ca t th i. chch lch gia P v Q l gi tr cos ca hai vector biu din Gi quan trng nhn c t phng php tnh ny l IS(P) 2. Trang no c nhiu trang khc link n s quang trng hn, nn mt cch tnh quan trng ca trang P l tnh s link n P Gi quan trng nhn c t phng php tnh ny l IB(P) 3. Tnh quan trng bi chnh a ch URL ca n. Nu a ch trang Web no tn cng bng".com" hay c cha t "home" s quan trng hn Gi quan trng nhn c t phng php tnh ny l IL(P) 4. Mt phng php na tnh quan trng l m s ln ngi dng truy cp vo trang trong mt khong thi gian no Vy cui cng quan trng ca trang P s l s kt hp ca cc quan trng tnh theo cc cch trn, theo mt t l no : IC(P)=k1. IS (P)+k2.IB(P)+ k3.IL(P)+k4.IU(P) (vi k1,k2,k3,k4 v truy vn Q l cho trc) Vn 2: S cp nht cc trang download C hai chin lc cho s cp nht cc trang download: 1. Cp nht theo nh k tt c cc trang: crawler s thm li tt c cc trang vi cng mt tn s f, khng tnh n mc thng xuyn thay i ca chng.Ngha l cc trang c i x cng bng bt k chng thay i ra sao. Cp nht thng xuyn theo ngha l khi down c 10.000 trang chng hn th s tnh li PageRank, index ca word trong URL 2. Cp nht theo mt t l: Trang no cng nhiu thay i th tn sut cp nht cng ln. VD: cc trang e1, e2,...,e n , thay i theo th t k1,k2,...,k n ln a.2. Indexing (Qu trnh nh ch mc) Indexer Module s tm hiu tt c cc t trong tng trang Web c lu tr trong kho cha cc trang, v ghi li cc a ch URL ca cc trang c cha mi t. Kt qu sinh ra mt bng ch mc rt ln, v nh c bng ch mc ny n c th cung cp tt c cc a ch URL ca cc trang khi c yu cu. Hai modul nh ch s (indexer) v collection analysis trn hnh 1 lm nhim v xy dng cc ch s khc nhau cho cc trang web down v. Modul Indexer xy dng hai loi ch s c bn: Text(content)Index v structor(link) index. S dng 2 loi ch s trn v cc trang web trong ni lu tr cc trang (repository), modul collection analysis xy dng thm nhiu ch s hu ch khc. Di y chng ta m t ngn gn mt vi loi ch s, tp trung vo cu trc v cch s dng ca chng. Link index xy dng ch s lin kt (link indext), mt phn ca b d (Crawler) c m ha di dng mt s vi cc nt v cc cnh ni, trong cc nt l cc trang Web, cc cnh ni gia cc nt l cc lin kt gia cc trang. Ch s index s c xy dng ln theo cc nt v cc cnh ca s . (v hnh)
Hnh1.2_ th minh ho cc nt ( ti liu Hypertext) v cc cnh ni (link) trong mt tp ti liu Hypertext Thng thng, thng tin c cu trc ph bin nht c s dng bi cc thut ton tm kim trong cc h tm tin l cc thng tin ly t cc trang c lin kt, chnh s lin kt trn cung cp mt cch hu hiu s truy cp ti cc thng tin lng ging . Nhng s nh vi hng trm thm ch hng nghn nt c th c biu din bi bt k mt cu trc d liu no, song cng s thc hin nhng vi mt s ln hn c hng triu nt li l mt thch thc ln. Text Index Mc d k thut da vo lin kt c s dng tng cng cht lng v lin quan gia cc kt qu tm c, th s truy xut da vo t mc (tm kim cc trang c cha cc t kha) vn l mt phng php chnh xc nh cc trang web c lin quan n truy vn. Cch nh ch s h tr truy vn da vo t mc c th c thc hin bng cch s dng bt k phng php truy cp truyn thng no tm trn ton b ni dung ti liu.My tm kim s dng ch mc lin kt ngc (Inverted Index) cho vic biu din ti liu. Ch mc lin kt ngc (Inverted Index) l la chn truyn thng cho cu trc ch s ca cc trang Web V d chng ta c 4 vn bn sau: vn bn 1: computer science vn bn 2: computer is about live vn bn 3: to live or not to live Qu trnh to file Index nh sau: - Ly tt c cc t c mt trong c 4 ti liu - Lu tr chng theo th t a, b, c, .... - Lu tr cc thng tin v ti liu (bao gm m ti liu, a ch URL, tiu , miu t ngn gn...) Kt qu thu c mt File Inverted index l mt danh sch cc thng tin sau: T M V a Tiu Miu About 2 3 ... ... ... Computer 1 1 computer 2 1 ... ... ... Is 2 2 ... ... ... live 3 2 Live 3 6 Live 2 4 ... ... ... Not 3 4 ... ... ... Or 3 3 ... ... ... science 1 2 ... ... ... to 3 1 To 3 5 Tuy nhin mt thut ton tm kim thng s dng thm nhng thng tin v s xut hin ca t mc trong trang web, v d t mc c vit hoa (nm trong th <B>), hay t mc nm phn tiu (nm trong th <H1> v <H2>). kt hp nhng thng tin ny, mt trng mi c thm vo gi l trng payload(ti trng), trng ny m ha cc thng tin thm v s xut hin ca cc t mc trong vn bn. Nhng thng tin ny phc v cho thut ton Ranking sau ny. Inverted index Inverted index c lu tr qua file CSDL cc bn ghi.Vic xy dng mt CSDL lu tr Inverted Index cho b d liu ln nh tp cc trang web trn internet i hi mt kin trc phn tn vi mm do cao. Trong mi trng Web c hai chin lc c bn cho vic chia cc Inverted Index thnh mt tp cc nt khc nhau c th lu tr phn tn ti nhiu ni khc nhau. Kiu th nht l local inverted file (IFL). Trong t chc kiu IFL th mi nt lu tr cc danh sch inverted index ca mt tp nh cc trang Web khc nhau trong tp cc trang Web lu tr trong b phn lu tr (page repository). Khi c yu cu tm kim th b phn search query s truyn yu cu i tt c cc nt, mi nt s tr li mt danh sch ring cc trang c cha cc t ang tm kim Kiu th hai l Global inverted file (GFL). Trong t chc kiu GFL, inverted index c chia theo cc t, v vy mi mt query server lu tr danh sch inverted index ca mt tp nh cc t trong b d liu. V d h thng vi hai query server A v B, th A s lu tr danh sch inverted index cho tt c cc t vi k t bt u t a n o, cn B lu tr cho cc t cn li t p n z. V vy khi b phn search query mun tm cc trang c cha t people th n s ch hi server A. Cu trc d liu chnh Modul Indexer ly cc trang c Crawler down v cha trong Repository, nh ch s lu vo CSDL. CSDL c to ra trong qu trnh index. y l cu trc chnh ca c s d liu trong hu ht cc my tm kim: a. Mt File T kha gm cc bn ghi, mi bn ghi ti thiu c hai trng : M s t kha, t kha (hnh a). Cc t kha ny dc thit lp trong qu trnh Indexing: c File vn bn, tch t kha, xem c trong file t kha cha. Nu cha c to ra bn gi mi trong file t kha, trong c m s t kha v tt nhin c lun c m s. Nu c ri th ly m s. M s ly c dng cho vic to ra bn ghi tp theo. b. File cha cc vn bn qun l trong h thng gm cc bn ghi, mi bn ghi cho mt vn bn, ti thiu c cc trng l: M vn bn, tn vn bn (a ch URL), a ch trong my h thng cha file vn bn (cache ca cc trang web ) (hnh b) c. File cha s xut hin ca cc t kha trong vn bn gm cc bn ghi, mi bn ghi c ba trng: m s vn bn, m s t kha, v tr xut hin t kha ny trong vn bn (hnh c)( y chnh l file ch s lin kt ngc(Inverted index)) Cch t chc CSDL: S dng cu trc hm bm _theo cc t vng Thch thc - Vic xy dng mt file ch mc lin kt ngc (inverted index) lin quan n vic tin x l cc trang thnh cc phn nh, sp xp chng vo cc ch s t mc v nh v tr cho chng, cui cng vit ra nhng phn c sp xp di dng mt tp hp cc danh sch lin kt ngc. Thi gian xy dng file index khng qua kht khe, tuy nhin khi lm vic vi mt tp hp cc trang Web, mt s file ch s tr nn kh qun l v yu cu ngun ti nguyn ln (chng hn nh b nh), v thng cn nhiu thi gian hon thnh. S so snh vi nhng h tm tin truyn thng cho thy, vi h thng ang nghin cu, ni lu tr (repository)cha 40 triu trang Web mc d ch biu din c 4% ca tng cc trang Web c kh nng nh ch s, nhng ln hn h thng tm tin tiu chun (TREC-7 colection)l 100GB - Bi v ni dung ca cc trang web thay i nhanh chng, nn vic xy dng li file ch s l rt cn thit cho vic lm mi cc trang Web. Mt phn cng vic ca Crawler l cp nht cc trang Web down v, song song vi cng vic ny vic xy dng li cc file ch s - Cui cng, dng b nh dnh cho file inverted index cn phi c thit k cn thn. Mt file ch s c nn s ci tin thao tc truy vn hn l c file ch s c lu tr trong b nh. Tuy nhin vn gp phi l tn thi gian dnh cho vic gii nn a.3. Tnh ton i lng PageRank Cc h tm kim c hai c tnh quan trng gip a ra kt qu c chnh xc cao. u tin, n s dng cu trc lin kt ca Web tnh ton quan trng cho tng trang Web, (PageRank).Th hai, h s dng lin kt xp hng kt qu (Ranking). Chnh s cc lin kt gia cc trang Web cho php tnh ton nhanh chng i lng PageRank. i lng PageRank c nh ngha nh sau: Gi s trang A c cc trang T 1, T 2 ,,T n tr ti. Tham s d l h s hm c gi tr trong khong 0 v 1. Chng ta thng t d=0.85. C(A) l s lin kt ra t trang A. Khi PageRank ca A c tnh nh sau: PR(A)=(1-d)+d (PR(T1)/C(T1)++PR(Tn)/C(Tn)).
V PageRank ca mt trang l i lng i din cho s phn b xc sut trn cc trang Web trong mt tp cc trang Web nht nh, do tng cc gi tr pagerank ca tt c cc trang Web trong tp cc d liu c gi tr bng 1
Trang V 1 Trang V 2 Trang V m Trang U
R V1 / N V1 R V1 /N Vm
Hnh 2.2 Qu trnh tnh ton c lp i lp li cho n khi hi t. Vi d=0.85, s vng lp =20 vi khong vi triu trang. V tnh PageRank cho 26 triu trang web vi mt trm lm vic va phi th thi gian tiu tn ti vi gi. 2.3. NHC IM CA CC MY TM KIM 1. L cc h tm kim t ng, ngi s dng cha c vai tr g trong qu trnh tm kim, khng c c ch phn hi t ngi s dng cp nht cc tham s tm kim nhm tng hiu qu cho ln tm kim sau 2. Coi quan trng ca cc t kha l nh nhau, do cha cho php tnh quan trng khc nhau ca cc t kha. Nh trong cc h tm kim ln nh Google, Yahoo, nu a vo t System Information th h s tm kim tt c cc trang Web c lin quan n 2 t System v Information. Nu ngi dng mun tm kim t Computer Story m trong t Computer c ngha nhiu hn t Story (chng hn, t Computer c trng s 0.8, story c trng s 0.2), th vn t ra l cn phi xy dng mt h tm kim nh vy 3. Cha quan tm n bn cht ca x l vn bn, vn t ng ngha, a ngha C rt nhiu ti liu lin quan n ni dung cn tm nhng khng cha cc t kha a vo, m ch cha cc t ng ngha vi chng v nhng ti liu s b b qua trong qu trnh tm kim. V cc my hu ht tm kim theo t kha, da vo vic nh ch mc cho cc trang Web(index-base search engine), c th c hng trm ti liu cng cha t kha a vo, dn n mt s lng ln ti liu nhn c t my tm kim, m rt nhiu trong chng t hoc khng lin quan n ni dung cn tm 2.4. BI TON TM KIM MI Hng ngy c hng t ngi truy cp vo Internet v cng c tng y ngi thc hin cc thao tc tm kim vi cc my tm kim khc nhau. Nu thng k cc thng tin ca mi ln tm kim ny th chc chn chng ta s c mt ngun thng tin khng l, v nu bit cch s dng chng th s lm c rt nhiu cng vic hu ch. Cc bi ton tm kim trong cc my tm kim thng thng ch n gin p ng nhu cu tm kim thng tin ca khch hng m cha bit tn dng nhng thng tin t pha khch hng qua mi ln tm kim. Di y l bi ton xut thm vo tnh nng ca cc my tm kim v hng gii quyt trong tng lai. Bi ton: Cn c vo cc ti liu m khch hng xem hoc down v, sau khi phn tch ta bit c khch hng hay tp trung vo cc trang c ni dung g trn tp cc trang Web ca chng ta, t b xung thm nhiu ti liu m khch hng quan tm v ngc li. Cn v pha khc hng sau khi phn tch chng ta cng bit c khch hng hay tp trung v vn g , t c thm nhng h tr cho khch hng. Hng gii quyt: Xy dng mt CSDL v cc ti liu, trong c mt trng ClassificationID cho bit ti liu ny thuc lnh vc no da trn kt qu phn tch trc .(Bng phn lp) Xy dng mt CSDL v pha khch hng: Trc khi khch hng truy cp vo CSDL, yu cu ng k mt account thng tin: tn, tui, a ch,chng ta cng a thm hai trng quan trng l ngh nghip, trnh (cho chnh xc ca thng tin l c%). Yu cu ng k account l tu chn vi khch hng. Sau trong qu trnh mi ln khch hng truy cp vo CSDL chng ta s ghi li cc ti liu m khch hng truy nhp vo bng thng tin khch hng. Sau da vo cc thng tin v ti liu m khch hng truy nhp v thng tin v khch hng, phn tch theo thut ton cy quyt nh sinh lut cho bit khch hng khch hng c ngh nghip v trnh nh th no th quan tm n lnh vc no vi tin cy l ngng c 2.5. KT LUN Chng 3. BI TON PHN LP 3.1. PHT BIU BI TON Trong t nhin, con ngi thng c tng chia s vt thnh cc phn, cc lp khc nhau. Tng t nh vy, gii thut phn lp n gin ch l mt php nh x c s d liu c sang mt min gi tr c th no , da vo mt thuc tnh hoc mt tp hp cc thuc tnh ca d liu.
Phn lp vn bn c cc nh nghin cu nh ngha thng nht nh l vic gn cc ch c xc nh cho trc vo cc vn bn Text a trn ni dung ca n. Phn lp vn bn l cng vic c s dng h tr trong qu trnh tm kim thng tin (Inrmation Retrieval), chit lc thng tin (Information Extraction), lc vn bn hoc t ng dn ng cho cc vn bn ti nhng ch xc nh trc. phn loi vn bn, ngi ta s dng phng php hc my c gim st (supervised learning). Tp d liu c chia ra lm hai tp l tp hun luyn v tp kim tra trc ht phi xy ng m hnh thng qua cc mu hc bng cc tp hun luyn, sau kim tra s chnh xc bng tp liu kim tra. Hnh sau l mt khung cho vic phn lp vn bn, trong bao gm ba cng on chnh: cng on u l biu din vn bn, tc l chuyn cc d liu vn bn thnh mt dng c cu trc no , tp hp cc mu cho trc thnh mt tp hun luyn. Cng on th hai l vic s dng cc k thut hc my hc trn cc mu hun luyn va biu din. Nh vy l vic biu din cng on mt s l u vo cho cng on th hai. Cng on th ba l vic b sung cc kin thc thm vo do ngi dng cung cp lm tng chnh xc trong biu din vn bn hay trong qu trnh hc my. Trong cng on hai, c nhiu phng php hc my c p dng, m hnh mng Bayes, cy quyt nh, phng php k ngii lng ging gn nht, mng Neuron, SVM, D liu vo Gii thut phn lp hot ng Lp 1 Lp 2 Lp n
3.2. CC PHNG PHP BIU DIN VN BN 3.2.1. Cc phng php biu din vn bn trong C s d liu FullText Tn ti ba m hnh CSDL FullText in hnh: M hnh logic, m hnh c php v m hnh Vector a. M hnh phn tch c php a.1. Quy tc lu tr: - Mi vn bn u phi c phn tch c php v tr li thng tin chi tit v ch ca vn bn . - Sau tin hnh Index cc ch ca tng vn bn. Cch Index trn ch ging nh khi Index trn vn bn nhng ch Index trn cc t xut hin trong ch . - Cc vn bn c qun l thng qua cc ch ny c th tm kim c khi c yu cu, cu hi tm kim s da trn cc ch trn. a.2. Quy tc tm kim: Cu hi tm kim s da vo cc ch c Index. Vy u tin phi tin hnh Index cc ch . Cch Index trn ch ging nh Index trn ton b cc t c trong ch , Cu hi a vo c th c phn tch c php tr li mt ch v tm kim trn ch Nh vy b phn x l chnh i vi mt h CSDL xy dng theo m hnh ny chnh l h thng phn tch c php v on nhn ni dung vn bn. a.2. u im, nhc im u im Khi c sn ch th vic tm kim theo phng php ny li kh hiu qu v n gin do tm kim nhanh v chnh xc. i vi nhng ngn ng n gin v mt ng php th vic phn tch trn c th t c mc chnh xc cao v chp nhn c. Nhc im Cht lng ca h thng theo phng php ny hon ton ph thuc vo cht lng ca h thng phn tch c phpv on nhn ni dung ti liu. Trn thc t, vic xy dng h thng ny l rt phc tp, ph thuc vo c im ca tng ngn ng v a s vn cha t n chnh xc cao. b. M hnh Logic Theo m hnh ny cc t c ngha trong vn bn c Index v ni dung vn bn c qun l theo cc ch s Index . b.1. Cc quy tc lu tr - Mi vn bn c Index theo quy tc: Thng k cc t c ngha trong cc vn bn, l nhng t mang thng tin chnh v cc vn bn lu tr. Index cc vn bn a vo theo danh sch cc t kho ni trn. ng vi mi t kho trong danh sch s lu v tr xut hin n trong tng vn bn v tn vn bn tn ti t kho . V d, c hai vn bn vi m tng ng l VB1,VB2.
Cng ha x hi ch ngha Vit Nam (VB1)
Vit Nam dn ch cng ha (VB2)
Khi ta c cch biu din nh sau:
b.2. Cc quy tc tm kim: Cu hi tm kim c a ra di dng Logic, tc l gm mt tp hp cc php ton (AND, OR,) c thc hin trn cc t hoc cm t. Vic tm kim s da vo bng Index to ra v kt qu tr li l cc vn bn tho mn ton b cc iu kin trn b.3. u im Nhc im u im - Tm kim nhanh v n gin. Thcvy, gi s cn tm kim t computer. H thng s duyt trn bng Index tr n ch s Index tng ng. Nu t computer tn ti trong h thng. Vic tm kim ny l kh nhanh v n gin khi trc ta sp xp bng Index theo vn ch ci. Php tm kim trn c phc tp cp (nlog 2 n), vi n l s t trong bng Index. Tng ng vi ch s index trn s cho ta bit cc ti liu cha n.Nh vy vic tm kim lin quan n k t th cc php ton cn thc ehin l k*n*log 2 n, vi n l s t trong bng Index - Cu hi tm kim nhanh v linh hot C th dng cc k t c bit trong cu hi tm kim m khng lm nh hng n phc tp ca php tm kim. V d ta tm ta th kt qu s tr li cc vn bn c cha cc t ta, tao, tay,l cc t bt u bng t ta K t % c gi l k t i din (wildcard character). Ngoi ra, bng cc php ton Logic cc t cn tm c th t chc thnh cc cu hi mt cch linh hot. V d: Cn tm t [ti, ta, tao], du [] s th hin vic tm kim trn mt trong s nhiu t trong nhm. y thc ra l mt cch th hin linh hot php ton OR trong i s Logic thay v phi vit l: Tm cc ti liu c cha t ti hoc t ta hoc tao. T mc MVB_V tr XH
Cng VB1(1), VB2(5) Ha VB1(2), VB2(6) X VB1(3) hi VB1(4) ch VB1(5), VB2(4) ngha VB1(6) Vit VB1(7), VB2(1) Nam VB1(8), VB2(2) Dn VB2(3) Nhc im: - Ngi tm kim phi c chuyn mn trong lnh vc tm kim Thc vy, do cu hi a vo di dng Logic nn kt qu tr li cng c gi tr Logic (Boolean). Mt s ti liu s c tr li khi tho mn mi iu kin a vo. Nh vy mun tm c ti liu theo ni dung th phi bit ch xc v ti liu. - Vic Index cc ti liu l tn nhiu thi gian v phc tp. - Tn khng gian lu tr cc bng Index. - Cc ti liu tm c khng c xp xp theo chnh xc ca chng. - Cc bng Index khng linh hot. Khi cc t vng thay i (thm, xa,) th ch s Index cng phi thay i theo c. M hnh khng gian Vector c.1. Quy tc lu tr Mt trong nhng phng php in hnh biu din vn bn ni chung l s dng khng gian Vector. Trong cch biu din ny, mi vn bn c biu din bng mt vector. Mi thnh phn ca Vector l mt t mc ring bit trong tp vn bn gc(corpus)v c gn mt gi tr l hm f ch mt ca t mc trong vn bn. Chng ta c th biu din cc vn bn di dng vi t mc l cc t n v hm f biu din s ln xut hin ca chng, cch biu din ny cn gi l biu din theo ti cc t (bag of words) Chng hn vn bn vb1, n c biu din bi mt vector V (v 1 ,v 2 ,,v n ) Vi v i l s ln xut hin ca t kha th i (t i ) trong vn bn vb1. Ta xt hai vn bn sau:
T Vector cho vn V Computer 2 1 Is 1 1 Life 0 1 Not 1 0 Only 1 0
C nhiu tiu chun chn hm f, do m chng ta c th sinh ra nhiu gi tr trng s khc nhau. Sau y l mt vi tiu chun chn hm f Computer is not only computer Computer is life M hnh Boolean Gi s c mt CSDL gm m vn bn D={d 1 ,d 2 ,,d m }. Mi vn bn c biu din di dng mt vector gm n t mc T={t 1 ,t 2 ,,t n }. Gi W=(wij) l ma trn trng s, trong w ij l gi tr ca t mc t i trong vn bn d j . M hnh Boolean l m hnh n gin nh, c xc nh nh sau: W ij = 0 nu t i khng c mt trong d j
1 nu ngc li
V d chng ta c hai vn bn sau:
T Vector cho vn V Computer 1 1 Is 1 1 Life 0 1 Not 1 0 Only 1 0
M hnh tn s (Frequency Model) M hnh tn s xc nh gi tr cc s trong ma trn W=(w ij ) cc gi tr l cc s dng da vo tn s ca c t sut hin trong vn bn hoc tn s xut hin ca vn bn trong CSDL. C ba phng php ph bin sau: Phng php da trn tn s t mc (TF_Term Frequency) Cc gi tr ca cc t mc c tnh da trn s ln xut hin ca ca c t mc trong vn bn . Gi tf ij l s ln xut hin ca t mc t i trong vn bn d j , khi w ij c tnh bi cng thc: W ij = tfij hoc w ij = 1+log(tf ij ) hoc w=tf ij . Phng php da trn nghch o t s vn bn(IDF_ Inverse Document Frequency) Gi tr t mc c tnh bi cng thc sau: Wij= log dfij m =log(m)- log(df i ) Computer is not only computer Computer is life Phng php TF.IDE Phng php ny l tng hp ca hai phng php TF v IDF, ma trn trng s c tnh nh sau: W ij = [1+log(tf ij )] log ( dfi m ) nu tf ij >=1 0 nu tf ij =0 c.2. Cc quy tc tm kim Cc cu hi a vo c nh x vector Q(q 1 ,q 2,, q m )
theo h s ca cc t vng l khc nhau. Tc l: T vng cng c ngha vi ni dung cn tm c h s cng ln. Q i =0 khi t vng khng thuc danh sch nhng t cn tm. Q i <>0 khi t vng thuc danh sch cc t cn tm v Q i cng ln th mc lin quan n ni dung ti liu cng cao. Tc l h thng s u tin hn i vi cc ti liu c cha cc t tm kim c h s cao. V d: Nu ni dung cn tm c t Machine quan trng hn t Computer, th trong vector Q ta c th t q k =2,q h =1 tng ng vi t k =Machine, t h =a s. Khi , cho mt h thng cc t vng ta s xc nh c cc vector tng ng vi tng ti liu v ng vi mi cu hi a vo ta s c mt vector tng vi n vi nhng h s c xc nh t trc. Vic tm kim v qun l s c thc hin trn ti liu ny. T cch xc nh ni dung cc ti liu v cu hi theo cc vector tr cho ta phng php tm kim v lu tr cc ti liu dng Full-Text theo cch mi nh sau: 1. Mi ti liu c m ha bi mt vector 2. Phn loi cc ti liu theo cc vector ni trn. 3. Mi cu hi a vo cng c m ha bi mt vector Vic tm kim cc ti liu c thc hin bng cch nhn ln lt tng Vector cu hi vi vector ca tng ti liu Kt qu tr li s l mi ti c lin quan n cu hi tm kim c.3. u, nhc im u im - Cc ti liu tr li c th c sp xp theo mc lin quan n ni dung yu cu do trong php th mi ti liu u tr li ch s nh gi lin quan ca n n ni dung yu cu. - Vic a ra cc cu hi tm kim l d dng v khng yu cu ngi tm kim c trnh chuyn mn cao v vn - Tin hnh lu tr v tm kim n gin hn phng php Logic. Ngi tm kim c th t a ra s cc ti liu tr li c mc chnh xc cao nht Nhc im - Vic tm kim tin hnh kh chm khi h thng cc t vng l ln do phi tnh ton trn ton b cc Vector ca ti liu. - Khi biu din cc Vector vi cc h s l s t nhin lm tng mc chnh xc ca vic tm kim nhng lm tc tnh ton gim i rt nhiu do cc php nhn vector phi tin hnh trn cc s t nhin hoc s thc, hn na vic lu tr cc vector s tn km v phc tp - H thng khng linh hot khi lu tr cc t kha. Ch cn mt thay i rt nh trong bng t vng s ko theo hoc l vector ho li ton b cc ti liu lu tr, hoc l s b qua cc t c ngha b sung trong cc ti liu c m ha trc . Tuy nhin, vi nhng u im nht nh s sai s nh ny c th b qua do hin ti s cc t c ngha c m ha kh y trc khi tin hnh m ha ti liu. V y phng php Vector vn c quan tm v s dng - Mt nhc im na, chiu ca mi Vector theo cch biu din ny l rt ln, bi v chiu ca n c xc nh bng s lng cc t khc nhau trong tp hp vn bn. V d s lng cc t c th c t 10 3 n 10 5 trong tp hp cc vn bn nh, cn trong tp hpc c vn bn ln th s lng s nhiu hn, c bit trong mi trng Web Cch khc phc: C mt s phng php gim bt s chiu ca Vector c p dng. Mt phng php n gin v hiu qu l loi b cc t dng (stop words). T dng l cc t dng biu din cu trc cu ch khng biu t ni dung vn bn, v d nh cc t ni, cc gii tNhng t nh vy xut hin rt nhiu trong vn bn nhng li khng lin quan n ch v ni dung vn bn. Do chng ta c th loi b cc t ny i lm gim c s chiu ca cc vector biu din m li khng lm nh hng g n hiu qu tm kim. Mt s v d v cc t dng
Ting Vit Ting Anh V a Hoc the Cng do about 3.2.2. Cc phng php biu din vn bn trong C s d liu HyperText Trong chng I chng ta nu ra nhng kh khn trong vic tm kim d liu Web v s khc nhau gia cu trc mt vn bn truyn thng vi mt vn bn HyperText Chnh v nhng kh khn gp phi nh vy m vic biu din d liu trong cc my tm kim l rt quan trng. Biu din cc trang web nh th no c th lu tr c mt s lng khng l cc trang web my tm kim c th thc hin vic tm kim nhanh chng v a ra cc kt qu chnh xc cho ngi s dng? a. Biu din vn bn HyperText trong cc my tm kim (inverted index) Modul Indexer ly cc trang c Crawler down v cha trong Repository, nh ch s lu vo CSDL. CSDL c to ra trong qu trnh index. y l cu trc chnh ca c s d liu trong hu ht cc my tm kim: - Mt File T kha gm cc bn ghi, mi bn ghi ti thiu c hai trng : M s t kha, t kha. Cc t kha ny dc thit lp trong qu trnh Indexing - File cha cc vn bn qun l trong h thng gm cc bn ghi, mi bn ghi cho mt vn bn, ti thiu c cc trng l: M vn bn, tn vn bn (a ch URL), a ch trong my h thng cha file vn bn (cache ca cc trang web ) - File cha s xut hin ca cc t kha trong vn bn gm cc bn ghi, mi bn ghi c ba trng: m s vn bn, m s t kha, v tr xut hin t kha ny trong vn bn u im: Biu din c v tr xut hin ca cc t (Bit c t kha xut hin trong cc loi th khc nhau, xut hin tiu hay thn vn bn). Lu tr c thng tin quan trng ca cc t kha. Nhc im: Cha biu din c tn s xut hin ca cc t kha. Dn n thiu chc nng tm kim trangWeb theo ni dung b. Biu din vn bn HyperText theo m hnh Vector Trong lun n tin s, tc gi San Slattery [May 2002_CMU-CS-02-142] a ra 4 cch biu din theo m hnh Vector cho ti liu HyperText Cch 1 B qua tt c cc thng tin lin kt gia cc ti liu lng ging m ch biu din ring ni dung ti liu ang cn biu din. y l cch biu din theo ti cc t. Nu khng nh c ni dung cc ti liu lng ging l hon ton c lp vi lp th cch biu din ny l s la chn tt. Thc t l cc ti liu lng ging cung cp kh nhiu thng tin hu ch cho vic phn lp, do vy cch biu din ny l khng hiu qu. Cch 2 Cch thc n gin nht nhm s dng ni dung cc ti liu lng ging l kt hp ni dung ti liu cn biu din vi ni dung mi ti liu lng ging ca n to ra mt super_document. Khi , thnh phn vector biu din chnh l tn sut xut hin ca t kha trong super_document. Hn ch ca cch biu din ny chnh l vic xa nha phn bit ti liu ang xt vi lng ging ca n, v v th to nn nhiu ln xn khi phn lp. Cch biu din ny ch tt trong trng hp cc ti liu c tr ti c cng ch vi ti liu cn phn lp. Cch 3 Trong cch biu din ny, vector biu din c chia thnh hai phn: Phn u biu din cc t kha trong chnh ti liu cn phn lp, phn sau biu din cc t kha xut hin trong tt c cc ti liu lng ging vi n. Cch biu din ny khc phc c nhc im ca cch biu din trc l trnh lm m nht ti liu ch vi cc ti liu lng ging. Nu cc ti liu lng ging hu ch cho vic phn lp th c th d dng truy cp n ni dung ca chng. Tuy nhin cch biu din ny c nhc im l s chiu ca Vector ln. Cch 4 Cch biu din ny c th hin qua cc ni dung sau: - Tm s lng trang lng ging trong ton b vn bn hypertext ang xem xt, gi s c d l s lng lng ging. - Cu trc vector biu din thnh d+1 phn: Phn u tin biu din trc tip ti liu cn phn lp. T phn th 2 n phn d+1 biu din cc ti liu lng ging, mi phn tng ng vi mt lng ging. D nhn thy vector nhn c l rt ln v mt khc, li khng tun theo mt quy tc duy nht. Tn ti nhiu cch chn th t t phn th 2 tr i. Chnh v s a dng trong cch biu din ca phng php ny gy kh khn trong vic la chn mu d liu xy dng Qua cc cch biu din trn, chng ta a ra mt s nhn xt v cch biu din vn bn HyperText theo m hnh Vector nh trnh by di y. u im: - Khai thc c thng tin tim nng ca cc siu lin kt. - Biu din c tn s xut hin ca cc t, nn c kh nng thc hin chc nng tm kim vn bn theo gn nhau v ni dung Nhc im : - Khng biu din c v tr xut hin ca cc t. Dn n b qua cc thng tin ly c quan trng ca t kha, nh nu t kha xut hin tiu hay trong cc th in m s quan trng hn cc v tr khc - S chiu ca Vector l rt ln III 2.2.3 Biu din vn bn HyperText theo m hnh quan h Biu din vn bn theo m hnh quan h l cch biu din t nhin cho vn bn HyperText. Chng ta d dng cu trc mt quan h nh phn (mi lin kt gia cc vn bn) m i s th nht l tn ca ti liu c cha cc Hyperlink v i s th 2 l tn ca ti liu c tr ti. a) Quan h l g hiu c nhng u th ca hc quan h (relational learning), trc tin ta so snh chng vi nhng thut ton nh (propositional algorithms) m lm vic vi nhng v d hay thc th c lp. Mi iu m hc nh cn bit v cc v d hun luyn ch l cc miu t hay thng tin v chnh v d . Hn na khi thc hin phn lp cho mt v d, hc nh cng ch quan tm n thng tin ca chnh v d m khng quan tm n mi lin h gia v d vi cc v d khc. Biu din quan h bao gm c biu din nh (nh biu din theo m hnh vector, ti cc t (bag of word), tp hp cc t (set of word)) cng vi cc thng tin v mi quan h gia cc v d vi nhau. Chng hn, nu v d hun luyn ca chng ta l people , biu din nh ch ch m t cc thng tin nh tn, tui, cng vic, lng, ca tng ngi, trong khi biu din quan h s biu din tt c nhng thng tin trn cng thm mt s thng tin khc na, v d nh mi quan h gia ng ch-ngi lm thu hay mi quan h hn nhn. Nh vy r rng rng mt biu din quan h cho ta mt c hi tm kim ton b khng gian giu c ca cc mi quan h. Nu chng ta tin tng rng cc v d lin quan c th l ngun thng tin hu ch cho s phn lp mt vi v d, th cch biu din quan h l ph hp, cn ngc li, cc v d lin quan khng cung cp thm thng tin no cn thit th cch biu din quan h (relation representation) khng th no tt hn cch biu din nh (proposition representation) Biu din quan h trong cho HyperText Cc quan h : Link_to (page, page): Mi quan h ny th hin cc siu lin kt (hyperlink) tham chiu n cu trc gia cc trang trong ton b vn bn Web. Chng ta c th biu din rng trang 15 cha siu lin kt tham chiu n trang 37 nh sau: link_to (page15, page37). Has_word (page): Cung cp thng tin v ni dung ca mi trang Web. Chng ta s ch biu din nhng t m ta quan tm (hay sau ny s chn lm t kha). Chng hn has_computer(A) c ngha l trang A c cha t computer. Ta c th biu din ph nh: not(link_to (page15, page37)) c ngha l page15 khng lin kt vi page17, cn not(has_computer(A) c ngha l trang A khng c cha t computer V d: C hai trang Web A v B sau:
Gi s A l trang ch ca sinh vin ca tp hp cc trang Web ca mt trng i hc Khi trang A c biu din nh sau: A:- has_engine(A), has_list(A), has_vector(A), link_to(B,A), has_jame(B), has_link(B), has_paul(B), not(has_home(A)) V nu bng ngn ng th ta c th dch ra thnh lut nh sau: Mt trang m cha cc t kha list, vector, common nhng khng cha t kha home, v c lin kt bi trang c cha cc t jame, paul, link th l trang ch ca sinh vin A
List Vector Common
B
Jame Paul Link 3.3. CC PHNG PHP HC MY 3.3.1. Thut ton phn lp Bayes Thut ton phn lp Bayes l mt trong nhng thut ton phn lp in hnh nht trong khai thac d liu v tri thc. tng chnh ca thut ton l tnh xc sut c sau ca s kin c thuc lp x theo s phn loi da trn xc sut c trc ca s kin c thuc lp x trong iu kin T
Gi V l tp tt c cc t vng. Gi s c N lp ti liu: C 1, C 2 ,,C n
Mi lp C i c xc sut p(C i ) v ngng CtgTsh i . Gi p(C| Doc) l xc sut ti liu Doc thuc lp C. Cho mt lp C v mt ti liu Doc, nu xc sut p(C|Doc) tnh c ln hn hoc bng gi tr ngng ca C th ti liu Doc s thuc vo lp C. Ti liu Doc c biu din nh mt vector c kch thc l s t kho trong ti liu. Mi thnh phn cha mt t trong ti liu v tn xut xut hin ca t trong ti liu. Thut ton c thc hin trn tp t vng V, vector biu din ti liu Doc v cc ti liu c sn trong lp, tnh ton p(C|Doc) v quyt nh ti liu Doc s thuc lp no. Xc sut p(C | DOC) c tnh theo cng thc sau: Xc sut p(C | Doc) c tnh theo cng thc sau: Vi: p(c | x, ) = p(c | x,T) p(T |x) T in
Trong : |V| : s lng cc t trong tp V F j : t kho th j trong t vng TF(F j | Doc) : Tn xut ca t F j trong ti liu Doc (bao gm c t ng ngha) TF(F j | C) : Tn xut ca t F j trong lp C (s ln F j xut hin trong tt c cc ti liu thuc lp C) P(F j | C) : Xc sut c iu kin t F j xut hin trong ti liu ca lp C Cng thc F(F i | C) c tnh s dng c lng xc sut Laplace. S d c s 1 trn t s ca cng thc ny trnh trng hp tn sut ca t F i trong lp C bng 0, khi F i khng xut hin trong lp C. gim s phc tp trong tnh ton v gim thi gian tnh ton, ta thy rng, khng phi ti liu Doc cho u cha tt c cc t trong tp t vng V. Do , TF(F i | DOC) =0 khi t F i thuc V nhng khng thuc ti liu Doc, nn ta c, (P(F j | C)) TF(Fj, Doc) = 1. Nh vy cng thc (1) s c vit li nh sau:
Vi:
Nh vy trong qu trnh phn lp khng da vo ton b tp t vng m ch da vo cc t kha xut hin trong ti liu Doc. 3.3.2. Thut ton k-ngi lng ging gn nht. Thut ton hot ng khng da vo tp t vng. Tuy nhin, n vn s dng ngng CtgTsh, v thc hin theo cc bc nh cp trn. l tin hnh ngu nhin k ti liu v tnh xc sut p(C|Doc) da trn s ging nhau gia ti liu Doc v k ti liu c chn. Xc sut p(C| Doc) c tnh theo cng thc sau:
Trong : n : S lp k : S ti liu c chn so snh P(C i | D j ) : C gi tr 0 hoc 1, cho bit ti liu D j c thuc lp C i khng. S d c gi tr ny v mt ti liu c th thuc hn mt lp Sm(Doc,D j ) xc nh mc ging nhau ca ti liu Doc vi ti liu c chn D j , c tnh bng cos ca gc gia hai Vector biu din ta liu Doc v ti liu c chn D j .
Cch biu din cc ti liu trong thut ton ny hon ton tng t nh trong thut ton phn lp Bayes th nht, ngha l cng gm F i t kha v tn xut X i
tng ng. Trong cng thc (4): X i l tn xut ca t kho th i (da trn s t ng ngha xut hin trong ti liu Doc)
Y i l tn xut ca t th i (da trn s t ng ngha xut hin trong ti liu D i ) 3.3.3. Phn lp da vo cy quyt nh Hc cy quyt nh l phgn php c s dng rng ri cho vic hc quy np t mt mu ln. y l phng php xp x hm mc tiu c gi tr ri rc. Mt khc, cy quyt nh cn c th chuyn sang dng biu din tng ng di dng tri thc l cc lut If-then. Trong cc thut ton hc cy quyt nh th ID3 v C4.5 l hai thuta ton ni ting nht. Sau y l ni dung thut ton ID3. ID3 (Example, Target attributes, Attributes) 1.To mt nt gc Root cho cy quyt nh 2. Nu ton b Examples u l cc v d dng, t li cy Root mt nt n, vi nhn +. 3. Nu ton b Examples u l cc v d m, tr li cy Root mt nt n, vi nhn -. 4. Nu Attributes l rng th tr li cy Root mt nt n vi gn nhn bng gi tr ph bin nht ca Target_attribute trong Example. 5. Ngc li Begin 5.1. A<= thuc tnh t tp Attribute m phn loi tt nht tp Examples 5.2. Thuc tnh quyt nh cho Root<=A 5.3. For mi gi tr c th c v i ca A 5.3.1. Cng thm mt nhnh cy con di Root, ph hp vi biu thc kim tra A=v i . 5.3.2. t Examples vi l mt tp con ca tp cc v d c gi tr v i cho A 5.3.3. Nu Examples vi rng -Di mi nhnh mi thm mt nt l vi nhn bng gi tr ph bin nht ca Target_attribute trong tp Examples -Ngc li th di nhnh mi ny thm mt cy con ID3(Examples, target_attribute, Attribute-{A}). End Return Root. Thuc tnh tt nht l thuc tnh c ly thng tin ln nht. Phng php hc my dng cy quyt nh v da trn cy quyt nh l rt hiu qu bi v n c th lm vic c vi mt s lng ln cc thuc tnh, v hn na t cy quyt nh c th rt ra c mt h thng lut hc c 3.3.4. Thut ton hc quan h FOIL a. Khi nim mnh Horn (Horn Clause) Mnh Horn l cc mnh c nhiu nht mt literal dng, c dng nh sau: H \/ (-L1)\/ (-L2)\/\/ (-Ln)) Trong H, L1,L2,,Ln gi l cc literal dng, cn L1,-L2,-Ln gi l cc literal m. Hay vit di dng lut: ( L1^L2^^Ln)=>H. Dng ny c gi l lut First_Order L1,L2,Ln gi l tp cc tin iu kin. H gi l kt lun. VD v cc lut First_Order: If Parents(x,y) then Ancestor (x,y) If (Parents(x,z) ^ Ancestor(z,y) ) then Ancestor(x,y). Trong Parents, Ancestor, gi l cc predicate b.Thut ton Foil FOIL c xut v pht trin bi Quinlan (Quinlan, 1990). FOIL hc cc tp d liu ch bao gm hai lp, lp cc v d dng v v d m. FOIL hc m t lp i vi lp dng. u vo ca Foil gm cc tin iu kin v cc kt lun. . u ra l mt tp cc lut sinh t cc tin iu kin v cc kt lun . Mi bc Foil s thm mt literal vo cc tin iu kin ca lut ang hun luyn. Thut ton s dng hm Foil_Gain tnh ton la chn mt literal trong tp cc literal ng c FOIL l m hnh hc my khng tng trong thut ton leo i s dng metric da theo l thuyt thng tin xy dng mt lut bao trm ln d liu. Trong Foil c hai trng thi chnh : 1. separate stage (trng thi phn tch) : Bt u mt trng thi mi 2. Conquer State (trng thi ch ng): Kt hp cc literal xy dng thn ca mnh . Pha tch ri ca thut ton bt u t lut mi trong khi pha ch ng xy dng mt lin kt cc literal lm thn ca lut. Mi lut m t mt tp con no cc v d dng v khng c v d m. Lu rng, FOIL c hai ton t: bt u mt lut mi vi thn lut rng v thm mt literal kt thc lut hin ti. FOIL kt thc vic b sung literal khi khng cn v d m c bao ph bi lut, v bt u lut mi n khi tt c mi v d dng c bao ph bi mt lut no . Cc v d dng c ph bi mnh s c tch ra khi tp dy v qu trnh tip tc hc cc mnh tip theo vi cc v d cn li, v kt thc khi khng c cc v d dng thm na. Sau y l thit k bc 1 ca FOIL: 1.Gi POS l tp cc v d dng. 2. Gi NEG l tp cc v d m 3. t NewClauseBody bng rng 4. Trong khi POS cha rng thc hin: Separate: (Bt u mt lut mi) 5. Loi khi POS tt c nhng v d tho mn NewClauseBody. 6. t li NEG l tp cc v d m ban u 7. t li NewClauseBody bng rng Trong khi NEG cha rng thc hin. . Conquer (Xy dng thn mnh ) 8. Chn Literal L 9. Kt hp vo NewClauseBody. 10. Loi khi NEG nhng v d m khng tho mn L. FOIL s dng thut ton leo i b sung cc literal vi thng tin thu c ln nht vo mt lut. Vi mi bin i ca mt khng nh P, FOIL o lng thng tin t c. la chn literal vi thng tin t c cao nht, n cn bit bao nhiu b dng v m hin ti c bo m bi cc bin i ca mi khng nh c xc nh theo cch dn tri. Cng thc tnh infortmaion gain ca Foil l: Gain(Literal)=T ++ *(log 2 (P 1 /P 1 +N 1 ) - log 2 (P 0 /P 0 +N 0 )) P 0 v N 0 l s v d dng v m trc khi thm mt literal L vo mnh P 1 v N 1 l s v d dng v m sau khi thm literal L vo mnh . T ++ l s v d dng c nh c trc v sau khi thm literal .(ngha l s v d ng vi c hai lut R v R_l R sau khi thm vo literal L) Sau y l mt v d minh ha cho thut ton FOIL. Ta mun hc mi quan h Grandaughter(x,y) t cc quan h (Predicate) Grandaughter, Father, Mail, Femail v cc hng s: Victor, Sharon, Bob, Tom. Tp v d: L nhng gi nh lin quan n cc Predicate Grandaughter, Father, Mail, Femail v cc hng s Victor, Sharon, Bob, Tom, trong c cc v d dng l Grandaughter(Victor, Sharon), Father (Sharon, Bob), Father(Tom, Bob), Femail(Sharon), Father(Bob, Victor). Cc v d cn li l m (Chng hn nh -Grandaughter(Tom,Bob),-Father(Victor, Victor),). chn cc literal cho lut, FOIL xt cc cch kt hp khc nhau ca cc bin x,y,z,t vi cc hng s trn. Chng hn bc khi u khi lut ch l : - Bc 1: Lut khi u: Grandaughter (x,y) S kt hp {x/Bob, y/Sharon}s cho ta mt v d dng v trong d liu hun luyn Grandaughter(Bob, Sharon) l ng. Cn 15 cch kt hp cn li s tng ng vi cc v d m v khng tm thy s xc nhn tng ng trong tp hun luyn - Mi trng thi tip theo, lut c hnh thnh da trn tp cc kt ni m cho ra cc v d dng, m. Khi mi literal c thm vo lut, tp cc v d m dng s thay i. Chng hn xt literal tip theo c vo lut l Father (y,z), th thay v kt ni {x/Bob,y/Sharon} trn, kt ni {x/ Bob, y/Sharon,z/ Bob} mi tong ng vi mt v d dng. Ti mi bc, s v d m, dng s c tnh ton c c ly thng tin Foil_Gain (L,R).
CHNG 4. H THNG TH NGHIM 4.1. MT S CNG TRNH NGHIN CU LIN QUAN H thng th nghim c xy dng da trn s kt hp nhng u im ca cc gii php trong cc cng trnh nghin cu v vn tm kim v phn lp vn bn trc y. Sau y l ni dung v kt qu ca cc cng trnh nghin cu 1.. [San Slattery (May 20002_CMU-CS-02-142)] Lun n tin s HyperText Classification Trong lun n tin s ca mnh, tc gi so snh cc thut ton hc my p dng cho phn lp trang Web cng vi cc cch biu din tng ng, l: 1. Dng Nave Bayes vi cch biu din ti liu thnh mt ti cc t (bag of words) 2. Dng k ngi lng ging gn nht vi m hnh tn s cho biu din trang Web (TF-IDF) 3. Thut ton FOIL vi cch biu din thnh tp cc t (set of words) cho mi ti liu (khng tnh n cc lin kt trong mi ti liu) 4. Thut ton FOIL vi cch biu din thnh tp cc t (set of words) v c tnh n cc thng tin lin kt trong cc ti liu Tc gi ci t v th nghim v a ra kt qu, vi tiu chun nh gi l hi tng(recall)v chnh xc( Precision)
Cch tio cn 4 u im hn c, cho hi tng v chnh xc cao hn hn. Tip n, tc gi xy dng mt b phn lp HyperText mi s dng thut ton FOIL_PILES vi cch biu din vn bn theo m hnh quan h. 2. [on Sn] Lun vn thc s Phng php s dng Logic m v ng dng trong khai ph d liu FullText Trong lun vn ny, tc gi thc hin phn lp vn bn s dng cch biu din vn bn bng phng php s dng Logic m v ng dng thut ton hc cy quyt nh. Vi cch gii quyt bi ton nh vy cho ta thy mt s u im: S dng cc khi nim m lm gim s chiu ca cc thuc tnh, dn n lm gim thi gian tnh ton khi hc cy quyt nh. Tuy nhin cch biu din ny cn c mt s mt hn ch, l vic con ngi c th s tn nhiu cng sc cho vic xy dng ch , cc khi nim v mi lin quan gia chng. 3. [Bi Quang Minh] My tm kim Vietseek. Bo co kt qu nghin cu thuc ti khoa hc c bit cp HQGHN m s QG 02-02. Trong my tm kim Vietseek, cc vn bn c t chc thnh c s d liu. Vietseek xy dng c c ba loi ch mc (TextIndex, StructureIndex v UtilityIndex). C s d liu Vietseek c chia thnh hai phn: Phn 1: D liu v vn bn Web, Domain, Word c lu tr trong cc bng ca CSDL mySQL Phn 2: D liu v ch mc (index) c lu tr ring v c c cu ring. Do phn ny i hi tc cao nn khng lu tr trong CSDL MySql m lu tr trong 300 file nh phn khc nhau. Vietseek thc hin tm kim theo cm t a vo v tr v cc vn bn c cha cc cm t kha ch cha thc hin phn lp 4. [Phm Th Thanh Nam] Lun vn Thc s Mt s gii php cho bi ton tm kim trong CSDL HyperText. T CSDL ch mc c xy dng ca Vietsek, tc gi xy dng nn vector biu din cc trang Web, vi thnh phn ca vector chnh l tn sut xut hin ca cc t kha trong vn bn ang xt. Lun vn ny xut mt s thut ton: - Lit k danh sch cc trang Web Gn ngha nht vi trang Web hoc cm t tm kim a vo theo tiu ch Gn nhau v ni dung. gn nhau v ni dung s thu c khi so snh cc vector biu din vi nhau - quan trng ca trang Web da vo mi lin kt vi trang Web khc v tn s xut hin ca cc t kha tm kim trong trang. - Kt hp gn nhau v ni dung v quan trng ca trang web thnh mt tiu ch gi l gi tr kt hp. Kt qu s c hin th theo gi tr kt hp. Nhn xt Tuy cng trnh u tin [San Slattery] gii thiu kh tng quan v cc phng php phn lp v phn tch mt s kt qu th nghim, nhng ni chung c bn cng trnh nghin cu ni trn cha thc s cp ti vn thit k v ci t nhng gii php thc s tinh t gii quyt vn t ng ngha v a ngn ng i vi h thng phn lp trong CSDL Web. Thc hin vic kho st nhng gii php cho vn ny v ci t th nghim l mt cng vic nghin cu c ngha. Tn ti mt s thut ton in hnh gii quyt bi ton phn lp trong cc CSDL vn bn. Vic ci t th nghim v nh gi hiu qu hot ng ca mt s thut ton phn lp in hnh nh vy trong mt CSDL web thc s (khong vn trang ) c th c coi nh nhng bc i cn thit u tin trong vic xy dng v pht trin cc my tm kim ting Vit. 4.2. XUT MT CCH T CHC CSDL V THUT TON P DNG Theo nhng phng php biu din vn bn HyperText v ang c s dng, nghin cu, ta c nhn xt tng qut sau: cch biu din vn bn HyperText trong cc my tm kim c u im l khai thc c nhng thng tin quan trng v v tr xut hin ca t kha, t xp hng c cc trang Web tm c theo th t gn vi ni dung t kha cn tm, nhng cha thy cp n tn s xut hin ca cc t kha trong vn bn. Nn vic tm theo ni dung l kh thc hin c. Cn vi cch biu din theo m hnh Vector ca Sen Slattery [2002] th b qua thng tin v v tr xut hin ca cc t kha, mt thng tin rt quan trng cho phn lp vn bn. Hn na nu theo cch biu din 2, vn bn gc cn phn lp s b m nht i trong tp hp cc vn bn lin qua n n, v phn lp s mt chnh xc nht l khi cc vn bn lin quan khng c cng ch . Cn vi cch biu din 3 v 4, s chiu ca vector s rt ln v c rt nhiu thnh phn lp (chnh l cc t xut hin lp i lp li trong tp cc vn bn lin quan). T nhng u nhc im ca cc phng php trn, ti a ra mt cch biu din ring. t ng chnh vn l da trn m hnh vector, ng thi trong cch xy dng file t kha c tnh n cc t ng ngha 4.2.1. t bi ton Tn ti mt tp cc vn bn HyperText cho trc, mi lp cha cc ti liu (di dng *.html) thuc cng mt th loi. Xy dng h thng vi chc nng: c mt ti liu mi, yu cu h thng phn ti liu vo mt lp thch hp. 4.2.2. Cch biu din vn bn: S dng m hnh Vector tnh tn sut c tnh n quan trng ca v tr xut hin cc t kha, cng vi cc lin kt gia cc trang Xy dng vector cho trang Web A bng cch: - Vi mi trang Web A no , thng k cc trang Web c lin kt ti A v c A tr ti. - m s ln ca mi t kha xut hin trong A v trong cc trang c lin quan n A, gi s count[i] l s ln xut hin ca t kha th i trong vector biu din ca trang A, Nu i xut hin trong th body (<body></body>) th ch tng count[i] ln 1, Nu t i xut hin trong th tiu (<title></title>) th tng count[i] ln 3, Sau khi m xong trang A, nhn count [i] vi 3 (chnh l trng s ca vn bn cn biu din), sau m tip trong cc trang c lin kt, vi nguyn tc tnh trng s v tr xut hin nh trong vn bn A, trng s ca cc vn bn lin quan bng 1. Nh vy: Cch biu din trn s dng kt hp c cc thng tin: Cc lin kt vo ra ca ti liu HyperText, tnh n cc ti liu lng ging nhng cng t ra trng s cho ti liu gc, biu din c s ln xut hin ca t kha trong ti liu ng thi tnh n v tr xut hin ca cc t kha trong ti liu 4.2.3. Thit k CSDL. Cc vn bn HyperText c m ha thnh 3 bng trong CSDL Access. 1. Bng 1: bng cc t kha (KeyWords),
Field Name Data Type Description KeyWordID KeyWord Synonymous Auto Number Text Memo M t kha T kha Cc t ng ngha vi t kha
T kha (KeyWord) : Ni dung l mt t trong ting Anh nn n phi tha mn cc iu kin sau: T trong ting Anh c mt m tit, mi m tit l mt chui k t a- z,A-Z. Cc t trong cu c tch bit bi du cch hoc cc k t bt k (du chm, du phy, du hai chm,) khng thuc a-z, A-Z. Cc t ng ngha (Synonymous): L trng memo c dng (word1, word2,,word n ). Vy cc t ng ngha c cng m (keywordID) vi t kha.
2. Bng 2: Bng cc vn bn (Documents) Field Name Data Type Description DocID DocName CacheAdd Vector Auto Number Text Text Memo M vn bn Tn vn bn a ch Cache Vector biu din cho vn bn Vector: l trng kiu Memo, mi vector c dng: (M t kha 1, s ln xut hin tiu , tng s ln xut hin trong vn bn);( M t kha 2, s ln xut hin tiu , tng s ln xut hin trong vn bn); S thnh phn ca Vector chnh l s t kha xut hin trong trang Web ang biu din, ch khng phi l ton b cc t kha trong bng KeyWord, do s chiu ca vector s gim i rt nhiu. Mi thnh phn ca vector biu din s ln xut hin v v tr xut hin ca cc t kha trong vn bn. VD: Mt Vector c dng: (1,1,4);(2,1,4);(4,2,7) c ngha: T kha th nht xut hin 4 ln, trong 1 ln xut hin tiu . T kho th 2 xut hin 4 ln trong 1 ln xut hin tiu T kho th 4 xut hin 7 ln trong 2 ln xut hin tiu
3.Bng 3 Th hin s kin kt gia cc vn bn. (LINKS) Field Name Field Type Descrription DocID1 DocID2 Number Number M ca vn bn lin kt i M vn bn c lin kt ti
DocID1 l m cc vn bn c lin kt ti cc vn bn c m trong DocID2.
4. Bng 4. Xc sut ca cc lp
4.2.4.Thit k Modul chng trnh Field name Fielsd type Description ClassName Probability Text Number(t 0..100) Tn lp Xc sut c lp 1.Modul phn tch trang Web to ra bng KEYWORDS Thut ton: Input: Cc vn bn dng to t kha While (cha c ht cc vn bn) do 1. c tng vn bn 2. While (cha c xong vn bn) do 2.1.c tng t 2.2. Insert vo C s d liu End End. Output: File cc t kha Trung Synonymous s c b sung bng tay i vi tng t kha Thm chc nng nhp thm t kha bng tay, xa t kha khng cn thit. 2.Modul ly a ch Cache (CacheAddress) ca tng ti liu hun luyn v to ra m ti liu (DocID) thm vo hai trng u tin ca cc bng DOCUMENTS. Cn trng Vector s to sau nh Modul th 4. Thut ton: Input: Cc vn bn dng hun luyn While (cha c ht cc vn bn) do 1.1. c a ch Cache ca tng vn bn Insert vo CSDL 1.2. c tn vn bn Insert vo CSDL End M vn bn t tng. 3.Modul to bng LINKS. to bng LINKS trc ht phi c bng DOCUMENTS ly a m ca tng ti liu (DocID) tng ng. Thut ton: 1. c t th mc cha cc ti liu t trn a cng 2. t bin TnTM=[ng dn ca th mc] 3. While (cha phn tch ht cc ti liu) do 3.1. Ly tng ti liu trong th mc km thm a ch Cache(CacheAdd). 3.2. Tm trong bng DOCUMENTS DocID ca ti liu ny nh vo CacheAdd, c DocID1 3.2.1. Phn tch ly c cc th siu lin kt, l cc cm t c dng: href=[Tn ti liu c tr ti], gi s c N th. 3.2.2. For i=1 to N do 3.2.2.1. Cng TnTM v [tn ti liu c tr ti] c a ch Cache, duyt trong DOCUMENTS ly DocID, c DocID2 3.2.2.2.Thc hin lnh Insert hai DocID ly c trn vo hai trng DocID1 v DocID2 ca bn LINKS End. End 4. Tr li bng LINKS trong CSDL
4. Modul to ra vector cho mi ti liu, thm vo trng Vector ca bng DOCUMENTS. Thut ton: 1. c t bng DOCUMENTS trong CSDL ly DocID v CacheAdd 2. While (cha c ht cc bn ghi) 2.1. Dng CacheAdd c ti liu t a cng 2.2. Gn DocID_curence=DocID 2.3. Gn total_occurence=0; header_occurence=0; vector=; 2.4. Ly tng t kha keyword trong bng KEYWORDS so snh 2.4.1 While (cha ht cc t kha) 2.4.1.2. Phn tch ti liu ly tng t mc : word 2.4.1.2. Kim tra xem nu word cha c trong bng KEYWORD th b sung thm 2.4.1.3. While (cha c ht ti liu) - Nu (word= keyword) hoc (word=t ng ngha) v (word nm trong th <head>) th total_occurence+3 v header_occurence+1; - Nu (word=keyword) hoc (word=t ng ngha) v (word khng nm trong th <head>) th total_occurence ++; header_occurense++; End. 2.4.1.4. total_occurence*3; header_occurence*3; 2.4.1.5. c tt c cc ti liu m ti liu hin thi lin kt ti(outgoing) Lp li cc bc phn tch nh i vi ti liu hin thi, tng 2 bin total_occurence v header_occurence 2.4.1.6. c tt cc ti liu lin kt ti ti liu hin thi (incoming) Lp li cc bc phn tch nh i vi ti liu hinh thi tng 2 bin total_occurence v header_occurence End. 2.5. Nu (total_occurence !=0 ) th vector += KeyWordID + , + total_occurence + , + header_occerence +; 2.6. Insert into DOCUMENTS (Vector) values vector where DocID=DocID_curence. 3. End.
5. Modul thc hin phn lp. Input:Tp hp cc ti liu cn phn lp. While (cha c ht ti liu) do c vo ti liu cn phn lp 1. Phn tch ti liu thnh cc vetor nh trong modul to trng vector ca bng DOCUMENTS 2. Kt hp vi cc vector ca cc ti liu trong CSDL, p dng mt trong cc thut ton hc my phn lp. End 4.2.5. Phn tch cc chc nng ca h thng a. Chc nng chnh ca h thng b. Chc nng chi tit - Chc nng to CSDL - Chc nng phn lp v tm kim 4.2.6. nh gi h thng th nghim a. Mt s v d kt qu trn h thng th nghim H thng chy v cho mt s kt qu ban u - Xy dng c h thng CSDL nh trnh by trn + Phn tch cc vn bn ly t kha + Th hin c cc lin kt (link) gia cc ti liu siu vn bn trong mt siu vn bn + M ha cc vn bn thnh cc vector v lu tr vo CSDL - Thc hin vic phn lp mt ti liu siu vn bn cho trc - Cho php tm kim mt ti liu siu vn bn c ni dung gn vi ti liu a vo b. Hn ch ca h thng Do hn ch v mt thi gian nn h thng cn c mt s mt hn ch - Cc t kha vn cha y v cha c chn lc - Ch phn lp c tng ti liu mt (nu cn thi gian s tip tc sa) - chnh xc cha cao do cha c d liu hc chnh xc.