You are on page 1of 54

Chng 1.

TNG QUAN V KHAI PH D LIU WEB


1.1. GII THIU V KHAI PH D LIU (DATAMING) V KDD
1.1.1. Ti sao li cn khai ph d liu (datamining)
Khong hn mt thp k tr li y, lng thng tin c lu tr trn cc
thit b in t (a cng, CD-ROM, bng t, .v.v.) khng ngng tng ln. S tch ly
d liu ny xy ra vi mt tc bng n. Ngi ta c on rng lng thng tin
trn ton cu tng gp i sau khong hai nm v theo s lng cng nh kch c
ca cc c s d liu (CSDL) cng tng ln mt cch nhanh chng. Ni mt cch hnh
nh l chng ta ang ngp trong d liu nhng li i tri thc. Cu hi t ra l
liu chng ta c th khai thc c g t nhng ni d liu tng chng nh b i
y khng ?
Necessity is the mother of invention - Data Mining ra i nh mt hng
gii quyt hu hiu cho cu hi va t ra trn []. Kh nhiu nh ngha v Data
Mining v s c cp phn sau, tuy nhin c th tm hiu rng Data Mining nh
l mt cng ngh tri thc gip khai thc nhng thng tin hu ch t nhng kho d liu
c tch tr trong sut qu trnh hot ng ca mt cng ty, t chc no .
1.1.2. Khai ph d liu l g?
Khai ph d liu (datamining) c nh ngha nh l mt qu trnh cht lc
hay khai ph tri thc t mt lng ln d liu. Mt v d hay c s dng l l vic
khai thc vng t v ct, Dataming c v nh cng vic "i ct tm vng" trong
mt tp hp ln cc d liu cho trc. Thut ng Dataming m ch vic tm kim mt
tp hp nh c gi tr t mt s lng ln cc d liu th. C nhiu thut ng hin
c dng cng c ngha tng t vi t Datamining nh Knowledge Mining (khai
ph tri thc), knowledge extraction(cht lc tri thc), data/patern analysis(phn tch d
liu/mu), data archaeoloogy (kho c d liu), datadredging(no vt d liu),...
nh ngha: Khai ph d liu l mt tp hp cc k thut c s dng t
ng khai thc v tm ra cc mi quan h ln nhau ca d liu trong mt tp hp d
liu khng l v phc tp, ng thi cng tm ra cc mu tim n trong tp d liu .
Khai ph d liu l mt bc trong by bc ca qu trnh KDD (Knowleadge
Discovery in Database) v KDD c xem nh 7 qu trnh khc nhau theo th t
sau:s
1. Lm sch d liu (data cleaning & preprocessing)s: Loi b nhiu v cc d
liu khng cn thit.
2. Tch hp d liu: (data integration): qu trnh hp nht d liu thnh nhng
kho d liu (data warehouses & data marts) sau khi lm sch v tin x l (data
cleaning & preprocessing).
3. Trch chn d liu (data selection): trch chn d liu t nhng kho d liu
v sau chuyn i v dng thch hp cho qu trnh khai thc tri thc. Qu trnh ny
bao gm c vic x l vi d liu nhiu (noisy data), d liu khng y
(incomplete data), .v.v.
4. Chuyn i d liu: Cc d liu c chuyn i sang cc dng ph hp
cho qu trnh x l
5. Khai ph d liu(data mining): L mt trong cc bc quan trng nht,
trong s dng nhng phng php thng minh cht lc ra nhng mu d liu.
6. c lng mu (knowledge evaluation): Qu trnh nh gi cc kt qu tm
c thng qua cc o no .
7. Biu din tri thc (knowledge presentation): Qu trnh ny s dng cc k
thut biu din v th hin trc quan cho ngi dng.



Hnh 1 - Cc bc trong Data Mining & KDD
1.1.3. Cc chc nng chnh ca khai ph d liu
Data Mining c chia nh thnh mt s hng chnh nh sau:
M t khi nim (concept description): thin v m t, tng hp v tm
tt khi nim. V d: tm tt vn bn.
Lut kt hp (association rules): l dng lut biu din tri th dng kh
n gin. V d: 60 % nam gii vo siu th nu mua bia th c ti 80% trong s h s
mua thm tht b kh. Lut kt hp c ng dng nhiu trong lnh vc knh doanh,
y hc, tin-sinh, ti chnh & th trng chng khon, .v.v.
Phn lp v d on (classification & prediction): xp mt i tng
vo mt trong nhng lp bit trc. V d: phn lp vng a l theo d liu thi
tit. Hng tip cn ny thng s dng mt s k thut ca machine learning nh
cy quyt nh (decision tree), mng n ron nhn to (neural network), .v.v. Ngi ta
cn gi phn lp l hc c gim st (hc c thy).
Phn cm (clustering): xp cc i tng theo tng cm (s lng cng
nh tn ca cm cha c bit trc. Ngi ta cn gi phn cm l hc khng gim
st (hc khng thy).
Khai ph chui (sequential/temporal patterns): tng t nh khai ph
lut kt hp nhng c thm tnh th t v tnh thi gian. Hng tip cn ny c ng
dng nhiu trong lnh vc ti chnh v th trng chng khon v n c tnh d bo
cao.
1.1.4. ng dng ca khai ph d liu
Data Mining tuy l mt hng tip cn mi nhng thu ht c rt nhiu s
quan tm ca cc nh nghin cu v pht trin nh vo nhng ng dng thc tin ca
n. Chng ta c th lit k ra y mt s ng dng in hnh:
Phn tch d liu v h tr ra quyt nh (data analysis & decision
support)
iu tr y hc (medical treatment)
Text mining & Web mining
Tin-sinh (bio-informatics)
Ti chnh v th trng chng khon (finance & stock market)
Bo him (insurance)
Nhn dng (pattern recognition)
.v.v.
1.2. C S S LIU HYPERTEXT V FULLTEXT
1.2.1. C s d liu FullText
D liu dng FullText l mt dng d liu phi cu trc vi thng tin ch gm
cc ti liu dng Text. Mi ti liu cha thng tin v mt vn no th hin qua
ni dung ca tt c cc t cu thnh ti liu . ngha ca mi t trong ti liu
khkng c nh m tu thuc vo tng ng cnh khc nhau s mang ngha khc
nhau. Cc t trong ti liu c lin kt vi nhau theo mt ngn ng no .
Trong cc d liu hin nay th vn bn l mt trong nhng d liu ph bin
nht, n c mt khp mi ni v chng ta thng xuyn bt gp do cc bi ton
v x l vn bn c t ra kh lu v hin nay vn l mt trong nhng vn
trong khai ph d liu Text, trong c nhng bi ton ng ch nh tm kim vn
bn, phn loi vn bn, phn cm vn bn hoc dn ng vn bn
CSDL full_text l mt dng CSDL phi cu trc m d liu bao gm cc ti
liu v thuc tnh ca ti liu. C s d liu Full_Text thng c t chc nh mt
t hp ca hai thnh phn: Mt CSDL c cu trc thng thng (cha c im ca
cc ti liu) v cc ti liu






Ni dung cu ti liu c lu tr gin tip trong CSDL theo ngha h thng
ch qun l a ch lu tr ni dung.
C s d liu dng Text c th chia lm hai loi sau:
Dng khng c cu trc (unstructured): Nhng vn bn thng thng m
chng ta thng c hng ngy c th hin di dng t nhin ca con ngi v n
CSDL Full-Text
CSDL c cu trc cha c im
ca cc ti liu
Cc ti liu
khng c mt cu trc nh dng no. VD: Tp hp sch, Tp ch, Bi vit c qun
l trong mt mng th vin in t.
Dng na cu trc (semi-structured): Nhng vn bn c t chc di dng
cu trc khng cht ch nh bn ghi cc k hiu nh du vn bn v vn th hin
c ni dung chnh ca vn bn, v d nh cc dnh HTML, email,...
Tuy nhin vic phn lm hai loi cng khng tht r rng, trong cc h phn
mm, ngi ta thng phi s dng cc phn kt hp li thnh mt h nh trong c
h tm tin (Search Engine), hoc trong bi ton tm kim vn bn (Text Retrieval), mt
trong nhng lnh vc qua tm nht hin nay. Chng hn trong h tm kim nh Yahoo,
Altavista, Google... u t chc d liu theo cc nhm v th mc, mi nhm li c
th c nhiu nhm con nm trong . H Altavista cn tch hp thm chng trnh
dch t ng c th dch chuyn i sang nhiu th ting khc nhau v cho kt qu kh
tt.
1.2.2. C s d liu HyperText
Theo t in ca i hc Oxford (Oxford English Dictionary Additions
Series) th Hypertext c nh ngha nh sau: l loi Text khng phi c theo
dng lin tc n, n c th c c theo cc th t khc nhau, c bit l Text v
nh ha (Graphic) l cc dng c mi lin kt vi nhau theo cch m ngi c c
th khng cn c mt cch lin tc. V d khi c mt cun sch ngi c khng
phi c ln lt tng trang t u n cui m c th nhy cc n cc on sau
tham kho v cc vn h quan tm.
Nh vy vn bn HyperText bao gm dng ch vit khng lin tc, chng
c phn nhnh v cho php ngi c c th chn cch c theo mun ca mnh.
Hiu theo ngha thng thng th HyperText l mt tp cc trang ch vit c kt ni
vi nhau bi cc lin kt v cho php ngi c c th c theo cc cch khc nhau.
Nh ta lm quen nhiu vi cc trang nh dng HTML, trong cc trang c nhng
lin kt tr ti tng phn khc nhau ca trang hoc tr ti trang khc, v ngi c
s c vn bn da vo nhng lin kt .
Bn cnh , HyperText cng l mt dng vn bn Text c bit nn cng c
th bao gm cc ch vit lin tc (l dng ph bin nht ca ch vit). Do khng b
hn ch bi tnh lin tc trong HyperText, chng ta c th to ra cc dng trnh by
mi, do ti liu s phn nh tt hn ni dung mun din t. Hn na ngi c c
th chn cho mnh mt cch c ph hp chng hn nh i su vo mt vn m h
quan tm. Sng kin to ra mt tpc c vn bn cng vi cc con tr tr ti cc vn
bn khc lin kt mt tp cc vn bn c mi quan h voi nhau vi nhau l mt
cch thc s hay v rt hu ch t chc thng tin. Vi ngi vit, cch ny cho
php h c th thoi mi loi b nhng bn khon v th t trnh by, m c th t
chc vn thnh nhng phn nh, ri s dng kt ni ch ra mi lin h gia cc
phn nh vi nhau.
Vi ngi c cch ny cho php h c th i tt trn mng thng tin v quyt
nh phn thng tin no c lin quan n vn m h quan tm tip tc tm hiu.
So snh vi cch c tuyn tnh, tc l c ln lt th HyperText cung cp cho
chng ta mt giao din c th tip xc vi ni dung thng tin hiu qu hn rt
nhiu. Theo kha cnh ca cc thut ton hc my th HyperText cung cp cho
chng ta c hi nhn ra ngoi phm vi mt ti liu phn lp n, ngha l c tnh c
n cc ti liu c lin kt vi n. Tt nhin khng phi tt c cc ti liu c lin kt
n n u c ch cho vic phn lp, c bit l khi cc siu lin kt c th ch n rt
nhiu loi cc ti liu khc nhau. Nhng chc chn vn cn tni ti tim nng m con
ngi cn tip tc nghin cu v vic s dng cc ti liu lin kt n mt trang
nng cao chnh xc phn lp trang .
C hai khi nim v HyperText m chng ta cn quan tm:
Hypertext Document (Ti liu siu vn bn): L mt ti liu vn bn n trong
h thng siu vn bn. Nu tng tng h thng siu vn bn l mt th, th cc ti
liu tng ng vi cc nt. Hypertext Link (Lin kt siu vn bn): L mt tham chiu
ni mt ti liu HyperText ny vi mt ti liu HyperText khc. Cc siu lin kt
ng vai tr nh nhng ng ni trong th ni trn.
HyperText l loi d liu ph bin hin nay, v cng l loi d liu c nhu cu
tm kim v phn lp r ln. N l d liu ph bin trn mng thng tin Internet CSDL
HyperText vi vn bn dng na cu trc do xut hin thm cc th : Th cu trc
(tiu , m u, ni dung), th nhn trnh by ch (m, nghing,). Nh cc th
ny m chng ta c thm mt tiu chun (so vi ti liu fulltext) c th tm kim v
phn lp chng. Da vo cc th quy nh trc chng ta c th phn thnh cc
u tin khc nhaucho cc t kha nu chng xut hin nhng v tr khc nhau. V d
khi tm kim cc ti liu c ni dung lin quan n people th chng ta a t kha
tm kim l people, v cc ti liu c t kha poeple ng tiu th s gn vi
yu cu tm kim hn.

So snh c im ca d liu Fulltext v d liu trang web
Mc d trang Web l mt dang c bit ca d liu FullText, nhng c nhiu
im khc nhau gia hai loi d liu ny. Mt s nhn xt sau y cho thy s khc
nhau gia d liu Web v FullText. S khc nhau v c im l nguyn nhn chnh
dn n s khc nhau trong khai ph hai loi d liu ny (phn lp, tm kim,).
Mt s minh ho Hypertext Document nh l cc nt v cc Hypertext Link nh l
cc lin kt gia chng
Mt s i snh di y v c im gia d liu Fulltext vi d liu trang
c trnh by trong [2].
STT Trang web Vn bn thng thng (Fulltext)
1 L dng vn bn na cu trc.
Trong ni dung c phn tiu v
c cc th nhn mnh ngha ca
t hoc cm t
Vn bn thng l dng vn bn phi
cu trc. Trong ni dung ca n
khng c mt tiu chun no cho ta
da vo nh gi
2 Ni dung ca cc trang Web
thng n m t ngn gn, c
ng, c cc siu lin kt ch ra
cho ngi c n nhng ni
khc c ni dung lin quan
Ni dung ca cc vn bn thng
thng thng rt chi tit v y

3 Trong ni dung cc trang Web c
cha cc siu lin kt cho php
lin kt cc trang c ni dung lin
vi nhau
Cc trng vn bn thng thng khng
lin kt c n ni dung ca cc
trang khc

1.3. KHAI PH D LIU VN BN (TEXTMINING) V KHAI PH D
LIU WEB (WEBMINING)
Nh cp trn, TextMining (Khai ph d liu vn bn) v WebMining
(Khai ph d liu Web) l mt trong nhng ng dng quan trng ca Datamining.
Trong phn ny ta s i su hn vo bi ton ny.
1.3.1. Cc bi ton trong khai ph d liu vn bn
1. Tm kim vn bn
a. Ni dung
Tm kim vn bn l qu trnh tm kim vn bn theo yu cu ca ngi dng.
Cc yu cu c th hin di dng cc cu hi (query), dng cu hi n gin nht
l cc t kha. C th hnh dung h tm kim vn bn sp xp vn bn thnh hai lp:
Mt lp cho ra nhng cc vn bn tha mn vi cu hi a ra v mt lp khng hin
th nhng vn bn khng c tha mn. Cc h thng thc t hin nay khng hin th
nh vy m a ra cc danh sch vn bn theo quan trng ca vn bn tu theo cc
cu hi a vo, v d in hnh l cc my tm tin nh Google, Altavista,
b. Qu trnh
Qu trnh tm tin c chia thnh bn qu trnh chnh :
nh ch s (indexing): Cc vn bn dng th cn c chuyn sang mt
dng biu din no x l. Qu trnh ny cn c gi l qu trnh biu din vn
bn, dng biu din phi c cu trc v d dng khi x l.
nh dng cu hi: Ngi dng phi m t nhng yu cu v ly thng tin cn
thit di dng cu hi. Cc cu hi ny phi c biu din di dng ph bin cho
cc h tm kim nh nhp vo cc t kha cn tm. Ngoi ra cn c cc phng php
nh dng cu hi di dng ngn ng t nhin hoc di dng cc v d, i vi cc
dngny th cn c cc k thut x l phc tp hn. Trong cc h tm tin hin nay th
i a s l dng cu hi di dng cc t kha.
So snh: H thng phi c s so snh r rng v hon ton cu hi cc cu hi
ca ngi dng vi cc vn bn cl u tr trong CSDL. Cui cng h a ra mt
quyt nh phn loi cc vn bn c lin quan gnvi cu hi a vo v th t ca
n. H s hin th ton b vn bn hoc ch mt phn vn bn.
Phn hi: Nhiu khi kt qu c tr v ban u khng tha mn yu cu ca
ngi dng, do cn phi c qua trnh phn hi ngi dng c tht hay i li
hoc nhp mi cc yu cu ca mnh. Mt khc, ngi dng c th tng tc vi cc
h v cc vn bn tha mn yu cu ca mnh v h c chc nng cp nhu cc vn
bn . Qu trnh ny c gi l qu trnh phn hi lin quan (Relevance feeback).
Cc cng c tm kim hin nay ch yu tp trung nhiu vo ba qu trnh u,
cn phn ln cha c qu trnh phn hi hay x l tng tc ngi dng v my. Qu
trnh phn hi hin nay ang c nghin cu rng ri v ring trong qu trnh tng
tc giao din ngi my xut hin hng nghin cu l interface agent.
2. Phn lp vn bn(Text Categoization)
a. Ni dung
Phn lp vn bn c xem nh l qu trnh gn cc vn bn vo mt hay
nhiu vn bn xc nh t trc. Ngi ta c th phn lp cc vn bn mtc ch th
cng, tc l c tng vn bn mt v gn n vo mt lp no . Cch ny s tn rt
nhiu thi gian v cng sc i vi nhiu vn bn v do khng kh thi. Do vy m
phi c cc phng php phn lp t ng. phn lp t ng ngi ta s dng cc
phng php hc my trong tr tu nhn to (Cy quyt nh, Bayes, k ngi lng
ging gn nht)
Mt trong nhng ng dng quan trng nht ca phn lp vn bn l trong tm
kim vn bn. T mt tp d liu phn lp cc vn bn s c nh ch s vi
tng lp tng ng. Ngi dng c th xc nh ch hoc phn lp vn bn m
mnh mong mun tm kim thng qua cc cu hi.
Mt ng dng khc ca phn lp vn bn l trong lnh vc tm hiu vn bn.
Phn lp vn bn c th c s dng lc cc vn bn hoc mt phn cc vn bn
cha d liu cn tm m khng lm mt i tnh phc tp ca ngn ng t nhin.
Trong phn lp vn bn, mt lp c th c gn gi tr ng sai (True hay
False hoc vn bn thuc hay khng thuc lp) hoc c tnh theo mc ph thuc
(vn bn c mt mc ph thuc vo lp). Trong trng hp c nhiu lp th phn
loi ng sai s l vic xem mt vn bn c thuc vo mt lp duy nht no hay
khng..
b. Qu trnh
Qu trnh phn lp vn bn. tun theo cc bc sau:
nh ch s (Indexing): Qu trnh nh ch s vn bn cng ging nh trong
qu trnh nh ch s ca tm kim vn bn. Trong phn ny th tc nh ch s
ng vai tr quan trng v mt s cc vn bn mi c th cn c x l trong thi
gan thc
Xc nh phn lp: Cng ging nh trong tm kim vn bn, phn lp vn
bn yu cu qu trnh din t vic xc nh vn bn thuc lp no nh th no,
da trn cu trc biu din ca n. i vi h phn lp vn bn, chng ta gi qu trnh
ny l b phn lp (Categorization hoc classifier). N ng vai tr nh nhng cu hi
trong h tm kim. Nhng trong khi nhng cu hi mang tnh nht thi, th b phn
loi c s dng mt cch n nh v lu di cho qu trnh phn loi.
So snh: Trong hu ht cc b phn loi, mi vn bn u c yu cu gn
ng sai vo mt lp no . S khc nhau ln nht i vi qu trnh so snh trong h
tm kim vn bn l mi vn bn ch c so snh vi mt s lng cc lp mt ln v
vicc hn quyt nh ph hp cn ph thuc vo mi quan h gia cc lp vn bn.
Phn hi (Hay thch nghi): Qu trnh phn hi ng vai tr trong h phn lp
vn bn. Th nht l khi phn loi th phi c mt s lng ln cc vn bn c
xp loi bng tay trc , cc vn bn ny c s dng lm mu hun luyn h
tr xy dng b phn loi. Th hai l i vi vic phn loi vn bn ny khng d
dng thay i cc yu cu nh trong qu trnh phn hi ca tm kim vn bn , ngi
dng c th thng tin cho ngi bo tr h thng v vic xa b, thm vo hoc thay
i cc phn lp vn bn no m mnh yu cu.
3. Mt s bi ton khc
Ngoi hai bi ton k trn, cn c cc bi ton sau:
Tm tt vn bn
Phn cm vn bn
Phn cm cc t mc
Phn lp cc t mc
nh ch mc cc t tim nng
Dn ng vn bn
Trong cc bi ton x l vnbn nu trn, chng tra thy vai tr ca biu
din vn bn rt ln, c bit trong cc bit on tm kim, phn lp, phn cm, dn
ng
1.3.2. Khai ph d liu Web
a. Nhu cu
S pht trin nhanh chng ca mng Internet v Intranet sinh ra mt khi
lng khng l cc d liu dng siu vn bn(d liu Web). Cng vi s thay i v
pht trin hng nga hng gi v ni dung cng nh s lng ca cc trang Web trn
Internet th vn tm kim thn g tin i vi ngi s dng li ngy cng kh khn.
C th ni nhu cu tm kim thng tin trn mt CSDL phi cu trc c pht trin
ch yu cng vi s pht trin ca Internet. Thc vy vi Internet con ngi lm
quen vi cc trang Web cng vi v vn cc thng tin. Trong nhng nm gn y
Intrnet tr thnh mt trong nhng knh v khoa hc, thng tin kinh t, thng mi
v qung co. Mt trong nhng l do cho s pht trin ny l s thp v gi c tiu tn
khi cng khai mt trang Web trn Internet. So snh vi nhng dch v khc nh mua
bn hay qung co trn mt t bo hay tp ch, th mt trang Web "i" r hn rt
nhiu v cp nht nhanh chng hn ti hng triu ngi dng khp mi ni trn th
gii. C th ni trang Web nh l cun t in Bch khoa ton th. Thng tin trn cc
trang Web a dng v mt ni dung cng nh hnh thc. C th ni Internet nh mt
x hi o, n bao gm cc thng tin v mi mt ca i sng kinh t, x hi c trnh
by di dng vn bn, hnh nh, m thanh,...
Tuy nhin cng vi s a dng v s lng ln thng tin nh vy ny sinh
vn qu ti thng tin. Ngi ta khng th tm t kim a ch trang Web cha thng
tin m mnh cn, do vy i hi cn phi c mt trnh tin ch qun l ni dung ca
cc trang Web v cho php tm thy cc a ch trang Web c ni dung ging vi yu
cu ca ngi tm kim. Cc tin ch ny qun l d liu nh cc i tng phi cu
trc. Hin nay chng ta lm quen vi mt s cc tin ch nh vy l: Yahoo,
goolel, Alvista,...
Mt khc, gi s chng ta c cc trang Web v cc vn Tin hc, Th thao,
Kinh t-X hi v xy dng...Cn c vo ni dung ca cc ti liu m khch hng xem
hoc download v, sau khi phn lp chng ta s bit khch hng hay tp trung vo ni
dung g trn trang Web ca chng ta, t chng ta s b sung thm nhiu cc ti liu
v cc ni dung m khch hng quan tm v ngc li. Cn v pha khch hng sau
khi phn tch chng ta cng bit c khch hng hay tp trung v vn g, t
c th a ra nhng h tr thm cho khch hng . T nhng nhu cu thc t trn,
phn lp v tm kim trang Web vn l bi ton hay v cn pht trin nghin cu hin
nay.
b. Kh khn
H thng phc v World Wide Web nh l mt h thng trung tm rt ln
phn b rng cung cp thng tin trn mi lnh vc khoa hc, x hi, thng mi, vn
ha,... Web l mt ngun ti nguyn giu c cho Khai ph d liu. Nhng quan st sau
y cho thy Web a ra s thch thc ln cho cng ngh Khai ph d liu
1. Web dng nh qu ln t chc thnh mt kho d liu phc v
Dataming
Cc CSDL truyn thng th c kch thc khng ln lm v thng c lu
tr mt ni, , Trong khi kch thc Web rt ln, ti hng terabytes v thay i
lin tc, khng nhng th cn phn tn trn rt nhiu my tnh khp ni trn th gii.
Mt vi nghin cu v kch thc ca Web a ra cc s liu nh sau: Hin nay
trn Internet c khong hn mt t cc trang Web c cung cp cho ngi s dng.,
gi s kch thc trung bnh ca mi trang l 5-10Kb th tng kch thc ca n t nht
l khong 10 terabyte. Cn t lt ng ca cc trang Web th tht s gy n tng. Hai
nm gn y s cc trang Web tng gp i v cng tip tc tng trong hai nm ti.
Nhiu t chc v x hi t hu ht nhng thng tin cng cng ca h ln Web. Nh
vy vic xy dng mt kho d liu (datawarehouse) lu tr, sao chp hay tch hp
cc d liu trn Web l gn nh khng th
2. phc tp ca trang Web ln hn rt nhiu so vi nhng ti liu vn bn
truyn thng khc
Cc d liu trong cc CSDL truyn thng th thng l loi d liu ng nht
(v ngn ng, nh dng,), cn d liu Web th hon ton khng ng nht. V d v
ngn ng d liu Web bao gm rt nhiu loi ngn ng khc nhau (C ngn ng din
t ni dung ln ngn ng lp trnh), nhiu loi nh dng khc nhau (Text, HTML,
PDF, hnh nh m thanh,), nhiu loi t vng khc nhau (a ch Email, cc lin kt
(links), cc m nn (zipcode), s in thoi)
Ni cch khc, trang Web thiu mt cu trc thng nht. Chng c coi nh
mt th vin k thut s rng ln, tuy nhin con s khng l cc ti liu trong th vin
th khng c sp xp tun theo mt tiu chun c bit no, khng theo phm tr,
tiu , tc gi, s trang hay ni dung,... iu ny l mt th thch rt ln cho vic tm
kim thng tin cn thit trong mt th vin nh th.
3. Web l mt ngun ti nguyn thng tin c thay i cao
Web khng ch c thay i v ln m thng tin trong chnh cc trang Web
cng c cp nht lin tc. Theo kt qu nghin cu , hn 500.000 trang Web trong
hn 4 thng th 23% cc trang thay i hng ngy, v khong hn 10 ngy th 50% cc
trang trong tn min bin mt, ngha l a ch URL ca n khng cn tn ti na.
Tin tc, th trng chng khon, cc cng ty qun co v trung tm phc v Web
thng xuyn cp nht trang Web ca h.s Thm vo s kt ni thng tin v s
truy cp bn ghi cng c cp nht
4. Web phc v mt cng ng ngi dng rng ln v a dng
Internet hin nay ni vi khong 50 trm lm vic, v cng ng ngi dng
vn ang nhanh chng lan rng. Mi ngi dng c mt kin thc, mi quan tm, s
thch khc nhau. Nhng hu ht ngi dng khng c kin thc tt v cu trc mng
thng tin, hoc khng c thc cho nhng tm kim, rt d b "lc" khi ang "m
mm"trong "bng ti" ca mng hoc s chn khi tm kim m ch nhn nhng mng
thng tin khng my hu ch
5. Ch mt phn rt nh ca thng tin trn Web l thc s hu ch
Theo thng k, 99% ca thng tin Web l v ch vi 99% ngi dng Web.
Trong khi nhng phn Web khng c quan tm li b bi vo kt qu nhn c
trong khi tm kim. Vy th ta cn phi khai ph Web nh th no nhn c trang
web cht lng cao nht theo tiu chun ca ngi dng?
Nh vy chng ta c th thy cc im khc nhau gia vic tm kim trong
mt CSDL truyn thng vi vvic tm kim trn Internet. Nhng thch thc trn
y mnh vic nghin cu khai ph v s dng ti nguyn trn Internet
c. Thun li
Bn cnh nhng th thch trn, cn mt s li th ca trang Web cung cp
cho cng vic khai ph Web.
1. Web bao gm khng ch c cc trang m cn c c cc hyperlink tr t
trang ny ti trang khc. Khi mt tc gi to mt hyperlink t trang ca ng ta ti mt
trang A c ngha l A l trang c hu ch vi vn ang bn lun. Nu trang A cng
nhiu Hyperlink t trang khc tr n chng t trang A quan trng. V vy s lng
ln cc thng tin lin kt trang s cung cp mt lng thng tin giu c v mi lin
quan, cht lng, v cu trc ca ni dung trang Web, v v th l mt ngun ti
nguyn ln cho khai ph Web
2. Mt my ch Web thng ng k mt bn ghi u vo (Weblog entry) cho
mi ln truy cp trang Web. N bao gm a ch URL, a ch IP, timestamp. D liu
Weblog cung cp lng thng tin giu c v nhng trang Web ng. Vi nhng thng
tin v a ch URL, a ch IP, mt cch hin th a chiu c th c cu trc nn
da trn CSDL Weblog. Thc hin phn tch OLAP a chiu c th a ra N ngi
dng cao nht, N trang Web truy cp nhiu nht, v khong thi gian nhiu ngi truy
cp nht, xu hng truy cp Web
d. Cc ni dung trong Webmining
Nh phn tch v c im v ni dung cc vn bn HyperText trn, t
khai ph d liu Web cng s tp trung vo cc thnh phn c trong trang Web.
chnh l:
1. Khai ph ni dung trang Web (Web Content mining)
Khai ph ni dung trang Web gm hai phn:
a. Web Page Content
Ngha l s s dng ch cc t trong vn bn m khng tnh n cc lin kt
gia cc vn bn. y chnh l khai ph d liu Text (Textmining)
b.Search Result
Tm kim theo kt qu. Trong cc my tm kim, sau khi tm ra nhng
trang Web tho mn yu cu ngi dng, cn mt cng vic khng km phn quan
trng, l phi sp xp kt qu theo th t d gn nhau vi ni dung cn tm kim.
y cng chnh l khai ph ni dung trang Web.
2. Web Structure Mining
Khai ph da trn cc siu lin kt gia cc vn bn c lin quan.
3. Web Usage Mining
a. General Access Partern Tracking:
Phn tch cc Web log khm ph ra cc mu truy cp ca ngi dng
trong trang Web.
b. Customize Usage Tracking:
Phn tch cc mu truy cp ca ngi dng ti mi thi im bit xu
hng truy cp trang Web ca tng i tng ngi dng ti mi thi im khc nhau

Cc ni dung trong khai ph Web

Web
Structure
Web
Content
Web Page
Content
Search
Result
Web
Usage
General Access
Pattern
Customized
Usage
Web Mining
Chng 2. MY TM KIM
2.1. NHU CU
Nh cp phn trn. Internet nh mt x hi o, n bao gm cc
thng tin v mi mt ca i sng kinh t, x hi c trnh by di dng vn bn,
hnh nh, m thanh,... Thng tin trn cc trang Web a dng v mt ni dung cng
nh hnh thc Tuy nhin cng vi s a dng v s lng ln thng tin nh vy
ny sinh vn qu ti thng tin. i vi mi ngi dng ch mt phn rt nh
thng tin l c ch, chng hn c ngi ch quan tm n trang Th thao, Vn ha m
khng my khi quan tm n Kinh t. Ngi ta khng th tm t kim a ch trang
Web cha thng tin m mnh cn, do vy i hi cn phi c mt trnh tin ch qun
l ni dung ca cc trang Web v cho php tm thy cc a ch trang Web c ni
dung ging vi yu cu ca ngi tm kim. Hin nay chng ta lm quen vi mt
s cc tin ch nh vy l: Yahoo, Google, Alvista,...
My tm kim l cc h thng c xy dng c kh nng tip nhn cc yu
cu tm kim ca ngi dng (thng l mt tp cc t kho), sau phn tch v tm
kim trong c s d liu c sn v a ra cc kt qu l cc trang web cho ngi
s dng. C th, ngi dng gi mt truy vn, dng n gin nht l mt danh sch
cc t kha, v my tm kim s lm vic tr li mt danh sch cc trang Web c
lin quan hoc c cha cc t kha . Phc tp hn, th truy vn l c mt vn bn
hoc mt on vn bn hoc ni dung tm tt ca vn bn.
2.2. CU TRC V C CH HOT NG
2.2.1. Tng quan v cc h tm kim hin nay
Bng mt v d c th, ta xem xt h tm kim Google
Trong phn ny ta a ra ci nhn tng quan v cch lm vic ca mt h
tm kim Google. Phn sau s tho lun v ng dng chnh (Crawling, indexing,
searching) v cu trc d liu m phn ny cha kp cp.
Phn ln Google c thit k bng C, C++ v chy tt trn Solaris hay
Linux. Trong Google, Web crawling(download cc trang Web) c thc hin bi
mt vi Webcrawler phn tn. C mt my ch URL gi danh sch cc URL m
c nh km ti crawler. Nhng trang Web c nh km c gi ti
my ch lu tr. My ch lu tr s nn v lu tr cc trang vo Repository (Ni
lu tr). Mi trang
Web u c mt ch s ID km theo gi l DocID. Chc nng
Index c c thc hin bi
Indexer v Sorter. Indexer thc hin
cc chc nng sau: c t
Repository , gii nn ti liu v
phn tch chng. Mi ti liu c
c chuyn thnh mt tp hp cc
t xut hin gi l Hits. Hits ghi cc
t, v tr cc t, xp x ca phng
ch, s vit hoa thng. Indexer
phn b nhng Hits thnh cc b
gi l "Barrels". Indexer thc hin
mt chc nng quan trng khc,
l n phn tch tt c nhng
hyperlink trn tt c cc trang v
lu tr nhng thng tin quan trng
v chng vo mt file ngun. File
ny cha mt lng ln cc
thng tin xc nh mi lin kt tr t v tr ti trang no, cng ni dung ca lin
kt.
Nh vy, Crawler c nhim v down cc trang web v lu tr vo
respository
Indexer c t respository gii nn cc ti liu v phn tch, m ha thnh
Hits, sp xp thnh "Barrels". Phn tch tt c cc hyperlink lu tr vo mt file
2.2.2. Cu trc ca cc h tm kim
Cc my tm kim hin nay thng c t chc thnh ba Modul sau:
Modul nh ch mc (indexing): D tm cc trang Web trn Internet, phn
tch chng ri lu vo CSDL.
Modul tm kim (searching): Truy xut cc CSDL tr v danh sch cc ti
liu tha mn mt yu cu ngi dng (di dng truy vn l mt tp cc t kha).
Modul giao din ngi my: Ly kt qu t modul tm kim.
Sau y ta i su vo chi tit ca tng modul v nhim v ca chng

Hnh 2.3_M hnh kin trc ca my tm kim Google
a. Modul nh ch mc (Indexing)
Modul nh ch mc thc hin cc nhim v sau
1. Phn tch c php vn bn v nh ch mc ton b cc t kho trong vn
bn (s ln xut hin, v tr xut hin)
2. Lp th lin kt gia cc siu vn bn (lin kt xui v lin kt ngc).
3. Tnh ton quan trng PageRank ca tt c cc vn bn da vo cu trc
lin kt siu vn bn (GoogleTM).
Sau y, ta xem xt chi tit tng nhim v
a.1. B d trn Web theo cc hyperlink (Web Crawler)
Crawler (s): Hu ht cc my tm kim hot ng da trn cc chng trnh
c tn l Crawler, chng trnh ny cung cp d liu (l cc trang Web) cho my tm
kim hot ng. Crawler l cc chng trnh nh ca cc my tm kim lm cng vic
duyt Web. Cng vic ca n cng tng t nh cng vic ca con ngi truy cp
Web da vomi lin kt i n cc trang Web khc nhau. Cc Crawler c cung
cp cc a ch URL ban u v s phn tch cc lin kt c trong cc trang v a
cc thng tin v cho b phn iu khin crawler (Crawler control). B phn iu
khin ny s quyt nh xem lin kt no s c i thm tip theo v gi li kt qu
cho Crawler (trong mt vi my tm kim chc nng ny ca b phn iu khin
crawler c th c crawler thc hin lun). Cc Crawler cng chuyn lun cc trang
tm thy vo kho cha cc trang (Page Repository), tip tc i thm cc trang
Web khc trn Internet cho n khi cc ngun cha cn kit.
Vy modul Crawler truy lc cc trang ly t Mng, download xung sau
cc trang c nh ch mc bi Mdul nh ch mc, sau y vo CSDL. Qu
trnh ny c lp i lp li cho n khi Crawler c quyt nh dng.
b iu khin quyt nh c trang Web no c i thm tip theo
Mt my tm kim tiu chun cn xem xt hai vn chnh trong modul
crawler:
- S cc trang Web l rt ln, nn Crawler khng th down ton b cc trang
m ch chn nhng trang "quan trng". Vy nhng trang nh th no c coi l quan
trng v quan trng c tnh ton nh th no?
- Bi v ni dung cc trang Web lin tc thay i nn sau khi download,
crawler phi thng xuyn thm li cc trang c down cp nht s thay i
. Hn na mc thay i ca cc trang l khc nhau nn crawler phi cn thn
xem xt trang no cn xem li, trang no b qua.
Vn 1: quan trng
Cho mt trang Web P, chng ta c cc cch tnh quan trng sau:
1. C mt truy vn Q. quan trng ca P c nh ngha l "s ging nhau
v t ng" gia P v Q
2. Biu din Q v P bi hai vector n chiu v=(w1, w2,..., w
n
) vi w
i
l biu th
cho t th i trong b t vng , c th w
i
=s ln xut hin ca t th i. chch lch
gia P v Q l gi tr cos ca hai vector biu din
Gi quan trng nhn c t phng php tnh ny l IS(P)
2. Trang no c nhiu trang khc link n s quang trng hn, nn mt cch
tnh quan trng ca trang P l tnh s link n P
Gi quan trng nhn c t phng php tnh ny l IB(P)
3. Tnh quan trng bi chnh a ch URL ca n. Nu a ch trang Web
no tn cng bng".com" hay c cha t "home" s quan trng hn
Gi quan trng nhn c t phng php tnh ny l IL(P)
4. Mt phng php na tnh quan trng l m s ln ngi dng truy
cp vo trang trong mt khong thi gian no
Vy cui cng quan trng ca trang P s l s kt hp ca cc quan
trng tnh theo cc cch trn, theo mt t l no :
IC(P)=k1. IS (P)+k2.IB(P)+ k3.IL(P)+k4.IU(P) (vi k1,k2,k3,k4 v truy
vn Q l cho trc)
Vn 2: S cp nht cc trang download
C hai chin lc cho s cp nht cc trang download:
1. Cp nht theo nh k tt c cc trang: crawler s thm li tt c cc
trang vi cng mt tn s f, khng tnh n mc thng xuyn thay i ca
chng.Ngha l cc trang c i x cng bng bt k chng thay i ra sao.
Cp nht thng xuyn theo ngha l khi down c 10.000 trang chng hn th s
tnh li PageRank, index ca word trong URL
2. Cp nht theo mt t l: Trang no cng nhiu thay i th tn sut cp
nht cng ln. VD: cc trang e1, e2,...,e
n
, thay i theo th t k1,k2,...,k
n
ln
a.2. Indexing (Qu trnh nh ch mc)
Indexer Module s tm hiu tt c cc t trong tng trang Web c lu tr
trong kho cha cc trang, v ghi li cc a ch URL ca cc trang c cha mi t.
Kt qu sinh ra mt bng ch mc rt ln, v nh c bng ch mc ny n c th
cung cp tt c cc
a ch URL ca cc
trang khi c yu cu.
Hai modul nh ch
s (indexer) v
collection analysis
trn hnh 1 lm nhim
v xy dng cc ch
s khc nhau cho cc
trang web down
v. Modul Indexer
xy dng hai loi ch
s c bn:
Text(content)Index v
structor(link) index.
S dng 2 loi ch s
trn v cc trang web trong ni lu tr cc trang (repository), modul collection
analysis xy dng thm nhiu ch s hu ch khc. Di y chng ta m t
ngn gn mt vi loi ch s, tp trung vo cu trc v cch s dng ca chng.
Link index
xy dng ch s lin kt (link indext), mt phn ca b d (Crawler)
c m ha di dng mt s vi cc nt v cc cnh ni, trong cc nt l
cc trang Web, cc cnh ni gia cc nt l cc lin kt gia cc trang. Ch s
index s c xy dng ln theo cc nt v cc cnh ca s . (v hnh)

Hnh1.2_ th minh ho cc nt ( ti liu Hypertext)
v cc cnh ni (link) trong mt tp ti liu Hypertext
Thng thng, thng tin c cu trc ph bin nht c s dng bi cc
thut ton tm kim trong cc h tm tin l cc thng tin ly t cc trang c lin kt,
chnh s lin kt trn cung cp mt cch hu hiu s truy cp ti cc thng
tin lng ging . Nhng s nh vi hng trm thm ch hng nghn nt c th
c biu din bi bt k mt cu trc d liu no, song cng s thc hin
nhng vi mt s ln hn c hng triu nt li l mt thch thc ln.
Text Index
Mc d k thut da vo lin kt c s dng tng cng cht
lng v lin quan gia cc kt qu tm c, th s truy xut da vo t mc
(tm kim cc trang c cha cc t kha) vn l mt phng php chnh xc
nh cc trang web c lin quan n truy vn. Cch nh ch s h tr truy vn da
vo t mc c th c thc hin bng cch s dng bt k phng php truy cp
truyn thng no tm trn ton b ni dung ti liu.My tm kim s dng ch
mc lin kt ngc (Inverted Index) cho vic biu din ti liu. Ch mc lin kt
ngc (Inverted Index) l la chn truyn thng cho cu trc ch s ca cc trang
Web
V d chng ta c 4 vn bn sau:
vn bn 1: computer science
vn bn 2: computer is about live
vn bn 3: to live or not to live
Qu trnh to file Index nh sau:
- Ly tt c cc t c mt trong c 4 ti liu
- Lu tr chng theo th t a, b, c, ....
- Lu tr cc thng tin v ti liu (bao gm m ti liu, a ch URL,
tiu , miu t ngn gn...)
Kt qu thu c mt File Inverted index l mt danh sch cc thng
tin sau:
T M V a Tiu Miu
About 2 3 ... ... ...
Computer 1 1
computer 2 1 ... ... ...
Is 2 2 ... ... ...
live 3 2
Live 3 6
Live 2 4 ... ... ...
Not 3 4 ... ... ...
Or 3 3 ... ... ...
science 1 2 ... ... ...
to 3 1
To 3 5
Tuy nhin mt thut ton tm kim thng s dng thm nhng thng tin v
s xut hin ca t mc trong trang web, v d t mc c vit hoa (nm trong th
<B>), hay t mc nm phn tiu (nm trong th <H1> v <H2>). kt hp
nhng thng tin ny, mt trng mi c thm vo gi l trng payload(ti trng),
trng ny m ha cc thng tin thm v s xut hin ca cc t mc trong vn bn.
Nhng thng tin ny phc v cho thut ton Ranking sau ny.
Inverted index
Inverted index c lu tr qua file CSDL cc bn ghi.Vic xy dng mt
CSDL lu tr Inverted Index cho b d liu ln nh tp cc trang web trn internet
i hi mt kin trc phn tn vi mm do cao. Trong mi trng Web c hai
chin lc c bn cho vic chia cc Inverted Index thnh mt tp cc nt khc nhau
c th lu tr phn tn ti nhiu ni khc nhau.
Kiu th nht l local inverted file (IFL).
Trong t chc kiu IFL th mi nt lu tr cc danh sch inverted index ca
mt tp nh cc trang Web khc nhau trong tp cc trang Web lu tr trong b phn
lu tr (page repository). Khi c yu cu tm kim th b phn search query s truyn
yu cu i tt c cc nt, mi nt s tr li mt danh sch ring cc trang c cha cc
t ang tm kim
Kiu th hai l Global inverted file (GFL).
Trong t chc kiu GFL, inverted index c chia theo cc t, v vy mi mt
query server lu tr danh sch inverted index ca mt tp nh cc t trong b d liu.
V d h thng vi hai query server A v B, th A s lu tr danh sch inverted index
cho tt c cc t vi k t bt u t a n o, cn B lu tr cho cc t cn li t p n
z. V vy khi b phn search query mun tm cc trang c cha t people th n s
ch hi server A.
Cu trc d liu chnh
Modul Indexer ly cc trang c Crawler down v cha trong Repository,
nh ch s lu vo CSDL. CSDL c to ra trong qu trnh index. y l cu trc
chnh ca c s d liu trong hu ht cc my tm kim:
a. Mt File T kha gm cc bn ghi, mi bn ghi ti thiu c hai trng : M
s t kha, t kha (hnh a). Cc t kha ny dc thit lp trong qu trnh Indexing:
c File vn bn, tch t kha, xem c trong file t kha cha. Nu cha c to ra
bn gi mi trong file t kha, trong c m s t kha v tt nhin c lun c m
s. Nu c ri th ly m s. M s ly c dng cho vic to ra bn ghi tp theo.
b. File cha cc vn bn qun l trong h thng gm cc bn ghi, mi bn ghi
cho mt vn bn, ti thiu c cc trng l: M vn bn, tn vn bn (a ch URL),
a ch trong my h thng cha file vn bn (cache ca cc trang web ) (hnh b)
c. File cha s xut hin ca cc t kha trong vn bn gm cc bn ghi, mi
bn ghi c ba trng: m s vn bn, m s t kha, v tr xut hin t kha ny trong
vn bn (hnh c)( y chnh l file ch s lin kt ngc(Inverted index))
Cch t chc CSDL: S dng cu trc hm bm _theo cc t vng
Thch thc
- Vic xy dng mt file ch mc lin kt ngc (inverted index) lin quan n
vic tin x l cc trang thnh cc phn nh, sp xp chng vo cc ch s t mc v
nh v tr cho chng, cui cng vit ra nhng phn c sp xp di dng mt tp
hp cc danh sch lin kt ngc. Thi gian xy dng file index khng qua kht khe,
tuy nhin khi lm vic vi mt tp hp cc trang Web, mt s file ch s tr nn kh
qun l v yu cu ngun ti nguyn ln (chng hn nh b nh), v thng cn nhiu
thi gian hon thnh. S so snh vi nhng h tm tin truyn thng cho thy, vi h
thng ang nghin cu, ni lu tr (repository)cha 40 triu trang Web mc d ch
biu din c 4% ca tng cc trang Web c kh nng nh ch s, nhng ln hn
h thng tm tin tiu chun (TREC-7 colection)l 100GB
- Bi v ni dung ca cc trang web thay i nhanh chng, nn vic xy dng
li file ch s l rt cn thit cho vic lm mi cc trang Web. Mt phn cng vic ca
Crawler l cp nht cc trang Web down v, song song vi cng vic ny vic xy
dng li cc file ch s
- Cui cng, dng b nh dnh cho file inverted index cn phi c thit k
cn thn. Mt file ch s c nn s ci tin thao tc truy vn hn l c file ch s
c lu tr trong b nh. Tuy nhin vn gp phi l tn thi gian dnh cho vic
gii nn
a.3. Tnh ton i lng PageRank
Cc h tm kim c hai c tnh quan trng gip a ra kt qu c chnh
xc cao. u tin, n s dng cu trc lin kt ca Web tnh ton quan trng cho
tng trang Web, (PageRank).Th hai, h s dng lin kt xp hng kt qu
(Ranking). Chnh s cc lin kt gia cc trang Web cho php tnh ton nhanh
chng i lng PageRank.
i lng PageRank c nh ngha nh sau:
Gi s trang A c cc trang T
1,
T
2
,,T
n
tr ti. Tham s d l h s hm c gi
tr trong khong 0 v 1. Chng ta thng t d=0.85. C(A) l s lin kt ra t trang A.
Khi PageRank ca A c tnh nh sau:
PR(A)=(1-d)+d (PR(T1)/C(T1)++PR(Tn)/C(Tn)).










V PageRank ca mt trang l i lng i din cho s phn b xc sut trn
cc trang Web trong mt tp cc trang Web nht nh, do tng cc gi tr pagerank
ca tt c cc trang Web trong tp cc d liu c gi tr bng 1

Trang V
1
Trang V
2
Trang V
m
Trang U

R
V1
/ N
V1
R
V1
/N
Vm

Hnh 2.2
Qu trnh tnh ton c lp i lp li cho n khi hi t.
Vi d=0.85, s vng lp =20 vi khong vi triu trang. V tnh PageRank
cho 26 triu trang web vi mt trm lm vic va phi th thi gian tiu tn ti vi gi.
2.3. NHC IM CA CC MY TM KIM
1. L cc h tm kim t ng, ngi s dng cha c vai tr g trong qu
trnh tm kim, khng c c ch phn hi t ngi s dng cp nht cc tham s
tm kim nhm tng hiu qu cho ln tm kim sau
2. Coi quan trng ca cc t kha l nh nhau, do cha cho php tnh
quan trng khc nhau ca cc t kha. Nh trong cc h tm kim ln nh Google,
Yahoo, nu a vo t System Information th h s tm kim tt c cc trang Web
c lin quan n 2 t System v Information. Nu ngi dng mun tm kim t
Computer Story m trong t Computer c ngha nhiu hn t Story (chng hn,
t Computer c trng s 0.8, story c trng s 0.2), th vn t ra l cn phi xy
dng mt h tm kim nh vy
3. Cha quan tm n bn cht ca x l vn bn, vn t ng ngha, a
ngha
C rt nhiu ti liu lin quan n ni dung cn tm nhng khng cha cc t
kha a vo, m ch cha cc t ng ngha vi chng v nhng ti liu s b b
qua trong qu trnh tm kim.
V cc my hu ht tm kim theo t kha, da vo vic nh ch mc cho cc
trang Web(index-base search engine), c th c hng trm ti liu cng cha t kha
a vo, dn n mt s lng ln ti liu nhn c t my tm kim, m rt nhiu
trong chng t hoc khng lin quan n ni dung cn tm
2.4. BI TON TM KIM MI
Hng ngy c hng t ngi truy cp vo Internet v cng c tng y ngi
thc hin cc thao tc tm kim vi cc my tm kim khc nhau. Nu thng k cc
thng tin ca mi ln tm kim ny th chc chn chng ta s c mt ngun thng
tin khng l, v nu bit cch s dng chng th s lm c rt nhiu cng vic hu
ch. Cc bi ton tm kim trong cc my tm kim thng thng ch n gin p ng
nhu cu tm kim thng tin ca khch hng m cha bit tn dng nhng thng tin t
pha khch hng qua mi ln tm kim. Di y l bi ton xut thm vo tnh
nng ca cc my tm kim v hng gii quyt trong tng lai.
Bi ton:
Cn c vo cc ti liu m khch hng xem hoc down v, sau khi phn tch ta
bit c khch hng hay tp trung vo cc trang c ni dung g trn tp cc trang
Web ca chng ta, t b xung thm nhiu ti liu m khch hng quan tm v
ngc li. Cn v pha khc hng sau khi phn tch chng ta cng bit c khch
hng hay tp trung v vn g , t c thm nhng h tr cho khch hng.
Hng gii quyt:
Xy dng mt CSDL v cc ti liu, trong c mt trng ClassificationID
cho bit ti liu ny thuc lnh vc no da trn kt qu phn tch trc .(Bng
phn lp)
Xy dng mt CSDL v pha khch hng: Trc khi khch hng truy cp vo
CSDL, yu cu ng k mt account thng tin: tn, tui, a ch,chng ta cng a
thm hai trng quan trng l ngh nghip, trnh (cho chnh xc ca thng tin l
c%). Yu cu ng k account l tu chn vi khch hng. Sau trong qu trnh mi
ln khch hng truy cp vo CSDL chng ta s ghi li cc ti liu m khch hng truy
nhp vo bng thng tin khch hng. Sau da vo cc thng tin v ti liu m
khch hng truy nhp v thng tin v khch hng, phn tch theo thut ton cy quyt
nh sinh lut cho bit khch hng khch hng c ngh nghip v trnh nh th
no th quan tm n lnh vc no vi tin cy l ngng c
2.5. KT LUN
Chng 3. BI TON PHN LP
3.1. PHT BIU BI TON
Trong t nhin, con ngi thng c tng chia s vt thnh cc phn,
cc lp khc nhau. Tng t nh vy, gii thut phn lp n gin ch l mt php
nh x c s d liu c sang mt min gi tr c th no , da vo mt thuc
tnh hoc mt tp hp cc thuc tnh ca d liu.






Phn lp vn bn c cc nh nghin cu nh ngha thng nht nh l
vic gn cc ch c xc nh cho trc vo cc vn bn Text a trn ni
dung ca n. Phn lp vn bn l cng vic c s dng h tr trong qu trnh
tm kim thng tin (Inrmation Retrieval), chit lc thng tin (Information
Extraction), lc vn bn hoc t ng dn ng cho cc vn bn ti nhng ch
xc nh trc. phn loi vn bn, ngi ta s dng phng php hc my c
gim st (supervised learning). Tp d liu c chia ra lm hai tp l tp hun
luyn v tp kim tra trc ht phi xy ng m hnh thng qua cc mu hc
bng cc tp hun luyn, sau kim tra s chnh xc bng tp liu kim tra.
Hnh sau l mt khung cho vic phn lp vn bn, trong bao gm ba
cng on chnh: cng on u l biu din vn bn, tc l chuyn cc d liu
vn bn thnh mt dng c cu trc no , tp hp cc mu cho trc thnh mt
tp hun luyn. Cng on th hai l vic s dng cc k thut hc my hc
trn cc mu hun luyn va biu din. Nh vy l vic biu din cng on mt
s l u vo cho cng on th hai. Cng on th ba l vic b sung cc kin
thc thm vo do ngi dng cung cp lm tng chnh xc trong biu din
vn bn hay trong qu trnh hc my.
Trong cng on hai, c nhiu phng php hc my c p dng, m
hnh mng Bayes, cy quyt nh, phng php k ngii lng ging gn nht,
mng Neuron, SVM,
D
liu
vo
Gii
thut
phn
lp
hot
ng
Lp 1
Lp 2
Lp n



3.2. CC PHNG PHP BIU DIN VN BN
3.2.1. Cc phng php biu din vn bn trong C s d liu
FullText
Tn ti ba m hnh CSDL FullText in hnh: M hnh logic, m hnh c php
v m hnh Vector
a. M hnh phn tch c php
a.1. Quy tc lu tr:
- Mi vn bn u phi c phn tch c php v tr li thng tin chi tit v
ch ca vn bn .
- Sau tin hnh Index cc ch ca tng vn bn. Cch Index trn ch
ging nh khi Index trn vn bn nhng ch Index trn cc t xut hin trong ch .
- Cc vn bn c qun l thng qua cc ch ny c th tm kim c
khi c yu cu, cu hi tm kim s da trn cc ch trn.
a.2. Quy tc tm kim:
Cu hi tm kim s da vo cc ch c Index. Vy u tin
phi tin hnh Index cc ch . Cch Index trn ch ging nh Index trn ton b
cc t c trong ch ,
Cu hi a vo c th c phn tch c php tr li mt ch v
tm kim trn ch
Nh vy b phn x l chnh i vi mt h CSDL xy dng theo m hnh
ny chnh l h thng phn tch c php v on nhn ni dung vn bn.
a.2. u im, nhc im
u im
Khi c sn ch th vic tm kim theo phng php ny li kh hiu qu
v n gin do tm kim nhanh v chnh xc.
i vi nhng ngn ng n gin v mt ng php th vic phn tch trn c
th t c mc chnh xc cao v chp nhn c.
Nhc im
Cht lng ca h thng theo phng php ny hon ton ph thuc vo cht
lng ca h thng phn tch c phpv on nhn ni dung ti liu. Trn thc t, vic
xy dng h thng ny l rt phc tp, ph thuc vo c im ca tng ngn ng v
a s vn cha t n chnh xc cao.
b. M hnh Logic
Theo m hnh ny cc t c ngha trong vn bn c Index v ni dung vn
bn c qun l theo cc ch s Index .
b.1. Cc quy tc lu tr
- Mi vn bn c Index theo quy tc:
Thng k cc t c ngha trong cc vn bn, l nhng t mang thng tin
chnh v cc vn bn lu tr.
Index cc vn bn a vo theo danh sch cc t kho ni trn. ng vi mi
t kho trong danh sch s lu v tr xut hin n trong tng vn bn v tn vn bn
tn ti t kho .
V d, c hai vn bn vi m tng ng l VB1,VB2.

Cng ha x hi ch ngha Vit Nam (VB1)

Vit Nam dn ch cng ha (VB2)







Khi ta c cch biu din nh sau:










b.2. Cc quy tc tm kim:
Cu hi tm kim c a ra di dng Logic, tc l gm mt tp hp cc
php ton (AND, OR,) c thc hin trn cc t hoc cm t. Vic tm kim s
da vo bng Index to ra v kt qu tr li l cc vn bn tho mn ton b cc
iu kin trn
b.3. u im Nhc im
u im
- Tm kim nhanh v n gin. Thcvy, gi s cn tm kim t computer.
H thng s duyt trn bng Index tr n ch s Index tng ng. Nu t
computer tn ti trong h thng. Vic tm kim ny l kh nhanh v n gin khi
trc ta sp xp bng Index theo vn ch ci. Php tm kim trn c phc tp
cp (nlog
2
n), vi n l s t trong bng Index. Tng ng vi ch s index trn s cho
ta bit cc ti liu cha n.Nh vy vic tm kim lin quan n k t th cc php ton
cn thc ehin l k*n*log
2
n, vi n l s t trong bng Index
- Cu hi tm kim nhanh v linh hot
C th dng cc k t c bit trong cu hi tm kim m khng lm nh
hng n phc tp ca php tm kim. V d ta tm ta th kt qu s tr li cc
vn bn c cha cc t ta, tao, tay,l cc t bt u bng t ta
K t % c gi l k t i din (wildcard character).
Ngoi ra, bng cc php ton Logic cc t cn tm c th t chc thnh cc
cu hi mt cch linh hot. V d: Cn tm t [ti, ta, tao], du [] s th hin vic
tm kim trn mt trong s nhiu t trong nhm. y thc ra l mt cch th hin linh
hot php ton OR trong i s Logic thay v phi vit l: Tm cc ti liu c cha t
ti hoc t ta hoc tao.
T mc MVB_V tr XH

Cng VB1(1), VB2(5)
Ha VB1(2), VB2(6)
X VB1(3)
hi VB1(4)
ch VB1(5), VB2(4)
ngha VB1(6)
Vit VB1(7), VB2(1)
Nam VB1(8), VB2(2)
Dn VB2(3)
Nhc im:
- Ngi tm kim phi c chuyn mn trong lnh vc tm kim
Thc vy, do cu hi a vo di dng Logic nn kt qu tr li cng c gi
tr Logic (Boolean). Mt s ti liu s c tr li khi tho mn mi iu kin a
vo. Nh vy mun tm c ti liu theo ni dung th phi bit ch xc v ti liu.
- Vic Index cc ti liu l tn nhiu thi gian v phc tp.
- Tn khng gian lu tr cc bng Index.
- Cc ti liu tm c khng c xp xp theo chnh xc ca chng.
- Cc bng Index khng linh hot. Khi cc t vng thay i (thm, xa,)
th ch s Index cng phi thay i theo
c. M hnh khng gian Vector
c.1. Quy tc lu tr
Mt trong nhng phng php in hnh biu din vn bn ni chung l s
dng khng gian Vector. Trong cch biu din ny, mi vn bn c biu din bng
mt vector. Mi thnh phn ca Vector l mt t mc ring bit trong tp vn bn
gc(corpus)v c gn mt gi tr l hm f ch mt ca t mc trong vn bn.
Chng ta c th biu din cc vn bn di dng vi t mc l cc t n v
hm f biu din s ln xut hin ca chng, cch biu din ny cn gi l biu din
theo ti cc t (bag of words)
Chng hn vn bn vb1, n c biu din bi mt vector V (v
1
,v
2
,,v
n
)
Vi v
i
l s ln xut hin ca t kha th i (t
i
) trong vn bn vb1.
Ta xt hai vn bn sau:






T Vector cho vn V
Computer 2 1
Is 1 1
Life 0 1
Not 1 0
Only 1 0

C nhiu tiu chun chn hm f, do m chng ta c th sinh ra nhiu gi
tr trng s khc nhau. Sau y l mt vi tiu chun chn hm f
Computer is not only computer
Computer is life
M hnh Boolean
Gi s c mt CSDL gm m vn bn D={d
1
,d
2
,,d
m
}. Mi vn bn c
biu din di dng mt vector gm n t mc T={t
1
,t
2
,,t
n
}. Gi W=(wij) l ma trn
trng s, trong w
ij
l gi tr ca t mc t
i
trong vn bn d
j
.
M hnh Boolean l m hnh n gin nh, c xc nh nh sau:
W
ij
= 0 nu t
i
khng c mt trong d
j

1 nu ngc li

V d chng ta c hai vn bn sau:






T Vector cho vn V
Computer 1 1
Is 1 1
Life 0 1
Not 1 0
Only 1 0

M hnh tn s (Frequency Model)
M hnh tn s xc nh gi tr cc s trong ma trn W=(w
ij
) cc gi tr l cc
s dng da vo tn s ca c t sut hin trong vn bn hoc tn s xut hin ca
vn bn trong CSDL. C ba phng php ph bin sau:
Phng php da trn tn s t mc (TF_Term Frequency)
Cc gi tr ca cc t mc c tnh da trn s ln xut hin ca ca c t
mc trong vn bn . Gi tf
ij
l s ln xut hin ca t mc t
i
trong vn bn d
j
, khi
w
ij
c tnh bi cng thc:
W
ij
= tfij hoc w
ij
= 1+log(tf
ij
) hoc w=tf
ij
.
Phng php da trn nghch o t s vn bn(IDF_ Inverse Document
Frequency)
Gi tr t mc c tnh bi cng thc sau:
Wij= log
dfij
m
=log(m)- log(df
i
)
Computer is not only computer
Computer is life
Phng php TF.IDE
Phng php ny l tng hp ca hai phng php TF v IDF, ma trn trng
s c tnh nh sau:
W
ij
= [1+log(tf
ij
)] log (
dfi
m
) nu tf
ij
>=1
0 nu tf
ij
=0
c.2. Cc quy tc tm kim
Cc cu hi a vo c nh x vector Q(q
1
,q
2,,
q
m
)

theo h s ca cc t
vng l khc nhau. Tc l: T vng cng c ngha vi ni dung cn tm c h s
cng ln.
Q
i
=0 khi t vng khng thuc danh sch nhng t cn tm.
Q
i
<>0 khi t vng thuc danh sch cc t cn tm v Q
i
cng ln th mc
lin quan n ni dung ti liu cng cao. Tc l h thng s u tin hn i vi cc
ti liu c cha cc t tm kim c h s cao.
V d: Nu ni dung cn tm c t Machine quan trng hn t Computer,
th trong vector Q ta c th t q
k
=2,q
h
=1 tng ng vi t
k
=Machine, t
h
=a s.
Khi , cho mt h thng cc t vng ta s xc nh c cc vector tng
ng vi tng ti liu v ng vi mi cu hi a vo ta s c mt vector tng vi n
vi nhng h s c xc nh t trc. Vic tm kim v qun l s c thc
hin trn ti liu ny.
T cch xc nh ni dung cc ti liu v cu hi theo cc vector tr cho ta
phng php tm kim v lu tr cc ti liu dng Full-Text theo cch mi nh sau:
1. Mi ti liu c m ha bi mt vector
2. Phn loi cc ti liu theo cc vector ni trn.
3. Mi cu hi a vo cng c m ha bi mt vector
Vic tm kim cc ti liu c thc hin bng cch nhn ln lt tng Vector
cu hi vi vector ca tng ti liu
Kt qu tr li s l mi ti c lin quan n cu hi tm kim
c.3. u, nhc im
u im
- Cc ti liu tr li c th c sp xp theo mc lin quan n ni dung
yu cu do trong php th mi ti liu u tr li ch s nh gi lin quan ca n
n ni dung yu cu.
- Vic a ra cc cu hi tm kim l d dng v khng yu cu ngi tm
kim c trnh chuyn mn cao v vn
- Tin hnh lu tr v tm kim n gin hn phng php Logic. Ngi tm
kim c th t a ra s cc ti liu tr li c mc chnh xc cao nht
Nhc im
- Vic tm kim tin hnh kh chm khi h thng cc t vng l ln do phi
tnh ton trn ton b cc Vector ca ti liu.
- Khi biu din cc Vector vi cc h s l s t nhin lm tng mc chnh
xc ca vic tm kim nhng lm tc tnh ton gim i rt nhiu do cc php nhn
vector phi tin hnh trn cc s t nhin hoc s thc, hn na vic lu tr cc vector
s tn km v phc tp
- H thng khng linh hot khi lu tr cc t kha. Ch cn mt thay i rt
nh trong bng t vng s ko theo hoc l vector ho li ton b cc ti liu lu tr,
hoc l s b qua cc t c ngha b sung trong cc ti liu c m ha trc . Tuy
nhin, vi nhng u im nht nh s sai s nh ny c th b qua do hin ti s cc
t c ngha c m ha kh y trc khi tin hnh m ha ti liu. V y phng
php Vector vn c quan tm v s dng
- Mt nhc im na, chiu ca mi Vector theo cch biu din ny l rt
ln, bi v chiu ca n c xc nh bng s lng cc t khc nhau trong tp hp
vn bn. V d s lng cc t c th c t 10
3
n 10
5
trong tp hp cc vn bn nh,
cn trong tp hpc c vn bn ln th s lng s nhiu hn, c bit trong mi trng
Web
Cch khc phc: C mt s phng php gim bt s chiu ca Vector c
p dng. Mt phng php n gin v hiu qu l loi b cc t dng (stop words).
T dng l cc t dng biu din cu trc cu ch khng biu t ni dung
vn bn, v d nh cc t ni, cc gii tNhng t nh vy xut hin rt nhiu trong
vn bn nhng li khng lin quan n ch v ni dung vn bn. Do chng ta c
th loi b cc t ny i lm gim c s chiu ca cc vector biu din m li
khng lm nh hng g n hiu qu tm kim.
Mt s v d v cc t dng

Ting Vit Ting Anh
V a
Hoc the
Cng do
about
3.2.2. Cc phng php biu din vn bn trong C s d liu
HyperText
Trong chng I chng ta nu ra nhng kh khn trong vic tm kim d
liu Web v s khc nhau gia cu trc mt vn bn truyn thng vi mt vn bn
HyperText Chnh v nhng kh khn gp phi nh vy m vic biu din d liu trong
cc my tm kim l rt quan trng. Biu din cc trang web nh th no c th lu
tr c mt s lng khng l cc trang web my tm kim c th thc hin
vic tm kim nhanh chng v a ra cc kt qu chnh xc cho ngi s dng?
a. Biu din vn bn HyperText trong cc my tm kim (inverted index)
Modul Indexer ly cc trang c Crawler down v cha trong Repository,
nh ch s lu vo CSDL. CSDL c to ra trong qu trnh index. y l cu trc
chnh ca c s d liu trong hu ht cc my tm kim:
- Mt File T kha gm cc bn ghi, mi bn ghi ti thiu c hai trng : M
s t kha, t kha. Cc t kha ny dc thit lp trong qu trnh Indexing
- File cha cc vn bn qun l trong h thng gm cc bn ghi, mi bn ghi
cho mt vn bn, ti thiu c cc trng l: M vn bn, tn vn bn (a ch URL),
a ch trong my h thng cha file vn bn (cache ca cc trang web )
- File cha s xut hin ca cc t kha trong vn bn gm cc bn ghi, mi
bn ghi c ba trng: m s vn bn, m s t kha, v tr xut hin t
kha ny trong vn bn
u im: Biu din c v tr xut hin ca cc t (Bit c t kha xut
hin trong cc loi th khc nhau, xut hin tiu hay thn vn bn). Lu tr c
thng tin quan trng ca cc t kha.
Nhc im: Cha biu din c tn s xut hin ca cc t kha. Dn n
thiu chc nng tm kim trangWeb theo ni dung
b. Biu din vn bn HyperText theo m hnh Vector
Trong lun n tin s, tc gi San Slattery [May 2002_CMU-CS-02-142]
a ra 4 cch biu din theo m hnh Vector cho ti liu HyperText
Cch 1
B qua tt c cc thng tin lin kt gia cc ti liu lng ging m ch biu
din ring ni dung ti liu ang cn biu din. y l cch biu din theo ti cc t.
Nu khng nh c ni dung cc ti liu lng ging l hon ton c lp vi
lp th cch biu din ny l s la chn tt. Thc t l cc ti liu lng ging cung cp
kh nhiu thng tin hu ch cho vic phn lp, do vy cch biu din ny l khng hiu
qu.
Cch 2
Cch thc n gin nht nhm s dng ni dung cc ti liu lng ging l kt
hp ni dung ti liu cn biu din vi ni dung mi ti liu lng ging ca n to ra
mt super_document. Khi , thnh phn vector biu din chnh l tn sut xut hin
ca t kha trong super_document.
Hn ch ca cch biu din ny chnh l vic xa nha phn bit ti liu ang
xt vi lng ging ca n, v v th to nn nhiu ln xn khi phn lp. Cch biu din
ny ch tt trong trng hp cc ti liu c tr ti c cng ch vi ti liu cn
phn lp.
Cch 3
Trong cch biu din ny, vector biu din c chia thnh hai phn: Phn u
biu din cc t kha trong chnh ti liu cn phn lp, phn sau biu din cc t kha
xut hin trong tt c cc ti liu lng ging vi n.
Cch biu din ny khc phc c nhc im ca cch biu din trc l
trnh lm m nht ti liu ch vi cc ti liu lng ging. Nu cc ti liu lng ging
hu ch cho vic phn lp th c th d dng truy cp n ni dung ca chng. Tuy
nhin cch biu din ny c nhc im l s chiu ca Vector ln.
Cch 4
Cch biu din ny c th hin qua cc ni dung sau:
- Tm s lng trang lng ging trong ton b vn bn hypertext ang xem xt,
gi s c d l s lng lng ging.
- Cu trc vector biu din thnh d+1 phn:
Phn u tin biu din trc tip ti liu cn phn lp.
T phn th 2 n phn d+1 biu din cc ti liu lng ging, mi
phn tng ng vi mt lng ging.
D nhn thy vector nhn c l rt ln v mt khc, li khng tun theo mt
quy tc duy nht. Tn ti nhiu cch chn th t t phn th 2 tr i. Chnh v s a
dng trong cch biu din ca phng php ny gy kh khn trong vic la chn
mu d liu xy dng
Qua cc cch biu din trn, chng ta a ra mt s nhn xt v cch biu
din vn bn HyperText theo m hnh Vector nh trnh by di y.
u im:
- Khai thc c thng tin tim nng ca cc siu lin kt.
- Biu din c tn s xut hin ca cc t, nn c kh nng thc hin chc
nng tm kim vn bn theo gn nhau v ni dung
Nhc im :
- Khng biu din c v tr xut hin ca cc t. Dn n b qua cc thng
tin ly c quan trng ca t kha, nh nu t kha xut hin tiu hay
trong cc th in m s quan trng hn cc v tr khc
- S chiu ca Vector l rt ln
III 2.2.3 Biu din vn bn HyperText theo m hnh quan h
Biu din vn bn theo m hnh quan h l cch biu din t nhin cho vn
bn HyperText. Chng ta d dng cu trc mt quan h nh phn (mi lin kt gia
cc vn bn) m i s th nht l tn ca ti liu c cha cc Hyperlink v i s th
2 l tn ca ti liu c tr ti.
a) Quan h l g
hiu c nhng u th ca hc quan h (relational learning), trc tin
ta so snh chng vi nhng thut ton nh (propositional algorithms) m lm vic
vi nhng v d hay thc th c lp. Mi iu m hc nh cn bit v cc v d
hun luyn ch l cc miu t hay thng tin v chnh v d . Hn na khi thc hin
phn lp cho mt v d, hc nh cng ch quan tm n thng tin ca chnh v d
m khng quan tm n mi lin h gia v d vi cc v d khc.
Biu din quan h bao gm c biu din nh (nh biu din theo m hnh
vector, ti cc t (bag of word), tp hp cc t (set of word)) cng vi cc thng tin v
mi quan h gia cc v d vi nhau. Chng hn, nu v d hun luyn ca chng ta l
people , biu din nh ch ch m t cc thng tin nh tn, tui, cng vic,
lng, ca tng ngi, trong khi biu din quan h s biu din tt c nhng
thng tin trn cng thm mt s thng tin khc na, v d nh mi quan h gia ng
ch-ngi lm thu hay mi quan h hn nhn.
Nh vy r rng rng mt biu din quan h cho ta mt c hi tm kim
ton b khng gian giu c ca cc mi quan h. Nu chng ta tin tng rng cc v
d lin quan c th l ngun thng tin hu ch cho s phn lp mt vi v d, th cch
biu din quan h l ph hp, cn ngc li, cc v d lin quan khng cung cp thm
thng tin no cn thit th cch biu din quan h (relation representation) khng th
no tt hn cch biu din nh (proposition representation)
Biu din quan h trong cho HyperText
Cc quan h :
Link_to (page, page): Mi quan h ny th hin cc siu lin kt (hyperlink)
tham chiu n cu trc gia cc trang trong ton b vn bn Web. Chng ta c th
biu din rng trang 15 cha siu lin kt tham chiu n trang 37 nh sau: link_to
(page15, page37).
Has_word (page): Cung cp thng tin v ni dung ca mi trang Web. Chng
ta s ch biu din nhng t m ta quan tm (hay sau ny s chn lm t kha). Chng
hn has_computer(A) c ngha l trang A c cha t computer.
Ta c th biu din ph nh: not(link_to (page15, page37)) c ngha l
page15 khng lin kt vi page17, cn not(has_computer(A) c ngha l trang A
khng c cha t computer
V d: C hai trang Web A v B sau:




Gi s A l trang ch ca sinh vin ca tp hp cc trang Web ca mt trng
i hc
Khi trang A c biu din nh sau:
A:- has_engine(A), has_list(A), has_vector(A), link_to(B,A), has_jame(B),
has_link(B), has_paul(B), not(has_home(A))
V nu bng ngn ng th ta c th dch ra thnh lut nh sau: Mt trang m
cha cc t kha list, vector, common nhng khng cha t kha home, v c lin
kt bi trang c cha cc t jame, paul, link th l trang ch ca sinh vin
A

List
Vector
Common

B

Jame
Paul
Link
3.3. CC PHNG PHP HC MY
3.3.1. Thut ton phn lp Bayes
Thut ton phn lp Bayes l mt trong nhng thut ton phn lp in hnh
nht trong khai thac d liu v tri thc. tng chnh ca thut ton l tnh xc sut
c sau ca s kin c thuc lp x theo s phn loi da trn xc sut c trc ca s
kin c thuc lp x trong iu kin T


Gi V l tp tt c cc t vng.
Gi s c N lp ti liu: C
1,
C
2
,,C
n

Mi lp C
i
c xc sut p(C
i
) v ngng CtgTsh
i
.
Gi p(C| Doc) l xc sut ti liu Doc thuc lp C.
Cho mt lp C v mt ti liu Doc, nu xc sut p(C|Doc) tnh c ln hn
hoc bng gi tr ngng ca C th ti liu Doc s thuc vo lp C.
Ti liu Doc c biu din nh mt vector c kch thc l s t kho trong
ti liu. Mi thnh phn cha mt t trong ti liu v tn xut xut hin ca t
trong ti liu. Thut ton c thc hin trn tp t vng V, vector biu din ti liu
Doc v cc ti liu c sn trong lp, tnh ton p(C|Doc) v quyt nh ti liu Doc s
thuc lp no.
Xc sut p(C | DOC) c tnh theo cng thc sau:
Xc sut p(C | Doc) c tnh theo cng thc sau:
Vi:
p(c | x, ) = p(c | x,T) p(T |x)
T in

Trong :
|V| : s lng cc t trong tp V
F
j
: t kho th j trong t vng
TF(F
j
| Doc) : Tn xut ca t F
j
trong ti liu Doc (bao gm c t ng
ngha)
TF(F
j
| C) : Tn xut ca t F
j
trong lp C (s ln F
j
xut hin trong tt c
cc ti liu thuc lp C)
P(F
j
| C) : Xc sut c iu kin t F
j
xut hin trong ti liu ca lp C
Cng thc F(F
i
| C) c tnh s dng c lng xc sut Laplace. S d
c s 1 trn t s ca cng thc ny trnh trng hp tn sut ca t F
i
trong
lp C bng 0, khi F
i
khng xut hin trong lp C.
gim s phc tp trong tnh ton v gim thi gian tnh ton, ta
thy rng, khng phi ti liu Doc cho u cha tt c cc t trong tp t vng
V. Do , TF(F
i
| DOC) =0 khi t F
i
thuc V nhng khng thuc ti liu Doc, nn
ta c, (P(F
j
| C))
TF(Fj, Doc)
= 1. Nh vy cng thc (1) s c vit li nh sau:

Vi:

Nh vy trong qu trnh phn lp khng da vo ton b tp t vng m ch
da vo cc t kha xut hin trong ti liu Doc.
3.3.2. Thut ton k-ngi lng ging gn nht.
Thut ton hot ng khng da vo tp t vng. Tuy nhin, n vn s
dng ngng CtgTsh, v thc hin theo cc bc nh cp trn. l tin
hnh ngu nhin k ti liu v tnh xc sut p(C|Doc) da trn s ging nhau gia ti
liu Doc v k ti liu c chn. Xc sut p(C| Doc) c tnh theo cng thc sau:

Trong :
n : S lp
k : S ti liu c chn so snh
P(C
i
| D
j
) : C gi tr 0 hoc 1, cho bit ti liu D
j
c thuc lp C
i
khng. S d c gi tr ny v mt ti liu c th thuc hn mt lp
Sm(Doc,D
j
) xc nh mc ging nhau ca ti liu Doc vi ti liu c
chn D
j
, c tnh bng cos ca gc gia hai Vector biu din ta liu Doc v ti liu
c chn D
j
.

Cch biu din cc ti liu trong thut ton ny hon ton tng t nh trong
thut ton phn lp Bayes th nht, ngha l cng gm F
i
t kha v tn xut X
i

tng ng.
Trong cng thc (4):
X
i
l tn xut ca t kho th i (da trn s t ng ngha xut hin trong ti
liu Doc)


Y
i
l tn xut ca t th i (da trn s t ng ngha xut hin trong ti liu
D
i
)
3.3.3. Phn lp da vo cy quyt nh
Hc cy quyt nh l phgn php c s dng rng ri cho vic hc quy
np t mt mu ln. y l phng php xp x hm mc tiu c gi tr ri rc. Mt
khc, cy quyt nh cn c th chuyn sang dng biu din tng ng di dng
tri thc l cc lut If-then. Trong cc thut ton hc cy quyt nh th ID3 v C4.5 l
hai thuta ton ni ting nht. Sau y l ni dung thut ton ID3.
ID3 (Example, Target attributes, Attributes)
1.To mt nt gc Root cho cy quyt nh
2. Nu ton b Examples u l cc v d dng, t li cy Root mt nt
n, vi nhn +.
3. Nu ton b Examples u l cc v d m, tr li cy Root mt nt n,
vi nhn -.
4. Nu Attributes l rng th tr li cy Root mt nt n vi gn nhn bng
gi tr ph bin nht ca Target_attribute trong Example.
5. Ngc li Begin
5.1. A<= thuc tnh t tp Attribute m phn loi tt nht tp Examples
5.2. Thuc tnh quyt nh cho Root<=A
5.3. For mi gi tr c th c v
i
ca A
5.3.1. Cng thm mt nhnh cy con di Root, ph hp vi
biu thc kim tra A=v
i
.
5.3.2. t Examples
vi
l mt tp con ca tp cc v d c gi tr
v
i
cho A
5.3.3. Nu Examples
vi
rng
-Di mi nhnh mi thm mt nt l vi nhn bng gi
tr ph bin nht ca Target_attribute trong tp Examples
-Ngc li th di nhnh mi ny thm mt cy con
ID3(Examples, target_attribute, Attribute-{A}).
End
Return Root.
Thuc tnh tt nht l thuc tnh c ly thng tin ln nht.
Phng php hc my dng cy quyt nh v da trn cy quyt nh l rt
hiu qu bi v n c th lm vic c vi mt s lng ln cc thuc tnh, v hn
na t cy quyt nh c th rt ra c mt h thng lut hc c
3.3.4. Thut ton hc quan h FOIL
a. Khi nim mnh Horn (Horn Clause)
Mnh Horn l cc mnh c nhiu nht mt literal dng, c dng nh
sau:
H \/ (-L1)\/ (-L2)\/\/ (-Ln))
Trong H, L1,L2,,Ln gi l cc literal dng, cn L1,-L2,-Ln gi l
cc literal m.
Hay vit di dng lut:
( L1^L2^^Ln)=>H. Dng ny c gi l lut First_Order
L1,L2,Ln gi l tp cc tin iu kin. H gi l kt lun.
VD v cc lut First_Order:
If Parents(x,y) then Ancestor (x,y)
If (Parents(x,z) ^ Ancestor(z,y) ) then Ancestor(x,y).
Trong Parents, Ancestor, gi l cc predicate
b.Thut ton Foil
FOIL c xut v pht trin bi Quinlan (Quinlan, 1990). FOIL hc cc
tp d liu ch bao gm hai lp, lp cc v d dng v v d m. FOIL hc m
t lp i vi lp dng. u vo ca Foil gm cc tin iu kin v cc kt lun. .
u ra l mt tp cc lut sinh t cc tin iu kin v cc kt lun . Mi bc Foil
s thm mt literal vo cc tin iu kin ca lut ang hun luyn. Thut ton s
dng hm Foil_Gain tnh ton la chn mt literal trong tp cc literal ng c
FOIL l m hnh hc my khng tng trong thut ton leo i s dng
metric da theo l thuyt thng tin xy dng mt lut bao trm ln d liu. Trong
Foil c hai trng thi chnh :
1. separate stage (trng thi phn tch) : Bt u mt trng thi mi
2. Conquer State (trng thi ch ng): Kt hp cc literal xy dng thn
ca mnh .
Pha tch ri ca thut ton bt u t lut mi trong khi pha ch ng xy
dng mt lin kt cc literal lm thn ca lut. Mi lut m t mt tp con no cc
v d dng v khng c v d m. Lu rng, FOIL c hai ton t: bt u mt lut
mi vi thn lut rng v thm mt literal kt thc lut hin ti. FOIL kt thc vic
b sung literal khi khng cn v d m c bao ph bi lut, v bt u lut mi n
khi tt c mi v d dng c bao ph bi mt lut no .
Cc v d dng c ph bi mnh s c tch ra khi tp dy v qu
trnh tip tc hc cc mnh tip theo vi cc v d cn li, v kt thc khi khng
c cc v d dng thm na.
Sau y l thit k bc 1 ca FOIL:
1.Gi POS l tp cc v d dng.
2. Gi NEG l tp cc v d m
3. t NewClauseBody bng rng
4. Trong khi POS cha rng thc hin:
Separate: (Bt u mt lut mi)
5. Loi khi POS tt c nhng v d tho mn NewClauseBody.
6. t li NEG l tp cc v d m ban u
7. t li NewClauseBody bng rng
Trong khi NEG cha rng thc hin.
. Conquer (Xy dng thn mnh )
8. Chn Literal L
9. Kt hp vo NewClauseBody.
10. Loi khi NEG nhng v d m khng tho mn L.
FOIL s dng thut ton leo i b sung cc literal vi thng tin thu c
ln nht vo mt lut. Vi mi bin i ca mt khng nh P, FOIL o lng thng
tin t c. la chn literal vi thng tin t c cao nht, n cn bit bao nhiu
b dng v m hin ti c bo m bi cc bin i ca mi khng nh c xc
nh theo cch dn tri.
Cng thc tnh infortmaion gain ca Foil l:
Gain(Literal)=T
++
*(log
2
(P
1
/P
1
+N
1
) - log
2
(P
0
/P
0
+N
0
))
P
0
v N
0
l s v d dng v m trc khi thm mt literal L vo mnh
P
1
v N
1
l s v d dng v m sau khi thm literal L vo mnh .
T
++
l s v d dng c nh c trc v sau khi thm literal .(ngha l s v
d ng vi c hai lut R v R_l R sau khi thm vo literal L)
Sau y l mt v d minh ha cho thut ton FOIL.
Ta mun hc mi quan h Grandaughter(x,y) t cc quan h (Predicate)
Grandaughter, Father, Mail, Femail v cc hng s: Victor, Sharon, Bob, Tom.
Tp v d: L nhng gi nh lin quan n cc Predicate Grandaughter,
Father, Mail, Femail v cc hng s Victor, Sharon, Bob, Tom, trong c cc v d
dng l Grandaughter(Victor, Sharon), Father (Sharon, Bob), Father(Tom, Bob),
Femail(Sharon), Father(Bob, Victor). Cc v d cn li l m (Chng hn nh
-Grandaughter(Tom,Bob),-Father(Victor, Victor),).
chn cc literal cho lut, FOIL xt cc cch kt hp khc nhau ca cc
bin x,y,z,t vi cc hng s trn. Chng hn bc khi u khi lut ch l :
- Bc 1:
Lut khi u: Grandaughter (x,y)
S kt hp {x/Bob, y/Sharon}s cho ta mt v d dng v trong d liu
hun luyn Grandaughter(Bob, Sharon) l ng.
Cn 15 cch kt hp cn li s tng ng vi cc v d m v khng tm thy
s xc nhn tng ng trong tp hun luyn
- Mi trng thi tip theo, lut c hnh thnh da trn tp cc kt ni m
cho ra cc v d dng, m. Khi mi literal c thm vo lut, tp cc v d m
dng s thay i. Chng hn xt literal tip theo c vo lut l Father (y,z), th
thay v kt ni {x/Bob,y/Sharon} trn, kt ni {x/ Bob, y/Sharon,z/ Bob} mi
tong ng vi mt v d dng. Ti mi bc, s v d m, dng s c tnh ton
c c ly thng tin Foil_Gain (L,R).

CHNG 4. H THNG TH NGHIM
4.1. MT S CNG TRNH NGHIN CU LIN QUAN
H thng th nghim c xy dng da trn s kt hp nhng u im ca
cc gii php trong cc cng trnh nghin cu v vn tm kim v phn lp vn bn
trc y. Sau y l ni dung v kt qu ca cc cng trnh nghin cu
1.. [San Slattery (May 20002_CMU-CS-02-142)] Lun n tin s HyperText
Classification
Trong lun n tin s ca mnh, tc gi so snh cc thut ton hc my p dng
cho phn lp trang Web cng vi cc cch biu din tng ng, l:
1. Dng Nave Bayes vi cch biu din ti liu thnh mt ti cc t (bag of
words)
2. Dng k ngi lng ging gn nht vi m hnh tn s cho biu din trang
Web (TF-IDF)
3. Thut ton FOIL vi cch biu din thnh tp cc t (set of words) cho mi
ti liu (khng tnh n cc lin kt trong mi ti liu)
4. Thut ton FOIL vi cch biu din thnh tp cc t (set of words) v c tnh
n cc thng tin lin kt trong cc ti liu
Tc gi ci t v th nghim v a ra kt qu, vi tiu chun nh gi l
hi tng(recall)v chnh xc( Precision)

Cch tio cn 4 u im hn c, cho hi tng v chnh xc cao hn hn.
Tip n, tc gi xy dng mt b phn lp HyperText mi s dng thut ton
FOIL_PILES vi cch biu din vn bn theo m hnh quan h.
2. [on Sn] Lun vn thc s Phng php s dng Logic m v ng dng trong
khai ph d liu FullText
Trong lun vn ny, tc gi thc hin phn lp vn bn s dng cch biu din vn
bn bng phng php s dng Logic m v ng dng thut ton hc cy quyt nh.
Vi cch gii quyt bi ton nh vy cho ta thy mt s u im: S dng cc
khi nim m lm gim s chiu ca cc thuc tnh, dn n lm gim thi gian
tnh ton khi hc cy quyt nh.
Tuy nhin cch biu din ny cn c mt s mt hn ch, l vic con ngi c
th s tn nhiu cng sc cho vic xy dng ch , cc khi nim v mi lin quan
gia chng.
3. [Bi Quang Minh] My tm kim Vietseek. Bo co kt qu nghin cu thuc
ti khoa hc c bit cp HQGHN m s QG 02-02.
Trong my tm kim Vietseek, cc vn bn c t chc thnh c s d liu.
Vietseek xy dng c c ba loi ch mc (TextIndex, StructureIndex v
UtilityIndex). C s d liu Vietseek c chia thnh hai phn:
Phn 1: D liu v vn bn Web, Domain, Word c lu tr trong cc bng ca
CSDL mySQL
Phn 2: D liu v ch mc (index) c lu tr ring v c c cu ring. Do phn
ny i hi tc cao nn khng lu tr trong CSDL MySql m lu tr trong 300 file
nh phn khc nhau.
Vietseek thc hin tm kim theo cm t a vo v tr v cc vn bn c cha cc
cm t kha ch cha thc hin phn lp
4. [Phm Th Thanh Nam] Lun vn Thc s Mt s gii php cho bi ton tm kim
trong CSDL HyperText.
T CSDL ch mc c xy dng ca Vietsek, tc gi xy dng nn vector
biu din cc trang Web, vi thnh phn ca vector chnh l tn sut xut hin ca cc
t kha trong vn bn ang xt.
Lun vn ny xut mt s thut ton:
- Lit k danh sch cc trang Web Gn ngha nht vi trang Web hoc cm t
tm kim a vo theo tiu ch Gn nhau v ni dung. gn nhau v ni dung s
thu c khi so snh cc vector biu din vi nhau
- quan trng ca trang Web da vo mi lin kt vi trang Web khc v tn s
xut hin ca cc t kha tm kim trong trang.
- Kt hp gn nhau v ni dung v quan trng ca trang web thnh mt tiu
ch gi l gi tr kt hp. Kt qu s c hin th theo gi tr kt hp.
Nhn xt
Tuy cng trnh u tin [San Slattery] gii thiu kh tng quan v cc
phng php phn lp v phn tch mt s kt qu th nghim, nhng ni chung c
bn cng trnh nghin cu ni trn cha thc s cp ti vn thit k v ci t
nhng gii php thc s tinh t gii quyt vn t ng ngha v a ngn ng i vi
h thng phn lp trong CSDL Web. Thc hin vic kho st nhng gii php cho vn
ny v ci t th nghim l mt cng vic nghin cu c ngha.
Tn ti mt s thut ton in hnh gii quyt bi ton phn lp trong cc CSDL
vn bn. Vic ci t th nghim v nh gi hiu qu hot ng ca mt s thut ton
phn lp in hnh nh vy trong mt CSDL web thc s (khong vn trang ) c th
c coi nh nhng bc i cn thit u tin trong vic xy dng v pht trin cc
my tm kim ting Vit.
4.2. XUT MT CCH T CHC CSDL V THUT TON P
DNG
Theo nhng phng php biu din vn bn HyperText v ang c s dng,
nghin cu, ta c nhn xt tng qut sau: cch biu din vn bn HyperText trong cc
my tm kim c u im l khai thc c nhng thng tin quan trng v v tr xut
hin ca t kha, t xp hng c cc trang Web tm c theo th t gn vi
ni dung t kha cn tm, nhng cha thy cp n tn s xut hin ca cc t
kha trong vn bn. Nn vic tm theo ni dung l kh thc hin c.
Cn vi cch biu din theo m hnh Vector ca Sen Slattery [2002] th b
qua thng tin v v tr xut hin ca cc t kha, mt thng tin rt quan trng cho
phn lp vn bn. Hn na nu theo cch biu din 2, vn bn gc cn phn lp s b
m nht i trong tp hp cc vn bn lin qua n n, v phn lp s mt chnh xc
nht l khi cc vn bn lin quan khng c cng ch . Cn vi cch biu din 3 v
4, s chiu ca vector s rt ln v c rt nhiu thnh phn lp (chnh l cc t xut
hin lp i lp li trong tp cc vn bn lin quan).
T nhng u nhc im ca cc phng php trn, ti a ra mt cch biu
din ring. t ng chnh vn l da trn m hnh vector, ng thi trong cch xy
dng file t kha c tnh n cc t ng ngha
4.2.1. t bi ton
Tn ti mt tp cc vn bn HyperText cho trc, mi lp cha cc ti liu (di
dng *.html) thuc cng mt th loi. Xy dng h thng vi chc nng:
c mt ti liu mi, yu cu h thng phn ti liu vo mt lp thch hp.
4.2.2. Cch biu din vn bn:
S dng m hnh Vector tnh tn sut c tnh n quan trng ca v tr xut
hin cc t kha, cng vi cc lin kt gia cc trang
Xy dng vector cho trang Web A bng cch:
- Vi mi trang Web A no , thng k cc trang Web c lin kt ti A v c
A tr ti.
- m s ln ca mi t kha xut hin trong A v trong cc trang c lin quan
n A, gi s count[i] l s ln xut hin ca t kha th i trong vector biu din ca
trang A,
Nu i xut hin trong th body (<body></body>) th ch tng count[i] ln 1,
Nu t i xut hin trong th tiu (<title></title>) th tng count[i] ln 3,
Sau khi m xong trang A, nhn count [i] vi 3 (chnh l trng s ca vn bn cn
biu din), sau m tip trong cc trang c lin kt, vi nguyn tc tnh trng s v
tr xut hin nh trong vn bn A, trng s ca cc vn bn lin quan bng 1.
Nh vy: Cch biu din trn s dng kt hp c cc thng tin: Cc lin
kt vo ra ca ti liu HyperText, tnh n cc ti liu lng ging nhng cng t ra
trng s cho ti liu gc, biu din c s ln xut hin ca t kha trong ti liu
ng thi tnh n v tr xut hin ca cc t kha trong ti liu
4.2.3. Thit k CSDL.
Cc vn bn HyperText c m ha thnh 3 bng trong CSDL Access.
1. Bng 1: bng cc t kha (KeyWords),

Field Name Data Type Description
KeyWordID
KeyWord
Synonymous
Auto Number
Text
Memo
M t kha
T kha
Cc t ng ngha vi t kha

T kha (KeyWord) : Ni dung l mt t trong ting Anh nn n phi tha mn
cc iu kin sau: T trong ting Anh c mt m tit, mi m tit l mt chui k t a-
z,A-Z. Cc t trong cu c tch bit bi du cch hoc cc k t bt k (du chm,
du phy, du hai chm,) khng thuc a-z, A-Z.
Cc t ng ngha (Synonymous): L trng memo c dng (word1,
word2,,word
n
). Vy cc t ng ngha c cng m (keywordID) vi t kha.

2. Bng 2: Bng cc vn bn (Documents)
Field Name Data Type Description
DocID
DocName
CacheAdd
Vector
Auto Number
Text
Text
Memo
M vn bn
Tn vn bn
a ch Cache
Vector biu din cho vn bn
Vector: l trng kiu Memo, mi vector c dng:
(M t kha 1, s ln xut hin tiu , tng s ln xut hin trong vn
bn);( M t kha 2, s ln xut hin tiu , tng s ln xut hin trong vn
bn);
S thnh phn ca Vector chnh l s t kha xut hin trong trang Web ang
biu din, ch khng phi l ton b cc t kha trong bng KeyWord, do s chiu
ca vector s gim i rt nhiu. Mi thnh phn ca vector biu din s ln xut hin
v v tr xut hin ca cc t kha trong vn bn.
VD: Mt Vector c dng: (1,1,4);(2,1,4);(4,2,7) c ngha: T kha th nht
xut hin 4 ln, trong 1 ln xut hin tiu . T kho th 2 xut hin 4 ln trong
1 ln xut hin tiu T kho th 4 xut hin 7 ln trong 2 ln xut hin
tiu

DocID Cache Address Vector
1
2
3
4
C:\data\sport\s1.htm
C:\data\sport\s2.htm
C:\data\culture \ct3.htm
C:\data\ culture \c4.htm

(1,1,4); (3,1,4); (4,2,7);.
(1,2,7); (2,1,4); (3,2,8);.
(1,2,6); (5,1,4); (7,2,7);.
(2,1,4); (3,1,4); (4,2,7);.

3.Bng 3 Th hin s kin kt gia cc vn bn. (LINKS)
Field Name Field Type Descrription
DocID1
DocID2
Number
Number
M ca vn bn lin kt i
M vn bn c lin kt ti

DocID1 l m cc vn bn c lin kt ti cc vn bn c m trong DocID2.

4. Bng 4. Xc sut ca cc lp






4.2.4.Thit k Modul chng trnh
Field name Fielsd type Description
ClassName
Probability
Text
Number(t 0..100)
Tn lp
Xc sut c lp
1.Modul phn tch trang Web to ra bng KEYWORDS
Thut ton:
Input: Cc vn bn dng to t kha
While (cha c ht cc vn bn) do
1. c tng vn bn
2. While (cha c xong vn bn) do
2.1.c tng t
2.2. Insert vo C s d liu
End
End.
Output: File cc t kha
Trung Synonymous s c b sung bng tay i vi tng t kha
Thm chc nng nhp thm t kha bng tay, xa t kha khng cn thit.
2.Modul ly a ch Cache (CacheAddress) ca tng ti liu hun luyn v to
ra m ti liu (DocID) thm vo hai trng u tin ca cc bng DOCUMENTS.
Cn trng Vector s to sau nh Modul th 4.
Thut ton:
Input: Cc vn bn dng hun luyn
While (cha c ht cc vn bn) do
1.1. c a ch Cache ca tng vn bn
Insert vo CSDL
1.2. c tn vn bn
Insert vo CSDL
End
M vn bn t tng.
3.Modul to bng LINKS. to bng LINKS trc ht phi c bng
DOCUMENTS ly a m ca tng ti liu (DocID) tng ng.
Thut ton:
1. c t th mc cha cc ti liu t trn a cng
2. t bin TnTM=[ng dn ca th mc]
3. While (cha phn tch ht cc ti liu) do
3.1. Ly tng ti liu trong th mc km thm a ch Cache(CacheAdd).
3.2. Tm trong bng DOCUMENTS DocID ca ti liu ny nh vo
CacheAdd, c DocID1
3.2.1. Phn tch ly c cc th siu lin kt, l cc cm t c
dng: href=[Tn ti liu c tr ti], gi s c N th.
3.2.2. For i=1 to N do
3.2.2.1. Cng TnTM v [tn ti liu c tr ti] c a ch
Cache, duyt trong DOCUMENTS ly DocID, c
DocID2
3.2.2.2.Thc hin lnh Insert hai DocID ly c trn vo hai
trng DocID1 v DocID2 ca bn LINKS
End.
End
4. Tr li bng LINKS trong CSDL

4. Modul to ra vector cho mi ti liu, thm vo trng Vector ca bng
DOCUMENTS.
Thut ton:
1. c t bng DOCUMENTS trong CSDL ly DocID v CacheAdd
2. While (cha c ht cc bn ghi)
2.1. Dng CacheAdd c ti liu t a cng
2.2. Gn DocID_curence=DocID
2.3. Gn total_occurence=0; header_occurence=0; vector=;
2.4. Ly tng t kha keyword trong bng KEYWORDS so snh
2.4.1 While (cha ht cc t kha)
2.4.1.2. Phn tch ti liu ly tng t mc : word
2.4.1.2. Kim tra xem nu word cha c trong bng KEYWORD th b
sung thm
2.4.1.3. While (cha c ht ti liu)
- Nu (word= keyword) hoc (word=t ng ngha) v (word nm trong
th <head>) th total_occurence+3 v header_occurence+1;
- Nu (word=keyword) hoc (word=t ng ngha) v (word khng nm
trong th <head>) th total_occurence ++; header_occurense++;
End.
2.4.1.4. total_occurence*3;
header_occurence*3;
2.4.1.5. c tt c cc ti liu m ti liu hin thi lin kt ti(outgoing)
Lp li cc bc phn tch nh i vi ti liu hin thi, tng 2 bin
total_occurence v header_occurence
2.4.1.6. c tt cc ti liu lin kt ti ti liu hin thi (incoming)
Lp li cc bc phn tch nh i vi ti liu hinh thi tng 2 bin
total_occurence v header_occurence
End.
2.5. Nu (total_occurence !=0 ) th vector += KeyWordID + , +
total_occurence + , + header_occerence +;
2.6. Insert into DOCUMENTS (Vector) values vector where
DocID=DocID_curence.
3. End.

5. Modul thc hin phn lp.
Input:Tp hp cc ti liu cn phn lp.
While (cha c ht ti liu) do
c vo ti liu cn phn lp
1. Phn tch ti liu thnh cc vetor nh trong modul to trng vector
ca bng DOCUMENTS
2. Kt hp vi cc vector ca cc ti liu trong CSDL, p dng mt trong
cc thut ton hc my phn lp.
End
4.2.5. Phn tch cc chc nng ca h thng
a. Chc nng chnh ca h thng
b. Chc nng chi tit
- Chc nng to CSDL
- Chc nng phn lp v tm kim
4.2.6. nh gi h thng th nghim
a. Mt s v d kt qu trn h thng th nghim
H thng chy v cho mt s kt qu ban u
- Xy dng c h thng CSDL nh trnh by trn
+ Phn tch cc vn bn ly t kha
+ Th hin c cc lin kt (link) gia cc ti liu siu vn bn trong mt siu
vn bn
+ M ha cc vn bn thnh cc vector v lu tr vo CSDL
- Thc hin vic phn lp mt ti liu siu vn bn cho trc
- Cho php tm kim mt ti liu siu vn bn c ni dung gn vi ti liu a vo
b. Hn ch ca h thng
Do hn ch v mt thi gian nn h thng cn c mt s mt hn ch
- Cc t kha vn cha y v cha c chn lc
- Ch phn lp c tng ti liu mt (nu cn thi gian s tip tc sa)
- chnh xc cha cao do cha c d liu hc chnh xc.

You might also like