You are on page 1of 76

-1-

I HC THI NGUYN
TRNG I HC CNG NGH THNG TIN V TRUYN THNG

QUNH ANH

NGHIN CU THUT TON KNUTH-MORRIS-PRATT


V NG DNG

Chuyn ngnh: Khoa hc my tnh


M s: 60.48.01

LUN VN THC S KHOA HC MY TNH


NGI HNG DN KHOA HC: PGS.TS: TRUNG TUN

Thi Nguyn 2014


S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

-2-

MC LC
MC LC .......................................................................................................................1
DANH MC CC K HIU, CC CH VIT TT ..................................................5
DANH MC CC HNH V V CC BNG ............................................................ 6
M U ......................................................................................................................... 7
CHNG 1. SO KHP CHUI ..................................................................................10
1.1. Khi nim so khp chui ...................................................................................10
1.2. Lch s pht trin................................................................................................ 11
1.3. Cc cch tip cn ................................................................................................ 12
1.4. ng dng ca so khp chui..............................................................................12
1.5. Cc dng so khp chui .....................................................................................13
1.5.1. So khp n mu ........................................................................................ 13
1.5.2. So khp a mu........................................................................................... 14
1.5.3. So mu m rng .......................................................................................... 15
1.5.4. So khp chnh xc ....................................................................................... 16
1.5.5. So khp xp x ............................................................................................ 17
1.5.5.1. Pht biu bi ton ................................................................................17
1.5.5.2. Cc tip cn so khp xp x .................................................................18
1.5.5.3. tng t gia hai xu .....................................................................19
1.5. Mt s thut ton so mu ...................................................................................20
1.5.1. Thut ton Brute Force ...............................................................................20
1.5.2. Thut ton Karp-Rabin ...............................................................................21
1.5.3. Thut ton BM ( Boyer- Moor) ..................................................................24
1.5.4. Cc thut ton khc .....................................................................................27
1.6. Khp chui vi otomat hu hn .........................................................................28
1.6.1. Otomat hu hn ........................................................................................... 28
1.6.1.1. tmt hu hn n nh DFA ........................................................... 29
1.6.1.2. tmt hu hn khng n nh NFA ................................................33
1.6.2. Otomat khp chui......................................................................................36
1.6.2.1. Gii thiu ............................................................................................. 36
1.6.2.2. Thut ton xy dng Otomat so khp chui .......................................38
1.7. Kt lun chng .................................................................................................40
CHNG 2. THUT TON SO KHP CHUI KNUTH-MORRIS-PRATT..........41
2.1. Thut ton KMP .................................................................................................41
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

-32.1.1. Gii thiu thut ton ...................................................................................41


2.1.2. Bng so snh mt phn ...............................................................................45
2.1.3. phc tp ca thut ton KMP ................................................................ 47
2.2. Thut ton KMP m ........................................................................................... 48
2.2.1. Otomat so mu ............................................................................................ 48
2.2.2. Thut ton ...................................................................................................49
2.2.2.1 Thut ton to lp TFuzz ......................................................................49
2.2.2.2. Thut ton tm kim mu da vo bng TFuzz ...................................51
2.2.3. So snh KMP v thut ton KMP m ......................................................... 52
2.3. Thut ton KMP - BM m .................................................................................53
2.3.1. tng ca thut ton ................................................................................53
2.4.2. Otomat m so mu ......................................................................................55
2.3.2.1. Gii thiu ............................................................................................. 55
2.3.2.2. Hot ng ca otomat m so mu ....................................................... 55
2.3.3. Thut ton tm kim ....................................................................................56
2.4. Kt lun chng .................................................................................................57
CHNG 3. NG DNG THUT TON KMP TRONG TM KIM THNG TIN
TRN VN BN..........................................................................................................58
3.1. Bi ton tm kim mu trn vn bn ..................................................................58
3.1.1. Tm kim mu ............................................................................................. 58
3.1.2. Tm kim thng tin......................................................................................59
3.1.2.1 Gii thiu .............................................................................................. 59
3.1.2.2 Cc m hnh tm kim thng tin thng s dng .................................61
3.2. M ngun m Lucene ........................................................................................ 64
3.2.1. Gii thiu ....................................................................................................64
3.2.2. Cc bc s dng Lucene ...........................................................................66
3.3. ng dng tm kim thng tin trn vn bn ........................................................ 67
3.4. Ci t chng trnh th nghim .......................................................................68
3.4.1. Gii php, cng ngh s dng.....................................................................68
3.4.2. Ni dung chng trnh ................................................................................68
3.4.3. Kt qu thc nghim ...................................................................................71
3.4.3.1. Giao din chnh ca chng trnh ....................................................... 72
3.4.3.2. Kt qu th nghim ca chng trnh khi tm kim vi t kha Vn
bn ...................................................................................................................72
3.5. Kt lun chng 3 .............................................................................................. 73
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

-4KT LUN ...................................................................................................................74


TI LIU THAM KHO ............................................................................................. 76

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

-5-

DANH MC CC K HIU, CC CH VIT TT


BM

Thut ton Boyer - Moore

DFA

Deterministic Finite Automata - tmt hu hn n nh

DOC

Document

FA

Finite Automata - tmt hu hn

HTML

HyperText Markup Language

IDF

Inverse document frequency - Tn sut ti liu ngc

KMP

KNUTH-MORRIS-PRATT

LAN

Local area network

NFA

Nondeterministic Finite Automata - tmt hu hn khng n nh

TF

Term frequency - Tn sut t

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

-6-

DANH MC CC HNH V V CC BNG


Hnh 1.1. S chuyn ca mt DFA ................................................................ 30
Hnh 1.2. M t mt DFA ................................................................................... 31
Bng 1.1. V d hm chuyn ca DFA ............................................................ 32
Hnh 1.3. S ca mt NFA ............................................................................. 34
Hnh 1.4. Di chuyn chui .................................................................................. 35
Bng 1.2. V d hm chuyn trng thi ca NFA ............................................ 35
Hnh1.5. V d so khp chui ............................................................................. 37
Hnh 1.6. V d otomat so khp chui ................................................................ 38
Bng 2.1. Bng so snh mt phn ....................................................................... 46
Bng 2.2. Th d khc ......................................................................................... 46
Bng 2.3. Trng hp mu xu nht vi thut ton KMP.................................. 47
Bng 2.4. Bng next ............................................................................................ 51
Bng 2.5. Bng TFuzz ......................................................................................... 51
Bng 2.6. Minh ha th d ................................................................................... 52
Hnh 2.1. Dch chuyn con tr trn mu ............................................................. 52
Bng 2.7. Kt qu tm s xut hin mu P trong tp S theo KMP v tip cn m
............................................................................................................................. 53
Hnh 2.2. tng chung ca thut ton KMP-BM m ...................................... 55
Hnh 3.1. M hnh biu din v so snh thng tin .............................................. 60
Hnh 3.2. M hnh khng gian vec t ................................................................. 62
Bng 3.1. Tnh im s ....................................................................................... 64
Hnh 3.3. M hnh nh ch mc ca Lucene ..................................................... 65
Hnh 3.4. M hnh ng dng tm kim thng tin vn bn .................................. 68
Hnh 3.5. Giao din chnh ca chng trnh ....................................................... 72
Hnh 3.6. Kt qu tm kim ca chng trnh..................................................... 73

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

-7-

M U
1. L do chn ti
My tnh ngy nay c s dng trong hu ht cc lnh vc v gp
phn quan trng vo vic thc y s pht trin kinh t, x hi, khoa hc k
thut, My tnh ra i nhm phc v cho nhng mc ch nht nh ca con
ngi. Vi tt c s x l ca my tnh ly thng tin hu ch v trong qu
trnh x l mt vn c bit quan trng l tm kim thng tin vi khi
lng ln, chnh xc cao, thi gian nhanh nht.
Cng vi s ph bin ca cng ngh thng tin, s lng cc ti liu in
t cng gia tng tng ngy. n nay, s lng cc ti liu c lu tr ln n
hng t trang. Trong khi , nhu cu khai thc trong kho ti liu khng l ny
tm kim nhng thng tin cn thit ang l nhu cu thng ngy v thit thc
ca ngi s dng. Tuy nhin, mt trong nhng kh khn con ngi gp phi
trong vic khai thc thng tin l kh nng tm chnh xc thng tin h cn trong
kho ti liu. tr gip cng vic ny, cc h thng tm kim ln lt c
pht trin nhm phc v cho nhu cu tm kim ca ngi s dng.
Nhng h thng tm kim bt u pht trin v a vo ng dng, ph
bin l cc h thng tm kim theo t kha. Nhiu h thng hot ng hiu qu
trn Internet nh Google, Bing, Yahoo! Tuy nhin, phn ln cc cng c tm
kim ny l nhng sn phm thng mi v m ngun c gi b mt. Hoc
cc h thng tm kim trn my c nhn nh Windows Search, Google
Desktop p ng phn no nhu cu ca ngi s dng, min ph cho c
nhn, tuy nhin cng ch p ng c trn phm vi nh v mi ch dng li
mc tm kim t kha theo tiu v phn tm tt.
C mt cch tip cn hiu qu gii quyt vn ny l thc hin vic
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

-8-

so khp v tm kim ton vn. Mt trong nhng thut ton so khp chui kinh
in l thut ton KMP. C th ni, KPM l mt thut ton mi m t c s
dng ti Vit Nam trong vic qun l, lu tr v x l lng d liu ln nhng
rt hiu qu v chnh xc. Da trn hng tip cn v s hng dn ca gio
vin, ti mnh dn nhn ti So khp chui v thut ton Knuth-MorrisPratt.
2. i tng v phm vi nghin cu
Cc khi nim so khp chui.
Cc khi nim thut ton so khp chui KMP.
Mt s ng dng trong thut ton KMP.
3. Hng nghin cu ca ti
Nghin cu tm kim KnuthMorrisPratt v ng dng trong vic
tm kim thng tin trn vn bn.
Nghin cu gii php cng ngh ci t chng trnh th nghim.
4. Nhng ni dung chnh
Lun vn c trnh by trong 3 chng, c phn m u, phn kt lun,
phn mc lc, phn ti liu tham kho. Lun vn c chia lm ba chng vi
ni dung c bn nh sau:
Chng 1: Trnh by khi nim v so khp chui, cc hng tip
cn, cc dng so khp v mt s thut ton so mu.
Chng 2: Trnh by v thut ton KMP, thut ton KMP m v
thut ton KMP-BM m.
Chng 3: Trnh by v bi ton tm kim thng tin trn vn bn v
tin hnh ci t th nghim chng trnh.
5. Phng php nghin cu
Tng hp cc ti liu c cng b v thut ton tm kim thng tin,
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

-9-

khai ph d liu, c bit cc kt qu nghin cu lin quan n thut ton tm kim


thng tin.
Thc nghim thut ton tm kim KMP vi d liu mu. Nhn xt, nh
gi kt qu th nghim.
6. ngha khoa hc ca ti
Lun vn nghin cu k thut, thut ton tm kim thng tin l c s h
tr cho cng tc d bo, lp k hoch, quy hoch, phn tch d liu qun l,
chuyn mn, nghip v.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 10 -

CHNG 1. SO KHP CHUI


1.1. Khi nim so khp chui
So khp chui l mt k thut ng vai tr nn tng trong lnh vc x l
vn bn. Hu nh tt c cc trnh son thoi v x l vn bn u cn phi c
mt c ch so khp cc chui trong ti liu hin ti. Vic tch hp cc thut
ton so khp chui l mt trong nhng khu c bn c s dng trong vic
trin khai phn mm v c thc hin trn hu ht cc h iu hnh.
Mc d hin nay d liu c lu tr di nhiu hnh thc khc nhau,
nhng vn bn vn l hnh thc ch yu lu tr v trao i thng tin. Trong
nhiu lnh vc nh so khp, trch chn thng tin, tin sinh hc, mt lng ln
d liu thng c lu tr trong cc tp tin tuyn tnh. Hn na khi lng d
liu thu thp c tng ln rt nhanh nn i hi phi c cc thut ton x l v
so khp d liu vn bn hiu qu
So khp chui l vic so snh mt hoc nhiu chui (thng c gi l
mu hoc Pattern) vi vn bn tm v tr v s ln xut hin ca chui
trong vn bn.
Ta hnh thc ho bi ton so khp chui nh sau: coi vn bn l mt
mng T[1..n] c chiu di n v khun mu l mt mng P[1..m] c chiu di m;
cc thnh phn ca T v P l cc k t c rt t mt bng ch ci hu hn .
V d, ta c th c = {0,1} hoc ={a,b,....,z}. Cc mng k t P v T thng
c gi l cc chui k t. Ta ni rng mt chui w l tin t (hu t) ca mt
chui x, k hiu l w x (w x), nu x = wy (x = yw), vi y l mt chui no
. ngn gn, ta k hiu Pk th hin tin t k - k t P[1..k] ca khun mu
P[1..m]. Ta ni rng khun mu P xy ra vi kho chuyn s trong vn bn T
(hoc, theo tng ng, ni rng khun mu P xy ra bt u ti v tr s + i
trong vn bn T) nu 0 s n-m v T[s + 1..s + m] = P[1..m] (ngha l, nu
T[s+j] = P[j], vi 1 j m). Bi ton so khp chui l bi ton tm tt c cc
S ha bi Trung tm Hc liu
http://www.lrc-tnu.edu.vn/

- 11 -

kho chuyn hp l vi n mt khun mu P cho xy ra trong mt vn bn T


cho.
V d: khun mu P = abaa xut hin mt ln trong vn bn T =
abcabaabcabac, ti kho chuyn s = 3. Vi bi ton ny, r rng ta c mt cch
lm n gin l tm tt c cc kho chuyn hp l dng mt vng lp kim tra
iu kin P[1..m] = T[s+1..s+m] vi n - m + 1 gi tr c th ca s.

1.2. Lch s pht trin


Trong nm 1970, S.A. Cook chng minh mt kt qu l thuyt gip
suy ra s tn ti ca mt thut ton gii bi ton so khp mu c thi gian t
l vi (M+N) trong trng hp xu nht.
D.E.Knuth v V.R.Pratt kin tr theo ui kin trc m Cook dng
chng minh cho nh l ca ng v nhn c mt thut ton tng i n
gin. ng thi J.H.Morris cng khm ph ra thut ton ny.
Knuth, Morris, Pratt khng gii thiu thut ny ca h cho n nm
1976, v trong thi gian ny R.S.Boyer v J.S.Moore khm ph ra mt thut ton
nhanh hn nhiu.
Thng 6 1975, Alfred V. Aho v Margret J. Corasick gii thiu thut
ton so khp chui a mu Aho Corasick trong ti liu Communications of the
ACM 18.
Nm 1980, Nigel Horspool gii thiu thut ton so khp chui tng
t thut ton KMP, nhng o ngc th t so snh trong ti liu Software Practice & Experience, 10(6):501-506.
Thng 3 - 1987, R.M.Karp v M.O.Rabin gii thiu thut ton n
gin gn nh thut ton Brute Force c thi gian thc thi t l vi m+n trong ti
liu IBM J. Res develop vol 31 no.2.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 12 -

1.3. Cc cch tip cn


C 4 cch tip cn chnh ca cc thut ton so khp chui:
Thut ton c in: l cc thut ton ch yu da vo s so snh
gia cc k t. Cc thut ton in hnh bao gm Brute Force,
Nave,
Thut ton my t ng hu t: l cc thut ton s dng cu trc
d liu hu t t ng nhn ra tt c cc hu t ca mu. Cc
thut ton in hnh bao gm Knuth Morris Pratt, Boyer
Moore, Horspool,
Thut ton bit song song: l cc thut ton khai thc bn cht song
song ca cc d liu bit thc hin cc thao tc cng lc. Cc
thut ton in hnh bao gm Shift Or,
Thut ton bm: l cc thut ton s dng k thut bm, trnh vic
so snh cc k t c phc tp bc 2. Cc thut ton in hnh
bao gm Karp Rabin.
phc tp tnh ton: Trn thc t c nhiu loi k t khc nhau nh:
binary, DNA, Alphabet, numeric v mi loi k t c phc tp khc nhau.
phc tp tnh ton t l thun vi chiu di ca mu, chiu di ca vnbn v
ln ca tp cc k t.
Cc thut ton so khp chui thng c thc hin theo 2 bc x l sau:
Bc tin x l: bao gm x l mu v Khi to cu trc d liu.
Bc so khp: thc hin vic so khp mu trong vn bn.

1.4. ng dng ca so khp chui


So khp chui l mt trong nhng bi ton c bn ca ngnh Tin hc. So
S ha bi Trung tm Hc liu
http://www.lrc-tnu.edu.vn/

- 13 -

khp chui c s dng rng ri trong nhiu ng dng v lnh vc khc nhau
nh:
Chc nng search trong cc trnh son tho vn bn v web
browser.
Cc cng c so khp nh: Google Search, Yahoo Search,.
Sinh hc phn t nh trong so khp cc mu trong DNA,
protein,.
So khp c s d liu.
Trong nhiu knh vi cho php chp nhn c.
Trong so khp mu hoc vt ca tn cng, t nhp v cc phn
mm c hi.
Trong lnh vc an ton mng v an ton thng tin.

1.5. Cc dng so khp chui


Phn loi cc thut ton so khp da trn cc c tnh ca mu ta c cc
dng: so khp n mu, so khp a mu (mu l tp cc xu), so khp mu m
rng, so khp biu thc chnh qui vi hai hng tip cn l so khp chnh xc
v xp x.
1.5.1. So khp n mu

Cho xu mu P d di m, P = P1 P2 Pm , v xu di n, S = S1 S 2 Sn
(S thng di, l mt vn bn) trn cng mt bng ch A. Tm tt c cc xut
hin ca xu P trong S.
Trong cc thut ton so mu thng s dng cc khi nim: Khc u,
khc cui, khc con hay xu con ca mt xu, c nh ngha nh sau: Cho 3
xu x, y, z. Ta ni x l khc u (prefix) ca xu xy, l khc cui (suffix) ca
xu yx v l khc con hay xu con (factor) ca xu yxz.
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 14 -

Thut ton th nht v c s dng rng ri l Brute- Force.


Phng php ny n gin ch l ln lt bt u t v tr trong S i snh
vi mu P. Mc d c tc chm, thi gian xu nht t l vi tch m.n, song
trong nhiu ng dng thc t cc chui pht sinh ra thng c thi gian x l
thc s lun t l vi m + n. Ngoi ra, mt u im khc l n thch hp vi cu
trc ca hu ht cc h my tnh.
Cho n nay, rt nhiu thut ton so n mu c a, trong kinh
in nht l KMP.
C th xem nh c ba tip cn chung cho cc thut ton so mu, ph
thuc vo cch duyt tm mu trong vn bn. Vic nh gi tc ca cc thut
ton da trn kch c ca mu P v bng ch A.
Tip cn th nht, ln lt tng k t ca vn bn S c c v ti mi
v tr, sau khi i snh vi mt k t ca mu s cp nht s thay i nhn ra
mt kh nng xut hin mu. Hai thut ton in hnh theo tip cn ny l KMP
v Shift - Or.
Tip cn th hai s dng mt ca s trt trn xu S v so khp mu
trong ca s ny. Ti mi v tr trong ca s, cn tm mt khc cui ca ca s
m l khc cui ca xu mu P. Thut ton BM l mt in hnh cho tip cn
ny v mt bin th n gin ho ca n l Horspool.
Tip cn th ba mi xut hin gn y cho ra i cc thut ton hiu qu
v thc hnh i vi mu P di. Cng tng t nh tip cn th hai, song ti
mi thi im s tm khc cui di nht ca ca s m l khc con ca mu.
Thut ton u tin theo tip cn ny l BDM v khi P ngn, mt phin bn
n gin hn, hiu qu hn l BNDM. Vi nhng mu di, thut ton BOM
c nh gi l nhanh nht
1.5.2. So khp a mu

Cho mt mu P gm tp cc t kho w1, w2,.,w


S ha bi Trung tm Hc liu

v xu vo S =

http://www.lrc-tnu.edu.vn/

- 15 -

S1S2Sn trn cng bng ch A. Tm s xut hin ca cc t kho wi trong S.


Mt cch n gin tm nhiu t kho trong mt xu ch l s dng
thut ton so n mu nhanh nht i vi mi t kho. R rng phng php
ny khng hiu qu khi s lng t kho ln.
C ba tip cn tm n mu trn u c m rng cho tm a mu. Hai
in hnh theo tip cn th nht l thut ton ni ting Aho- Corasisk, c tc
ci thin ng k khi s t kho nhiu v thut ton Multiple Shift- And, c
s dng hiu qu khi tng di ca mu P rt nh 2 .
Theo tip cn th hai c thut ton ni ting Commentz - Walter, trong
kt hp tng ca Boyer - Moore v Aho- Corasisk , nhanh v l thuyt, song
li khng hiu qu trong thc hnh. Mt m rng ca thut ton Horspool l Set
Horspool. Cui cng l thut ton Wu-Manber, mt phng php pha trn gia
tip cn so khp hu t (suffix search approach) v mt kiu hm bm, c
nh gi l nhanh trong thc hnh.
Trong tip cn th ba c nhng m rng t thut ton BOM v
SBOM; tng t vi Shift- Or BNDM l Multiple BNDM.
1.5.3. So mu m rng

Trong nhiu ng dng, so khp mu khng ch n gin l dy cc k t.


Sau y l mt s m rng thng thy trong cc ng dng:
M rng n gin nht cho php mu l mt dy cc lp hay cc tp k
t, gi s c nh s th t l 1,2,,m. Bt k k t no trong lp th i cng
c th c xem l k t th i ca mu.
M rng th hai l gii hn khong trn di: Mt s v tr trn mu
c n nh khp vi mt dy vn bn no c di nm trong mt
khong xc nh trc. iu ny thng c s dng trong cc ng dng
sinh- tin hc, chng hn tm mu PROSITE.
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 16 -

M rng th ba s dng cc k t ty chn v k t lp. Trong xut hin


ca mu trn vn bn, cc k t tu chn c th c hoc khng c, cn cc k t
lp c th c mt hoc lp nhiu ln.
Cc vn ny sinh t ba hng m rng trn v nhng kt hp t ba
hng ny c gii quyt bng cch iu chnh li thut ton Shift - Or v
BNDM, trong c s dng c ch song song bit m phng otomat a nh,
cho php tm tt c cc xut hin ca mu.
1.5.4. So khp chnh xc

Tm mt (hoc nhiu) v tr xut hin chnh xc cu mt xu k t P[1..m]


(mu so khp - pattern) trong mt xu k t ln hn hay trong mt on vn
bn no T[1..n], m<=n. V d: ta c th tm thy v tr ca xu abc trong
xu abcababc l 1 v 6.
Pht biu hnh thc bi ton nh sau: gi l mt tp hu hn (finite set)
cc k t. Thng thng, cc k t ca c mu so khp v on vn bn gc u
nm trong . Tp ty tng ng dng c th c th l bng ch ci ting Anh t
A n Z thng thng, cng c th l mt tp nh phn ch gm hai phn t 0 v
1 ( = {0,1}) hay c th l tp cc k t DNA trong sinh hc ( = {A,C,G,T}).
Phng php n gin nht l ln lt xt tng v tr i trong xu k t gc
t 1 n n-m+1, so snh T[i(i+m-1)] vi P[1..m] bng cch xt tng cp k t
mt v a ra kt qu so khp. Ngi ta cn gi phng php ny l cch tip
cn ngy th (Nave string search). Di y l th tc c t ca phng php ny:
NAVE_STRING_MATCHER (T, P)
1. n length [T]
2. m length [P]
3. for s 1 to n-m+1 do
4. j 1
5. while j m and T[s + j] = P[j] do
6. j j +1
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 17 7. If j > m then
8. return s // s l v tr tm c
9. return false. // khng c v tr no tha mn

phc tp trung bnh ca thut ton l O(n+m), nhng trong trng hp


xu nht phc tp l O(n.m), v d nh so khp mu aaaab trong xu
aaaaaaaaab.

1.5.5. So khp xp x

1.5.5.1. Pht biu bi ton


So mu xp x l bi ton tm s xut hin ca mt mu trong vn bn,
trong s khp gia mu v xut hin ca n c th chp nhn k li (k l
mt gii hn cho trc). C th k ra mt vi kiu li, nh nhng li nh
my hay li chnh t trong h thng trch rt thng tin, nhng s bin i chui
gen hay cc li o c trong sinh- tin hc v nhng li truyn d liu trong cc
h thng x l tn hiu, V trong cc h thng tin hc kh c th trnh c
cc li nn vn so khp xp x cng tr nn quan trng.
c bit, khi s dng cc h thng trch rt thng tin, ngi dng ngy nay
cn i hi c nhng kt qu gn ging hoc c c kt qu ph hp tr v nu c
s sai st trong mu hay vn bn. Trong trng hp ny li c th do nhiu
nguyn nhn khc nhau, c th k ra nh sau:
- Cu truy vn sai chnh t, xu so khp khng ng c php so vi vn bn.
- Li in n, sai li chnh t, s dng du chm sai,
- Do s bin i hnh thi t trong mt s ngn ng.
- D liu a vo c s d liu khng chnh xc, thng xy ra vi tn
ngi, a ch
- Thng tin ngi tm a vo khng chnh xc, ch i loi.
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 18 -

V vy, mt vn t ra cho cc h thng trch rt thng tin ngy nay l


p ng c nhu cu so khp mm do ny ca ngi s dng.
Bi ton so mu xp x tng qut c pht biu nh sau: Cho vn bn T
di n v xu mu P di m trn cng mt bng ch A. Tm cc v tr trong
vn bn khp vi mu, cho php nhiu nht k li.

1.5.5.2. Cc tip cn so khp xp x


Thut ton so khp xp x hin nay chia thnh 4 loi:
1) Cc thut ton da trn quy hoch ng: y l tip cn xut hin u
tin v c dng tnh khong cch son tho.
2) Cc thut ton s dng otomat so khp: Trc tin xy dng mt hm
ca mu P v s li k, sau to otomat a nh hu hn. y l hng tip cn
c quan tm nhiu v c phc tp thi gian trong trng hp xu nht l
O(n) (tuy nhin i hi phc tp khng gian ln hn).
3) Cc thut ton s dng c ch song song bit: cch tip cn ny cho ra
rt nhiu thut ton hiu qu nh khai thc bn cht song song ca cc php ton
bit trn mt t my trong b vi x l. Ni chung song song bit c dng
song song ho cc k thut khc, nh to otomat a nh, lp ma trn quy hoch
ng. Ni chung k thut ny lm vic kh tt vi mu ngn v tng tc ng k
so vi nhng ci t khng tn dng kh nng song song ca thanh ghi. Mt s
thut ton dng c ch song song bit l BPR v BPD ti to mt otomat a
nh hu hn v BDM ti to cc thut ton quy hoch ng.
4) Cc thut ton s dng c ch lc: C gng thu hp khng gian so
khp ca bi ton bng cch loi i cc vn bn m chc chn khng cha mt
on no khp vi mu. Ni chung, phng php ny t c bng cch p
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 19 -

dng k thut so mu chnh xc cho cc mu nh ca mu. Hai thut ton hiu


qu nht theo tip cn ny l PEX v ABNDM. Trong PEX, mu c chia
thnh k + 1 on v sp xp so khp a mu trn cc on ny, v t nht mt
on phi c mt trong mt xut hin bt k. Thut ton ABNDM l mt m
rng ca thut ton BNDM, trong ti to otomat a nh hu hn cho so khp
xp x. Ni chung, cc thut ton s dng c ch lc lm vic tt hn t l k/m
nh. i vi trng hp t l k/m ln, cc thut ton s dng c ch song song
bit c nh gi tt hn.
i vi bi ton so khp a mu cng c mt s pht trin theo hng
xp x. Thut ton MultiHash ch lm vic vi k = 1 song rt hiu qu khi s
lng mu ln; MultiPEX l thut ton hiu qu nht khi t l k/m nh; Multi
BP xy dng cc NFA ca tt c cc mu v s dng kt qu ny lm b lc,
y l la chn tt nht cho t l k/m c trung bnh.
Mt vi tip cn xp x cho bi ton tm mu m rng v tm biu thc
chnh qui c th k ra nh: thut ton da trn quy hoch ng cho biu thc
chnh qui; thut ton s dng mt otomat a nh hu hn cho php c li,
thut ton song song bit da trn phng php ca BPR,
1.5.5.3. tng t gia hai xu
so khp xp x, cn s dng mt hm khong cch o tng t
gia hai xu. Tng t y c hiu l gia hai xu k t c mt vi sai khc
nhng li c th nhn ra bng mt thng, khng xt v kha cnh ng ngha
(OCR- optical character recognition errors), chng hn Vit Nam v Vit
Nan hay Vitt Nan, C th k ra mt s k thut ph bin o tng t
gia hai xu: Xu con chung di nht, dy con chung di nht, khong cch son
tho. Nhiu ng dng s dng cc bin th ca cc hm khong cch ny.
1) Khong cch son tho: i vi hai xu x, y khong cch son tho
Edit distance(x,y) l s nh nht cc php sa i v mt son tho bin i
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 20 -

xu x thnh xu y (vic tnh ton kh phc tp). Khong cch son tho cng ln
th s khc nhau gia hai xu cng nhiu (hay tng t cng nh) v ngc
li. Khong cch son tho thng kim tra chnh t hay ting ni. Tu thuc
vo quy c v cc php sa i m ta nhn c cc loi khong cch son
tho khc nhau, chng hn nh:
Khong cch Hamming: Php sa i ch l php thay th k t.
Khong cch Levenshtein: Php sa i bao gm: Chn, xo, v thay
th k t.
Khong cch Damerau: Php sa i bao gm: Chn, xo, thay th
v hon v lin k ca cc k t.
2) Xu con chung di nht (hay khc con chung di nht): Mt xu w l
xu con hay khc con (substring or factor) ca xu x nu x = uwv (u, v c th
rng). Xu w l khc con chung ca hai xu x, y nu w ng thi l khc con
ca x v y. Khc con chung di nht ca hai xu x v y, k hiu LCF (x,y), l mt
khc con c di ln nht.
3) Dy con chung di nht: Mt dy con ca xu x l mt dy cc k t c
c bng cch xo i khng, mt hoc nhiu k t t x. Dy con chung ca
hai xu x, y l mt dy con ca c hai xu x v y. Dy con chung ca x v y c
di ln nht c gi l dy con chung di nht LCS (x,y). C th dng
di dy con chung ca hai xu x, y tnh khong cch Levenstein gia x v y
theo cng thc:
LevDistance (x,y) = m + n - 2 length(LCS( x,y))

1.5. Mt s thut ton so mu


1.5.1. Thut ton Brute Force

Thut ton Brute Force th kim tra tt c cc v tr trn vn bn t 1 cho


n n-m+1. Sau mi ln th thut ton brute force dch mu sang phi mt k t
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 21 -

cho n khi kim tra ht vn bn. Thut ton khng cn cng vic chun b cng
nh cc mng ph cho qu trnh tm kim. phc tp tnh ton ca thut ton
ny l O(n*m).
function IsMatch(const X: string; m: integer;
const Y: string; p: integer): boolean;
var i: integer;
begin
IsMatch := false;
Dec(p);
for i := 1 to m do
if X <> Y[p + i] then Exit;
IsMatch := true;
end;

procedure BF(const X: string; m: integer; const Y: string; n: integer);


var i: integer;
begin
for i := 1 to n - m + 1 do
if IsMatch(X, m, Y, i) then
Output(i); { Thng bo tm thy mu ti v tr i ca vn bn }
end;

1.5.2. Thut ton Karp-Rabin

Karp-Rabin bi ton tm kim chui khng khc nhiu so vi bi ton tm


kim chun. Ti y mt hm bm c dng trnh i s so snh khng cn
thit. Thay v phi so snh tt cc v tr ca vn bn, ta ch cn so snh nhng
ca s bao gm nhng k t c v ging mu.
Trong thut ton ny hm bm phi tha mn mt s tnh cht nh phi
d dng tnh c trn chui, v c bit cng vic tnh li phi n gin t
nh hng n thi gian thc hin ca thut ton. V hm bm c chn y l:
hash(w[ii+m-1]) = h
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 22 -

= (w*dm-1 + w[i+1]*dm-2 + w[i+m-1]*d0) mod q


Vic tnh li hm bm sau khi dch ca s i mt k t ch n gian nh sau:
h = ((h w*dm-1)*d + w[i+m]
Trong bi ton ny ta c th chn d = 2 tin cho vic tnh ton a*2
tng ng a shl 1. V khng ch th ta chn q = MaxLongint khi php mod
q khng cn thit phi thc hin v s trn s trong tnh ton chnh l mt php
mod c tc rt nhanh.
Vic chun b trong thut ton Karp-Rabin c phc tp O(m). Tuy vy
thi gian tm kim li t l vi O(m*n) v c th c nhiu trng hp hm bm
ca chng ta b la v khng pht huy tc dng. Nhng ch l nhng trng
hp c bit, thi gian tnh ton ca thut ton KR trong thc t thng t l vi
O(n+m). Hn na thut ton KR c th d dng m rng cho cc mu, vn bn
dng 2 chiu, do khin cho n tr nn hu ch hn so vi cc thut ton cn
li trong vic x l nh.
procedure KR(const X: string; m: integer;
const Y: string; n: integer);
var
dM, hx, hy: longint;
i, j: integer;
begin
dM := 1;
for i := 1 to m - 1 do dM := dM shl 1;
hx := 0;
hy := 0;
for i := 1 to m do
begin
hx := (hx shl 1) + Ord(X);
hy := (hy shl 1) + Ord(Y);
end;
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 23 j := 1;
while j <= n - m do
begin
if hx = hy then
if IsMatch(X, m, Y, j) then Output(j);
{hm IsMatch trong phn BruteForce}
hy := ((hy - Ord(Y[j])*dM) shl 1) + Ord(Y[j + m]); {Rehash}
Inc(j);
end;
if hx = hy then
if IsMatch(X, m, Y, j) then Output(j);
end;

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 24 1.5.3. Thut ton BM ( Boyer- Moor)

Mt tip cn ph bin trong cc thut ton so n mu l duyt tun t


qua tt c cc k t trn xu vo S, mi ln mt k t. Nhng trong thut ton
BM, c th c nhng bc nhy xa trn S c thc hin, nh vy BM c
nh gi l thut ton nhanh nht v thc hnh, y l la chn hiu qu cho
nhng ng dng thng thng nh cc trnh son tho vn bn.
tng c bn ca thut ton l s dng mt Ca s trt nh sau:
Ca s thc ra l mt khc di m trn xu vo S (m l di ca mu P)
c i snh vi mu ti mt thi im no . Mi ln i snh mu P vi
mt ca s trn S bng cch so snh tng k t t phi sang tri. Khi gp k t
khng khp, ca s trt sang phi qua mt on trn S (tng ng vi vic
dch mu P sang phi). Trng hp tt nht khi s khng khp xy ra ti v tr
Pm v k t khng khp l Sk li khng phi l mt k t trong mu P, lc c
th an ton trt ca s sang phi qua m v tr trn S v bt u qu trnh tm
kim mi bi vic so snh Pm v Sk+ m.
Gi s ti mt thi im ang xt ca s Sk - m+ 1Sk - m + 2 .... Sk v bt u
so snh Pmvi Sk.
(1) Gi s Pm

Sk c hai kh nng:

Nu v tr xut hin phi nht ca k t Sk trong P l m - g, ta c th


dch mu P sang phi g v tr sao cho Pm-g dng thng vi Sk ri bt
u li qu trnh i snh bi php so snh Pm v S k+ g
Nu k t Sk khng c mt trong P, ta c th dch mu P sang phi
m v tr. y l bc dch chuyn xa nht c th m vn khng b
st s xut hin no ca mu.
(2) Gi s m - i k t cui ca mu P khp vi m - i k t cui ca
S(k). Nu i = 0, ta tm c mt xut hin ca mu P. Ngc li, nu i > 0 v
Pi

Sk -m+i, xt hai kh nng:


S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 25 -

Nu v tr xut hin tri nht ca k t Sk -m+i trong P l i - g, khi


mu P c dch sang phi g v tr sao cho Pi-g dng thng vi Sk-m+i
v s bt u qu trnh i snh mi, bt u t Pm so vi Sk+g. Nu
Pi-g nm bn phi ca Pi (khi g < 0) th mu P ch dch sang phi 1
v tr.
Gi s sufi(P) l mt xu con ca Pi+1-gPi+2-g....Pm-gv Pi-g

Pi (nu

c nhiu xut hin nh vy ca sufi(P) th chn v tr phi nht).


Khi s dch mu P sang phi mt on di hn so vi trng
hp (2a) sao cho khc Pi+1-gPi+2-g....Pm-g dng thng vi khc Skm+i+1Sk-m+i+2...Sk

Nh vy, khi Pi

v bt u qu trnh i snh mi t Pm so vi Sk+g.

Sj, mu P s dch sang phi i mt s v tr. Thut ton

s dng hai bng d1v d2 tnh ton bc ch chuyn ny.


Bng d1 bao hm trng hp (1) v (2a): Vi mi k t c, d1 c l s i ln
nht sao cho c = Pi hoc d c = m nu c khng xut hin trong mu P.
Bng d2 bao hm trng hp (2b): Vi mi i, 1 i
nh l: d2 i = min g + m - i| g
= Pk) vi i

1 v (g

i hoc Pi-g

m, d2 i c xc

Pi) v ((g

k hoc Pk-g

m)

C nhiu cch tnh ton bng d2 c a ra. Thut ton di y tnh


bng dch chuyn d2 l ca Knuth, c s sa i ca Mehlhorn. Thut ton s
dng hm f c tnh cht f[m] = m+1 v vi 1

j < m, f j = min i j < i < m v

Pi+1Pi+2....Pm = Pj+1Pj+2....Pm+j-i .
Thut tnh bng dch chuyn d2
procedure computed 2();
begin
for i: = 1 to m do d2 i : = 2 *m- i;
j := m; k: = m+ 1;
while j > 0 do
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 26 begin
f j : = k;
while k <= m and Pj

Pi do

begin
d2 k := min d2 k , m- j ;
k: = f k];
end;
j := j - 1; k := k - 1;
end;
for i: = 1 to k do d2 i : = min d2 i , m +k - i
j: = f k ;
while k < = m do
begin
while k <=j do
begin
d2 k := min d2 k , j-k + m
k := k + 1;
end;
j: = f j ;
end;
end;

Thut ton BM tm s xut hin ca mu P trong xu vo S


procedure BM();
var i, j: integer;
counter: integer;
begin
j:= m; counter: = 0;
while j <= n do
begin
i: = m;
while i >0 and Sj

Pi do

begin i: = i - 1; j: = j - 1; end;
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 27 if i: = 0 then
begin

Ghi nhn mt ln xut hin mu ti v tr j + 1;

counter: = counter + 1;
j := j + m + 1;
end;
else j: =j+ max d1 Sj , d2 i ;
end;
Ghi nhn counter;
end;

phc tp ca thut ton: phc tp thi gian l O(m + n) v phc


tp khng gian l O(m).
1.5.4. Cc thut ton khc

Mt s thut ton nu trn cha phi l tt c cc thut ton tm kim


chui hin c. Nhng chng i din cho a s cc t tng dng gii bi
ton tm kim chui.
Cc thut ton so snh mu ln lt t tri sang phi thng l cc dng
ci tin (v ci li) ca thut ton Knuth-Morris-Pratt v thut ton s dng
Automat nh: Forward Dawg Matching, Apostolico-Crochemore, Not So Naive,
Cc thut ton so snh mu t phi sang tri u l cc dng ca thut
ton Boyer-Moore. Thut ton BM l thut ton tm kim rt hiu qu trn thc
t nhng phc tp tnh ton l thuyt li l O(m*n). Chnh v vy nhng ci
tin ca thut ton ny cho phc tp tnh ton l thuyt tt nh: thut ton
Apostolico-Giancarlo nh du li nhng k t so snh ri khi b so snh
lp li, thut ton Turbo-BM nh gi cht ch hn cc thng tin trc c th
dch c xa hn v t b lp, Cn c mt s ci tin khc ca thut ton BM
khng lm gim phc tp l thuyt m da trn kinh nghim c tc tm
kim nhanh hn trong thc t. Ngoi ra, mt s thut ton kt hp qu trnh tm
kim ca BM vo h thng Automat mong t kt qu tt hn.
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 28 -

Cc thut ton so snh mu theo th t c bit:


* Thut ton Galil-Seiferas v Crochemore-Perrin chng chia mu thnh
hai on, u tin kim tra on bn phi ri mi kim tra on bn tri vi
chiu t tri sang phi.
* Thut ton Colussi v Galil-Giancarlo li chia mu thnh hai tp v tin
hnh tm kim trn mi tp vi mt chiu khc nhau.
* Thut ton Optimal Mismatch v Maximal Shift sp xp th t mu da
vo mt ca k t v khong dch c.
* Thut ton Skip Search, KMP Skip Search v Alpha Skip Search da s
phn b cc k t quyt inh v tr bt u ca mu trn vn bn.
Cc thut ton so snh mu theo th t bt k: nhng thut ton ny
c th tin hnh so snh mu vi ca s theo mt th th ngu nhin. Nhng
thut ton ny u c ci t rt n gin v thng s dng tng k t
khng khp ca thut ton Boyer-Moore.

1.6. Khp chui vi otomat hu hn


1.6.1. Otomat hu hn

tmt hu hn FA l mt m hnh tnh ton ca h thng vi s m t


bi cc input v output. Ti mi thi im, h thng c th c xc nh mt
trong s hu hn cc cu hnh ni b gi l cc trng thi. Mi trng thi ca h
thng th hin s tm tt cc thng tin lin quan n nhng input chuyn qua
v xc nh cc php chuyn k tip trn dy input tip theo [5].
Trong khoa hc my tnh, c nhiu h thng trng thi hu hn, v l
thuyt v tmt hu hn l mt cng c thit k hu ch cho cc h thng ny.
Chng hn, mt h chuyn mch nh b iu khin trong my tnh. Mt chuyn
mch th bao gm mt s hu hn cc cng input, mi cng c 2 gi tr 0 hoc 1.
Cc gi tr u vo ny s xc nh 2 mc in th khc nhau cng output.
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 29 -

Mi trng thi ca mt mng chuyn mch vi n cng bt k s l mt trng


hp trong 2n php gn ca 0 v 1 i vi cc cng khc nhau. Cc chuyn mch
th c thit k theo cch ny, v th chng c th c xem nh h thng
trng thi hu hn. Cc chng trnh s dng thng thng, chng hn trnh
san tho vn bn hay b phn tch t vng trong trnh bin dch my tnh cng
c thit k nh cc h thng trng thi hu hn. V d b phn tch t vng s
qut qua tt c cc dng k t ca chng trnh my tnh tm nhm cc chui
k t tng ng vi mt tn bin, hng s, t kha, Trong qu trnh x l
ny, b phn tch t vng cn phi nh mt s hu hn thng tin nh cc k t
bt u hnh thnh nhng chui t kha. L thuyt v tmt hu hn thng
c dng n nhiu cho vic thit k cc cng c x l chui hiu qu.
My tnh cng c th c xem nh mt h thng trng thi hu hn.
Trng thi hin thi ca b x l trung tm, b nh trong v cc thit b lu tr
ph mi thi im bt k l mt trong nhng s rt ln v hu hn ca s
trng thi. B no con ngi cng l mt h thng trng thi hu hn, v s cc
t bo thn kinh hay gi l neurons l s c gii hn, nhiu nht c th l 235.
L do quan trng nht cho vic nghin cu cc h thng trng thi hu
hn l tnh t nhin ca khi nim v kh nng ng dng a dng trong nhiu
lnh vc thc t. tmt hu hn c chia thnh 2 loi: n nh (DFA) v
khng n nh (NFA). C hai loi tmt hu hn u c kh nng nhn dng
chnh xc tp chnh quy. tmt hu hn n nh c kh nng nhn dng ngn
ng d dng hn tmt hu hn khng n nh, nhng thay vo thng
thng kch thc ca n li ln hn so vi tmt hu hn khng n nh
tng ng.
1.6.1.1. tmt hu hn n nh DFA
Mt tmt hu hn n nh DFA (Deterministic Finite Automata) gm
mt tp hu hn cc trng thi v mt tp cc php chuyn t trng thi ny ti
trng thi khc trn cc k hiu nhp c chn t mt b ch ci no .
S ha bi Trung tm Hc liu
http://www.lrc-tnu.edu.vn/

- 30 -

Mi k hiu nhp c ng mt php chuyn khi mi trng thi (c th chuyn


tr v chnh n). Mt trng thi, thng k hiu l q0, gi l trng thi bt u
(trng thi tmt bt u). Mt s trng thi c thit k nh l cc trng thi
kt thc hay trng thi chp nhn.
Mt th c hng, gi l s chuyn tng ng vi mt DFA nh
sau: cc nh ca th l cc trng thi ca DFA; nu c mt ng chuyn t
trng thi q n trng thi p trn input a th c mt cung nhn a chuyn t trng
thi q n trng thi p trong s chuyn. DFA chp nhn mt chui x nu nh
tn ti dy cc php chuyn tng ng trn mi k hiu ca x dn t trng thi
bt u n mt trong nhng trng thi kt thc.
V d s chuyn ca mt DFA c m t trong di. Trng thi khi
u q0 c ch bng mi tn c nhn "Start". Ch c duy nht mt trng thi kt
thc, cng l q0 trong trng hp ny, c ch ra bng hai vng trn. tmt
ny chp nhn tt c cc chui s 0 v s 1 vi s s 0 v s s 1 l s chn.

Hnh 1.1. S chuyn ca mt DFA

Mt cch hnh thc ta nh ngha tmt hu hn l b gm nm thnh


phn (Q, , , q0, F), trong :
Q l tp hp hu hn cc trng thi.
l b ch ci nhp hu hn.
l hm chuyn nh x t Q Q, tc l (q, a) l mt trng
thi c cho bi php chuyn t trng thi q trn k hiu nhp a.
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 31 -

q0 Q l trng thi bt u.
F Q l tp cc trng thi kt thc.
Ta v DFA nh l b iu khin hu hn, vi mi trng thi thuc Q,
DFA c mt chui cc k hiu a t vit trn bng.

Hnh 1.2. M t mt DFA

Trong mt ln chuyn, DFA ang trng thi q c k hiu nhp a trn


bng, chuyn sang trng thi c xc nh bi hm chuyn (q, a), ri dch u
c sang phi mt k t. Nu (q, a) chuyn n mt trong nhng trng thi kt
thc th DFA chp nhn chui c vit trn bng input pha trc u c,
nhng khng bao gm k t ti v tr u c va dch chuyn n. Trong
trng hp u c dch n cui chui trn bng, th DFA mi chp nhn
ton b chui trn bng.
c th m t mt cch hnh thc hot ng ca mt DFA trn chui, ta
m rng hm chuyn p dng i vi mt trng thi trn chui hn l mt
trng thi trn tng k hiu. Ta nh ngha hm chuyn nh mt nh x t Q
Q vi ngha (q, w) l trng thi DFA chuyn n t trng thi q trn
chui w.
1. (q, ) = q
2. (q, wa) = ( (q, w), a), vi mi chui w v k hiu nhp a.
Trong :
Q l tp cc trng thi. K hiu q v p (c hoc khng c ch s) l
cc trng thi, q l trng thi bt u.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 32 -

l b ch ci nhp. K hiu a, b (c hoc khng c ch s) v cc


ch s l cc k hiu nhp.
l hm chuyn.
F l tp cc trng thi kt thc.
w, x, y v z (c hoc khng c ch s) l cc chui k hiu nhp.
Mt chui w c chp nhp bi tmt hu hn M (Q, , , q0, F) nu
(q, w) = p vi p F. Ngn ng c chp nhn bi M, k hiu L(M) l tp
hp: L(M) = { w | (q, w) F }
V d: theo khi nim hnh thc, ta c DFA c xc nh bi M (Q, , ,
q0, F) vi Q = {q0, q1, q2, q3}, = {0, 1}, F = {q0} v hm chuyn nh sau:
Bng 1.1. V d hm chuyn ca DFA

Gi s chui w = 110101 c nhp vo M.


Ta c (q0, 1) = q1 v (q1, 1) = q0 ,vy (q0, 11) = ((q0,1),1) =
(q1, 1) = q0.
Tip tc (q0, 0) = q2, vy (q0, 110) = ((q0, 11), 0) = q2.
Tip tc ta c (q, 1101) = q3, (q0, 11010) = q1
V cui cng (q0, 110101) = q0 F.
(Hay (q0, 110101) = (q1, 10101) = (q0, 0101) = (q2, 101) = (q3, 01) =
(q1, 1) =q0 F).
Vy 110101 thuc L(M). Ta c th chng minh rng L(M) l tp mi
chui c s chn s 0 v s chn s 1.
Theo m t DFA nh trn, ta thy cng c th dng bng hm chuyn
m t cc php chuyn trng thi ca mt tmt hu hn. Trong bng hm
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 33 -

chuyn, hng cha cc trng thi thuc tp trng thi ca tmt v ct l cc k


hiu thuc b ch ci nhp. Bng hm chuyn gi cho chng ta mt cu trc
d liu m t cho mt tmt hu hn, ng thi cng cho thy c th d
dng m phng hot ng ca DFA thng qua mt chng trnh my tnh,
chng hn dng cu trc vng lp.
Mt cch tng qut, ta thy tp Q ca DFA th hin cc trng thi lu tr
ca tmt trong qu trnh on nhn ngn ng, v nh vy kh nng lu tr
ca tmt l hu hn. Mt khc, hm chuyn l hm ton phn v n tr, cho
nn cc bc chuyn ca tmt lun lun c xc nh mt cch duy nht.
Chnh v hai c im ny m DFA m t nh trn c gi l tmt hu hn
n nh.
1.6.1.2. tmt hu hn khng n nh NFA
Xt mt dng sa i m hnh DFA chp nhn khng, mt hoc nhiu
hn mt php chuyn t mt trng thi trn cng mt k hiu nhp. M hnh
mi ny gi l tmt hu hn khng n nh (NFA - Nondeterministic Finite
Automata).
Mt chui k hiu nhp a1 a2 ... an c chp nhn bi mt NFA nu c
tn ti mt chui cc php chuyn, tng ng vi chui nhp, t trng thi bt
u n trng thi kt thc. Chng hn, chui 01001 c chp nhn bi tmt
trong hnh di y v c chui php chuyn qua cc trng thi q0, q0, q0, q3, q4,
q4 c nhn tng ng l 0,1, 0, 0, 1. NFA ny chp nhn tt c cc chui c hai
s 0 lin tip hoc hai s 1 lin tip.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 34 -

Hnh 1.3. S ca mt NFA

C th xem tmt hu hn n nh l mt trng hp c bit ca


NFA, trong mi trng thi ch c duy nht mt php chuyn trn mi k hiu
nhp. V th trong DFA, vi mt chui nhp w v trng thi q, ch c ng mt
ng i nhn w bt u t q. xc nh chui w c c chp nhn bi DFA
hay khng ch cn kim tra ng i ny. Nhng i vi NFA, c th c nhiu
ng i c nhn l w, v do tt c phi c kim tra thy c hay khng
c ng i ti trng thi kt thc.
Tng t nh DFA, NFA cng hot ng vi mt b iu khin hu hn
c trn bng nhp. Tuy nhin, ti mi thi im, b iu khin c th cha mt
s bt k trng thi. Khi c s la chn trng thi k tip, chng hn nh t
trng thi q0 trn k hiu nhp 0 hnh trn, ta tng tng nh c cc bn sao
ca tmt ang thc hin ng thi. Mi trng thi k tip m tmt c th
chuyn n s tng ng vi mt bn sao ca tmt m ti b iu khin
ang cha trng thi .
V d vi chui 01001, ta c :

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 35 -

Hnh 1.4. Di chuyn chui

Mt cch hnh thc ta nh ngha tmt hu hn khng n nh NFA l


mt b 5 thnh phn (Q, , , q0, F) trong Q, , q0 v F c ngha nh trong
DFA, nhng l hm chuyn nh x t Q 2Q.
Khi nim (q, a) l tp hp tt c cc trng thi p sao cho c php chuyn
trn nhn a t trng thi q ti p.
thun tin trong vic m t hot ng tmt trn chui, ta m rng
hm chuyn nh x t Q * 2Q nh sau:
1. (q, ) = {q}
2. (q, wa) = { p | c mt trng thi r trong (q, w) m p thuc (r, a)}
= ((q, w), a)
3. (P, w) = q P (q, w) , P Q.
Ngn ng L(M), vi M l tmt hu hn khng n nh NFA (Q, , ,
q , F) l tp hp: L(M) = {w | (q0, w) c cha mt trng thi trong F}
V d: Xt s chuyn ca hnh 2.3. Theo khi nim hnh thc, ta c:
NFA M ({q0, q1, q2, q3, q4}, {0, 1}, , q0}) vi hm chuyn nh sau:
Bng 1.2. V d hm chuyn trng thi ca NFA

Xt chui nhp w = 01001


S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 36 -

Ta c: (q0, 0) = {q0, q3}


(q0, 01) = ((q0, 0),1) = ({q0, q3},1) = (q0, 1) (q3, 1) = {q0, q1}
Tng t , ta c th tnh :
(q0, 010) = {q0, q3}
(q0, 0100) = {q0, q3, q4}
v (q0, 01001) = {q0, q1, q4}
Do q4 F nn w L (M).
1.6.2. Otomat khp chui

1.6.2.1. Gii thiu


So khp chui l thc hin tm kim v xc nh chnh xc v tr xut hin
ca i tng c chn, cn gi l mu trong vn bn. Vn bn v cc mu l
cc chui k t, chng l nhng chui gm hu hn cc k t thuc tp hu hn
cc k t trong tp hu hn cc k t ch ci.
Ta hnh thc ho bi ton so khp chui nh sau: coi vn bn l mt
mng T[1..n] c chiu di n v khun mu l mt mng P[1..m] c chiu di m;
cc thnh phn ca T v P l cc k t c rt t mt bng ch ci hu hn .
V d, ta c th c = {0,1} hoc ={a,b,....,z}. Cc mng k t P v T thng
c gi l cc chui k t. Ta ni rng mt chui w l tin t (hu t) ca mt
chui x, k hiu l w x (w x), nu x = wy (x = yw), vi y l mt chui no
. ngn gn, ta k hiu Pk th hin tin t k - k t P[1..k] ca khun mu
P[1..m].
Ta ni rng khun mu P xy ra vi kho chuyn s trong vn bnT (hoc,
theo tng ng, ni rng khun mu P xy ra bt u ti v tr s + i trong vn
bn T) nu 0 s n-m v T[s + 1..s + m] = P[1..m] (ngha l, nu T[s+j] = P[j],
vi 1 j m). Bi ton so khp chui l bi ton tm tt c cc kho chuyn hp
l vi n mt khun mu P cho xy ra trong mt vn bn T cho.
V d: khun mu P = abaa xut hin mt ln trong vn bn T =
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 37 -

abcabaabcabac, ti kho chuyn s = 3. Vi bi ton ny, r rng ta c mt cch


lm n gin l tm tt c cc kho chuyn hp l dng mt vng lp kim tra
iu kin P[1..m] = T[s+1..s+m] vi n - m + 1 gi tr c th ca s.
Procedure NAIVE-STRING-MATCHER(T,P)
Begin
for s := 0 to n - m do
if P[1..m] = T[s+1..s+m] then
write(khun mu xy ra vi kho chuyn, s +1)
End;

Hnh1.5. V d so khp chui

Hnh trn minh ha php ton ca b so khp chui n gin vi khun


mu P = aab v vn bn T = acaabc; cc vch dc ni cc vng tng ng vi
tm thy so khp (c t bng), v mt vch zch zc ni k t khng so khp
u tin tm thy, nu c. Mt trng hp xut hin ca khun mu c tm
thy, ti kho chuyn s = 2, trong hnh (c).
nh gi: Th tc NAIVE-STRING-MATCHER c phc tp O((nm+1)m) trong trng hp xu nht. Nh s thy, NAIVE-STRING-MATCHER
khng phi l mt th tc ti u cho bi ton ny. Do vic s dng Otomat so
khp chui hiu qu hn v chng xt mi k t vn bn chnh xc mt ln.
C mt otomat so khp chui cho mi khun mu P: otomat ny phi
c khi to t khun mu trc khi dng n tm trong chui vn bn. Hnh
1.6 minh ha s chuyn tip trng thi ca otomat so khp chp nhn tt c
cc chui kt thc bi chui ababaca. Trng thi 0 l trng thi u, v trng thi
7 l trng thi chp nhn duy nht. Mt cnh c hng t trng thi i n trng
thi j c gn nhn a biu din (i, a) = j. Cc cnh sang phi hnh thnh ct
S ha bi Trung tm Hc liu
http://www.lrc-tnu.edu.vn/

- 38 -

sng ca otomat, c v m trong hnh, tng ng vi cc ln so khp thnh


cng gia khun mu v cc k t u vo. Cc cnh sang phi tng ng cc
ln so khp tht bi. C vi cnh tng ng vi cc ln so khp tht bi khng
c nu; theo quy c, nu mt trng thi i khng c cnh i ra c gn nhn
a vi mt a , th (i, a) = 0. Hnh (b) l hm chuyn tip tng ng , v
chui khun mu P = ababaca. Cc tng ng vi cc ln so khp thnh cng
gia khun mu v cc k t u vo c nu dng t bng. Hnh (c) minh
ha php ton ca otomat trn vn bn T = abababacaba. Di mi k t vn
bn T[i] c gn mt trng thi m otomat nm trong , sau khi x l tin t
Ti. Mt ln xut hin ca khun mu c tm thy, kt thc ti v tr 9.

Hnh 1.6. V d otomat so khp chui

1.6.2.2. Thut ton xy dng Otomat so khp chui


Trc tin ta xy dng hm chuyn tip t mt khun mu cho
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 39 -

P[1..m] nh sau:
Procedure COMPUTE-TRANSITION-FUNCTION(P, )
Begin
m : = length(P)
For q : = 0 to m do
For mi k t a do
Begin
k : = min(m+1, q+2);
repeat k : = k 1
until (Pk l hu t ca Pqa);
(q, a) : = k
End;
End;

Trong th tc trn, ta xt tt c cc trng thi q v cc k t a; gi tr (q,


a) l s k ln nht sao cho Pk l hu t ca Pqa. Gi tr ln nht ca k l min (m,
q+1), sau gim k cho n khi tha mn. Thi gian thc hin ca hm trn l
O(m3||), vi || l s lng k t trong . Sau y l th tc xy dng otomat:
Procedure FINITE-AUTOMATON-MATCHER(T, , m)
Begin
n : = length(T)
q : = 0;
for i : = 1 to n do
Begin
q : = (q, T[i]);
if q = m then writeln(khun mu xy ra vi kho chuyn:,i - m);
End; End;

nh gi: tng thi gian thc hin tm tt c cc ln xut hin ca mt


khun mu c chiu di m trong mt vn bn c chiu di n trn mt bng ch
ci l O(n + m||).

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 40 -

1.7. Kt lun chng


Trn y l mt s kin thc c bn v so khp chui, thng qua ta c
th nm bt c cc kin thc nn tng v so khp chui, cc hng tip cn,
cc dng so khp v mt s thut ton so mu. y l nhng kin thc c s
tip tc tm hiu, nghin cu v thut ton KMP chng tip theo. Cc thut
ton chnh c gii thiu, trong c cc bi ton s dng may ri. Lp bi
ton may ri c ngha ln trong thc t.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 41 -

CHNG 2. THUT TON SO KHP CHUI


KNUTH-MORRIS-PRATT
2.1. Thut ton KMP
2.1.1. Gii thiu thut ton

Thut ton c pht minh nm 1977 bi hai gio s ca i hc


Stanford, Hoa K (mt trong s t cc trng i hc xp hng s mt v khoa
hc my tnh trn th gii cng vi MIT, CMU cng ca Hoa K v Cambridge
ca Anh) Donal Knuth v Vaughan Ronald Pratt, Knuth (gii Turing nm 1971)
. Thut ton ny cn c tn l KMP ly tn vit tt ca ba ngi pht minh ra
n, ch M l ch gio s J.H.Morris, mt ngi cng rt ni ting trong khoa
hc my tnh.
tng chnh ca phng php ny nh sau: trong qu trnh tm kim v
tr ca mu P trong xu gc T, nu tm thy mt v tr sai ta chuyn sang v tr
tm kim tip theo v qu trnh tm kim sau ny s c tn dng thng tin t
qu trnh tm kim trc khng phi xt cc trng hp khng cn thit [8].
V d: tm mu w = ABCDABD trong xu T = ABC ABCDAB
ABCDABCDABDE. mi thi im, thut ton lun c xc nh bng hai
bin kiu nguyn, m v i, c nh ngha ln lt l v tr tng ng trn S bt
u cho mt php so snh vi W, v ch s trn W xc nh k t ang c so
snh. Khi bt u, thut ton c xc nh nh sau:
m:

S:

ABC ABCDAB ABCDABCDABDE

W:

ABCDABD

i:

Chng ta tin hnh so snh cc k t ca W tng ng vi cc k t ca S,


di chuyn ln lt sang cc ch ci tip theo nu chng ging nhau. S[0] v
W[0] u l A. Ta tng i:
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 42 -

m:

S:

ABC ABCDAB ABCDABCDABDE

W:

ABCDABD

i:

_1

S[1] v W[1] u l B. Ta tip tc tng i:


m:

S:

ABC ABCDAB ABCDABCDABDE

W:

ABCDABD

i:

__2

S[2] v W[2] u l C. Ta tng i ln 3:


m:

S:

ABC ABCDAB ABCDABCDABDE

W:

ABCDABD

i:

___3

Nhng, trong bc th t, ta thy S[3] l mt khong trng trong khi


W[3] = 'D', khng ph hp. Thay v tip tc so snh li v tr S[1], ta nhn thy
rng khng c k t 'A' xut hin trong khong t v tr 0 n v tr 3 trn xu S
ngoi tr v tr 0; do , nh vo qu trnh so snh cc k t trc , chng ta
thy rng khng c kh nng tm thy xu d c so snh li. V vy, chng ta di
chuyn n k t tip theo, gn m = 4 v i = 0.
m:

____4

S:

ABC ABCDAB ABCDABCDABDE

W:

ABCDABD

i:

Tip tc qu trnh so snh nh trn, ta xc nh c xu chung


"ABCDAB", vi W[6] (S[10]), ta li thy khng ph hp. Nhng t kt qu ca
qu trnh so snh trc, ta duyt qua "AB", c kh nng s l khi u cho
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 43 -

mt on xu khp, v vy ta bt u so snh t v tr ny. Nh chng ta thy


cc k t ny trng khp vi nhau k t trong php so khp trc, chng ta
khng cn kim tra li chng mt ln na; ta bt u vi m = 8, i = 2 v tip tc
qu trnh so khp.
m:

________8

S:

ABC ABCDAB ABCDABCDABDE

W:

ABCDABD

i:

__2

Qu trnh so khp ngay lp tc tht bi, nhng trong W khng xut hin
k t ,v vy, ta tng m ln 11, v gn i = 0.
m:

___________11

S:

ABC ABCDAB ABCDABCDABDE

W:

ABCDABD

i:

Mt ln na, hai xu trng khp on k t "ABCDAB" nhng k t


tip theo, 'C', khng trng vi 'D' trong W. Ging nh trc, ta gn m = 15, v
gn i = 2, v tip tc so snh.
m:
S:

_______________15
ABC ABCDAB ABCDABCDABDE

W:

ABCDABD

i:

__2

Ln ny, chng ta tm c khp tng ngvi v tr bt u l S[15].


Bng so khp mt phn T gip ta xc nh c v tr tip theo so khp
khi php so khp trc tht bi. Mng T c t chc nu chng ta c mt
php so khp bt u t S[m] tht bi khi so snh S[m + i] vi W[i], th v tr
ca php so khp tip theo c ch s l m + i - T[i] trong S (T[i] l i lng xc
nh s cn li khi c mt php so khp tht bi). Mc d php so khp tip
theo s bt u ch s m + i - T[i], ging nh v d trn, chng ta khng cn
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 44 -

so snh cc k t T[i] sau n, v vy chng ta ch cn tip tc so snh t k t


W[T[i]]. Ta c T[0] = -1, cho thy rng nu W[0] khng khp, ta khng phi li
li m tip tc php so snh mi k t tip theo. Sau y l on m gi mu
ca thut ton tm kim KMP.
algorithm kmp_search:
input:
mng k t, S (on vn bn)
mng k t, W (xu ang tm)
output:
mt bin kiu nguyn (v tr (bt u t 0) trn S m W c tm thy)
define variables:
bin nguyn, m 0
bin nguyn, i 0
mng nguyn, T
while m + i nh hn di ca su S, do:
if W[i] = S[m + i],
let i i + 1
if i bng di W,
return m
otherwise,
if T[i] > -1,
let i T[i], m m + i - T[i]
else
let i 0, m m + 1
return di ca on vn bn S

Vi s xut hin ca mng T, phn tm kim ca thut ton Knuth


MorrisPratt c phc tp O(k), trong k l di ca xu S. Ngoi tr cc
th tc nhp xut hm ban u, tt c cc php ton u c thc hin trong
vng lp while, chng ta s tnh s cu lnh c thc hin trong vng lp;
lm c vic ny ta cn phi tm hiu v bn cht ca mng T. Theo nh
ngha, mng c to : nu mt php so khp bt u v tr S[m] tht bi khi
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 45 -

so snh S[m + i] vi W[i], th php so khp c th thnh cng tip theo s bt


u v tr S[m + (i - T[i])]. C th hn, php so khp tip theo s bt u ti v
tr c ch s cao hn m, v vy T[i] < i.
T iu ny, ta thy rng vng lp c th thc hin 2k ln. Vi mi ln
lp, n thc hin mt trong hai nhnh ca vng lp. Nhnh th nht tng i v
khng thay i m, v vy ch s m + i ca k t ang so snh trn S tng ln.
Nhnh th hai cng thm i - T[i] vo m, v nh chng ta bit, y lun l s
dng. V vy, v tr m, v tr bt u ca mt php so khp tim nng tng ln.
Vng lp dng nu m + i = k; v vy mi nhnh ca vng lp c th c s
dng trong ti a k ln, do chng ln lt tng gi tr ca m + i hoc m, v m
m + i: nu m = k, th m + i k, v vy: do cc php ton ch yu tng theo n
v, chng ta c m + i = k vo mt thi im no trc, v v vy thut ton
dng. Do vng lp ch yu thc hin 2k ln, phc tp tnh ton ca thut
ton tm kim ch l O(k).
2.1.2. Bng so snh mt phn

Mc ch ca bng l cho php thut ton so snh mi k t ca S khng


qu mt ln. S quan st cha kha v bn cht ca phng php tm kim tuyn
tnh cho php iu ny xy ra l trong qu trnh so snh cc on ca chui
chnh vi on m u ca mu, chng ta bit chnh xc c nhng v tr m
on mu c th xut hin trc v tr hin ti. Ni cch khc, chng ta "t tm
kim" on mu trc v a ra mt danh sch cc v tr trc m b qua cc
k t v vng m vn khng mt i cc on tim nng [8].
Vi mi v tr trn W, di ca on di nht ging vi "on bt u"
trn W tnh n (khng bao gm) v tr , y l khong cch chng ra c th
li li tip tc so khp. Do vy T[i] l gi tr ca di on di nht kt thc
bi phn t W[i - 1]. Ta s dng quy c rng mt chui rng c di l 0.
Vi trng hp khng trng vi mu ngay gi tr u tin (khng c kh nng
li li), ta gn T[0] = -1.
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 46 -

V d: xt xu W = "ABCDABD". Ta s thy thut ton xy dng bng


c nhiu nt tng ng vi thut ton tm kim chnh. Ta gn T[0] = -1.
tnh T[1], ta cn tm ra mt xu con "A" ng thi cng l xu con bt u ca
W. V vy ta gn T[1] = 0. Tng t, T[2] = 0 v T[3] = 0.
Xt k t W[4] = 'A', k t ny trng vi k t bt u xu W[0]. Nhng
do T[i] l di xu di nht trng vi xu con bt u trong W tnh n W[i
1] nn T[4] = 0 v T[5] = 1. Tng t, k t W[5] trng vi k t W[1] nn T[6]
= 2. V vy ta c bng sau:
Bng 2.1. Bng so snh mt phn

W[i]

A B C D A B D

T[i]

-1

Mt v d khc phc tp hn, nh trong bng.


Bng 2.2. Th d khc
0 1 2 3 4 5 6 7 8 9 0 1 2
P A R T I C I P A T E
I
-1 0 0 0 0 0 0 0 1 2 0 0 0

i
W[i]
T[i]

3 4 5 6 7 8 9 0 1 2 3
N
P A R A C H U T E
0 0 0 1 2 3 0 0 0 0 0

Thut ton to bng:


algorithm kmp_table:
input:
mng k t, W
mng s nguyn, T
output:
mng T
define variables:
bin kiu nguyn, pos 2
bin kiu nguyn, cnd 0
let T[0] -1, T[1] 0
while pos nh hn di ca W, do:
(trng hp mt: tip tc dy con)
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 47 if W[pos - 1] = W[cnd],
let T[pos] cnd + 1, pos pos + 1, cnd cnd + 1
(trng hp hai: khng tha mn, nhng ta c th quay ngc tr li)
otherwise, if cnd > 0, let cnd T[cnd]
(trng hp ba: ht phn t. Ch rng cnd = 0)
otherwise, let T[pos] 0, pos pos + 1

phc tp ca thut ton to bng l O(n), vi n l di ca W. Ngoi


tr mt s sp xp ban u, ton b cng vic c thc hin trong vng lp
while, phc tp ca ton b vng lp l O(n), vi vic cng lc s l gi tr
ca pos v pos - cnd. Trong trng hp th nht, pos - cnd khng thay i, khi
c pos v cnd cng tng ln mt n v. trng hp hai, cnd c thay th bi
T[cnd], nh chng ta bit trn, lun lun nh hn cnd, do tng gi tr
ca pos - cnd. Trong trng hp th ba, pos tng v cnd th khng, nn c gi tr
ca pos v pos - cnd u tng. M pos pos - cnd, iu ny c ngha l mi
bc hoc pos hoc chn di pos u tng; m thut ton kt thc khi pos = n,
nn n phi kt thc ti a sau 2n vng lp, do pos - cnd bt u vi gi tr 1. V
vy phc tp ca thut ton xy dng bng l O(n).
2.1.3. phc tp ca thut ton KMP

Do phc tp ca hai phn trong thut ton ln lt l O(k) v O(n), nn


phc tp ca c thut ton l O(n + k).
Nh thy trong v d trn, thut ton mnh hn cc thut ton so
khp chui km hn v n c th b qua cc k t duyt. t phi quay tr li
hn, thut ton s nhanh hn, v c th hin trong bng T bi s hin din
ca cc s khng. Mt t nh "ABCDEFG" s lm tt vi thut ton ny v n
khng c s lp li ca nhng ch bt u, v vy mng n gin ch ton s
khng vi -1 u. Ngc li, vi t W = "AAAAAAA" n hot ng ti t,
bi v bng s l:
Bng 2.3. Trng hp mu xu nht vi thut ton KMP
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 48 i

W[i]

A A A A A A A

T[i]

-1

y l mu xu nht cho mng T, v n c th dng so snh vi on


nh S = "AAAAAABAAAAAABAAAAAAA", trong trng hp ny thut
ton s c gng ghp tt c cc ch 'A' vi 'B' trc khi dng li; kt qu l s
lng ti a cu lnh c s dng, tin ti trn hai ln s k t ca xu S khi s
ln lp ca "AAAAAAB" tng. Mc d qu trnh xy dng bng rt nhanh so
vi ch ny (nhng v tc dng), qu trnh ny chy c mt ln vi ch W,
trong khi qu trnh tm kim chy rt nhiu ln. Nu vi mi ln, t W c
dng tm trn xu nh xu S, phc tp tng th s rt ln. Bng cch so
sch, s kt hp ny l trng hp tt nht vi thut ton so khp chui BoyerMoore.

2.2. Thut ton KMP m


2.2.1. Otomat so mu

Do ngha ca m l di khc u di nht ca mu P xut hin


trn S nn otomat s c tp trng thi l tp s nguyn {0, 1,..., m}. Hot ng
ca otomat m so mu s nh sau [4]:
Khi u con tr trn S l j = 0. Ti cha xut hin khc u no
ca mu nn trng thi khi u ca otomat l q0 = 0.
Duyt S, mi ln mt k t, bt u t S1. Gi s trng thi ca
otomat l q th khi c c k t Sj, trng thi mi (ng vi v tr j
trn S) s l q = (q,Sj) ( l hm chuyn ca otomat).
Ti v tr j trn S, nu trng thi ca otomat l q, c ngha khc u
di nht xut hin trn S ca P c di q. Nu q = m, bo hiu
mt ln xut hin mu, bt u t v tr j -m+1.
M hnh otomat m cn c xy dng mt cch thch hp p ng
S ha bi Trung tm Hc liu
http://www.lrc-tnu.edu.vn/

- 49 -

c yu cu snh mu nh trn.
Otomat m so mu l b A(P) = (A, Q, q0, , F) trong :
Bng ch vo A = AP

{#}

Tp trng thi Q = {0,1,...,m}


Trng thi khi u q0 = 0
Trng thi kt thc F = m.
Hm chuyn : Q

Q (q,a) = TFuzz (q,a).

2.2.2. Thut ton

Khi ci t thut ton cn lu la chn cu trc d liu ph hp c


th truy nhp nhanh chng trong bng TFuzz.
Gi A[0..k] l mng lu gi bng ch A ca otomat, trong k l s k t
phn bit trong mu P. Mng c sp theo chiu tng ca cc k t v A[k] =
#. thun tin khi truy nhp n cc ch ci trong A, c th s dng mng
index xc nh v tr ca ch trong bng [4]. Gi tr index [c] = i, nu c = A [i];
v =k, nu c

{A[0], A[1], ...A[k-1]}. TFuzz l mng [0..m, 0..k], trong

TFuzz [i, j] l m mi khi m i gp k t x c index [ ] = j.


Khi chi tit thut ton to lp bng TFuzz v tm kim da vo bng
TFuzz s nh sau:
2.2.2.1 Thut ton to lp TFuzz
procedure initTFuzz()
var i, j, t: integer;
begin
for i: = 0 to m do
TFuzz [i,m]:=0;
for j: = 0 to k do TFuzz [0, j] := 0;
TFuzz [0, index [P1]]:=1;
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 50 for i: = 0 to m do
for t: = 0 to k-1 do
begin
if i = m then j:= next [i+1]
elsse j:=i+1;
while (j > 0) and (Pj

A([t]) do j:=next [j];

TFuzz [i, index [A[t]]:= j;


end;
end;

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 51 -

2.2.2.2. Thut ton tm kim mu da vo bng TFuzz


procedure FPM();
var j, counter: integer; fuz: array [1..50] of integer; {* m*}
begin
j :=1; counter :=0; fuz[0] = 0;
while (cn c c Sj) do
begin
fuz [j] = TFuzz [fuz[j-1], index [Sj]];
if fuz [j] = m then
begin
counter :=counter+1;
Ghi nhn v tr j-m+1;
end;
end; {while}
Ghi nhn counter;
if counter = 0 then Ghi nhn v tr 0;
end;

V d Vi mu P = aababaab, A = {a, b, #}, AP = {a,b}.


Ta c bng next nh sau:
Bng 2.4. Bng next

next[i]

Bng TFuzz c tnh ton da trn mng next cho kt qu nh sau:


Bng 2.5. Bng TFuzz

A
Q
0
1
2
3
4
5
6
S ha bi Trung tm Hc liu

1
2
0
4
2
6
7

0
0
3
0
5
0
0

0
0
0
0
0
0
0

http://www.lrc-tnu.edu.vn/

- 52 7
8

2
4

8
0

0
0

Qu trnh so mu trn dng d liu S = aabaababaababaababaab s nh sau:


j

10 11 12 13 14 15 16 17 18 19 20 21

ghi nhn

ghi nhn

ghi nhn

11-8+1=4

168+1=9

21-8+1=14

Bng 2.6. Minh ha th d

2.2.3. So snh KMP v thut ton KMP m

Gi s xut hin khc u di i-1 ca P trn S, tnh ti v tr j, c


ngha c P(i - 1) = sufi - 1(S(j - 1)) hay m ti v tr j - 1 l

j-i

= i - 1. Xt

k t Sj, c th xy ra hai kh nng:


Trng hp 1: Sj = Pi (hay m ti v tr j l

= i). Tng i, j ln

1. Vi trng hp ny tc thao tc ca thut ton KMP nh


trong tip cn m.
Trng hp 2: Sj
j-i

Pi (hay m ti v tr j l

i)

=i-1
i

S
i
P
?

next [i]

Hnh 2.1. Dch chuyn con tr trn mu

Trong KMP, con tr j trn S gi nguyn, ch dch chuyn con tr trn mu


(dng lnh i:= next[i]. V vy phi mt thi gian dch chuyn theo bng next,
thm ch nhiu ln. V d nh:
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 53 -

Vi P = aababaab, s dng bng next (trong v d mc 2.3) tm s xut


hin ca mu trong dng k t S nh sau:
S = a a c ...

j=3

P=aababaab

i=3

dch ln th nht a a b a b a a b

i = next[i] = 2

dch ln th 2

i = next[i] = 0

aababaab

Ta thy c 2 ln dng i:= next[i] v 2 ln so snh Sj v Pi (i khc nhau).


Ni chung c th xy ra nhiu ln dng next trong khi con tr trn S vn gi
nguyn. iu ny lm chm ng k so vi tip cn m: mi ln nhn mt k t
Sj l mt ln iu chnh gi tr m theo otomat:

= TFuzz (

j-1,

Sj). Lnh ny

thc hin rt nhanh nu TFuzz c biu din di dng mt mng v c tnh


ton cn thn trc theo thng tin trn mu P.
Kt qu sau so snh tc thc hin vic tm s xut hin mu P trong
tp d liu ln S theo hai thut ton KMP v tip cn m trn my PC IBM tc
233MHz.
Bng 2.7. Kt qu tm s xut hin mu P trong tp S theo KMP v tip cn m

Mu P

Kch thc tp S

TKMP

TFuzzy-KMP

1)

aababcab

1400 KB

17% s

11% s

2)

MDSVF6V

140.000 KB

35 s

30 s

3)

bacabccaa

1200 KB

16% s

10% s

4)

S068FAB50

140.000 KB

37 s

30 s

2.3. Thut ton KMP - BM m


2.3.1. tng ca thut ton

Trong thut ton Boyer - Moore (BM), cc k t trn mu P c duyt


t phi sang tri, bt u t Pm. Ti thi im gp k t khng trng khp,
chng hn Pi = a cn Sj = b, khi s quyt nh dch con tr trn mu. Php
dch chuyn ng vi mi k t trn P, nu s khng trng khp xy ra ,
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 54 -

c xc nh trong bc tin x l mu P. Trong thut ton k t khng


khp ny ca BM, c mt trng hp cho php dch chuyn tt nht (xa nht)
l khi k t b khng xut hin trong mu P. T chi tit ny, kt hp vi kiu
snh mu nh trong KMP, ta s c mt thut ton theo tip cn m tng qut
kiu KMP v BM, trong m vn c tnh ton da trn hm TFuzz,
ng thi s c nhng bc nhy di trn xu ch, em li hiu qu tm kim
cao [4].
tng ca thut ton ny l: gi ptr l con tr trn xu ch S (khi u
ptr = 0 v m ti bng 0 bo hiu cha tm thy mu). Mi ln xt mt
khi w gm m + 1 k t lin tc trn S, bt u t v tr ptr, ta gi khi ny l
khi k t quan st v k hiu wi l k t th i trong w (vi w1 = Sptr). Da
trn bng TFuzz, tnh m xut hin mu khi gp k t w1 (hay chnh l Sptr),
k hiu m ny l n1, ng thi xc nh bc nhy tip theo t s xt
khi k t w mi, k hiu bc nhy l n2. Nu n1 l m ti Sptr th c ngha
sufn1(S(ptr)) = P(n1). Xy ra cc kh nng sau:
Nu n1 = m, chng t xut hin mu bt u t v tr ptr - m + 1
trn S. khng b st s xut hin lng nhau ca mu, t n2 = 1.
Nu n1 ln hn m ti v tr c xt trc v tr ptr, c ngha
ang c hy vng tm thy mu nn n2 = 1.
Trong cc trng hp cn li ca n1, ch mi xut hin khc u
P(n1) khp vi khc cui di n1 ca S(ptr). Nu vic khp mu
thnh cng vi khi k t quan st w, th k t Pm s khp vi k
t Sptr+m-n1 (hay w1+m-n1). Do , nu Sptr+m-n1 khng phi l mt k
t xut hin trong P th c th thc hin bc nhy xa c w mi
bt u t v tr ptr+m-n1+1 trn S m vn m bo khng b st s
xut hin no ca mu.
1+m-n1

http://www.lrc-tnu.edu.vn/

S ha bi Trung tm Hc liu
S

ptr
P(n1)

m+1

ptr+m-n1

- 55 -

Hnh 2.2. tng chung ca thut ton KMP-BM m

2.4.2. Otomat m so mu

2.3.2.1. Gii thiu


Cho P l xu mu di m trn bng ch A. Ap l bng cc k t xut
hin trong P. Otomat m so mu l A(P) = (Ak, Q, q0, F, ), trong :
Ak l bng ch vo, mi ch l mt xu k t di k trn A,
k=m+1
Q l tp hu hn cc trng thi, Q = {q=(n1,n2)| n1, n2
m, 1

n2

N, 0

n1

k}; n1 gi l m ti v tr ang xt; n2 gi l bc

nhy tip theo v tr ang xt


q0 l trng thi khi u, q0 = (0,1)
F l trng thi kt thc, F = (m,1)
Hm chuyn : Q Ak

Qs; (q, w)

q = (q, w)

Vi q = (n1, n2) th q = (n1, n2) c xc nh nh sau:


1.

Nu n2 > 1 th t n1 = 0

2.

Tnh n1 = TFuzz (n1, w1)

3.

Nu n1 = m hoc n1 > n1 th n2 = 1, ngc li (n1 < m v n1


n1) th xt: Nu w1+m-n1

Ap th n2 = 1, ngc li n2=1+m-n1

2.3.2.2. Hot ng ca otomat m so mu


Cho mu P di m v xu ch S di n trn bng ch A. A(P) l
otomat c xc nh theo nh ngha trn. Ta s dng otomat A(P) on
nhn tt c cc v tr xut hin mu P trong xu S v tng s ln xut hin mu.
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 56 -

Thut ton c bn da trn otomat c m t nh sau:


Ta dng cc k hiu:
j l con tr quan st trn S
q.n1, q.n2 l hai thnh phn ca trng thi q
w l khi k t quan st bt u t v tr j trn xu ch S, gi s
b sung thm m k t # vo cui S.
qold l trng thi ca otomat ti v tr trc khi c w
q l trng thi otomat sau khi tc ng w, q = (qold, w)
Counter l bin m s ln xut hin mu.
Bc 1: Khi to
j: = 0; counter := 0; qold.n1 :=0; qold.n2 :=1;
Bc 2: While j

n do

j: = j + qold.n2
c khi k t quan st w; {w1 Sj}
{Tnh q = (qold, w)}
if qold.n2 > 1 then qold.n1:= 0; endif;
q.n1: = TFuzz (qold.n1, w1);
q.n2: = 1;
if q.n1= m then
Ghi nhn v tr xut hin mu l j - m + 1;
Counter: = counter + 1;
else if q.n1 < m and q.n1 < qold.n1 then
if w1+m-q.n1

Ap then q.n2: = 1+m-q.n1; endif;

endif;
endif;
qold := q;
end while;
2.3.3. Thut ton tm kim
procedure GFSearching (); {tm kim mu}
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 57 var apr: array [1..N] of integer;


counter, j, n1, n2, n1, n2: integer;
begin
j:= 0; n1:= 0; n2 := 1;
while j <= n do
begin

j := j + n2;
if n2 > 1 then n1:= 0;
n1 := TFuzz [n1, index [S[j]]];
n2: = 1;
if n1 = m then
begin

counter := counter + 1;
apr[counter]:= j - m + 1;

end
else if n1 < m and n1 <= n1 then
begin if j + m - n1 > n then return;
if index [S[j + m - n1]] = k then n2 : = 1 + m - n1;
end;
n1:=n1; n2: = n2;
end;
end;

2.4. Kt lun chng


Chng ny tp trung trnh by v thut ton so khp chui KMP, thut
ton KMP m, thut ton KMP-BM m. Kt qu nghin cu ca chng ny l
c s gii quyt bi ton tm kim thng tin trn vn bn chng sau.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 58 -

CHNG 3. NG DNG THUT TON KMP TRONG TM


KIM THNG TIN TRN VN BN
3.1. Bi ton tm kim mu trn vn bn
3.1.1. Tm kim mu

Kiu vn bn l dng biu din d liu hay gp nht trong cc h thng


thng tin. Tm kim vn bn l vn ch yu thuc lnh vc qun l vn bn.
Mt dng c bn v tng qut hn l tm kim chui hay i snh chui. Khi
nim chui y kh rng, c th l chui vn bn gm mt dy cc ch, s
v k t c bit, c th l chui nh phn hay chui gen, Tm kim chui l
bi ton tm ra mt mu vi mt s c tnh no trong chui cc k hiu cho
trc, v th bi ton ny cn c gi l so xu mu hay c th gi ngn gn l
so mu. Dng n gin nht l tm s xut hin mt xu cho trc trong mt
chui (cn gi l xu ch).
Thi gian gn y, vn i snh chui cng tr nn quan trng v c
quan tm nhiu do s tng trng nhanh chng ca cc h thng trch rt thng
tin v cc h thng sinh- tin hc. Mt l do na, bi con ngi ngy nay khng
ch i mt vi mt lng thng tin khng l m cn i hi nhng yu cu tm
kim ngy cng phc tp. Cc mu a vo khng ch n thun l mt xu k
t m cn c th cha cc k t thay th, cc khong trng v cc biu thc
chnh. S tm thy khng n gin l xut hin chnh xc mu m cn cho
php c mt t sai khc gia mu v xut hin ca n trong vn bn. T ,
bn cnh vn kinh in l tm kim chnh xc, ny sinh mt hng nghin
cu ht sc th v l tm kim xp x.
C th phn cc loi thut ton so mu theo 2 hng. Th nht l cc
thut ton trc tuyn, trong ch mu c tin x l (thng s dng otomat
hoc da trn cc c tnh kt hp trn xu), cn vn bn th khng. Th hai l
gii php tin x l vn bn theo cch xy dng mt cu trc d liu trn vn
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 59 -

bn (lp ch mc). Nhiu ng dng cn s dng gii php ny mc d c


nhng thut ton trc tuyn nhanh bi chng cn phi iu khin mt lng vn
bn qu ln nn khng c thut ton trc tuyn no c th thc hin mt cch
hiu qu. Tm kim trn ch mc thc ra cng da trn tm kim on-line.
gim s d tha trong lu tr v truyn d liu, mt gii php c s
dng l nn d liu. Qu trnh nn lm cho cc tp chim t khng gian lu tr
hn, gim c thi gian v chi ph truyn thng nhng li lm mt i phn ln
cu trc ca d liu, dn n kh khn trong vic tm kim v trch rt thng tin.
Cch n gin nht song rt tn thi gian (v kh kh thi vi nhng vn bn qu
ln) l gii nn ton b ri tin hnh tm kim bng mt thut ton so mu kinh
in. Hin nay, c nhiu gii php tt hn theo hai hng chnh l: so mu
nn v so mu trn min nn. So mu nn thc hin nn mu trc ri em i
tm kim trn vn bn nn, cn so mu trn min nn s dng gii php nn
tng phn ca vn bn. Nn d liu text thc cht l mt qu trnh m ho,
chuyn cc thng bo ngun (trong bng ch ngun A) thnh cc bn m (trong
bn ch m B) v ngc li l qu trnh gii m. V vy thut ton tm kim trn
vn bn nn c th p dng i vi vn bn m ho dng khi k t. Tuy nhin,
do yu cu bo mt, i vi nhng vn bn m ho, cn c nhng gii thut tm
kim m bo khng b r r thng tin ngay trong qu trnh tm kim.
3.1.2. Tm kim thng tin

3.1.2.1 Gii thiu


Mc ch ca tm kim l tr li cho ngi dng mt tp cc thng tin
tha mn nhu cu ca h. Ta nh ngha rng thng tin cn thit l cu truy vn
v cc thng tin c chn l ti liu. Mi cch tip cn trong tm kim thng
tin bao gm 2 thnh phn chnh: mt l cc k thut biu din thng tin (cu
truy vn, ti liu) v hai l phng php so snh cc cch biu din ny. Mc
ch l thc hin t ng qui trnh kim tra cc ti liu bng cch tnh ton
tng quan gia cc cu truy vn v ti liu. Qui trnh t ng ny thnh cng
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 60 -

khi n tr v cc kt qu ging vi cc kt qu c con ngi to ra khi so


snh cu truy vn vi cc ti liu.

Hnh 3.1. M hnh biu din v so snh thng tin

Gi min xc nh ca hm biu din cu truy vn q l Q, tp hp cc cu


truy vn c th c; v min gi tr ca n l R, khng gian thng nht biu din
thng tin. Gi min xc nh ca hm biu din ti liu d l D, tp hp cc ti
liu; v min gi tr ca n l R. Min xc nh ca hm so snh c l R R v
min gi tr ca n l [0,1], tp cc s thc t 0 n 1. Trong mt h thng tm
kim l tng:
c(q(query), d(doc)) = j(query, doc) , query Q, doc D
Vi j: Q D [0,1] biu din vic x l ca ngi dng v mi quan
h ca thng tin trn cu truy vn v thng tin trong ti liu.
C nhiu cch o lng khc nhau cho vic nh gi mc x l tr v
kt qu ca mt h thng tm kim thng tin. Cc cch o lng u i hi mt
tp ti liu v mt cu truy vn trn tp ti liu , gi s rng mi ti liu c th
lin quan hoc khng lin quan n cu truy vn.
chnh xc (Precision): c o bi t l ca ti liu tr v chnh xc
trn tng cc ti liu nhn c:

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 61 -

bao ph (Recall): c o bi t l ti liu tr v chnh xc trn tng


cc ti liu c lin quan:

Kt qu sai (Fall-out): c o bi t l cc ti liu khng c lin quan t


v trn tng cc ti liu khng lin quan

3.1.2.2 Cc m hnh tm kim thng tin thng s dng


M hnh so khp: l mt m hnh m chng ta c th xem mi cu truy
vn l mt biu thc logic cc mc t, trong gm cc mc t kt hp vi cc
ton t logic (AND, OR, v NOT). M hnh s xem mi ti liu l mt tp hp
cc mc t. Mi mc t s cha mt danh sch cc ti liu c cha n, danh
sch ny gi l posting, vic so khp s duyt qua danh sch posting kim tra
cc ti liu c cha mc t hay khng.
Phng php tnh im: M hnh so khp ch tr v gi tr chn l l c
hoc khng c trong ti liu tm kim, kt qu tr v khng c th hng. iu
ny dn n kt qu tm kim khng c nh mong mun ca ngi dng.
ci tin m hnh ny, ngi ta p dng cch tnh im cho kt qu tr v, da
trn trng s ca mc t trn ti liu.
Tn sut xut hin ca mc t t trn ti liu d c k hiu l tft,d. Tn
sut nghch o ca ti liu d trn tp N ti liu k hiu l idft = log(N / dft), khi
trng s mc t t trn ti liu d s l tf-idf = tft,d x idft, im s ca ti liu d
s l tng im cc mc t trong cu truy vn c mt trong d:

V d vi 1000 ti liu c 100 ti liu cha mc t tin v 150 ti liu


cha mc t hc, gi s ti liu th nht d c 3 ln xut hin mc t tin v 4
S ha bi Trung tm Hc liu
http://www.lrc-tnu.edu.vn/

- 62 -

ln xut hin mc t hc, khi im s ca cu truy vn q=tin hc trn ti


liu d s l:
Score(q,d) = tftin,d - idftin + tfhc,d - idfhc = tftin,d log (N/dftin) + tfhc,d log(N/dfhc)
= 3 log (1000/100) + 4 log (1000/150) ~ 6.23

M hnh khng gian vec-t: M hnh so khp v c phng php tnh


im s trn cha xt vai tr ca cc mc t trong cu truy vn. V d hai ti
liu cha cu Mary is quicker than John v John is quicker than Mary, s
lng cc mc t nh nhau nhng vai tr khc nhau hon ton. gii quyt
vn ny ngi ta a ra m hnh khng gian vec-t, m hnh ny s biu din
ti liu d nh mt vec-t tn sut cc mc t V(d) .Vi hai ti liu trn th tuy c
cc mc t ging nhau, nhng biu din vec-t ca chng th li khc nhau, khi
chng ta c th tnh mc tng quan gia hai ti liu:

i vi truy vn q chng ta cng xem y l mt vec-t V(q) biu din


tn sut cc mc t truy vn. Mc tng quan gia hai vector c tnh theo
hm cosin ca gc gia chng:

Hnh 3.2. M hnh khng gian vec t

V d, vi truy vn best car insurance trn tp d liu vi N=1.000.000 ti


liu vi tn s xut hin ca 4 mc t auto, best, car, insurance ln lt l 5.000,
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 63 -

50.000, 10.000, 1.000. Chng ta c bng tnh im s sau:

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 64 Bng 3.1. Tnh im s

Nh vy, im s Score (q,d) = 0 + 0 + 0.82 + 2.46 = 3.28

3.2. M ngun m Lucene


3.2.1. Gii thiu

Nm 1998, Doug Cutting tng l nhn vin ca Excite, Yahoo, v ang


lm vic ti Apache Software Foundation - bt u tin hnh xy dng th
vin tm kim thng tin m ngun m Lucene vi mc tiu pht trin n thnh
mt th vin tm kim ti liu hon chnh, cho php cc nh pht trin ng dng
d dng tch hp chc nng tm kim vo h thng ca mnh [10].
Lucene l mt th vin tm kim thng tin c kh nng x l v kh nng
m rng mc cao, cho php chng ta c th tch hp vo cc ng dng.
Lucene l mt d n m ngun m v nguyn thu c pht trin bng ngn
ng Java, ngy nay Lucene c pht trin bng nhiu ngn ng khc nhau nh
Delphi, Perl, C#, C++, Python, Ruby v PHP
Th vin ny cung cp cc hm c bn h tr cho vic nh ch mc v
tm kim. c th s dng Lucene, cn phi c sn d liu. D liu c th l
tp hp cc tp tin dng PDF, Word hay l cc trang web HTML; hoc l d liu
lu trong cc h qun tr CSDL nh MS SQL Server hay MySQL. Dng
Lucene, c th tin hnh nh ch mc trn d liu hin c sau ny c th
thc hin thao tc tm kim ton vn trn d liu .

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 65 -

Hnh 3.3. M hnh nh ch mc ca Lucene

Thnh phn To ch mc: bao gm cc phn chc nng x l to ch mc,


t vn bn u vo cho ra kt qu l mt tp ch mc. Lucene ch h tr trn
vn bn sau khi c tch ni dung dng k t thun, n cho php lp ch mc
trn tng trng thng tin ca vn bn v cho php thit lp h s cho tng
trng thng tin nng cao vai tr lc tm kim.
Directory: cho php nh ngha vng nh, xc nh ni lu tr trn
b nh ngoi v b nh trn RAM trong qu trnh to ch mc.
Document v Field: nh ngha ti liu v cc trng thng tin ca
ti liu s dng cho lp ch mc, n cng s dng cho vic ly kt
qu tr v cho thnh phn Tm kim.
Analyzer: thc hin chc nng x l v tch vn bn ly ni
dung, chun ha, loi b mc t khng cn thit, chun b cho
vic lp ch mc
IndexWriter: l phn chnh trong thnh phn To ch mc, n thc
hin vic to mi hoc m ch mc, sau thc hin thm mi
hoc cp nht ni dung ca ch mc.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 66 -

3.2.2. Cc bc s dng Lucene

1. M t i tng cn nh ch mc: Lucene coi mi i tng cn


nh ch mc l mt Document. Mi Document c th c nhiu
Field, mi Field tng ng mt thuc tnh ca i tng cn nh
ch mc. V d, mun tm kim cc trang web dng HTML. Nh
vy i tng cn nh ch mc l trang HTML; cc thuc tnh c
th l ni lu tr (host), ng dn, tiu , metadata v ni dung
ca chnh trang web. Vi mi Field, bn c th chn gia nh ch
mc hay khng nh ch mc. Nu chn nh ch mc, bn c th
tm kim trn Field . Cc Field khng nh ch mc thng l
cc Field khng quan trng trong qu trnh tm kim v phc v
ch yu cho nhu cu trnh by kt qu tr v.
2. nh ch mc: Thc hin xy dng cc hm cng c chuyn i
d liu ban u thnh d liu m t trong Document. V d, nu d
liu ban u ca l tp tin PDF hay Word, th phi c cc hm
c hiu cc nh dng ny v chuyn v dng chui vn bn tng
ng. Thao tc nh ch mc kh phc tp. Trc ht d liu vn
bn s c phn tch thnh cc t kha, ng thi loi b cc t
khng dng n (stop words, trong ting Anh cc t nh a, an, the
l cc stop words), sau cc t kha s c dng to inverted
index (ch mc nghch o) v lu thnh cc phn on dng thun
tin cho vic tm kim sau ny. Ch mc nghch o dng ch
cch lu tr danh sch cc ti liu m c cha t cho trc. Gi l
nghch o bi v thng thng, vi mi ti liu cho trc, ngi ta
lu tr danh sch cc t c trong ti liu . V d, vi t kha
Lucene, ta s lu tr danh sch cc trang web A, B, C c cha t
kha ny. Sau ny khi ngi dng g vo t kha Lucene, danh
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 67 -

sch ny c th gip nh v nhanh chng cc trang web c cha


n. Nu dng ch mc thng thng, phi qut qua ht cc trang
web c trong c s d liu mi tm ra. Cch ny rt tn thi gian
khi s lng d liu ln.
3. Tm kim: Sau khi d liu c nh ch mc, c th thc hin
tm kim trn chng. Tm kim ton vn cho php bn c th tm
kim theo danh sch cc t kha cng vi cc ton t lun l.
Lucene khng phi l mt ng dng hay mt my tm kim hon chnh
ngi dng c th s dng ngay, y ch l mt th vin, n cung cp cc thnh
phn quan trng nht ca mt my tm kim l to ch mc v truy vn.
Chnh v ch cung cp cc thnh phn quan trng trong vic to ch mc nn
ngi dng rt linh hot trong vic ng dng vo sn phm ca mnh, cng nh
c mt s ci tin cho ph hp hn.

3.3. ng dng tm kim thng tin trn vn bn


ng dng bao gm hai thnh phn:
1. Thnh phn to ch mc: Bao gm chc nng chnh nh ch nh
d liu lp ch mc, thc hin phn tch ti liu, to ch mc v
lu tr xung tp ch mc, cp nhp ch mc trong trng hp b
sung hay thay i ni dung ch mc.
2. Thnh phn tm kim: p dng thut ton KMP thc hin tm
kim, hin th danh sch kt qu v c lin kt n ti liu gc.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 68 -

Hnh 3.4. M hnh ng dng tm kim thng tin vn bn

3.4. Ci t chng trnh th nghim


3.4.1. Gii php, cng ngh s dng

Cng ngh, cng c: ng dng c xy dng trn b cng c


Microsoft Visual Studio 2010, trn nn tng .Net Framework 4.0.
S dng th vin m ngun m Lucene thc hin c, phn tch
v lp ch mc cc ti liu c lu tr.
Thut ton s dng: S dng thut ton KMP thc hin vic so
mu.
3.4.2. Ni dung chng trnh

Chng trnh s dng th vin IFilter c v phn tch cc file ca


Microsoft Office v trch xut vn bn t cc file . IFilter l mt th vin c
sn trong window (t window 2000 tr ln).
Hm thc hin c, phn tch v lp ch mc cho cc file vn bn
public static string Parse(string filename)
{
IFilter filter = null;
try {
StringBuilder plainTextResult = new StringBuilder();
filter = loadIFilter(filename);
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 69 STAT_CHUNK ps = new STAT_CHUNK();


IFILTER_INIT mFlags = 0;
uint i = 0;
filter.Init( mFlags, 0, null, ref i);
int resultChunk = 0;
resultChunk = filter.GetChunk(out ps);
while (resultChunk == 0)
{
if (ps.flags == CHUNKSTATE.CHUNK_TEXT)
{
uint sizeBuffer = 60000;
int resultText = 0;
while (resultText == Constants.FILTER_S_LAST_TEXT || resultText == 0)
{
sizeBuffer = 60000;
System.Text.StringBuilder sbBuffer =
new System.Text.StringBuilder((int)sizeBuffer);
resultText = filter.GetText(ref sizeBuffer, sbBuffer);
if (sizeBuffer > 0 && sbBuffer.Length > 0)
{
string chunk = sbBuffer.ToString(0, (int)sizeBuffer);
plainTextResult.Append(chunk);
}
}
}
resultChunk = filter.GetChunk(out ps);
}
return plainTextResult.ToString();
}
finally
{
if (filter != null)
Marshal.ReleaseComObject(filter);
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 70 }

- Hm to bng so snh:
public static int[] BuildTable(string p)
{
int[] result = new int[p.Length];
result[0] = 0;
for (int i = 1; i < p.Length - 1; i++)
{
// The substring from p[1] to p[i]
string s = p.Substring(0, i + 1);

var prefixes = Enumerable.Range(1, s.Length - 1)


.Select(a => s.Substring(0, a)).ToList();
var suffixes = Enumerable.Range(1, s.Length - 1)
.Select(a => s.Substring(a, s.Length - a)).ToList();
var common = prefixes.Intersect(suffixes).FirstOrDefault();
result[i] = (common == null) ? 0 : common.Length;
}
return result;
}
}

- Hm tm kim
private static int SearchKMP(int[] x, string s)
{
int n = s.Length;
int l = x.Length;
int find = 0;
Char[] charPattern = pattern.ToCharArray();
for (int i = 0; i < n; )
{
S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 71 string a = s.Substring(i, l);


if (a.CompareTo(pattern).Equals(0))
{
return i; // Found match, return match position of the first letter
}
// move position by BuildTable
Char[] charSubstring = a.ToCharArray();
int count = 0;
for (int j = 0; j < l; j++)
{
if (charPattern[j] == charSubstring[j])
{
count++;// count of matched chars
continue;
}
else
{
i += count - x[j]; // move forward steps = matched count - table value
break;
}
}
}
return -999; // not found
}

3.4.3. Kt qu thc nghim

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 72 -

3.4.3.1. Giao din chnh ca chng trnh

Hnh 3.5. Giao din chnh ca chng trnh

La chn th mc c cha cc file (doc, xls, ppt, html, txt) cn tm


kim.
Nht nt Ti d liu thc hin vic lp ch mc cho cc file
vn bn trong th mc.
Nhp t kha cn tm kim trong cc vn bn.
Chng trnh s thc hin c ni dung ca cc file vn bn c
trong th mc. Sau thc hin so khp t kha cn tm trong ni
dung ca cc vn bn (Tm kim ton vn) theo thut ton KMP.
Chng trnh s a ra danh sch cc file m ni dung c cha t
kha tm kim. Khi chn vo mt trong nhng file kt qu th
chng trnh s thc hin m file ln xem ni dung.
3.4.3.2. Kt qu th nghim ca chng trnh khi tm kim vi t kha Vn
bn

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 73 -

Hnh 3.6. Kt qu tm kim ca chng trnh

3.5. Kt lun chng 3


Chng ny trnh by v bi ton tm kim mu, tm kim thng tin trn
vn bn. p dng thut ton KMP xy dng chng trnh th nghim n
gin da trn ngn ng lp trnh C# trn nn h iu hnh Window v tin hnh
chy th nghim chng trnh vi mt s cm t kha tm kim trn cc file vn
bn lu tr.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 74 -

KT LUN
nh gi kt qu ti:
Trong qu trnh nghin cu v thc hin, lun vn t c nhng kt
qu nh sau:
Gii thiu mt s khi nim c bn so khp chui, cc hng tip
cn, cc dng so khp v mt s thut ton so mu.
Trnh by v thut ton KMP, thut ton KMP m v thut ton
KMP-BM m.
Ci t thut ton KMP bng ngn ng lp trnh C# chy trn nn
h iu hnh Window v sau th nghim tm kim vi mt s
cm t kha trn cc file vn bn c lu tr.
Hn ch:
Chng trnh th nghim cn n gin. Chng trnh ch thc hin
c cc thut ton tm kim trn mt s nh dng c bn: doc,
ppt, xls, html, txt. Cha h tr tm kim trn mt s nh dng: pdf,
docx, xlsx, ppts.
Chng trnh mi dng li tm kim trong my cc b, cha h tr
tm kim thng qua mng LAN v Internet.
Hng pht trin trong tng lai:
Vi nhng kt qu t c, tc gi xut mt s cng vic tip theo
trong thi gian ti nh sau:
Tip tc x l nhng vn cn tn ti trong chng trnh th
nghim ci t nh: Vn d liu vo, xy dng giao din
chng trnh thn thin v d s dng hn.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 75 -

Tip tc nghin cu ng dng, pht trin chng trnh h tr tm


kim qua mng Lan v tm kim trn Internet thng qua Website.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

- 76 -

TI LIU THAM KHO


Ting Vit
[1] ng Huy Run (2011), L thuyt thut ton, NXB. i hc Quc
gia H Ni.
[2] Robert Sedgewick (1994), Cm nang thut ton, Tp 1: Cc thut
ton thng dng, NXB Khoa hc v K thut.
[3] Nguyn Hu in (2006), Mt s vn v thut ton, NXB. Gio
dc.
[4] V Thnh Nam, Phan Trung Huy, Nguyn Th Thanh Huyn (2005),
M tch n hi v tm kim trn vn bn m ho s dng thut
ton so mu theo tip cn m, Bo co khoa hc ti Hi ngh ng
dng ton hc ton quc ln 2, H Ni, 12/2005.
[5] Phan Th Ti (1986), Trnh bin dch 1986 (Chng 3 : B phn
tch t vng), Nh xut bn Gio dc.
Ting Anh
[6] Thomas H. Cormen (2009), Introduction to Algorithms, MIT Press.
[7] Christian Charras, Thierry Lecroq (2000), Handbook of Exact
Stringmatching Algorithms.
[8] Donald Knuth, James H. Morris, Jr, Vaughan Pratt (1977), Fast
pattern matching in strings, Siam J.Comput, Vol 6, No. 2, Nune
1977.
[9] Aho A.V.(1992), Algorithms for finding patterns in strings, Chapter
5 of Jan Van Leeuwen (ed.), Handbook of Theoretical Computer
Science "Algorithms and Complexity", The MIT Press, pp. 255-300.
[10] Erik Hatcher, Michael McCandless (2008), Otis Gospodnetic:
Lucene

in

Action,

Apache

Jakarta

Project

Management

Committee.

S ha bi Trung tm Hc liu

http://www.lrc-tnu.edu.vn/

You might also like