You are on page 1of 7

Hash: A String Matching Algorithm

Author: Le Khac Minh Tue

1. Gii thiu:
a. Hon cnh:
Mt lp nhng bi ton rt c quan tm trong khoa hc my tnh ni chung v lp
trnh thi c ni ring, l x l xu chui. Trong lp bi ton ny, ngi ta thng rt hay
phi i mt vi mt bi ton: tm kim xu chui.
b. Pht biu bi ton:
i. Cho mt on vn bn, gm k t.
ii. Cho mt on mu, gm k t.
iii. My tnh cn tr li cu hi: on mu xut hin bao nhiu ln trong
on vn bn v ch ra cc v tr xut hin .
c. Thut ton:
C rt nhiu thut ton c th gii quyt bi ton ny. Ngi vit xin tm
tt 2 thut ton ph bin c dng trong cc k thi lp trnh:
i. Brute-force approach:
Vi mt cch tip cn trc tip, chng ta c th thu c thut ton
gii. Tuy nhin phc tp ca n l rt ln trong trng hp xu nht.
Thut ton brute-force so khp tt c cc v tr xut hin ca on mu
trong on vn bn. C th phc tp cho thut ton ny l ().
ii. Knuth-Morris-Pratt algorithm
Hay cn c vit tt l KMP, c pht minh vo nm 1974, bi
Donald Knuth, Vaughan Pratt v James H. Morris. Thut ton ny s
dng mt correction-array, l mt thut ton rt hiu qu, c phc
tp l ( + ).
d. Mc ch bi vit:
Trong bi vit ny, ngi vit ch tp trung vo mt thut ton. Tc gi xin gi thut ton
ny l Hash. Theo nh bn thn ngi vit nh gi, y l thut ton rt hiu qu c bit
l trong thi c. N hiu qu bi 3 yu t: tc thc thi, linh ng trong vic s dng (ng
dng hiu qu) v s n gin trong ci t.
u tin, ngi vit xin c trnh by v thut ton ny. Sau , ngi vit s trnh by
mt vi ng dng, cch s dng v pht trin thut ton Hash trong cc bi ton tin hc.
2. Thut ton Hash:
a. K hiu:
i. Tp hp cc ch ci c s dng: .
1

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

on vn bn: [1. . ],
on mu: [1. . ],
on con t n ca mt xu : [. . ]
Chng ta cn tm ra tt c cc v tr (1 + 1) tha mn:
[. . + 1] = .
b. M t thut ton:
ii.
iii.
iv.
v.

n gin, gi s rng = {, , , }, ngha l ch gm cc ch ci Latin in


thng. biu din mt xu, thay v dng ch ci, chng ta s chuyn sang biu din dng
s. V d: xu c vit di dng s l mt dy gm 4 s: (0,2,25,3). Nh vy, mt
xu c biu din di dng mt s h c s 26. T y suy ra, 2 xu bng nhau khi v
ch khi biu din ca 2 xu h c s 10 ging nhau.
y chnh l t tng ca thut ton: i 2 xu t h c s 26 ra h c s 10, ri em
so snh h c s 10. Tuy nhin, chng ta nhn thy rng, khi i 1 xu ra biu din h c
s 10, biu din ny c th rt ln v nm ngoi phm vi lu tr s nguyn ca my tnh.
khc phc iu ny, chng ta chuyn sang so snh 2 biu din ca 2 xu h c
s 10 sau khi ly phn d cho mt s nguyn ln. C th hn: nu biu din trong h thp
phn ca xu l v biu din trong h thp phn ca xu l , chng ta s coi bng
khi v ch khi = , trong l mt s nguyn ln.
D dng nhn thy vic so snh vi ri kt lun c bng vi
hay khng l sai. = ch l iu kin cn bng ch cha phi
iu kin . Tuy nhin, chng ta s chp nhn lp lun sai ny trong thut ton Hash. V
coi iu kin cn nh iu kin . Trn thc t, lp lun sai ny c nhng lc dn n so
snh xu khng chnh xc v chng trnh b chy ra kt qu sai. Nhng cng thc t cho
thy rng, khi chn l mt s nguyn ln, s lng nhng trng hp sai rt t, v ta c
th coi Hash l mt thut ton chnh xc.
n gin trong vic trnh by tip thut ton, chng ta s gi biu din ca mt
xu trong h thp phn sau khi ly phn d cho l m Hash ca xu . Nhc li, 2 xu
bng nhau khi v ch khi m Hash ca 2 xu bng nhau.
Tr li bi ton ban u, chng ta cn ch ra xut hin nhng v tr no trong .
lm c vic ny, chng ta ch cn duyt qua mi v tr xut pht c th ca trong .
Gi s v tr l , chng ta s kim tra [. . + 1] c bng vi hay khng. kim tra
iu ny, chng ta cn tnh c m Hash ca on [. . + 1] v m Hash ca xu .
tnh m Hash ca xu chng ta ch cn lm n gin nh sau:
hashP = 0
for (i : 1 .. n)
hashP = (hashP * 26 + P[i] - 'a') mod base

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

Phn kh hn ca thut ton Hash l: Tnh m Hash ca mt on con t n


[. . ] ca xu (1 ).
hnh dung cho n gin, xt v d sau: Xt xu v biu din ca n di c s
26: (4,1,2,5,1,7,8). Chng ta cn ly m Hash ca on con t phn t th 3 n phn t th
6, ngha l cn ly m Hash ca xu (2,5,1,7). Nhn thy, ly c xu [3. .6], ch cn ly
s [1. .6] l (4,1,2,5,1,7) tr cho s ([1. .2] ni thm (0,0,0,0)) l (4,1,0,0,0,0) ta s thu c
(2,5,1,7). Tng t, ly c m Hash ca xu [3. .6], ch cn ly m Hash ca [1. .6]tr
i (m Hash ca [1. .2] nhn vi 264 ).
ci t tng ny, chng ta cn khi to 26 (0 ) v m Hash
ca tt c nhng tin t ca , c th l m Hash ca nhng xu [1. . ] (1 ).
pow[0] = 1
for (i : 1 .. m)
pow[i] = (pow[i-1] * 26) mod base
hashT[0] = 0
for (i : 1 .. m)
hashT[i] = (hashT[i-1] * 26 + T[i] - 'a') mod base

Trn on code trn, chng ta thu c mng [] (lu li 26 ) v mng


[] (lu li m Hash ca [1. . ]).
ly m Hash ca [. . ] ta vit hm sau:
function getHashT(i, j):
return (hashT[j] - hashT[i - 1] * pow[j - i + 1] + base * base) mod base

Bi ton chnh c gii quyt, v y l chng trnh chnh:


for (i : 1 .. m - n +1)
if hashP = getHashT(i, i + n - 1):
print("Match position: ", i)

c. M chng trnh:
Chng trnh sau, ti vit bng ngn ng C++, l li gii cho bi trn h
thng chm bi trc tuyn VOJ.
#include <iostream>
#include <cstdio>
#include <cstring>
#define FOR(i,a,b) for(int i=a;i<=b;i++)
#define base 1000000003LL
#define ll long long
#define maxn 1000111
using namespace std;
ll POW[maxn],hashT[maxn];

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

ll getHashT(int i,int j) {
return (hashT[j]-hashT[i-1]*POW[j-i+1]+base*base)%base;
}
int main() {
string T,P;
cin >> T >> P;
int m=T.size(),n=P.size();
T=" "+T;P=" "+P;
POW[0]=1;
FOR(i,1,m) POW[i]=(POW[i-1]*26) % base;
FOR(i,1,m) hashT[i]=(hashT[i-1]*26+T[i]-'a') % base;
ll hashP=0;
FOR(i,1,n) hashP=(hashP*26+P[i]-'a') % base;
FOR(i,1,m-n+1) if(hashP==getHashT(i,i+n-1)) printf("%d ",i);
}

d. nh gi:
phc tp ca thut ton l ( + ). Nhng iu quan trng l: chng ta c th kim
tra 2 xu c ging nhau hay khng trong (1). y l iu to nn s linh ng cho thut
ton Hash. Ngoi s linh ng v tc thc thi, chng ta c th thy ci t thut ton ny
thc s rt n gin nu so vi cc thut ton x l xu khc.
3. ng dng:
Nh cp trn, thut ton ny s c trng hp chy sai. Tt nhin, bn cnh vic
s dng Hash, cn c nhiu thut ton x l xu chui khc, mang li s chnh xc tuyt i.
Ti tm gi nhng thut ton l thut ton chun. Vic ci t thut ton chun c th
mang li mt tc chy chng trnh cao hn, chnh xc ca chng trnh ln hn. Tuy
nhin, ngi lm bi s phi tr gi l s phc tp khi ci t cc thut ton chun .
S dng Hash khng ch gip ngi lm bi d dng ci t hn m quan trng ch:
Hash c th lm c nhng vic m thut ton chun khng lm c. Sau y, ti s xt
mt vi v d chng minh iu ny.
a. Longest palindrome substring
Bi ton t ra nh sau: Bn c cho mt xu di ( 50 000). Bn cn tm
di ca xu i xng di nht gm cc k t lin tip trong . (Xu i xng l xu c t 2
chiu ging nhau).
Mt thut ton chun khng th p dng vo bi ton ny l thut ton KMP. Ngoi
KMP ra, c 2 thut ton chun c th p dng c. Thut ton th nht l s dng
thut ton Manachar tnh bn knh i xng ti tt c v tr trong xu. Thut ton th 2
l s dng Suffix Array v LCP (Longest Common Prefix) cho xu c ni bi v xu

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

vit theo th t ngc li. 2 thut ton ny u khng d dng ci t, v nm ngoi


phm vi bi vit, nn ti ch nu s qua m khng i vo chi tit.
By gi, chng ta s xt thut ton khng chun l thut ton Hash. n gin, chng
ta xt trng hp di ca xu i xng l l (trng hp chn x l hon ton tng t).
Gi s xu i xng di l di nht c di l . D thy, trong xu tn ti xu i
xng di 2, 4, Tuy nhin, xu khng tn ti xu i xng di + 2, + 4,
Nh vy, tha mn tnh cht chia nh phn. Chng ta s chia nh phn tm di ln
nht c th. Vi mi di , chng ta cn kim tra xem trong xu c tn ti mt xu con l
xu i xng di hay khng. lm vic ny, ta duyt qua tt c tt c cc xu con
di trong .
Bi ton cn li l: kim tra xem [. . ](1 ; ( + 1) 2 = 1) c phi l
xu i xng hay khng.
Cch lm nh sau. Gi l xu vit theo th t ngc li. Bng thut ton Hash, chng
ta c th kim tra c mt xu con no ca c bng mt xu con no ca hay
khng. Nh vy, chng ta cn kim tra [. . ] c bng [ + 1. . + 1] hay khng vi
l tm i xng, ni cch khc =

+
.
2

Nh vy bi ton c gii. phc tp cho

cch lm ny l ( log()).
b. k-th alphabetical cyclic
Bi ton t ra nh sau: Bn c cho mt dy 1 , 2 , , ( 50 000). Sp xp hon
v vng quanh ca dy ny theo th t t in. C th, cc hon v vng quanh ca dy ny
l (1 , 2 , , ), (2 , 3 , , , 1 ), (3 , 4 , , , 1 , 2 ),... Dy ny c th t t in nh hn
dy kia nu s u tin khc nhau ca dy ny nh hn dy kia. Yu cu bi ton l: In ra
dy c th t t in ln th .
Nu tip cn mt cch trc tip, chng ta s sinh ra tt c cc dy hon v vng quanh,
ri sau dng mt thut ton sp xp sp xp li chng theo th t t in, cui cng
ch vic in ra dy th sau khi sp xp. Tuy nhin phc tp ca thut ton ny l rt ln
v khng th p ng c yu cu v thi gian. C th, cch ny c phc tp l (2
log()), y l tch ca phc tp ca sp xp v phc tp ca mi php so snh dy.
Vn gi t tng l sp xp li tt c cc dy hon v vng quanh ri in ra dy ng v
tr th , chng ta c gng ci tin phc tp ca vic so snh th t t in ca 2 dy.
Nhc li nh ngha v th t t in ca 2 dy: Xt 2 dy v c cng s phn t. Gi
v tr th l v tr u tin t tri sang m . < < . Nh vy, ta phi tm
on tin t ging nhau di nht ca v , ri so snh k t tip theo. tm c on
tin t ging nhau di nht, ta c th s dng Hash kt hp vi chia nh phn.
5

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

gii c bi ny, cn s dng thm mt k thut nh na: Thay v sinh ra tt c cc


hon v vng quanh, chng ta ch cn nhn i dy a ln, dy mi s c 2 phn t
(1 , 2 , , , 1 , 2 , , ). Mt hon v vng quanh s l mt dy con lin tip di ca
dy nhn i ny.
c. Longest substring and appear at least k times
Bi ton t ra nh sau: Bn c cho xu di ( 50000), bn cn tm ra xu con
ca c di ln nht, v xu con ny xut hin t nht ln.
Bi ton ny c th c gii bng Suffix Array, tuy nhin cch ci t phc tp v khng
phi trng tm ca bi vit nn ti s khng nu ra y.
Tip tc bn n thut ton Hash thay th thut ton chun. Nhn xt rng, gi s
di ln nht tm c l , th vi mi , lun tn ti xu c di xut hin t nht
ln. Tuy nhin, vi mi > , khng tn ti xu c di xut hin t nht ln (do l
ln nht). Nh vy, tha mn tnh cht chia nh phn. Chng ta c th p dng thut ton
tm kim nh phn tm ra ln nht.
By gi, vi mi khi ang chia nh phn, chng ta s phi kim tra liu c tn ti xu
con no xut hin t nht ln hay khng. iu ny c lm rt n gin, bng cch sinh
mi m Hash ca cc xu con di trong . Sau sp xp li cc m Hash ny theo
chiu tng dn, ri kim tra xem c mt on lin tip cc m Hash no ging nhau di
hay khng.
Nh vy, phc tp chia nh phn l (log()), phc tp ca sp xp l
( log()), vy phc tp ca c bi ton l ( log()2 ).
4. Tng kt:
a. Thut ton:
tng thut ton Hash da trn vic i t h c s ln sang h thp phn, so snh
hai s thp phn ln bng cch so snh phn d ca chng vi mt s ln.
b. u im:
u im ca thut ton Hash l ci t rt d dng. Linh ng trong ng dng v c
th thay th cc thut ton chun hm h khc.
c. Nhc im:
Nhc im ca thut ton Hash l tnh chnh xc. Mc d rt kh sinh test c th
lm cho thut ton chy sai, nhng khng phi l khng th. V vy, nng cao tnh chnh
xc ca thut ton, ngi ta thng dng nhiu modulo khc nhau so snh m Hash (v
d nh dng 3 modulo mt lc).

Hash: A String Matching Algorithm


Author: Le Khac Minh Tue

5. Cc ngun tham kho:


http://en.wikipedia.org/wiki/String_searching_algorithm
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algo
rithm
http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
http://vn.spoj.com/problems/SUBSTR/
http://vn.spoj.com/problems/PALINY/
http://acm.sgu.ru/problem.php?contest=0&problem=426
http://vn.spoj.com/problems/DTKSUB/
http://en.wikipedia.org/wiki/Alphabetical_order

You might also like