You are on page 1of 46

Chng I: PHN M U

1.1 LI CH CA VIC NN D LIU


Mt trong nhng chc nng chnh ca my tnh l x l d liu v lu
tr. Bn cnh vic x l nhanh, ngi ta cn quan tm n vic lu tr c
nhiu d liu nhng li tit kim c vng nh v gim chi ph lu tr. V
mt l thuyt th cc thit b lu tr l khng c gii hn nhng ngy nay do
nhu cu x l nhiu tp tin, nhiu loi d liu trong cng mt tp do vy m
kch thc tp tr nn kh ln.
Trong nhiu nm gn y, mng my tnh tr nn ph bin trn th
gii. S ra i ca mng thc hin c c m chinh phc khong cch
gia con ngi. Nhng li ch m mng cung cp rt a dng v phong ph
trn cc lnh vc khc nhau ca x hi nh cung cp, trao i thng tin gia
cc my tnh, gia my tnh vi server hoc gia cc server vi nhau. iu
ny dn n phi lm nh th no gim thiu thi gian, chi ph s dng
trao i d liu trn mng. N cng ng ngha vi vic bn cnh nng cao
cht lng ca cc thit b truyn d liu trn mng th mt khc chng ta
phi ngh ra mt phng php no khng sao cho vic truyn d liu c
hiu qu hn.
Tt c nhng vn trn ny sinh ra khi nim nn d liu. Mt trong
nhng hnh thc nn d liu u tin l h ch Braille, l mt h ch dng
phng php m ho k hiu cho ngi m c th c v vit. Ngy nay, nn
d liu mang li rt nhiu li ch khc nhau m in hnh l :
+ i vi vic tm kim th i khi ta tm kim thng tin trn d liu
nn li nhanh hn so vi vic tm kim thng tin trn d liu khng nn v do
d liu lu tr t nn s php ton tm kim gim v lng thng tin cao.
+ Nn d liu c bit hiu qu vi vic truyn d liu trn mng. Khi
nn d liu th chi ph cho vic truyn d liu trn mng s gim mt khc tc
ng truyn s tng ln bi v cng mt lng thng tin thi gian
truyn d liu s gim.
1
Vi nhng u im trn th do , nn d liu l gii php hp l nht
nhm mc ch gim chi ph cho ngi s dng.
Tm li nhng li ch m nn d liu mang li xoay quanh vn tit
kim ti a chi ph cho vic mua cc thit b lu tr d liu v cho vic lun
chuyn thng tin trn mng.
* Mt s vn gp phi khi nn d liu l :
+ Cc thut ton thc hin trc ht phi gim chi ph lu tr.
+ Cc thut ton c thc hin nhanh, hiu qu.
Vi nhiu loi thng tin khc nhau m ta c cc k thut nn khc
nhau, c hiu qu khc nhau, v d nh nn tp vn bn thng tit kim 20%
- 50%, cn i vi tp tin nh phn khong 50% - 90% tuy nhin i vi cc
tp tin ngu nhin th lng khng gian tit kim c rt t hoc hu nh
khng tit kim c (chng hn nh tp *.exe, ...). Do cc d liu c d
tha khc nhau nn mi phng php nn nu p dng ng s tr nn khng
cn thit, khng c linh hot.
1.2 KHI NIM V D LIU:
Trong mt bi ton, d liu bao gm mt tp cc phn t c s m ta
gi l d liu nguyn t (atoms). N c th l mt ch s, mt k t... nhng
cng c th l mt con s, hay mt t..., iu tu thuc vo tng bi ton.
1.3 NN BO TON
l m hnh nn d liu m n cho php ngi s dng bo ton
thng tin trong sut qu trnh nn. iu ny c gii thch nh sau:
Gi s ta c d liu ngun l D v d liu nn l D'. Sau khi ta gii nn
D' th c tp D'' m tp D'' hon ton ging vi tp D ban u khi c gii
nn. Thng thng, k thut ny c p dng vi cc loi d liu nh vn
bn v chnh xc ca vn bn. Nu hiu theo ngn ng ton hc th l
mt nh x 1 - 1 t 1 tp X -> Y m mi phn t Xi X tng ng vi
mt phn t Yi Y.
2
1.4 NN KHNG BO TON
Trong k thut nn, bn cnh nn bo ton th ngi ta cn a ra khi
nim nn khng bo ton. Nn khng bo ton l m hnh nn d liu m
tnh bo ton ca d liu khng c coi trng. N c ngha l nu ta c tp
d liu D, tp nn D' th sau khi gii nn ta thu c tp D'' khc tp D ban
u. K thut ny thng p dng cho vic nn d liu l cc loi tp nh v
ni chung n cng khng nh hng g nhiu n hnh dng nh. Nu biu
din bng ton hc, chng c m hnh sau:
F(x) { y
1
, y
2
, ...}
1.5 QA TRNH NN V GII NN
Qu trnh nn d liu l mt qu trnh gm 2 cng on:
1) Cng on nn.
2) Cng on gii nn.
Chng ta c th minh ho nh sau:
a) Cng on nn:
D liu M ho ng gi D liu nn.
b) Cng on gii nn:
D liu nn Gii m M ho D liu gc.
Hai cng on trn l 2 in hnh tri ngc nhau. i vi tin trnh
nn th module m ho thc hin vic ct vn bn ngun thnh cc on v
gn cho chng k hiu xc nh chng. Ngc li i vi tin trnh gii
nn th module gii m s da vo cc m m module m ho tin trnh nn
sinh ra tm on tng ng. Qu trnh tm on tng ng c thc
hin trn rt nhiu on trong tin trnh nn, gii nn sinh ra. Tp hp cc
on chng ta gi l t in.
Chng II: NHNG KHI NIM V M HNH NGUN
3
2.1 T VN
Xt qu trnh truyn bn tin t A n B.
H1.1
Qu trnh truyn bn tin t A n B da trn quan
im m v gii m ngi ta c th phn chia h thng
truyn tin thnh cc loi m hnh sau:
2.2 M hnh tnh (static modeling)
Trong m hnh tnh, ngi ta coi b m (encoder) v
gii m (decoder) gi chung l codebook , c ngi pht
bn tin v nhn bn tin dng chung. y l m hnh tng
t nh h thng truyn telegraph. Trong m hnh ny c
hai nhn vin hai u c chung codebook, nu nhn
vin hai u cui lm vic tin cy th bn tin truyn i
sau khi m v bn tin nhn c sau khi gii m l nh
nhau .M hnh ny c nhc im , nu xut hin nhng
k t mi , t mi trong bn tin , ngi ta phi b sung vo
codebook c hai pha v tt nhin li phi dng dng
telegrap chnh sa, cp nht.
2.3 M hnh bn thch nghi (semiadaptive modeling )
Theo m hnh ny, trc khi gi bn tin i, nhn vin A
c trc bn tin, ghi li nhng t xut hin nhiu ln, da
vo tn xut xut hin m bn tin, kt qu cng vic
ny to ra codebook ca chnh bn tin cn gi. Nhn vin
A gi bn tin cng vi codebook ca bn tin m. Nhn
vin B nhn codebook v bn tin m, nhn vin B dng
4
codebook gii m nhn li bn gc. nhc im ca m
hnh ny l phi gi codebook mi khi gi mt bn tin.
2.4 M hnh thch nghi (adaptive modeling )
Trong m hnh ny nhn vin A gi bn tin cho B bng
cch gi i ln lt tng k t mt , c hai ngi u ghi li t
truyn v h thng nht nguyn tc t no truyn
nhn ln hai trong cng bn text s thm n vo codebook v
nh v n trong codebook, t y tr i nu gp li t
ny, h c th truyn nhn bng m mi.
D nhin h thng c gi nh cc nhn vin lm vic
tin cy, chc chn.
Nhng m hnh ny l c s cho php ngi ta thit cc
thut ton nn khc nhau.
2.5 Cac m hnh To ra bn tin (finite- context models)
Chng ta bit rng to ra bn tin ngi ta thng da
vo mt bng ch ci no . Song nu ch bit c th th
cha , c th to ra bn tin, m ha v nn bn tin mt
cch hiu qu ngi ta cn phi bit nhiu hn, chng hn
cn phi bit v cc qui tc to ra bn tin, nh cc qui tc
to ra cc xu trong bn tin, quan h ca cc k t v cc
nhm k t trong xu, trong cu. Chng hn nh th t trc
sau ca cc k t , cc nhm k t trong xu , trong cu.
Nhng thng tin nh vy cn phi c khi qut li , kt ni li
dng m hnh, m hnh to ra cc bn tin ngi ta gi l m
hnh ngun tin (source model). M hnh ngun c vai tr
quan trng ngi ta xc nh c xc sut ca cc s kin
trong qu trnh to ra bn tin, nh vic m ha v nn
bn tin hiu qu hn.
2.5.1 M hnh ng cnh hu hn bc 0 (order - 0 finite
context model)
Gi s cho bng ch ci A={a
1
,a
2
,,a
n
}.
5
M hnh ng cnh hu hn bc 0 l m hnh n gin
nht, trong mi k t ca bng ch ci A c xc sut c
nh p
i
=p(a
i
) vi i=1,2,n v p
1
+p
2
++p
n
=1; cc p
i
khng
ph thuc vo v tr v th t ca a
i
trong bn tin.
V d A={a, b,c } p(a)=1/2 ; p(b)=1/4; p(c)=1/4;
2.5.2 M hnh ng cnh hu hn bc -1
M hnh ng cnh hu hn bc -1 l m hnh trong
mi k t ca bang ch ci A c xc sut c nh v bng
nhau vi mi a
i
; p(a
i
)=p =1/n vi i=1,2,n v p
1
+p
2
+
+p
n
=1; cc p
i
khng ph thuc vo v tr v th t ca a
i
trong bn tin.
V d A={a, b,c,d } p(a)=1/4 ; p(b)=1/4; p(c)=1/4;
p(d)=1/4;
2.5.3 M hnh ng cnh hu hn bc n (n>=1)
M hnh ng cnh hu hn bc n l m hnh trong
mi k t ca A xut hin c xc sut ph thuc vo n k t
ng trc n.
M hnh dng ny xut hin tng i ph bin v d
trong ting vit ch u c xc sut 0,99 khi xut hin
ch q.
2.5.4 M hnh my hu hn trng thi ( finite state
models )
M hnh my hu hn trng thi FSM (finite state
machine)
M=<A,S,P, >
Trong A l bng ch ci hu hn;
S tp hu hn trng thi;
P l bng phn b xc sut, phn t p
i k
k hiu xc sut
my trng thi chuyn t trng thi s
i
sang trng thi s
k
6
vi a bt k thuc A. Vi mi i , cc p
ik
tha mn iu kin
p
i1
+p
i2
++p
ik
=1;
l hm chuyn trng thi ; (a,s
i
)=s
k
vi xc sut p
ik
My trng thi c th m t trc quan bng th c h-
ng ;
v d 1
cho my trng thi m t bi th cho bi hnh H1.31
hnh H1.31

Khi xu x=abcaab sinh ra bi my trng thi vi xc
sut
p=0.5 x0.3x 0.2 x0.5 x 0.5 x 0.2=0.0025.
v d 2
hnh H1.32
Xu x=aabba c xc sut l 1/2
5
M hnh my trng thi cn c tn gi l m hnh Markov
do nh ton hc Nga a ra .
Ch :
7
- trn chng ta ch ra m hnh trng thi tng -
ng vi th c hng m trn mi cung c ghi nhn v xc
sut chuyn trng thi.
2.5.5 M hnh vn phm (grammar models)
M hnh my trng thi c mt s hn ch nh phi din
t cc xu vi vng lp ln nh xu (a+b*(c+d)-e). Ngi ta
chng minh c rng cn n+2 trng thi trong m hnh trng
thi hu hn din t biu thc s hc c n cp ngoc,
nh vy s trng thi s rt ln khi s cp ngoc ln. y l nh-
c im ca m hnh trng thi hu hn. V vy m hnh
trng thi hu hn khng th dng biu din ngn ng t
nhin. khc phc nhc im ca m hnh my hu hn
trng thi ngi ta a ra m hnh vn phm
M hnh vn phm c dng G=<N,T,P,k>;
N - tp hu hn cc bin hay cc nonterminal.
k l phn t thuc N , ta gi l phn t bt u
T tp hu hn gm ch ci hay cc terminal
P tp cc qui tc dn xut dng: ---->

; , (NUT)
+
;


(NUT)
+
;
Ta gi L(G)={w k --* ->w ; w T
+
} l tp ngn ng
sinh bi vn phm G. K hiu k --* ->w ch rng bt u k
sau mt s ln s dng cc qui tc trong P ta nhn c w l
xu ca T; k hiu w T
+
ch rng w l xu ca bng ch
ci T.
V d
8
Hnh H1.12
Xc sut vn phm G trong v d ny sinh ra xu
aaaa l
3/4 *1/2*1/2*1/2 =1/16.
2.5.6 M hnh state : Do A.A.Markov (1856-1922) a ra.
* nh ngha :
M hnh state l mt th nh hng m mi mt cnh c mt nhn
v mt trng s l xc sut chuyn trng thi c theo hng , tng cc xc
sut chuyn trng thi ra khi mt nh bt k ca th lun lun = 1.
V d:
a 0.5 a 0.2
a 0.2
b 0.3 d 0.3
c 0.5 a 0.2
c 0.8
Vi m hnh trn, ta c th c cc chui vn bn khc nhau nh vy vi
mt chui vn bn bt k th lung thng tin sinh ra t m hnh trn l mt
chu trnh no .
9
M hnh state c gi l xc nh nu c mt s nguyn B ln sao
cho mi dy tn cc cnh do m hnh state sinh ra c di ln hn B xc
nh duy nht mt dy cc nh cc cnh m n i qua. Nh th lung tin
di s tng ng vi mt chu trnh no i qua dc cc nh v cc cnh.
Entropy ca lung tin khi c nh ngha thng qua entropy ca chu
trnh.
Vy entropy l g ? Chng ta tip tc xem xt khi nim v entropy.
nh ngha entropy:
Entropy l kh trung bnh on nhn 1 thng tin trng thi c
sinh ra t m hnh ngun.
Cng thc tnh entropy :
-

n
1 i
i 2 i
P log P
Cch tnh :
1. Xc nh s trng thi.
2. Tm ma trn xc sut chuyn trng thi.
3. Tm vector ring
4. Tnh entropy ca mi trng thi.
5. Tm entropy ngun bng cch ly tng cc tch xc sut xut hin
ca trng thi vi entropy ca ring n.
V d : Gi s c m hnh sau :
a =30
b =7
c =12
c =26
c =26
d =7
Bc 1 : S trng thi = 2
Bc 2 : Ma trn chuyn
10
a
11
=
75
49
) 7 19 ( ) 12 7 30 (
12 7 30

+ + + +
+ +
;
a
21
=
) 7 19 ( ) 12 7 30 (
7 19
+ + + +
+
=
75
26
a
12
=
26
26
= 1 ; a
22
= 0
P =
1
]
1

,
_

0 75 / 26
1 75 / 49
a a
a a
22 21
12 11
Bc 3: Gi tr ring l No ca phng trnh.
det
1
1
1
1
]
1



0
75
26
1
75
49
= 0
Khai trin ra ta c phng trnh sau :
0
75
26
75
49
2

nghim l :

1
= 1 ;
2
= -
75
26
V

,
_

,
_

,
_

0
0
x
x
0 75 / 26
1 75 / 49
2
1
trong X
1
v X
2
l xc sut xut hin
ca trng thi 1 v 2 nn 0 do ta chn =
1
=1.
Gii h phng trnh trn vi = 1 do X
1
+ X
2
= 1 =>
c No: X
1
=
100
75
; X
2
=
101
26
Bc 4 : Tnh entropy c tng trng thi.
- Trng thi 1 :
E
1
=
) 26 12 7 30 (
7
30
26 12 7 30
log
) 26 12 7 30 (
30
2
+ + +
+
+ + +
+ + +
12
26 7 12 30
log
26 12 7 30
12
7
26 12 7 30
log
2 2
+ + +
+ + +
+
+ + +
11
+
26
26 12 7 30
log
26 12 7 30
26
2
+ + +
+ + +
= 1.80096
- Trng thi 2 :
E
2
=
7
7 19
log
7 19
7
19
7 19
log
7 19
19
2 2
+
+
+
+
+
= 0.84036
Bc 5 :
E = X
1
E
1
+ X
2
E
2
=
84036 . 0 x
101
26
0096 . 1 x
101
75
+
= 1.55368
Kt lun : Vy entropy ca ngun l : 1.55368
12
Chng III:
PHNG PHP V K THUT C BN V NN D LIU
3.I. NH NGHA NN D LIU
Nn d liu thc cht l mt hnh thc m ho d liu ghi li dng
d liu sao cho tn t b nh hn m li cho php chng ta khi phc li d
liu ban u.
3.2.MT S LOI M
3.2.1 M k hiu:
nh ngha : l mt h thng quy c cc m c s dng nhn
ra 1 chui cc s kin khc nhau th c gi l k hiu. M k hiu l 1 hnh
thc ca phng php nn d liu.
M k hiu xut hin hng ngy xung quanh chng ta.
V d : Trong vn bn Nh nc thng c cm t :
CV cng vn ; Q quyt nh
TTCP Th tng Chnh ph ...
Vi cc nc pht trin, h thng chun ho k hiu cho php dng
chng mt cch d dng v tit kim nh trong lnh vc in tn, telex, th
in t, ... Xt v cc mt khc nhau m m k hiu c p dng th m k
hiu cung cp thng tin cho chng ta y hn, n d dng gi nh tr nh
ca mi ngi do vy lng thng tin t m k hiu sinh ra l rt ln.
Mt s c im ca m k hiu:
+ Tit kim b nh, tit kim thi gian.
+ Tnh nguyn thy ca m.
+ Tnh tng i ca h thng m.
13
Nh nhng c im trn m m k hiu c s dng nhiu trong lnh
vc nh qun l d liu, qun l dn s, qun l vic mua bn hng ho...
* Tnh nguyn thu ca m c xem xt nh sau:
Mt trong nhng lnh vc s dng m k hiu nhiu nht l qun tr d
liu. D liu l c tnh ca i tng qun l v c chia ra lm 2 c tnh
sau:
* Qun l s vt:
S vt l cc ch th m chng ta gi n l cc thc th.
* Qun l s vic :
S vic cc hot ng bin nhn ghi li s tng tc ca cc s vt
c gi l cc form.
Chnh m k hiu s l a ch dng truy cp kho d liu. ng thi
m ho k hiu a ra mt s tnh cht duy nht ca mt i tng no
tho mn m thi. Nh vy tnh nguyn thu ca m k hiu chnh l cch ghi
nhn cc s vt hay s vic.
+ Tnh tng i:
tra cu mt thng tin no trong CSDL, chng ta da vo cc
kho xem xt c im ca thng tin nhng vic tm kim ny khng th
chng minh c rng CSDL ch c duy nht thng tin c kho m n cn
c th c cc thng tin khc cng c kho nh vy. y chnh l tnh tng
i ca m k hiu. Chnh v vy m tnh tng i ca m k hiu cng ny
sinh ra mt s vn .
V d :
- H thng chun m Ting Vit trn PC nc ta do cc nh lm m
qu lm dng tnh tng i ca m k hiu to ra bng ch ci chun cho
my tnh nn Vit Nam c ti 3 h thng m Ting Vit khc nhau cho 3
khu vc khc nhau nh :
Khu vc pha Bc - ABC ; min Trung - Vietware ; min Nam VNI.
14
- H thng in tn (telex) do Moorse pht minh ra. Trong h thng ny
ch gm 2 k t chm v gch. Gia cch du chm v gch c ngn cch
bng cc khong cch.
3.2.2 M ng gi
L mt phng php nn d liu trong mi phng php m bao
gi cng c mt khu ng gi. Thng thng chng ta hay ghi cc byte
gm 8 bit 0/1 cho nn ng gi y l cch thc to ra cc byte t dy
cc bit 0/1. a s cc thut ton m chng ta s dng m phng thut
ton nn d/l u cho ra dy cc bit 0/1 sau l bc ng gi cc byte
0/1 y li.
V d: ng gi cc ch "abcdu" thnh u byte.
a b c d u
100 10001 01 001 0010111010011010101
Trong hnh thc ny chng ta c cc k thut ng gi khc nhau
nhng ph bin vn l k thut ng gi nh lng.
+ Khi nim ng gi nh lng:
ng gi nh lng l cch thc chng ta qui nh s byte dnh
cho cc k t c ghi trong tp nn.
c im ca m ng gi nh lng:
+ Phng php nn khng mm do.
+ Hiu qu nn cha cao.
+ Thi gian chy nhanh.
V d:
Chng ta xt u ra ca thut ton L778 ca 1 xu k t
"aaabbabaabaaabab" l:
(0+a) (10 + b) (3 + a) (4 + a) (5 + a) (4 + b).
15
Nu ta ghi mi k t l 1 byte th s dn ti kch thc ca tp nn
c th to hn kch thc cu tp trc khi nn ng thi chng ta cng s
gp mt s kh khn trong vic gii nn. V vy ng gi cng l mt
khu quan trng trong vic nn d liu.
Trong phng php nh lng chng ta phi c mt s qui c
nht nh ng gi cc k t sao cho tn t b nh nht. Nh v d
trn nu chng ta ng gi cc con s nm trong khong 1- 2 byte v cc
k t chng ta c nh rt nhiu do vy m hiu qu nn cao. Vi qui
nc trn, chng ta c u ra l: 0a1a3a4a5a4b.
Bn cnh khi nim ng gi nh lng th ta cng c khi nim
ng gi t ng.
*nh ngha:
ng gi t ng l hnh thc chng trnh t to ra cc byte theo
mt s tiu chun nht nh no . K thut ng gi t ng c p
dng nhiu trong thut ton nn Huffman.
Mt s c im ca m ng gi t ng:
+ Phng php nn tr nn mm do
+ Hiu qu nn cao
+ Thi gian chy tng i chm.
3.2.3 M gn ng
M gn ng c s dng ph bin trong lnh vc nh v m
thanh. Mt trong nhng ng dng ph bin nht ca m gn ng l in
thoi di ng. in thoi di ng truyn thng tin i sau khi nn. Nh
th chng t khng truyn m thah ting ni i m truyn "cu to ca
mt ci ming" n s pht li cho chng ta nghe li. Hiu qu ca phng
php ny cho php chng nn tn hiu ting ni t 8000 byte/giy xung
50 byte/giy. Tng t k thut ny cng c p dng cho cc loi in
thoi hnh.
3.2.4 M theo di (RLE)
16
Trong s cc phng php dng nn d liu th phng php n
gin nht l phng php m ho theo di.
+ Nguyn tc m:
Nguyn tc c bn ca phng php ny l pht hin mt k t c
s ln xut hin lin tip vt qua mt ngng c nh no . Trong
trng hp ny dy s c thay th bng 3 k t.
K t th nht l k t c bit, thng bo dy tip l dy c bit.
K th th hai ch s ln lp.
K t th ba ch k t lp
Nh vy, t tng ca phng php ny l thay th mt dy bng
mt dy khc ngn hn tun theo mt ngng no , v thng thng
ngng c gi tr l 4.
+V d:
Dy ngun: ..SSOOOOONNNLLLLLLAAANNN..
Dy nhn c sau khi nn:
.... HH CS S O NNN CS 6L AAAN...
Tuy nhin phng php ny cng c nhiu khuyt im. Trc ht
l k t c bit khng c php xut hin trong tp d liu ngun. Nu
k t xut hin vi t cch l d liu th khi g nn n s dn n tnh
trng nhp nhng.
Ngoi ra ngi ta cn ch n ch s lp. Nu ch s lp c lu
tr trn 1 byte th s ln lp ca mt k t khng vt qua 255.
Khi s ln lp ca mt k t vt qu 255 th ch s lp s c lu
tr trn 2 byte. Trong trng hp ny, ch s lp ti a s c nng ln
thnh 65535. Tuy nhin trong thc t s s ln lp ca mt k t thng
255 -> qu trnh nn s chim 1 byte v ch.
Da vo nhng c im trn ta thy phng php m ho theo
di khng c li nhiu trong vic nn tp vn bn.
17
Nhng i vi nhng tp hnh nh th phng php ny li c hiu
qu cao v nh en trng l dy cc s 0 (im en) v cc s 1 (im
trng) an xen nhau. Trong trng hp ny vic m s ln xut hin lin
tip ca cc s 0 v s 1 tng i d dng v khi m ho khng cn k t
c bit cng nh khng cn phi ch r k t lp l k t no m ch cn
ghi s ln xen k.
+ V d:
Ngun: 0000000000 11111111 00000 111111
=> s c m: 10 8 5 6
N... vi hnh nh m im u tin khng phi l im en nh v
d trn th phi bt u dy m bng 1 s 0.
+ V d : Ngun: ... 00 .. 0 11111100 ...
280 ln
Nn:
... 2550 25 52 ...
i vi nh mu th mi mu sc c hiu th bng mt s
nguyn. phn bit c s khc nhau trong lp mu ngi ta phi xen
vo mt k t c bit v tip theo sau s l ch s lp v k t c lp.
C ngha l m ho nh mu c th dng mt cp, k t u l ln lp
li, k t sau l mt mu v dng mt k t c bit (v d nh #) bo
hiu s xut hin cc cp nht.
+V d:
Ngun: 11111122223344444444
Nn: #61 # 42 # 23 # 84
i vi nhng mu l duy nht (k t c ch s lp l 1) th mu
khng cn i vi k t c bit.
+ Thut ton m theo di:
* Nn: Tp ngun: f
K t c bit: db
18
S ln lp: dem
db = 255
dem = 1
c k t u tin trong f l ktlap
While not eof (f) do
begin
- c k t tip theo -kt;
if kt := kt lp then dem = dem + 1
else if dem > 3 then
begin
In db;
In dem;
In ktlap;
End;end;
for |:= 1 to dem do
In ktlap
End;
Dem: = 1
ktlap := kt
End;
* Gii nn:
Tp nn: f
K t c bit: db
S ln lp: dem
db = 255
While not eof (f) do
begin
c k t trong f l kt;
If kt:=db then
19
begin
c k t tip - dem;
c k t tip - ktlap;
for | = 1 to dem do
In ktlap
else
c k t tip;
end;
End ;
Trong phng php ny, ngi ta s dng k t c bit vi t cch
l k t khng bao gi xut hin trong tp d liu ngun. y l t tng
khng chun mc bi v tp d liu ngun c th cha tt c cc k t ca
bng m ASCII. Do phng php ny c s dng nhiu nht trong
vic nn hnh nh.
3.2.5 M ho thch nghi dn ca tn s (AFE)
Phng php ny l phng php m ho thch nghi dn ca tng
k t theo tn s xut hin.
Khi u ca vic nn, nu mt k t c m ho trn 5 bits th n
cng c th c m ho bng 1 hay 9 bits tu thuc vo s xut hin ca
n l nhiu hay t.
Trong phng php ny, vic nn v g nn phi tin hnh song
song. Mi modul s c chung bng m ban u cho 1 byte truyn i.
Nhng chng s ph thuc vo tn s xut hin ca k t v tun theo qui
tc ci bin.
Vi mc ch nh vy, phi lp m cho mi byte v lu tr chng
trong mt bng tham kho hay l trong mt t in. Bng ny dng lm
c s cho s lp m ca mi k t, c mt tng hai modul nn v g nn.
20
M ca mi byte bao gm 2 phn: Phn u (Header) v phn thn
(Body). Phn u thng gm 3 bits v phn thn c 1 - 8 bits, tu thuc
vo byte lp m v nht l tn s xut hin ca m. Phn u ch ra
di ca thn m. Khi gii m khng cn x l tng bit phn bit cc
byte truyn nh trong phng php Huffman. gii m ch cn c 3
bits u xc nh di n ca thn m, sau c n bits tip theo
khi phc li m truyn i.
3.3. M HNH NN
Chng ta c th c thut ton nn tt khi bit c m hnh sinh ra
ngun tin. Song vic tm ra m hnh sinh ra d liu l khng th. Vy
c cch no nn c m khng bit m hnh sinh ngun tin. Chng ta c
2 phng php sau:
- M hnh tnh: l m hnh tm c thng qua nghin cu cc
c trng thng k ca vn bn ri sau s dng chng.
- M hnh thch ng: M hnh thch ng c xut pht im l mt
m hnh tng qut no sau hiu chnh dn.
- c im ca m hnh thch ng:
+ M hnh thch ng da vo thng k ca mt s rt ln cc vn
bn cng loi v p dng cho vn bn mi. u im ca phng php ny
l trnh nn dn chy nhanh.
+ Da vo m hnh nn gi nh chng ta tnh gi tr cc trng
ti v tin hnh nn.
3.3.1. Nn d liu c m hnh ngun
Mt trong nhng c im ca vic nn d liu ny l c ngi nn
cng bit ngun sinh ra tin. Nhng thut ton nn d liu c trng cho
vic nn d liu c m hnh ngun ny l:
+ Thut ton Fano - Shannon
+ Thut ton Huffman.
21
3.3.2. Nn d liu cha c m hnh ngun
Mt trong nhng v d in hnh v nn cha c m hnh ngun l
ngn ng t nhin. Chng ta lun ri vo tnh trng c vn bn m khng
c m hnh ngun. tm ra mt thut ton nn tt nht th chng ta phi
tm ra qui lut ca chng. Qua nghin cu c th rt ra c hai cch tip
cn sau:
a. Cch tip cn tng th:
Cch tip cn ny cn c gi l phng php thng k. Cch ny
da vo nhn xt tinh t, l nhng g xy ra trong qu kh th s xy
ra trong tng lai. iu ny cng c tha nhn trn thc t m c
trng l nhng cu tc ng ca cha ng chng ta c c kt t nhiu
i nay nh: "m thng 5 cha nm sng, ngy thng 10 cha ci
ti", ...
Nh vy, c m hnh lung tin, chng ta phi tm ra mt s iu
kin cho lung tin. y chng ta ch xt m hnh sinh tin l ngu nhin
nhn hu hn cc gi tr t mt ngn tin bt k sao cho s xut hin ca
thng tin sau ph thuc vo gi tr ca lng thng tin ng trc n.
Cch tip cn ny gm hai phng php sau:
+ Phng n tnh.
+ Phng n ng.
Phng n tnh c tin hnh m theo 2 bc:
1. Tm m hnh thch hp da vo thng k.
2. Tin hnh m nn theo phng n c m hnh.
Phng n ng:
c trng chnh trong phng n ng l cc thut ton, da vo s
cm nhn sau. t gi s chng ta tin hnh nn d liu trong khi cha
bit r k t no s xut hin tip theo. Nhng nu chng ta bit c xc
sut xut hin ca mt ch tip theo l g th trn thc t chng ta c th
bit c m hnh nn v s c thut ton nn d liu. Tuy nhin y
22
li ny sinh ra vn l lm th no d on k t no s xut hin,
xut hin t hay xut hin nhiu nu chng ta ch da vo k t xut
hin.
Xt v d sau y:
Gi s chng ta c mt vn bn gm "aabbababab... bbaaba". Tt
nhin, nu vn bn cha kt thc th ta khng th khng nh l vn bn
ch gm 2 k t "a" & "b" th cha chc chn v c th c 1 k t bt k
xut hin nh "s", "t"... Do vy trong phng n ng ta lun phi c
phng n d phng. Vy xc xut xut hin ca k t cha xut hin
l bao nhiu? Ta c th ni rng "xc sut ca mt k t no cha xut
hin tuy rt nh nhng vn phi ln hn 0".
im mu cht ny c gi l zero - frequency problem. Vn
ny cho n nay vn ang c tm hiu v cha c chng minh.
- Thut ton nn ng
1. Thng k cc k t c trong vn bn to m hnh mi.
2. Nn m hnh mi va xy dng bc mt.
3. Lp li bc mt v bc hai cho n ht.
Tm li, trn thc t nhng k thut tip cn tng th (k thut
thng k) c p dng rt t do cc thut ton chy chm v cc chng
trnh vit theo k thut trn phi thc hin mt s lng ln cc php tnh
ton v mt khc hiu qu nn li khng cao.
b. Cch tip cn i t chi tit:
Cch tip cn ny da vo nhng nhn xt ph thuc thi im
hin ti cng tng t mt thi im no trong qu kh. Ni theo mt
cch khc th cch tip cn ny da vo nhng qui lut nht nh. Cc qui
lut gp li thnh t in nn ta c th gi cch tip cn ny l phng
php t in m chng ta c th xem xt k hn phn sau.
23
Chng IV: PHNG PHP M HA T IN
4.1. m ha T in tnh (static dictionary Encoders)
T in m ha tnh, gi tt l t in tnh. N c
tn gi nh vy v t in ny c lp ra khng ph thuc
vo vn bn c th. Ni chung hiu qu ca nn da vo
t in tnh khng cao, nhng c u im l n gin, khi
trun vn bn i ngi ta khng cn truyn t in km
theo.
V d in hnh ca t in tnh l bng m ASCII.
N dng m 256 k t, mi k t c di m nh nhau
gm 8 bit hay mt byte.
Vi t in tnh ngi ta c th c mt s ci tin
nng cao hiu qu. Ngi ta c th ghp hai k t i lin
nhau ng vi mt m, gi l digram. Thut ton m v gii
m r rng l rt n gin v nhanh chng.
V d trong cc k t ca bng m ASCII ngi ta phn
chia thnh hai loi gm 96 k t (94 k t in c, space,
newline) v 160 k t cn li c m theo digram hay theo
cp k t. Chng hn ta c th m cc digram {*a, *t, *s,
th,he,.}.
Mt cch tng qut, nu c q k t c coi l quan
trng hay dng, khi ta cn 256-q digram lp vo t
in. Trong trng hp ny chng ta c th c hai gii
php. Gii php th nht l kim tra cc vn bn tiu biu
24
xc nh 256-q digram chung nht. Gii php th hai
l xc nh hai k t S1 v S2 v xy dng digram cho
cc phn t sinh ra bi hai tp ny. Kh t in s c
lp y nu S1 * S2 =256-q
H4.1
Hnh H4.1 minh ha tng nu trn, trong cc con
tr, tr n cc bng khc s dng 4 bit.
C th m rng tng digram trong t in tnh
bng cch lp cc
n - gram m thc cht n l cc on (fragment)
gm n k t lin tip. Bi ton vi s n- gram tnh
(static n- gram scheme) l bi ton chn cu di nht c
th lp t in n ph thuc vo vn bn c th.
4.2. T in m ha bn thch nghi (semiadaptive
dictionary Encoders)
Logic t nhin ca quan im n- gram trong t in
tnh a ngi ta n tng xy dng t in cha m ca
on vn bn, t in ny cha ng phong cch rt ring
(idiosyncratic) ca vn bn cho trc. iu ny rt hay gp
trong cc vn bn khoa hc k thut. u im ca vic nn
25
theo t in bn thch nghi nhiu khi rt hiu qu, song nh-
c im ca n l phi gi t in i km theo bn m.
Chng ta ch rng vic xc nh trc mt t in ti u
khi cho trc mt vn bn l bi ton c phc tp NP y
. Tuy nhin c nhiu thut ton heuristic c th tm ra
gii php ti u cho bi ton.
Gi s s cc i cc phn t trong t in M c xc
nh trc l M , mi phn t trong t in M c m
log M bit, nh vy vic m s ti u khi cc cu trong t in
c tn xut bng nhau (equifrequent). Nh vy vn ct li
l thut ton xc nh cc cu sao cho chng c tn xut
nh nhau.
Ta xt thut ton c m t nh sau:
Trc ht tnh tn xut ca tng k t n l, sau l
tn xut ca digram, trigram v.vKhi to t in lc u
ton cc t n. Tm tn xut cao nht trong cc digram,
b sung n vo t in, gim (reduce) i tn xut ca cc
cp k t, Xc nh tn xut ln th hai ca digram v thm
n vo v tr tip theo trong t in, qu trnh c tip tc
cho n khi tn xut ln nht ca trigram ln hn cc tn
xut ca digram cn li. Tip theo trigram ny c thm vo
t in, cc tn s lin quan n hai digram v 3 k t n
c rt gn (Reduce) theo. Qu trnh c lp li cho n lp
y t in. Trong qu trnh trn c th hy i mt s
phn t khi t in nu tn s ca n gim xung qu
thp.
Chng hn *of* th c tn xut cao ta b xung n vo
t in, nhng nu *of* the cng c, khi tn xut c ca
*of* the s b gim xung do *of* th xy ra trc e.
26
Ch rng vic chn m cho cc phn t trong t in
cn phi dung ha gia t l nn v tc m.
4.3. T in m ha thch nghi (adaptive dictionary
Encoders)
T in m ha thch nghi cn c tn gi l bng m
ZIV_LEMPEL. Nm 1967 bi bo m t k thut xy dng t
in bn thch nghi c cng b. Nm 1977 tng ny
c Jacob Ziv v Abraham Lemple ci tin tr thnh t
in m ha thch nghi. Thut ton ca Jacob Ziv v
Abraham Lemple c ci tin nhiu ln, n tr thnh h
thut ton ZIV_LEMPEL hay gi tt l nn LZ . tng ct li
ca h thut ton LZ l mt xu s c thay bng con tr, tr
n ni m xu xut hin trc . Con tr dng
(m,l), s tr n v tr m ca dng vo x v l l di xu
trong dng vo, chnh l xu x[m . . m+l-1];
u im c bn ca h thut ton ZIV_LEMPEL cho t l
nn cao, tc gii nn nhanh.
V d:
Con tr (7,2) ch xu th 7, th 8 trong dng vo.
khi xu abbaabbbabab c m abba(1,3)(3,2)
(8,3) nh hnh Hnh H1.8
Hnh H4.3
27
Ch :
Qu trnh m l qui, tuy nhin vic gii m khng
nhp nhng.
C nhiu thut ton thuc h thut ton LZ, cc thut
ton thuc h LZ khc nhau ch yu hai yu t sau: yu t
th nht l s k t con tr c th tr quay lui v dng con
no trong gii hn c th l im tr ca con tr.
Chng ta s tm hiu mt vi thut ton h LZ.
4.3.1 LZ77
LZ77 l thut ton cng b sm nht ca h thut ton
ton LZ (1977).
Theo s ny, cc con tr biu th cc cu trong
mt ca s c kch thc c nh N, ca s ny t trc v
tr m. C di cc i F cho cc dng con, c th thay th
bng con tr. LZ77 cho php s dng mt ca s trt. t-
ng ca LZ77 nh sau:
Cho ca s (window) kch thc N (thng N <=8192) trt
trn dng vo S, mi ln trt i F v tr , vi F l kch th-
c Lookhead, thng F=10 hoc 20.
Thot u ta tng tng rng N-F k t ca ca s -
c m nhng vi bc u tin ny n rng, F s k t u
tin ca S trong Lookhead.
Trong qu trnh m, ngi ta s dng con tr dng
<i,j,a>, vi i l a ch tng i (offset) ca xu con
di nht tnh t Lookhead buffer, j l di ca xu
con di nht, a l k t u tin khng nm trong xu
con di nht ca window. Vic gn thm k t a vo
con tr bo m rng vic m khng vt ln trc v
28
tr khng tm thy dng con c k t bt u a trong
lookhead buffer.
Hnh H4.3.1a
Ch :
o Kch thc b nh cho thut ny bng kch thc window
o offset i c th m ha dng log(N-F) bit
o S k t j c th m vi logF bit
o Thi gian mi bc trong qu trnh m bng s php so
snh (N-F) *F, n l mt hng s, do thi gian ca
thut ton l 0(n) vi n l di dng S.
o Vic gii m vi thut ton ny rt nhanh, vic gii m
da vo ca s nh lc m, ch khc l thay cho vic
tm xu trong ca s, n sao cc dng con t ca s
da vo con tr trong bn m (endcoder).
o T in M thay i mi bc m ha n bao gm tp
cc xu trong window c di nh hn di
lookhead buffer. S xu thc ra c th tnh c trn
c s bng ch ci v s xu c di khng vt qu F
to nn trn bng ch ci vi mt s hn ch no . M
29
cho mi xu c th to ra bng hai cch, c th l ch
s trong t in, n c th chn (N-F)*F*q cu trong t
in. Cch th hai c th l con tr gm 3 thnh phn
m n xc nh mt trong N-F v tr trong window,
di xu t 0 n F-1 v k t trong bng ch ci.
V d:
Cho cc tham s sau:
N=11 l kch thc window; F=4 kch thc Lookhead
buffer.
dng vo S= abcabcbbababcabc;
Hnh H4.3.1b minh ha cch trt ca ca s v v tr
ca Lookhead buffer vi dng vo S.
Hinh H4.3.1b
30
S thut ton
4.3.2 LZR
LZR tng t vi LZ77 ch khc l con tr biu th v tr
bt k phn m trong dng vo S. C ngha l tham s
N bng kch thc ca dng vo S. Khi con tr <i ,j ,a> c i,
j l s nguyn ln ty .
Hn ch ca LZR l kch thc t in c th khng b
chn. V b nh cn thit m cng tng t nht n phi
cha c dng vo S. Thi gian tm kim c th tng ng
vi 0(n
2
).
4.3.3 LZ78
LZ78 l thut ton nn theo t in thch nghi, n l
thut ton quan trng theo c quan im l thuyt v thc
hnh. tng ca n l thay cho vic s dng con tr, tr
n ni xut hin xu gp trc , text dng nh li c
31
phn tch thnh cc cu c di di nht xy ra pha
trc v thm mt k t. Mi cu c m bng ch s tin t
ca n v mt k t. Cu mi c a vo danh sch cc cu
tm kim.
V d:
Cho dng vo aaabbabaabaaabab dng ny c phn
tch thnh 7 cu nh hnh H4.3.4
Hnh H4.3.4
Mi cu c m nh l cu xut hin trc v mt
k t. Chng hn cu cui cng m 4 k t bab, n c
m thnh cu 4 v c thn mt k t b. Cu th 0 l xu
rng.
Ch rng thut ton LZ78 khng hn ch s k t n c
th hng v pha trc, ngha l n khng hn ch khung hay
ca s m con tr c th tr ti. C ngha rng c th c rt
nhiu cu c lu li trong qu trnh m. Khi mt cu p c
phn tch, th con tr s c m dng [logp] bit. Trong thc
t th t in khng th lin tc tng cc cu, trong khi
b nh c th dng ht, nhng iu ny khng sao, n c th
xa i v vic m li tip tc nh l bt u dng vo mi.
V kha cnh thc hnh thut ton LZ78 c th th hin
bng trie m thc cht cy (multiway tree), qu trnh tm
kim cu l qu trnh chn nt vo cy. Trong mi nt
lin quan vi mt cu l cui ca mt ng i t gc biu
32
din cu ny. Mi nt trong cy nh s l s hiu ca cu
m n biu din. Qu trnh chn cu mi s sinh ra cu c
di di hn cu sinh trc . Do vy mi k t vo m
cn phi ct ngang mt cung trong cy. Hnh H4.3.4 ch
ra cu trc d liu sinh ra khi phn tch dng vo.
Hnh H4.3.3
V mt l thuyt LZ78 cho t l nn tim cn n
phng n ti u khi kch thc ca dng vo tng.Vi LZ78, t
in M l tp tt c cc xu c th phn tch
4.3.4 LZH
LZH l thut ton kt hp hai k thut nn Ziv-Lempel v
Huffman, vic m ha c thc hin thnh hai giai on
4.3.5 LZFG
y l k thut dng nhiu nht trong thc t. Thut ton
ny cho tc m v gii m nhanh, t l nn tt. V tng
LZFG s dng k thut ca LZ78, y con tr ch bt
u bin ca cu phn tch trc . iu ny c
ngha rng mt cu c a vo t in vi mi cu -
c m. N khc k thut LZ78 ch con tr cha thnh phn
di khng b gii hn, ch s k t ca cu cn sao
chp. Cc k t c m t trong ca s (ging LZ77), cu
ngoi window c xa khi trie
33
Chng V:
GII THIU V MT S THUT TON NN D LIU
5.1. THUT TON NN C M HNH NGUN:
+ Thut ton Fano - Shanon.
+ Thut ton Huffman, Huffman ng.
5.2. THUT TON P DNG VI K THUT T IN:
5.2.1 - Thut ton Fano - Shannon
Thc cht thut ton Fano - Shannon l phng php m ho trn
c s thng k tn s xut hin ca cc k hiu trong tp d liu. Thut
ton ny nhm mc ch to ra cy Fano - Shannon theo cc bc sau:
a. Thit lp bng tn s xut hin ca tng k hiu trong tp ngun.
b. Sp xp bng tn s theo th t gim dn ca tn s xut hin, k
hiu xut hin nhiu nht s ng u bng, k hiu xut hin t nht s
ng cui bng.
c. Chia bng ra thnh 2 phn sao cho tng tn s ca na trn v
tng tn s ca na di gn nhau nht.
d. Na trn c gn bi s 0, na di c gn bi s 1. iu
c ngha l m ca cc k hiu thuc na trn bt u bng bit 0, m ca
cc k hiu thuc na di bt u bng bit 1.
34
e. Mi na li tip tc chia lm 2 phn theo cch trn. Sau mi ln
nh vy li thm 0 hoc 1 vo ng sau m m.
Qu trnh trn c thc hin cho n khi mi nhm ch cn duy
nht 1 k t.
V d minh ho:
Cho 1 ngun c cc trng thi {e, a, i, o, } vi cc xc sut tng
ng (e, o3); (a, 0.2); i,0.2); (0.0,1); u,0.1); (,0.1)
=> K t Tn xut M
A 0.2 01
E 0.3 00
I 0.1 101
O 0.2 100
U 0.1 110
0.1 111
Cch nhm nh sau: M
E 0.3 0.3 00
A 0.2 0.5 0.2 01
O 0.2 0.2 100
I 0.1 0.3 0.1 101
U 0.1 0.5 0.1 110
0.1 0.2 0.1 111
nn mt tp d liu bng phng php Fano - Shannon, cng
vic u tin l phi c tp ngun thng k tn s xut hin ca mi
k hiu. Sau khi thng k xong, ngi ta sp xp bng tn s theo th
t gim dn ca tn s xut hin.
Bng m tng ng ca cc k hiu c gi ti cho chng trnh
g nn nh sau: Mi k hiu dng 3 bytes, byte th nht mang k t
35
chuyn m, 4 bits tip theo (4 bits u ca byte th hai) mang di ca
m, phn cn li mang m ti a ca k hiu.
Trong qu trnh g nn phi s dng bng m nhn c t thut
ton nn. ng thi gii m cn phi da vo nhn xt: khng mt m
no l phn u ca m khc.
nh gi: V nhng l do trn nn thut ton Fano - Shannon c xp
trong danh sch nhng thut ton thng k. Tuy nhin, nhn chung y l
mt phng php km hiu qu v t c s dng.
5.1.2 - Thut ton Huffman.
a) Nguyn l:
Nguyn l ca phng php Huffman l m ho cc bytes trong tp
d liu ngun bng bin nh phn. N to m di bin thin l mt tp
hp cc bits. y cng l mt phng php nn kiu thng k, nhng k
t xut hin nhiu hn s c m ngn hn.
M Huffman c mt tnh cht quan trng: m ca mt k hiu ny
khng th l phn u ca m mt k hiu khc.
Nu nh mt k hiu c m ho bng t hp nh phn 101 th t
hp 10110 khng th l m ca mt k hiu khc trong tp ngun. Do
khi gii m cn phi c ln lt cc bit cho n khi gp m ca k hiu
no .
b) Thut ton.
Vic xy dng cy m ho Huffman c tin hnh bi mt thut
ton khc vi thut ton Fano - Shannon. Nu nh cy Fano - Shannon
c xy dng t trn xung di bng cch chia i v gn cho mi
phn 1 bt, cng vic kt thc khi khng th tin hnh phn chia tip th
cy Huffman li c thit k t di ln, bt u t cc l ca cy v
cng vic kt thc ti im gc.
V d :
36
Cho m hnh ngun c cc trng thi v tn sut tng ng nh sau:
(A, 0.2); (E, 0.3); (I, 0.1), (0. 0,2); (U, 0.1); (, 0.1) ta c:
K t Tn xut M 0 1
A 0.2 10
E 0.3 01
I 0.1 001
O 0.2 11
U 0.1 0000
0.1 0001
Bc1: Nhm 2 ch ci c tn sut nh nht to ra ch ci kp. Sau
mi ln nhm s ch ci t i 1.
c -> 0.3 e -> 0.3 e -> 0.3 {a, 0} -> 0.4{{{u, },:}
a -> 0.2 a -> 0.2 {{u,},i} -> 0.3 e -> 0.3 {a, o{ ->
o -> 0.2 o -> 0.2 a -> 0.2 {{u, }, i} -> 0.3
i -> 0.1 {u,} -> 0.2 o -> 0.2
u -> 0.1 i -> 0.1
-> 0.1
Bc 2: To cy phn nhnh ngc vi qu trnh nhm t nhnh
tri c m 0, nhnh phi m 1.
{{{{u, }, 1}, e}, {a, o}}
0 1
{{{u, }, i}, e} {a, o}
0 10 1
{{{ u, }, i}, e a o
0 1
{u, } i
37
0 1
0 1 0 1
0 1 e a o
0 1 i
u
0 1
u
Vy m ca k t l: u -> 0000; e -> 01
-> 0001; a -> 10
i -> 001; o -> 11
Thut ton nn:
Bc 1: Tm hai k t c trng s nh nht ghp li lm mt, trng s ca k
t mi bng tng trng s ca hai k t em ghp.
Bc 2: Trong khi s lng k t trong danh sch cn ln hn mt th thc
hin bc mt, nu khng th thc hin bc ba.
Bc ba: Tch k t cui cng v to cy nh phn vi qui c bn tri m 0,
bn phi m 1.
Thut ton gii nn:
Bc 1: c ln lt tng bit trong tp tin nn v duyt cy nh phn c
xc nh cho n khi ht mt l. Ly k t l ghi ra tp gii nn.
Bc 2: Trong khi cha ht tp tin nn th thc hin bc mt, ngc li thc
hin bc ba.
Bc 3: Kt thc thut ton.
Mt s nhng hn ch ca m Huffman:
+ M Huffman ch thc hin c khi bit c tn xut xut hin ca
cc k t.
+ M Huffman ch gii quyt c d tha phn b k t.
+ Huffman tnh i hi phi xy dng cy nh phn sn cha cc kh
nng. iu ny i hi thi gian khng t do ta khng bit trc kiu d liu
s c thc hin nn.
+ Qu trnh gii nn phc tp do chiu di m khng bit trc cho n
khi k t u tin c tm ra.
38
5.1.3. Thut ton Huffman ng:
Trong thut ton Huffman ng chng ta s dng hai cy nh phn cn v mt
ca s .
Khi nim ca s :
Ca s l mt s lng rt ln cc k t c thm vo cy nh phn
Huffman trong qu trnh nn vn bn.
Trong qu trnh nn d liu, ta tin hnh thng k cc k t. Nh vic thng
k ny m sinh ra b nn.
Thut ton nn:
Bc 1: Khi to bng m Huffman(front tree) vi k t c bit, c s m
bng 1.
Bc 2: To m th t theo nguyn tc cn bng.
Bc 3: While not eof(f) do
Begin
Getchar-> ch
If ch thuc bng m th t then
Begin
Ghi m 0/1 ca k t v k t c bit ra tp ch
Xa k t bng m th t, thm k t vo bng m Huffman
(trong cy front tree) vi s m bng 1.
End Else
Begin
If ch thuc bng m Huffman then
Begin
Tng s m ca k t ln mt n v
If s m ca k t ln hn s m ca k t ngay trn n then
Begin
i ch hai k t
End;
Ghi m 0/1 ca k t v k t c bit ra tp ch
End;
39
End;
If s lng k t trong bng m Huffman>= then
Begin
Tm mt k t, gim s m ca k t i mt n v
If di k t l k t c bit then
Gim s m ca k t c bit i mt n v
Begin
If s m ca k t sau khi gim < s m ca k t di n then
Begin
i ch
End;
End;
If s m ca k t bng khng then
Begin
Loi k t ra khi bng m Huffman.
Tng s lng k t cy th t ln mt.
Gim s lng k t cy Huffman i mt.
Thm k t va loi ra khi bng m Huffman v bng m th
t v tr u tin.
End;
To li bng m th t, m Huffman cho bng m Huffman theo
nguyn tc cn bng.
End;
Bc 4: Dng chng trnh.
Thut ton gii nn:
Bc 1: Khi to bng m th t theo nguyn tc cn bng
Bc 2: While not eof(f) do
Begin
Ly ra ln lt tng k t v gn chui cho n khi gp k t c
bit.
If k t c bit ch ti bng m th t then
40
Begin
Tra trong bng m th t v ghi k t ra tp ch.
Thm k t vo bng m Huffman vi s m bng 1.
End;
If k t c bit ch ti bng m Huffman then
Begin
Tra trong bng m Huffman v ghi k t ra tp ch.
Tng s m ca k t ln 1 n v.
If s m ca k t ln hn s m ca k t trn n then
Begin
i ch.
End;
End;
If s lng k t trong bng m Huffman>= then
Begin
Tm k t no gim i 1 n v.
If s m ca k t < s m ca k t ng trc n then
Begin
If bn di n l k t c bit then
Gim s m ca k t c bit i 1.
i ch hai k t .
End;
If s m ca k t sau khi gim i 1 bng khng then
Begin
Tng s lng k t cy th t ln mt.
Gim s lng k t cy Huffman i mt.
Loi k t ra khi bng m Huffman.
Thm k t va loi ra vo bng m th t.
End;
To li m cho bng m th t, bng m Huffman theo nguyn l
cn bng.
End;
Bc 4: Kt thc.
41
nh gi: Qu trnh m v gii m tng i chm do phi xy dng cy nh
phn ng vi d liu nhp. Thut ton nn Huffman thng c dng
nn cc tp dng vn bn, cc tp c kch thc ln.
5.2.1. Thut ton LZ78:
Thay v thng bo v tr oan vn lp li trong qu kh, m LZ78 nh
s tt c cc on vn sao cho mi on ghi nhn s hiu on vn lp li
trong qu kh cng vi mt k t m n lm cho on khc vi on trong
qu kh. Nh vy mi on mi l mt on k t trong qu kh cng vi
mt k t trong qu kh. Chnh v th m on mi khc vi on c trong
qu kh.
V d:
Gi s chng ta c mt on vn bn sau: aaabbabaabaaabab.
Theo thut ton LZ78 th chng c phn thnh cc on nh sau:
Input a aa b ba baa baaa bab
on 1 2 3 4 5 6 7
Output 0+a 1+a 0+b 3+a 4+a 5+a 4+b
Nh vy bn nn ca chng ta l: (0,a);(1,a);(0,b);(3,a);(4,a);(5,a);(4,b).
Tip theo chng ta s s dng tp phn tch m cc con s cn dng, m
c di c nh mt byte cho cc k t.
Biu din qu trnh thc hin ca thut ton LZ78 bng cy:
a b
a a
a b
42
0
3
2
5 7
4
1
a
Qua v d trn, khi thut ton LZ78 c biu din di cu trc d liu dng
cy s lm cho vic x l n gin hn.
Thut ton nn:
Bc 1: c mt k t -> ch, on ch gn bng 1, kt np k t vo t
in, w:=ch;
Bc 2: While not eof(f) do
Begin
c tip k t tip theo w:=ww+ch;
If w thuc t in then ww:=w;
Else Begin
Code(w,j);
Ghi j v ch vo tp nn.
Thm w vo t in.
End;
End;
Bc 3: Dng chng trnh.
Thut ton gii nn
Bc 1: c thng tin v t in c lu trong tp nn, tl:=false;
Bc 2: While not eof(f) do
Begin
c byte tip theo ->b
Decode(b,s,t);
If tl=false then w:=w+s
Else w:=ww+s;
TIMCHU(w,t);
If t=false then
Begin
43
6
Ghi s ra tp gii nn
Thm s vo t in
End Else
Begin
ww:=s;
End;
End;
Bc 3: Dng chng trnh.
nh gi: Ni chung thut ton LZ78 l mt thut ton nn vn bn tt, c
thi gian chy chng trnh tng i nhanh tuy nhin kh nng tit kim
cha c khai thc ti a.
5.2.2. Thut ton LZW:
Thut ton ny l s chuyn giao ca thut ton LZ78. Nh chng ta
bit thut ton LZ78, vic lu tr cc k t theo sau mi on thng gy
lng ph v b nh nn hiu qu nn khng cao. Thut ton LZW qun l
bng cch loi b k t sau mi on do u ra ca mi on ch cha con
tr m thi. Thut ton ny lu tr bng vic chun b mt danh sch cc
on bao gm rt nhiu k t trong u vo l mt bng ch ci no , n
thc hin mt qu trnh m rng cc bng ch ci hay ni cch khc l n
dng k t b sung biu din li cc chui ca k t chnh quy. nn
LZW trn m ASCII 8 bits ta cn m rng bng ch ci bng cch dng 9 bits
hay nhiu hn 256 k t b sung m m 9 bits cung cp c dng lu tr
cc chui m c quyt nh t cc chui trong ngun tin. Thut ton s
khng t hiu qu nn cao nu c nhng iu kin sau:
+ Ngun tin khng ng nht v c tnh d tha ca n thay i trong
sut tp tin.
+ Ngun tin di mt cch ng k vt qu tm gii hn ca bng
chui.
Thut ton nn:
Bc 1: Thng k to ra t din, ghi vo tp nn, t:=false
44
c k t u tin -> w
Bc 2: While not eof(f) do
Begin
c mt k t -> ch
if t=false then w:=w+ch
else
begin
w:=ww+ch;t:=false;
end;
TIMCHU(w,tl);
If tl=false then
Begin
Code(,j);
Ghi j ra tp nn.
Thm w vo t in.
w:=ch;
end else
begin
ww:=w;
t:=true;
end;end;
Bc 3: Code(ch), Dng chng trnh.
Thut ton gii nn:
Bc 1: c thng tin t in trong tp nn, c byte tip theo v gii
nn gn vo w, t=false;
Bc 2: While not eof(f) do
Begin
c byte tip theo -> b
Decode(b,s,t);
If t=true then
Begin
For i:=1 to length(s) do
45
Begin
If t=false then w:=w+s(i)
Else begin
w:=ww+s(i); t:=false; end;
TIMCHU(w,t);
If t=false then
Begin
Thm vo t in.
Ghi ra tp gii nn.
W:=s(i);
End;end;
End
Else
Begin
Ghi ra tp gii nn;
w:=w+w(i);
thm w vo t in
End;
End;
Bc 3: Decode(b,s,t); ghi s ra tp gii nn. Dng chng trnh.
nh gi: Thut ton LZW khc phc c s lng ph v b nh m cc
thut ton trc khng tn dng c ht. Thut ton c thi gian chy
chng trnh nhanh nu chng ta s dng cu trc cy( nh phn, tam phn).
Mt trong nhng u im na l thut ton c hiu qu nn cao.

46

You might also like