You are on page 1of 71

I HC QUC GIA H NI

TRNG I HC CNG NGH


Nguyn Thc Huy
TM HIU M HNH NGN NG
S DNG PHNG PHP BLOOM FILTER
KHO LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin
H NI - 2010
I HC QUC GIA H NI
TRNG I HC CNG NGH
Nguyn Thc Huy
TM HIU M HNH NGN NG
S DNG PHNG PHP BLOOM FILTER
KHO LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin
Cn b hng dn: TS. Nguyn Vn Vinh
H NI - 2010
Nguyn Thc Huy Kha lun tt nghip
Tm tt ni dung
M hnh ngn ng l mt thnh phn quan trng trong cc ng dng nh nhn
dng ting ni, phn on t, dch thng k, V chng thng c m hnh ha s
dng cc n-gram. Trong kha lun ny, chng ti nghin cu v tm hiu m hnh ngn
ng xy dng da trn cu trc d liu Bloom Filter. Khng lu tr ton b tp n-gram
ging nh cc m hnh truyn thng, loi m hnh ngn ng ny s dng mt quy trnh
m ha c bit, cho php chia s mt cch hiu qu cc bit khi lu tr thng tin thng k
n-gram, nh tit kim ng k b nh. Sau khi tm hiu s lc v m hnh ngn ng,
chng ta s nghin cu hai kiu cu trc d liu da trn Bloom Filter l Log-Frequency
Bloom Filter v Bloom Map. Qua cc th nghim, chng ti ch ra s u vit ca cc m
hnh ngn ng da trn Bloom Filter trn c phng din dung lng v tnh hiu qu khi
ng dng trong thc t, c th y l h thng dch my bng phng php thng k vi
Moses [21].
i
Nguyn Thc Huy Kha lun tt nghip
Mc lc
TM TT NI DUNG ................................................................................................. 1
MC LC ..................................................................................................................... 2
LI CM N ............................................................................................................... 4
DANH MC T VIT TT ........................................................................................ 5
DANH MC HNH ..................................................................................................... 6
M U ....................................................................................................................... 1
CHNG 1 - Tng quan v m hnh ngn ng ........................................................ 3
1.1 N-gram ............................................................................................................... 3
1.2 Xy dng m hnh ngn ng............................................................................... 4
1.2.1 c lng cc i ha kh nng (MLE) ................................................... 5
1.2.2 Cc phng php lm mn ......................................................................... 5
1.2.2.1 Kneser-Ney ....................................................................................... 7
1.2.2.2 Kneser-Ney ci tin (Modified Kneser-Ney)...................................... 8
1.2.2.3 Stupid Backoff ................................................................................... 9
1.3 nh gi m hnh ngn ng ............................................................................. 10
1.3.1 Perplexity ................................................................................................. 10
1.3.2 MSE ......................................................................................................... 11
CHNG 2 - Cc cu trc d liu da trn Bloom Filter ..................................... 13
2.1 Cc cu trc d liu xc sut (PDS) ................................................................. 14
2.2 Hm bm .......................................................................................................... 16
2.3 Bloom Filter c bn .......................................................................................... 18
2.4 M hnh ngn ng s dng Bloom Filter ......................................................... 23
2.4.1 Bloom Filter tn s log ............................................................................ 24
ii
Nguyn Thc Huy Kha lun tt nghip
2.4.2 B lc da vo chui con ......................................................................... 26
2.4.3 Bloom Map .............................................................................................. 27
CHNG 3 - Th nghim: Xy dng LM vi RandLM v SRILM .................... 33
3.1 Ng liu ........................................................................................................... 33
3.2 Thut ton lm mn .......................................................................................... 35
3.3 Xy dng LM vi SRILM v RandLM ............................................................ 36
CHNG 4 - Th nghim: Dch my thng k vi Moses .................................... 41
4.1 Dch my thng k ........................................................................................... 41
4.1.1 Gii thiu v dch my thng k .............................................................. 41
4.1.2 Dch my thng k da trn cm ............................................................. 44
4.1.3 im BLEU ............................................................................................. 46
4.2 Baseline System ............................................................................................... 47
4.3 Ng liu ........................................................................................................... 47
4.4 Kt qu th nghim .......................................................................................... 48
KT LUN ................................................................................................................. 51
PH LC .................................................................................................................... 52
iii
Nguyn Thc Huy Kha lun tt nghip
Li cm n
Trc tin, ti mun gi li cm n chn thnh ti ging vin, TS. Nguyn Vn
Vinh, cm n s ch bo tn tnh ca thy trong sut thi gian hng dn ti thc tp
chuyn ngnh v nghin cu kha lun ny. Ti cng xin cm n anh Tng Tng Khnh
v anh Vng Hoi Thu trong nhm Digital Content Solution Cng ty c phn tin hc
Lc Vit, hai anh nhit tnh gip ti vi ti ny v ng gp nhiu kin qu
bu kha lun c hon thin hn. Nu khng c s hng dn ca thy v cc anh,
ti khng th hon thnh c kha lun ny.
S ng vin, khch l ca b m, anh ch ti l ngun ng lc, ngun h tr ln
lao. V ti cng rt cm n tt c nhng ngi bn i hc cng chia s qung thi
gian ngha ca i sinh vin di mi trng i hc Cng ngh - HQGHN. Chc cc
bn c kt qu tt nghip tt v thnh cng trong cuc sng.
iv
Nguyn Thc Huy Kha lun tt nghip
Danh mc t vit tt
BF : Bloom Filter
BF-LM : M hnh ngn ng da trn Bloom Filter
LF-BF-LM : M hnh ngn ng Log-Frequency Bloom Filter
LM : M hnh ngn ng
MKN : Phng php lm mn Kneser-Ney ci tin
MLE : c lng cc i ha kh nng
MSE : Li trung bnh bnh phng
MT : Dch my
NLP : X l ngn ng t nhin
PDS : Cu trc d liu xc sut
RDS : Cu trc d liu ngu nhin
SMT : Dch my bng phng php thng k
v
Nguyn Thc Huy Kha lun tt nghip
Danh mc hnh
Hnh 1: M hnh Markov bc 2 .................................................................................... 4
Hnh 2: V d v hm bm ............................................................................................16
Hnh 3: V d v bng bm. Xung t trong bng bm ............................................... 18
Hnh 4: Hun luyn Bloom Filter .................................................................................19
Hnh 5: Truy vn Bloom Filter .....................................................................................20
Hnh 6: Li-mt-pha trong Bloom Filter .....................................................................21
Hnh 7: Tng kch c LM ci thin im BLEU ......................................................... 43
Hnh 8: Kin trc ca mt h thng SMT ................................................................... 44
Hnh 9: Minh ha dch my thng k da vo cm ......................................................44
vi
Nguyn Thc Huy Kha lun tt nghip
M u
M hnh ngn ng (Language Model - LM) l mt thnh phn quan trng trong
nhiu ng dng nh dch my, nhn dng ting ni, Cc LM lun c gng m phng
ngn ng t nhin mt cch chnh xc nht. T nhiu nghin cu v th nghim [19, 28],
chng ta c th thy rng m hnh ngn ng vi ng liu cng ln, bc cng cao th m
phng cng chnh xc.
Trc y vic xy dng cc ng liu ln rt kh khn. Nhng vi s bng n ca
Internet nh hin nay, khi lng thng tin sn c l v cng ln. S tht l lng ph nu
nh chng ta khng tn dng kho ng liu khng l ny. Do trong nhng nm gn
y, kch thc cc tp ng liu dng hun luyn LM pht trin ng kinh ngc,
chng ln n mc khng cn c th lu tr c trong b nh ca nhng siu my tnh
vi nhiu Gigabytes b nh RAM. iu ny khin cho n lc m phng chnh xc hn
ngn ng t nhin bng cch s dng cc ng liu ln vi kiu m hnh truyn thng tr
nn v ngha, v cn phi ct gim kch c ca ng liu LM c th c cha va
trong b nh my tnh. iu ny i ngc li vi mc ch ban u ca vic to ra nhng
tp ng liu ngy cng ln hn. Hn ch ny i hi cc nh nghin cu cn tm ra nhng
phng php khc m hnh ha ngn ng nu vn mun tn dng li th m cc b
ng liu ln mang li.
Mt gii php thc hin yu cu ny l b i s chnh xc, chp nhn mt mt
mt lng thng tin nht nh khi m hnh ngn ng t ng liu. Ngha l thay v cc LM
khng mt mt (losses LM), ta s dng cc LM c mt mt thng tin (lossy LM). Cc
nghin cu v lossy LM to ra mt lp cc loi cu trc d liu mi l Cu trc d liu
ngu nhin (Randomized Data Structure, vit tt l RDS), hay cn gi l Cu trc d liu
xc sut (Probabilistic Data Structure - PDS). Vi cu trc d liu in hnh loi ny l
Skip List [33], Sparse Partition [16], Lossy Dictionary [31], Bloom Filter [4]. Vit Nam
cng c mt s nghin cu v vn m hnh ngn ng [39], nhng mi ch dng li
vic s dng cc m hnh ngn ng chun. Kha lun ny nghin cu v tm hiu v
m hnh ngn ng da trn Bloom Filter do nhng ci tin ng ch nhng nm gn
y ca loi cu trc d liu ny xy dng m hnh ngn ng [35, 36, 37]. Ni dung
kha lun tp trung nghin cu kh nng tit kim b nh, khng gian lu tr ca loi
1
Nguyn Thc Huy Kha lun tt nghip
LM ny v hiu qu ca n, so vi cc LM tiu chun [34], thng qua mt ng dng c
th l h thng dch my thng k Moses.
Chng 1 trnh by cc hiu bit c bn cn bit v m hnh ngn ng nh n-gram,
cc thut ton lm mn c s dng trong m hnh ngn ng v cc thc o nh
gi mt m hnh ngn ng.
Chng 2 tp trung nghin cu v cc trc d liu da trn Bloom Filter c s
dng cho m hnh ngn ng, c th l Log-Frequency Bloom Filter v Bloom Map.
Chng 3 th nghim xy dng m hnh ngn ng trn mt ng liu ting Anh v
mt ng liu ting Vit..
Chng 4 gii thiu s lc v dch my thng k, th nghim dch my thng k
vi h thng dch my ngun m Moses s dng cc m hnh ngn ng xy dng
chng 3.
2
Nguyn Thc Huy Kha lun tt nghip
Chng 1
Tng quan v m hnh ngn ng
M hnh ngn ng (Language Model - LM) l cc phn phi xc sut trn mt ng
liu n ng, c s dng trong nhiu bi ton khc nhau ca x l ngn ng t nhin,
v d nh: dch my bng phng php thng k, nhn dng ging ni, nhn dng ch
vit tay, sa li chnh t, Thc cht, LM l mt hm chc nng c u vo l mt
chui cc t v u ra l im nh gi xc sut mt ngi bn ng c th ni chui .
Chnh v vy, mt m hnh ngn ng tt s nh gi cc cu ng ng php, tri chy cao
hn mt chui cc t c th t ngu nhin, nh trong v d sau:
Pr(hm nay tri nng) > Pr(tri nng nay hm)
1.1 N-gram
Cch thng dng nht c dng m hnh ha ngn ng vo trong LM l thng
qua cc n-gram. Vi m hnh n-gram, chng ta coi mt vn bn, on vn bn l chui
cc t lin k nhau, w
1
, w
2
, , w
N-1
, w
N
, v sau phn tch xc sut ca chui vi cng
thc xc sut kt hp:
1 2 3 1
1 2 1 3 1 2
1 1 2 3 2 1 2 3 2 1
1
1
Pr(w , w , w , ..., w , w )
Pr(w ) Pr(w |w ) Pr(w |w ,w )...
... Pr(w |w ,w ,w ,...,w ) Pr( w |w ,w ,w ,...,w w )
Pr(w | )
N N
N N N N N
N
i
i i
i
w

v do vy mi t s lin quan c iu kin ti ton b cc t trc n (ta s gi y l


lch s ca s kin hoc t ).
Tuy nhin, vic s dng ton b cc t trc on nhn t tip theo l khng
th thc hin c v 2 nguyn nhn sau. u tin l phng php ny khng kh thi v
mt tnh ton do tn qu nhiu thi gian, ti nguyn h thng cho mi ln d on. Hai l,
trong rt nhiu trng hp, ch sau khi duyt vi t trong lch s, ta nhn thy rng
3
Nguyn Thc Huy Kha lun tt nghip
l mt cu cha tng gp trc y. Bi vy k c khi bit ton b lch s ca mt t,
xc sut ca n vn c th l khng bit. Thay vo , cc m hnh ngn ng thng c
lng tng i xc sut da trn gi nh Markov (hay m hnh Markov n), rng t tip
theo ch chu nh hng t mt vi t trc [25]. Mt m hnh Markov bc n gi nh
rng ch n t trc c lin h ng cnh vi t ang cn xc nh. Vic quyt nh bao
nhiu t trc m LM quan tm c gi l bc n (order) ca LM, v thng c gi
l 1-gram (unigram), 2-gram (bigram), 3-gram (trigram), 4-gram (fourgram) tng ng
vi cc m hnh Markov bc mt, hai, ba, bn. V d, nu chng ta mun c lng xc
sut 3-gram ca mt t w
i
vi m hnh Markov bc 2 th chng ta s da trn hai t trc
:
1 2 2 1
Pr(w , w ,..., w ) Pr(w | w , w )
i i i i


Hnh 1: M hnh Markov bc 2
Mt cch tng qut, gi
1
w
i
i n +
l mt n-gram chiu di n kt thc bng t w
i
. Khi
c lng xc sut n-gram cho mt chui chiu di N ta s dng cng thc:
1
1 1
1
Pr(w ) Pr(w | w )
N
N i
i i n
i

1.2 Xy dng m hnh ngn ng


xy dng (hun luyn) mt m hnh ngn ng ta cn mt ng liu n ng
(corpus) c kch thc tng i v mt b c lng thng k c nhim v m hnh ha
lng xc sut ca ng liu. Cc b c lng c m LM s dng, theo nhng cch
khc nhau, u cn n tn sut ca cc n-gram, do chng ta cn phi m s ln xut
hin ca cc n-gram t 1-gram cho n s bc m hnh chng ta ang hun luyn.
4
w
i-
3
w
i-
2
w
i-
1
w
i
w
i+
1
Nguyn Thc Huy Kha lun tt nghip
1.2.1 c lng cc i ha kh nng (MLE)
Chng ta c th s dng kt qu m cc n-gram xy dng mt m hnh c
lng cc i ha kh nng (Maximium Likelihood Estimation - MLE) vi tn sut
tng i ca cc n-gram trong ng liu. Vi MLE, xc sut mt unigram nht nh no
s xut hin tip theo n gin l tn sut n xut hin trong ng liu.
'
'
(w )
Pr (w )
(w )
i
MLE i
i
i
c
c

trong c(w
i
) = |w
i
| chnh l s ln xut hin ca t w
i
trong ng liu. Phng php ny
c gi nh vy bi v n cc i ha gi tr u ra m hnh ha ng liu hun luyn.
V d, trong ng liu Brown
1
, mt ng liu vi mt triu t, t kha Chinese xut hin
400 ln. Vy th xc sut m mt m hnh ngn ng dng MLE s gn cho unigram
Chinese l
400
Pr ( ) .0004
1000000
MLE
Chinese
.
Xc sut iu kin ca mt n-gram tng qut vi bc > 1 l:
1 1
1
1
1
(w )
Pr (w | w )
(w )
i
i i n
MLE i i n
i
i n
c
c
+
+

+

tc l tn sut mt t no thng xuyn xut hin sau lch s c bc n 1. minh


ha, ta tip tc v d trn, xc sut bigram Chinese food xut hin l s ln t food
xut hin sau t Chinese chia cho c(Chinese) = 400. Trong ng liu Brown, cm t
Chinese food xut hin 120 ln, nn:
Pr
MLE
(food|Chinese) = 0.3
1.2.2 Cc phng php lm mn
Tuy MLE l mt phng php d hiu, d s dng c lng xc sut cho m
hnh, nhng trong thc t ta gp phi vn d liu tha (data sparseness problem). Tc
l tp ng liu dng xy dng LM d ln n my, cng ch l tp hu hn cc cu
trong v s cu c th ca mt ngn ng t nhin. Do mt LM ch s dng MLE s
gn xc sut bng 0 cho nhiu n-gram tt. gim thiu vn ny, ngi ta thng
1
http://icame.uib.no/brown/bcm.html
5
Nguyn Thc Huy Kha lun tt nghip
khng s dng MLE m thay vo l cc phng php c lng xc sut thng k
phc tp hn. Cc phng php ny c gi l lm mn (smoothing) hay tr hao
(discounting), khi m mt phn xc sut t cc s kin trong m hnh s c dnh cho
nhng s kin cha tng xut hin. Vic ly t ci g v tr hao nh th no l mt ti
vn ang c nghin cu nhiu. V d, cch c in nht ca lm mn l phng php
Add-one smoothing [13], trong phng php ny, ta thm mt lng 1 l vo kt qu
m s ln xut hin ca mi t vng trong ng liu.
Hai khi nim quan trng c s dng trong qu trnh lm mn cc m hnh ngn
ng l backoff v interpolation. Khi LM gp mt n-gram cha bit, vic tnh xc sut s
s dng thng tin t (n-1)-gram, nu s kin (n-1)-gram cng cha tng xut hin trong
qu trnh hun luyn th LM li s dng thng tin xc sut t (n-2)-gram, V c tip
tc nh vy cho n khi tnh c xc sut ca n-gram. Qu trnh ny c gi l backoff
v c nh ngha nh sau:
1 1 1
1 1 1 1
1
2
(w ) Pr (w | w ) if c(w ) 0
Pr (w | w )
Pr (w ) otherwise.
i i i
i n LM i i n i n i
BO i i n
i
BO i n


+ + +
+
+
>

'

Trong l h s tr hao da trn tn sut xut hin ca


1
1
w
i
i n

+
trong lch s v

l tham s backoff. Khi s lng t vng ln, chng ta c th s cn gn xc sut


bng 0 cho mt s t ngoi t in (out of vocabulary - OOV) khi mc unigram. Chng
hn khi ta c mt cun t in chuyn ngnh v khng mun chia s lng xc sut ca
cc t vng (cc danh t chung, cc s thc c bit, ) cho cc OOV. Mt cch khc
l chng ta lm mn LM v dnh mt lng xc sut nh gn cho cc OOV khi mc
unigram.
Phng php Interpolation kt hp thng tin thng k n-gram qua tt c cc bc
ca LM. Nu bc ca LM l n th cng thc quy interpolation nh sau:
1 1 1 1
1 1 2
Pr (w | w ) Pr (w | w ) (1 ) Pr (w | w )
n i i n i
I i i n LM i i n I i i n


+ + +
+
Trong l trng s quyt nh bc no ca LM c nh hng ln nht n gi
tr u ra. Tng trng s c s dng cho tt c cc bc n-gram bng mt. C nhiu
cch xc nh gi tr cho cc trng s ny, i vi phng php interpolation n
gin th cc gi tr ny gim theo s bc n-gram. Tuy nhin thng th chng s c
6
Nguyn Thc Huy Kha lun tt nghip
tnh ton ty theo iu kin ng cnh c th, tc l theo tn sut ca cc bc n-gram trong
lch s. Cc trng s ny khng c tnh ton t d liu hun luyn, m s dng tp d
liu held-out ring bit tp ny ch c dng hun luyn cc tham s, m trong
trng hp ny l cc gi tr . Cn phi nhn thy rng s khc bit c bn gia hai
phng php ny l interpolation s dng thng tin t cc bc thp hn ngay c khi d
liu xc sut ca n-gram cn tnh khc 0; trong khi backoff th li ch tm kim n d
liu khc 0 gn nht.
Nhng tiu mc tip theo trong phn ny s trnh by v mt s phng php lm
mn ph bin nht hin nay, nh Kneser-Ney [17] hay Stupid backoff ca Google [5].
1.2.2.1 Kneser-Ney
Thut ton lm mn Kneser-Ney (KN) c pht trin bi Reinhard Kneser v
Hermann Ney, cng b nm 1995 [17]. Trong thut ton KN, xc sut ca mt unigram
khng t l thun vi tn sut xut hin ca n, m vi s tin t m n c.
C th minh ha nh sau, bigram San Francisco rt ph bin trong cun sch
Lch s thnh ph San Francisco. Vi tn sut bigram ny cao nh vy th nu s dng
cc phng php n gin, tn sut ca tng t San v Francisco cng s phi rt
cao. Tuy nhin trong thut ton KN th xc sut Pr(Francisco) li c th l rt thp, v t
Francisco thng ch ng sau t San. Do cc LM bc thp thng c s dng cho
vic tnh xc sut backoff ca cc LM bc cao hn, nn thut ton KN mun tn dng s
lng ph lng xc sut ny trong cc thut ton trc dnh cho cc s kin c kh
nng xy ra ln hn.
Trc tin chng ta nh ngha s lng tin t ca mt t nh sau:
{ }
1 1 1
( w ) | w : (w w ) 0 |
i i i i
N c
+
> o
Thut ng
1
N
+
dng ch s lng cc t xut hin mt ln hoc nhiu hn v k
t
o
ch mt t bt k no . Thay v s dng tn sut nh trong MLE, tn sut th ca
mi t c thay th bng s lng t (khc nhau) ng trc t . Vy th xc sut ca
unigram trong thut ton KN c tnh l:
1
1 '
'
( w )
Pr (w )
( w )
i
KN i
i
i
N
N
+
+

o
o
7
Nguyn Thc Huy Kha lun tt nghip
tc l bng s lng tin t ca t w
i
chia cho tng s tin t ca tt c cc unigram trong
ng liu.
i vi cc bc cao hn, xc sut ny c tnh nh sau:
1 1 2
2
'
1 ' 2
'
( w )
Pr (w | w )
( w )
i
i i n
KN i i n
i
i n
i
N
N
+ +
+
+ +

o
o
trong t s:
{ }
1 2 1 1 2
( w ) | w : (w w ) 0 |
i i
i n i n i n i n
N c
+ + + + +
> o
v mu s l tng s lng tin t ca tt c cc n-gram c cng chiu di
2
w
i
i n +
. M
hnh y ca thut ton KN c ni suy v c dng nh sau:
{ }
1
1 1
1 1 1 2 ' '
' 1 ' 1
' '
ax (w ) , 0
Pr(w | w ) (w ) Pr (w )
(w ) (w )
i
i n
i i i
i i n i n KN i n i i
i n i n
i i
m c D
D
N
c c
+

+ + + +
+ +

+

o
vi:
{ }
1 1
1 1 1
(w ) | w : (w w ) 0 |
i i
i n i i n i
N c

+ + +
> o
l s lng hu t (khc nhau) xut hin sau t
1
1
w
i
i n

+
; v D l tham s tr hao.
1.2.2.2 Kneser-Ney ci tin (Modified Kneser-Ney)
Thut ton lm mn Kneser-Ney ci tin (Modified Kneser-Ney - MKN) c pht
trin t thut ton KN, l kt qu nghin cu ca Chen v Goodman, cng b nm 1999
[11], tc l 4 nm sau s ra i ca thut ton KN. Thut ton KN dng phng php tr
hao tuyt i (absolutely discounting), tr i mt gi tr D duy nht, 0 < D < 1, cho mi
kt qu m khc 0. Thut ton MKN nng cao hiu qu ca KN bng cch s dng cc
gi tr tr hao khc nhau trong nhng trng hp khc nhau, da trn gi tr m ca mi
n-gram. Cng thc tng qut ca MKN l:
1 1 1 1 1
1 1 2
'
' 1
'
(w ) ( (w ))
Pr (w | w ) (w ) Pr (w |w )
(w )
i i
i i i i n i n
MKN i i n i n MKN i i n
i
i n
i
c D c
c

+ +
+ + +
+

trong :
8
Nguyn Thc Huy Kha lun tt nghip
1
2
3
0 if c 0
if c 1
( )
if c 2
if c 3
D
D c
D
D
+

'

v:
1
1 2
2
1
1
3
2
2
4
3
3
2
1 2
2 3
3 4
n
n n
n
D
n
n
D
n
n
D
n

+



vi n
i
l tng s n-gram c kt qu m l i ca m hnh bc n ang c ni suy. Tng
tt c cc phn phi phi bng mt, do :
1
1
{1,2,3+} 1
1
'
' 1
'
(w )
(w )
(w )
i
i i i n
i i
i n
i
i n
i
DN
c

+

+
+

o
trong N
2
v N
3+
tng ng i din cho s s kin c kt qu m l hai v ba hoc
nhiu hn ba.
1.2.2.3 Stupid Backoff
Thut ton Kneser-Ney v Kneser-ney ci tin trn tuy hiu qu trong thc t
nhng vic tnh ton li kh phc tp, khi lng tnh ton s tr nn rt ln khi d liu
nhiu, chng hn nh ng liu n-gram Trillion Words ca Google.
Google s dng mt thut ton lm mn n gin, tn l Stupid Backoff. Thut ton
ny s dng tn sut tng i ca cc n-gram mt cch trc tip nh sau:
1
1
1 1
1 1
2
(w )
if c(w )>0
(w | w ) (w )
(w ) otherwise
i
i i n
i n
i i
i i n i n
i
i n
c
S c
S
+
+

+ +
+

'

9
Nguyn Thc Huy Kha lun tt nghip
trong

l mt hng s c gi tr bng 0.4. Qu trnh quy kt thc ngay khi t n


mc unigram:
w
(w )
i
i
S
N

trong N l c ca ng liu hun luyn. Brants [5] tuyn b rng khi c lng d
liu ln, th hiu qu ca Stupid Backoff xp x lm mn MKN. L do y k hiu S
c s dng thay cho P l nhn mnh rng phng php ny tr li im s tng i
ch khng phi l xc sut c chun ha.
1.3 nh gi m hnh ngn ng
1.3.1 Perplexity
Sau khi LM c hun luyn, chng ta cn phi nh gi cht lng ca m
hnh. Cch nh gi chnh xc nht mt m hnh ngn ng l kim tra trong thc t. V d
trong nhn dng ting ni, chng ta c th so snh hiu qu ca 2 m hnh ngn ng bng
cch chy b nhn dng ngn ng 2 ln, mi ln vi 1 m hnh v xem m hnh no cho
kt qu chnh xc hn. Nhng cch ny li rt tn thi gian, v th, chng ta cn 1 cng c
m c th nhanh chng nh gi hiu qu ca mt m hnh. Perplexity (PP) [3] l thc
o thng c dng cho cng vic ny.
Perplexity thc cht l mt dng bin i ca entropy cho (cross entropy) ca m
hnh. Entropy cho l cn trn ca entropy. Entropy l mt khi nim c bn trong Thuyt
thng tin, nh gi lng thng tin ca d liu bng o s khng chc chn. Nu mt
bin ngu nhin x tn ti trong khong X ca thng tin ang c nh gi vi phn phi
xc sut l p, th khi entropy ca x c nh ngha l:
2
( ) log
x X
H x p p

V d khi tung mt ng xu, x ch c th l mt nga hoc mt sp v xc sut


0.5 p
trong c hai trng hp. Nhng khi tung mt ht xc xc 6 mt, khong gi tr c
th ca kt qu rng hn, v cc xc sut l
1
6
p
. V hnh ng tung xc xc c o
khng chc chn ln hn, nn entropy ca n cng cao hn hnh ng tung ng xu.
10
Nguyn Thc Huy Kha lun tt nghip
Entropy cho ca mt m hnh l o thng tin gia hai phn phi xc sut. i
vi mt phn phi xc sut q no m chng ta s dng m hnh ha phn phi xc
sut p, entropy cho c nh ngha l:
2
( , ) log
x X
H p q p q

nh l Shannon-McMillan-Breiman [3] ch ra rng i vi c entropy v entropy


cho chng ta u c th b i thnh phn p nu chui gi tr x di. Nu chng ta cn
tnh entropy cho tng t th ch vic chia cho tng s t:
1
2 2
1 1
( , ) log ( ) log ( )
n
x
H p q q x q x
n n

Perplexity c nh ngha l
( , )
2
H p q
PP . Do entropy cho l cn trn ca
entropy,
( , ) ( ) H p q H p
, chng ta s dng entropy cho trong Perplexity khng bao
gi nh gi thp entropy thc s ca m hnh. Perplexity ca mt m hnh c nh gi
trn tp kim tra. Trong thc t, Perplexity l thc o u tin nh gi mt m hnh
ngn ng, v c th c coi l hm ca c c ngn ng v m hnh. Trn phng din l
hm ca m hnh, n nh gi mt m hnh m phng ngn ng chnh xc n mc
no. Cn trn phng din l hm ca ngn ng, n o tnh phc tp ca ngn ng.
1.3.2 MSE
Cc m hnh LM c mt mt khng m bo xc sut chnh xc v n lu tr d
liu khng y , do lm bin dng phn phi xc sut thng thng. Chnh v l do
ny m ta khng th s dng cc phng php o da trn Entropy nh Perplexity
nh gi cht lng ca m hnh. Tuy nhin chng ta vn c th s dng mt m hnh
m bo phn phi xc sut thng thng lm chun mc so snh xem cc lossy LM
khc bit nh th no so vi m hnh ny. iu ny c th c thc hin bng cch s
dng Li trung bnh bnh phng (Mean Square Error - MSE) ca lossy LM v lossless
LM, u c hun luyn v kim tra s dng cc tp ng liu ging nhau.
2
1
( ' )
1
n
i i
i
MSE X X
n


11
Nguyn Thc Huy Kha lun tt nghip
trong X l xc sut s kin i trong lossless LM v X l xc sut ca cng s kin
trong lossy LM.
12
Nguyn Thc Huy Kha lun tt nghip
Chng 2
Cc cu trc d liu da trn
Bloom Filter
T khi ra i n nay, vic m hnh ngn ng c nhiu pht trin ng k cng
vi cc thut ton lm mn ngy cng tt hn [5]. Th nhng cng c khng t thch thc
m LM phi i mt. l lm th no to ra c m hnh i din hiu qu ngn ng
t nhin, bng cch s dng nhiu d liu, tng bc m hnh n-gram (n = 6, 7, 8, )
nhng khng qu phc tp trong tnh ton v s dng t b nh. Mt tp ng liu nh ca
Google l qu ln (24GB khi nn), khng th cha va trong b nh RAM thng
thng. iu ny thc y cc nh nghin cu cn tm ra mt gii php thay th cch biu
din n-gram truyn thng, nu vn mun tn dng u th ca cc tp ng liu ln m
khng cn s dng cc phng thc tn km truyn thng nh h thng siu my tnh
trong mi trng in ton phn tn ca Google.
Trong chng ny chng ta s tm hiu mt loi cu trc d liu c kh nng p
ng phn no nhng yu cu nu trn, chnh l Bloom Filter (BF) [4], s dng mt
dng m ha c mt mt thng tin (lossy encoding), tng ca BF l thay v lu tr
ton b cc n-gram, chng ta ch lu mt tp i din mang tnh ngu nhin ca n. M
ha c mt mt thng tin l mt loi k thut ph bin thng c dng trong lu tr a
phng tin nh chun nn JPEG cho hnh nh, MP3 cho m thanh hay MPEG cho nn
video. Trong mt phn d liu b mt i khi m ha, nhng i din mi c to
thnh vn cha ng kh y cc thng tin hu ch sau khi c gii m.
Bloom Filter l mt cu trc d liu xc sut, u tin c xy dng ch tr li
cho cu hi Liu phn t x c thuc tp S hay khng ? Nu kt qu l c th ta gi l
mt HIT, cn ngc li th ta gi l MISS. C hai loi li c th xy ra khi tr li cu hi
truy vn trn, l false positive v false negative. Li false positive xy ra khi i tng
c truy vn khng thuc tp S, x S , nhng li HIT. Cn false negative th ngc li
vi false positive, tc l mt i tng x S b kt lun l MISS trong khi thc t th
13
Nguyn Thc Huy Kha lun tt nghip
khng phi nh vy. Cu trc d liu thng k no ch gp mt trong hai loi li ny c
gi l c li mt pha (one-side error) v li hai pha trong trng hp cn li. BF l cu
trc d liu ch c li mt pha.
Cu trc d liu ny yu cu dung lng lu tr thp hn kh nhiu ngng di
ca thuyt Entropy nhng li c t l li kh thp v c th xc nh c. Bloom Filter
nguyn bn khng h tr lu tr c cp kha-gi tr. Tuy nhin Talbot v Osborne [35,
36, 37] xut nhng cch cho php tch hp gi tr vo trong m hnh ngn ng
Bloom Filter. Cch thc thc hin iu ny c m t trong ni dung ca chng.
2.1 Cc cu trc d liu xc sut (PDS)
Mt bc quan trng trong khu thit k ca mt chng trnh l tm cch thch
hp lu tr v x l d liu. Vic nh gi v la chn cn trng cu trc d liu c
s dng trong chng trnh c ngha rt quan trng: la chn ng c th lm tng ng
k hiu nng ca chng trnh, tit kim ti nguyn, d dng bo tr h thng trong tng
lai; ngc li, kh nng vn hnh ca h thng c th b hn ch do khi lng tnh ton
qu ln hay hot ng thiu n nh, thm ch khng hot ng c vi nhng tp d
liu ln nu s dng mt cu trc d liu ti.
Tn ti nhiu dng cu trc d liu khc nhau, ph hp cho nhng mc ch s
dng khc nhau. Mt s cu trc d liu ch l nhng kho cha d liu thng thng,
trong khi mt s khc li c dng cho nhng ng dng c bit v ch pht huy c
hiu nng ti a trong iu kin nht nh.
Trong nhiu trng hp, tp ng liu qu ln n ni khng mt siu my tnh no
hin ti c kh nng qun l c. V cng khng c cu trc d liu chun no c th
lu tr c n. V d nh, trong lnh vc dch my thng k, nm 2006, Google
khin c cng ng ngnh NLP phi sng st khi h cng b mt ng liu Ngram khng
l
2
. Vi khong 3 t t, dung lng l 24 GB khi nn, tp ng liu ny qu ln thm
ch vi h thng b nh ca nhng siu my tnh. Hin nhin l ta c th c lu tr n
trong a cng, nhng v d nh vi dch my thng k (SMT), mt m hnh ngn ng
c th c truy vn hng trm nghn ln mi cu, vy nn r rng y khng phi l mt
2
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
14
Nguyn Thc Huy Kha lun tt nghip
phng n kh thi. Chng ta cn tm ra mt hng tip cn khc cho nhng tp ng liu
s nh th ny.
Mt hng tip cn kh thi l thay v tm cch biu din chnh xc mt tp ng liu
ln, khng mt mt (lossless), ta chp nhn tng s dng mt tp i din c mt mt
(lossy) ca n. Ngha l bng cch s dng mt vi k thut no :
i) Mt lng d liu m ta kim sot c b mt i.
ii) Tn hi n s ton vn d liu gy ra bi lng d liu mt c th c
coi l nh nu so snh vi khng gian lu tr ta tit kim c. ng
thi t ch khng th kim sot c d liu (khng s dng c trong cc
chng trnh do tp ny qu ln, thi gian tm kim lu, ), gi y ta
c th kim sot c chng.
Hng tip cn ny pht trin thnh mt lp cu trc d liu mi, c gi l cc
Cu trc d liu ngu nhin (Randomised Data Structure - RDS) hay cn c gi l cc
Cu trc d liu xc sut (Probabilistic Data Structure). Trong cc PDS, d liu c m
ha cn thn v ti u di dng c mt mt, v t ngu nhin m ch cc cu trc d
liu ny da trn nhng thut ton m ha mang tnh ngu nhin nht nh.
Mt thut ton ngu nhin c th c nh ngha l thut ton s dng cc la
chn ty , khng xc nh trc trong qu trnh tnh ton [14]. Mt phn d liu s b
mt khi c m ha vo mt PDS. Tuy nhin thng tin vn s c lu tr sao cho dng
mi ny ca d liu vn hiu qu tng ng dng biu din chnh xc (khng mt mt)
ca n.
Nhiu loi cu trc d liu xc sut c nghin cu, pht trin v ng dng
trong nhng nm gn y [30]. Mt s cu trc d liu loi ny c th k n nh Skip
List [33], Sparse Partition [16], Lossy Dictionary [31], v mt cu trc d liu tuy
c xut t kh lu nhng hin ti li tip tc c nghin cu nhiu - Bloom Filter
[24, 35, 36, 37].
Nhn thy mt s u im nh tc , kh nng tit kim b nh ng k ca
Bloom Filter [24], chng ti chn nghin cu loi cu trc d liu ny v trnh by
trong kha lun. Cu trc d liu Bloom Filter c bn s c gii thiu trong phn sau
ca chng ny. Tip l ci tin n gin c th lu tr d liu theo cp {kha, gi
15
Nguyn Thc Huy Kha lun tt nghip
tr} Logarithmic Frequency Bloom Filter (hay Bloom Filter tn s log) [35]; v mt
dng ci tin phc tp hn c ra i sau l Bloom Map [37].
2.2 Hm bm
Mt thnh phn rt quan trng c s dng trong Bloom Filter l cc hm
bm. Chnh v vy trc khi i su tm hiu cu trc d liu BF cc phn sau, mc ny
trnh by vi nt s lc v hm bm.
Hm bm (Hash function) l mt hm nh x phn t t tp ny sang mt tp khc
(thng l nh hn).
{ } { }
w
: U 0,1 0,1
b
h
Hnh 2: V d v hm bm. Cc xu k t c chuyn thnh ch k i din.
Phn t cn c bm l t tp S c c n, tp ny nm trong tp d liu ban u U
vi
S U
v { }
w
0,1 U . i din ca phn t trong min b c gi l ch k hoc
du n ca d liu. Hm h ny phi mang tnh n nh, c ngha l nu cng mt d liu
i qua hm h nhiu ln th lun cho kt qua ging nhau. ng thi nu u ra ca hai
phn t qua hm h khc nhau th ta cng c th kt lun hai phn t l khc nhau.
Kha k
i
c a vo hm bm h(k
i
), kt qu ca hm bm ny tr n mt trong
bng gi tr c m: { } 0,1,..., 1 m
(c gi l bng bm), cha gi tr a
i
. c tnh
16
Cy Hm bm DFA2C4ED
Cy to Hm bm BE34C87A
Cy cam Hm bm 7CD4ADE
INPUT OUTPUT
Nguyn Thc Huy Kha lun tt nghip
ng ch ca bng gi tr bm ny l thi gian tm kim khng ph thuc vo kch c
ca tp d liu c m ha vo bng. Hnh 3 minh ha cu trc mt bng bm.
17
4
( )
b
h k
a
4

2
( )
b
h k
a
2

1 3
( ) ( )
b b
h k h k
???


S
U
{ }
w
4 1
, k a
{ }
w
1 2
, k a
{ }
w
2 4
, k a
{ }
w
3 3
, k a
Nguyn Thc Huy Kha lun tt nghip
tr, thi gian truy vn bt bin khng ph thuc vo kch c ca tp cn i din v c t
l li-mt-pha iu khin c.
0
0
1
0
0
1
0
0
1
0
0
0
0
1
0
0
1
0
0
1
0
0
h
k1
(x
i
) h
k1
(x
i
)
h
k2
(x
i
) h
k2
(x
i
)
h
k3
(x
i
) h
k3
(x
i
)
x
i
x
i
m bits
Hnh 4: Hun luyn Bloom Filter
Mt BF i din cho mt tp { }
1 2
, ,...,
n
S x x x
vi n phn t c ly ngu nhin t
tp ban u U c N ( N n ? ). B nh ng k nht m BF s dng l mt mng bit c m.
Trc qu trnh hun luyn, mng ny ch cha cc gi tr 0. hun luyn BF, chng ta
s dng k hm bm c lp, mi hm bm c u ra tr n mt v tr trong mng bt m
ty theo gi tr u vo x, { } ( ) 0,1,..., 1
k
h x m
. Mi phn t x trong tp S c n chng ta
ang cn biu din c bm k ln s dng k hm bm nu trn, v bit tng ng vi u
ra ca mi hm bm trong bng m c thit lp gi tr bng 1. Nh vy l vi mi phn
t x S , k v tr tng ng c bt ln, nu mt bit h
k
no c gi tr bng 1 ri
th bit vn gi nguyn trng thi. Ngha l, mt khi c thit lp, gi tr ny s
khng bao gi b thay i na qua c qu trnh hun luyn. Cn lu rng c th tn ti
kh nng k hm bm ca mt phn t c u ra khng tr n k v tr ring bit trong
mng bit, BF khng trnh s va chm ny. Cc bit trong mng do c ch ny m c th
c chia s cho cc phn t dng chung, em li kh nng tit kim ng k b nh cho
BF, tuy nhin li tn ti mt xc sut li false positive khc 0.
18
Nguyn Thc Huy Kha lun tt nghip
1
0
1
0
0
1
0
1
1
0
1
1
0
1
0
0
1
0
1
1
0
1
h
k1
(x) h
k1
(x)
h
k2
(x) h
k2
(x)
h
k3
(x) h
k3
(x)
x
i
x
i
x
j
x
j
miss
hit
m bits
Hnh 5: Truy vn Bloom Filter
kim tra mt phn t no c thuc tp c m ha trong BF hay khng,
chng ta li a n chy qua k hm bm trn. Nu tt c cc bit c tham chiu bi k
hm bm c gi tr bng 1 th ta coi phn t thuc tp hp; cn nu tn ti bt c gi
tr no bng 0 th chng ta bit chc chn n khng thuc tp hp. Ti sao y cn phi
nhn mnh t coi v chc chn ? l v cc phn t thc s ca tp S th lun c
xc nh chnh xc, nhng c mt xc sut li false positive s xut hin nu nh k bit
tng ng thc ra c thit lp do cc phn t trong tp hun luyn m phn t ang xt
th khng thuc tp ny. y c gi l li-mt-pha.
Cn ch l, thay v thc s lu tr tp S, s dng Bloom Filter, thc cht chng
ta ch ang lu tr mt i din ca n B, mt mng bit c m. Cc phn t trong S do
thc cht b mt: chng ta s khng th khi phc c tp t Bloom Filter. Tuy
nhin, nu chng ta ch quan tm n liu mt phn t c nm trong tp S hay khng th
Bloom Filter li c th thc hin c, ng thi tit kim ng k b nh trong khi tc
truy vn li khng i.
Mt u im ng ch na ca Bloom Filter, nh c trnh by sau y, l
mc d khng th trnh khi, nhng li-mt-pha ca Bloom Filter li l yu t m ta c
th kim sot c.
19
Nguyn Thc Huy Kha lun tt nghip
1
0
1
0
0
1
0
1
1
0
1
1
0
1
0
0
1
0
1
1
0
1
h
k1
(x
k
) h
k1
(x
k
)
h
k2
(x
k
) h
k2
(x
k
)
h
k3
(x
k
) h
k3
(x
k
)
x
k
x
k
m bits
nhng HIT
k
x S
m bits
Hnh 6: Li-mt-pha trong Bloom Filter
Nu chng ta coi cc hm bm c phn phi u v hon ton ngu nhin, tc l
chng s chn cc v tr trong mng m-bit vi xc sut tng ng nhau th xc sut mt
bit bt k c thit lp bng 1 bi mt hm bm l 1/m. Xc sut mt bit bt k khng
phi l 1 sau khi thc hin mt hm bm l:
1
1
m

Xc sut mt bit vn l 0 sau khi mi ch c mt phn t chy qua k hm bm l:


1
1
k
m
_


,
Xc sut mt bit vn l 0 sau qu trnh hun luyn:
1
1
kn
m
_


,
Nh vy sau qu trnh hun luyn, xc sut mt bit xc nh no c gi tr bng
1 l:
1
1 1
kn
m
_


,
20
Nguyn Thc Huy Kha lun tt nghip
Biu 1: T l li v Khng gian lu tr (gi nguyn n) [24]
Biu 2: T l li v S lng kha (gi nguyn m) [24]
Nu mt phn t no khng phi l thnh vin ca tp S, xc sut n lc no
cng tr ti nhng bit bng 1 i vi tt c k hm bm n c chy qua, gy ra li false
positive, l:
21
Nguyn Thc Huy Kha lun tt nghip
1
1 1
k
kn
m
_
_




,
,
S dng gii hn ca c s logarit t nhin, cng thc trn tng ng:
1 exp
k
kn
m
_
_


, ,
xc sut li l nh nht, t cng thc trn ta ly o hm th s tnh c s
hm bm k ti u l:
ln 2
m
k
n
_


,
tc l mt na s bit ca mng BF nn c thit lp gi tr bng 1 sau qu trnh hun
luyn. iu ny s dn n xc sut li false positive l:
/
1
0.6185
2
k
m n
_


,
Nh vy l nu gi nguyn n, xc sut li gim nu m tng (s dng nhiu khng
gian lu tr hn). Cn nu m khng i th t l li tng nu n tng. Hai biu 1 v 2 c
th ch ra s tng quan gia t l li vi khng gian lu tr, v s lng kha (s phn t
trong tp S).
Cu trc ca Bloom Filter khng cho php lit k hoc ly gi tr cc phn t lu
trong mt cch trc tip. Ta cng khng th loi b mt i tng khi Bloom Filter
m khng lm hng cc phn t khc. Th nhng BF vn l mt trong nhng cu trc d
liu xc sut c s dng rng ri trong NLP do s n gin v hiu qu ca n. Cc
ng dng ca BF c th k n nh trong kim tra chnh t, cc ng dng c s d liu
[12] v nh tuyn mng [6].
2.4 M hnh ngn ng s dng Bloom Filter
Nm 2007, Talbot v Osborne ln u tin gii thiu phng php s dng mt
cu trc lu tr d liu c mt mt nh BF m hnh ngn ng, v cng a ra tng
22
Nguyn Thc Huy Kha lun tt nghip
gip gim t l li false positive, tch hp c kha v gi tr trong Bloom Filter vi vic
cng b hai cu trc d liu Log Frequency Bloom Filter [35, 36] v Bloom Map [37].
2.4.1 Bloom Filter tn s log (Log-frequency Bloom Filter)
Kh nng lu tr hiu qu d liu thng k n-gram vo trong BF c c l do
phn phi dng Zipf (lut Zipf)
3
ca cc t trong mt ng liu ngn ng t nhin. Theo
lut ny, th trong mt ng liu thng thng, t xy ra thng xuyn nht nhiu gp i
t xy ra thng xuyn th hai, t ny li nhiu bng hai ln t xy ra thng xuyn th
t, Ly v d i vi ng liu Brown, the l t xy ra nhiu nht, chim 7% tng s
t. ng theo lut Zipf, t c s lng xp th hai l of chim khong trn 3.5% mt
cht. V ch cn khong 135 t l chim mt na tng s t ca ng liu Brown (theo
Wikipedia). Nh vy tc l ch c mt s t s kin xy ra thng xuyn trong khi hu ht
cc s kin khc u him khi xy ra.
bin BF thnh mt cu trc d liu h tr lu tr cp kha gi tr, vi kha l
n-gram cn gi tr l s ln xut hin n-gram trong ng liu. Nhm ti thiu ha s bit cn
s dng, mt thut ton c tn m ha tn s log (log-frequency encoding) c s dng.
S ln xut hin ca cc n-gram c(x) trc tin c lng t ha thnh qc(x) s dng
cng thc:
[ ]
( ) 1 log ( )
b
qc x c x +
iu ny c ngha l tn sut xut hin ca cc n-gram s b suy gim theo hm m
khi s dng quy trnh m ha tn s log. Tuy nhin do c s khc bit ln trong phn phi
ca cc s kin ny nn t l ca m hnh vn c lu gi gn nh nguyn vn trong BF-
LM. Kch thc khong gi tr qc(x) c quyt nh bi c s b trong cng thc trn.
Tng n-gram c lu tr vo trong BF cng vi gi tr qc(x) c biu din bng
mt s nguyn j tng t 1 n qc(x). n giai on kim tra th tn sut ca mt n-gram
c ly ra bng cch m tng dn ln bt u t 1. Vi mi lot k hm bm c kt qu
tr ti cc gi tr bng 1 trong mng bit BF, th gi tr ca n-gram li c tng thm 1
n v. Qu trnh ny tip din cho n khi mt hm bm no tr n bit 0 trong mng
hoc t n gi tr m ti a. V khi gi tr m hin ti tr i mt c tr li, chnh
l cn trn ca qc(x) n-gram . i vi hu ht cc n-gram th qu trnh ny ch din ra
3
Tham kho thm trn Wikipedia: http://en.wikipedia.org/wiki/Zipfs_law
23
Nguyn Thc Huy Kha lun tt nghip
mt hoc hai ln nh c quy trnh m ha tn s log. c tnh li-mt-pha ca Bloom
Filter m bo rng gi tr lng t ha qc(x) khng th ln hn gi tr c tr li ny.
Qu trnh hun luyn v kim tra c minh ha qua Thut ton 1 v Thut ton 2 [35].
Thut ton 1: Thut ton hun luyn BF
- u vo:
train
S
, { }
1
,...,
k
h h
v BF
- u ra: Bloom Filter
for all x in
train
S
do
c(x) = tn sut ca n-gram x in
train
S
qc(x) = gi tr lng t ha ca c(x)
for j = 1 to qc(x) do
for i = 1 to k do
h
i
(x) = gi tr bm ca s kin {x, j} vi h
i

BF[h
i
(x)] = 1
end for
end for
end for
return BF
Thut ton 2: Thut ton kim tra BF
- u vo: x, MAXQCOUNT, { }
1
,...,
k
h h
v BF
- u ra: Cn trn ca gi tr c(x) trong
train
S
for j = 1 to MAXQCOUNT do
for i = 1 to k do
h
i
(x) = gi tr bm ca s kin {x, j} vi h
i
if BF[h
i
(x)] = 0 then
return E[c(x) | qc(x) = j]
end if
end for
end for
S ln xut hin ca n-gram sau c c lng s dng cng thc:
24
Nguyn Thc Huy Kha lun tt nghip
1
1
[ ( ) | ( ) ]
2
j j
b b
E c x qc x j

+

tip theo mt thut ton lm mn s c s dng ly ra lng xc sut thc t s s
dng trong tnh ton [36].
2.4.2 B lc da vo chui con
Cc m hnh ngn ng n-gram chun lu tr xc sut iu kin ca n-gram trong
mt ng cnh c th. Hu ht cc m hnh ngn ng ny cng li s dng mt s phng
php ni suy kt hp xc sut iu kin ca n-gram ang xt vi xc sut n-gram bc
thp hn. Ph thuc vo phng php lm mn c s dng, c th chng ta cn cn n
cc thng s thng k ph cho tng n-gram nh s lng hu t (i vi lm mn Witten-
Bell, Kneser-Ney) hay tin t ng cnh (i vi lm mn Kneser-Ney, Stupid Backoff).
Chng ta c th s dng mt BF duy nht lu tr nhng s liu thng k ny nhng
cn ch r loi ca chng (tn sut xut hin th, s tin t, s hu t, ), bng cch s
dng cc tp k hm bm khc nhau cho tng loi.
L do nn lu tr cc d liu thng k ny mt cch trc tip vo BF, thay v lu
cc xc sut c tnh ton sn l: (i) tnh hiu qu ca quy trnh m ha nu trn da vo
phn phi tn sut dng Zipf; iu ny l hon ton ng cho d liu thng k n-gram
trong ng liu ngn ng t nhin, nhng li c th l khng ng cho xc sut c c
lng ca chng; (ii) s dng d liu thng k ng liu trc tip, chng ta c th tit kim
c khng gian lu tr ng thi gim t l li nh s dng cc thng tin trung gian khc
c kt xut t ng liu.
Phn tch v t l li phn trn ch tp trung vo li false positive ca BF. Nhng
thc t, khng ging nh cc cu trc d liu thng thng khc, chnh xc ca m
hnh BF cn ph thuc vo cc yu t khc trong h thng v cch thc m hnh c
truy vn.
Chng ta c th tn dng tnh n iu ca khng gian s kin n-gram trong ng
liu ngn ng t nhin thit lp mt cn trn cho tn sut ca tt c cc n-gram ny.
Nh m c th gim bt s ln thc hin vng lp ln trong thut ton kim tra (Thut
ton 2). C th l, nu lu tr cc n-gram bc thp hn trong BF, ta c th ni rng
25
Nguyn Thc Huy Kha lun tt nghip
mt n-gram khng th tn ti nu bt k chui con no ca n khng tn ti, tng ny
c gi l b lc da vo chui con (sub-sequence filtering) [35]. Do quy trnh lu tr
tn sut BF s dng khng bao gi nh gi thp tn sut ca mt s kin, nn: tn sut
ca mt n-gram khng th ln hn tn sut ca chui con t xy ra nht ca n.
{ }
1 1 1 2
(w ,..., w ) min (w ,..., w ), (w ,..., w )
n n n
c c c

B lc ny gim t l li ca cc m hnh ngn ng BF vi phng php lm mn


ni suy, bng cch s dng gi tr nh nht c tr li bi cc m hnh bc thp lm cn
trn cho cc m hnh cp cao hn.
2.4.3 Bloom Map
Bloom Map [37] c pht trin da trn nghin cu ca Chazelle vi mt cu
trc c tn l Bloomier Filter [10].
Bloom Map ra i nhm gii quyt nhu cu lu tr hiu qu cp {kha, gi tr}
trong mt cu trc d liu tit kim b nh. Khng ging nh LF-BF, Bloom Map khng
b hn ch vi m ha ch cc s nguyn dng. Gi s chng ta c mt bng nh x kha
gi tr cn m ha nh sau:
M = {(x
1
, v(x
1
)), , (x
i
, v(x
i
)), } i = 1, , n
vi cc kha:
X = {x
1
, , x
n
}
c ly ra t mt tp ban u U c u (u
?
n), cc gi tr l phn t ca mt tp c nh:
V = {v
1
, v
2
,, v
b
}
trong mi phn t trong tp V ny phi l gi tr ca t nht mt kha no thuc tp
X. Ta gi s phn phi ca cc gi tr trn cc kha c i din bi vc t khng i
sau:
1 2
( , ,..., )
b
p p p p

26
Nguyn Thc Huy Kha lun tt nghip
vi
1
1
b
i
i
p

. Cu trc d liu ny, bao gm mt bng nh x kha gi tr v mt vc t


phn phi xc sut
p

c gi l mt
p

- map.
Gii php u tin c xut xy dng mt
p

- map ngu nhin l Bloom


Map n gin: y l mt ci tin ca Bloom Filter, thay v s dng k hm bm ngu
nhin c lp vi nhau, chng ta s s dng mt mng ca cc mng k hm bm c lp
ngu nhin.
1
2
1,1 1,2 1,
2,1 2,2 2,
,
,1 ,2 ,
,1 ,2 ,
( ) ( ) ... ... ( ) ||
( ) ( ) ... ( ) ||
[h (x)]= v
( ) ( ) ... ... ... ( ) ||
( ) ( ) ...
i
k
k
i j i
i i i k
b b b k
h x h x h x
h x h x h x
values
h x h x h x
h x h x h

M
M
( ) ||
b
x

'

Trong mt ma trn nh vy, s dng c c nh v bng s lng cc gi tr


(i=1,,b). S lng ct mi dng khng bng nhau, tc l n l mt hm ca s th t
dng: s phn t mi hng i i din cho k
i
hm bm c chn cho gi tr v
i
. Ngha l
vi mi
[b] i
, ta chn k
i
hm bm
,
: [m]
i j
h U
. Gi s chng ta mun lu cp kha
gi tr (x, v
i
). thc hin vic ny, ta tnh h
i, j
(x)
[k ]
i
j
. V d nu chng ta mun lu
x vo hng th i ca ma trn hm bm, chng ta ln lt t gi tr bit bng 1 cho k
i
bit
trong Bloom Filter:
,
[h (x)] 1
i j
B
Trong bc kim tra, nu chng ta mun xem liu mt phn t x U c tn ti
trong B khng, chng ta kim tra x vi mi tng hng trong ma trn hm bm [h
i,j
(x)]:
,
( ) [b], j [k]
i j
h x i
27
Nguyn Thc Huy Kha lun tt nghip
Nu phn t x tn ti trong B th gi tr c tr li s l gi tr tng ng vi
hng i m ti mi gi tr bit c nh x t cc hm bm trong hng c gi tr bng
1. Ngha l:
1 ,
( ) {i [b] | B[h ( )] 1}
i
k
j i j
qval x x


Nu qval(x) = => tr li
Nu qval(x)
=> tr li gi tr v
c
, vi c = max{qval(x)}
Qu trnh xy dng v truy vn mt Bloom Map n gin c minh ha bng
Thut ton 3 v Thut ton 4.
T l li:
Ta c tp cc kha cng vi gi tr ca chng:
{x X| ( ) }
i i
X v x v
Chng ta quan tm n ba loi li sau:
False positive - \ x U X c gn cho mt gi tr
i
v V
(tm c mt gi tr
trong Bloom Map cho mt kha khng c chn vo trong Bloom Map)
False negative kha
i
x X
b gn nhm l mang gi tr (gi tr c chn
vo trong Bloom Map trong giai on hun luyn li khng c tm thy trong
giai on kim tra)
Gn nhm gi tr (Missassignment) -
i
x X
b gn nhm mt gi tr l
\ { }
i
v V v
(c gi tr c tr li khi truy vn kha x, nhng khng phi gi tr
ng)
Nu tt c cc cp kha gi tr c chn vo Bloom Filter trong giai on hun
luyn u thuc tp M, th gi tr s khng bao gi b tr li, ni cch khc, Bloom
Map n gin khng c li false negative. Tuy nhin, vi cch gii thch tng t nh i
vi Bloom Filter c bn, s lun tn ti li false positive.
28
Nguyn Thc Huy Kha lun tt nghip
Thut ton 3: Xy dng mt Bloom Map n gin
- u vo: Bng nh x M, mng bit B,
tp cc hm bm [ ] [ ] { }
,
| ,
i j i
H h i b j k
- u ra: Bloom Map n gin (B, H)
0 B
for
( , )
i
x v M
do
for
[ ]
i
j k
do
,
[ ( )] 1
i j
B h x
end for
end for
Thut ton 4: Truy vn mt Bloom Map n gin
- u vo: Bloom Map n gin (B, H), mt kha x U
- u ra:
{ } v V
qval
for
[ ] i b
do
if
1 ,
[ ( )] 1
i
k
j i j
B h x


then
{ } qval qval i
end if
end for
if
qval
then
max c qval
return
c
v
else
return
end if
Hn na, do c th tn ti nhiu hn mt dng trong ma trn hm bm tha mn
ton t

trong cng thc to ra mt gi tr ch mc i sai, ng thi li trng hp l gi


tr ln nht, nn vic gn sai gi tr cng c th xy ra.
29
Nguyn Thc Huy Kha lun tt nghip
Nu ta gi m l s bit trong B,

l t l bit vn mang gi tr l 0 sau qu trnh hun


luyn, v

[ ] E
, Talbot v Obsborne [37] tnh c rng c ( ) f B
+
, t l li false
positive v
*
( )
i
f B
, t l li gn sai gi tr c mt cn trn ti:

1
(1 )
i
k
b
i

Qua mt qu trnh ti u ha da trn tha s Lagrange, mt s ch s c tnh


ton cho kt qu nh sau:
C ca b lc bit:
log (log1/ ( )) m n e H p +
ur
S lng hm bm cho gi tr th i:
log1/ log1/
i i
k p +
S lng bit dnh cho mi kha:
log (log1/ ( )) e H p +
ur
vi

> 0 l mt hng s i din cho cn trn ca t l li false positive do ngi s dng


thit lp.
( ) f B
+

hiu r hn chi tit cc bc tnh ton ra c cc ch s ny, ngi c c


th tham kho thm trong [37].
Bloom Map nhanh:
Cu trc d liu Bloom Map va c gii thiu trn ch l nn tng c bn cho
mt loi Bloom Map nhanh, da trn vic s dng cy nh phn, cng vi nhng c tnh
ca Bloom Map c bn nh khng c li false negative, t l li false positive v gn sai
gi tr c th kim sot c. Cu trc d liu Bloom Map ny c hai dng, c gi l
Bloom Map tiu chun v Bloom Map nhanh. Bloom Map nhanh tuy s dng nhiu
khng gian lu tr hn mt cht t nhng c u im l s dng t bit hn trong qu trnh
truy vn kha gi tr. S lng bit trung bnh cn cho mt kha l:
30
Nguyn Thc Huy Kha lun tt nghip
log (log1/ 2 ( ) 2) e H p + +
ur
v s bit cn dng trong qu trnh truy vn l:
3 (khi \ x U X )

3log( 1) 2log1/ log1/ 2


i
b i p + + + +
(khi
i
x X
)
Lu : T nay tr i trong kha lun ny, ta s gi chung cc m hnh ngn ng s
dng cu trc d liu da trn Bloom Filter l BF-LM. M hnh ngn ng xy dng s
dng cu trc d liu Log-Frequency Bloom Filter c vit tt l LF-BF-LM; nu s
dng cu trc d liu Bloom Map th c gi l BloomMap-LM.
31
Nguyn Thc Huy Kha lun tt nghip
Chng 3
Th nghim:
Xy dng LM vi RandLM v SRILM
Cc th nghim trnh by trong lun vn ny u c thc hin trn cng mt
my tnh cu hnh nh sau:
CPU: Intel Core2Duo 1.86GHz x 2
RAM: 2GB DDR2
H iu hnh: Linux Mint 8 Helena 32-bit
Trong chng ny, chng ti trnh by th nghim xy dng cc m hnh ngn
ng vi hai cng c RandLM v SRILM [34].
- Cc m hnh ngn ng BF-LM c xy dng vi cng c m ngun m
RandLM
4
. Cng c ny c pht trin xy dng LM dung lng thp nh s dng
cc cu trc d liu xc sut, in hnh l Bloom Filter. Sau khi bin dch, cng c ny
to ra hai file thc thi l buildlm v querylm. File buildlm c dng xy dng cc m
hnh ngn ng. Cn file querylm s dng truy vn LM, tr li kt qu thng k n-gram
hoc xc sut iu kin log.
- Cc m hnh ngn ng chun, khng mt mt c xy dng s dng SRI
Language Modelling Toolkit (SRILM)
5
. SRILM l mt d n m ngun m bao gm
nhiu chng trnh, th vin C++ v script h tr trong vic xy dng v th nghim cc
m hnh ngn ng cho nhn dng ting ni hoc cc ng dng khc. N h tr nhiu kiu
m hnh ngn ng khc nhau da trn thng k v n-gram. SRILM c pht trin t
nm 1995 Phng nghin cu cng ngh ting ni SRI, v vn cn ang c tip tc
sa cha, m rng bi nhiu nh nghin cu trong cng ng NLP.
3.1 Ng liu
4
M ngun RandLM c th c download min ph t: http://sourceforge.net/projects/randlm/
5
Ti liu hng dn v m ngun ca SRILM c th c download min ph t:
http://www.speech.sri.com/projects/srilm/
32
Nguyn Thc Huy Kha lun tt nghip
Th nghim ny c thc hin trn hai ng liu: mt ng liu n ng ting Anh
c dung lng ln v mt ng liu nh ting Vit. Ng liu ting Anh l ng liu chnh,
cc LM xy dng t ng liu ny s c s dng trong th nghim v dch my thng
k vi Moses chng sau.
Ng liu ting Anh:
Ng liu n ng ting Anh l b ng liu tng hp t nhiu ngun khc nhau,
c s dng ti ACL 2010 Workshop ln th nm
6
, gm 48.6 triu cu, dung lng
5.7 GB khng nn, cc cu trong ng liu c o th t mt cch ngu nhin. Thng
k chi tit hn v ng liu ny c th c tham kho bng 1.
Bng 1: Thng k ng liu News Shuffle
Dung lng 5.7 GB
Gzip 2.4 GB
S lng cu 48,653,883
S lng t 1,133,192,971
di trung bnh cu 23
y l mt ng liu ln, do hn ch v thi gian v ti nguyn h thng th
nghim nn ch mt phn ng liu ny c trch xut v s dng cho cc th nghim
trong khun kh lun vn ny (b ng liu ln nht c s dng trong th nghim l 1
GB d liu vn bn). Ton b ng liu c chuyn thnh ch thng s dng script
lowercase.perl (script ny nm trong mt b script h tr s c gii thiu chi tit hn
chng sau).
scripts/lowercase.perl < working-dir/corpus/news_shuffled > working-
dir/corpus/news_lowercased
Cc tp ng liu c trch xut v tin x l trn c s dng hun
luyn cc m hnh ngn ng 3-gram v 4-gram s dng hai b cng c SRILM v
RandLM. Dung lng cc tp ng liu ny c th hin trong Bng 2.
Bng 2: Thng k cc tp ng liu ting Anh c s dng xy dng LM (Set 1

4)
6
http://statmt.org/wmt10/translation-task.html/
33
Nguyn Thc Huy Kha lun tt nghip
Set 1 Set 2 Set 3 Set 4
S lng cu 2,137,817 4,275,634 6,413,452 8,551,269
S lng t 49,817,252 99,590,072 149,379,391 199,163,279
di trung bnh cu (t) 23 23 23 23
C t in 409,785 584,968 717,013 824,974
Dung lng 0.25 GB 0.50 GB 0.75 GB 1.00 GB
Gzip 103.4 MB 206.8 MB 310.1 MB 413.5 MB
Ng liu ting Vit:
Mt ng liu nh n ng ting Vit cng c s dng vi mc ch cng thm
c kt qu vi vic th nghim trn nhiu ng liu khc nhau.
Ng liu ny c xy dng t nhiu bi vit trn Bo Lao ng phin bn in
t
7
thuc nhiu lnh vc khc nhau nh khoa hc, kinh t, th thao, vn ha [1]. Cc
thng k v ng liu ny c lit k trong bng di y:
Bng 3: Thng k v ng liu Ting Vit Bo Lao ng in t
Thng k chung
Dung lng 78.4 MB
Gzip 25.9 MB
S lng cu 557,736
S lng t 12,516,790
di trung bnh cu 22.4
Thng k n-gram
1-gram 257,446
2-gram 2,639,657
3-gram 1,428,292
4-gram 1,198,019
3.2 Thut ton lm mn
M hnh ngn ng khng mt mt c hun luyn s dng thut ton lm mn
Kneser-Ney ci tin (MKN). Do trong nhng th nghim ny, chng ta ch quan tm n
7
Bo Lao ng in t: http://www.laodong.com.vn/
34
Nguyn Thc Huy Kha lun tt nghip
c tnh, hiu sut ca cc cu trc d liu (khng mt mt v c mt mt thng tin) m
khng cn xc sut chnh xc ca tng s kin, nn thut ton Stupid Backoff ca Google
c s dng trong qu trnh xy dng BF-LM v n nhanh v n gin. Hn na cht
lng ca thut ton lm mn Stupid Backoff c chng minh l xp x vi thut ton
MKN vi ng liu ln [5].
3.3 Xy dng LM vi SRILM v RandLM
Vi ng liu ting Anh:
Cc m hnh ngn ng SRILM c xy dng t cc tp ng liu trn s dng
lnh sau:
./ngram-count -order 3 -interpolate kndiscount
-text /corpus/news.lowercased_1GB.gz
-lm model_sri_1.00GB_3-grams.txt
Cu lnh trn s to ra mt m hnh ngn ng 3-gram trong file
model_sri_1.00GB_3-grams.txt t file ng liu news.lowercased_1GB.gz (SRILM cho
php u vo v u ra s dng file nn). Ta cng c th yu cu SRILM to ra nhng m
hnh ngn ng bc cao hn nh 4-gram, 5-gram, thm ch cao hn na. Nhng khi tham
s ny tng th lng b nh cn dng cng tng ln rt nhanh. SRILM s dng RAM
lu tr kt qu m n-gram tm thi, vi cu hnh my tnh th nghim nu trn (s
dng 2GB RAM), chng ti xy dng c m hnh ngn ng 3-gram t tp ng liu
1GB; nhng khng to c m hnh 4-gram vi cng lng d liu do thiu RAM trong
bc thng k n-gram. Tham s -kndiscount yu cu SRILM s dng thut ton Kneser-
Ney ci tin trong bc lm mn. Cc thut ton lm mn khc c th c dng trong
SRILM l Good-Turing hay Witten-Bell.
Vic xy dng m hnh tn khong 10 pht cho tp ng liu 0.25GB (Set 1) cho
n vi ting i vi tp ng liu 1GB (Set 4). Sau khi xy dng ta c th xem c bao
nhiu n-gram mi bc trong file m hnh ngn ng va c to ra (head n 5
model_sri_1.00GB_3-grams.txt).
Bng 4: Thng k s lng cc n-gram trong cc tp ng liu
1-gram 2-gram 3-gram
35
Nguyn Thc Huy Kha lun tt nghip
Set 1 409,806 6,322,122 4,648,704
Set 2 585,002 9,720,557 9,294,600
Set 3 717,048 12,354,288 13,813,750
Set 4 825,026 14,549,604 18,171,077
i vi RandLM, xy dng m hnh ngn ng c th c thc hin theo 3 cch:
i) t ng liu c chia t sn; ii) t mt tp cc cp n-gram v s ln xut hin ca n
(cp ngram-count); iii) t mt m hnh ngn ng backoff c xy dng trc vi
nh dng ARPA (nh dng ca m hnh ngn ng c to ra t SRILM). Nu xy
dng theo cch u tin hoc th hai, m hnh c gi l CountRandLM, s dng loi
th ba th c gi l BackoffRandLM. CountRandLM c th s dng lm mn
StupidBackoff hoc Witten-Bell. BackoffRandLM c th s dng bt k phng php
lm mn no m SRILM h tr. V d ta xy dng BloomMap-LM 3-gram t tp ng liu
1GB s dng lnh sau:
./buildlm struct BloomMap falsepos 8 values 8
output-prefix randlm_3-grams_1.00GB
input-path /corpus/news.lowercased_1GB.gz
order 3 working-mem 1500
Tham s -falsepos quyt nh t l li false positive ca cu trc d liu, v d
-falsepos 8 cho ta t l li l 1/2
8
.
Tham s values quyt nh khong lng t ha ca m hnh, bc ca logarit s
dng trong qu trnh lng t ha s l 2
1/values
nu ta s dng tham s -values 8.
Tham s order xc nh bc ca m hnh n-gram.
Tham s input-path: ng dn ti ng liu c dng to LM.
c bit tham s -struct quyt nh cu trc d liu c s dng xy dng m
hnh ngn ng. Hin ti, RandLM h tr hai loi cu trc d liu l Log-Frequency
Bloom Filter (-struct LogFreqBloomFilter) v Bloom Map (-struct BloomMap). S
dng RandLM, chng ti s xy dng cc m hnh ngn ng vi c hai loi cu
trc d liu ny so snh kch thc cng nh hiu qu ca tng cu trc d liu.
36
Nguyn Thc Huy Kha lun tt nghip
Tham s -working-mem 1500 c ngha l cho php s dng 1500MB trong qu
trnh sp xp cc n-gram.
Cc file c to ra sau qu trnh xy dng LM vi cu lnh trn bao gm:
- randlm_3-grams_1.00GB.BloomMap M hnh ngn ng
- randlm_3-grams_1.00GB.counts.sorted.gz Thng k n-gram
- randlm_3-grams_1.00GB.stats.gz Thng k kt qu m
- randlm_3-grams_1.00GB.vcb.gz File cha t vng
C hai file .stats.gz v .counts.sorted.gz u c th c khai bo s dng li, trnh
tnh ton nhiu ln khi cn xy dng thm LM t b ng liu ging nhau. iu ny l rt
cn thit do trong th nghim nhiu khi cn xy dng LM nhiu ln vi gi tr cc tham
s khc nhau. V d:
./buildlm struct BloomMap falsepos 8 values 10 order 3
output-prefix randlm_3-grams_1.00GB_values10
input-path randlm_3-grams_1.00GB.counts.sorted.gz
-input-type counts
-stats-path randlm_3-grams_1.00GB.stats.gz
working-mem 1500
s xy dng mt BloomMap-LM mi t cng ng liu s dng trc nhng vi gi
tr lng t ha khc (values 10) m khng cn tnh ton li cc file thng k.
Thi gian xy dng cc BF-LM s dng RandLM lu hn khi xy dng cc m
hnh ngn ng chun cng bc, cng lng d liu trong SRILM; mt xp x 20 ting
RandLM mi xy dng xong m hnh ngn ng 3-gram vi 1GB ng liu (Set 4).
RandLM lu tr mi th trn a cng, nn vic thng k, sp xp cng mt nhiu thi
gian hn. Nhng b li, RandLM li c th xy dng cc m hnh ngn ng bc cao hn,
s dng nhiu d liu hn SRILM. V d, trn my tnh th nghim, RandLM c th
xy dng thnh cng m hnh ngn ng 4-gram t 1GB ng liu hun luyn, trong khi
SRILM th khng th. Tuy thi gian hun luyn ca RandLM lu hn SRILM nhng
khng phi l vn ln, v ta ch xy dng m hnh ngn ng mt ln duy nht. Hn
na, dung lng cc m hnh ngn ng Bloom Filter xy dng t RandLM chim t b
nh hn cc m hnh chun t SRILM rt nhiu. Bng 5 thng k dung lng cc m
37
Nguyn Thc Huy Kha lun tt nghip
hnh ngn ng 3-gram to bi hai cng c ny (khng nn) vi cc b ng liu kch thc
khc nhau.
Bng 5: Kch thc cc loi LM khc nhau trn cc tp ng liu
Set 1 Set 2 Set 3 Set 4
BloomMap 52.5 MB 86.6 MB 114.4 MB 138.8 MB
Log-Freq BF 63.4 MB 102.1 MB 136.8 MB 181.5 MB
SRILM 290.7 MB 511.6 MB 710.4 MB 893.4 MB
Qua bng trn ta c th thy rng dung lng cc m hnh ngn ng to bi
RandLM ch bng khong 1/6 ln dung lng m hnh ngn ng chun to bi SRILM
nu s dng cu trc d liu Bloom Map, v bng khong 1/5 ln nu s dng cu trc d
liu Log-Frequency Bloom Filter. Vi cng cc tham s khi xy dng bng RandLM,
nhng LM vi cu trc d liu LF-BF c kch thc ln hn LM vi cu trc d liu
Bloom Map (khong 20 - 30%).
0
100
200
300
400
500
600
700
800
900
Set 1 Set 2 Set 3 Set 4
M
B
LF-BF-LM
BloomMap-LM
SRI-LM
Biu 3: Dung lng cc LM to t RandLM v SRILM
Nhn vo biu 3, ta c th kt lun rng cng s dng nhiu d liu hun luyn,
th cng tit kim c khng gian lu tr; ngha l t l chnh lch gia dung lng LM
chun v m hnh xy dng bng cng c RandLM to ra t cng mt ng liu hun
luyn cng tng. Th nhng tr li cho cu hi liu vic tit kim dung lng ny lm
hiu qu LM gim nh th no so vi LM chun th cn phi c tr li bng thc
nghim.
38
Nguyn Thc Huy Kha lun tt nghip
Vi ng liu ting Vit:
Kt qu xy dng LM bc 2, 3 v 4 t b ng liu ting Vit c th hin trong
biu di y:
0
20
40
60
80
100
120
140
160
180
2-gram 3-gram 4-gram
M
B
LF-BF-LM
BloomMap-LM
SRI-LM
Biu 4: Dung lng cc LM ting Vit
Cc LM ny c xy dng vi bc n-gram khc nhau, t 2-gram cho n 4-gram.
Kt qu th hin trong biu mt ln na cho thy s chnh lch ln v dung lng
gia cc m hnh ngn ng SRILM chun v RandLM.
39
Nguyn Thc Huy Kha lun tt nghip
Chng 4
Th nghim:
H thng dch my thng k vi Moses
Cc m hnh c xy dng trn s c dng trong dch my thng k s dng
h thng dch my m ngun m Moses [21]. Kt qu dch sau c nh gi bng
im BLEU. Qua ta c th so snh hiu qu ca m hnh ngn ng s dng Bloom
Filter vi m hnh ngn ng chun truyn thng.
4.1 Dch my thng k
4.1.1 Gii thiu v dch my v dch my thng k
Cc m hnh ngn ng n-gram c s dng rt nhiu trong cc bi ton ca X l
ngn ng t nhin nh x l ting ni, kim tra chnh t, nhn dng ch vit, dch my,
Mc ny s minh ha mt ng dng c th ca LM vi vic gii thiu s lc LM
c s dng nh th no trong mt h thng Dch my thng k da trn cm (Phrase-
based Statistical Machine Translation) [22].
Dch my (Machine Translation - MT) l mt hng pht trin c lch s lu i t
thp k 50 v c pht trin mnh m t thp k 80 cho n nay. Hin ti, trn th gii
c rt nhiu h dch my thng mi ni ting trn th gii nh Systrans, Kant, hay
nhng h dch my m tiu biu l h dch ca Google, h tr hng chc cp ngn ng
ph bin nh Anh-Php, Anh-Trung, Anh-Nht, Hoa-Nht, Cc cch tip cn MT chia
lm bn lp chnh l dch trc tip (direct), dch da trn lut chuyn i (transfer), dch
lin ng (interlingua) v dch da vo thng k (statistical MT). Phng php dch da
trn lut chuyn i v dch lin ng ch yu da vo c php, c thi gian pht trin
kh di v vn cn c s dng ph bin trong nhiu h dch thng mi. Cc h dch
my loi ny ny t c kt qu kh tt vi nhng cp ngn ng tng ng nhau
v c php nh Anh-Php, Anh-Ty Ban Nha, nhng cn gp nhiu hn ch i vi
cc cp ngn ng c c php khc nhau nh Anh-Trung, Anh-Nht,
40
Nguyn Thc Huy Kha lun tt nghip
Vit Nam, dch Anh-Vit, Vit-Anh cng vp phi nhng kh khn tng t do
s khc bit v mt cu trc ng php v tnh nhp nhng ca ng ngha. H thng dch
Anh-Vit da trn lut chuyn i c thng mi ha u tin Vit Nam l EVTran
[23]. Hin nay, nhiu nghin cu vi mong mun tng cht lng dch vn ang c
thc hin thch nghi vi c im ca cc cp ngn ng khc nhau.
Dch my bng phng php thng k (Statistical Machine Translation) chng
t l mt hng tip cn y y tim nng bi nhng u im vt tri so vi cc
phng php dch my da trn c php truyn thng qua nhiu th nghim v dch my.
Thay v xy dng cc t in, cc lut chuyn i bng tay, h dch ny t ng xy dng
cc t in, cc quy lut da trn kt qu thng k c c t d liu. Chnh v vy, dch
my da vo thng k c tnh kh chuyn cao, c kh nng p dng c cho cp ngn
ng bt k. H thng SMT c xut ln u tin bi Brown nm 1990 s dng m
hnh knh nhiu v pht trin p o trong ngnh MT nhiu nm tr li y.
Trong phng php dch trc tip, tng t c dch t ngn ng ngun sang
ngn ng ch. Trong dch da trn lut chuyn i, u tin chng ta cn phi phn tch
c php ca cu vo, ri p dng cc lut chuyn i bin i cu trc cu ny ngn
ng ngun sang cu trc ca ngn ng ch; cui cng ta mi dch ra cu hon chnh. i
vi dch lin ng, cu vo c phn tch thnh mt dng biu din tru tng ha v
ng ngha, c gi l interlingua, sau ta tm cch xy dng cu ch ph hp nht
vi interlingua ny. Dch my thng k c cch tip cn hon ton khc, kh nng dch
c c l da trn cc m hnh thng k c hun luyn t cc ng liu song ng. Kin
trc chung ca mt h thng SMT c th hin trong hnh 8.
M hnh ca Brown (hay cn gi l m hnh IBM) [7] biu din qu trnh dch
bng mt m hnh knh nhiu (noisy channel model) bao gm ba thnh phn: mt m
hnh dch (translation model), c nhim v lin h cc t, cm t tng ng ca cc ngn
ng khc nhau; mt m hnh ngn ng (LM), i din cho ngn ng ch; mt b gii m
(decoder), kt hp m hnh dch v m hnh ngn ng thc hin nhim v dch.
Thng th LM c gn trng s cao hn cc thnh phn khc trong h thng
dch, bi v ng liu n ng dng hun luyn LM ln hn nhiu ng liu song ng,
do c tin cy ln hn. Och [28] ch ra rng vic tng kch c ca LM ci thin
41
B gii m B
Nguyn Thc Huy Kha lun tt nghip
im BLEU tiu chun ph bin nh gi cht lng dch my. Hnh 7, trch t [19],
cho thy s ci thin cht lng dch khi tng kch c LM.
Hnh 7: Tng kch c LM ci thin im BLEU [19]
Trong m hnh u tin ca Brown, m hnh dch da trn kiu t-thnh-t [8] v
ch cho php nh x mt t trong ngn ng ngun n mt t trong ngn ng ch.
Nhng trong thc t, nh x ny c th l mt-mt, mt-nhiu, nhiu-nhiu hoc mt-
khng. Th nn nhiu nh nghin cu ci tin cht lng ca SMT bng cch s dng
dch da trn cm (phrase-based translation) [22][26].
42
B gii m B
Nguyn Thc Huy Kha lun tt nghip
Hnh 8: Kin trc ca mt h thng SMT [20]
4.1.2 Dch my thng k da trn cm
Hnh 9: Minh ha dch my thng k da vo cm
Trong dch da trn cm, mt chui cc t lin tip (cm) c dch sang ngn
ng ch, vi di cm ngn ng ngun v ch c th khc nhau. Hnh 9 minh ha
phng php dch cm: cu vo c chia thnh mt s cm; tng cm mt c dch
43
Tin x l
Ngn ng ngun ( f )
B gii m B
Hu x l
M hnh ngn ng Pr(e)
M hnh dch Pr(f | e)
Ngn ng ch ( e )
That songwriter wrote many romantic songs
Nhc s vit nhiu bi ht lng mn
Nguyn Thc Huy Kha lun tt nghip
sang ngn ng ch; v sau cc cm c o trt t theo mt cch no ri ghp
vi nhau. Cui cng ta thu c cu dch trong ngn ng ch.
Gi s ta gi ngn ng ngun l f v ngn ng ch l e, chng ta s c gng ti a
ha xc sut
Pr( | ) f e
vi mong mun c c bn dch tt nht. Thc t l tn ti rt
nhiu bn dch ng cho cng mt cu, mc ch ca ta l tm ra cu ngn ng e ph hp
nht khi cho trc cu ngn ng ngun f. Dch da vo cm s dng m hnh knh nhiu,
p dng cng thc Bayes ta c:
argmax Pr( | ) Pr( )
arg max Pr( | )
Pr( )
e
e
f e e
e f
f

Do Pr(f) l khng i i vi e, vn tr thnh vic tm cu e nhm ti a ha


Pr( | ) Pr( ) f e e
. Vic xy dng m hnh ngn ng cn s dng mt ng liu n ng ln,
trong khi m hnh dch li cn n ng liu song ng tt. B gii m c s dng
chia cu ngun thnh cc cm v sinh ra cc kh nng dch c th cho mi cm nh s tr
gip ca bng cm (phrase table).
sinh ra c cu dch, cu ngun c chia thnh I cm lin tip
I
f
1
. Chng ta
gi s rng phn phi xc sut l nh nhau i vi cc cm ny. Mi cm
i
f
trong
I
f
1

c dch thnh cm tng ng trong ngn ng ch
i
e
. Cc cm trong ngn ng ch c
th o v tr cho nhau. Qu trnh dch cm c m hnh ha bi phn phi xc sut
) | (
i i
e f
.
Vic o v tr (reodering) ca cc cm u ra c m hnh bi phn phi xc
sut
) (
1

i i
b a d
, trong a
i
i din cho v tr bt u ca cm trong cu ngun c dch
thnh cm th i trong cu ch, v b
i-1
l k hiu ch v tr kt thc ca cm trong cu
ngun c dch thnh cm (i-1) trong cu ch. y chng ta s dng m hnh o
cm rt n gin nh sau:
| 1 |
1
1
) (



i i
b a
i i
b a d
vi gi tr thch hp cho tham s .
44
Nguyn Thc Huy Kha lun tt nghip
xc nh di thch hp ca cu dch, chng ta a thm vo tha s khi
sinh ra cu trong ngn ng ch. Tha s ny s c ti u qua qu trnh tm kim cu
dch ti u. Tha s ny cng ln hn 1 th di ca cu trong ngn ng ch cng di.
Ni tm li, cu dch tt nht e
best
c sinh ra t cu ngun theo m hnh trong
[22] l:
( )
es
arg max Pr( | ) arg max Pr( | ) Pr ( )
length e
b t e e LM
e e f f e e
y
Pr( | ) f e
c phn tch thnh:
1 1 1
1
Pr( | ) ( | ) ( )
I
I I
i i i i
i
f e f e d a b

4.1.3 im BLEU
nh gi cht lng cc h thng dch c th c thc hin th cng bi con
ngi hoc t ng. Qu trnh nh gi th cng cho im cho cc cu dch da trn s
tri chy v chnh xc ca chng. Phn ln mi ngi cho rng y l phng php nh
gi chnh xc nht. Th nhng cng vic nh gi th cng ny li tiu tn qu nhiu thi
gian, c bit khi cn so snh nhiu m hnh ngn ng, nhiu h thng khc nhau. Cng
bng m ni, mi phng php u c u nhc im ring. Tuy nh gi t ng khng
th phn nh c ht mi kha cnh ca cht lng dch, nhng n c th nhanh chng
cho ta bit: cht lng ca h dch tm no, c tng ln hay khng sau khi ci tin hoc
thay i mt tham s no . Trong thc t, hai phng php ny vn c s dng ng
thi, v im BLEU l o cht lng h dch ph bin nht hin nay, c xut bi
Papineni nm 2002 [32].
BLEU tnh im bng cch i chiu kt qu dch vi ti liu dch tham kho v
ti liu ngun. Mc d [9] ch ra rng im BLEU thng khng thc s tng quan vi
nh gi th cng ca con ngi vi cc loi h thng khc nhau, th nhng vn c th
kh chnh xc nh gi trn cng mt h thng, hoc nhng h thng tng t nhau.
Chnh v vy, trong kha lun ny, im BLEU c s dng lm thc o cht lng
dch, t so snh cc loi m hnh ngn ng khc nhau.
45
Nguyn Thc Huy Kha lun tt nghip
4.2 Baseline system
Chng ti xy dng h thng dch s dng GIZA++ 2.0
8
[29], SRILM [34] v b
hun luyn cc tiu ha t l li (Minimum Error Rate Training MERT) [27] ging
hng cc t, xy dng m hnh ngn ng, ti u ha cc trng s s dng trong qu trnh
dch. M hnh ngn ng s dng trong hun luyn l mt m hnh 3-gram vi thut ton
lm mn Kneser-Ney ci tin. MERT c thc hin trn tp ng liu pht trin c s
dng ti WMT nm 2008, gm 2000 cp cu song ng c Anh (thng k bng 7).
Bng cm c to ra sau qu trnh hun luyn c dung lng 800.8 MB; mt bng h
tr o v tr t (lexical reordering table) [15][38] cng c to ra c dung lng 186.5
MB.
Trong qu trnh xy dng v th nghim trn h thng dch ny, chng ti c s
dng mt s script h tr
9
bao gm:
- B tch t tokenizer.perl
- Script chuyn ton b vn bn sang ch thng lowercase.perl
- SGML-Wrapper c nhim v ng gi d liu theo nh dng XML ca h thng
tnh im NIST BLEU : wrap-xml.perl
- Script NIST MTeval version 11b mteval-v11b.pl dng tnh im BLEU
4.3 Ng liu
H thng dch c hun luyn s dng ng liu Europarl [18] song ng c
Anh version 3 gm 1.2 triu cu trong mi ngn ng. Th nhng sau khi loi b bt cc
cu c di ln hn 40 t tng ng c hai ngn ng, ch cn khong gn 1 triu cp
cu (mt 268,000 cp cu). Nguyn nhn ta cn phi lm nh vy l v qu trnh hun
luyn bng GIZA++ tn rt nhiu thi gian nu c nhiu cu di. Thng k y v ng
liu ny sau khi lc c th c tham kho bng 7.
M hnh ngn ng ting Anh dng trong hun luyn h thng dch c xy dng
t ng liu Europarl n ng ting Anh (xem thng k chi tit bng 6).
8
http://code.google.com/p/giza-pp/
9
Download t http://www.statmt.org/wmt08/scripts.tgz
Script MTeval: ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v11b.pl
46
Nguyn Thc Huy Kha lun tt nghip
Bng 6: Thng k chi tit ng liu Europarl n ng ting Anh
dng xy dng LM hun luyn h thng dch
Dung lng 200.6 MB
Gzip 62.7 MB
S lng cu 1,412,546
S lng t 38,280,717
di trung bnh cu 27
C t in (t) 100,795
Bng 7: Thng k ng liu song ng c Anh dng hun luyn, pht trin v nh
gi
10
h thng dch
Ting c Ting Anh
Hun luyn S lng cu 997,575
S lng t 20,341,901 21,432,529
di cu trung bnh (t) 20.4 21.5
C t in (t) 226,387 74,581
Pht trin S lng cu 2000
S lng t 55,118 58,761
di cu trung bnh (t) 27.6 29.4
C t in (t) 8,796 6,118
nh gi S lng cu 2000
S lng t 54,232 58,055
di cu trung bnh (t) 27.1 29.0
C t in (t) 8,669 6058
4.4 Kt qu th nghim
10
Trong bc nh gi th l vn bn ngun ting c v vn bn dch tham kho bng ting Anh
47
Nguyn Thc Huy Kha lun tt nghip
H thng dch c th nghim vi cc m hnh ngn ng SRILM v RandLM,
vi vic dch 2000 cu ting c. Thi gian dch ht 2000 cu ny khi s dng m
hnh ngn ng SRILM l 98 pht, i vi BloomMap-LM l 124 pht v vi LF-BF-LM
l 117 pht. Nh vy l khi s dng cc loi BF-LM, thi gian dch lu hn khi s dng
m hnh ngn ng chun khong 1.3 ln. Khong thi gian dch lu hn ny khng phi
l ti khi ta xem xt n phn b nh tit kim c nh s dng cc LM da trn
Bloom Filter.
Bng 8: Thi gian dch 2000 cu ting khi s dng cc loi LM khc nhau
Loi LM Thi gian dch (pht)
SRI-LM 98
BloomMap-LM 124
LF-BF-LM 117
Hnh 9: nh dng XML ca NIST MT
nh gi kt qu dch, chng ti s dng im BLEU. Do , sau khi dch, kt
qu c ng gi li theo nh dng XML ca h thng tnh im NIST MT. Hnh 9 l
mt v d ca nh dng XML ny. Script MTeval s dng ba u vo nh gi kt
qu dch: file cha vn bn ngn ng ngun, file cha kt qu dch ngn ng ch v
mt file dch chun dng tham chiu.
im BLEU cho kt qu dch vi cc LM khc nhau c th hin trong bng 9.
Cc m hnh ngn ng ny u c xy dng t tp ng liu Set 4 gm 1 GB ng liu
ting Anh. Nhn vo kt qu ny ta c th thy rng nu cng s dng m hnh 3-gram th
h thng dch s dng m hnh ngn ng SRI-LM c im cao hn khi s dng m hnh
48
<tstset setid="wmt08-de-en-nc-test" srclang="German" trglang="English">
<DOC docid="Speigel-doc1" sysid="UMD_de_en_primary">
<seg id="1"> TRANSLATED ENGLISH TEXT </seg>
<seg id="2"> TRANSLATED ENGLISH TEXT </seg>
...
</DOC>
<DOC docid="Speigel-doc2" sysid="UMD_de_en_primary">
<seg id="13"> TRANSLATED ENGLISH TEXT </seg>
<seg id="14"> TRANSLATED ENGLISH TEXT </seg>
...
</DOC>
</tstset>
Nguyn Thc Huy Kha lun tt nghip
cc m hnh BF-LM. Nhng s chnh lch ny khng phi l ln, trong trng hp ny l
SRILM cho im cao hn BloomMap-LM 3.5%, cao hn LF-BF-LM 4%, nn ta c th
coi cc im s ny l tng ng nhau vi cng bc n-gram. Th nhng, nh ni
phn trn, vi cu hnh my tnh dng cho th nghim, ta ch c th xy dng m hnh
ngn ng 4-gram nu s dng BF-LM. S dng m hnh ngn ng 4-gram BF-LM ny
(s dng cu trc d liu Bloom Map) trong h thng dch cho im s l 19.93, cao hn
r rt khi s dng m hnh ngn ng SRI-LM vi 18.25 im.
Bng 9: im BLEU cho kt qu dch vi cc LM khc nhau
C LM im BLEU
SRI-LM 3-gram 893.4 MB 18.25
BloomMap-LM 3-gram 138.8 MB 17.63
LF-BF-LM 3-gram 181.5 MB 17.55
BloomMap-LM 4-gram 302.2 MB 19.93
Ta bit rng dung lng cc LF-BF-LM r rng l cao hn BloomMap-LM.
Nhng qua th nghim trong thc t dch, im BLEU ca h thng s dng LF-BF-LM
khng h cao hn so vi khi s dng BloomMap-LM (vi cng bc n-gram). Thm ch s
dng BloomMap-LM im s cn nhnh hn mt cht. Hn th na, thi gian dch khi s
dng 2 loi m hnh ny c s chnh lch khng ln. Nhn vo kt qu ny, ta c th thy
r u th ca cu trc d liu Bloom Map so vi cu trc d liu Log-Frequency Bloom
Filter, va s dng t b nh hn, va hiu qu hn.
49
Nguyn Thc Huy Kha lun tt nghip
Kt lun
Qua cc chng ca kha lun, chng ti trnh by l thuyt v th nghim cc
m hnh ngn ng xy dng da trn hai cu trc d liu Bloom Filter l Log-Frequency
Bloom Filter v Bloom Map. y l cc cu trc d liu c u im ni bt l kh nng
tit kim ng k b nh nh c s chia s bit dng trong lu tr. Tuy phi nh i iu
ny vi mt xc sut li khc 0, nhng xc sut li ny li l yu t c th iu khin
c. T kt qu cc th nghim, ta c th nhn thy cc m hnh ngn ng Bloom Filter
c hiu qu xp x cc lossless LM chun nhng tc truy vn chm hn. Th nhng
iu quan trng l n cho php ta xy dng cc LM c bc cao hn, s dng ng liu ln
hn; gii quyt c yu cu va tit kim ti nguyn m vn tn dng c tri thc ca
cc ng liu ln.
Trong tng lai, ti mong mun tip tc nghin cu cc m hnh ngn ng c nn
tng l cc PDS v p dng vo xy dng mt h thng dch my thng k Anh Vit,
Vit - Anh vi ng liu ln. Hn th na, vic nghin cu ng dng ca chng trong cc
bi ton khc cng vn cn rng m.
50
Nguyn Thc Huy Kha lun tt nghip
PH LC
Chng trnh truy vn RandLM
1. GenStats.h
#ifndef GENSTATS_H
#define GENSTATS_H
#include "RandLMParams.h"
#include "RandLMTool.h"
#include "RandLM.h"
namespace randlm {
class GenStats {
public:
// Constructor
GenStats(int argc, char ** argv) {
inParam = argv;
randlm_ = NULL;
test_data_ = NULL;
vocab = NULL;
order_ = 0;
corpus_data_ = false;
getcounts_ = false;
outputFile = "";
assert(load());
}
// Destructor
~GenStats() {
delete randlm_;
delete test_data_;
}
// Token a string into a vector
static void Tokenize(const string& str,
vector<string>& tokens,
const string& delimiters = ":") {
51
Nguyn Thc Huy Kha lun tt nghip
// Skip delimiters at beginning.
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first "non-delimiter".
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the vector.
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters. Note the "not_of"
lastPos = str.find_first_not_of(delimiters, pos);
// Find next "non-delimiter"
pos = str.find_first_of(delimiters, lastPos);
}
}
// reads ngrams from file and writes scores them to stdout, output file
bool query();
// check and format user's input
vector<string> formatTestInfo(string info);
// set test info
bool setTestInfo(vector<string> testInfo);
private:
// load RandLM file into memory
bool load();
RandLM* randlm_; // RandLM file
CountRandLM* count_randlm_; // use this if return only counts
TestCorpus* test_data_; // Test data
Vocab* vocab; // Vocabulary info
int order_; // order of LM
bool corpus_data_; // input file is corpus or ngrams ?
bool getcounts_; // if != NULL, return only counts
char ** inParam; // argv
string outputFile; // output file
};
}
endif // GENSTATS
52
Nguyn Thc Huy Kha lun tt nghip
2. GenStats.cpp
#include <iostream>
#include <fstream>
using namespace std;
#include "genstats.h"
#include <ctime>
namespace randlm {
// Query
bool GenStats::query() {
assert(test_data_ != NULL);
WordID sentence[Corpus::kMaxSentenceWords];
double start = clock();
int len = 0;
int found = 0;
uint64_t counter = 0;
long sentenceNo = 0;
bool out = false;
ofstream output;
// open output file for writing
if(outputFile != "") {
out = true;
output.open (outputFile.c_str());
}
// query as sentences
if (corpus_data_) {
while (test_data_->nextSentence(&sentence[0], &len)) {
cout << "SENTENCE No." << sentenceNo + 1 << ": ";
if(out)
output << "SENTENCE No." << sentenceNo + 1 << ": ";
for (int i = 0; i < len; i++) {
cout << vocab->getWord(sentence[i]) << " ";
if(out) output << vocab->getWord(sentence[i]) << " ";
}
cout << endl;
if(out) output << endl;
if (len < 3) // <s> </s> + at least one word
53
Nguyn Thc Huy Kha lun tt nghip
continue;
for (int i = 1; i < len; ++i) {
int start = std::max(0, i - order_ + 1);
if (getcounts_) { // return counts
if((i-start+1) < order_) continue;
for(int j = 0; j < i-start+1; j++) {
cout << vocab->getWord(sentence[start+j]) << " ";
if(out)
output << vocab->getWord(sentence[start+j]) << " ";
}
cout << ": "
<< count_randlm_->getCount(&sentence[start], i - start + 1)
<< std::endl;
if(out)
output << ": "
<< count_randlm_->getCount(&sentence[start], i - start + 1)
<< std::endl;
}
else { // return probs
if((i-start+1) < order_) continue;
for(int j = 0; j < i-start+1; j++) {
cout<< vocab->getWord(sentence[start+j]) << " ";
if(out) output << vocab->getWord(sentence[start+j]) << " ";
}
cout << ": "
<< randlm_->getProb(&sentence[start], i - start + 1, &found)
<< std::endl;
if(out)
output << ": "
<< randlm_->getProb(&sentence[start], i - start + 1, &found)
<< std::endl;
}
++counter;
}
cout << endl;
if(out) output << endl;
sentenceNo++;
54
Nguyn Thc Huy Kha lun tt nghip
}
// query as ngrams
} else {
while (test_data_->nextSentence(&sentence[0], &len)) {
assert(len <= order_);
for(int j = 0; j < len; j++) {
cout << vocab->getWord(sentence[j]) << " ";
output << vocab->getWord(sentence[j]) << " ";
}
cout << ": ";
if(out) output << ": ";
if (getcounts_) { // return counts
cout << count_randlm_->getCount(&sentence[0], len) << std::endl;
if(out) output << count_randlm_->getCount(&sentence[0], len)
<< std::endl;
} else { // return probs
cout << randlm_->getProb(&sentence[0], len, &found) << std::endl;
if(out) output << randlm_->getProb(&sentence[0], len, &found)
<< std::endl;
}
++counter;
}
}
output.close();
std::cerr << "Time elapsed: "
<< (clock() - start)/CLOCKS_PER_SEC << std::endl;
return true;
}// end query
// Load LM
bool GenStats::load() {
assert(randlm_ == NULL);
// read and trim path
string lmpath = inParam[1];
RandLMUtils::trim(lmpath);
55
Nguyn Thc Huy Kha lun tt nghip
// load
RandLMFile fin(lmpath, std::ios::in);
RandLMInfo* info = new RandLMInfo(&fin);
randlm_ = RandLM::initRandLM(info, &fin, 1);
info = NULL;
assert(randlm_ != NULL);
std::cerr << "Loaded RandLM." << std::endl;
// read LM's order
order_ = randlm_->getOrder();
return true;
}// end load
// check and format test info
vector<string> GenStats::formatTestInfo(string in) {
// return vector 'fail' if input not valid
vector<string> arr, fail;
string info = in;
fail.push_back(" ");
// at least 2 params
if(info.find(":") == string::npos)
return fail;
// write output to file or not ?
if(info.find(">") != string::npos) {
vector<string> tmp;
Tokenize(info, tmp, ">");
if(tmp.size() == 2) {
outputFile = tmp[1];
RandLMUtils::trim(outputFile);
info = tmp[0];
} else {
return fail;
}
}
56
Nguyn Thc Huy Kha lun tt nghip
// test valid params
Tokenize(info, arr);
for(int i = 0; i< arr.size(); i++)
RandLMUtils::trim(arr[i]);
if(arr.size() != 2 && arr.size() != 3)
return fail;
if( arr[1] != "corpus" && arr[1] != "ngrams" &&
arr[1] != "c" && arr[1] != "n")
return fail;
if(arr.size() == 3)
if( arr[2] != "1" && arr[2] != "0" &&
arr[2] != "true" && arr[2] != "false")
return fail;
// all params are valid, return formatted info vector
return arr;
}// end formatTestInfo
bool GenStats::setTestInfo(vector<string> testInfo) {
// if corpus data then add <s> </s> symbols
string test_type = testInfo[1];
if(test_type == "c") test_type = "corpus";
if(test_type == "n") test_type = "ngrams";
corpus_data_ = test_type == InputData::kCorpusFileType;
vocab = randlm_->getVocab();
assert(vocab != NULL);
test_data_ = new TestCorpus(testInfo[0], vocab, order_, corpus_data_);
assert(test_data_ != NULL);
// return only counts or smoothed probs
getcounts_ = (testInfo.size() == 3) ?
(RandLMUtils::StringToBool(testInfo[2])) : false;
// if return counts, cast RandLM into CountRandLM
if (getcounts_) {
count_randlm_ = dynamic_cast<CountRandLM*>(randlm_);
assert(count_randlm_ != NULL);
}
57
Nguyn Thc Huy Kha lun tt nghip
}// end setTestInfo
}// end
3. GenStatsMain.cpp
#include <iostream>
#include <cstring>
using namespace std;
#include "genstats.h"
using namespace randlm;
int main(int argc, char ** argv) {
// init genstats, load LM into RAM
GenStats genstats(argc, argv);
string testInfo = "";
vector<string> infoArr;
cout << endl << "GENERATE STATS (counts, probabilities)" << endl;
cout << "Query: <path_to_test_file>:<corpus | ngrams>\
[:<return_only_counts_?>]" << endl;
cout << "Example: test1:corpus:true" << endl;
cout << "Type 'quit' or 'exit' to stop program." << endl;
do {
do{
infoArr.clear();
cout << "Info of test: ";
getline(cin, testInfo);
if(testInfo != "exit" && testInfo != "quit")
// validate user's input
infoArr = genstats.formatTestInfo(testInfo);
else break;
}while(infoArr.size() <= 1);
// quit
if(testInfo == "exit" || testInfo == "quit")
break;
// set test info
58
Nguyn Thc Huy Kha lun tt nghip
genstats.setTestInfo(infoArr);
// process queries
assert(genstats.query());
}while(1);
return 0;
}
59
Nguyn Thc Huy Kha lun tt nghip
Ti liu tham kho
Ting Vit:
[1] Nguyn Vn Vinh. Xy dng chng trnh dch t ng Anh-Vit bng phng php
dch thng k. Lun vn Thc s, i hc Cng ngh, HQGHN, 2005.
Ting Anh:
[2] Algoet, P. H. and Cover, T. M.. A sandwich proof of the Shannon-McMillan-
Breiman Theorem. The Annals of Probability, 1988, 16(2): pages 899-909.
[3] Bahl, L. R., Baker J. K., Jelinek F. and Mercer R. L.. Perplexity - a measure of the
difculty of speech recognition tasks. Acoustical Society of America Journal 62, 1977,
pages 6366.
[4] Bloom, B. H.. Space/time trade-offs in hash coding with allowable errors.
Commun. ACM, 1970.
[5] Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J.. Large language models in
machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning
(EMNLP-CoNLL), 2007, pages 858867.
[6] Broder, A. and Mitzenmacher, M., Network applications of bloom lters: A survey.
In In Proc. of Allerton Conference, 2002.
[7] Brown P. F., Cocke J., Della Pietra V., Della Pietra S., Jelinek F., Lafferty J. D.,
Mercer R. L., and Roossin P. S.. A statistical approach to machine translation.
Computational Linguistics, 1990, 6(2): pages 7985.
[8] Brown et al. The Mathematics of Statistical Machine Translation: Parameter
Estimation. Computational Linguistics, 19(2), 1993.
[9] Callison-Burch, Chris, Miles Osborne, and Philipp Koehn. Re-evaluating the role of
Bleu in machine translation research. In EACL 2006: Proceedings the Eleventh
Nguyn Thc Huy Kha lun tt nghip
Conference of the European Chapter of the Association for Computational Linguistics,
2006.
[10] Chazelle, B., Kilian, J., Rubinfeld, R., and Tal, A.. The bloomier lter: an efcient
data structure for static support lookup tables. In SODA 04: Proceedings of the
fteenth annual ACM-SIAM symposium on Discrete algorithms, Philadelphia, PA, USA.
Society for Industrial and Applied Mathematics, 2004, pages 3039.
[11] Chen, S. and Goodman, J.. An empirical study of smoothing tech-niques for
language modeling. Computer Speech & Language, 1999, 13: pages 359393(35).
[12] Costa, L. H. M. K., Fdida, S., and Duarte, O. C. M. B. Incremental service
deployment using the hop-by-hop multicast routing protocol. IEEE/ACM Trans. Netw.,
2006, 14(3): pages 543556.
[13] de Laplace, M.. A Philosophical Essay on Probabilities. Dover Publications,
1996.
[14] Fallis, D.. The reliability of randomized algorithms. Br J Philos Sci, 2000, 51(2):
pages 255-271.
[15] Galley M. and Manning C. D.. A simple and effective hierarchical phrase
reordering model. In Proceedings of the 2008 Conference on Empirical Methods in
Natural Language Processing, Honolulu, Hawaii, October. Association for Computational
Linguistics, 2008, pages 848856.
[16] Golin, M., Raman, R., Schwarz, C., Smid, M., and C, S. J. C.. Randomized data
structures for the dynamic closest-pair problem. In In Proc. 4th ACM-SIAM Sympos.
Discrete Algorithms, 1993, pages 301-310.
[17] Kneser, R. and Ney, H., Improved backing-off for m-gram language modelling. In
Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, 1995,
volume 1, pages 181184.
[18] Koehn, P.. Europarl: A multilingual corpus for evaluation of machine translation,
2003.
Available at http://people.csail.mit.edu/koehn/publications/europarl.ps.
Nguyn Thc Huy Kha lun tt nghip
[19] Koehn, P.. Empirical Methods in Natural Language Processing. From course
slides at http://www.inf.ed.ac.uk/teaching/courses/emnlp/, 2007.
[20] Koehn, P. and Chris Callison-Burch. Introduction to statistical machine
translation. ESSLLI 2005 tutorial, 2005.
[21] Koehn, P. and Hoang, H.. Factored translation models. In Proceedings of the 2007
Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning (EMNLP-CoNLL), 2007, pages 868876.
[22] Koehn, P., Och, F. J., and Marcu, D.. Statistical phrase-based translation. In
NAACL 03: Proceedings of the 2003 Conference of the North American Chapter of the
Association for Computational Linguistics on Human Language Technology, Morristown,
NJ, USA. Association for Computational Linguistics, 2003, pages 4854.
[23] Le Khanh Hung. One method of interlingual translation. In National Conference
on IT Research, Development and Applications of ICT, 2003.
[24] Levenberg, A. D. Bloom filter and lossy dictionary based language models.
Dissertation, master of science, School of Informatics, University of Edinburgh, 2007.
[25] Manning, C. D. and Schutze, H.. Foundations of Statistical Natural Language
Processing. The M.I.T. Press. Massachusetts, 1999.
[26] Masao Utiyama. A survey of statistical machine translation. Lecture slides, Kyoto
University, 2006.
[27] Och, F. J.. Minimum error rate training in statistical machine translation. In ACL
03: Proceedings of the 41st Annual Meeting on Association for Computational
Linguistics, Morristown, NJ, USA. Association of Computational Linguistics, 2003,
pages 160167.
[28] Och, F. The Google Statistical Machine Translation System for the 2005 NIST MT
Evaluation. Oral presentation at the 2005 NIST MT Evaluation workshop, 2005.
[29] Och F.J. and Hermann Ney. Improved statistical alignment models. In
Proceedings of ACL, 2000.
[30] Pagh, A., Pagh, R., and Rao, S. S.. An optimal bloom lter replacement. In SODA
05: Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms,
Nguyn Thc Huy Kha lun tt nghip
Philadelphia, PA, USA. Society for Industrial and Applied Mathematics, 2005, pages
823829.
[31] Pagh, R. and Rodler, F. F., Lossy dictionaries. In ESA 01: Proceedings of the 9th
Annual European Symposium on Algorithms, London, UK. Springer-Verlag, 2001, pages
300311.
[32] Papineni K., S. Roukos, T. Ward, and W. J. Zhu. Bleu: a method for automatic
evaluation of machine translation. In Proc. of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL), Philadelphia, PA, July, 2002, pages
311318.
[33] Pugh, W.. Skip lists: A probabilistic alternative to balanced trees. Commun.
ACM, 1990, 33(6): pages 668-676.
[34] Stolcke, A., Srilm an extensible language modeling toolkit. In Proc. Intl. Conf.
on Spoken Language Processing, 2002.
[35] Talbot, D. and Osborne, M., Randomised language modelling for statistical
machine translation. In Proceedings of the 45th Annual Meeting of the Association of
Computational Linguistics, Prague, Czech Republic. Association for Computational
Linguistics, 2007a, pages 512519.
[36] Talbot, D. and Osborne, M., Smoothed Bloom lter language models: Tera-scale
LMs on the cheap. In Proceedings of the 2007 Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning
(EMNLP-CoNLL), 2007b, pages 468476.
[37] Talbot, D. and Talbot, J.. Bloom maps. In Proceedings of the Fourth Workshop on
Analytic Algorithmics and Combinatorics (ANALCO). Society for Industrial and Applied
Mathematics, 2008.
[38] Tillmann C.. A unigram orientation model for statistical machine translation. In
Daniel Marcu Susan Dumais and SalimRoukos, editors, Proceedings of HLT-NAACL
2004: Short Papers, Boston, Massachusetts, USA, May 2 - May 7. Association for
Computational Linguistics, 2004, pages 101104.
[39] To Hong Thang. Building language model for Vietnamese and its application.
Dissertation, Bachelor of IT, College of Technology, Vietnam National University, 2008.

You might also like