You are on page 1of 30

I HC KHOA HC T NHIN TPHCM

KHOA CNG NGH THNG TIN

MN HC : NHP MN X L NGN NG T NHIN

BO CO CUI K
TI : X L VN THIU HT T VNG
BNG CCH TRCH XUT CC CP T MN
TRONG B NG LIU SONG SONG

Nguyn Thanh Dng - MSSV : 1112056


16/12/2014

1 GII THIU
Trong th gii hin i ngy nay, khi m ton cu ha tr thnh mt khi nim
quen thuc, vic tng cng giao lu, hc hi gia cc quc gia, khu vc ngy cng
tr nn quan trng. Cng vic dch thut do cng ngy cng tr nn ph bin, gip
cho con ngi thot khi ro cn ngn ng c th thoi mi giao tip vi nhau.
Tuy nhin, khng phi lc no chng ta cng c cc chuyn gia dch thut gip
khi i mt vi nhng ngoi ng l v khng phi chuyn gia no cng c th thnh
tho mi ngn ng. Trn th gii c hn 6500 ngn ng trong khng phi ngn
ng no cng c chuyn gia m nhn vic dch thut. m bo thng tin c
trao i mt cch chnh xc, i ng dch thut cn phi c xy dng dch vn
bn, ti liu, ting ni t ngn ng ny sang ngn ng khc. Vic xy dng i ng
dch thut nh vy c th rt tn km v khng tin li trong nhiu hon cnh. Do
vn dch thut t ng tr thnh mt lnh vc thu ht c gii nghin cu
khoa hc vi rt nhiu bi bo ra i. Do lnh vc dch t ng c qu nhiu vn
lin quan, y nhm chn nghin cu v vn x l vic thiu ht t trong b ng
liu, c th l vic xy dng h thng chuyn t cho nhng t khng c trong b ng
liu. Nhm ch yu s dng nghin cu trong bi bo "Integrating an Unsupervised
Transliteration Model into Statistical Machine Translation (2014)" ca nhm tc gi
Nadir Durrani, Hassan Sajjad, Hieu Hoang, Philipp Koehn.
Mc tiu v ni dung ti :
Mc tiu ca ti l tm hiu, nghin cu v vn x l t nm ngoi ng
liu (out-of-vocabulary words - OOV), c th l phng php x l vn t chuyn
t gia ting Anh v ting Vit nhm nng cao hiu sut ca vic dch t ng theo
hng tip cn thng k.

2 DCH MY (MACHINE TRANSLATION - MT)


2.1 DCH MY L G?
Dch my (machine translation) l mt hng quan trng trong lnh vc x l
ngn ng t nhin. l qu trnh my tnh t ng dch mt vn bn t ngn ng
ny (v d nh ting Anh) sang ngn ng khc (v d nh ting Vit). Trong trng
hp ny, ting Anh l ngn ng ngun (source language), ting Vit l ngn ng ch
(target language). Khi dch thnh cng bt k vn bn no, iu kin tin quyt l
ngha trong vn bn gc phi c dch y trong vn bn ch. Khng n thun
ch l dch tng t, mt my dch t ng phi c kh nng phn tch v nh gi tt
c mi thnh phn trong vn bn bit c mi quan h gia cc t, cc cu,...
nhm lm tng hiu qu dch thut. iu ny i hi qu trnh x l v ng php, c
php, ng ngha,... Vic dch c bng con ngi hay bng my u c chung nhng
thch thc ging nhau. V d vi bt k hai ngi hoc hai my tnh no khi dch mt
vn bn u khng th cho ra hai vn bn c dch ging nhau hon ton. Tt c cc
bn dch sau khi hon thnh u cn c kim tra li p ng c nhu cu ca
cng vic. Tuy nhin, vn ln nht i vi dch my chnh l lm sao my tnh
c th t ng dch ra c nhng vn bn vi cht lng tt.
Vn dch my c mt s hng tip cn. Trong hai hng tip cn chnh
l da trn lut (rule-based machine translation) v da trn thng k (statistical machine translation).

2.2 DCH MY DA TRN LUT


Vic dch da trn lut (rule-based machine translation) s dng mt b ln cc
lut, di s gim st ca chuyn gia, nhm dch cc vn bn t ngn ng ngun
tng ng sang ngn ng ch. Qu trnh ny i hi phi c mt b t vng ln, c
php, ng ngha v c tp cc lut. Thng thng trong rule-based machine translation s c s iu chnh v gim st ca con ngi gip vic dch tt hn.
u im ca rule-based machine translation l kinh nghim ca con ngi
s gip a ra nhng bn dch tng i tt vi kt qu c th d on c. Nhng
li trong qu trnh dch c th c sa thng qua vic thay i b t in.
Tuy nhin, nhc im ca phng php ny chnh l tn chi ph v thi gian
thit k v vn hnh. Ngoi ra, vic cp nht nhng lut mi c th gy nn tnh trng
nhp nhng trong qu trnh dch.

2.3 DCH MY DA TRN THNG K


Dch my thng k (statistical machine translation - SMT) s dng m hnh thng
k trong cc tham s c rt ra t cc b ng liu n ng v song ng. M hnh

hc thng k bao gm vic dch t, cm t i km vi xc sut c th xy ra ca chng.


Cc cu c dch c sinh ra bng cch ni nhng t, cm t trong vn bn
ngun vi c s d liu ca nhng bn dch c sn, t sinh ra tt c nhng cch
dch c th c. Sau cc ng vin ny s c tnh im da trn xc sut c th c
ca chng, t mt thut ton s c dng chn ra bn dch c im s tt nht.
u im ca phng php ny l hu nh khng cn s hiu chnh ca chuyn
gia l con ngi, ngoi ra vic s dng dch my thng k gip cho qu trnh dch
nhanh v hiu qu hn nhiu so vi cc phng php trc y. Nhng h thng
dch my thng k cng rt d cp nht v ci thin hiu sut.
Thch thc ca dch my thng k l vic kim sot qu trnh hc v pht hin
ra ngun gc ca nhng li dch. Ngoi ra, phng php ny i hi b d liu ln,
ng thi h thng phn cng mnh c th chy c m hnh dch.

2.4 S KHC NHAU GIA DCH MY DA TRN LUT V DCH MY DA TRN


THNG K

Dch my da trn lut c hiu qu tt khi s dng trn cc lnh vc dch thut
khc nhau v a ra c bn dch n nh, c th d on c cht lng bn dch.
Vic ty bin t in s lm tng cht lng v gip dch tt cc thut ng c a
vo b d liu. Tuy nhin kt qu dch c th thiu i s linh hot, trn tru. Qu trnh
ty bin t ti mc cht lng chp nhn c c th ko di v tn km. Hiu
nng cao v khng ty thuc vo phn cng ca my tnh.
Dch my thng k t cht lng tt khi c b ng liu ln v tt. Bn dch c
trn tru tha yu cu ngi c. Tuy nhin, cht lng bn dch khng n nh
v khng on trc c. Qu trnh hun luyn hon ton t ng v t tn chi ph.
Nu hun luyn trn b ng liu tng qut, khng thuc mt lnh vc c th s dn
ti cht lng km hn. Cui cng, dch my thng k i hi phn cng nht nh
c th xy dng v qun l m hnh dch ln.
Bng 2.1 th hin s khc bit gia hai phng php k trn.

Bng 2.1: So snh dch my da trn lut v dch my thng k


Da trn lut

Da trn thng k

n nh, d on c cht lng bn dch


X l c vn out-of-domain
Bit c cc lut vn phm
Hiu sut cao
C s n nh gia cc phin bn
Bn dch thiu s lu lot
Chi ph pht trin, hiu chnh ln

Khng d on c cht lng bn dch


X l km vn out-of-domain
Khng bit c ng php
Yu cu phn cng mnh
Khc bit gia cc phin bn
Dch lu lot
Pht trin nhanh, tn t chi ph

2.5 NHNG VN GP PHI TRONG DCH MY THNG K


Trong qu trnh thc hin, dch my thng k gp phi mt s tr ngi nh sau :
2.5.1 GING HNG CHO CU
Trong b d liu song song, mt cu trong ngn ng ny c th c dch thnh
nhiu cu ngn ng khc v ngc li.
2.5.2 S D THNG TRONG THNG K
Tp d liu trong thc t c th chng ln bn dch ca mt s t, v d nh danh
t ring. Mt v d : "Ti i n Lt" c th c dch thnh "Ti i n Nng"
do s d tha ca cm "i n Nng" trong tp hun luyn.
2.5.3 S PHA LONG D LIU
i khi trong qu trnh xy dng m hnh thng k cho my dch t ng, d
liu hun luyn mt lnh vc c th khin hiu qu ca vic dch tr nn km cht
lng khi s dng trong lnh vc khc. V d "neural network" c th c ngha khc
nhau khi s dng trong lnh vc y sinh v khoa hc my tnh. Tp hun luyn t mt
lnh vc ny li s dng cho mt lnh vc c th khc c th lm long d liu. Thng
thng, dch my thng k c xy dng s dng trong mt lnh vc c th, tuy
nhin mc tiu ca cc nghin cu hin nay l lm sao c th p dng mt m hnh
dch my thng k cho nhiu lnh vc.
2.5.4 CC THNH NG
Ty thuc vo b ng liu c dng, thnh ng c th khng c dch chnh
xc theo ngha ca n trong ting bn x. V d, "kick a bucket" nu dch theo ngha
en s c kt qu " ci x" trong ting Vit trong khi ngha chnh xc ca thnh ng
ny l "qua i".

2.5.5 TRT T T KHC NHAU


Mt s ngn ng c trt t t khc nhau. Mt cch x l l gn nhn cho subject (S), verb (V), object (O) trong mi cu v iu chnh trt t SVO hoc VSO ty theo
ngn ng. Ngoi trt t t gia S,V,O, cn c nhng khc bit khc v trt t t, v d
nh v tr ca t b ngha cho danh t, hoc trt t t trong cu hi, cu khng nh,
cu nghi vn.
i vi dch my thng k, mt my dch ch c th x l mt chui ngn, v
trt t t phi c x l bi lp trnh vin. Mt gii php l thm mt m hnh lm
nhim v sp xp li cu, vi v tr ca mi t s c xc nh da nhng t ng k
n. Nhng v tr ny c th c xp hng, t m hnh s chn ra v tr tt nht.
2.5.6 T NM NGOI NG LIU (OUT-OF-VOCABULARY WORDS - OOV WORDS)
Cc h thng dch my thng k thng thng s lu tr cc t khc nhau di
dng nhng n v c lp m khng h th hin mi lin h gia chng. Do nhng
t hoc cm t khng nm trong tp hun luyn s khng th c dch. Nguyn
nhn xy ra iu ny c th do s thiu ht trong d liu hun luyn, do s thay i
trong lnh vc m h thng c s dng, hoc do s khc bit v hnh thi t.

3 MT S THUT NG PH BIN TRONG DCH MY


T : n v mang ngha c lp, c cu to bi (cc) hnh v, c chc nng nh
danh, v d : I - am - reading - my - books.
Ng : gm hai hay nhiu t c quan h ng php hay ng ngha vi nhau. V d :
bc th, mng my tnh, computer system,...
Cu : gm cc t/ng c quan h ng php hay ng ngha vi nhau v c chc nng
c bn l thng bo, v d : I am reading my books.
Vn bn : h thng cc cu c lin kt vi nhau v mt hnh thc, ng php, ng
ngha v ng dng.
Hnh thi : mi quan h gia n v ngn ng vi hnh thc cu to ca n v .
Ng php : mi quan h gia n v ngn ng ny vi cc n v ngn ng hu quan.
Ng ngha : mi quan h gia n v ngn ng vi ni dung (mt ngha) ca n
v .
Ng dng : mi quan h gia n v ngn ng vi mc ch s dng ca n v .
Trt t t : l s th hin tnh hnh tuyn ca ngn ng, hoc ni cch khc l trt
t cc thnh phn cu.
T loi : cc lp t ca mt ngn ng c phn loi theo bn cht ng php. Theo
, cc t trong cng mt lp c chung nhng c im v ngha khi qut (bn
cnh nhng ngha t vng c th), v cc phm tr ng php, v cc kiu bin ho
hnh thi t v cu to t, v cc chc nng c php. Trong ting Vit c cc t loi
nh: danh t, ng t, tnh t, s t, i t, ph t, kt t, tr t, cm t, h t.
Danh t : t c ngha t vng khi qut ch s vt v cc khi nim tru tng c
quan nim nh l s vt; thng gi cc chc nng ch ng, b ng trong cu.
ng t : t biu th hot ng, trng thi hay qu trnh, thng c cc phm tr
ng php thi, ngi, thc, th, dng ( cc ngn ng n - u), thng gi chc nng
v ng trong cu.
Tnh t : t loi biu th c trng (tnh cht, c im, mu sc), thng c cc phm
tr ng php nh cp so snh, ging, s; chc nng c php c bn l nh ng v v
ng. Trong cc ngn ng n lp, tnh t c th trc tip lm v ng.

i t : t loi c chc nng i din hoc thay th cho ngi, s vt c ni ti


trong giao tip hay cho mt vi t loi khc nh danh t, tnh t, thm ch cho mt
thnh phn cu, cho c nhng cu xut hin trc.
Thc t : t c ngha t vng c lp, c kh nng lm thnh phn ca cu. Danh
t, ng t, tnh t, s t u l thc t.
H t : t khng c kh nng c lp lm thnh phn cu, c dng biu th
mi quan h ngha - c php gia cc thc t, hoc b sung cc ngha ng php
cho t. Kt t, tr t thuc h t.
T ng ngha : nhng t gn nhau v ngha, c m thanh khc nhau, cng thuc
v mt t loi, nhng khc nhau v cc sc thi th hin ca cng mt khi nim.
T ng m : t c m ging nhau nhng ngha khc nhau.
Vn phm phi ng cnh : l vn phm vi mi lut sn sinh u c dng :

V>w

(3.1)

vi V l mt k hiu khng kt thc, w l mt chui bao gm k hiu kt thc v k hiu


khng kt thc (w c th rng).
K hiu khng kt thc : bao gm cu, ng danh t, ng ng t, ng gii t.
K hiu kt thc : bao gm i t, danh t, nh t, ng t, tnh t, gii t, lin t.

4 VN T NM NGOI NG LIU (OOV WORDS) V CC HNG


TIP CN
y nhm tp trung tm hiu vo vn x l t OOV. Cc hng tip cn x
l t OOV trong mt s nghin cu trc y :
Dch tn ring ca cc thc th da trn ngun d liu n ng v song ng.

Al-Onaizan v Knight, 2002


Ni t nm ngoi b t vng (OOV) vi t nm trong b t vng sn c (in-

vocabulary words INV) m c th c lin quan vi n. Nizar Habash, 2009


Phn ln cc nghin cu trc y u thc hin vic x l t OOV trong cc
khu tin/hu x l bng cch thay th t OOV vi t chuyn t tt nht c th. Khu
ny thng thng s nm ngoi thnh phn chnh chnh ca b dch. Trong cc cch
tip cn trn, module ca vic chuyn t t OOV thng khng nm trong module
chnh ca my dch. Mt trong nhng nguyn nhn ca vic ny l do b d liu
hun luyn khng c sn trong nhiu ngn ng. Trong bo co ny nhm chn tm
hiu v bi bo Integrating an Unsupervised Transliteration Model into Statistical
Machine Translation - Nadir Durrani, Hassan Sajjad, Hieu Hoang, Philipp Koehn,
2014. Trong bi bo c s dng cho bo co ny, nhm tc gi s dng m hnh
hc khng gim st rt ra b d liu hun luyn t ngun d liu l cc cp t song
song c ni vi nhau, sau dng chng dng lm d liu hun luyn cho
vic chuyn t. Cc tc gi ca bi bo tch hp m hnh chuyn t khng gim st
ny vo trong MOSES mt trong nhng cng c dnh cho dch my thng k ph
bin nht. Phng php trong bi bo ny nh gi l phng php tt nht cho n
hin ti.

5 NHNG KIN THC LIN QUAN


5.1 M HNH NGN NG
M hnh ngn ng tnh ton lu lot ca cu c dch v l mt b phn
quan trng ca dch my thng k. M hnh ngn ng nh hng n vic la chn
t, trt t t v cc quyt nh khc. C th, m hnh ngn ng gn cho mi cu mt
xc sut th hin kh nng cu c th xut hin sau s xut hin ca mt t no
. Mt m hnh ngn ng p LM s u tin cu c dch c trt t t chnh xc hn.
p LM (the house is small) > p LM (small the is house)

(5.1)

Gi s chng ta c mt cu e c th chia thnh cc t e = e 1 e 2 ...e m . Chng ta c


th nh ngha xc sut ca e nh l mt tch ca cc xc sut c iu kin :
pr (e) =

m
Y

pr (e j |e 1 , ..., e j 1 )

(5.2)

j =1

Tr ngi trong vic xy dng m hnh ngn ng l c qu nhiu nhng xc


sut c iu kin cn phi tnh. n gin vn tnh ton, chng ta gi nh rng
xc sut xut hin ca mt t s ph thuc vo nhng t xut hin pha trc n. V
d :
pr (e j |e 1 , ..., e j 1 ) = pr (e j )

(5.3)

T dn n
pr (e) =

m
Y

pr (e j )

(5.4)

j =1

Chng ta c th tnh xc sut pr (e j ) bng cch ly mt b ng liu ln v m s


lng t tnh xc sut. Vn y l m hnh ny vn cha c c th, v chng
ta cha xc nh r s tnh xc sut ca bao nhiu t ng trc. Mt trong nhng m
hnh ngn ng tt nht l m hnh N-gram. M hnh N-gram s dng gi thit Markov
nhm chia xc sut ca mt cu thnh tch xc sut ca tng t trong cu. Theo ,
xc sut xut hin ca mt t trong cu s ph thuc vo mt s lng (gii hn) n t
xut hin trc n. N-gram c kch thc mt n v s c gi l unigram, hai n
v l bigram, ba n v l trigram.
Gi s ta ang xt n m hnh trigram. Ta c c xc sut xut hin ca mt t
khi bit c hai t ng lin trc n :
pr (e j |e 1 , ..., e j 1 ) = pr (e j |e j 2 , e j 1 )

(5.5)

C nhng trng hp trigram khng xut hin trong ng liu hun luyn s dn
ti tn ti mt xc sut pr (e j |e j 1 , e j 2 ) = 0 t a n pr (e) = 0. khc phc

10

trng hp trn, ngi ta s dng m hnh trigram c lm mn, lc ny pr (e j |e j 1 , e j 2 )


s c tnh vi cng thc gn ng :
pr (e j |e j 1 , e j 2 ) = 1 T (e j |e j 1 , e j 2 ) + 2 B (e j |e j 1 ) + 3 U (e j ) + 4

(5.6)

1 , 2 , 3 , 4 l cc hng s vi 1 << 2 << 3 << 4


T : xc sut trigram
B : xc sut bigram
U : xc sut unigram

5.2 M HNH DCH


Mc tiu ca m hnh dch l tnh xc sut mt cu ch c th xut hin khi dch
mt cu ngun. M hnh dch c 3 hng tip cn chnh :
M hnh dch da trn t.
M hnh dch da trn cm t.
M hnh dch da trn c php.

5.3 B GII M
Nhim v ca b gii m l tm cu ngn ng ch e tt nht c dch t cu
ngn ng ngun v sao cho gi tr P (e) P (v|e) cc i.

5.4 EXPECTATION MAXIMIZATION


Cc m hnh hc thng k nh m hnh Markov hoc mng Bayesian thng
c s dng m hnh ha d liu. Chng c s dng hc cc tham s t
d liu quan st c. Tuy nhin i khi d liu dng cho vic hun luyn li khng
y lm cho vic hc khng c thc hin. Gii thut Expectation Maximization
(EM) c s dng c lng tham s cho m hnh thng k trong trng hp d
liu khng y .
Gii thut EM l cch gi chung ca gii thut lp (iterative) vi mt gi tr khi
to ban u cho cc tham s ca m hnh v sau hiu chnh cc gi tr ny qua mi
ln lp nhm gia tng tin cy (likelihood) ca d liu quan st c. Vng lp dng
li khi cc gi tr c lng hi t.
Nhng th hin c th ca gii thut EM c p dng cho mt s bi ton
khc nhau trong x l ngn ng. V d :

11

c lng tham s ca m hnh Hidden Markov (HMM), ta dng phin bn

ca gii thut EM c tn l Baum-Welch hay Forward-Backward.


c lng tham s ca vn phm xc sut, ta dng phin bn ca gii thut

EM c tn l Inside-Outside.
lm trn (smoothing) trong m hnh da trn HMM, ta dng phin bn ca

gii thut EM c tn l Linear Interpolation (ni suy tuyn tnh).


V d : th nghim tung ng xu
Ta tin hnh th nghim n gin vi hai ng xu A v B, xc sut xut hin mt
sp ln lt l A v B . Mc tiu th nghim l c lng = ( A , B ) bng cch lp
li 5 ln hnh ng sau : chn ngu nhin mt trong hai ng xu (vi xc sut tng
ng nhau), sau tung ng xu ny 10 ln. Tng cng chng ta c 50 ln tung ng
xu.
Trong qu trnh th nghim, gi s chng ta ghi nhn li hai vector x = (x 1 , x 2 , ..., x 5 )
v z = (z 1 , z 2 , ..., z 5 ) vi x i {0, 1, ..., 10} l s ln xut hin mt sp lt th i, z i {A, B }
th hin ng xu no c chn lt i. y ta thy cc tham s c th hin y
, ni cch khc gi tr ca tt c bin ngu nhin trong m hnh thng k (kt qu
ca mi lt tung ng xu, loi ng xu dng cho mi ln tung) c bit n.
y, mt cch n gin c lng A v B l da vo t l xut hin mt
sp ca tng ng xu.
A =

S ln xut hin mt sp dng ng xu A


tng s ln tung ng xu A

(5.7)

v
B =

S ln xut hin mt sp dng ng xu B


tng s ln tung ng xu B

(5.8)

Phng php trn cn c gi l phng php c lng maximum likelihood.


Phng php ny nh gi cht lng ca mt m hnh da trn xc sut c gn
cho d liu quan st c. Tuy nhin, y ta xt n mt trng hp khc phc tp
hn khi chng ta c cung cp d liu v s ln xut hin mt sp x nhng khng
xc nh c ng xu no c s dng (khng bit z). Chng ta xem z nh l bin
n hay h s ngm. c lng tham s trong trng hp ny c xem nh trng
hp d liu khng y . y, phng php tnh t l xut hin mt sp trn mi
ng xu khng cn kh thi v chng ta khng bit c ng xu no c s dng
mi lt tung. Tuy nhin, nu chng ta c cch no c th lm y d liu (
trng hp ny l d on chnh xc ng xu no c s dng trong 5 ln chn ng
xu nm), chng ta c th thay th vic c lng tham s trn d liu khng y
bng c lng maximum likelihood trn d liu y .

12

M t ca thut gii EM trong trng hp ny c trnh by nh sau :


Khi to cc gi tr ban u cho
Expectation(E) : t cc gi tr , xc nh xc nh ng xu A hay B c s dng

trong mi 10 lt tung.
Maximization(M) : sau khi c c d liu y trn, gi s rng cc d liu
ny chnh xc, p dng c lng maximum likelihood c c gi tr mi.
Lp li hai bc E v M cho ti khi cc tham s hi t.

Tng kt li, gii thut EM l qu trnh thay i gia hai bc d on phn phi
xc sut t lm y d liu ca m hnh (bc E), sau c lng li tham s
cho m hnh da trn d liu y (bc M). Tn bc Expectation xut pht t
vic chng ta khng cn tm phn phi xc sut c th cho ton b d liu, m ch
quan tm n xc sut k vng c c t s y ca d liu. Tn bc Maximization xut pht t vic c lng li m hnh c th xem nh qu trnh cc i ha xc
sut k vng ca d liu. Tip theo chng ta i n tnh ton c th cho v d trn
bng hnh 5.1
Maximum likelihood : Vi mi b 10 ln tung ng xu, maximum likelihood tnh
ton s lng mt sp v nga ca ng xu A v B, t tnh xc sut sp nga ca
tng ng xu.
Expectation maximization :
(1) EM bt u bng vic khi to gi tr, y khi to (0)
= 0.60, (0)
= 0.50
A
B
(2) Bc E : phn phi xc sut da trn nhng gi tr c tnh ton. Cc gi

tr trong bng th hin s lng mt sp v nga k vng c c khi s dng


phn phi xc sut ny.
(3) Bc M : cc gi tr tham s mi c tnh da trn nhng d liu mi c

lm y .
(4) Tip tc lp li bc 2 v 3 cho n khi cc gi tr hi t.

5.5 PHNG PHP NH GI BLEU


BLEU l thut ton dng nh gi cht lng ca vic dch t ng. tng
chnh y da trn vic so snh gia bn dch ca my v bn dch chun ca con
ngi (chuyn gia). Bn dch ca my cng gn ging vi bn dch ca mt chuyn
gia dch thut th cht lng ca n cng cao.
Gi tr ca im BLEU c tnh da trn nhng phn on c dch - thng

13

Hnh 5.1: c lng tham s cho d liu y v khng y . (a) c lng maximum likelihood. (b) Expectation maximization.

14

thng l cu - bng cch so snh chng vi mt tp nhng phn on tng ng


c dch bi con ngi. Sau cc gi tr ny c tnh trung bnh da trn ton
b ng liu tnh ra cht lng trn ton b khi lng vn bn c dch. Tnh d
hiu hoc chnh xc ca ng php khng c tnh n khi s dng BLEU.
BLEU c thit k so snh chnh xc mc vn bn, do n s khng
t c hiu qu cao khi s dng nh gi cht lng ca tng cu c lp.
Gi tr BLEU l mt s nm trong khong [0;1]. Gi tr BLEU cng gn 1 th hin
rng bn dch cng tt v ngc li.
5.5.1 TNG CHNH
Mt my dch cng gn vi vic dch thut chuyn nghip ca con ngi bao

nhiu th cng tt by nhiu.


BLEU c xut nh l mt bin php thay th s kim nh ca con ngi

khi cn thit
5.5.2 CC YU CU
Cn nhiu b dch tt to bi con ngi lm tham kho
Ph thuc vo vic iu chnh chnh xc n-gram
C hnh thc pht i vi vic dch qu ngn, rt gn qu mc

5.5.3 V D MINH HA
ng vin 1 : It is a guide to action which ensures that the military always obey the
commands of the party.
ng vin 2 : It is to insure the troops forever hearing the activity guidebook that party
direct.
Bng mt thng chng ta s d dng nhn thy rng ng vin 1 tt hn. By
gi chng ta tip tc vi cc bn dch tham kho :

Tham kho 1 : It is a guide to action that ensures that the military will forever heed
Pary commands.
Tham kho 2 : It is the guiding principle which guarantees the military forces always
being under the command of the Party.

15

Tham kho 3 : It is the practical guide for the army always to heed directions of the
party.

nh gi ng vin 1 tt hn ng vin 2 chng ta lm nh sau :


m s lng n-gram trng khp
Vic m trng khp khng cn ch n v tr
C th trng vi cc bn tham kho nhiu ln
Khng cn ch n ngn ng

Chng ta m s lng n-gram trng khp :


ng vin 1 : It is a guide to action which ensures that the military always obey the
commands of the party.
Tham kho 1 : It is a guide to action that ensures that the military will forever heed
Party commands.
Tham kho 2 : It is the guiding principle which guarantees the military forces always
being under the command of the Party.
Tham kho 3 : It is the practical guide for the army always to heed directions of the
party.
chnh xc N-gram : 17
i vi ng vin 2 :
ng vin 1 : It is to insure the troops forever hearing the activity guidebook that party
direct.
Tham kho 1 : It is a guide to action that ensures that the military will forever heed
Party commands.
Tham kho 2 : It is the guiding principle which guarantees the military forces always
being under the command of the Party.
Tham kho 3 : It is the practical guide for the army always to heed directions of the
party.
chnh xc N-gram : 8

16

Tuy nhin nu xy ra trng hp mt t xut hin qu nhiu ln, v d nh trng


hp sau :
ng vin : the the the the the the the
Tham kho 1 : The cat is on the mat.
Tham kho 2 : There is a cat on the mat
chnh xc N-gram : 7 (khng ng)
Gii php y : cc t c tham kho phi b loi b sau khi so khp.
Thut ton hiu chnh chnh xc N-gram :
m s ln xut hin ln nht ca mt t thuc cu ng vin trong bt k tham
kho no, gi l mmax
Vi s ln xut hin m w ca mi t, nu n ln hn m max th ct n xung cn
m max . Tip tc cng dn cc m w v thc hin tng t.
Tnh t s gia m w vi tng s t ca ng vin.

i vi trng hp chnh xc 1-gram cng cao th tnh y ca bn dch


cng tt. Vi chnh xc n-gram, n>1 th chnh xc cao cho thy s tri chy, lu
lot ca bn dch. di tt nht ca n-gram nh gi chnh xc mt bn dch l
bn.
Tuy nhin vn vi BLEU nm vic n thng u tin hn cho cc bn dch
ngn. V d nh ng vin "of the" s c chnh xc unigram v bigram u l 1 khi
so snh vi cc tham kho trn. x l vn ny chng ta cn t ra mt hnh
pht cho nhng bn dch ngn hn so vi bn tham kho, gi l Brevity Penalty. Khi
mt bn dch c di bng vi bn tham kho, BP = 1, ngc li BP < 1.
BP =

(
1,

(1r /c)

nu c > r
nu c r

(5.9)

Vi c l di ca ng vin, r l di chun ca bn tham kho.


Tip theo ta tnh c im BLEU

B LEU = B P.exp

N
X

w n l og p n

n=1

Vi p n l chnh xc n-gram, w n l trng s, tnh bng 1/N.

17

5.5.4 U IM V NHC IM
u im :
Nhanh v t tn chi ph, khng cn nhiu nhn lc kim tra.
C th s dng trong qu trnh pht trin kim tra s thay i trong hiu sut

ca b dch.
Nhc im :
Hiu qu nh gi trn tng cu n l thng thng khng tt, khng n nh.
BLEU nh gi tng t trn ton b cc bn tham kho. iu ny c th khng

tt khi so snh vi vic nh gi trn tng bn tham kho mt v chn ci tt


nht.
BLEU i hi so snh chnh xc tng t. Trng hp t ng ngha khng c

xt n.
Tt c cc t u c nh gi vi trng s ngang nhau.

5.6 PHN BIT GIA DCH ( TRANSLATION) V CHUYN T ( TRANSLITERATION)


Dch l qu trnh bin i vn bn t ngn ng ny sang ngn ng khc t
c s tng ng v ngha. Trong khi , chuyn t l qu trnh bin i t gia hai
ngn ng nhm mc ch t c s tng ng v ch ci. V d mt s t mn
trong ting Vit nh vi rt (mn t virus trong ting Anh), s mi (mn t chemise
trong ting Php), vn ha (mn trong ting Hn). Do yu t lch s, phn ln t
mn trong ting Vit xut pht t ting Hn, ting Php, ting Anh. Qua thi gian,
nhng t ny c ngi Vit chun ha cch c theo h thng ng m ca mnh.
Phn tip theo s trnh by v phng php chnh nhm tch hp m hnh chuyn
t khng gim st vo dch my thng k. Phng php s bao gm cc bc nh sau.
u tin, h thng khai ph khng gim st cc cp t chuyn t s trch xut ra d
liu dng cho vic hun luyn. Tip , cc cp t chuyn t ny s c dng
hun luyn cho m hnh dch. Cui cng, h thng chuyn t trn s c tch hp
trc tip vo h thng dch my theo ba cch : i) thay th t OOV vi mt t chuyn
t tt nht bc tin x l, ii) chn ra cch chuyn t tt nht trong s n chuyn t
tt nht s dng c trng m hnh chuyn t v m hnh ngn ng bc tin x l,
iii) cung cp mt bng phrase-table cho b gii m n c th xem xt tt c cc c
trng, t chn ra t chuyn t tt nht thay th cho t OOV.

18

6 PHNG PHP CHNH


6.1 KHAI PH D LIU CHUYN T
Tr ngi chnh trong vic xy dng mt h thng chuyn t l s thiu ht
ca d liu hun luyn. Tuy nhin, chng ta hon ton c th gi nh rng bt k b
ng liu song song no u c cha mt s lng nht nh cc cp t chuyn t.
Do chng ta c th rt trch nhng cp t nh vy to thnh b d liu hun
luyn. Nhng phng php trc y ch yu s dng phng php c gim st v
bn gim st :
Tarek Sherif and Grzegorz Kondrak. 2007. Bootstrapping a Stochastic Transducer

for Arabic-English Transliteration Extraction. In Proceedings of the 45th Annual


Meeting of the Association for Computational Linguistic, Prague, Czech Republic.
Sittichai Jiampojamarn, Kenneth Dwyer, Shane Bergsma, Aditya Bharga, Qing

Dou, Mi-Young Kim, and Grzegorz Kondrak. 2010. Transliteration Generation


and Mining with Limited Training Resources. In Proceedings of the 2010 Named
Entities Workshop, Uppsala, Sweden.
Kareem Darwish. 2010. Transliteration Mining with Phonetic Conflation and It-

erative Training. In Proceedings of the 2010 Named Entities Workshop, Uppsala,


Sweden.
Ali El Kahki, Kareem Darwish, Ahmed Saad El Din, and Mohamed Abd El-Wahab.

2012. Transliteration Mining Using Large Training and Test Sets. In Proceedings
of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 12.
Cc phng php trn i hi d liu hun luyn phi c sn. Mt s cch tip
cn da trn vic khai ph d liu khng gim st c xut :
Jae-Sung Lee and Key-Sun Choi. 1998. English to Korean Statistical Translitera-

tion for Information Retrieval. Computer Processing of Oriental Languages, 12(1)


: 1737.
Wen-Pin Lin, Matthew Snover, and Heng Ji. 2011. Unsupervised Language-Independent

Name Translation Mining from Wikipedia Infoboxes. In Proceedings of the First


workshop on Unsupervised Learning in NLP, pages 4352, Edinburgh, Scotland,
July. Association for Computational Linguistics.
Hassan Sajjad, Alexander Fraser, and Helmut Schmid. 2012. A Statistical Model

for Unsupervised and Semi-supervised Transliteration Mining. In Proceedings


of the 50th Annual Conference of the Association for Computational Linguistics,
Jeju, Korea.

19

y chng ta tm hiu v phng php ca Sajjad c m t di y.


M hnh khai ph d liu chuyn t khng gim st
y tc gi Sajjad trch dn nghin cu ca Li Haizhou v cng s (2004). Qu
trnh chuyn t c xem nh l mt qu trnh nhn mt chui k t ngn ng
ngun v xy dng nn mt chui k t tng ng ngn ng ch. Qu trnh ny
gm hai phn : phn on chui ngun (t c xem nh chui k t) thnh cc n
v chuyn t (k t); v lin kt cc n v gia chui ngun v chui ch thng qua
vic ging hng cho cc n v ny.
Cc k t thuc t ngun v t ch c th c ging hng theo nhiu cch.
Chng ta xt mt cch ging hng c th vi t ngun e v t ch a. Hm Align(e,f )
tr v tp tt c cc chui ging hng c th ca cp t (e,f ). Xc sut hp p 1 (e,f ) ca
mt cp t l tng cc xc sut trn ton b chui c ging hng :
p 1 (e, f ) =

p(a)

(6.1)

aAl i g n(e, f )

H thng transliteration c hun luyn da trn mt danh sch cc cp t


chuyn t. Vic ging hng gia cc cp t ny c hc bng cch s dng phng
php Expectation Maximization (EM). Chng ta s dng m hnh unigram n gin,
vi mt chui ging hng k t bao gm cc trng hp 0-1, 1-1 v 1-0 gia t ngun
e v t chuyn t f. Chng ta cp n mt n v ca k t ging hng nh mt
"multigram" v k hiu n l "q". Mt chui cc multigram to nn mt cch ging
hng ca t ngun v t ch. Xc sut ca mt chui a cc multigram l tch ca tt
c cc xc sut ca nhng multigram trong .
p(a) = p(q 1 , q 2 , ..., q |a| ) =

|a|
Y

p(q j )

(6.2)

j =1

Trong khi h thng chuyn t c hun luyn trn mt tp cc cp chuyn t,


h thng khai ph cc cp t chuyn t li hc trn d liu cha c cc t chuyn t
v khng chuyn t. y mt m hnh transliteration mining s gm hai m hnh
ph - m hnh chuyn t (transliteration model) v m hnh phi chuyn t (nontransliteration model). M hnh chuyn t p 1 (e,f ) x l cc cp chuyn t. M hnh
ph th hai, p 2 (e,f ) - m hnh phi chuyn t s x l cc cp khng chuyn t. Php
ni suy vi m hnh phi chuyn t gip chng ta gim st c c vic m hnh ha
vic chuyn t trong qu trnh hun luyn EM. Sau khi hun luyn xong, cc cp t
chuyn t s c gn mt xc sut cao hn trn m hnh chuyn t v xc sut thp
hn i vi m hnh phi chuyn t, v ngc li vi cc cp t phi chuyn t. Tnh
cht ny c khai thc xc nh s chuyn t.
i vi mt cp t khng chuyn t, cc k t ca t ngun v t ch khng lin
quan vi nhau. Chng ta m hnh chng nh l mt t ngun v mt t ch ngu

20

nhin i vi nhau. M hnh phi chuyn t s dng cc k t c sinh ngu nhin t


hai m hnh unigram. M hnh ny c nh ngha nh sau :
p 2 (e, f ) = p E (e)p F ( f )

p E (e) =

|e|
Y

(6.3)

p E (e i )

(6.4)

pF ( fi )

(6.5)

i =1

pF ( f ) =

|f |
Y
i =1

M hnh khai ph d liu chuyn t l mt php ni suy ca m hnh chuyn t


p 1 (e, f ) v m hnh phi chuyn t p 2 (e, f ):
p(e, f ) = (1 )p 1 (e, f ) + p 2 (e, f )

(6.6)

Vi l xc sut tin nghim ca m hnh phi chuyn t.

6.2 M HNH CHUYN T


Sau khi hon tt vic khai ph d liu chuyn t, chng ta c c d liu
dng cho vic hun luyn m hnh chuyn t. D liu c c trong b ng liu s
c chia thnh cc n v k t v c hun luyn. M hnh chuyn t gi nh rng
d liu ngun v ch c sinh ra mt cch n iu, tc l chng ta khng cn phi
sp xp li. M hnh chuyn t s dng nhng c trng c bn (hng, bn dch
ngc ca cm t, c trng trng s ca t vng), c trng ca m hnh ngn ng
(xy dng t pha ch ca b ng liu c khai ph), v cui cng l im pht cho
t v cm t. c trng trng s c iu chnh t 1000 cp t chuyn t.

6.3 TCH HP VO M HNH DCH MY


y chng ta th nghim 3 phng php nhm tch hp m hnh chuyn t vo
trong cng c dch my :
Thay th t OOV vi t chuyn t tt nht trong bc chun b trc khi dch.
Chn t chuyn t tt nht trong tp t chuyn t bng cch s dng c trng

ca m hnh chuyn t v m hnh ngn ng trc khi dch.


Cung cp bng chuyn t cho b gii m khi n thc hin vic dch, t xem

xt tt c cc c trng chn ra t chuyn t tt nht thay th t OOV.

21

6.4 C LNG M HNH


Trong phn ny chng ta bn v vic c lng tham s cho m hnh chuyn t
p 1 (e, f ) v m hnh phi chuyn t p 2 (e, f ).

M hnh phi chuyn t bao gm hai m hnh unigram. Tham s ca chng c


c lng t t ngun v t ch trong d liu hun luyn v cc tham s khng thay
i trong qu trnh hun luyn EM.
i vi m hnh chuyn t, chng ta gii hn trong cc kiu ging hng k t
0-1, 1-1, 1-0 vi m hnh unigram. Gii thut Expectation Maximization c dng
hun luyn m hnh. Phng php ny c dng ti u ha xc sut ca d
liu hun luyn. bc E, thut gii EM tnh ton s lng k vng ca multigram v
bc M xc sut ca multigram c c lng li da vo gi tr ny. Hai bc ny
c lp li n khi gi tr hi t. Vi vng lp EM u tin, xc sut multigram c
khi to vi gi tr phn phi chun, c gn gi tr 0.5
S lng k vng ca mt multigram q c tnh bng cch nhn xc sut tin
nghim ca mi cch ging hng a vi tn s xut hin ca q trong a v cng tt c
gi tr ny trn ton b cch ging hng ca tt c cc cp t.
c(q) =

(1 )p 1 (a, e i , f i )
n q (a)
p(e i , f i )
i =1 aAl i g n(e i , f i )
N
X

(6.7)

y n q (a) l s ln multigram q xut hin trong chui a v p ( e i , f i ) c nh


ngha bng cng thc 6.6. c lng xc sut mi ca multigram c th hin nh
sau :
p(q) = P

c(q)
0
q 0 c(q )

(6.8)

Tng t, chng ta tnh s lng k vng ca phi chuyn t bng cch cng xc
sut tin nghim ca cc cp t phi chuyn t :
c nt r =

N
X
i =1

p nt r (e i , f i ) =

N p (e , f )
X
2 i i
i =1 p(e i , f i )

(6.9)

sau c c lng li bng cch ly s lng k vng ca phi chuyn t chia

cho N.

22

7 NH GI
y nhm tc gi th nghim trn 7 cp ngn ng : A-rp, Bengali, Farsi,
Hindi, Nga, Telugu tng ng vi ting Anh. Vi ting A-rp v Farsi, d liu c ly
t chng trnh TED talks. Vi cc ngn ng h n, d liu c dng l b ng liu
song song Indic. Vi ting Nga, nhm tc gi s dng d liu WMT-13.
Khai ph d liu chuyn t : Cc cp t chuyn t s c rt trch t b ng
liu song song vi cc cp t c ging hng. y chng ta ch dng ging hng
1-1. Trc khi i vo bc khai thc d liu, cc cp t s c x l bng cch b
i cc s, k t, cc cp t m trong tn ti t c t hn 3 k t, t cha cc k t
nc ngoi. Qu trnh khai ph c thc hin vi 10 vng lp EM. S lng cc cp
t chuyn t c rt trch th hin bng 7.1
Bng 7.1: S lng t trong d liu v t chuyn t c rt trch
Lang Traint m Traint r Dev Test1 Test2
AR
152K
6795
887 1434 1704
BN
24K
1916
775 1000
FA
79K
4039
852 1185 1116
HI
39K
4719
1000 1000
RU
2M
302K
1501 1502 3000
TE
45K
4924
1000 1000
UR
87K
9131
980
883
H thng chuyn t : Trc khi nh gi vic tch hp vo h thng dch my,
cc tc gi nh gi h thng chuyn t c xy dng t cc cp t c rt trch.
y d liu test l A-rp-Anh (1799 cp), Hindi-Anh (2394 cp) v Nga-Anh (1859
cp). Bng 7.2 th hin cc gi tr precision v recall ca h thng khai ph chuyn t.
Bng 7.2: Precision v Recall ca h thng khai ph chuyn t
AR
HI
RU
Precision (1-best) 20.0% 25.3% 46.1%
Recall (100-best) 80.2% 79.3% 87.5%
chnh xc precision trong trng hp 1-best ca m hnh chuyn t kh thp.
iu ny xy ra l do nhiu trong d liu hun luyn v do cc cp t chuyn t khng
chnh xc. Chng ta c th ci thin precision bng cch lm cht ngng xc sut
trong qu trnh khai ph d liu. Tuy nhin, v mc ch cui cng y l ci thin
vn dch t ng cho nn chng ta khng hon ton phi ch tm vo vn ny.
Chng ta bit rng recall quan trng hn so vi precision trong vic nh gi tng th
cht lng ca b dch. Tip theo chng ta s i vo bc nh gi cui cng.

23

Th nghim vi b dch : Bng 7.3 trnh by vic nh gi 3 phng php tch


hp k trn cng vi s lng t OOV trong mi test khc nhau. y phng php
nh gi BLEU c s dng nh gi tng phng php. Phng php 1, thay th
t OOV vi t chuyn t tt nht ca n, cho kt qu ci thin im BLEU trung bnh
khong +0.13. Kt qu ny c th quy cho gi tr chnh xc precision thp ca h
thng chuyn t bng 7.2. Phng php 2, chuyn t t OOV trong bc tin gii
m, ci thin trung bnh +0.39 im. im trung bnh c ci thin cao hn khng
ng k bng cch s dng phng php 3, tch hp bng cm t chuyn t vo trong
b gii m trong qu trnh dch. Tuy nhin, s ci thin cht lng gia phng php
3 v phng php 2 khng r rng khi m phng php 2 cho kt qu tt hn phng
php 3 trong mt na s trng hp.

Lang
AR
BN
FA
HI
RU
TE
UR
Avg

Bng 7.3: nh gi b dch, B 0 = Baseline


Test
B0
M1
M2
M3
i w sl t 11 26.75 +0.12 +0.36 +0.25
i w sl t 12 29.03 +0.10 +0.30 +0.27
j hu 12
16.29 +0.12 +0.42 +0.46
i w sl t 11 20.85 +0.10 +0.40 +0.31
i w sl t 12 16.26 +0.04 +0.20 +0.26
j hu 12
15.64 +0.21 +0.35 +0.47
wht 12
33.95 +0.24 +0.55 +0.49
wmt 13 25.98 +0.25 +0.40 +0.23
j hu 12
11.04 -0.09 +0.40 +0.75
j hu 12
23.25 +0.24 +0.54 +0.60
21.9 +0.13 +0.39 +0.41

OOV
587
682
1239
559
400
1629
434
799
2343
827
950

24

8 TRIN KHAI CHNG TRNH


8.1 CC CNG C DCH MY THNG K PH BIN
Hin nay trn th gii c kh nhiu cng c dch my thng k c min ph v c
bn quyn. Sau y l mt s cng c ph bin nht :
Moses (Koehn v cng s, 2007) l cng c m ngun m c s dng ph bin

trong lnh vc dch my thng k. N c thit k ch yu cho vic dch da


trn cm t, nhng hin ti c h tr cho m hnh phn cp. Moses cung
cp cng c cho ton b vic dch my, bao gm vic thc hin trn nhng m
hnh khc nhau v c cung cp cc ti liu y tham kho.
Joshua (Li v cng s, 2009) c vit bng Java v thit k ton b m hnh

phn cp ca dch my. Ngoi bng phn cp lut, n cn c kh nng rt trch


cc lut ng php.
cdec (Dyer v cng s, 2010) l m hnh dch linh hot.
Ncode (Crego v cng s, 2011) thc thi m hnh dch da trn hng tip cn

n-gram. Vic sp xp li trt t t c thc hin bng cch to ra mt li tha


trong bc tin x l, sau s c a vo m hnh dch.
Phrasal (Cer v cng s, 2010) l cng c dch my m ngun m c vit bng

Java da trn phng php dch da trn cm t. Phrasal c kh nng rt trch


v dch nhng cm t khng lin tc.
NiuTrans (Xiao v cng s, 2012) c vit bng C++ v h tr m hnh dch da

trn cm t, phn cp cm t v da trn c php.


Jane (Wuebker v cng s, 2012) l cng c dch my m ngun m h tr

phng php dch da trn cm t v phn cp cm t cho dch my thng


k. N c vit bng C++ v cung cp thut ton v cu trc d liu ph hp
thc hin vic dch my.

8.2 S DNG MOSES


8.2.1 CI T
Phn ny s hng dn cc bc ci t cng c Moses trn h iu hnh Fedora,
mt phin bn h iu hnh da trn nhn Linux. Cc bc ci t nh sau :
Dng lnh su cp quyn admin
Ci t cc gi s dng lnh yum install [tn gi]. Cc gi cn ci t nh sau :

git
subversion

25

automake
libtool
gcc-c++
zlib-devel
python-devel
bzip2-devel
8.2.2 CHY TH
u tin ta download cc v d mu v bng cc lnh sau :
cd /mosesdecoder
wget http://www.statmt.org/moses/download/sample-models.tgz
tar xzf sample-models.tgz
cd sample-models

Sau ta chy th
cd /mosesdecoder/sample-models
/mosesdecoder/bin/moses -f phrase-model/moses.ini < phrase-model/in > out

y ni dung trong file in l cu "das ist ein kleines haus" trong ting Php.
Nu chng trnh hot ng tt, kt qu s tr v cu "this is a small house" bng
ting Anh trong file out.
8.2.3 S DNG MODULE CHUYN T
s dng module chuyn t trong Moses theo phng php 2 trong bi bo
nu, chng ta sa li trong file config nh sau :
transliteration-module = "yes"
post-decoding-transliteration = "yes"
language-model-file = /ng dn n file cha m hnh ngn ng/

s dng phng php 3 - tch hp vo trong qu trnh dch, chng ta thm


vo nhng dng sau :
in-decoding-transliteration = "yes"
transliteration-file = /file cha nhng t cn c chuyn t/

26

Vi phng php post-decoding, danh sch cc t OOV s c t ng sinh ra


bi b dch. Trong khi vi phng php in-decoding, ngi dng cn phi cung
cp danh sch t OOV cn c chuyn t. Phng php in-decoding c u im
trong trng hp mt t c sn trong b ng liu nhng c th c xem l t
mn trong mt s trng hp. Ngc li, phng php post-decoding c s dng
nu ngi dng khng mun thm vo bt k t no vo b ng liu.

8.3 KHO NG LIU :


Mt trong s nhng kho ng liu m ni ting l OPUS. D liu y c rt
trch t cc ngun ti liu m trn internet. D liu thu thp c s c phn tch
ra thnh tng cu. Thng thng cu trc ca mt b ng liu song song bao gm
mt file d liu ngun, mt file d liu ch v mt file cha th t ging hng ca cc
cu trong d liu ngun v d liu ch.
ng dn n website OPUS : http://opus.lingfil.uu.se/

27

9 TI LIU THAM KHO


Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. 2001. Bleu: a Method

for Automatic Evaluation of Machine Translation. IBM Research Division.


inh in. 2006. Gio trnh x l ngn ng t nhin. NXB i hc Quc gia

TPHCM.
Philipp Koehn. 2009. Statistical Machine Translation. Trang 33-62. Cambridge

University Press.
Haizhou Li, Zhang Min, and Su Jian. 2004. A joint source-channel model for ma-

chine transliteration. In ACL 04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain.
Chuong B Do and Serafim Batzoglou. 2008. What is the expectation maximiza-

tion algorithm? In Nature Biotechnology 26, 897 - 899


Hassan Sajjad, Alexander Fraser, and Helmut Schmid. 2012. A Statistical Model

for Unsupervised and Semi-supervised Transliteration Mining. In Proceeddings


of the 50th Annual Conference of the Association for Computational Linguistics,
Jeju, Korea.
Nadir Durrani, Hassan Sajjad, Hieu Hoang, Philipp Koehn. 2014. Integrating an

Unsupervised Transliteration Model into Statistical Machine Translation. In Proceedings of the 14th Conference of the European Chapter of the Association for
Computational Linguistics (EACL), Gothenburg, Sweden.
http://www.systransoft.com/systran/corporate-profile/translation-technology/what-

is-machine-translation/
http://michaelnielsen.org/blog/introduction-to-statistical-machine-translation/
http://www.loqate.com/technology/transliteration/

28

MC LC
1 Gii thiu

2 Dch my (Machine Translation - MT)


2.1 Dch my l g? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Dch my da trn lut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Dch my da trn thng k . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 S khc nhau gia dch my da trn lut v dch my da trn thng k
2.5 Nhng vn gp phi trong dch my thng k . . . . . . . . . . . . . .
2.5.1 Ging hng cho cu . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 S d thng trong thng k . . . . . . . . . . . . . . . . . . . . . . .
2.5.3 S pha long d liu . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.4 Cc thnh ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.5 Trt t t khc nhau . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.6 T nm ngoi ng liu (Out-of-vocabulary words - OOV words)

3
3
3
3
4
5
5
5
5
5
6
6

3 Mt s thut ng ph bin trong dch my

4 Vn t nm ngoi ng liu (OOV words) v cc hng tip cn

5 Nhng kin thc lin quan


5.1 M hnh ngn ng . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 M hnh dch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 B gii m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . .
5.5 Phng php nh gi BLEU . . . . . . . . . . . . . . . . . . . . . .
5.5.1 tng chnh . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.2 Cc yu cu . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.3 V d minh ha . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.4 u im v nhc im . . . . . . . . . . . . . . . . . . . .
5.6 Phn bit gia dch (translation) v chuyn t (transliteration)

.
.
.
.
.
.
.
.
.
.

10
10
11
11
11
13
15
15
15
18
18

.
.
.
.

19
19
21
21
22

6 Phng php chnh


6.1 Khai ph d liu chuyn t . . .
6.2 M hnh chuyn t . . . . . . . .
6.3 Tch hp vo m hnh dch my
6.4 c lng m hnh . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

7 nh gi

23

8 Trin khai chng trnh


8.1 Cc cng c dch my thng k ph bin . . . . . . . . . . . . . . . . . . .

25
25

29

8.2 S dng MOSES . . . . . . . . . . . .


8.2.1 Ci t . . . . . . . . . . . . .
8.2.2 Chy th . . . . . . . . . . . .
8.2.3 S dng module chuyn t
8.3 Kho ng liu : . . . . . . . . . . . . .
9 Ti liu tham kho

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

25
25
26
26
27
28

30

You might also like