Professional Documents
Culture Documents
Integrating An Unsupervised Transliteration Model Into Statistical Machine Translation - NLP Final Report
Integrating An Unsupervised Transliteration Model Into Statistical Machine Translation - NLP Final Report
BO CO CUI K
TI : X L VN THIU HT T VNG
BNG CCH TRCH XUT CC CP T MN
TRONG B NG LIU SONG SONG
1 GII THIU
Trong th gii hin i ngy nay, khi m ton cu ha tr thnh mt khi nim
quen thuc, vic tng cng giao lu, hc hi gia cc quc gia, khu vc ngy cng
tr nn quan trng. Cng vic dch thut do cng ngy cng tr nn ph bin, gip
cho con ngi thot khi ro cn ngn ng c th thoi mi giao tip vi nhau.
Tuy nhin, khng phi lc no chng ta cng c cc chuyn gia dch thut gip
khi i mt vi nhng ngoi ng l v khng phi chuyn gia no cng c th thnh
tho mi ngn ng. Trn th gii c hn 6500 ngn ng trong khng phi ngn
ng no cng c chuyn gia m nhn vic dch thut. m bo thng tin c
trao i mt cch chnh xc, i ng dch thut cn phi c xy dng dch vn
bn, ti liu, ting ni t ngn ng ny sang ngn ng khc. Vic xy dng i ng
dch thut nh vy c th rt tn km v khng tin li trong nhiu hon cnh. Do
vn dch thut t ng tr thnh mt lnh vc thu ht c gii nghin cu
khoa hc vi rt nhiu bi bo ra i. Do lnh vc dch t ng c qu nhiu vn
lin quan, y nhm chn nghin cu v vn x l vic thiu ht t trong b ng
liu, c th l vic xy dng h thng chuyn t cho nhng t khng c trong b ng
liu. Nhm ch yu s dng nghin cu trong bi bo "Integrating an Unsupervised
Transliteration Model into Statistical Machine Translation (2014)" ca nhm tc gi
Nadir Durrani, Hassan Sajjad, Hieu Hoang, Philipp Koehn.
Mc tiu v ni dung ti :
Mc tiu ca ti l tm hiu, nghin cu v vn x l t nm ngoi ng
liu (out-of-vocabulary words - OOV), c th l phng php x l vn t chuyn
t gia ting Anh v ting Vit nhm nng cao hiu sut ca vic dch t ng theo
hng tip cn thng k.
Dch my da trn lut c hiu qu tt khi s dng trn cc lnh vc dch thut
khc nhau v a ra c bn dch n nh, c th d on c cht lng bn dch.
Vic ty bin t in s lm tng cht lng v gip dch tt cc thut ng c a
vo b d liu. Tuy nhin kt qu dch c th thiu i s linh hot, trn tru. Qu trnh
ty bin t ti mc cht lng chp nhn c c th ko di v tn km. Hiu
nng cao v khng ty thuc vo phn cng ca my tnh.
Dch my thng k t cht lng tt khi c b ng liu ln v tt. Bn dch c
trn tru tha yu cu ngi c. Tuy nhin, cht lng bn dch khng n nh
v khng on trc c. Qu trnh hun luyn hon ton t ng v t tn chi ph.
Nu hun luyn trn b ng liu tng qut, khng thuc mt lnh vc c th s dn
ti cht lng km hn. Cui cng, dch my thng k i hi phn cng nht nh
c th xy dng v qun l m hnh dch ln.
Bng 2.1 th hin s khc bit gia hai phng php k trn.
Da trn thng k
V>w
(3.1)
(5.1)
m
Y
pr (e j |e 1 , ..., e j 1 )
(5.2)
j =1
(5.3)
T dn n
pr (e) =
m
Y
pr (e j )
(5.4)
j =1
(5.5)
C nhng trng hp trigram khng xut hin trong ng liu hun luyn s dn
ti tn ti mt xc sut pr (e j |e j 1 , e j 2 ) = 0 t a n pr (e) = 0. khc phc
10
(5.6)
5.3 B GII M
Nhim v ca b gii m l tm cu ngn ng ch e tt nht c dch t cu
ngn ng ngun v sao cho gi tr P (e) P (v|e) cc i.
11
EM c tn l Inside-Outside.
lm trn (smoothing) trong m hnh da trn HMM, ta dng phin bn ca
(5.7)
v
B =
(5.8)
12
trong mi 10 lt tung.
Maximization(M) : sau khi c c d liu y trn, gi s rng cc d liu
ny chnh xc, p dng c lng maximum likelihood c c gi tr mi.
Lp li hai bc E v M cho ti khi cc tham s hi t.
Tng kt li, gii thut EM l qu trnh thay i gia hai bc d on phn phi
xc sut t lm y d liu ca m hnh (bc E), sau c lng li tham s
cho m hnh da trn d liu y (bc M). Tn bc Expectation xut pht t
vic chng ta khng cn tm phn phi xc sut c th cho ton b d liu, m ch
quan tm n xc sut k vng c c t s y ca d liu. Tn bc Maximization xut pht t vic c lng li m hnh c th xem nh qu trnh cc i ha xc
sut k vng ca d liu. Tip theo chng ta i n tnh ton c th cho v d trn
bng hnh 5.1
Maximum likelihood : Vi mi b 10 ln tung ng xu, maximum likelihood tnh
ton s lng mt sp v nga ca ng xu A v B, t tnh xc sut sp nga ca
tng ng xu.
Expectation maximization :
(1) EM bt u bng vic khi to gi tr, y khi to (0)
= 0.60, (0)
= 0.50
A
B
(2) Bc E : phn phi xc sut da trn nhng gi tr c tnh ton. Cc gi
lm y .
(4) Tip tc lp li bc 2 v 3 cho n khi cc gi tr hi t.
13
Hnh 5.1: c lng tham s cho d liu y v khng y . (a) c lng maximum likelihood. (b) Expectation maximization.
14
khi cn thit
5.5.2 CC YU CU
Cn nhiu b dch tt to bi con ngi lm tham kho
Ph thuc vo vic iu chnh chnh xc n-gram
C hnh thc pht i vi vic dch qu ngn, rt gn qu mc
5.5.3 V D MINH HA
ng vin 1 : It is a guide to action which ensures that the military always obey the
commands of the party.
ng vin 2 : It is to insure the troops forever hearing the activity guidebook that party
direct.
Bng mt thng chng ta s d dng nhn thy rng ng vin 1 tt hn. By
gi chng ta tip tc vi cc bn dch tham kho :
Tham kho 1 : It is a guide to action that ensures that the military will forever heed
Pary commands.
Tham kho 2 : It is the guiding principle which guarantees the military forces always
being under the command of the Party.
15
Tham kho 3 : It is the practical guide for the army always to heed directions of the
party.
16
(
1,
(1r /c)
nu c > r
nu c r
(5.9)
B LEU = B P.exp
N
X
w n l og p n
n=1
17
5.5.4 U IM V NHC IM
u im :
Nhanh v t tn chi ph, khng cn nhiu nhn lc kim tra.
C th s dng trong qu trnh pht trin kim tra s thay i trong hiu sut
ca b dch.
Nhc im :
Hiu qu nh gi trn tng cu n l thng thng khng tt, khng n nh.
BLEU nh gi tng t trn ton b cc bn tham kho. iu ny c th khng
xt n.
Tt c cc t u c nh gi vi trng s ngang nhau.
18
2012. Transliteration Mining Using Large Training and Test Sets. In Proceedings
of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 12.
Cc phng php trn i hi d liu hun luyn phi c sn. Mt s cch tip
cn da trn vic khai ph d liu khng gim st c xut :
Jae-Sung Lee and Key-Sun Choi. 1998. English to Korean Statistical Translitera-
19
p(a)
(6.1)
aAl i g n(e, f )
|a|
Y
p(q j )
(6.2)
j =1
20
p E (e) =
|e|
Y
(6.3)
p E (e i )
(6.4)
pF ( fi )
(6.5)
i =1
pF ( f ) =
|f |
Y
i =1
(6.6)
21
(1 )p 1 (a, e i , f i )
n q (a)
p(e i , f i )
i =1 aAl i g n(e i , f i )
N
X
(6.7)
c(q)
0
q 0 c(q )
(6.8)
Tng t, chng ta tnh s lng k vng ca phi chuyn t bng cch cng xc
sut tin nghim ca cc cp t phi chuyn t :
c nt r =
N
X
i =1
p nt r (e i , f i ) =
N p (e , f )
X
2 i i
i =1 p(e i , f i )
(6.9)
cho N.
22
7 NH GI
y nhm tc gi th nghim trn 7 cp ngn ng : A-rp, Bengali, Farsi,
Hindi, Nga, Telugu tng ng vi ting Anh. Vi ting A-rp v Farsi, d liu c ly
t chng trnh TED talks. Vi cc ngn ng h n, d liu c dng l b ng liu
song song Indic. Vi ting Nga, nhm tc gi s dng d liu WMT-13.
Khai ph d liu chuyn t : Cc cp t chuyn t s c rt trch t b ng
liu song song vi cc cp t c ging hng. y chng ta ch dng ging hng
1-1. Trc khi i vo bc khai thc d liu, cc cp t s c x l bng cch b
i cc s, k t, cc cp t m trong tn ti t c t hn 3 k t, t cha cc k t
nc ngoi. Qu trnh khai ph c thc hin vi 10 vng lp EM. S lng cc cp
t chuyn t c rt trch th hin bng 7.1
Bng 7.1: S lng t trong d liu v t chuyn t c rt trch
Lang Traint m Traint r Dev Test1 Test2
AR
152K
6795
887 1434 1704
BN
24K
1916
775 1000
FA
79K
4039
852 1185 1116
HI
39K
4719
1000 1000
RU
2M
302K
1501 1502 3000
TE
45K
4924
1000 1000
UR
87K
9131
980
883
H thng chuyn t : Trc khi nh gi vic tch hp vo h thng dch my,
cc tc gi nh gi h thng chuyn t c xy dng t cc cp t c rt trch.
y d liu test l A-rp-Anh (1799 cp), Hindi-Anh (2394 cp) v Nga-Anh (1859
cp). Bng 7.2 th hin cc gi tr precision v recall ca h thng khai ph chuyn t.
Bng 7.2: Precision v Recall ca h thng khai ph chuyn t
AR
HI
RU
Precision (1-best) 20.0% 25.3% 46.1%
Recall (100-best) 80.2% 79.3% 87.5%
chnh xc precision trong trng hp 1-best ca m hnh chuyn t kh thp.
iu ny xy ra l do nhiu trong d liu hun luyn v do cc cp t chuyn t khng
chnh xc. Chng ta c th ci thin precision bng cch lm cht ngng xc sut
trong qu trnh khai ph d liu. Tuy nhin, v mc ch cui cng y l ci thin
vn dch t ng cho nn chng ta khng hon ton phi ch tm vo vn ny.
Chng ta bit rng recall quan trng hn so vi precision trong vic nh gi tng th
cht lng ca b dch. Tip theo chng ta s i vo bc nh gi cui cng.
23
Lang
AR
BN
FA
HI
RU
TE
UR
Avg
OOV
587
682
1239
559
400
1629
434
799
2343
827
950
24
git
subversion
25
automake
libtool
gcc-c++
zlib-devel
python-devel
bzip2-devel
8.2.2 CHY TH
u tin ta download cc v d mu v bng cc lnh sau :
cd /mosesdecoder
wget http://www.statmt.org/moses/download/sample-models.tgz
tar xzf sample-models.tgz
cd sample-models
Sau ta chy th
cd /mosesdecoder/sample-models
/mosesdecoder/bin/moses -f phrase-model/moses.ini < phrase-model/in > out
y ni dung trong file in l cu "das ist ein kleines haus" trong ting Php.
Nu chng trnh hot ng tt, kt qu s tr v cu "this is a small house" bng
ting Anh trong file out.
8.2.3 S DNG MODULE CHUYN T
s dng module chuyn t trong Moses theo phng php 2 trong bi bo
nu, chng ta sa li trong file config nh sau :
transliteration-module = "yes"
post-decoding-transliteration = "yes"
language-model-file = /ng dn n file cha m hnh ngn ng/
26
27
TPHCM.
Philipp Koehn. 2009. Statistical Machine Translation. Trang 33-62. Cambridge
University Press.
Haizhou Li, Zhang Min, and Su Jian. 2004. A joint source-channel model for ma-
chine transliteration. In ACL 04: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain.
Chuong B Do and Serafim Batzoglou. 2008. What is the expectation maximiza-
Unsupervised Transliteration Model into Statistical Machine Translation. In Proceedings of the 14th Conference of the European Chapter of the Association for
Computational Linguistics (EACL), Gothenburg, Sweden.
http://www.systransoft.com/systran/corporate-profile/translation-technology/what-
is-machine-translation/
http://michaelnielsen.org/blog/introduction-to-statistical-machine-translation/
http://www.loqate.com/technology/transliteration/
28
MC LC
1 Gii thiu
3
3
3
3
4
5
5
5
5
5
6
6
.
.
.
.
.
.
.
.
.
.
10
10
11
11
11
13
15
15
15
18
18
.
.
.
.
19
19
21
21
22
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 nh gi
23
25
25
29
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
26
26
27
28
30