You are on page 1of 13

Nghin cu lut hiu chnh kt qu dng phng php MST

phn tch c php ph thuc ting Vit


Nguyn L Minh
Japan Advanced Institute of
Science and Technology

Hong Th ip
i hc Cng Ngh - HQG
H Ni

Trn Mnh K
i hc Cng Ngh - HQG
H Ni

Tm tt
Phn tch c php c vai tr quan trng trong lnh vc x l vn bn v n l bc trung
gian ca nhiu bi ton ln nh: tm tt vn bn, dch my, hi p t ng. Trong thi gian
gn y, phn tch c php ph thuc thu ht c s quan tm ca nhiu nhm nghin cu
x l ngn ng t nhin trn th gii bi quan h ph thuc gia hai t vng c th c ch
trong kh nhp nhng v c php ny c kh nng m hnh ha cc ngn ng c trt t t t
do. Trong bo co ny, chng ti trnh by phng php Maximum Spanning Tree phn
tch c php ph thuc cu ting Vit v s dng b hiu chnh cy bng lut ci thin u
ra ca MST. Cui cng chng ti a ra mt s kt qu thc nghim trn tp ng liu 450
cu ting Vit v xut hng pht trin phng php MST cho bi ton ny.

1
1.1

Gii thiu
Tnh hnh nghin cu t ng phn tch c php ph thuc ting Vit

Phn tch c php ph thuc1 trong vi nm gn y thu ht c s quan tm ca cng


ng nghin cu x l ngn ng t nhin [8] v c php ph thuc l mt dng biu din cu
c nhiu ng dng cho cc bi ton phc tp nh trch chn thng tin hay tm tt vn bn.
Tuy nhin, cc tip cn cho bi ton ny u da trn hc my v i hi kho ng liu vi
nhiu thng tin v t loi v quan h ph thuc nn hin cha c ai cng b nghin cu v
phn tch c php ph thuc ting Vit.
1.2

C php ph thuc

C php ph thuc l cu trc c php cha cc mc t vng ni vi nhau bi cc quan


h nh phn khng i xng gi l s ph thuc [5]. Quan h ph thuc ny c th c t
tn lm r lin h gia hai mc t.
Hnh 2 l minh ha c php ph thuc ca mt cu ting Vit. Theo quy c ph bin trong
cc ti liu v c php ph thuc th mc t nm gc ca mi tn l t chnh gi l head,
mc t nm u mi tn l t ph - gi l dependent.
Theo [7], ta cng c th nh ngha mt cch hnh thc: c php ph thuc ca mt cu
cho trc l mt th nh hng vi gc root l mt nt gi, thng c chn vo bn
tri cu, cc nt cn li l cc mc t ca cu. th ny c cc tnh cht sau:
1. N lin thng yu (c xt hng)
2. Mi mc t c chnh xc mt cnh i vo (tr root l khng c cnh i vo)
1

Thut ng ting Anh l denpendency parsing

3. Khng c chu trnh


4. Nu c n mc t trong cu (k c root) th th c chnh xc (n-1) cnh
Nh cch m hnh ha nh trn, c php ph thuc biu din c nhng ngn ng c
trt t t t do (xem thm Phn 2.3), y l iu m c php cu trc cm2 - vn ph hp vi
nhng ngn ng c nhiu quy tc cht ch trong cu thnh cu - khng lm c. Tuy vy,
khng c ngha l phn tch ngn ng c trt t t xc nh th ch dng cu trc cm hay
phn tch ngn ng c trt t t t do th ch dng cu trc ph thuc [10].
1.3

Bi ton t ng phn tch c php ph thuc

Phn tch c php ph thuc l tm cy ph thuc cho mt cu. Mc tiu ca nghin cu


ny l tm ra phng php sinh cy ph thuc chnh xc nht cho cu ting Vit a vo,
ngha l lm cc i s cung chnh xc trong cy v s nhn gn ng cho cc cung.
1.4

Tm tt v hng tip cn trong bo co ny

Hnh 1 m t qu trnh xc nh cy ph thuc ca mt cu ting Vit ca nghin cu ny,


n gm hai bc: 1- thit lp th nh hng c trng s bng cch khai thc m hnh
trng s v a v bi ton tm cy khung ti i3 trong th [7], 2- t ng pht hin li
ca cy u ra MST v la chn cc lut hiu chnh cy ph hp [9].
cu a
vo

B phn tch bng MST

M1: M hnh trng s cc


cnh ca th (hun
luyn bng MIRA)

u ra
ca MST

B hiu chnh

u ra
cui cng

M2: M hnh hun luyn


bng perceptron a lp

Hnh 1 S minh ha qu trnh phn tch ph thuc kho st

M hnh M1 c sinh ra bng phng php hc my MIRA4 [11] hc trn d liu hun
luyn. Cn M2 c sinh bng Perceptron a lp [11] hc trn tp kt hp u ra ca MST
v d liu hun luyn.
1.5

S lc cu trc bo co

Trong cc phn sau y ca bo co, chng ti trnh by mt s c trng ca ng php


ting Vit (tham kho ch yu t cc ti liu v ngn ng) c th lin quan ti qu trnh t
ng phn tch c php ph thuc. Sau ln lt trnh by cch xy dng b phn tch c
php ph thuc MST v cch xy dng b hiu chnh cy ph thuc ci thin kt qu. M
t phng php nh gi, thc o v kt qu th nghim ban u ca cc phng php ny
trn ting Vit s c trnh by cui bo co.

Thut ng ting Anh l phrase structure syntax


Thut ng ting Anh l Maximum Spanning Tree - vit tt l MST
4
MIRA l vit tt ca Margin Infused Relaxed Algorithm
3

Mt s c trng ng php ting Vit lin quan


Bng 1 Tm tt cc c trng ng php ting Vit

c trng

Tnh phn
tch

Tnh n
hnh

Trt t
t

iu kin x nh

T loi ca v t

Ting Vit

SVO

a s nhng khng
phi ton b

ng t, tnh t, danh
t, mt s h t

Mc ny trnh by mt s c trng ng php ca ting Vit, c gc ngn ng (gm


tnh phn tch, tnh n hnh v trt t t [1]) v gc bi ton t ng phn tch ph thuc
(gm iu kin x nh [5] v t loi ca v t [6]). Thc t th ng php ting Vit cn nhiu
c trng khc nhng trong nghin cu ny chng ti ch tng hp nhng c trng c th
lin quan ti qu trnh phn tch ph thuc.
2.1

Tnh phn tch [2]

Ngn ng phn tch5 l ngn ng c ng php v ng ngha c hnh thnh nh nh


cch dng cc tiu t v trt t t hn l nh vo cc bin t. Ngc vi ngn ng phn tch
l ngn ng tng hp6. Cc ngn ng nh ting Hi Lp, ting La-tinh, ting c, ting ,
ting Nga, ting Ba Lan v ting Sc l v d in hnh cho loi tng hp. Theo [2] th ting
Vit cng mt s ngn ng trong khu vc ng Nam (tr ting Malay) v ting Trung
Quc l ngn ng phn tch.
2.2

Tnh n hnh [2, 3]

Khi nim ngn ng n hnh7 khng ng nht vi khi nim ngn ng phn tch. Ngn
ng n hnh l ngn ng c phn ln hnh v l hnh v t do v c tiu chun l mt t.
Mc n c xc nh theo t l s lng hnh v - trn - s lng t. Ngn ng n hnh
ph bin cc nc ng Nam , trong c Vit Nam, v Trung Hoa c.
2.3

Trt t t8 [4]

Trong ngn ng hc, h thng phn loi theo trt t t ni ti nghin cu v cch m
ngn ng sp xp tng i cc thnh phn ca mt cu v v quan h gia cc cch sp ny.
Vi hu ht cc ngn ng c danh t chim a s th ta c th nh ngha mt trt t t c
bn theo ng t nguyn th (V) v cc i s ca n, ch ng (S) v tn ng (O). Theo
c 6 trt t c bn: SVO, SOV, VSO, VOS, OSV, OVS. Ng php Vit Nam thuc loi SVO.
Bn cnh cc trt t cp, cn mt lp cc ngn ng ng lu c gi l ngn
ng c trt t t t do (free word order language) v d nh ting La-tinh, Sc, Hung-ga-ri,
Ba Lan, Nga - i hi cc phng php nghin cu phc tp hn trong bi ton phn tch t
ng c php ph thuc.
5

Thut ng ting Anh l analytic language


Thut ng ting Anh l synthetic language
7
Thut ng ting Anh l isolating language
8
Thut ng ting Anh l word order
6

2.4

iu kin x nh9 [5]

iu kin x nh cho th ph thuc c pht biu mt cch hnh thc trong bi ging
[5] nh sau:
Mt th ph thuc c gi l c tnh x nh khi
Nu c i j th i * i vi i ' bt k tha mn i < i ' < j hoc j < i ' < i .

C th pht biu li l: nu t t j ph thuc vo t t i th t t i bt k nm gia i v j


phi ph thuc (c th l gin tip) vo t t i.

Hnh 2 V d cu ting Vit khng tha mn iu kin x nh

a s cc cu trong kho ng liu ca chng ti (Phn 5.1) tha mn tnh cht x nh m


t trn, nhng trong ting Vit vn tn ti nhng cu ghp khng c tnh x nh nh minh
ha trong Hnh 2. R rng l ta cn quan tm ti nhng trng hp ny khi nghin cu gii
thut phn tch c php ph thuc cho ting Vit.
2.5

T loi ca v t trong cu ting Vit

Khi nim t kha ca cu (mc t ph thuc vo nt gi root) trong phn tch ph thuc
chnh l khi nim v t trong ngn ng hc. Trong ting Anh th v t lun l ng t, nhng
trong ting Vit, t loi ca v t rt a dng. Cc v d bn di c trch t chng 1,
phn 2.2. Cc kiu cu c bn ca ting Vit trong cun Ng php Vit Nam [6]. V t l
cc t hay cm t in m.
T loi ca v t
ng t

V d
Gip a cho T t bo.

tnh t
danh t
h t l
h t bng

Trng sng qu.


Em b ny su tui.
Anh ny l th mc.
Ci o ny bng la.

T loi ca v t
h t ti, do,
bi
h t
h t ch v tr
h t nh
h t ca

V d
Vic ny ti n.
Hng ny do h lm.
Bn y ung nc.
ng ti ngoi vn.
nh hoa vng.
Xe ny ca Gip.
Hng ny ca h lm.

Xy dng b phn tch ph thuc theo tip cn MST

Ryan McDonald trong [7] xut tip cn da trn th, c th l a bi ton phn
tch c php ph thuc v bi ton tm cy khung ti i ca mt th nh hng c trng
9

Thut ng ting Anh l projectivity

s (bi ton MST). C hai phin bn MST: bc 1 v bc 2. MST bc 1 hot ng n gin


hn v thc nghim trn kho ng liu ting Vit cho thy MST bc 1 cho kt qu tt hn, do
trong khun kh nghin cu ny chng ti dng li MST bc 1.
3.1

a v bi ton MST

Vi mi cu x , ta nh ngha mt th Gx vi tp nh Vx v tp cnh Ex nh sau:


Vx = { x0 = root, x1, ..., xn}
Ex = {(i , j) : xi xj, xi Vx, xj Vx -root}
McDonald [7] chng minh: tm mt cy ph thuc (x nh) c im s cao nht tng
ng vi tm cy khung (x nh) ti i ca th Gx c gc ti nt gi root . Trong ,
im ca mt cy c phn tch thnh tng im tt c cc cnh n l trong cy, dng phn
tch ny c kim chng l n gin v hiu qu. y chnh l gii thch cho cch t tn
MST bc 1. Cc c trng trnh by trong Phn 3.2 v gii thut trnh by trong Phn 3.3
cng l cc phin bn gn vi MST bc 1 ny.
3.1.1 Tnh im mt cnh
im ca cnh (i , j) l tch v hng gia vect biu din c trng ca cnh v mt
vect trng s:
s(i , j) = w . f(i , j)
f(i , j) l k hiu rt gn cho f(x, i , j) v n cha c nhng c trng ca cu x.

Nh vy, im ca cy ph thuc y cho cu x l


s(x , y) =

s(i , j) =

( i , j )y

w . f(i , j)

( i , j )y

a)

b)

c)

c trng Unigram c bn
xi-word, xi-pos

c trng Bi-gram c bn

c trng t loi gia hai mc t

xi-word, xi-pos, xj-word, xj-pos

xi-pos, b-pos, xj-pos


c trng t loi xung quanh hai
mc t
xi-pos, xi-pos+1, xj-pos-1, xj-pos
xi-pos-1, xi-pos, xj-pos-1, xj-pos
xi-pos, xi-pos+1, xj-pos, xj-pos+1
xi-pos-1, xi-pos, xj-pos, xj-pos+1

xi-word
xi-pos
xj-word, xj-pos
xj-word
xj-pos

xi-pos, xj-word, xj-pos


xi-word, xj-word, xj-pos
xi-word, xi-pos, xj-pos
xi-word, xi-pos, xj-word
xi-word, xj-word
xi-pos, xj-pos

Hnh 3 Cc c trng dng trong MST bc mt10

10

Trong hnh ny, k hiu word l mc t, pos l t loi, +1 l v bn phi, -1 l v bn tri.

3.2

Cc c trng c kho st

Kt qu thc nghim trnh by trong nghin cu ny ng vi nhng vect c trng f n


gin (minh ha trong Hnh 3), cha bao hm cc c th ca ting Vit cp trong phn 2.
C th l vi mt cung (i,j), ta s xt:
+ Nhm a v b: xt t loi v mc t ca cung (i,j) trong ng cnh Uni-gram v Bi-gram.
+ Ngoi ra, nu mc t i hay j c nhiu hn 5 k t th xt thm c trng 5-gram pha
trc mc t .
+ Nhm c: b sung cho bi cnh cy ph thuc (nhm a v b), ta xt cc mc t trong bi
cnh cu, c th l thng qua t loi ca cc mc t nm gia mc t i v mc t j, cng
thm t loi ca cc mc t nm bn phi v bn tri mc t i v mc t j.
Tc gi ca [7] th thm bt nhiu ln v chng minh c bng thc nghim rng b c
trng ny l hiu qu nht cho phn tch ph thuc ca ting Anh.
3.3

Cc gii thut tm cy ph thuc

Gi s thit lp cc trng s cho th Gx (Phn 3.1).


3.3.1 Gii thut Eisner cho trng hp c x nh
a) tng
Gii thut Eisner l gii thut phn tch biu quy hoch ng di-ln vi phc tp
thi gian O(n3) nh mt ci tin trn gii thut phn tch biu CYK phc tp thi gian
O(n5): phn tch cc dependent tri ca mt mc t c lp vi cc dependent bn phi, v v
sau s kt hp chng.

Hnh 4 Gii thut phn tch Eisner bc ba

Hnh 4 minh ha gii thut ny. K hiu r, s v t cho ch s bt u v kt thc ca cc


mc biu , v h1, h2 cho ch s ca head cc mc biu . Ban u, tt c cc mc u hon
chnh, c th hin bng cc tam gic vung. Gii thut sau s to ra cc mc cha hon
chnh t cc mc t nm t h1 ti h2 (vi h1 l head ca h2). Mc ny n cui cng s c
hon chnh. Cng ging nh qu trnh phn tch CKY khc, nhng mc ln hn c to t
cc cp mc nh hn theo phng php di-ln.
b) Gi m
Hnh 5 l gi m Ryan [7] vit cho gii thut Eisner. K hiu C[s][t][d][c] l bng quy
hoch ng lu im s ca cy con tt nht t v tr s n v tr t, s t, vi hng d v gi tr
hon chnh c. Bin d {, } biu th hng ca cy con (nhm cc dependent tri hay
phi). Nu d=k th t l head ca cy con, nu d=l th s l head ca cy con. Bin c {0,1}

hm mt cy con l hon chnh (c=1, khng th thm dependent) hay cha hon chnh (c=0,
cn c hon chnh).
Dng c nh du (*) c ngha l tm im s tt nht cho mt cy con tri cha
hon chnh ta ch cn tm ch s sr<t s em li im s cao nht c th khi ghp hai cy con
hon chnh.

Hnh 5 Gi m ca gii thut Eisner

Theo rng buc phi c mt gc duy nht nm bn tri cu, im s ca cy tt nht cho c
cu l C[1][n][k][1].
3.3.2 Gii thut Chu-Liu-Edmonds cho trng hp khng x nh

Hnh 6 Gii thut Chu-Liu-Edmonds tm cy khung ti i ca th nh hng

Hnh 6 l phc tho ca Georgiadis cho gii thut Chu-Liu-Edmonds. C th pht biu
bng li l: vi mi nh trong th, gii thut chn (bng cch tham n) cnh i vo c
trng s cao nht. Nu to thnh mt cy th chnh l cy khung ti i. Nu khng th n
7

phi l mt chu trnh. Th tc trong hnh l pht hin mt chu trnh v rt gn n thnh
mt nh n v tnh li cc trng s cnh i vo v ra chu trnh.
Tc gi cng chng minh: cy khung ti i trn th rt gn l tng ng vi mt
cy khung ti i trn th gc. V vy gii thut c th gi quy ti chnh n trn th
mi. dng n gin nht, gii thut ny chy vi thi gian O(n3). MST s dng phin bn
ci tin ca tc gi Tarjan c phc tp thi gian O(n2) vi th tr mt [7].
3.4

Vn gn tn quan h ph thuc

3.4.1 Phng n kt hp gn tn quan h ph thuc v tm cy ph thuc


y l phng n dng trong MST. Ta chnh sa hm chm im cung (i,j). Vic ny quy
v chnh sa trn vect c trng f n cha thng tin v tn t ca quan h ph thuc.
s(i , j, t) = w . f(i , j, t)

s(x , y) =

w . f(i , j, t)

( i , j , t )y

Tc gi chng minh c: khi xc nh w, tn t tha mn iu kin


t= argmax w . f(i , j, t) cng chnh l tn ca cung (i,j) trong cy khung ti i.
t'

V vy ch cn xy dng mt bng bt(i,j) lu tn tt nht cho tng cung v trong qu trnh


phn tch th dng s(i,j,bt(i,j)).
Phng n ny tuy tn dng c tri thc chung suy lun ra c cy ph thuc v tn
cc quan h nhng v c bn li b gii hn bi phm vi phn tch a phng, c th l ch
xem xt c trng ca cc cnh n l trn cy. Ngoi ra, vi phc tp O(n3 + |T|n2) trong
trng hp c x nh v O(|T|n2) trong trng hp khng x nh th phng n ny khng ti
u khi s lng T tn cc quan h ph thuc rt ln.
3.4.2 Phng n gn tn quan h ph thuc sau khi tm ra cy ph thuc
bi ton ny, ta i tm tn cho tng cung khi c cy y trn cu x. Mt m hnh hiu
qu m tc gi trong [7] th nghim l gn tn cho mt chui cung, ng vi mt chui
dependent ca mc t i:
Gi xj1, ..., xjM l cc dependent ca xi; tng ng l cc tn quan h ph thuc t(i,j1),..., t(i,jM).
Chui tn tt nht ng vi mc t xi l
(t(i,j1),..., t(i,jM)) = t(i,) = argmax s( t , i, y, x)
t

Vn dng phn tch Markov bc 1 cho hm chm im


M

t (i, ) = argmax s (t(i , jm ) , t( i , jm 1) , i, y , x )


t

m=2

Sau dng gii thut Viterbi tm ra chui tt nht.

Hm chm im gn vi mt vect c trng gm: c trng cnh ang xt, c trng ca


cc cnh khc cng nt cha, c trng ng cnh cu.
3.5

Pha hc m hnh trng s bng phng php MIRA

3.5.1 L do chn MIRA


Cc gii thut cp pha trn u phi da vo vect trng s w. Vect ny c hc t
d liu hun luyn bng phng php hc my MIRA. Cc c tnh ca MIRA khin n ph
hp vi bi ton phn tch c php ph thuc v ting Vit l:
1) N l phng php hc my phn bit11.
2) Khc vi cc phng php tt nht hin nay (nh CRFs12, M3Ns13) u hc theo l,
MIRA hc online. c tnh 1 v 2 gip to ra cc m hnh hot ng tt trong iu
kin thiu d liu ting Vit.
3) Phn lp c chia thnh nhiu bi ton con, trong s c bi ton hc c cu trc
bng phn lp tuyn tnh. Phn tch ph thuc l bi ton hc c cu trc, MIRA nm
trong s t cc phng php hc my gii quyt hiu qu bi ton ny.
4) Khi c m hnh, bc suy lun ca MIRA da trn gii thut Hildreth gii bi ton
quy hoch bc hai. N khng cn ti cc gii thut forward-backward, inside-outside
phc tp nh CRFs hay cc tnh ton v phn phi v ti u phc tp ca CRFs v
M3Ns [7].
3.5.2 Cch tip cn ca MIRA
MIRA l online SVMs14 nh dng php xp x.
SVMs cho bi ton hc c cu trc

MIRA
(mi ln cp nht w ta chn vect trng s mi gn vi
vect c nht)

tm min||w||

w(i+1) = argminw*||w* - w(i)||

vi nhng s(x,y) - s(x,y) L(y,y)

vi nhng s(xt,yt) - s(xt,y) L(yt,y) ng vi w*

cho (x,y) T, y parses(x)

cho y parses(xt)
Hnh 7 So snh MIRA v SVMs

Trong L(y,y) l hm xc nh sai st ca y so vi y, tnh bng s mc t trn y c


cung i vo khc y; parses(x) l khng gian tt c cc cy ph thuc c th ng vi cu x.
3.5.3 Dng k-best MIRA xp x MIRA trnh s nhn tng theo hm m
Ch p dng rng buc v l cho k cy ph thuc y c s(x,y) cao nht.

11

Thut ng ting Anh l discriminative learning


CRFs l vit tt ca Conditional Random Fields
13
M3Ns l vit tt ca Maximum Margin Markov Networks
14
SVMs l vit tt ca Support Vector Machines
12

w(i+1) = argminw*||w* - w(i)||


vi nhng s(xt,yt) - s(xt,y) L(yt,y) ng vi w*
i)

cho nhng y bestk(xt , w( )


Hnh 8 k-best MIRA

Hnh 8 l k-best MIRA tng qut, trong MST tc gi ch s dng k=1.

Hiu chnh kt qu ca MST

nng cao chnh xc b phn tch c php ph thuc, chng ti thc hin cc lut
hiu chnh cy trn u ra ca MST. Gii php s dng y l tip cn Giuseppe Attardi
xut trong [9]: xem cc lut hiu chnh ny nh cc nhn phn loi, nh vy a bi ton hiu
chnh u ra ca mt b phn tch c php ph thuc v bi ton phn lp.
4.1

a v bi ton phn lp

4.1.1 Php hiu chnh nguyn t


Tc gi a ra mt tp php hiu chnh nguyn t nht nh trn cy (minh ha trong
Bng 2) , quy v hiu chnh head ca mt mc t xi (v trong cy ph thuc, mi mc t ch
c 1 head).
Bng 2 Cc php hiu chnh nguyn t trn cy

K hiu

r
u
-n
+n
[
]
>
<
d-d++
d-1
d+1
dP

Php hiu chnh nguyn t


t head root
t head ln nt cha ca head
t head sang mc t th n bn tri
t head sang mc t th n bn phi
t head bng head ca thnh phn lin trc
t head bng head ca thnh phn lin sau
t head bng mc t u tin trong thnh phn lin trc
t head bng mc t u tin trong thnh phn lin sau
dch head xung con tri nht ca n
dch head xung con phi nht ca n
dch head xung con tri u tin ca n
dch head xung con phi u tin ca n
dch head xung mc t c t loi P

4.1.2 Lut hiu chnh


Ta thng phi p dng nhiu php hiu chnh nguyn t trn mt mc t c kt
qu mong mun, v vy tc gi a ra khi nim lut hiu chnh. Lut hiu chnh l mt chui
khng qu 4 php hiu chnh nguyn t.
4.1.3

Pht biu hnh thc bi ton

Gi y=(x,E) l cy ph thuc cho cu x. Mt lut hiu chnh l mt nh x r: ElE bin


cung e = (i,t,j) thnh cung e=(i,t,s). Cy sau khi hiu chnh l r(y)=(x,E) trong E={r(e):
10

e E}. Bng cch xem mi lut hiu chnh l mt nhn, ta a bi ton hiu chnh cy v tm
mt chui nhn cho cc mc t trong cu x. Mi mc t mt nhn.
C hai la chn cho vic dng E: 1-p dng ng thi tt c cc lut, 2-p dng tng lut
ring l to ra cy trung gian, ri li tip tc tm lut hiu chnh trn cy trung gian ny. Do
cch 2 c th to nhng dng trung gian khng phi l cy nn nghin cu ch dng li
cch 1.
4.2

Hc m hnh bng Perceptron a lp

4.2.1 tng
D liu hun luyn l cc cp vect c trng fi v lut ri (fi c th sinh t yi ; ri c th
sinh t cp cy gm cy gn nhn bng tay (yMi) v cy u ra ca MST (yi)). Cn ch l
mi cp (f, r) ng vi mt mc t trn cy.
Mc ch ca chng ta l hc hm C: FlR. R l khng gian cc lut hiu chnh, vi r1 k
hiu cho php gi nguyn cy.
Dng Perceptron a lp thc hin nhim v ny. Mi lut r s c mt vect trng s
wr tng ng. Bi ton quy v hc cc vect trng s wr

C (f ) = argmax < f , w r >


r
wr hc c trong qu trnh hun luyn c chun ha bng cch ly trung bnh
wr =

1 T t
w r vi T l s mu hun luyn.
T t =1

4.2.2 Cc c trng kho st


Cc c trng dng hun luyn ra m hnh hiu chnh l mc t, t loi v tn quan h
ph thuc ca cc i tng: nt hin ti, cha, ng, c, con, mc t trc, mc t sau ca nt
hin ti. Ngoi ra nhng cp c trng xut hin hn 10 ln trong d liu hun luyn cng
c xt n.
4.3

Hiu chnh cy khi c m hnh hun luyn

Khi a vo yq no vo b hiu chnh, ban u cc vect f ng vi cc mc t trong yq


s c sinh ra. Cng vic cn li l kt hp f v cc wr hiu chnh cy ng vi tng mc
t. phc tp ca b hiu chnh ny l O(n).

5
5.1

Thc nghim trn ting Vit


D liu thc nghim

Kho ng liu dng cho thc nghim gm 450 cu ting Vit trch ngu nhin t cc bi
bo nhiu chuyn mc khc nhau ca bo in t Vietnamnet.
D liu c tin x l (sa li chnh t), gn nhn bng tay cc thng tin v t loi v
quan h ph thuc v nh dng theo chun ca Hi tho quc t CoNLL-X 2006 [8].
11

Thng tin v b nhn t loi v b nhn quan h ph thuc c m t chi tit hn trong
ti liu i km kho ng liu.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Ti
s
tip_tc
gip
,
Tm
ni
,
ti
khi
h
khng
cn
ti
na
.

ti
P
s
R
tip_tc MD
gip
V
,
SYM
tm
NP
ni
V
,
SYM
ti
IN
khi
NN
h
P
khng
R
cn
V
ti
P
na
R
.
SYM

P
R
MD
V
SYM
NP
V
SYM
IN
NN
P
R
V
P
R
SYM

_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_

3
3
7
3
3
7
0
7
3
9
13
13
10
13
13
7

NP-SBJ
ADVP
S
VP
DEP
NP-SBJ
ROOT
DEP
PP
SBAR
NP-SBJ
ADVP
S
NP-OBJ
ADVP
DEP

_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_

_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_

Hnh 9 V d d liu theo chun CONLL-X 2006 ca cu trong Hnh 1

5.2

Phng php nh gi v thc o

5.2.1 Phngphp nh gi
Do d liu i hi qu trnh x l bng tay cng phu nn chng ti cha xy dng c
nhiu. kt qu nh gi l chnh xc nht vi 450 cu xy dng c, chng ti xut
vn dng linh hot phng php nh gi cho15.
a) Phng php nh gi MST
Chia d liu thnh 10 phn nh gi cho.
b) Phng php nh gi MST sau khi hiu chnh
Chia d liu thnh 10 phn, k hiu l T1,..,T10. kim th hiu chnh trn T1, ta thc
hin quay vng MST trn 9 phn cn li (hun luyn MST trn 8 phn v kim th MST trn
1 phn) ri gp kt qu kim th li lm d liu hun luyn b hiu chnh.
Lm tng t vi 9 phn cn li v chia trung bnh c chnh xc.
5.2.2 Thc o
Chng ti dng hai thc o in hnh cho bi ton phn tch ph thuc l: UAS (vit tt
ca Unlabeled Attachment Score) l chnh xc khi cha tnh n tn quan h ph thuc;
v LAS (vit tt ca Labeled Attachment Score) l chnh xc khi xt c tn quan h ph
thuc.
5.3

Kt qu thc nghim
Bng 3 So snh kt qu MST khi trc v sau hiu chnh

Phng php
MST bc 1
MST bc 1 + hiu chnh
15

UAS
67.70%
66.49%

LAS
63.11%
61.76%

Thut ng ting Anh l cross validation

12

Nh vy sau khi hiu chnh chnh xc ca b phn tch li gim i. iu ny l do cc


c trng dng cho bc hiu chnh khc vi cc c trng dng trong MST v do d liu
hun luyn qu t (kho ng liu ch cha khong 2200 t t phn bit trong khi t in ting
Vit c khong 11 nghn t). Tuy l h thng th nghim u tin trn ting Vit v c hn
ch v kho ng liu, chnh xc trong khong ny kh gn vi LAS t 70.98% n 80.29%
v UAS t 75.53% n 84.80% ca MST trn mt s ngn ng lit k trong [12] cho thy
MST l mt hng kh thi gii quyt bi ton phn tch c php ph thuc ting Vit.

Kt lun

L mt trong nhng cng trnh u tin nghin cu v phn tch t ng c php ph


thuc cho cu ting Vit, bi bo trnh by chi tit v bi ton. V mt ngn ng, chng ti
tng hp nhng c th ca ng php ting Vit c th m hnh ha a thm vo cc
vect c trng nhm nng cao chnh xc b phn tch. Bi bo cng xut mt m hnh
phn tch ph thuc cho ting Vit da trn kt hp hai m hnh cho kt qu kh quan trn
ting Anh: m hnh MST v m hnh hiu chnh cy ph thuc. Kt qu th nghim ban u
trn kho ng liu ting Vit chng ti xy dng theo chun CONLL-X 2006 cho thy:
chnh xc sau khi hiu chnh gim khong 2%. Trong tng lai, c th dng chnh cc c
trng ca MST cho phn hiu chnh h thng nht qun hn. Ta cng c th thay th
Perceptron a lp trong phn hiu chnh bng MIRA tn dng im mnh phng php
hc my phn bit ny cho hc c cu trc v kh nng tng thch ca n vi hn ch ti
nguyn ngn ng - l mt vn ln trong x l ting Vit hin nay.

Ti liu tham kho


[1] Wikipedia (truy cp ngy 24/5/2008). Vietnamese syntax. http://en.wikipedia.org/wiki/Vietnamese_syntax
[2] Wikipedia (truy cp ngy 24/5/2008). Analytic language. http://en.wikipedia.org/wiki/Analytic_language
[3] Wikipedia (truy cp ngy 24/5/2008). Isolating language. http://en.wikipedia.org/wiki/Isolating_language
[4] Wikipedia (truy cp ngy 24/5/2008). Word order. http://en.wikipedia.org/wiki/Word_order
[5] Ryan McDonald, Joakim Nivre (2007). Introduction to Data-Driven Dependency Parsing. Introductory
Course, ESSLLI 2007.
[6] Dip Quang Ban (2005). Ng php ting Vit. NXB Gio Dc.
[7] Ryan McDonald (2006). Discriminative Training and Spanning Tree Algorithms for Dependency Parsing.
University of Pennsylvania.
[8] CoNLL-X Shared Task: Multi-lingual Dependency Parsing. http://nextens.uvt.nl/~conll/
[9] G. Attardi, M. Ciaramita (2007). Tree revision learning for dependency parsing. Proceedings of HLTNAACL 2007, Rochester.

[10]

EAGLES (1996). Recommendations for the Syntactic Annotation of Corpora.


http://www.ilc.cnr.it/EAGLES96/segsasg1/segsasg1.html

[11] K. Crammer and Y. Singer (2003). Ultraconservative Online Algorithms for Multiclass Problems. Journal
of Machine Learning Research 3: pp.951-991.

[12]

J. Nivre, J. Hall, S. Kubler, R. McDonald, J. Nilsson, S. Riedel, and D. Yuret (2007). The CoNLL 2007
Shared Task on Dependency Parsing. Conference on Empirical Methods in Natural Language Processing
and Natural Language Learning.

13

You might also like