You are on page 1of 14

S DNG B GN NHN T LOI XC SUT QTAG CHO VN BN TING VIT A case study of the probabilistic tagger QTAG for

Tagging Vietnamese Texts


Nguyn Th Minh Huyn, V Xun Lng, L Hng Phng

Tm tt Trong bi bo ny chng ti trnh by chi tit cc th nghim v gn nhn t loi cho cc vn bn ting Vit bng cch p dng b gn nhn QTAG, mt b gn nhn xc sut c lp vi ngn ng. Chng ti s dng hai b nhn t loi vi mn khc nhau. Vic gn nhn t ng da trn mt b t vng c thng tin t loi cho mi t v mt tp vn bn c gn nhn bng tay. Chng ti cng trnh by khu tin x l cho vic gn nhn: phn tch cc n v t trong vn bn. T kho: t loi, t vng, kho vn bn, phn tch t, gn nhn xc sut, QTAG

Abstract
In this paper we describe in detail our experiments on tagging Vietnamese texts using QTAG, a language independent probabilistic tagger with two part-of-speech (POS) sets at two different levels of finesse, based on a lexicon with information about possible POS tags for each word and a manually labeled corpus. We also describe the pre-processing for POS tagging, saying text tokenization. Keywords: POS, lexicon, corpus, tokenization, probabilistic tagging, QTAG

GII THIU Mt trong cc vn nn tng ca phn tch ngn ng l vic phn loi cc t thnh cc lp t loi da theo thc tin hot ng ngn ng. Mi t loi tng ng vi mt hnh thi v mt vai tr ng php nht nh. Cc b ch thch t loi c th thay i tu theo quan nim v n v t vng v thng tin ngn ng cn khai thc trong cc ng dng c th [19]. Mi t trong mt ngn ng ni chung c th gn vi nhiu t loi, v vic gii thch ng ngha mt t ph thuc vo vic n c xc nh ng t loi hay khng. Cng vic gn nhn t loi cho mt vn bn l xc nh t loi ca mi t trong phm vi vn bn . Khi h thng vn bn c gn nhn, hay ni cch khc l c ch thch t loi th n s c ng dng rng ri trong cc h thng tm kim thng tin, trong cc ng dng tng hp ting ni, cc h thng nhn dng ting ni cng nh trong cc h thng dch my. i vi cc vn bn Vit ng, vic gn nhn t loi c nhiu kh khn, c bit l bn thn vic phn loi t ting Vit cho n nay vn l mt vn cn nhiu tranh ci, cha c mt chun mc thng nht [3], [5], [8], [13], [18]. Nghin cu ca nhm chng ti

@ietLex

phc v ng thi hai mc ch: mt mt thc hin n lc nhm xy dng cc cng c cho vic x l vn bn ting Vit trn my tnh phc v cho cc ng dng cng ngh, mt khc cc cng c ny cng h tr tch cc cho cc nh ngn ng nghin cu ting Vit. Trong bo co ny chng ti s trnh by phng php tip cn v kt qu thu c ca nhm nghin cu trong bc th nghim u tin vi mt cng c gn nhn t ng thun tu xc sut. BI TON GN NHN T LOI Trong phn ny chng ti gii thiu tng quan v cc k thut gn nhn t loi v cc bc gii quyt bi ton gn nhn t loi cho vn bn ting Vit. Qu trnh gn nhn t loi c th chia lm 3 bc [15]. Phn tch xu k t thnh chui cc t. Giai on ny c th n gin hay phc tp tu theo ngn ng v quan nim v n v t vng. Chng hn i vi ting Anh hay ting Php, vic phn tch t phn ln l da vo cc k hiu trng. Tuy nhin vn c nhng t ghp hay nhng cm t cng c gy tranh ci v cch x l. Trong khi vi ting Vit th du trng cng khng phi l du hiu xc nh ranh gii cc n v t vng do tn s xut hin t ghp rt cao. Gn nhn tin nghim, tc l tm cho mi t tp tt c cc nhn t loi m n c th c. Tp nhn ny c th thu c t c s d liu t in hoc kho vn bn gn nhn bng tay. i vi mt t mi cha xut hin trong c s ng liu th c th dng mt nhn ngm nh hoc gn cho n tp tt c cc nhn. Trong cc ngn ng bin i hnh thi ngi ta cng da vo hnh thi t on nhn lp t loi tng ng ca t ang xt. Quyt nh kt qu gn nhn, l giai on loi b nhp nhng, tc l la chn cho mi t mt nhn ph hp nht vi ng cnh trong tp nhn tin nghim. C nhiu phng php thc hin vic ny, trong ngi ta phn bit ch yu cc phng php da vo quy tc ng php m i din ni bt l phng php Brill ([2]) v cc phng php xc sut ([4]). Ngoi ra cn c cc h thng s dng mng n-ron ([16]), cc h thng lai s dng kt hp tnh ton xc sut v rng buc ng php [6], gn nhn nhiu tng [17].

V mt ng liu, cc phng php phn tch t loi thng dng hin nay dng mt trong cc loi ti nguyn ngn ng sau: T in v cc vn phm loi b nhp nhng [14]. Kho vn bn gn nhn [4], c th km theo cc quy tc ng php xy dng bng tay [2].

@ietLex

Kho vn bn cha gn nhn, c km theo cc thng tin ngn ng nh l tp t loi v cc thng tin m t quan h gia t loi v hu t [10]. Kho vn bn cha gn nhn, vi tp t loi cng c xy dng t ng nh cc tnh ton thng k [11]. Trong trng hp ny kh c th d on trc v tp t loi.

Cc b gn nhn t loi dng t in v vn phm gn ging vi mt b phn tch c php. Cc h thng hc s dng kho vn bn hc cch on nhn t loi cho mi t [1]. T gia nhng nm 1980 cc h thng ny c trin khai rng ri v vic xy dng kho vn bn mu t tn km hn nhiu so vi vic xy dng mt t in cht lng cao v mt b quy tc ng php y . Mt s h thng s dng ng thi t in lit k cc t loi c th cho mt t, v mt kho vn bn mu loi b nhp nhng. B gn nhn ca chng ti nm trong s cc h thng ny. Cc b gn nhn thng c nh gi bng chnh xc ca kt qu: [s t c gn nhn ng] / [tng s t trong vn bn]. Cc b gn nhn tt nht hin nay c chnh xc t ti 98% [15]. Nghin cu p dng cho vn t ng gn nhn t loi ting Vit, nhm chng ti thc hin cc bc c th sau: 1. Xy dng t in t vng, la chn tiu ch xc nh t loi trong qu trnh phn tch t vng. Hu ht cc mc t trong t in u c thng tin t loi i km. 2. Xy dng cng c phn tch cc n v t vng trong vn bn. 3. Xy dng kho vn bn loi b nhp nhng t loi bng tay, sau khi t ng gn tt c cc nhn c th cho mi t. 4. Xy dng b gn nhn t loi t ng, da trn cc thng tin t loi trong t in t vng v cc quy tc kt hp t loi hc c t kho vn bn gn nhn mu. Trong phn tip theo ca bo co, chng ti s ln lt trnh by cc bc 1, 2 v 4. XY DNG T IN T VNG, XC NH B CH THCH T LOI TING VIT Trong khun kh ti cp Nh nc KC01 "Nghin cu pht trin cng ngh nhn dng, tng hp v x l ngn ng ting Vit", nhm nghin cu trin khai cc cng vic xy dng kho ng liu ting Vit bao gm t in t vng v kho vn bn c km theo m t t loi ca cc n v t vng vi cht lng cao, tun theo cc chun quc t v biu din d liu1, cho php cp nht v m rng d dng.

cf. ISO TC37/SC4 http://www.tc37sc4.org

@ietLex

T in t vng Trong ting Vit, bn cnh nhng n v r rng l t, l ng c nh nh thnh ng (sn cng thu tn, tay xch nch mang...), qun ng (ln lp, ln mt, ra v), cn tn ti nhng n v c ngi cho l t, c ngi cho l ng c nh (nh xe ln ng, my quay a, lm rung, lnh ngt, suy cho cng, ...). Ranh gii ca t trong ting Vit l mt vn phc tp, trong nhiu trng hp cn c nhng kin khc nhau [8]. Chng ti la chn quan nim n v t vng theo cun T in ting Vit [7] (do Vin Ngn Ng Hc bin son) xy dng c s ng liu. Trong ton b cun t in ny, quan im v vic thu thp t vng, v chun ho chnh t, v ch thch t loi l r rng v thng nht. Ngoi ra, chng ti c a thm cc n v t vng t dng, gp trong kho vn bn nhng khng c thu thp trong t in vo T in t vng. Mt khc, chng ti cng a thm cc n v t vng mi xut hin (m t in cha thu thp) vo T in t vng cng vi nhng n v l tn ngi, tn a danh, tn t chc thng gp tin cho chng trnh x l. Chnh t trong [7] theo ng cc Quy nh v chnh t ting Vit v v thut ng ting Vit trong cc sch gio khoa, c ban hnh theo Quyt nh s 240/Q ngy 5-3-1984 ca B trng B Gio dc (chng hn vn vit nguyn m "-i", vit "-uy", cch ghi du thanh, cch vit thut ng khoa hc, s dng con ch f, j, w, z cho cc t mn ting nc ngoi, v.v.). Trn thc t, trong cc vn bn ting Vit vn khng c s thng nht trong cch ghi du thanh nhng m tit c m m, v vy m trc khi p dng cho chng trnh tch t v gn nhn t loi, vn bn c chng ti x l li cho nht qun vi t in. Xy dng b ch thch t loi T loi phn nh v tr khc nhau ca cc t trong h thng ng php. phn nh c chnh xc tt c cc quan h ng php th cn c mt b t loi rt ln. Nhng cng nhiu ch thch t loi th cng vic gn nhn cng kh khn. Bi vy cn phi c mt s tho hip t c mt b ch thch t loi khng qu ln v c cht lng. Chng ti chn lm vic vi hai b t loi. Trc ht l s dng b ch thch 8 t loi (danh t, ng t, tnh t, i t, ph t, kt t, tr t, cm t) c cng ng ngn ng hc tho hip tng i, trnh by trong cun Ng php ting Vit [18] v c ch thch c th cho tng mc t trong [7]. B t loi th hai c xy dng bng cch phn nh mi t loi trn thnh cc tiu t loi. Ban u chng ti dng ngay cch chia thnh tiu loi trong [18]. Nhng ch thch t loi c chn nh trn sau c phn nh y trong T in t vng, lm c s d liu cho chng trnh t ng xc nh ngha danh t, ng t...,

@ietLex

ng t ni ng hay ng t ngoi ng... ca mi t khi phn xut trc tip trong vn bn. Cng vi t in ny l kho vn bn c chng ti gn nhn bng tay sau khi chy chng trnh tch t v xc nh tt c cc nhn c th tm c trong t in cho mi t. Trong qu trnh xc nh nhn cho tng t trong vn bn c th, chng ti nhn thy s cn thit phi b sung thm mt s nhn t loi trnh trng hp mt t mang cng mt lc nhiu nhn t loi (chng hn ng t ngoi ng ch cm ngh hay ng t ni ng ch cm ngh). Nh vy qu trnh xy dng tp mu cng ng thi l qu trnh iu chnh vic phn chia t loi hp l hn. Hin ti chng ti lm vic vi b nhn t loi mc mn hn gm 47 t loi v b sung mt nhn cho cc t cha xc nh c t loi. PHN TCH T TRONG VN BN TING VIT t bi ton. Cho mt cu ting Vit bt k, hy tch cu thnh nhng n v t vng (t), hoc ch ra nhng m tit no khng c trong t in (pht hin n v t vng mi). gii quyt bi ton t ra, chng ti s dng tp d liu gm bng m tit ting Vit (khong 6700 m tit) v t in t vng ting Vit (khong 30.000 t). Cc t in c lu di dng cc tp vn bn c nh dng m TCVN hoc Unicode dng sn (UTF-8). Chng trnh xy dng bng Java, m ngun m (lin h nhm tc gi). Cc bc gii quyt
1. Xy dng tmt m tit on nhn tt c cc m tit ting Vit 2. Xy dng tmt t vng on nhn tt c cc t vng ting Vit. 3. Da trn cc tmt nu trn, xy dng th tng ng vi cu cn phn tch v s dng thut ton tm kim trn th lit k cc cch phn tch c th.

Bng ch ci ca tmt m tit l bng ch ci ting Vit, mi cung chuyn c ghi trn mt k t. V d, vi ba m tit phng, php, trnh ta s c tmt on nhn m tit nh Hnh 1.

Hnh 1. Xy dng tmt m tit 5

@ietLex

Thut ton xy dng tmt m tit


Input: T in m tit Output: tmt m tit. Thut ton:

1. Lp trng thi khi u q0 ; 2. Vng lp c cho ti khi ht tp d liu, ly ra tng m tit. Gi cc k t ca m tit l c0 , c1 ,..., cn 1 .
a.

p := q0 ; i := 0; i. Ly ra k t ci ; ii. Tm trong cc cung chuyn t trng thi p cung trn ghi k t ci . Nu c cung ( p, q) nh th:
1. 2.

b. Vng lp trong khi ( i n 1 )

i := i + 1; p := q;

iii. Nu khng c cung ( p, q) no nh th th thot khi vng lp b. c. Vi j t i n n 1 i. To mi trng thi q , ghi nhn q l trng thi khng kt; ii. Thm cung chuyn ( p, q) trn ghi k t c j ;
iii.

p := q;

d. Ghi nhn q l trng thi kt; tmt t vng c xy dng tng t, vi im khc nh sau: thay v ghi trn mi cung chuyn mt m tit, ta ghi s hiu ca trng thi (kt) ca tmt m tit ti on nhn mi m tit ca t nhm gim kch thc ca tmt t vng. V d, vi hai t phng php v phng trnh, gi s khi a ln lt cc m tit phng, php, trnh qua tmt m tit, ta n c cc trng thi kt ghi cc s n1, n2, n3 th trn cc cung chuyn tng ng ta ghi cc s n1, n2, n3 (Hnh 2).

Hnh 2. Xy dng tmt t vng

@ietLex

Thut ton xy dng tmt t vng


Input: T in t vng, tmt m tit Output: tmt t vng. Thut ton:

1. Lp trng thi khi u q0 ; 2. Vng lp c cho ti khi ht tp d liu, ly ra tng mc t word. Gi cc m tit ca word l s0 , s1 ,..., sn 1 ; 3. S dng tmt m tit on nhn cc m tit trn, c cc s hiu ca trng thi (kt) tng ng l m0 , m1 ,..., mn 1
a.

p := q0 ; i := 0;
i. Ly ra s mi ; ii. Tm trong cc cung chuyn t trng thi p cung trn ghi s mi . Nu c cung ( p, q) nh th 1. 2.

b. Vng lp trong khi ( i n 1 )

i := i + 1; p := q;

iii.Nu khng c cung ( p, q) no nh th th thot khi vng lp b.

c. Vi j t i n n 1 i. To mi trng thi q , ghi nhn q l trng thi khng kt;


ii. Thm cung chuyn ( p, q) trn ghi s m j ; iii. p := q;

d. Ghi nhn q l trng thi kt Sau khi xy dng xong hai tmt, ta ghi chng vo hai tp nh kiu dng trong bc phn tch t vng. Nu mi k t (char) c ghi vo tp vi kch thc 2 byte (m Unicode), mi s nguyn (int) c kch thc 4 byte th tp lu tmt m tit c kch thc 146KB, tp tmt t vng c kch thc 1MB. T tng ca thut ton phn tch t vng l quy vic phn tch cu v vic tm ng i trn mt th c hng, khng c trng s. Gi s cu ban u l mt dy gm n+1 m tit s0, s1, ..., sn. Ta xy dng mt th c n+2 nh v0, v1, ..., vn, vn+1, sp th t trn mt ng thng t tri sang phi; trong , t nh vi n nh vj c cung (i < j) nu cc m tit si, si+1, ..., sj-1 theo th t lp thnh mt t. Khi mi cch phn tch cu khc nhau tng ng vi mt ng i trn th t nh u v0 n nh cui vn+1. Trong thc t, cch phn tch cu ng n nht thng ng vi ng i qua t cung nht trn th.

@ietLex

Trong trng hp cu c s nhp nhng th th s c nhiu hn mt ng i ngn nht t nh u n nh cui, ta lit k ton b cc ng i ngn nht trn th, t a ra tt c cc phng n tch cu c th v ngi dng quyt nh s chn phng n no, tu thuc vo ng ngha hoc vn cnh. V d, xt mt cu c cm "thuc a bn", ta c th nh sau (Hnh 3)

Hnh 3. Mt tnh hung nhp nhng Cm ny c s nhp nhng gia thuc a v a bn v ta s c hai kt qu phn tch l "thuc a / bn" v "thuc / a bn". Ta c th ch ra rt nhiu nhng cm nhp nhng trong ting Vit, chng hn "t hp m tit", "bng chng c",... Trng hp trong cu c m tit khng nm trong t in th r rng tmt m tit khng on nhn c m tit ny. Kt qu l th ta xy dng t cu l khng lin thng. Da vo tnh cht ny, ta thy rng nu th khng lin thng th d dng pht hin ra rng n v m tit khng on nhn c khng nm trong t in m tit, tc n b vit sai chnh t hoc l mt n v m tit (t vng) mi. nh gi kt qu Vi cch tip cn nh trn, bi ton phn tch t vng trong cu ting Vit v c bn c gii quyt, c bit l vn tch cc t hp t tng ng vi mt n v t vng, thng l cc cm t c nh, ng c nh hoc cc thnh ng trong ting Vit. Vi nhng cu nhp vo c s nhp nhng t vng, tc c nhiu hn mt cch phn tch th chng trnh lit k ton b cc phng n tch t c th v ginh quyn la chn kt qu cho ngi s dng. Trong tt c cc phng n phn tch bao gi cng tn ti phng n ng. Di y l mt s cu nhp vo v kt qu tch t tng ng. 1. N | l | mt | bn | tuyn ngn | c sc | ca | ch ngha nhn o | , mt | ting | chung | cnh tnh | trc | him ha | ln lao | ca | hnh tinh | trc | s | in r | ca | nhng | k | cung tn 2. Trong khi | cc | thnh phn | t bn ch ngha | c | nhng | bc | pht trin | mnh | hn | thi k | trc | th | th lc | ca | giai cp | a ch | vn | khng h | suy gim. Nh vy, cn mt s vn kh khn cn phi tip tc nghin cu gii quyt: Th nht l vn gii quyt nhp nhng phn tch. Cn phi chn mt phng n ng gia nhiu phng n. Cc hng tip cn kh thi cho vn ny c th l:

@ietLex

Dng cc quy tc ng php do chuyn gia ngn ng xy dng. Tin hnh phn tch c php ca cu vi nhng phng n tch t vng c th, t loi ra nhng phng n sai c php. Dng phng php xc sut - thng k. Phi thng k trong kho vn bn tng i ln ca ting Vit tm ra xc sut ca cc b i hay b ba t loi hoc t vng i cnh nhau. T la chn phng n phn tch c xc sut sai t nht.

Chng trnh phn tch c php ting Vit chng ti hin c cng c kh nng nhn bit c mt s cu nhp nhng t vng. V d, vi cu bn sao chp m th c th c hai cch phn tch c th l bn | sao chp v bn sao | chp, trnh phn tch nhn thy c hai cch tch t ny u ng c php v a ra hai cy phn tch tng ng. Vi cu anh y rt thuc a bn th mc d cm thuc a bn c hai cch phn tch t vng l thuc | a bn v thuc a | bn nhng trnh phn tch ch on nhn c mt v a ra cch phn tch tng ng vi cch tch t . Do , cch tch t cn li l sai. Th hai l vn gii quyt tn ring, tn vit tt v tn c ngun gc nc ngoi c mt trong cu. Hin ti chng trnh phn tch cha nhn ra c cc cm t dng Nguyn Vn A, i hc Khoa hc T nhin, hoc T. 8.20.20.20, 1.000$, 0,05%... TH NGHIM B GN NHN QTAG CHO TING VIT QTAG l mt b gn nhn nh vy, do nhm nghin cu Corpus Research thuc trng i hc tng hp Birmingham pht trin, cung cp min ph cho mc ch nghin cu2. Chng ti sa i phn mm ny thch nghi vi vic thao tc trn vn bn ting Vit, cng nh cho php s dng t in t vng c thng tin t loi bn cnh vic s dng kho vn bn gn nhn. Vi s ng ca tc gi O. Mason, chng ti cng b phin bn QTAG cho ting Vit cng vi kho ng liu (vnQTAG) ti a ch: http://www.loria.fr/equipes/led/outils.php. Phng php gn nhn xc sut tng ca phng php gn nhn t loi xc sut l xc nh phn b xc sut trong khng gian kt hp gia dy cc t Sw v dy cc nhn t loi St. Sau khi c phn b xc sut ny, bi ton loi b nhp nhng t loi cho mt dy cc t c a v bi ton la chn mt dy t loi sao cho xc sut iu kin P(St | Sw) kt hp dy t loi vi dy t cho t gi tr ln nht. Theo cng thc xc sut Bayes ta c: P(St | Sw) = P(Sw | St).P(St)/P(Sw). y dy cc t Sw bit, nn thc t ch cn cc i ho xc sut P(Sw | St).P(St). Vi mi dy St = t1t2 ... tN v vi mi dy Sw = w1w2 ... wN : P(w1w2... wN | t1t2...tN) = P(w1 | t1t2...tN) P(w2 | w1,t1t2...tN)...P(wN | w1... wN-1, t1t2...tN)
2

http://www.clg.bham.ac.uk/staff/oliver/software/tagger/

@ietLex

P(t1t2...tN) = P(t1)P(t2 | t1) P(t3 | t1t2) ... P(tN | t1...tN-1) Ngi ta a ra cc gi thit n gin ho cho php thu gn m hnh xc sut v mt s hu hn cc tham bin. i vi mi P(wi | w1... wi-1, t1t2...tN), gi thit kh nng xut hin mt t khi cho mt nhn t loi l hon ton xc nh khi bit nhn , ngha l P(wi | w1... wi-1, t1t2...tN) = P(wi | ti). Nh vy xc sut P(w1w2... wN|t1t2...tN) ch ph thuc vo cc xc sut c bn c dng P(wi| ti): P(w1w2... wN | t1t2...tN) = P(w1 | t1)P(w2 | t2) ... P(wN | tN) i vi cc xc sut P(ti | t1...ti-1), gi thit kh nng xut hin ca mt t loi l hon ton xc nh khi bit cc nhn t loi trong mt ln cn c kch thc k c nh, ngha l: P(ti | t1...ti-1)= P(ti | ti-k...ti-1). Ni chung, cc b gn nhn thng s dng gi thit k bng 1 (bigram) hoc 2 (trigram). Nh vy m hnh xc sut ny tng ng vi mt m hnh Markov n, trong cc trng thi n l cc nhn t loi (hay cc dy gm k nhn nu k > 1), v cc trng thi hin (quan st c) l cc t trong t in. Vi mt kho vn bn gn nhn mu, cc tham s ca m hnh ny d dng c xc nh nh thut ton Viterbi. B gn nhn QTAG

1.1.1

D liu mu

B gn nhn QTAG l mt b gn nhn trigram. QTAG s dng kt hp hai ngun thng tin: mt t in t cha cc t km theo danh sch cc nhn c th ca chng cng vi tn sut xut hin tng ng; v mt ma trn gm cc b ba nhn t loi c th xut hin lin nhau trong vn bn vi cc tn s xut hin ca chng. C hai loi d liu ny thu c d dng da vo kho vn bn mu gn nhn. Cc loi du cu v cc k hiu khc trong vn bn c x l nh cc n v t vng, vi nhn chnh l du cu tng ng.
1.1.2 Thut ton gn nhn t loi

V mt thut ton, QTAG lm vic trn mt ca s cha 3 t, sau khi b sung thm 2 t gi u v cui vn bn. Cc t c ln lt c v thm vo ca s mi khi ca s di chuyn t tri sang phi, mi ln mt v tr. Nhn c gn cho mi t lt ra ngoi ca s l nhn kt qu cui cng. Th tc gn nhn nh sau: 1. c t (token) tip theo 2. Tm t trong t in 3. Nu khng tm thy, gn cho t tt c cc nhn (tag) c th

@ietLex

10

4. Vi mi nhn c th 5. tnh Pw = P(tag|token) l xc sut t token c nhn tag 6. tnh Pc = P(tag|t1,t2), l xc sut nhn tag xut hin sau cc nhn t1, t2, l nhn tng ng ca hai t ng trc t token. 7. tnh Pw,c = Pw * Pc, kt hp hai xc sut trn. 8. Lp li php tnh cho hai nhn khc trong ca s Sau mi ln tnh li (3 ln cho mi t), cc xc sut kt qu c kt hp cho ra xc sut ton th ca nhn c gn cho t. V cc gi tr ny thng nh, nn chng c tnh trong biu thc logarit c s 10. Gi tr xc sut tnh c cho mi nhn tng ng vi mt t th hin tin cy ca php gn nhn ny cho t ang xt.
1.1.3 Thc hin gn nhn

Sau khi xy dng t in t vng v ma trn xc sut chuyn gia cc t loi t d liu mu, QTAG lm vic vi d liu vo l mt vn bn c tch t, mi t nm trn mt dng. Chng trnh c th in ra dy cc nhn t loi cng vi thng tin xc sut tng ng cho mi t trong vn bn, hoc ch in ra kt qu cui cng - nhn c kh nng xut hin cao nht. S dng QTAG cho ting Vit
1.1.4 D liu mu

Nhm nghin cu ngn ng ca Trung tm T in hc xy dng c s d liu mu bao gm: T in t vng gm 37454 mc t, mi mc t c km theo dy tt c cc t loi m n c th c, nhng n v cha xc nh c t loi th gn nhn X. Cc vn bn thuc mt s th loi khc nhau (vn hc Vit Nam/nc ngoi, khoa hc, bo ch) c gn nhn bng tay, bao gm 63732 lt t vi 48 nhn t loi cng vi mt s nhn tng ng vi cc du cu v mt s k hiu khc.
Th nghim

1.1.5

Nh trnh by, b gn nhn QTAG ban u ch lm vic vi mt kho vn bn c gn nhn mu "hun luyn" cho m hnh xc sut. Trong qu trnh gn nhn, nu gp mt n v mi (c th l t, con s, cc k hiu ton hc...) cha thy xut hin trong tp mu, QTAG gi thit n v c th c mt nhn t loi bt k nm trong tp tt c cc nhn xut hin trong tp hun luyn. C s d liu ca chng ti c t in t vng c lp nn chng ti thc hin mt s thay i sau:

@ietLex

11

1. a vo kho t vng ca b gn nhn tt c cc mc t c trong t in t vng ca chng ti v cc mc t c trong tp hun luyn 2. Khi gp mt n v mi trong tp vn bn cn gn nhn, kim tra nu n v l s hay tn ring th gn nhn s hay tn ring 3. Ngoi ra, mt mun on nhn t loi cho mt t mi da vo hu t ca t khng p dng c cho ting Vit - cng c lc b. Phng php th nghim ca chng ti l ly mt phn kho vn bn gn nhn lm tp hun luyn cho m hnh xc sut. Sau chng ti p dng m hnh ny t ng gn nhn cho phn cc vn bn cn li ri so snh kt qu thu c vi d liu mu. Cc th nghim c thc hin i vi 2 b ch thch t loi trnh by trong mc 3. Vi mi mc trn chng ti thc hin cc th nghim, tng ng vi cc tp mu khc nhau v kch thc v vn phong.
1.1.6 nh gi kt qu

Chng trnh c ci t bng ngn ng lp trnh Java, chy trong mi mi trng, c th dng m ting Vit Unicode (dng sn) hoc TCVN. M chng trnh ch khong 16KB. M ngun d dng sa i v dng li. Thi gian hun luyn hay gn nhn vi ng liu khong 32000 lt t u tn khong 30 giy. Kt qu gn nhn mt cu nu chn nh dng XML nh v d sau: <w pos="Nc">hi</w> <w pos="Vto">ln</w> <w pos="Nn">su</w> pos=",">, </w> <w pos="Vs">c</w> <w pos="Nu">ln</w> pos="Pp">ti</w> <w pos="Jt"></w> <w pos="Vt">nhn</w> pos="Vt">thy</w> <w pos="Nn">mt</w> <w pos="Nt">bc</w> pos="Nc">tranh</w> <w pos="Jd">tuyt</w> <w pos="Aa">p</w> <w <w <w <w

trong : Nc - danh t n th, Vto - ngoi ng t ch hng, Nn - danh t s lng, Vs ng t tn ti, Nu - danh t n v, Pp - i t nhn xng, Jt - ph t thi gian, Vt - ngoi ng t, Nt - danh t loi th, Jd - ph t ch mc , Aa - tnh t hm cht. Kt qu th nghim tt nht vi cc tp mu xy dng t ti chnh xc ~94% i vi b nhn th nht (9 nhn t vng v 10 nhn cho cc loi k hiu), trong khi vi b nhn th hai ch t ti ~85% (48 nhn t vng v 10 nhn cho cc loi k hiu). Bng 1 minh ho kt qu gn nhn vi b nhn th nht: t l tng ng trong mi th nghim l chnh xc. Nu khng dng n t in t vng (ch s dng kho vn bn gn nhn mu) th cc kt qu ch t c tng ng l ~80% v ~60%. Kt qu ca cc th nghim ban u cng cho chng ti mt s nhn xt sau: 4. Vi kch thc tp mu ban u nh nhau, do tp nhn t loi mc 2 ln hn nhiu so vi mc 1, nn t l li mc 2 cao hn mc 1 kh nhiu.

@ietLex

12

5. ng nh mong i, khi x l cc vn bn cng mt vn phong, tp mu cng ln th t l li cng gim 6. Tp mu vi cc vn bn c vn phong khc nhau c nh hng ti kt qu gn nhn. Bng 1. Kt qu gn nhn t loi mc 1
Vn bn / Vn phong Chuyn tnh1 / Tiu thuyt VN Chuyn tnh2 / Tiu thuyt VN Hong t b / Truyn nc ngoi Lc s thi gian / Sch khoa hc Muii ca rng / Truyn ngn VN Nhng bi hc / Truyn ngn VN Cng ngh / Bo ch chnh xc trung bnh S n v t 16787 14698 18663 11626 3573 8244 1162 Test 1 91,53% 91,78% tp mu 90,44% 90,68% 91,45% 88,81% 91,25% Test 2 89,75% 90,39% 10,48% tp mu 11,42% 10,24% 9,90% 89,77% Test 3 tp mu 94,28% tp mu 91,42% 91,04% 92,90% 89,24% 92,70% Test 4 tp mu 93,82% tp mu tp mu 91,32% 92,89% 89,67% 93,04%

KT LUN Trn y chng ti trnh by mt phng php tip cn gii quyt bi ton gn nhn t loi t ng cho cc vn bn ting Vit. Tuy nhng kt qu ban u c chnh xc cha tht cao, nhng chng ha hn trin vng tt cho cc nghin cu tip theo. Vi cc kt qu gn nhn thu c, chng ti s tip tc b sung kho d liu gm cc vn bn c gn nhn mu, lm tng cht lng b gn nhn. Kho d liu ny cng c bit hu ch cho vic nghin cu vn phm ting Vit. Vic nghin cu vn phm trn c s cc vn bn gn nhn cng gip cho chng ti iu chnh b nhn t loi, sao cho cc t loi a ra p ng c tt nht yu cu th hin cc c trng ng php ca cc n v t vng. Bn cnh , cc cng c t ng tch t v gn nhn t loi t ng cng h tr tch cc cho cc nh ngn ng pht hin cc hin tng ngn ng cn nghin cu. Vi mong mun m rng s quan tm nghin cu ca mi ngi, chng ti sn sng cung cp tt c cc ti nguyn v cng c xy dng trong cng ng nghin cu x l ting Vit. TI LIU THAM KHO 1. Abney S., "Part-of-Speech Tagging and Partial Parsing", in Young S. and Bloothooft (Eds), Corpus-Based Methods in Language and Speech processing, Kluwer Academic Publishers, Dodreht (The Netherlands), 1997. 2. Brill E., "Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging", Computational Linguistics, 21(4), December 199, p.543-565.

@ietLex

13

3. Cao Xun Ho, Ting Vit - my vn ng m, ng php, ng ngha, NXB Gio dc, 2000. 4. Dermatas E., Kokkinakis G., "Automatic Stochastic Tagging of Natural Language Texts", Computational Linguistics 21.2, 1995, p. 137 - 163. 5. Dip Quang Ban, Hong Vn Thung, Ng php ting Vit (2 tp), NXB Gio dc, 1999. 6. El-Bze M, Spriet T., "Etiquetage probabiliste et contraintes syntaxiques", Actes de la confrence sur le Traitement Automatique du Langage Naturel (TALN95), Marseille, France,14-16/6/1995. 7. Hong Ph (ch bin), T in ting Vit 2002, Nh xut bn Nng - Trung Tm T in Hc. 8. Hu t, Trn Tr Di, o Thanh Lan, C s ting Vit, NXB Gio dc, 1998. 9. Kuipec J., "Robust Part-of-Speech Tagging Using a Hidden Markov Model", Computer Speech and Language, vol. 6, 1992, p. 225-242. 10. Levinger M., Ornan U., Itai A., "Learning morpho-lexical probabilities from an untagged corpus with an application to Hebrew", Comutational Linguistics, 21(3), 1995, p. 383-404. 11. MacMahon J.G., Smith F.J., "Improving statistical language model performance with automatically generated word hierarchies", Computational Linguistics, 19(2), 1993, p. 313-330. 12. Mason O., Tufis D., "Tagging Romanian Texts: a Case Study for QTAG, a Language Independent Probabilistic Tagger", 1st International Conference on Language Ressources and Evaluation (LREC98), Granada (Spain), 28-30 May 1998, p. 589-596. 13. Nguyn Ti Cn, Ng php ting Vit, NXB i hc Quc gia H Ni, 1998. 14. Oflazer K., "Error-tolenrant finite-state recognition with applications to morphological analysis and spelling correction", Computational Linguistics, 22(1), 1996, p. 73-89. 15. Paroubek P., Rajman M., "Etiquetage morpho-syntaxique", Ingnierie des langues, chapitre 5, Hermes Science Europe, 2000. 16. Schmid H., "Part-of-Speech Tagging with Neural networks", International Conference on Computational Linguistics, Japan, 1994, p. 172-176, Kyoto. 17. Tufis D., "Tiered Tagging and combined classifier", In Jelineck F. and Nrth E. (Eds), Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence 1692, Springer, 1999. 18. U ban khoa hc x hi Vit Nam, Ng php ting Vit, NXB Khoa hc X hi, H ni, 1993. 19. Vergnes J., Giguet E., "Regards thoriques sur le tagging", 5e confrence sur le Traitement Automatique du Langage Naturel (TALN98), Paris, 10-12 juin, 1998.

@ietLex

14

You might also like