You are on page 1of 101

I HC QUC GIA THNH PH H CH MINH

TRNG I HC CNG NGH THNG TIN

HUNH TN TRUNG

H THNG NHN DNG V PHN LOI VN BN

LUN VN THC S CNG NGH THNG TIN

TP.H CH MINH - 2007

I HC QUC GIA THNH PH H CH MINH

TRNG I HC CNG NGH THNG TIN HUNH TN TRUNG

H THNG NHN DNG V PHN LOI VN BN


Chuyn ngnh: KHOA HC MY TNH M s: 60 48 01

LUN VN THC S CNG NGH THNG TIN

NGI HNG DN KHOA HC: TS. TRN THI SN

Thnh ph H Ch Minh - 2007

Li cm n
Trc tin, ti xin gi li cm n n trng i Hc CNTT to iu kin v t chc kha hc ny ti c th c iu kin tip thu kin thc mi v c thi gian hon thnh lun vn Cao Hc ny Ti cng xin c cm n TS. Trn Thi Sn, ngi tn tnh ch dn v ng vin ti c th hon thnh lun vn ny. Ti xin chn thnh cm n cc thy c truyn t cho chng ti nhng kin thc qu bu trong qu trnh hc Cao hc v lm lun vn. Ti chn thnh cm n cc bn b cng lp gip v ng vin ti trong qu trnh thc hin lun vn ny, c bit ti xin cm n bn Nguyn th Ngc Hp gip ti rt nhiu hon thnh lun vn ny. Cui cng, ti knh gi thnh qu ny n gia nh v ngi thn ca ti, nhng ngi ht lng chm sc, dy bo v ng vin ti ti c c kt qu ngy hm nay.

NHN XT CA GIO VIN PHN BIN

. ....
Ngy thngnm 2007 Gio vin phn bin

CHNG I.

M U ...................................................................................8

I.1. Gii thiu: ................................................................................................. 8 I.2. Tng quan v phn loi vn bn v cc nghin cu thc hin ........... 9 I.3. Mc tiu ca lun vn ............................................................................. 10 I.4. Ni dung nghin cu .............................................................................. 11 I.5. Kt qu t c.................................................................................... 12 CHNG II. C S L THUYT .............................................................. 14

II.1. Mt s nh ngha trong vn vn bn v ngn ng: ......................... 14 II.1.1. Cc cp trong ngn ng: ........................................................... 14 II.1.2. Cc quan h trong ngn ng .......................................................... 14 II.2. Phn loi ngn ng ............................................................................... 15 II.2.1. Phn theo ci ngun ....................................................................... 15 II.2.2. Phn theo loi hnh ......................................................................... 15 II.2.3. Phn theo trt t t ca ngn ng.................................................. 16 II.3. Cc c im ca ting Anh .................................................................. 17 II.4. Tm tt cc phng php phn loi vn bn bng ting Anh ............... 17 II.4.1. Nave Bayes (NB) ........................................................................... 17 II.4.2. Phng php KNearest Neighbor (kNN) ...................................... 19

II.4.3. Support vector Machine (SVM) ....................................................... 21 II.4.4. Neural Network (NNet) .................................................................... 23 II.4.5. Linear Least Square Fit (LLSF) ....................................................... 25 II.4.6. Centroid- based vector .................................................................... 26 II.5. Cc c im c bn v ting Vit ........................................................ 27 II.6. So snh i chiu ting Anh-Vit........................................................... 28 II.7. Tm tt cc phng php phn loi vn bn bng ting Vit ............... 28 II.7.1. Phng php khp ti a Maximum Matching: forward/backward . 28 * u im.................................................................................................. 29 * Hn ch .................................................................................................. 30 II.7.2. Phng php gii thut hc ci bin (Transformation-based Learning, TBL) .......................................................................................... 30 * Ni dung ................................................................................................. 30 * u im.................................................................................................. 30 * Hn ch .............................................................................................. 31 II.7.3. M hnh tch t bng WFST v mng Neural................................. 31 * Ni dung ................................................................................................. 31 * u im.................................................................................................. 34 * Hn ch .................................................................................................. 35

II.7.4. Phng php quy hoch ng (dynamic programming) ................. 35 * Ni dung ................................................................................................. 35 * u im.................................................................................................. 36 * Hn ch .................................................................................................. 36 II.8. M t phng php s dng trong cng ....................................... 36 II.8.1. Chn phng n thc hin lun vn .............................................. 36 II.8.2. Ht nhn cho cc chui Text........................................................... 37 II.8.3. C s l thuyt ca Support vector Machine (SVM): ...................... 43 II.8.4. Hun luyn SVM ............................................................................. 48 II.8.5. Phn loi vn bn ........................................................................... 49 CHNG III. M T BI TON v X L BI TON .............................. 50

III.1. Cc yu cu i vi vic phn loi vn bn ......................................... 50 III.2. Cu trc chng trnh .......................................................................... 51 III.2.1. Bc 1: Tin x l s liu .............................................................. 51 III.2.2. Bc 2: Tch cu: ......................................................................... 52 III.2.3. Bc 3: Tch t: ............................................................................ 52 III.2.4. Bc 4: Gn nhn t loi nh trng s .................................... 52 III.2.5. Bc 5: S dng thut ton phn loi vn bn cn c ........... 52 III.3. Cc bc thc hin trong chng trnh ............................................... 52

III.3.1. Tin x l s liu: ........................................................................... 52 III.3.2. Tch cu ........................................................................................ 55 III.3.3. Tch t .......................................................................................... 57 III.3.4. Gn nhn nh trng s ............................................................. 60 III.3.5. Hun luyn ..................................................................................... 64 III.3.6. Phn loi vn bn .......................................................................... 66 CHNG IV. CHNG TRNH TH NGHIM......................................... 69

IV.1.1. Chun b s liu ............................................................................. 69 IV.1.2. M t chng trnh: ....................................................................... 71 IV.1.1. Ci t ........................................................................................... 71 IV.1.2. Mt s giao din ca chng trnh ................................................ 72 IV.1.3. Ci t ........................................................................................... 77 IV.1.4. Cc lu khi chun b s liu ........................................................ 78 IV.1.5. Kt qu th nghim ....................................................................... 86 CHNG V. CHNG VI. CHNG VII. KT LUN ............................................................................. 89 TI LIU THAM KHO ........................................................ 91 PH LC............................................................................. 94

VII.1. Cu trc CSDL ca chng trnh........................................................ 94 VII.2. Kt qu nhn dng vn bn ................................................................ 94

VII.3. Cc c trng ca mu phn loi vn bn (trch) ............................... 95

CHNG I. M U
I.1.Gii thiu: Chng ta hy cng nhau xem xt cc trng hp thng hay xy ra trong thc t sau: Trong thi i bng n cng ngh thng tin hin nay, h thng d liu s ho tr nn khng l phc v cho vic lu tr trao i thng tin, D liu s ho ny rt a dng - n c th l cc d liu di dng tp tin vn bn text, tp tin vn bn MS Word, tp tin vn bn PDF, mail, HTML .v.v. Cc tp tin vn bn cng c lu tr trn my tnh cc b hoc c truyn ti trn intenet, cng vi thi gian v/hoc s lng ngi dng tng nhanh th cc tp tin ny ngy cng nhiu v n mt thi im no th s lng tp tin ny s vt qu tm kim sot, do khi mun tm kim li 1 vn bn no vic tm kim s rt kh khn v phc tp, c bit l trong trng hp ngi cn tm kim khng nh r cc cu cn tm chnh xc trong vn bn Cc thng tin trn internet c rt nhiu v phong ph gn nh p ng c hu ht cc nhu cu thng tin ca con ngi khi cn tra cu thng tin. Cc thng tin ny thng xuyn c cp nht v thay i lin tc, do vy khi ngi cn tm kim mun tm kim thng tin th lng thng tin tha mn nhu cu tm kim s rt nhiu nhng cha tr thnh ti liu phc v cho ngi tm kim; do khi ngi s dng mun sp xp cc thng tin tm c theo th loi (nhm vn bn) th thi gian thc hin s mt rt nhiu (thi gian) v cng sc b ra cng khng phi nh T cc nhu cu trn m yu cu v mt H thng nhn dng v phn loi vn bn p ng yu cu phn loi vn bn sau mi thc hin tm kim c ra i nhm p ng yu cu thc t ca ngi dng. c rt nhiu cng trnh nghin cu v ng dng thc t dng thc hin vic phn loi vn bn, tuy nhin cc ng dng cng cha th p ng hon ton nhu cu ca ngi s dng, do vy m vic

tm kim, nghin cu cc gii thut, cc phng php phn loi vn bn vn c tip tc nghin cu v hon thin Vi mc tiu gp phn vo lnh vc nghin cu v ng dng phn loi vn bn vo cuc sng, lun vn ny s thc hin cc cng vic sau: Nghin cu v tng hp mt s phng php phn loi vn bn (ting Anh v ting Vit) lm v sau a ra 1 s nhn xt nh gi Nghin cu v a vo ng dng trong vic phn loi vn bn ting Vit bng l thuyt kh mi hin nay l l thuyt phn loi vn bn bng ht nhn chui (string kernels) v phng php h tr vecto (Support vector Machine - SVM) a ra mt chng trnh my tnh th nghim v c kt qu nh gi v phng php phn loi vn bn s dng Ht nhn chui (string kernels) kt hp vi My h tr vecto (Support vector Machine - SVM)

I.2. Tng quan v phn loi vn bn v cc nghin cu thc hin Bi ton nhn dng v phn loi vn bn l mt trong nhng bi ton kinh in trong lnh vc x l d liu vn bn. X l d liu vn bn bao gm: Kim tra li chnh t (spelling-checker) Kim tra li vn phm (grammar checker) T in ng ngha (thesaurus) Phn tch vn bn (text analyzer) Phn loi vn bn (text classification) Tm tt vn bn (text summarization) Tng hp ting ni (voice synthesis) Nhn dng ging ni (voice recognization) Dch t ng (automatic translation)

.....

Phn loi vn bn l cng vic phn tch ni dung ca vn bn v sau ra quyt nh vn bn ny thuc nhm no trong cc nhm vn bn cho trc. Do cng vic phn loi vn bn chnh xc cn phi p ng c cc yu cu sau: Cc vn bn trong nhm c phn loi phi c nhng tiu chun chung no Cc vn bn khi phn tch th phi hiu c ni dung xc nh c cc tiu chun trong vn bn Vic xc nh loi ca vn bn khi so snh vi cc nhm vn bn yu cu phi c nhng nh lng xc nh xc nh chnh xc vn bn cn phn tch thuc nhm vn bn no

Do r rng vic phn loi vn bn chnh l cng vic khai ph d liu vn bn (text data mining). Trong lnh vc khai ph d liu, cc phng php phn loi vn bn da trn nhng phng php quyt nh nh quyt nh Bayes, cy quyt nh, lng ging gn nht, mng nron, ... Nhng phng php ny cho kt qu chp nhn c v c s dng trong thc t, tuy nhin vic nghin cu vic phn loi vn bn ting Vit vn cha c lu nm v cha c su rng, nguyn nhn l do ting Vit c nhng c trng khc vi ting Anh nh t khng bin i hnh thi, ngha ng php nm ngoi t, ranh gii t khng xc nh mc nhin bng khong trng .v.v. (xin xem thm phn II.3. Cc c im c bn v ting Vit), y c th k tn kh nhiu nghin cu v vn ny phn tham kho I.3.Mc tiu ca lun vn
Do phm vi bi ton kh ln v thi gian lm ti cng hn hp nn mc tiu nghin cu ca lun vn ny s c tp trung cc im sau:

10

Nghin cu k thut phn loi vn bn v mt s phng php phn loi vn bn, m t cc yu cu chnh yu nht ca tng phng php v rt ra cc u/khuyt im ca tng phng php, cc phng php c nghin cu y l cc phng php c nh gi tng i mi, c cc ti nghin cu trong nc ng dng Nghin cu v ng dng cch x l ngn ng ting Vit: o Phng php tch t ng dng trong ting Vit (trong lun vn ny s dng phng php Maximum Matching: forward/backward nhng s c mt s ci bin tng chnh xc) o Phng php phn tch nh dng vn bn ting Vit (trong lun vn s dng phng php phn tch Support vector machine (SVM) da trn l thuyt v String kernels) Xy dng th nghim phng php nhn dng v phn loi vn bn ting Vit da trn cc nghin cu v tch t, string kernels v SVM nu trn a ra cc kt lun v c th dng so snh vi cc phng php khc c s dng, ng thi cng s nu ra phng hng gii quyt cc vn cn tn ti

I.4.Ni dung nghin cu


Da trn cc mc tiu ca lun vn vic nghin cu trong lun vn ny s tin hnh bm st yu cu mc tiu i hi: Nghin cu cc phng php phn tch vn bn mi c a ra hoc c tnh ph bin c s dng nhiu trong thc t Da trn cc kt qu nghin cu v phn loi vn bn trn th lun vn s chn la mt phng php mi trong vic phn loi vn bn l phng php Ht nhn chui (String Kernels) kt hp vi My H tr Vecto (Support vector machine SVM) Lun vn cng s nghin cu cc phng php phn tch v tch cut trong ting Vit, vi mi phng php s a ra c cc u nhc im ca tng phng php

11

Da trn cc nghin cu v phn tch cu t ting Vit, lun vn s xut mt cch mi tng chnh xc ca vic phn tch cu t ting Vit chng minh tnh chnh xc hn khi phn tch vn bn so vi cc cch phn tch vn bn c; da trn cc phng php phn tch cu-t ting Vit xut v vi phng php Ht nhn Chui (String Kernels) kt hp vi My H tr Vecto (Support vector machine SVM) s xy dng mt chng trnh th nghim vi cc nghin cu c tng hp Trong qu trnh thc hin chng trnh, tng nhanh tc lp trnh v hiu qu ca phng php lm, s c s dng li cc chng trnh tnh ton c cung cp dng m m (open source code). C th l vic thc hin chng trnh s dng c s d liu ting Vit ca inh in, chng trnh c v nhn dng text cho cc file PDF l m ngun m trn http://sourceforge.net/ chng trnh tnh ton My H tr Vecto (Support vector machine SVM) l chng trnh ca Chih-Jen Lin c cho ti a ch http://www.csie.ntu.edu.tw/~cjlin

Vic kt lun ch yu s l a ra cc kt lun thc nghim khi s dng, xc nh c nhng thng s c th s dng cc kt qu ny nhm c th so snh c vi cc phng php v kt qu nghin cu ca cc cng trnh khc c cc tc gi khc nghin cu

I.5.Kt qu t c
Sau qu trnh nghin cu v thc hin lun vn t c cc kt qu nh sau: nghin cu v tip thu cc k thut phn loi vn bn ang c s dng trong thc t Nm c phng php phn loi vn bn bng Ht nhn chui (String Kernels) kt hp vi My H tr Vecto (Support vector machine SVM). ng dng c cc kt qu nghin cu x l ngn ng t nhin vo chng trnh phn loi vn bn.

12

xut phng n phn tch cu ting Vit c chnh xc v nhanh chng hn xy dng th nghim mt chng trnh phn loi vn bn cho cc file vn bn ting Vit. C nhng kt lun v c cc khuyn co tng tc chng trnh v hn ch cc sai st c th mc phi

13

CHNG II. C S L THUYT II.1.Mt s nh ngha trong vn vn bn v ngn ng: II.1.1.Cc cp trong ngn ng:
Theo trnh t t nh n ln, c th k ra cc n v ngn ng l: m v: n v m thanh nh nht cu to nn ngn ng v khu bit v mt biu hin vt cht (m thanh) ca cc n v khc, v d: k-a-d (card);b-i-g (big) Hnh v: n v nh nht mang ngha (ngha ng php hay ngha t vng) c cu to bi cc m v, VD: read-ing;book-s T: n v mang ngha c lp, c cu to bi (cc) hnh v, c chc nng nh danh, VD: I-am-reding-my-books Ng: gm 2 hay nhiu t c quan h ng php hay ng ngha vi nhau, VD:bc th, mng my tnh, computer system Cu: gm cc t/ng c quan h ng php hay ng ngha vi nhau v c chc nng c bn la thng bo, VD: I am reading my books Vn bn: h thng cc cu c lin kt vi nhau v mt hnh thc, t ng, ng ngha v ng dng

II.1.2.Cc quan h trong ngn ng Mi n v k trn, n lt chng li lm thnh mt tiu h thng trong h thng ln l h thng ngn ng. Ngi ta gi mi tiu h thng (gm nhng n v ng loi) ca ngn ng l mt cp . l v cc tiu h thng c quan h chi phi vi nhau. V d: cp cu, cp t, cp hnh v, cp m v. Cc n v ca ngn ng quan h vi nhau rt phc tp v theo nhiu kiu, tuy nhin c 3 quan h ct li l: Quan h cp bc (hierachical relation): n v cp cao hn bao gi cng bao hm n v ca cp thp hn v ngc li. V d: cu bao hm t . Quan h ng on (syntagmatical relation): ni kt cc n v ngn ng thnh chui khi ngn ng i vo hot ng. y l tnh hnh

14

tuyn ca ngn ng. Tnh cht ny bt buc cc n v ngn ng phi ni tip nhau ln lt trong ng lu cho ta nhng kt hp gi l ng on (syntagmes). V d This book, this book is interesting .. Quan h lin tng (associative relation): l quan h xu chui, mt yu t xut hin vi nhng yu t khim din ng sau lng n v nguyn tc c th thay th cho n. V d: I read book (newspage, magazine,) th cc t newspage, magazine l tng ng vi book v c th thay th cho book

II.2.Phn loi ngn ng II.2.1.Phn theo ci ngun


Cn c theo ci ngun (nghin cu lch i), ta c cc ng h sau n-u: dng n , I-Ran, Bantic, Slave, Roman, Hy Lp, German, (Gm c, Anh, H Lan .) S-mt: dng S-mt, Ai Cp, Kusit, Beebe Th: Ngn ng Th Nh K, Azecbaizan, Tacta Hn-Tng: dng Hn, Tng, Min Nam Phng: dng Nam-Thi, Nam . Trong dng Nam c cc ngnh: Nahali, MunDa, Nicoba v Mn-Khmer. Trong ngnh Mon-Khmer c nhm Vit-Mng v trong nhm ny c ngn ng Ting Vit ca chng ta

II.2.2.Phn theo loi hnh


Cn c theo c im hin nay ca cc ngn ng (nghin cu ng i), ngi ta phn cc ngn ng thnh cc loi hnh sau (mt cch gn ng) Ngn ng ha kt (flexional): loi hnh ny bao gm cc ngn ng: c, Latin, Hi lp, Anh, Php, Nga, A-rp Ngn ng chp dnh (agglutinate): c hin tng c ni tip thm mt cch my mc, c gii vo cn t no mt hay nhiu ph t, m mi

15

ph t li ch lun mang li mt ngha ng php nht nh. V d: Th Nh K, Mng C, Nht Bn, Triu Tin . Ngn ng n lp (isolate): cn gi l ngn ng phi hnh thi, khng bin hnh, n m tit, phn tit . Loi hnh ny bao gm cc ngn ng: ting Vit, Hn, v, vng ng Nam . Ngn ng a tng hp (polysynthetic): cn gi l ngn ng hn nhp hay lp khun. y l loi mang nhng c im ca cc loi hnh ni trn

II.2.3.Phn theo trt t t ca ngn ng


Xt v loi hnh trt t cp cu, th ting Anh v ting Vit c cng chung loi hnh, l loi hnh S V O, c ngha l trong mt cu bnh thng (khng nh du), th t cc thnh phn c sp xp nh sau: S (subject: ch ng) V (Verb: ng t) O (Object: B Ng) V d : Ti nhn anh y v I see Him S V O S V O

Theo thng k, th : Loi hnh SVO chim 32,4 - 41,8 %, bao gm cc ting nh: ting Anh, Php, Vit,. Loi hnh SOV chim 41 51,8 %, nh ting Nht Loi hnh VSO chim 2 4 % Loi hnh VOS chim 9 18 % Loi hnh OSV chim c 1%

Trt t t (word order) l s th hin hnh tuyn ca ngn ng. Trt t t c hiu theo ngha hp l: trt t cc thnh phn S-V-O nh trn, cn nu hiu theo ngha rng, th l trt t cc thnh t ba cp n v ngn ng:

16

T: trt t cc ting, hnh v, t t trong t ghp. V d: Cha-M/M-Cha Ng: trt t cc t trong cm t hay ng, nh: trt t nh t trong danh ng, trt t b ng trong ng ng Cu: trt t cc thnh phn S, V, O trong cu

C mt s ngn ng tuy cng loi hnh trt t t cp cu (nh ting Anh v ting Vit cng loi hnh SVO), nhng trt t t bn trong cc ng c th khc nhau. Chng hn: trong ting Anh tnh t ng trc danh t, cn trong ting Vit th ngc li II.3.Cc c im ca ting Anh Ting Anh c xp vo loi hnh bin cch (flexion) hay cn gi l loi hnh khut chit vi nhng c im sau: Trong hot ng ngn ng, t c bin i hnh thi. ngha ng php nm trong t. V d: I see him v he see me Phng php ng php ch yu l: ph t. V d: learning v learned. Hin tng cu to t bng cch ghp thm ph t (affix) vo gc t l rt ph bin. V d: anticomputerizational (anti-compute-er-ize-action-al) Kt hp gia cc hnh v l cht ch. Ranh gii gia cc hnh v l kh xc nh Ranh gii t c nhn din bng khong trng hoc du cu

II.4.Tm tt cc phng php phn loi vn bn bng ting Anh Ting Anh l ngn ng hin ang c s dng kh thng dng trn th gii do vy cc phng php phn loi vn bn ting Anh cng c nghin cu kh nhiu, y ch nu 1 vi phng php ang s dng v t ra c hiu qu kh cao: II.4.1.Nave Bayes (NB) NB l phng php phn loi da vo xc sut c s dng rng ri trong lnh vc my hc (Mitchell trnh by nm 1996, Joachims
17

trnh by nm 1997 v Jason nm 2001) c s dng ln u tin trong lnh vc phn loi bi Maron vo nm 1961, sau tr nn ph bin dng trong nhiu lnh vc nh trong cc cng c tm kim (c m t nm 1970 bi Rijsbergen), cc b lc mail (m t nm 1998 bi Sahami)... * tng tng c bn ca cch tip cn Nave Bayes l s dng xc sut c iu kin gia t v ch d on xc sut ch ca mt vn bn cn phn loi. im quan trng ca phng php ny chnh l ch gi nh rng s xut hin ca tt c cc t trong vn bn u c lp vi nhau. Vi gi nh ny NB khng s dng s ph thuc ca nhiu t vo mt ch , khng s dng vic kt hp cc t a ra phn on ch v do vic tnh ton NB chy nhanh hn cc phng php khc vi phc tp theo hm s m. * Cng thc Mc ch chnh l tnh c xc sut Pr(Cj,d), xc sut vn bn d nm trong lp Cj. Theo lut Bayes, vn bn d s c gn vo lp Cj no c xc sut Pr(Cj, d) cao nht. Cng thc sau dng tnh Pr(Cj,d) (do Joachims xut nm 1997)
IF ( w, d ') d' d' Pr(C j ). Pr( wi | C j ) Pr(C j ). Pr( w | C j ) i =1 i =1 = arg max = arg max |d '| IF ( w ,d ') |d '| C j C C C ' ' j Pr(C ' ). Pr( w | C ' ) Pr(C ). Pr( wi | C ) C 'C i =1 C 'C wF

H BAYES

Vi: (TF,d) l s ln xut hin ca t wi trong vn bn d |d| l s lng cc t trong vn bn d

18

wi l mt t trong khng gian c trng F vi s chiu l |F| Pr(Cj) c tnh da trn t l phn trm ca s vn bn mi lp
Pr(C j ) = || C j || || C || =

tng ng trong tp d liu hun luyn : trnh by nm 1982)


Pr( wi | C j ) = | F | + TF ( w' , C j )
w '| F |

C 'C

|| C '||

|| C j ||

Pr( wi|Cj) c tnh s dng php c lng Laplace (do Napnik


1 + TF ( wi , C j )

Ngoi ra cn c cc phng php NB khc c th k ra nh sau ML Naive Bayes, MAP Naive Bayes, Expected Naive Bayes, Bayesian Naive Bayes (Jason m t nm 2001). Naive Bayes l mt cng c rt hiu qu trong mt s trng hp. Kt qu c th rt ti nu d liu hun luyn ngho nn v cc tham s d on (nh khng gian c trng) c cht lng km. Nhn chung y l mt thut ton phn loi tuyn tnh thch hp trong phn loi vn bn nhiu ch . NB c u im l ci t n gin, tc nhanh, d dng cp nht d liu hun luyn mi v c tnh c lp cao vi tp hun luyn, c th s dng kt hp nhiu tp hun luyn khc nhau. Tuy nhin NB ngoi gi nh tnh c lp gia cc t cn phi cn n mt ngng ti u cho kt qu kh quan. Nhm mc ch ci thin hiu nng ca NB, cc phng php nh multiclassboosting, ECOC (do Berger trnh by nm 1999 v Ghani m t li nm 2000) c th c dng kt hp. II.4.2.Phng php KNearest Neighbor (kNN) y l phng php truyn thng kh ni ting v hng tip cn da trn thng k c nghin cu trong nhn dng mu hn bn thp k qua (theo ti liu ca Dasarathy nm 1991). kNN c nh gi l mt trong nhng phng php tt nht (p dng trn tp d liu Reuters phin bn 21450), c s dng t nhng thi k u ca vic
19

phn loi vn bn (c trnh by bi Marsand nm 1992, Yang nm 1994, Iwayama nm 1995) * tng Khi cn phn loi mt vn bn mi, thut ton s tnh khong cch (khong cch Euclide, Cosine ...) ca tt c cc vn bn trong tp hun luyn n vn bn ny tm ra k vn bn gn nht (gi l k lng ging), sau dng cc khong cch ny nh trng s cho tt c ch . Trng s ca mt ch chnh l tng tt c khong cch trn ca cc vn bn trong k lng ging c cng ch , ch no khng xut hin trong k lng ging s c trng s bng 0. Sau cc ch s c sp xp theo mc trng s gim dn v cc ch c trng s cao s c chn l ch ca vn bn cn phn loi. * Cng thc
r Trng s ca ch cj i vi vn bn x :
r r r r W ( x , c j ) = sim( x , d i ). y (d i ,c j ) b j

Trong
r y(d i ,c j ) {0,1} r d , vi y = 0: vn bn i khng thuc v ch cj, y = 1: vn

r d bn i thuc v ch cj.

r r r r sim( x , d i ) : ging nhau gia vn bn cn phn loi x d v vn bn i . C

th s dng o cosine tnh

r r sim( x , d i )

20

rr x.d i r r r r r sim( x, d i ) = cos(x , d i ) = r || x || . || d i ||

bj l ngng phn loi ca ch cj c t ng hc s dng mt tp vn bn hp l c chn ra t tp hun luyn chn c tham s k tt nht cho vic phn loi, thut ton phi c chy th nghim trn nhiu gi tr k khc nhau, gi tr k cng ln th thut ton cng n nh v sai st cng thp (theo Yang trnh by nm 1997). Gi tr tt nht c s dng tng ng trn hai b d liu Reuter v Oshumed l k = 45. II.4.3.Support vector Machine (SVM)
Support vector Machine (SVM) l phng php tip cn phn loi rt

hiu qu c Vapnik gii thiu nm 1995 gii quyt vn nhn dng mu 2 lp s dng nguyn l Cc tiu ha Ri ro c Cu trc (Structural Risk Minimization) (thep Vapnik). * tng Cho trc mt tp hun luyn c biu din trong khng gian vector trong mi ti liu l mt im, phng php ny tm ra mt siu mt phng h quyt nh tt nht c th chia cc im trn khng gian ny thnh hai lp ring bit tng ng lp + v lp . Cht lng ca siu mt phng ny c quyt nh bi khong cch (gi l bin) ca im d liu gn nht ca mi lp n mt phng ny. Khong cch bin cng ln th mt phng quyt nh cng tt ng thi vic phn loi cng chnh xc. Mc ch thut ton SVM tm c khong cch bin ln nht. * Cng thc

21

SVM thc cht l mt bi ton ti u, mc tiu ca thut ton ny l tm c mt khng gian H v siu mt phng quyt nh h trn H sao cho sai s phn loi l thp nht Phng trnh siu mt phng cha vector di trong khng gian nh sau:
r r d i .w + b = 0

r r d d i vo hai lp nh ni. i Nh th h( ) bi r u din s phn lp ca r d d i Gi yi={1}, vn bn lp +; yi=-1, vn bn i lp -. Lc ny c

siu mt phng h ta s phi gii bi ton sau :


r r Tm Min || w || vi w v b tho iu kin sau :

r r i 1, n : yi (sin g (d i .w)) 1

Bi ton SVM c th gii bng k thut s dng ton t Lagrange bin i thnh dng ng thc. im th v SVM l mt phng quyt nh ch ph thuc vo cc vector h tr (Support Vector) c khong cch n mt phng quyt nh
1 r l || w || . Khi cc im khc b xa i th thut ton vn cho kt qu ging

nh ban u. Chnh c im ny lm cho SVM khc vi cc thut ton khc nh kNN,LLSF, NNet v NB v tt c d liu trong tp hun luyn u c dng ti u ha kt qu. Cc phin bn SVM tt c th k
22

n l SVMLight (Joachims trnh by nm 1998) v Sequential Minimal Optimization (SMO) (Platt trnh by nm 1998) II.4.4.Neural Network (NNet) Nnet c nghin cu mnh trong hng tr tu nhn to. Wiener l ngi s dng Nnet phn loi vn bn, s dng 2 hng tip cn : kin trc phng (khng s dng lp n) v mng nron 3 lp (bao gm mt lp n)(theo Wiener trnh by nm 1995) C hai h thng trn u s dng mt mng nron ring r cho tng ch , NNet hc cch nh x phi tuyn tnh nhng yu t u vo nh t, hay m hnh vector ca mt vn bn vo mt ch c th. Khuyt im ca phng php NNet l tiu tn nhiu thi gian dnh cho vic hun luyn mng nron. * tng M hnh mng neural gm c ba thnh phn chnh nh sau: kin trc (architecture), hm chi ph (cost function), v thut ton tm kim (search algorithm). Kin trc nh ngha dng chc nng (functional form) lin quan gi tr nhp (inputs) n gi tr xut (outputs). Kin trc phng ( flat architecture ) : Mng phn loi n gin nht ( cn gi l mng logic) c mt n v xut l kch hot kt qu (logistic activation) v khng c lp n, kt qu tr v dng hm (functional form) tng ng vi m hnh hi quy logic. Thut ton tm kim chia nh m hnh mng thch hp vi vic iu chnh m hnh ng vi tp hun luyn. V d, chng ta c th hc trng s trong mng kt qu (logistic network) bng cch s dng khng gian trng s gim dn (gradient descent in weight space) hoc s dng thut ton interatedreweighted least squares l thut ton truyn thng trong hi quy (logistic regression).

23

Kin trc m dun (modular architecture ): Vic s dng mt hay nhiu lp n ca nhng hm kch hot phi tuyn tnh cho php mng thit lp cc mi quan h gia nhng bin nhp v bin xut. Mi lp n hc biu din li d liu u vo bng cch khm ph ra nhng c trng mc cao hn t s kt hp c trng mc trc.

Hnh Kin trc m un (Modular Architecture) . Cc kt qu ca tng mng con s l gi tr u vo cho mng siu ch v c nhn li vi nhau d on ch cui cng. * Cng thc Trong cng trnh ca Wiener et al (1995) da theo khung ca m hnh hi quy, lin quan t c trng u vo cho n kt qu gn ch tng ng c hc t tp d liu. Do vy, phn tch mt cch tuyn tnh, tc gi dng hm sigmoid sau lm hm truyn trong mng neural:
p= 1 1 + e n

Trong , = x l s kt hp ca nhng c trng u vo v p phi tha iu kin p (0,1).

24

II.4.5.Linear Least Square Fit (LLSF) LLSF l mt cch tip cn nh x c pht trin bi Yang v Chute vo nm 1992. u tin, LLSF c Yang v Chute th nghim trong lnh vc xc nh t ng ngha sau s dng trong phn loi vo nm 1994. Cc th nghim ca ang cho thy hiu sut phn loi ca LLSF c th ngang bng vi phng php kNN kinh in. * tng LLSF s dng phng php hi quy hc t tp hun luyn v cc ch c sn. Tp hun luyn c biu din di dng mt cp vector u vo v u ra nh sau : Vector u vo mt vn bn bao gm cc t v trng s Vector u ra gm cc ch cng vi trng s nh phn ca vn bn ng vi vector u vo Gii phng trnh cc cp vector u vo/ u ra, ta s c ma trn ng hin ca h s hi quy ca t v ch (matrix of wordcategory regression coefficients) * Cng thc
FLS = arg min =|| FA B || 2
F

Trong A, B l ma trn i din tp d liu hun luyn ( cc ct trong ma trn tng ng l cc vector u vo v u ra). FLS l ma trn kt qu ch ra mt nh x t mt vn bn bt k vo vector ca ch gn trng s.

25

Nh vo vic sp xp trng s ca cc ch , ta c mt danh sch ch c th gn cho vn bn cn phn loi. Nh t ngng ln trng s ca cc ch m ta tm c ch thch hp cho vn bn u vo. H thng t ng hc cc ngng ti u cho tng ch , ging vi kNN. Mc d LLSF v kNN khc nhau v mt thng k, nhng ta vn tm thy im chung hot ng ca hai phng php l vic hc ngng ti u. II.4.6.Centroid- based vector L mt phng php phn loi n gin, d ci t v tc nhanh do c phc tp tuyn tnh O(n) ( c Han trnh by nm 2000) * tng Mi lp trong d liu luyn s c biu din bi mt vector trng tm. Vic xc nh lp ca mt vn bn th bt k s thng qua vic tm vector trng tm no gn vi vector biu din vn bn th nht. Lp ca vn bn th chnh l lp m vector trng tm i din. Khong cch c tnh theo o cosine. * Cng thc Cng thc tnh vector trng tm ca lp i

r r C x o khong cch gia vector v i

26

Trong :
r x l vector vn bn cn phn loi

{i} l tp hp cc vn bn thuc ch Ci
r r r r C r C x x x i Ch ca l Cx tho cos( , ) argmax(cos( , i ))

II.5.Cc c im c bn v ting Vit Ting Vit c xp vo loi hnh n lp (isolate) hay cn gi l loi hnh phi hnh thi, khng bin hnh, n tit vi nhng c im chnh sau: Trong hot ng ngn ng, t khng bin i hnh thi. ngha ng php nm ngoi t. V d: Ti nhn anh y v anh y nhn ti Phng thc ng php ch yu l: Trt t t v t h. V d: Go xay v xay go Tn ti mt loi n v c bit, l hnh tit m v ng m ca chng trng kht vi m tit, v n v cng chnh l hnh v ting Vit hay cn gi l ting (theo tc gi inh in th c khong 10.000 ting, nhng theo kho st ca hi ngi m Vit Nam khi lm chng trnh sch ni th ch c khong 3000 t) Ranh gii t khng xc nh mc nhin bng khong trng nh cc th ting bin hnh khc. V d: hc sinh hc sinh hc. iu ny khin cho vic phn tch hnh thi (tch t) ting Vit tr nn kh khn. Vic nhn din ranh gii t l quan trng lm tin cho cc x l tip theo sau nh: kim tra li chnh t, gn nhn t, thng k tn xut t .
27

Tn ti loi t c bit t ch loi (classsifier) hay cn gi l ph danh t ch loi i km vi danh t nh: ci bn, cun sch, bc th, .. V mt m hc, cc m tit ting Vit u mang 1 trong 6 thanh iu (ngang, sc, huyn, hi, ng, nng). y l m v siu on tnh C hin tng ly trong t ting Vit nh: lp lnh, lung ling . Ngoi ra cn c hin tng ni li (do mi lin kt gia ph m u v phn vn trong m tit l lng lo) nh: hin i hi in

II.6.So snh i chiu ting Anh-Vit

Qua s phn tch c im ca ting Anh v ting Vit nh trn, ta thy ting Anh v ting Vit c nhiu im khc bit (do loi hnh ngn ng, do nn vn ha) chng hn: khc bit v ng m hc, hnh v, ranh gii t, s t vng ha (nh ox b c, anh elder brother, ); t loi; trt t t (tnh t v danh t), kt cu cu (ch v cm ch v), V vy chng ta khng th p dng y nguyn cc m hnh x l ngn ng ca ting Anh sang cho ting Vit c m phi c s iu chnh nht nh da trn cc kt qu so snh i chiu gia ting Anh v ting Vit. II.7.Tm tt cc phng php phn loi vn bn bng ting Vit II.7.1.Phng php khp ti a Maximum Matching: forward/backward * Ni dung Phng php khp ti a (Maximum Matching) cn gi l Left Right Maximum Matching (LRMM). Theo phng php ny, ta s duyt mt ng hoc cu t tri sang phi v chn t c nhiu m tit nht c mt trong t in, ri c th tip tc cho t k tip cho n ht cu. Thut ton c trnh by bi Chih-Hao Tsai nm 2000

28

Dng n gin c dng gii quyt nhp nhng t n. Gi s c mt chui k t (tng ng vi chui ting trong ting Vit) C1, C2, ..., Cn. Ta bt u t u chui. u tin kim tra xem C1, c phi l t hay khng, sau kim tra xem C1C2 c phi l t hay khng. Tip tc tm cho n khi tm c t di nht. T c v hp l nht s l t di nht. Chn t , sau tm tip nh trn cho nhng t cn li cho n khi xc nh c ton b chui t. Dng phc tp: Quy tc ca dng ny l phn on c v hp l nht l on ba t vi chiu di ti a. Thut ton bt u nh dng n gin. Nu pht hin ra nhng cch tch t gy nhp nhng (v d, C1 l t v C1C2 cng l t), ta xem cc ch k tip tm tt c cc on ba t c th c bt u vi C1 hoc C1C2. V d ta c nhng on sau: C1 C2 C3 C4 C1C2 C3 C4 C5 C1C2 C3 C4 C5 C6 Chui di nht s l chui th ba. Vy t u tin ca chui th ba (C1C2) s c chn. Thc hin li cc bc cho n khi c chui t hon chnh. * u im Vi cch ny, ta d dng tch c chnh xc cc ng/cu nh hp tc x || mua bn, thnh lp || nc || Vit Nam || dn ch || cng ha Cch tch t n gin, nhanh, ch cn da vo t in Trong ting Hoa, cch ny t c chnh xc 98,41% (theo Chih-Hao Tsai trnh by nm 2000).

29

* Hn ch chnh xc ca phng php ph thuc hon ton vo tnh v tnh chnh xc ca t in Phng php ny s tch t sai trong cc trng hp hc sinh || hc sinh|| hc, mt || ng || quan ti || gii, trc || bn l || mt || ly || nc II.7.2.Phng php gii thut hc ci bin (Transformation-based Learning, TBL) * Ni dung y l cch tip cn da trn ng liu nh du. Theo cch tip cn ny, hun luyn cho my tnh bit cch nhn din ranh gii t ting Vit, ta c th cho my hc trn ng liu hng vn cu ting Vit c nh du ranh gii t ng. Sau khi hc xong, my s xc nh c cc tham s (cc xc sut) cn thit cho m hnh nhn din t. * u im c im ca phng php ny l kh nng t rt ra quy lut ca ngn ng N c nhng u im ca cch tip cn da trn lut (v cui cng n cng da trn lut c rt ra) nhng n khc phc c khuyt im ca vic xy dng cc lut mt cch th cng bi cc chuyn gia. Cc lut c th nghim ti ch nh gi chnh xc v hiu qu ca lut (da trn ng liu hun luyn)
30

* Hn ch Phng php ny dng ng liu c gn nhn ngn ng hc t ng cc qui lut (theo inh in nm 2004). Nhng c th nhn thy r l vic xy dng mt tp ng liu t c y cc tiu ch ca tp ng liu trong ting Vit l mt iu rt kh, tn km nhiu v mt thi gian v cng sc. H phi tri qua mt thi gian hun luyn kh lu c th rt ra cc lut tng i y Ci t phc tp II.7.3.M hnh tch t bng WFST v mng Neural * Ni dung M hnh mng chuyn dch trng thi hu hn c trng s WFST (Weighted finitstate Transducer) c Richard p dng tch t ting Trung Quc. tng c bn l p dng WFST kt hp vi trng s l xc sut xut hin ca mi t trong ng liu. Dng WFST duyt qua cu cn xt. Cch duyt c trng s ln nht s l cch tch t c chn. Gii php ny cng ng p dng bi tc gi inh in (nm 2001) km vi mng neutral kh nhp nhng. H thng tch t ting Vit gm hai tng: tng WFST ngoi vic tch t cn x l thm cc vn lin quan n c th ca ting Vit nh t ly, tn ring v tng mng neural dng kh nhp nhng nu c.

31

S h thng WFST Tng WFST :gm c ba bc Xy dng t in trng s : theo m hnh WFST, vic phn on t c xem nh l mt s chuyn dch trng thi c xc sut (Stochastic Transduction). Chng ta miu t t in D l mt th bin i trng thi hu hn c trng s. Gi s: H: l tp cc t chnh t ting Vit (cn gi l ting) P: l t loi ca t (POS: Part Of Speech).

32

Mi cung ca D c th l: T mt phn t ca H ti mt phn t ca H, hoc T (k hiu kt thc t) ti mt phn t ca P Cc nhn trong D biu th mt chi ph c lng (estimated cost) bng cng thc : Cost = - log(f/N) Vi f: tn s ca t, N: kch thc tp mu. i vi cc trng hp t mi cha gp, tc gi p dng xc sut c iu kin Goog-Turning (Baayen) tnh ton trng s. Xy dng cc kh nng phn on t : gim s bng n t hp khi sinh ra cc dy cc t c th t mt dy cc ting trong cu, tc gi xut mt phng php mi l kt hp dng t in hn ch sinh ra cc bng n t hp. Khi pht hin thy mt cch phn on t no khng ph hp (khng c trong t in, khng phi l t ly, khng phi l danh t ring) th tc gi loi b cc nhnh xut pht t cch phn on t . La chn kh nng phn on t ti u : Sau khi c mt danh sch cc cch phn on t c th c ca cu, tc gi chn trng hp phn on t c trng s b nht nh sau: V d: input = Tc truyn thng tin s tng cao o Dictionary tc truyn 8.68 12.31
33

truyn thng thng tin tin s tng cao

1231 7.24 7.33 6.09 7.43 6.95

Id(D)*D* = Tc # truyn thng # tin # s # tng # cao. 48.79 (8.68 +12.31 + 7.33 + 6.09 + 7.43 +6.95 = 48.79 ) Id(D)*D* = Tc # truyn # thng tin # s # tng # cao. 48.70 (8.68 +12.31 + 7.24 + 6.09 + 7.43 +6.95 = 48.70) Do , ta c c phn on ti u l Tc # truyn # thng tin # s # tng # cao. Tng mng neural : M hnh mng neural m tc gi xut c dng lng gi 3 dy t loi: NNV,NVN, VNN (N: Noun, V: Verb). M hnh ny c hc bng chnh cc cu m cch phn on t vn cn nhp nhng sau khi qua m hnh th nht. * u im chnh xc trn 97% (theo inh in trnh by bn 2001)

34

M hnh cho kt qu phn on t vi tin cy (xc sut) km theo. Nh c tng mng neural nn m hnh c th kh nhp nhng cc trng hp tng WFST cho ra nhiu ng vin c kt qu ngang nhau Phng php ny cho kt qu vi chnh xc kh cao v mc ch ca tc gi mun nhm n vic tch t tht chnh xc l nn tng cho vic dch my. * Hn ch Cng tng t nh phng php TBL, vic xy dng tp ng liu l rt cng phu, nhng tht s rt cn thit phc v cho mc ch dch my sau ny ca tc gi. II.7.4.Phng php quy hoch ng (dynamic programming) * Ni dung Phng php quy hoch ng do Le An Ha trnh by nm 2003 ch s dng tp ng liu th ly thng tin v tn s thng k ca t , lm tng tin cy cho vic tnh ton. Vic tnh ton bt u vi nhng n v chc chn nh cu, cc ng (chunk) c phn cch bi du cu ( nh du phy, gch ni, chm phy) v nhng thnh phn ny khng c tnh nhp nhng ngay c trong vn vit cng nh ni. Sau , tc gi c gng ti a ho xc sut ca ng bng cch tm ra nhiu cch tch ng . Cch tch cui cng l cch tch l cho ng c xc sut cao nht. tng ca cch tch t ny cho mt ng cn tch t, ta phi tm ra cc t hp t to nn ng sao cho t hp t c xc sut ti a. Tuy nhin trong phng php tnh ton ny, tc gi gp phi vn bng n t hp v phn tch ng liu th. gii quyt vn trn, tc
35

gi s dng phng php quy hoch ng (dynamic programming) v lc , xc sut cc i ca mt ng nh hn ch phi tnh ton mt ln v s dng li trong cc ln sau. * u im Khng cn s dng tp ng liu nh du chnh xc * Hn ch Trong th nghim, tc gi ch dng li vic tch cc t c ba ting bi v tp ng liu u vo vn cn kh nh. Xc sut t ng l 51%, xc sut t chp nhn c 65% (theo Le An Ha). Xc sut ny tng i thp so vi cc phng php tch t khc cp trn. II.8.M t phng php s dng trong cng II.8.1.Chn phng n thc hin lun vn Sau khi nghin cu, xem xt cc phng php dng nhn dng v phn loi vn bn, chng ta thy r l cc phng php u c nhng u, nhc im khc nhau, tt c cc phng php u cha t c kt qu tuyt i, do vy m vic tm mt phng php khc c th c kh nng tt hn l mt vic cn lm Tc gi ti quyt nh chn kt hp hai phng php l phng php H Tr Vc To (SVM- Support Vector Machine) v phng php Ht nhn chui (String kernels). Vic chn Ht nhn chui (String kernels) l v: y l mt phng php mi v cho n thi im lm lun vn ny cha c nhiu ti lm v hat nhn chui

36

Vic s dng phng php phn tch ca ht nhn chui kh gn vi ting Vit, do trong ting Vit t khng bin i hnh thi, ngha ng php nm ngoi t v ph thuc vo vic sp xp th t cc t, v ht nhn chui (String kernels) th da trn s so snh khong cch ca cc t trong cu. M t chi tit v l thuyt ht nhn chui (String kernels) s c ni k phn sau

Vic chn phng php H Tr Vc To (SVM- Support Vector Machine) l do cc th nghim thc t cho thy, phng php SVM c kh nng phn loi kh tt i vi bi ton phn loi vn bn cng nh trong nhiu ng dng khc (nh nhn dng ch vit tay, pht hin mt ngi trong cc nh, c lng hi quy, ...). So snh vi cc phng php phn loi khc, kh nng phn loi ca SVM l tng ng hoc tt hn ng k Do vy vic s dng kt hp c hai phng php c th s em li kt qu tt nht cho vic phn loi vn bn ting Vit II.8.2.Ht nhn cho cc chui Text Trong phn ny ta m t mt ht nhn gia hai vn bn. tng l so snh ngha cc chui con trong hai vn bn: cng c nhiu chui con chung th chng cng ging nhau. iu quan trng l cc chui con ny khng cn phi nm lin k nhau v mc k nhau ca mt chui con trong vn bn c xc nh bng so snh trng lng. V d: chui con c-a-r hin din trong c hai t card v t custard, nhng trng lng ca chng khc nhau. Mi chui con l mt chiu trong khng gian c trng, v gi tr ca to ph thuc vo mc xut hin thng xuyn, cht ch ca chui con trong vn bn. i ph vi cc chui con khng lin k, cn phi s dng mt nhn t phn r (0, 1) o lng s hin din ca mt c trng no trong vn bn (Xem nh ngha 1 bit thm chi tit).

37

V d: Xt cc vn bn n gin bao gm cc t cat, car, bat, bar. Nu chng ta ch xem xt k=2, chng ta s c mt khng gian c trng 8 chiu, vi cc t c nh x nh sau: c-a (cat) (car) (bat) (bar) 2 2 0 0 c-t 3 0 0 0 a-t 2 0 2 0 b-a 0 0 2 2 b-t 0 0 3 0 c-r 0 3 0 0 a-r 0 2 0 2 b-r 0 0 0 3

V vy, ht nhn khng c chun ha gia car v cat l K(car, cat) = 4, ngc li mt phin bn c chun ho c c nh sau: K(car,car) = K(cat,cat) = 24 + 6 do K(car,cat) = 4/(24 + 6) = 1/(2 + 2). Lu thng thng mt vn bn s c nhiu hn mt t, do vic nh x ton b vn bn vo mt khng gian c trng l kt tt c cc t v cc khong trng (b qua du chm cu) thnh mt dy s kin duy nht. V d: Chng ta c th tnh ton im tng ng gia hai phn ca mt cu ni ting bng Kant. K(science is organized knowledge,wisdom is organized life) Cc gi tr ca ht nhn, vi cc gi tr k = 1, 2, 3, 4, 5, 6 l: K1 = 0.580, K2 = 0.580,K3 = 0.478, K4 = 0.439, K5 = 0.406, K6 = 0.370. Tuy nhin, i vi chui con c kch thc k>4 v cc vn bn chun ha kch thc th vic c lng trc tip cc c trng lin quan c th khng thc t (thm ch i vi cc vn bn c kch c va phi), v vy r rng l vic s dng phng php biu din l khng kh thi. Nhng cng nh c th nh ngha v tnh ton cc c trng nh th cho ht nhn mt cch rt hiu qu bng vic s dng cc k thut

38

lp trnh ng. chuyn ho sang ht nhn ta bt u bng cc c trng v tnh ton inner product ca chng. Trong trng hp ny khng cn phi chng minh n tho mn cc iu kin ca Mercer (symmetry and positive semi-definiteness) v chng s t ng pht sinh mt inner product. Ht nhn ny gi l mt ht nhn s kin con ca chui (SSK- string subsequence kernel) l c s cho hot ng ca cc ng dng sinh hc. N nh x cc chui vo mt vect c trng c ch mc bng tt c cc k-tupe k t. Mt k-tupe s c mt phn t khc 0 nu c mt chui con s kin hin din bt k v tr no (khng nht thit phi lin tc) trong chui. Trng lng ca c trng s l tng s ln xut hin ca k-tupple nhn t phn r trong chui s kin. * Cc nh ngha nh ngha 1: (Ht nhn chui s kin con ca chui String subsequence kernel - SSK). Xt l mt mu t xc nh. Mt chui l mt dy hu hn s kin ca cc k t t , bao gm c dy s kin rng. i vi chui s, t, chng ta biu th bi |s| chiu di ca chui s = s1 . . . s|s|, v st chui c c bng cch ghp chui s v t. Chui s[i : j] l chui con si . . . sj ca s. Ta ni u l dy con s kin ca s, nu tn ti cc ch mc i = (i1, . . . , i|u|), vi 1 i1 < ...< i|u| |s|, nh vy uj = sij , vi j = 1, . . . , |u|, hoc u = s[i]. Chiu di l(i) ca dy con s kin trong s l i|u| i1 +1. Ta biu th bng n tp tt c c cc chui hu hn c chiu di n, v bng * tp tt c cc chui

= Un
* 1

(1)

By gi chng ta nh ngha cc khng gian c trng Fn = .


n

Vic nh x c trng cho chui s c to ra bi vic nh ngha u kt hp u(s) cho mi chui u n. Ta nh ngha

39

u (s) =

i:u = s[ i ]

l (i )

(2)

vi 1. Nhng c trng ny c o lng bng s ln xut hin ca dy con s kin trong chui s. V vy, inner product ca cc khng gian c trng cho hai chui s v t l kt cng tt c cc dy con c nh lng thng qua tn s xut hin v chiu di ca theo cng thc sau:

Kn (s,t)= u (s).u (t) =


u
n n

u i:u=s[i]

l(i)
n

j:u=t[ j]

l ( j) =
n n

u i:u=s[i]n j:u=t[ j]n

l(i)+l( j)

Vic tnh ton trc tip cc c trng ny mt mt n v l O(||n) thi gian v khng gian, v l s c trng lin quan. R rng l hu ht cc c trng s c cc thnh phn khc 0 i cc vn bn ln. vic tnh ton ht nhn c hiu qu, chng ta gii thiu mt hm b sung hn ch php ton quy cho ht nhn ny. Xt

K' i (s, t) =

u i

i :u = s [ i ] n j :u = t [ j ]

| s | + | t | i1 j1 + 2

i = 1, . . . , n 1, dng o chiu di bt u t mt chui s kin no n cui chui s v t thay v ch l(i) v l(j).By gi chng ta c th nh ngha mt php quy cho Ki v t tnh Kn, nh ngha 2: Php tnh quy cho ht nhn dy con s kin K0 (s, t) = 1, i vi c s, t, Ki(s, t) = 0, nu min (|s|, |t|) < i,

40

Ki(s, t) = 0, nu min (|s|, |t|) < i,

K' i (s, t) = K i' ( s , t ) +


i = 1, . . . , n 1,

j :t j = x

' i 1

( s , t [1 : j 1]) |t | j + 2

K n (s, t) = K n ( s , t ) +

j :t j = x

' n 1

( s , t [1 : j 1]) 2

Lu chng ta cn hm h tr K v nhng k h bn trong ca dy con cn phi c x l. Tnh ng n ca php quy c xc nhn khi quan st chiu di cc chui gia tng, v vic gnh chu nhn t cho mi n v chiu di d tha. V th, cng thc cho Ki(sx, t), gii hn u tin c t k t hn, v th i hi mt nhn t n, trong khi gii hn th 2 c khong |t| j +2 k t. i vi cng thc cui cng c gii hn th 2 yu cu thm ch 2 k t, mt i vi s v mt i vi t[1 : j 1],v x l k t cui cng ca dy n. Nu chng ta mun tnh ton Kn(s, t) trong khong gi tr n, chng ta ch cn tnh Ki(s, t), v sau p dng bc quy cui cng cho mi Kn(s, t) cn s dng ti cc gi tr lu tr ca Ki(s, t). D nhin chng ta c th to ht nhn K(s, t) bng cch kt hp cc trng lng (dng) khc nhau ca cc Kn(s, t) khc nhau cho mi n. Vic to ra ht nhn l cch chun ho nhm loi b bt k s lch lc no trong vn bn. Ta c th tn dng c tnh ny mt cch hiu qu bng cch chun ho cc vect c trng trong khng gian c
( s ) = trng. Vy chng ta to mt php kt gn mi

( s) tm ht || ( s) ||

nhn

41

K ( s, t ) 1 ( s ) (t ) = (s) (t ) = ( s, t ) = K (s) (t ) = || (s) || || (t ) || || (s) || || (t ) | K ( s, s) K (t , t )

Hiu qu tnh ton ca SSK (string subsequence kernel) SSK tnh im ging nhau gia vn bn s v t trong thi gian n|s||t| vi n l chiu di ca chui s kin. iu ny c m t r rng trong php quy ca nh ngha 2, vng lp quy pha ngoi cng thc hin vi chiu di ca chui v i vi mi chiu di v mi k t thm vo trong s v t th tng chui s kin phi c tnh ton c lng. Tuy nhin, hon ton c th tng tc cho vic tnh ton ca SSK. By gi, chng ta trnh by tnh hiu qu ca php ton quy SSK lm gim phc tp ca php ton O(n|s||t|), trc tin bng cch nh gi
2

K i'' (sx, t) =

j :t j = x

' i 1

( s , t [1 : j 1]) |t | j + 2

v rng chng ta c th c lng Ki(s, t) vi php quy O(|s||t|),

K i' (sx, t) = K i' (sx, t) + K i' ' (sx, t)


Ta thy

K i' ' (sx, tu) = |u |K i' ' (sx, t) ,

vi iu kin x khng xut hin

trong u, trong khi

K i' ' (sx, tx) = K i' ' (sx, t) + K i' -1 (s, t)

42

Tng hp cc quan st ta thy khi tnh ton Ki (s, t) c thc hin quy O(|s||t|). V th, chng ta c th c lng tt c ht nhn trong khong thi gian O(n|s||t|). II.8.3.C s l thuyt ca Support vector Machine (SVM): c trng c bn quyt nh kh nng phn loi ca mt b phn loi l hiu sut tng qut ha, hay l kh nng phn loi nhng d liu mi da vo nhng tri thc tch ly c trong qu trnh hun luyn. Thut ton hun luyn c nh gi l tt nu sau qu trnh hun luyn, hiu sut tng qut ha ca b phn loi nhn c cao. Hiu sut tng qut ha ph thuc vo hai tham s l sai s hun luyn v nng lc ca my hc. Trong sai s hun luyn l t l li phn loi trn tp d liu hun luyn. Cn nng lc ca my hc c xc nh bng kch thc Vapnik- Chervonenkis (kch thc VC). Kch thc VC l mt khi nim quan trng i vi mt h hm phn tch (hay l b phn loi). i lng ny c xc nh bng s im cc i m h hm c th phn tch hon ton trong khng gian i tng. Mt b phn loi tt l b phn loi c nng lc thp nht (c ngha l n gin nht) v m bo sai s hun luyn nh.Phng php SVM c xy dng da trn tng ny. Xt bi ton phn loi n gin nht - phn loi hai phn lp vi tp d liu mu: {(xi, yi)| i = 1, 2, ..., N, xi Rm } Trong mu l cc vector i tng c phn loi thnh cc mu dng v mu m: Cc mu dng l cc mu xi thuc lnh vc quan tm v c gn nhn yi = 1;

43

Cc mu m l cc mu xi khng thuc lnh vc quan tm v c gn nhn yi = 1;

Mt siu phng tch cc mu dng khi cc mu m. Trong trng hp ny, b phn loi SVM l mt siu phng phn tch cc mu dng khi cc mu m vi chnh lch cc i, trong chnh lch cn gi l l (margin) xc nh bng khong cch gia cc mu dng v cc mu m gn mt siu phng nht (hnh 1). Mt siu phng ny c gi l mt siu phng l ti u. Cc mt siu phng trong khng gian i tng c phng trnh l wTx + b = 0, trong w l vector trng s, b l dch. Khi thay i w v

44

b, hng v khong cch t gc ta n mt siu phng thay i. B phn loi SVM c nh ngha nh sau: f(x) = sign(wTx + b) (1) Trong sign(z) = +1 nu z 0, sign(z) = 1 nu z < 0. Nu f(x) = +1 th x thuc v lp dng (lnh vc c quan tm), v ngc li, nu f(x) = 1 th x thuc v lp m (cc lnh vc khc). My hc SVM l mt h cc mt siu phng ph thuc vo cc tham s w v b. Mc tiu ca phng php SVM l c lng w v b cc i ha l gia cc lp d liu dng v m. Cc gi tr khc nhau ca l cho ta cc h mt siu phng khc nhau, v l cng ln th nng lc ca my hc cng gim. Nh vy, cc i ha l thc cht l vic tm mt my hc c nng lc nh nht. Qu trnh phn loi l ti u khi sai s phn loi l cc tiu. Nu tp d liu hun luyn l kh tch tuyn tnh, ta c cc rng buc sau: wT xi + b +1 nu yi = +1 (2) wT xi + b 1 nu yi = 1 (3) Hai mt siu phng c phng trnh l wT x + b =1 c gi l cc mt siu phng h tr (cc ng nt t trn hnh 1). xy dng mt mt siu phng l ti u, ta phi gii bi ton quy hoch ton phng sau: Cc i ha:
45

1 N N i i j yi y j xiT x j 2 i =1 j =1 i =1
vi cc rng buc:

i 0

y
i =1 i

=0

trong cc h s Lagrange i, i = 1, 2, ..., N, l cc bin cn c ti u ha. Vector w s c tnh t cc nghim ca bi ton ton phng ni trn nh sau:

w= i yi xi
i =1

xc nh dch b, ta chn mt mu xi sao cho vi i > 0, sau s dng iu kin KarushKuhnTucker (KKT) nh sau:
i [yi (w T xi + b) 1] = 0

Cc mu xi tng ng vi i > 0 l nhng mu nm gn mt siu phng quyt nh nht (tha mn du ng thc trong (2), (3)) v c gi l cc vector h tr. Nhng vector h tr l nhng thnh phn quan trng nht ca tp d liu hun luyn. Bi v nu ch c cc vector h tr, ta vn c th xy dng mt siu phng l ti u nh khi c mt tp d liu hun luyn y . Nu tp d liu hun luyn khng kh tch tuyn tnh th ta c th gii quyt theo hai cch.

46

Cch th nht s dng mt mt siu phng l mm, ngha l cho php mt s mu hun luyn nm v pha sai ca mt siu phng phn tch hoc vn v tr ng nhng ri vo vng gia mt siu phng phn tch v mt siu phng h tr tng ng. Trong trng hp ny, cc h s Lagrange ca bi ton quy hoch ton phng c thm mt cn trn C dng - tham s do ngi s dng la chn. Tham s ny tng ng vi gi tr pht i vi cc mu b phn loi sai. Cch th hai s dng mt nh x phi tuyn nh x cc im d liu u vo sang mt khng gian mi c s chiu cao hn. Trong khng gian ny, cc im d liu tr thnh kh tch tuyn tnh, hoc c th phn tch vi t li hn so vi trng hp s dng khng gian ban u. Mt mt quyt nh tuyn tnh trong khng gian mi s tng ng vi mt mt quyt nh phi tuyn trong khng gian ban u. Khi , bi ton quy hoch ton phng ban u s tr thnh: Cc i ha:

1 N N i i j yi y j k ( x j , x j ) 2 i =1 j =1 i =1
vi cc rng buc:
0 i C

y
i =1 i

=0

trong k l mt hm nhn tha mn:


k ( xi , x j ) = ( xi )T . ( x j )

Vi vic dng mt hm nhn, ta khng cn bit r v nh x . Hn na, bng cch chn mt nhn ph hp, ta c th xy dng c

47

nhiu b phn loi khc nhau. Chng hn, nhn a thc k(xi, xj) = (xiT xj + 1) p dn n b phn loi a thc, nhn Gaussian k(xi, xj)= exp(||xi xj||2) dn n b phn loi RBF (Radial Basis Functions), v nhn sigmoid k(xi, xj) = tanh(xiTxj + ), trong tanh l hm tang hyperbol, dn ti mng nron sigmoid hai lp (mt lp nron n v mt nron u ra). Tuy nhin, mt u im ca cch hun luyn SVM so vi cc cch hun luyn khc l hu ht cc tham s ca my hc c xc nh mt cch t ng trong qu trnh hun luyn. gii quyt vn ny th c nhiu tc gi dng cc phng php khc nhau, nhng trong giai on gn y th thy a phn cc tc gi s dng phng php ti u ha tun t cc tiu (Sequential Minimal Optimization - SMO) Thut ton ny s dng tp d liu hun luyn (cn gi l tp lm vic) c kch thc nh nht bao gm hai h s Lagrange. Bi ton quy hoch ton phng nh nht phi gm hai h s Lagrange v cc h s Lagrange phi tha mn rng buc ng thc (11). Phng php SMO cng c mt s heuristic cho vic chn hai h s Lagrange ti u ha mi bc. II.8.4.Hun luyn SVM Hun luyn SVM l vic gii bi ton quy hoch ton phng SVM. Cc phng php s gii bi ton quy hoch ny yu cu phi lu tr mt ma trn c kch thc bng bnh phng ca s lng mu hun luyn. Trong nhng bi ton thc t, iu ny l khng kh thi v thng thng kch thc ca tp d liu hun luyn thng rt ln (c th ln ti hng chc nghn mu). Nhiu thut ton khc nhau c pht trin gii quyt vn nu trn. Nhng thut ton ny da trn vic phn r tp d liu hun luyn thnh nhng nhm d liu. iu c ngha l bi ton quy hoch ton phng ln c phn r thnh cc bi ton quy hoch ton phng vi kch thc nh hn. Sau , nhng thut ton ny kim tra cc iu kin KKT xc nh phng n ti u.

48

II.8.5.Phn loi vn bn thc hin qu trnh phn loi, cc phng php hun luyn c s dng xy dng b phn loi t cc ti liu mu, sau dng b phn loi ny d on lp ca nhng ti liu mi (cha bit ch ). Mt s phng php nh trng s t thng dng: 1. Tn sut t (term frequency - TF): Trng s t l tn sut xut hin ca t trong ti liu. Cch nh trng s ny ni rng mt t l quan trng cho mt ti liu nu n xut hin nhiu ln trong ti liu . 2. TFIDF: Trng s t l tch ca tn sut t TF v tn sut ti liu nghch o ca t v c xc nh bng cng thc IDF = log(N / DF) + 1 (13) trong : N l kch thc ca tp ti liu hun luyn; DF l tn sut ti liu: l s ti liu m mt t xut hin trong . Trng s TFIDF kt hp thm gi tr tn sut ti liu DF vo trng s TF. Khi mt t xut hin trong cng t ti liu (tng ng vi gi tr DF nh) th kh nng phn bit cc ti liu da trn t cng cao.

49

CHNG III. M T BI TON v X L BI TON III.1.Cc yu cu i vi vic phn loi vn bn Yu cu chnh ca vic phn loi vn bn l vic xc nh mt vn bn sau khi x l s xc nh c vn bn thuc nhm vn bn no trong cc vn bn c xc nh trc. i vi cc vn bn khng th xc nh c hoc vn bn c tnh nhp nhm th chng trnh cn phi ch ra v cho php ngi s dng c th xc nh bng tay vn bn ny thuc vo nhm vn bn no. Sau khi xc nh th kt qu ny phi c cp nht vo h thng nhn dng chng trnh c th nhn dng c cc vn bn tng t ln sau

50

Vn quan trng y l vic c ni dung v phn tch ng ngha xc nh loi vn bn, nu phn vic ny lm tt th vic phi xc nh li vn bn bng nhn cng s gim i kh nhiu. Do c th coi nh bi ton phi gii quyt l cng vic c ni dung v phn tch ni dung c c sau chn thut ton a ra quyt nh nhm ca vn bn c c

III.2.Cu trc chng trnh Da vo cc nghin cu m t chng II v theo yu cu ca bi ton nh m t phn III.1, ta c th thy rng vic nhn dng mt vn bn c chnh xc, cn phi thc hin cc bc sau:

III.2.1.Bc 1: Tin x l s liu Mc ch ca bc ny l x l tng i sch d liu c vo cc bc sau s x l tt hn, do cng vic ca bc ny s ch l chuyn thnh chui k t thun ty (text), do n s c yu cu nh sau: u vo: Tp vn bn cn phi phn tch (File PDF, TXT, DOC, HTML, HTM)

51

u ra: chui k t thun ty (text only)

III.2.2.Bc 2: Tch cu:


Mc ch ca bc ny l tch mt vn bn text thun ty thnh cc cu

u vo: Chui k t vn bn thun ty u ra: Vecto cha cc cu c tch trong vn bn

III.2.3.Bc 3: Tch t:
Tch cc t t cc cu c ly ra, t y l t ting Vit do y l iu phi lu

u vo: Cu vn bn u ra: Vecto cha cc t c ngha trong cu

III.2.4.Bc 4: Gn nhn t loi nh trng s


Gn nhn t loi l nh lng cc t trong vn bn

u vo: Vecto cc t u ra: Vecto cha cc t c gn nhn

III.2.5.Bc 5: S dng thut ton phn loi vn bn cn c


y l bc chnh yu ca chng trnh

u vo: Vecto cc t, d liu chun ca cc nhm vn bn u ra: Xc nh nhm ca vn bn

III.3.Cc bc thc hin trong chng trnh III.3.1.Tin x l s liu: Nhim v: c ni dung cc tp tin s liu cn c, chuyn cc vn bn cn phi kim tra thnh dng text thun ty, ngha l loi b cc

52

thnh phn nh nh, cc tag (trong trng hp trang web), cc thng tin nh dng thng nht khun dng ca vn bn th tt c cc vn bn phi c cng mt phng ch duy nht, phng ch c chn l font Unicode, do trc khi thc hin vic chuyn thnh chui k t (text) th vic u tin phi lm l chuyn tt c cc vn bn c font ch khc vi font ch Unicode v thnh font ch Unicode. Do a phn cc tp vn bn hin nay u s dng font unicode v vic nhn dng font ting Vit s dng trong tp vn bn l kh kh khn do phn chuyn i ny s c lm bng tay, c ngha l do ngi s dng t quyt nh, chng trnh dng chuyn i t cc dng font khc nhau sang font Unicode c i km trong chng trnh, tuy nhin c th dng cc chng trnh chuyn dng font khc nh Unikey, Vietkey chuyn t cc font khc v font Unicode Cc s liu phi c lm sch cc thng tin khng phi l text, cc thng tin ny c th l hnh nh, m thanh, nh dng vn bn .v.v.. Vic tch ny ph thuc vo tng kiu tp tin d liu u vo Nu d liu u vo l tp vn bn dng text (txt) th ly tt c s liu Nu d liu u vo l tp vn bn dng rich-text-box (rtf) th s liu ly ra s l dng text do s dng control rft trong chng trnh, control ny s c u vo l tn tp .rft (c cha ng dn) v u ra l dng text thng thng Nu d liu u vo l tp vn bn dng MS word (doc) th s s dng Microsoft.Office.Core chuyn i, vi cng c ny vic chuyn i mt file dng Microsoft word sang text ch l mt hm Nu d liu u vo l tp vn bn dng PDF th s s dng control PDFbox c v loi b cc thuc tnh khng cn thit cho chng trnh nh hnh nh, m thanh, nh dng v ch ly gi tr text

53

Nu d liu u vo l cc tp vn bn (htm) hay (html) th vic loi b cc d liu l loi b cc on tag nh dng, cc link lin kt, cc link hnh nh

Loi b cc thng tin nh dng trang Web Cc trang Web hin nay c thit k theo chun HTML bao gm cc th (tag) nh dng cho cc thnh phn ni dung trong trang Web, ta c th lit k theo nhm nh sau: ag nh dng thng tin chung ca trang Web: <TITLE>, T <!DOCTYPE>, <HEAD>, <HTML>, Tag phn vng, chia dng, chia ct, : <BR>, <DIV>, <BLOCKQUOTE>, <TABLE>, <TR>, <TD>, Tag lit k mc: <LI>, <MENU>, <OL>, ag nh dng ch, hiu ng: <FONT>, <B>, <I>, <A>, T <MARQUEE>, <STRIKE>, ag x l: <SCRIPT>, <APPLET>, <CODE>, <FRAME>, T <EMBED>, <STYLE>, <INPUT>,

Cc thng tin nh dng, x l ny cn c loi b, ch gi li nhng phn thng tin bng li m trang Web mun thng bo cho ngi xem. Loi b cc vng vn bn ph khng cn thit Sau khi loi b nhng thng tin nh dng, thng tin x l v trch ra phn thng tin bng li, trong thng tin ny vn cn nhng thng tin ph, khng cn thit m ta cn tip tc loi b. Trong trang web, ngoi thng tin chnh ca trang web, thng cha nhiu thng tin ph khc nh: thng tin qung co, thng bo ph, cc mc, menu, Do , cn c mt cch ph hp b qua nhng phn ni dung khng cn thit v ch gi li phn ni dung chnh tch ly cc cu to tm tt. Ni dung bc ny s c trnh by chi tit trong cc phn sau. Vic tch ly cc on vn bn tng thut (narrative) trong vn bn, b qua cc phn vn bn ri rc nh cc mc, lin kt, c th
54

c thc hin theo hai cch sau: - Cch 1. Dng cc heuristic hoc hc my rt ra cc lut tch ly nhng phn vn bn tng thut - Cch 2. p dng dng c tnh tnh ng php loi b cc phn vn bn khng to thnh cu, chng hn nhng on khng c cha ng t hoc cha nhiu hn 4 t khng th xc nh t loi Lm sch s liu tip theo bao gm: Loi b cc khong trng nhiu hn 1 khong trng Cc du xung dng Cch dng trng Cc k t l .

III.3.2.Tch cu on vn bn s c duyt tun t v s c cho ngt cu khi gp cc k t ngt cu nh . (chm), ! (chm than), ? (chm hi), vi iu kin: k t tip theo (c th c cc k t khong trng gia) l k t vit in. Cch lm trn loi b c cc trng hp khng phi ngt cu nh: Du . khng phi l ngt cu m l du trong 1 chui s. C c iu ny v nu l du chm trong chui s th k t tip theo phi l s, khng phi k t vit in.

Du . trong mt lot du ba chm bn trong cu, cha phi l cui cu.

55

Ly mt s v d: on vn bnHm nay l mt ngy p tri. Chng ta s i cm tri ngoi tri s c ngt gia t tri v t chng thnh hai cu.

on vn bn Trong vn c 1.200 cy cc loi, trong a s l cy n tri nh cam, qut, o, l, mn, v mt s cy cnh nh cau, tng, ch thuc mt cu.

Lut trn vn cha phn bit ht cc trng hp xut hin du chm. Ta x l thm cho cc trng hp c xut hin du chm nhng khng tch cu nh sau: - Chui link, hay a ch Web (URL).

Du hiu nhn din: c cha k t . hay / v cha mt trong cc chui con sau ( y ch lit k mt s chui thng dng trong cc a ch Web): http, .com, .net, .edu, .vn, .org, .htm, .html, .asp, .jsp, .php, .gif, .jpg, .bmp, .pdf, .ps, .txt, .exe, .wav, .m3u, .mp3.

V d: http://www.vnuit.edu.vn

- K hiu vit tt : Danh sch cc k t vit tt c x l: GS., PGS., TS., VS., TSKH., NCS., ThS., BS., NS., DS., YS., LS., KS., CN., G., PG., TP., Tp., KCN..

56

- Cc chui c cha nhiu du chm lin tc, chng hn

Chui version (v d: version 1.2.1). Chui dng ny c cha nhiu k t s.

a ch IP (v d: 172.9.10.1). Chui dng ny cng cha nhiu k t s.

Chui nh dng cho mt kiu ghi no (v d: version ca chng trnh ny phi c ghi theo dng Vx.x.x.x). III.3.3.Tch t
Tch t l vn quan trng nht ca chng trnh, n quyt nh chng trnh c th thc hin ng v chnh xc vic phn loi hay khng l nh kt qu ca vic tch t ng hay sai. Do c im ting Vit nh trnh by trn (phn II.3) trong c bit ch vic ting Vit khng th tch t bng khong trng nn vic chn phng php tch t cng kh kh khn. Nh phn tch trn (phn II.4) vic chn mt phng php duy nht c nhng ci hay v ci d khc nhau do trong lun vn ny s chn mt phng php hn hp vic tch t c tt hn. Phng php c trnh by nh sau:

57

Cu vn bn

T in cu phn tch cc t

Kim tra trong t in cu C Khng

T in t

Tm kim cc t

T in t stop word

Loi cc t khng c ngha

Danh sch t

T in t ng ngha

Kim tra t ng ngha

Ngi s dng chnh sa

Danh sch t c tch t cu

Khng

(a) i vi mt cu vn bn a vo s kim tra trong d liu c sn c mu cu ny cha, nu c s ly cc mu tch t ca mu cu ny (s dng Phng php gii thut hc ci bin - Transformation-based Learning- TBL) (b) Nu cha c mu cu ny th chng trnh s c ch u tin v xem tip ch k tip, nu ch u tin v ch k tip c trong c s d liu

58

th chng trnh s c ch tip theo, c nh vy cho n khi c ch tip theo m dy ch khng c trong d liu th s dng li v ly t l dy ch c c, tc l chng trnh s duyt mt ng hoc

cu t tri sang phi v chn t c nhiu m tit nht c mt trong t in, ri c th tip tc cho t k tip cho n ht cu (s dng Phng php khp ti a (Maximum Matching) cn gi l Left Right Maximum Matching (LRMM))
(c) Sau khi thc hin xong bc (b) chng trnh s kim tra v loi b cc t c tnh cht kt ni, m t khng c ngha trong cu (t stop words) v ch gi li nhng t c ngha nht (d) Cc t ny trc khi c a vo phn tch cn phi qua bc kim tra t ng ngha nhng khc m nh: tc cu-bng , a cu-tri t .v.v. qui tt c cc t ny v chung mt mu thng nht (e) Trong trng hp ang ly mu th sau khi thc hin bc (b) chng trnh s a kt qu ngi s dng c th t xc nh xem vic tch t l ng hay sai, trong trng hp ngi s dng phi xc nh li th mu cu ny s c lu trong d liu ln sau khi gp mu cu ny th chng trnh s t ng tch nh bc (a) y l qui trnh chung nhng khi thc hin th tng tc (khi kim tra t theo Phng php khp ti a (Maximum Matching) th tc tm kim kh chm) th s c hai qui trnh khc nhau: Qui trnh khi thc hin mu th: thc hin ng nh qui trnh nu trn trong : o T in ting Vit l kt hp t hai t in: t in 78.000 t ca inh in v t in ca chng trnh Vikass (chng trnh phn loi tin in t) o T in cc t h (stop word) cng s dng ca chng trnh ViKass o T in t ng ngha t xy dng Qui trnh khi thc hin chng trnh phn loi vn bn th tng tc cc bc s hi khc mt cht nh sau:

59

o T in ting Vit s l nhng t c dng nh gi phn loi vn bn kt hp vi t in ng ngha o B qua bc li b cc t h (stop word) v cc h t thc ra l b loi b khi ly cc t ch thuc nhm cc t mu ( bc trn)

Phng php ny thc ra khng phi l phng php tt nht (v d nh so snh vi phng php gii thut hc ci bin hay m hnh tch t bng WFST v mng Neural) nhng n c u im l nhanh, vic tch t l do theo quan im ca vic x l ht nhn chui (string kernels) th cng c nhiu chui con chung th chng cng ging nhau v cc chui con ny l tp hp t cc t c ngha trong mt cu do phng php hn hp ny tm thi c s dng trong lun vn ny

III.3.4.Gn nhn nh trng s


Vic gn nhn nh trng s l lng ha cc t trong vn bn, nh vic lng ha ny m chng trnh c th xc nh c vn bn ang chn thuc nhm vn bn no. Vic nh nhn cng c tnh cht quyt nh n kt qu phn loi vn bn. Vic gn nhn nh trng s s c thc hin h sau: Cc t ca vn bn sau khi c vo s c sp xp vo mt bange c cc thng tin nh sau: T c trng S ln xut hin trong vn bn: y l s m s ln xut hin trong vn bn Khong cch ln nht: y l khong cch tnh bnh qun gia quyn ln nht ca khong cch ca t c trng n u cu (xem v d di) Khong cch nh nht: y l khong cch tnh bnh qun gia quyn nh nht ca khong cch ca t c trng n u cu (xem v d di)

60

Khong cch trung bnh: y l khong cch tnh bnh qun gia quyn trung bnh ca khong cch ca t c trng n u cu (xem v d di)

V d: Trong mt vn bn c t c trng l c trng Trong mt cu t c trng xut hin ti v tr th 3 T c trng c trng S ln xut hin 1 Khong cch ln nht 3 Khong cch nh nht 3 Khong cch trung bnh 3

Ti mt cu khc t c trng xut hin v tr th 5 khi cng thc tnh s nh sau: Do khong cch ny ln hn khong cch ln nht lu do s tnh li khong cch ln nht v khong cch trung bnh Khong cch ln nht (mi)=(1x3+1x5)/2=4 Khong cch trung bnh (mi)=(1x3+1x5)/2=4 S ln xut hin (mi)=1+1=2 T c trng c trng S ln xut hin 2 Khong cch ln nht 4 Khong cch nh nht 3 Khong cch trung bnh 4

Ti mt cu khc t c trng xut hin v tr th 2 khi cng thc tnh s nh sau: Do khong cch ny nh hn khong cch nh nht lu do s tnh li khong cch nh nht v khong cch trung bnh Khong cch nh nht (mi)=(2x3+1x2)/3=2.66

61

Khong cch trung bnh (mi)=(2x4+1x2)/3=3.33 S ln xut hin (mi)=2+1=2 T c trng c trng S ln xut hin 3 Khong cch ln nht 4 Khong cch nh nht 2.66 Khong cch trung bnh 3.33

C nh vy cho n khi ht vn bn, khi s tnh li l khong cch ln nht v khong cch nh nht s l gi tr trung bnh cng ca hai khong cch ny n khong cch trung bnh, nh trong trng hp trn th khong cch trung bnh s l: ((4-3.33)+(3.33-2.66))/2= (4-2.66)/2=0.67 Khong cch ln nht = 3.33+0.67=4 Khong cch nh nht=3.33-0.67=2.66 T c trng c trng Khong cch ln nht 4 Khong cch nh nht 2.66 Khong cch trung bnh 3.33

Nhng do khong cch ca t trong vn bn l mt s nguyn dng nn khi chng trnh s ly Gi tr ln nht l s nguyn dng nh nht ln hn hay bng s khong cch ln nht Gi tr nh nht l s nguyn dng ln nht ln hn hay bng khong cch nh nht Khong cch ln nht 4 Khong cch nh nht 2 Khong cch trung bnh 3.33

T c trng c trng

62

Nh vy l ta c bng nh trng s ca mt t trong vn bn, tuy nhin ta cng thy r vic gn nhn nh trng s nu thc hin trn tt c cc t c ngha trong vn bn th s dn n vic vecto t ph bin trong vn bn s c chiu rt ln v iu ny s lm cho vic tnh ton cn phi c my tnh rt mnh (trong thc t th vic thc hin trn cc my tnh thng thng s rt chm). tng tc ca chng trnh th phi gim khi lng ca cc vecto ny, nhng vic gim cc vecto ny th li lm cho vic nhn dng vn bn khng c chnh xc. Do thc hin c cng vic nh gi chnh xc v gim chiu ca cc vecto vn bn ny th tm thi trong lun vn xc nh nh sau: i vi vic hun luyn th vi vn bn ly mu s ly: o Mt vn bn ch ly cc t c ngha c xut hin tng i ph bin nht, ngha l cc t c ngha xut hin nhiu nht (s ln xut hin nhiu nht) do vy khi ly cc t lm mu th chng trnh sp xp s ln xut hin ca cc t c trng theo mc t nhiu n t v ch ly 1/3 tng s t (xt t nhiu n t) v cch ly l lnhiu nht o Nu trong nhng t c trng c ly ra nh trn m nhng t c s ln xut hin t hn 1/3 so vi t c s ln xut hin nhiu nht th t s c loi b o Chng trnh chp nhn sai s l nu hai t c s ln xut hin nh nhau nhng phi loi b mt th chng trnh s b ngu nhin mt t v gi li mt t m khng quan tm n ng ngha ca t b b v t gi li i vi vn bn phn loi th: o Ch ly cc t c ngha trong danh sch cc t c trng c chn la ca vn bn mu o Cc t c khong cch khng giao vi khong cch ca t trong danh sch mu t c trng s b loi b, c ngha l cc t c khong cch ln nht v nh nht khng nm trong khong cch ln nht v nh nht ca bt c t mu c trng no th s b loi b khi danh sch t dng phn loi

63

Vi vic gn nhn nh trng s ny th cc s liu vn bn mu s tr thnh cc vecto VMi(M0(x0,y0), M1(x1,y1), M2(x2,y2) Mm(xm,ym)) v cc vn bn cn phn nhm s chuyn thnh cc vecto: Vi(C0(x0,y0), C1(x1,y1), C2(x2,y2) Cn(xn,yn)) Trong : VM l vecto vn bn mu Vi l vecto vn bn cn phn loi Mi l k t c trng th i ca vn bn mu Ci l k t c trng th i ca vn bn cn phn loi, lu Ci Mj vi {j=0,m} xi l gi tr khong cch nh nht ca t c trng th i trong vn bn yi l gi tr khong cch ln nht ca t c trng th i trong vn bn

Vic so snh cc vecto ny bng phng php My h tr vecto (Support vector Machine SVM) s cho kt qu l nhm ca vn bn cn phn loi gn vi nhm no trong cc mu nht v t kt qu s bit vn bn cn phn loi thuc nhm vn bn no

III.3.5.Hun luyn Tp ti liu dng lm mu c ly l cc trang web htm t a ch http://vietnamnet.vn/, cng vic c n gin v khng mt qu nhiu
thi gian th lun vn coi nh l cc trang web ny c phn loi chnh xc

v lun vn gi thit nh sau: Cc ti liu c phn lp thnh nhng phn nhm tch bit. Trn thc t, cc ti liu trn http://vietnamnet.vn/ c phn loi khng chnh xc. Cc phn lp ti liu c s giao thoa v do mt ti

64

liu thuc mt phn lp c th c nhng c trng thuc mt phn lp khc S phn b ti liu trong mt phn nhm khng nh hng ti s phn b ti liu trong phn nhm khc. Gi thit ny c t ra c th chuyn bi ton phn loi nhiu phn lp giao thoa thnh cc bi ton phn loi phn lp tch bit. .

Cc t c dng biu din cc ti liu cng thng c gi l cc c trng. nng cao tc v chnh xc phn loi, ti bc tin x l vn bn, ta loi b cc t khng c ngha cho phn loi vn bn. Thng thng nhng t ny l nhng t c s ln xut hin qu t. Tuy vy vic loi b nhng t ny c th khng lm gim ng k s lng cc c trng. Vi s lng cc c trng ln b phn loi s hc chnh xc tp ti liu hun luyn, tuy vy nhiu trng hp cho kt qu d on km chnh xc i vi cc ti liu mi. trnh hin tng ny, ta phi c mt tp ti liu mu ln hun luyn b phn loi. Tuy vy, thu thp c tp mu ln tng ng vi s lng c trng thng kh thc hin c trong thc t. Do cho bi ton phn loi c hiu qu thc tin, cn thit phi lm gim s lng c trng. C nhiu phng php chn c trng hiu qu. y, lun vn s dng phng php lng tin tng h . Phng php ny s dng o lng tin tng h gia mi t v mi lp ti liu chn cc t tt nht. Lng tin tng h gia t t v lp c c tnh nh sau:

MI ( t ,c ) =
trong :

t{0,1)

c{0,1)

P(t ,c ) log

P( t ,c ) P( t ) P( c )

P(t, c) l xc sut xut hin ng thi ca t t trong lp c; P(t) l xc sut xut hin ca t t v

65

P(c) l xc sut xut hin ca lp c. o MI ton cc (tnh trn ton b tp ti liu hun luyn) cho t t c tnh nh sau:

MI avg ( t ) = P( ci ) MI (t , ci )
i

Khi s dng cc phng php chn c trng, ta c th loi b i nhiu t quan trng, dn n mt mt nhiu thng tin, iu lm cho chnh xc phn loi s gim i ng k. Trong thc t, theo th nghim ca Joachims, rt t c trng khng c lin quan, v hu ht u mang mt thng tin no , v vy mt b phn loi tt nn c hun luyn vi nhiu c trng nht nu c th. Tuy nhin gii thut SVM c kh nng iu chnh nng lc phn loi t ng m bo hiu sut tng qut ha tt, thm ch c trong khng gian d liu c s chiu cao (s c trng rt ln) v lng ti liu mu l c hn, chnh v vy m trong vn chn c trng ta c th quyt nh cch chn l ly 1/3 s mu t ly c trong danh sch mu ca tt c cc vn bn c chn lm mu vi iu kin l s ln xut hin ca t t nht trong danh sch c chn khng c t hn qu 1/3 ln so vi t nhiu nht c chn. Vi cch chn nh vy th s lng mu s t m bo khng gian vecto khng qu ln v vn phn loi vn bn mt cch chnh xc

III.3.6.Phn loi vn bn Cc s liu kim th cng l cc tp htm v html c ly trn a ch http://vnexpress.net/Vietnam/Home/ v c np v (download) thnh cc tp trn a cng, cc phn c ly cc phn ch mc kh gn vi s liu mu
Nh ni trn, vic hun luyn c coi l cc s liu c phn loi tm coi nh l chnh xc, cc file c phn loi ny cng c coi nh l c chun ha, cc s liu ca cc file ny s l cc vc t mu phn loi cc vn bn s c s dng sau

66

Vn bn s thc hin cc qui trnh nh tin x l-tch cu-tch t nh m t phn III.2, sau khi c tch t s c biu din c biu din di dng vector vi cc thnh phn (chiu) ca vector ny l cc trng s ca cc t.
V(C0(x0,y0), C1(x1,y1), C2(x2,y2) Cn(xn,yn)) Trong : V l vecto vn bn cn phn loi Ci l k t c trng th i ca vn bn cn phn loi xi l gi tr khong cch nh nht ca t c trng th i trong vn bn yi l gi tr khong cch ln nht ca t c trng th i trong vn bn

y, lun vn b qua th t gia cc t cng nh cc vn ng php khc (theo l thuyt v string kernels). Vi mi s c trng c chn, cc ti liu c biu din di dng cc vector tha dng cch nh trng s t TFIDF. Mi vector tha gm hai mng: Mt mng s nguyn lu ch s ca cc gi tr khc 0 ca k t c trng, s ny c ly l t s ch mc ca t c trng mu trong c s d liu vn bn mu Mt mng s thc lu cc gi tr khc 0 tng ng vi k t ny, n l khong cch ln nht v nh nht ca k t c trng xut hin trong vn bn.

S d dng cc vector tha l do s t xut hin trong mi ti liu l rt nh so vi tng s t c s dng; iu ny mt mt tit kim b nh, mt khc lm tng tc tnh ton ln ng k. thc hin phn loi vn bn bng phng php SVM, trong lun vn s dng cc hm trong th vin http://www.csie.ntu.edu.tw/~cjlin/libsvm/, cc hm y gip cho vic tnh ton cc vecto ca s liu c c v cc vecto ca tp s liu hun luyn, cc kt qu a ra s dng ma trn kt qu v chng
67

trnh s nhn dng ma trn kt qu ny phn loi vn bn thnh cc nhm vn bn nh m t trn, i vi cc kt qu cho kt qu khng xc nh r rng th chng trnh s nh du l khng xc nh c v a ra tt c cc kt qu gn ging, ngi s dng s t xc nh nhm loi vn bn, nu ngi s dng mun s dng kt qu ny dng hun luyn th cc c trng ca vn bn ny s chuyn thnh cc mu lu vo d liu hun luyn v lm cho d liu hun luyn y hn, gip cho ln x l sau c kt qu kh quan hn.

68

CHNG IV. CHNG TRNH TH NGHIM Phng php s dng trong chng trnh c m t chng II, cch thc thc hin cng c m t chng III, nn thut ton s dng trong chng trnh cng tun theo cc cch thc v phng php m t IV.1.1.Chun b s liu
Cc s liu ly mu s c ly v (download) t trang web http://vietnamnet.vn/ bng cch dng chng trnh Teleport pro c cung cp a ch http://tenmax.com/. Khi ly cc trang web ny v cn ch Replicate the directory structure of remote server cng trnh c th ly tt c cc cu trc trn trang cn ly v (download) v lp li cu trc trn my ly v, vic lm ny l vic phn loi vn bn c d dng hn

69

Sau khi ly v cc s liu c tch thnh cc mc sau: Tng dung lng (ch tnh vn bn text) (bytes) 28.383.744 204.711.680 89.755.136 30.454.784 61.110.272 414.415.616

Mc ChinhTri CNTT GiaoDuc KhoaHoc KinhTe Cng

S Tp tin 1.374 13.439 4.271 1.807 3.337 24.228

Cc s liu u l font Unicode nn khng cn phi chuyn font

70

Cc s liu kim th c ly t trang http://vnexpress.net/Vietnam/Home/ v cng s dng chng trnh Teleport pro v cng ch Replicate the directory structure of remote server ly cu trc ging nh khi ly mu T in ting Vit l mt tp trn a cng dng vn bn (text), t in ny c rt ra t hai t in ca inh in (77107 t) v t in ca chng trnh Vikass (73901 t) loi b cc t trng nhau gia ai t in v c mt t in s dng trong chng trnh c 107773 t T in t khng c ngha (t h - stop word) c ly t chng trnh Vikass v c 805 t IV.1.2.M t chng trnh: Chng trnh s dng cng ngh .NET ca Microsoft do vy h tr hon ton Unicode ting Vit v c vit trn nn Window (winform) Cc c s d liu s dng trong chng trnh bao gm: b t in cho chng trnh, b t in ca s liu hun luyn c dng l XML chng trnh c th thc hin m khng ph thuc qu nhiu vo h thng th nghim Tt c cc s liu c nh km vo trong thc thi (.exe) v cc tp tin th vin (.dll) do khi ln u s dng th cc s liu lin quan s c bung ra (v d nh cc t in ni trn) IV.1.1.Ci t chng trnh c th thc hin c th yu cu ti thiu chng trnh c th thc hin c l: my tnh phi c ci b .net framwork 1.1 v .net framwork 2.0 (c Microsoft cho s dng free) c th dng c Unicode th nn s dng cc h iu hnh h tr Unicode nh Window 2000 hay WindowXP tr ln

71

Cc s liu kim th cn phi to thnh tp tin trn my hoc trn mng, chng trnh cha thc hin kim tra online hoc to thnh mt m-un nh lm (plugin) gn vo chng trnh khc nhng vic s dng thut ton ny hon ton c th ng dng vo c cc chng trnh khc IV.1.2.Mt s giao din ca chng trnh

Hnh 4.1.2.1.Mn hnh lm vic chnh gm cc chc nng: Phn tch vn bn: c mt vn bn hoc nhiu vn bn v a ra kt qu l nhm ca vn bn (hoc vn bn khng th xc nh c nhm) Hun luyn: dng chy tp tin mu v phn tch cc c trng, rt kt cc c trng c bn nht a kt qu vo danh sch dng

72

lm mu phn tch cc vn bn s liu, a vo cc nhm vn bn theo mu T in: t in ting Vit, t in t khng quan trng (t h stop word) v t in t ng ngha CSDL Mu: C s d liu cc mu cun dng phan loi v cc thng s i km vi cc mu ny
2 3 4 5

Hnh 4.1.2.2.Mn hnh phn tch s liu: Phn tch s liu bng nhn cng v c th dng kim tra tnh ng ca chng trnh phn nhm vn bn

73

Hnh 4.1.2.3.Mn hnh phn tch s liu: Kt qu phn tch y l phn lm vic chnh ca chng trnh, phn ny gm hai phn: phn phn tch s liu v phn kt qu phn nhm. Hnh 4.1.2.1 l mn hnh phn tch gm cc phn sau: Phn (1) l phn chn th mc, ni cha cc tp d liu cn phn nhm, sau khi chn th mc th chn phn np s liu np cc tp vn bn vo b nh, chun b cho vic phn tch. y vic chn cc tp l ly tt c cc tp trong th mc hin hnh v c cc tp trong cc th mc con-chu ca th mc hin hnh Phn (2) l danh sch cc tp c chn trong th mc, trc khi phn tch ngi s dng c th chn li/b chn v c th loi b cc tp vn bn khng cn thit, ngi s dng cng c th c nguyn dng cc tp ny hoc ch c nhng on vn bn (text only) ca cc
74

tp ny, ngi s dng cng c th phn tch mt tp c chn xem vic phn tch cu, tch t v cc thng s c trng (cc t c trng cho vn bn cn phn tch) Phn (3) l ni dung ca on vn bn c c t tp cn c, on ny ch hin th l vn bn Phn (4) l hin th danh sch cc cu trong vn bn v ng vi mi cu l cc t c tch t cu qua bc tch t, phn ny cng gip cho ngi s dng c th xem bit t c tch chnh xc v ng ng ngha hay khng Phn (5) l danh sch tng cng cc t c tch t vn bn, cc t ny s l cc t c trng ca vn bn cn tch, cc t ny cng c km theo cc thng s vecto ngi s dng c th nhn xt v quyt nh nhm vn bn bng tay (trong trng hp chng trnh khng nhn din c) Phn (6) l nt dng thc hin vic phn tch tt c cc tp vn bn c nh du chn la, khi chng trnh s t ng thc hin vic phn tch tt c cc tp c chn v s hin th kt qu phn tch hnh 4.1.2.3

75

Hinh 4.1.2.4. Mn hnh hun luyn V c bn th mn hnh hun luyn cng gn ging nh mn hnh phn tch, tuy nhin mn hnh hun luyn th ngi s dng phi chn la kiu loi vn bn cp nht vo bng lu mu vn bn chun Lu l phn hun luyn th t in c dng s l t in ting Vit cn trong pn phn tch th t in c dng chnh l t in cc t mu trong nhm phn loi

76

Hinh 4.1.2.5. Mn hnh t in Mn hnh t in y l mn hnh t in ting Vit, cho php ngi s dng c th thm bt sa xa cc thng tin trong t in Cc mn hnh tng t l mn hnh t in t ng ngha v mn hnh t in t h (stop word) IV.1.3.Ci t chng trnh c th thc hin c th yu cu ti thiu chng trnh c th thc hin c l: my tnh phi c ci b .net framework 1.1 v .net framework 2.0 (c Microsoft cho s dng free) c th dng c Unicode th nn s dng cc h iu hnh h tr Unicode nh Window 2000 hay WindowXP tr ln

77

Cc s liu kim th cn phi to thnh tp tin trn my hoc trn mng, chng trnh cha thc hin kim tra online IV.1.4.Cc lu khi chun b s liu
Khi chun b cc s liu ly mu th vic quan trng nht l phi xem xt cc tp vn bn s dng lm mu, mc d lun vn coi nh cc s liu c sp xp trang web lm mu l c sp xp ng nhng thc s th vic loi b nhng bi khng ng l mt vic cn phi lm v vic ny v c bn vn phi do ngi thc hin tin hnh. V d nh trng hp phn loi CNTT th xt cc bi c phn tch th bi ngn nht c ni dung Mi iu khng mun nghe t nhn vin h tr k thut 21:24' 09/07/2004 (GMT+7) Ngh h tr k thut, hng dn ngi s dng my tnh qua in thoi thng gp phi nhng tnh hung "tro ngoe", buc h phi c... qui chiu khi gp khch hng... kh h tr.

10. ng c ci... ba t hay cc gch no gn khng ?

9. ... Vng! ng ri, thm ch ti cng... khng th khc phc c!

78

8. Th... ng ang mc ci g vy?

7. G... v cng ri ngh!

6. Chng ti c th gip ng khc phc, nhng ng cn phi c mt con dao pht b, mt cun bng dnh v mt bnh c-quy xe hi.

5. Ti xin li, Dave. Ti s l ti khng th lm iu .

3. ng gi my mt giy nh... i m i! Thng Timmy ang nh con!

2. OK, hy lt ti trang 523 trong cun Cm nang Khoa hc hc (Dianetics) ca ng.

1. Lm n gi my gp... lut s ca Bill Gates. a ch thc ca bi ny l

http://www.vnn.vn/cntt/itpark/2004/07/174595/, c bi ta c nhn xt l bi ny khng phi l bi v cng ngh thng tin m ch l mt cu chuyn ci do cn phi loi b khi chuyn mc CNTT Bi di nht l bi sau:

79

Nhiu kin khc by t s bt bnh v a ra cc kin v vic, khng ch FPT, m hin ti cn c nhiu nh cung cp dch v ADSL khc ti Vit Nam cng c tnh trng tng t. Cht lng dch v thc t cch qu xa vi qung co v khng m bo cc thng s ca "bng thng rng"!

bn c tin theo di v c cc kin phn hi chng ti c thm cn c tin hnh tm hiu thm thng tin. Xin tm chia cc kin phn hi lm hai hng: Phn bn c a ra kin v cht lng dch v ca ADSL FPT, phn th hai l kin v cht lng ca cc nh cung cp dch v khc ti Vit Nam.

Khng ch FPT - Cc ISP khc cng "c vn "?

Ho ten: Dng c Thnh Dia chi: 2D22 TT i hc Thy Li Email: thanhlan7479@yahoo.com Tieu de: Khng ch l ADSL ca FPT

Noi dung:Ti l ngi kinh doanh dch v internet. Hin nay ti ang dng ng truyn 2MB (MegaVNN - Maxi- Tc ti a 2Mbps/640Kbps) ca VDC nhng cng b hin tng nh ng truyn ca FPT. Lc u download khong 250-300kb/s, nhng hin nay c thi im download l 15-60kb/s. Ti c thc mc n 800126 th c tr li l hy kim tra xem may c b nhim Vius khng? V chc l do nh ti ni nhiu my tnh qu ?(10 my). iu ny

80

l khng ng v trc kia ti vn s dng 10 my tnh truy cp Internet. Ti kim tra Virus th ng truyn vn chm, u th vo duy nht 1 my cng vn chm nh thng. Ti nghi chc VDC cng ging nh nhng g m ng nh Anh ni " Khng ai a ra cc thng s ti thiu hoc cc thng s tc n nh trong hp ng c". Ngay nh nh cung cp cn ni vy th chng ti bit ni g? Thc mc vi ai? c s dng theo ng nh quyn li ca chng ti?!

Ho ten: thanhloi Dia chi: Email: thanhloi2003gl@yahoo.com Tieu de: Noi dung: Gia lai ch c 1 wng truyn ADSL thu bao ca VNN, thu qua bu in Gia lai, khng ring g FPT, ng truyn VNN cng vy, ch c t 10 m n 6 h sng hm sau th tng i n nh, bt u t 6 sng tr i rt chm, thm ch c lc cn thua Dial up 1269. V mnh l kinh doanh qun Net m, hi my qun Net khc cng chm vy. Khng hiu Bu in Gia lai lm n kiu g? Nu nh Gia lai c dch v khc th c l b ng truyn ny ri. Hi Bu in th h ni rng khng bit. Khng bit quyn li khch hng h c m bo khng?

Ho ten: Hoang Cong Xuong Dia chi: 115 nguyen Thien thuat,thi xa Hung yen,tinh Hung yen Email: urethane@hn.vnn.vn

81

Tieu de: ve toc do truy cap ADSL Noi dung: Toi thi khong xai FPT, ma xai VNPT o co quan va o nha rieng nhung so fan cung khong hon gi. Toi dong y voi cac ban la nha cung cap chi quang cao, con thuc te toc do chi dat 1/5-1/10 tham chi nhieu thoi diem khong ket noi duoc. Xem phim thuong bi dung hinh nhieu. Toi nghi quang cao phai di doi voi nang luc, do do tien thue bao thuc te cung khong re! Bi cn di nhng ch a ln mt on m t a ch thc ca bi ny: http://www.vnn.vn/cntt/2005/11/512827/ Bi ny thc s di hn bi a ln y nhng r rng bi ny c my vn : Tuy cng cp n CNTT nhng l s tr li (FAQ) v mng ADSL v khng c nhiu c trng v mt bi vit CNTT Trong bi c rt nhiu on vn bn s dng ting Vit khng du do vic phn tch s loi b hon ton cc t ting Vit ny do thc t l s lng t c trng rt t

Do nhng bi ny cng cn phi loi b V d mt bi khong gia (sp xp theo th t kch c tp t cao n thp) Microsoft hon pht hnh Windows Vista n 1/2007 10:38' 22/03/2006 (GMT+7) G khng l phn mm d nh hon pht hnh Windows Vista phin bn ngi dng n tn u nm sau, thay v mc tiu na cui nm 2006 nh trc y.

82

Ngun: CNET Tuy nhin, Microsoft vn cam kt s pht hnh phin bn Vista dnh cho khch hng doanh nghip ngay trong thng 11 ti y. Gi c phiu ca Microsoft sau thng tin ny lp tc st gn 3%.

Ban u, Vista, bn nng cp Windows ng k nht k t sau Windows XP (ra mt cch y 5 nm), c k vng s kp ra mt ngay trong nm 2005. Th nhng Microsoft nhiu ln hon ln hon xung, t u nm 2006 sang cui nm 2006 v gi l sang hn nm 2007.

Vic hon pht hnh Vista t 8-10 tun ny, theo phn tch ca hng nghin cu Gartner, c th nh hng n ton b ngnh cng ngh, t cc hng sn xut my tnh cho n hng chip, cc h thng phn phi, gii nghin cu v c cc nh u t.

Th trng chng khon, do lo ngi v tc ng ca s kin ny ln tnh hnh tiu th my tnh, chng kin s h gi dy chuyn ca Intel, HP v Dell.

Trong sut thi gian qua, Windows Vista lun chim gi v tr "minh tinh", i trc dn ng cho hng lot sn phm v dch v (va hoc sp cng b) ca Microsoft, t thit b chi game th h mi Xbox 360 cho n phn mm Office mi.

83

S thnh cng ca nhng phn mm mi ny c ngha sng cn i vi Microsoft trong vic duy tr v cng c a v ca hng trn th trng phn mm, khi m c phiu Microsoft gn nh dm chn ti ch trong sut 5 nm qua.

Windows, h iu hnh nm gi ti 90% my tnh bn ton cu, chnh l con g trng vng s mt ca Microsoft. Chnh v l , vic Microsoft buc phi di li ngy pht hnh Vista khin khng ch hng ny, m hng lot cng ty cng ngh khc ri vo tnh trng hoang mang. "Chng ti khng ngh ra c nn lm vic g vo lc ny na. Thi nh mc n u th n vy".

V phn mnh, Microsoft l gii nguyn nhn ca s chm tr ny l do h mun nng cao cht lng Vista ln hn na, nht l lnh vc bo mt, v li bn thn cc hng ch to my tnh cng khng mun, do mt phin bn h iu hnh mi tung ra ngay ma mua sm s gy nn tnh trng bt n nh trn th trng.

Nhng "mng m" mi cho Windows

Bn cnh tnh nng bo mt c sit cht, H iu hnh Windows mi s c mt giao din mi vi scrolling 3 chiu gia cc ca s. Nhng ca s ny c th hin th trong sut ngi dng xem c cc thng tin bn di.

Cha ht, n cn c th pht cng nh thu truyn hnh phn gii cao ngay trn my tnh, v cho php ngi dng tm kim cc ti liu lu trong cng cng

84

nh trn mng Internet mt cch d dng, hiu qu hn.

Cng theo cc thng tin trc y, th Microsoft d nh s pht hnh ti 8 phin bn Vista khc nhau, nhm n nhng i tng s dng my tnh khc nhau, thay v phn chia theo thng s phn cng nh cch lm truyn thng. a ch gc http://www.vnn.vn/cntt/2006/03/552699/, bi ny cho thy l mt trang thng tin v CNTT thng tin p ng nhu cu. Cng tng t nh vy vi cc trang v GiaoDuc, ChinhTri .v.v. do vy phng php ly mu s c hai cch: Do ngi s dng chn trc v xc nh kiu Chng trnh s ly cc tp trong th mc nh sn l thuc nhm no, nhng s khng s dng cc tp qu ln hoc qu nh

Trn thc t th chng trnh dng cc ly cc tp trong cng th mc chn v ch ly s lng tp khong s tp trong th mc, loi b hon ton (khng ly lm vn bn mu) cc tp m dung lng qu ln v/hoc qu nh hoc cch khong gia (ca hai dung lng) qu khong cch n hai u (ln/nh), nh vy vi cch ly ny th tuy s lng tp lm mu t i nhng mc c trng ca mu li tng ln V cc t ly lm mu c trng th v mt l thuyt th cc t c ngha c s lng xut hin nhiu nht s c ngha hn cc t xut hin t hn, tuy nhin khi tin hnh chng trnh th cc t c trng m n m tit (ch c mt t ni duy nht) th xut hin kh nhiu nhng khng to c c trng cn thit do vy v mt thc t chng trnh s u tin ly theo th t sau: Cc t c nhiu hn mt t ni s c u tin cao hn Cc t c s ln xut hin nhiu hn s c u tin

85

Tuy nhin s lng t c trng c ly khng qu 1/3 ton b s t c trng ly c t vn bn v trong cc t c trng c ly th s ln xut hin ca t c trng t nht khng c t hn s ln xut hin ca t c trng xut hin nhiu nht qu 1/3 ln Do chng trnh khng s dng cc c s d liu chuyn bit nh MS SQL Server hay MS Access lm ni lu tr cc s liu tm thi m dng vic lu tr XML lm c s d liu nn cc cng vic tm kim, lc d liu s c tc chm hn cc c s d liu chuyn bit do vy m thut ton cn phi c gng rt gn v hn ch n mc thp nht vic lin tc tm kim, lc d liu; trong chng trnh th c ba ln tm kim phn tch t c ngha trong mt cu, l: Lc t c ngha trong t in Lc t ng ngha nhng khc m Loi b cc t v ngha (khng c gi tr c trng t h stop word)

Th chng trnh s rt gon ba bc ny trc khi phn tch t trong cu bng cch: Gn cc t ng ngha nhng khc m vo trong t in t tm t ng Loi b cc t trong t in nu t tn ti trong t in t h (stop word) Cc t trong t in cn phi to ch mc (index) tng tc tm kim Cc t tm kim s c tm t hai t tr ln trc vi phng php khp ti a Maximum Matching: forward/backward

IV.1.5.Kt qu th nghim
Chng trnh th nghim sau khi hun luyn s th nghim nhn dng cc tp vn bn html c ly v t trang http://vnexpress.net/ trong cc th mc nh sau

86

Mc The-Gioi Vi-tinh Xa-Hoi Kinh-Doanh

S Tp tin 137 297 134 38

Tng dung lng (ch tnh vn bn text) (bytes) 5.130.498 25.539.072 6.181.187 1.290.156

c th tng kt c tm thi coi nh s liu kim tra cng c sp xp chun ha, do khi kim tra mc (X hi) tm so snh vi mc ChinhTri trn mu chun. Kt qu nh sau:

Ni dung S lng tp tin S tp vn bn phi phn tch S tp nhn dng l chnh tr S tp khng xc nh c kiu l S tp nhn dng l khc kiu chnh tr

134 87 10 37

Sau khi c kt qu ln 1 kim tra li cc tp nhn dng khc kiu v khng nhn dng c kiu th xc nh c:

Ni dung S tp vn bn khng nhn dng 2 c nhng c ni dung l chnh tr S tp nhn dng kiu khc nhng 15 xc nh l chnh tr

Nh vy tng cng c s tp ng loi l 87+2+15=104 tp Nhn dng c 87 tp/104 =83,65%

87

Cng tng t nh vy i vi cc dng vn bn khc Tng kt tt c cc dng vn bn th c c kt qu khong 82,63% Nhn xt v kt qu: Kt qu nh khi thc nghim l cha cao c bit nhng n cng khng km cc phng php khc qu nhiu, c bit l trong vic phn loi vn bn ting Vit, iu ny c th hon ton gii thch c do cc nguyn nhn sau: Cc mu chun ly t mt trang web v c coi nh s phn loi sp xp l chnh xc, trong khi thc s th cc vn bn ny chc chn c s sp xp khng chun xc nh mong i Cc mu dng lm chun ch c coi nh l chun v c gng chun mc no (ch ly cc tp vn bn khong gia ca tt c cc vn bn c coi l chun) nn s liu chun cng khng ng n Cc s liu c trng ca vn bn ch yu l dng phng php thng k do chc chn s c trng hp s liu l c trng nhng khng tht s l c trng cho th loi vn bn, do nu c phng php no khc xc nh c trng chnh xc hn th kt qu chc chn s cao hn

88

CHNG V.

KT LUN

Tc gi xy dng c mt chng trnh phn loi vn bn, tuy thi gian cn hn ch nn cc tnh nng tin dng ca chng trnh cha cao nhng chng trnh c s dng cc l thuyt mi nht v hin ang c p dng kh nhiu trong thc t, l cc l thuyt v ht nhn chui string kernels, h tr vecto (Support vector Machine - SVM) do v mt l thuyt chng trnh c nhng bc tin nht nh, mc d kt qu khng tht s l ni tri hn cc chng trnh khc tng t nhng vi s liu dng hun luyn cha c chun ha nn vn cn c s nhp nhm th kt qu ny l hon ton chp nhn c. i vi cc loi vn bn c s tch bit r rng nh vn bn v CNTT hay vn bn v kinh t th mc ni tri r rng. Qua chng trnh ny ta cng thy tng hiu qu ca mt chng trnh th vic ng dng l thuyt l phi ng dng nhiu l thuyt kt hp vi nhau tng hiu qu Cc hng ci tin chng trnh: S dng phng php phn tch cu c mc chnh xc hn S liu s liu mu nhiu hn th c th m bo chnh xc cao hn S liu mu khi ly vo cn phi c chn lc chnh xc, trnh s nhp nhm gia cc mu th lm kt qu b hn ch Cc hng nghin cu trong tng lai B xung thm b phn tch ng ngha ting Vit tng mc

chnh xc Nghin cu thm cc thut ton b xung cho CSDL phn tch Nghin cu thm c ch dng kim tra c cc trang web trn mng h tr c ch tm kim v phn loi trc tuyn
ng dng trong thc t

89

Chng trnh ang bt u c trin khai s dng cho vic tm kim v phn loi ca Trung Tm Vn Th lu tr ca UBND tnh B Ra Vng Tu

90

CHNG VI.
Ting Anh

TI LIU THAM KHO

[1]. Amitay E. and C. Paris (2000), Automatically summarising web sites - is there a way around it?, ACM 9th International Conference on Information and Knowledge Management. [2]. Aone C., M. E. Okurowski, J. Gorlinsky, and B. Larsen (1997), A scalable summarization system using robust nlp, Proceeding of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization, p.66-73. [3]. Barzilay R., and M. Elhadad (1997), Using lexical chains for text summarization, Proceedings of the Intelligent Scalable Text Summarization Workshop (ISTS'97), ACL, Madrid, Spain. [4]. Buyukkokten O., H. Garcia-Molina, and A. Paepcke (2001), Seeing the whole in parts: Text summarization for web browsing on handheld devices, Proceedings of 10th International World-Wide Web Conference. [5]. Cavnar William B. (1994), Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model, NIST Special Publication 500-225: Overview of the Third Text Retrieval Conference (TREC-3), p. 269-278, NIST. [6]. Delort -Y. J., B. Bouchon-Meunier, and M. Rifqi (2003), Enhanced Web Document Summarization Using Hyperlinks, under submission. [7]. Dinh Dien, Hoang Kiem, Nguyen Van Toan (2001), Vietnamese Word Segmentation, Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPR2001), p. 749-756, Tokyo. [8]. Goldstein J., M. Kantrowitz, V. Mittal, and J. Carbonell (1999), Summarizing text documents: Sentence selection and evaluation metrics, Proceedings of SIGIR, p. 121-128. [9]. Hassel Martin, Automatic text summarization evaluation, Term Paper, Royal Institute of Technology.

91

[10]. Jr. Santos Eugen, Ahmed A. Mohamed, and Qunhua Zhao (2004), Automatic Evaluation of Summaries Using Document Graphs, ACL. [11]. Luhn H. P. (1958), The Automatic Creation of Literature Abstracts, IBM Journal of Research Development, 2(2), p. 159-165. [12]. Mallet Daniel (2003), Text Summarization: An Annotated Bibliography, (Last compiled June 24). [13]. Mani I. (2001), Recent developments in text summarization, CIKM'01, p. 529-531. [14]. Nguyen Thi Minh Huyen, Laurent Romany , Xuan Luong Vu (2003), A Case Study in POS Tagging of Vietnamese Texts, TALN 2003, Batz-sur-Mer. [15]. Oard Douglas W. (2001), The Vector Space Model, LBSC 708A/CMSC, 838L,Session 3.
(http://www.cse.lehigh.edu/~brian/course/2002/searchengines/notes/notes08-29.pdf )

[16]. Radev D. R., H. Jing, and M. Budzikowska (2000), Centroidbased summarization of multiple documents: sentence extraction, utilitybased evaluation, and user studies, Summa rization Workshop. [17]. Radev Dragomir R., Eduard Hovy and Kathleen McKeown (2002), Introduction to the special issue on summarization, Computational Linguistics, 28(4), p.399-408. [18]. Ruiz Miguel, Automatic Indexing & Text Categorization
(http://informatics.buffalo.edu/faculty/ruiz/teaching/Seminars/Automatic_Inde xing.ppt )

[19]. Zha H. (2002), Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering, SIGIR02, p. 113-120. [20]. Zhang Y., N. Zincir-Heywood, Evangelos Milios (2002), World Wide Web Site Summarization, Technical Report CS-2002-08, Faculty of Computer Science, Dalhousie University.

92

Ting Vit [1]. Nguyn Ngc Bnh (2004), Dng l thuyt tp th v cc k thut khc phn loi, phn cm vn bn ting Vit, K yu hi tho ICT.rda04. H ni [2]. Bch Dip (2004), Phn loi vn bn da trn m hnh th, Lun vn cao hc. Trng i hc Tng hp New South Wales - Australia. [3]. Thnh Dng (2003), Rt trch thng tin t cc tm tt ca cc bi bo khoa hc v tr tu nhn to dng th khi nim, Lun vn Thc s Cng ngh thng tin, Khoa Cng ngh thng tin, Trng i hc Bch khoa, H Quc gia TPHCM. [4]. Li Th Hnh (2002), Trch cm danh t ting Vit nhm phc v cho cc h thng tra cu thng tin a ngn ng, Lun vn thc s Tin hc, Th vin Cao hc, Khoa Cng ngh thng tin, Trng i Hc Khoa hc T nhin, H Quc gia TPHCM. [5]. Nguyn Th Minh Huyn, V Xun Lng, L Hng Phng (2003), S dng b gn nhn t loi xc sut QTAG cho vn bn ting Vit, K yu Hi tho ICT.rda03, H Ni. [6]. V L Ha (2004), Tm hiu vn bn tm tt v phng php tm tt vn bn, Lun n tin s Ng vn, i hc Khoa hc X hi v Nhn vn, TP.HCM. [7]. Phc, Hong Kim (2004), Rt trch chnh t vn bn ting Vit h tr to tm tt ni dung, Tp ch Cc cng trnh nghin cu - trin khai vin thng v cng ngh thng tin, s 13, trang 59-63.inh Th Phng Thu, Hong Vnh Sn, Hunh Quyt Thng (2005), Phng n xy dng tp mu cho bi ton phn lp vn bn ting Vit: nguyn l, gii thut, th nghim v nh gi kt qu, Bi bo gi ng ti Tp ch khoa hc v cng ngh [8]. Trn Ngc Thm (2000), H thng lin kt vn bn ting Vit, NXB Gio dc, TP. HCM. [9]. ng Th Bch Thy, H Bo Quc (2001), ng dng x l ngn ng t nhin trong h thng tm kim thng tin trn vn bn ting Vit. Khoa Cng ngh thng tin, Trng i hc Khoa hc T nhin, TPHCM.

93

CHNG VII.

PH LC

VII.1.Cu trc CSDL ca chng trnh Bng t in v t h (stop word) ch c mt ct duy nht l ct T Bng t ng ngha khc t: Tu Nghia Bng cha d liu vn bn mu: Tu Nhom TrungBinh LonNhat NhoNhat Cha t c trng Cha nhm vn bn Khong cch trung bnh ca t trong cu Khong cch ln nht ca t trong cu Khong cch nh nht ca t trong cu Cha t ng ngha Cha t c ngha tng ng trong t in

VII.2.Kt qu nhn dng vn bn


Cn nhn dng A The Gioi Vi tinh S File C 137 297 134 38 606 Nhn dng c D 74 203 87 31 395 Khng nhn dng c E 11 21 10 1 43 Nhn dng kiu khc F 52 73 37 6 168 Nhn dng li kiu E G 14 5 2 1 22 Nhn dng li kiu F H 16 27 15 3 61 Tng kt I 71.15 86.38 83.65 385 88.57 82.63

Nhn l B Chinhtri cntt

Xa hoi Chinhtri Kinh Doanh Kinh te TNG CNG

94

VII.3.Cc c trng ca mu phn loi vn bn (trch) ChinhTri T Chnh ph Ch tch i biu u t pht trin Quc hi Th tng thc hin T chc vn Vit Nam Gi tr trung bnh 34.63 26.10 19.83 48.99 158.85 29.61 17.95 41.01 43.13 136.62 35.87 Gi tr ln nht 103.09 59.11 52.93 113.46 340.31 91.98 56.88 73.86 87.67 243.02 87.38 Gi tr b nht 7.53 3.42 4.84 9.81 10.61 6.06 2.69 4.80 0.00 6.76 6.44

CNTT T c th cng ngh cng ty cung cp di ng dch v in thoi khch hng my tnh Gi tr trung bnh 16.27 16.17 17.12 16.76 16.10 17.23 20.25 18.07 18.18 Gi tr ln nht 33.22 32.06 34.76 34.70 31.75 70.20 35.59 53.42 32.26 Gi tr b nht 3.06 3.88 3.72 4.76 6.02 5.52 4.13 7.12 0.92

95

phn hi phn mm pht trin Sn phm s dng th gii Th trng thng tin vin thng Vit Nam

11.44 15.54 19.54 20.14 18.37 18.40 16.61 19.91 19.27 18.96

21.92 38.51 36.87 35.87 37.14 54.74 33.45 35.67 37.45 39.76

3.00 4.67 4.74 0.00 2.79 8.61 2.48 0.77 6.10 5.68

GiaoDuc T chng trnh c th o to gio dc gio vin hc sinh sinh vin Th sinh t chc tt nghip Vit Nam Gi tr trung bnh 21.67 18.68 24.80 29.17 20.25 19.96 23.11 16.76 21.99 20.87 25.80 Gi tr ln nht 60.09 47.72 59.40 84.64 56.53 44.96 50.43 75.20 57.79 62.82 72.97 Gi tr b nht 6.42 4.80 2.78 6.12 2.89 0.93 5.11 4.54 7.25 7.87 3.90

KinhTe T Gi tr trung bnh Gi tr ln nht Gi tr b nht

96

Chng khon c phiu c th cng ty u t doanh nghip giao dch Ngn hng pht trin th trng Vit Nam

17.04 11.88 17.59 17.79 19.35 19.25 13.70 18.44 22.75 18.24 20.11

42.47 40.08 114.29 45.37 43.81 54.05 37.97 41.70 44.22 42.01 44.25

0.00 2.77 5.33 3.65 5.76 5.89 4.56 4.94 2.55 6.66 1.77

KhoaHoc T bnh nhn Bnh vin c th iu tr gia cm khoa hc mi trng nghin cu nguy c phn hi pht hin pht trin Phu thut sn xut s dng Gi tr trung bnh 18.84 33.04 74.13 21.00 23.36 42.68 19.61 22.54 24.42 21.26 15.46 41.54 23.68 32.64 20.62 Gi tr ln nht 42.59 362.59 609.13 40.76 55.00 198.87 67.66 125.17 137.85 120.38 67.70 201.24 73.74 292.88 35.10 Gi tr b nht 3.02 4.72 4.80 3.21 9.01 3.72 6.81 3.09 2.68 4.96 3.60 3.03 2.70 3.03 2.59

97

t bo th gii trng hp t vong Vit Nam Y t

19.07 19.42 15.72 21.96 35.70 26.65

70.85 37.36 45.35 48.83 235.59 483.37

8.04 6.90 4.57 4.75 3.17 4.77

98

You might also like