You are on page 1of 41

Trng i hc in Lc

ti 3: Tm tt vn bn

MC LC
6.1. Kt lun :........................................................................................................ 39 6.2. Hng pht trin:........................................................................................... 39 TI LIU THAM KHO.......................................................................................................................41

Bo co kt thc mn My Hc

Trng i hc in Lc

ti 3: Tm tt vn bn

LI M U
Mc d s gia tng nhanh chng ca lng thng tin trn Internet ch bt u t thp k cui ca th k trc, nhng cc phng php cho vic x l thng tin vn bn nh : tm tt, trch rt thng tin, phn loi, nh ch s vn bn, bt u t nhng nm 59,60. Tm tt vn bn hin nay vn thu ht c nhiu s quan tm ca cc nh nghin cu, cc hi tho chuyn tm tt vn bn c t chc hng nm (DUC) lun cp ti vn tm phng php x l tm tt vn bn mt cch tt nht. Nhng nghin cu sm nht v tm tt vn bn u s dng phng php trch rt cu da trn c trng v t v tn sut nhm t ( Luhn, 1958), v tr ca cu trong vn bn ( Baxendale,1958) v nhm t quan trng ( Edmundson, 1969). Xc nh quan trng ca t da trn m hnh tn sut tf*idf l mt trong nhng phng php ch yu hin nay. Mt trong nhng vn thch thc v c s quan tm trong nhng nm gn y i vi bi ton tm tt vn bn t ng l a ra kt qu tm tt cho mt tp vn bn lin quan vi nhau v mt ni dung hay cn gi l tm tt vn bn ting Vit. Bi ton tm tt vn bn ting Vit c xc nh l mt bi ton c phc tp cao. a s mi ngi ngh rng, tm tt vn bn ch l vic p dng tm tt vn bn cho vn bn cho trc. Tuy nhin iu l hon ton khng chnh xc, thch thc ln nht ca vn tm tt vn l do d liu u vo c th c s nhp nhng ng ngha gia ni dung trong mt vn bn ny, v vy a ra mt kt qu tm tt tt s v cng kh khn [EWK]. Vi vic la chn ti Tm tt vn bn ting Vit da phng php khng gim st, chng em tp trung vo vic nghin cu, kho st, nh gi v xut ra mt phng php tm tt vn bn ph hp vi ngn ng ting Vit. Sau y chng em s trnh by c th, chi tit hng phn tch v pht trin ti. Qua rt ra kt lun v hng pht trin ca ti.

Bo co kt thc mn My Hc

Trng i hc in Lc

ti 3: Tm tt vn bn

CHNG 1-KHI QUT BI TON TM TT VN BN


1.1. Tng quan v bi ton:

Yu cu ca ti: Tm tt vn bn ting Vit bng phng php khng gim st. Trong ti ny chng em tm hiu thut ton ca phng php khng gim st v p dng thut ton vo vic tm tt vn bn da trn mi trng C#.

1.2.

Bi ton tm tt vn bn t ng:

Theo Inderjeet Mani, mc ch ca tm tt vn bn t ng l: Tm tt vn bn t ng nhm mc ch trch xut ni dung t mt ngun thng tin v trnh by cc ni dung quan trng nht cho ngi s dng theo mt khun dng sc tch v gy cm xc i vi ngi s dng hoc mt chng trnh cn n. Vic a ra c mt vn bn kt qu tm tt c cht lng nh l vn bn do con ngi lm ra m khng b gii hn bi min ng dng l c xc nh l cc k kh khn. V vy, cc bi ton c gii quyt trong tm tt vn bn thng ch hng n mt kiu vn bn c th hoc mt kiu tm tt c th.

2.1.

Mt s khi nim ca bi ton tm tt v phn loi tm tt:

2.1.1. Mt s khi nim: - T l nn(Compression Rate): l o th hin bao nhiu thng tin c c ng trong vn bn tm tt c tnh bng cng thc: SourceLength CompressionRate = SummaryLength SummaryLength: di vn bn tm tt
Bo co kt thc mn My Hc

Trng i hc in Lc

ti 3: Tm tt vn bn

SourceLength: di vn bn ngun - ni bt hay lin quan(Salience or Relevance): l trng s c gn cho thng tin trong vn bn th hin quan trng ca thng tin i vi ton vn bn hay ch s lin quan ca thng tin i vi chng trnh ca ngi s dng. - S mch lc(coherence): Mt vn bn tm tt gi l mch lc nu tt c cc thnh phn nm trong n tun theo mt th thng nht v mt ni dung v khng c s trng lp gia cc thnh phn. 2.1.2 Phn loi bi ton tm tt: C nhiu cch phn loi tm tt vn bn khc nhau tuy nhin s phn loi ch mang tnh tng i, ph thuc vo vic tm tt trn c s no. y, lun vn cp n phn loi tm tt da trn 3 c s l: da vo nh dng, ni dung u vo, da vo nh dng, ni dung u ra, da vo mc ch tm tt. Tm tt da trn c s nh dng, ni dung u vo s tr li cho cu hi Ci g s c tm tt. Cch chia ny s cho ta nhiu cch phn loi con khc nhau. C th nh: - Kiu vn bn (bi bo, bn tin, th, bo co ). Vi cch phn loi ny, tm tt vn bn l bi bo s khc vi tm tt th, tm tt bo co khoa hc do nhng c trng vn bn quy nh. - nh dng vn bn: da vo tng nh dng vn bn khc nhau, tm tt cng chia ra thnh cc loi khc nhau nh: tm tt vn bn khng theo khun mu (free-form) hay tm tt vn bn c cu trc. Vi vn bn c cu trc, tm tt vn bn thng s dng mt m hnh hc da vo mu cu trc xy dng t trc tin hnh tm tt. - S lng d liu u vo: ty vo s lng u vo ca bi ton tm tt, ngi ta cng c th chia tm tt ra thnh tm tt a vn bn, tm tt n vn bn. Tm tt n vn bn khi u vo ch l mt vn bn n, trong khi u vo ca tm tt a vn bn
Bo co kt thc mn My Hc
4

Trng i hc in Lc

ti 3: Tm tt vn bn

l mt tp cc ti liu c lin quan n nhau nh: cc tin tc c lin quan n cng mt s kin, cc trang web cng ch hoc l cm d liu c tr v t qu trnh phn cm. - Min d liu: da vo min ca d liu nh c th v mt lnh vc no , v d nh: y t, gio dc hay l min d liu tng qut, c th chia tm tt ra thnh tng loi tng ng. Tm tt trn c s mc ch thc cht l lm r cch tm tt, mc ch tm tt l g, tm tt phc v i tng no ... - Nu ph thuc vo i tng c tm tt th tm tt cho chuyn gia khc cch tm tt cho cc i tng c thng thng. - Tm tt s dng trong tm kim thng tin (IR) s khc vi tm tt phc v cho vic sp xp. - Da trn mc ch tm tt, cn c th chia ra thnh tm tt ch th (Indicative) v tm tt thng tin (Informative). Tm tt ch th (indicative) ch ra loi ca thng tin, v d nh l loi vn bn ch th ti mt. Cn tm tt thng tin ch ra ni dung ca thng tin. - Tm tt trn c s truy vn (Query-based) hay tm tt chung (General). Tm tt general mc ch chnh l tm ra mt on tm tt cho ton b vn bn m ni dung ca on vn bn s bao qut ton b ni dung ca vn bn . Tm tt trn c s truy vn th ni dung ca vn bn tm tt s da trn truy vn ca ngi dng hay chng trnh a vo, loi tm tt ny thng c s dng trong qu trnh tm tt cc kt qu tr v t my tm kim. Tm tt trn c s u ra cng c nhiu cch phn loi. - Da vo ngn ng: Tm tt cng c th phn loi da vo kh nng tm tt cc loi ngn ng:
Bo co kt thc mn My Hc
5

Trng i hc in Lc

ti 3: Tm tt vn bn

Tm tt n ngn ng (Monolingual): h thng c th tm tt ch mt loi ngn ng nht nh nh: ting Vit hay ting Anh Tm tt a ngn ng (Multilingual): h thng c kh nng tm tt nhiu loi vn bn ca cc ngn ng khc nhau, tuy nhin tng ng vi vn bn u vo l ngn ng g th vn bn u ra cng l ngn ng tng ng. Tm tt xuyn ngn ng (Crosslingual): h thng c kh nng a ra cc vn bn u ra c ngn ng khc vi ngn ng ca vn bn u vo. - Da vo nh dng u ra ca kt qu tm tt: nh bng, on, t kha. Ngoi hai cch phn loi trn, phn loi tm tt trn c s u ra cn c mt cch phn loi c s dng ph bin l: tm tt theo trch xut (Extract) v tm tt theo tm lc (Abstract). Tm tt theo trch xut: l tm tt c kt qu u ra l mt tm tt bao gm ton b cc phn quan trng c trch ra t vn bn u vo. Tm tt theo tm lc: l tm tt c kt qu u ra l mt tm tt khng gi nguyn li cc thnh phn ca vn bn u vo m da vo thng tin quan trng vit li mt vn bn tm tt mi. Hin nay, cc h thng s dng tm tt theo trch xut c s dng ph bin v cho kt qu tt hn tm tt theo tm lc. Nguyn nhn to ra s khc bit ny l do cc vn trong bi ton tm tt theo tm lc nh: biu din ng ngha, suy lun v sinh ra ngn ng t nhin c nh gi l kh v cha c nhiu kt qu nghin cu kh quan hn so vi hng trch xut cu ca bi ton tm tt theo trch xut. Trong thc t, theo nh gi ca Dragomir R. Radev (i hc Michigan, M) cha c mt h thng tm tt theo tm lc t n s hon thin, cc h thng tm tt theo tm lc hin nay thng da vo thnh phn trch xut c sn. Cc h thng ny thng c bit n vi tn gi tm tt theo nn vn bn.

Bo co kt thc mn My Hc

Trng i hc in Lc

ti 3: Tm tt vn bn

Tm tt theo nn vn bn (Text Compaction): l loi tm tt s dng cc phng php ct xn(truncates) hay vit gn(abbreviates) i vi cc thng tin quan trng sau khi c trch xut. Mc d da vo nhiu c s c nhiu loi tm tt khc nhau tuy nhin hai loi tm tt l tm tt n vn bn v tm tt a vn bn vn c s quan tm ln ca cc nh nghin cu v tm tt t ng. 2.2. Khi qut tm tt vn bn:

Bi ton tm tt vn bn n cng ging nh cc bi ton tm tt khc, l mt qu trnh tm tt t ng vi u vo l mt vn bn, u ra l mt on m t ngn gn ni dung chnh ca vn bn u vo . Vn bn n c th l mt trang Web, mt bi bo, hoc mt ti liu vi nh dng xc nh (v d : .doc, .txt) Tm tt vn bn n l bc m cho vic x l tm tt a vn bn v cc bi ton tm tt phc tp hn. Chnh v th nhng phng php tm tt vn bn ra i u tin u l cc phng php tm tt cho vn bn n. Cc phng php nhm gii quyt bi ton tm tt vn bn n cng tp trung vo hai loi tm tt l: tm tt theo trch xut v tm tt theo tm lc. Tm tt theo trch xut: a s cc phng tm tt theo loi ny u tp trung vo vic trch xut ra cc cu hay cc ng ni bt t cc on vn bn v kt hp chng li thnh mt vn bn tm tt. Mt s nghin cu giai on u thng s dng cc c trng nh v tr ca cu trong vn bn, tn s xut hin ca t, ng hay s dng cc cm t kha tnh ton trng s ca mi cu, qua chn ra cc cu c trng s cao nht cho vn bn tm tt [Lu58, Ed69]. Cc k thut tm tt gn y s dng cc phng php hc my v x l ngn ng t nhin nhm phn tch tm ra cc thnh phn quan trng ca vn bn. S dng cc phng php hc my c th k n phng php ca Kupiec, Penderson and Chen nm 1995 s dng phn lp Bayes kt hp cc c trng li vi nhau [PKC95] hay
Bo co kt thc mn My Hc
7

Trng i hc in Lc

ti 3: Tm tt vn bn

nghin cu ca Lin v Hovy nm 1997 p dng phng php hc my nhm xc nh v tr ca cc cu quan trng trong vn bn [LH97]. Bn cnh vic p dng cc phng php phn tch ngn ng t nhin nh s dng mng t Wordnet ca Barzilay v Elhadad vo nm 1997 [BE97]. Tm tt theo tm lc: Cc phng php tm tt khng s dng trch xut to ra tm tt c th xem nh l mt phng php tip cn tm tt theo tm lc. Cc hng tip cn c th k n nh da vo trch xut thng tin (information extraction), ontology, hp nht v nn thng tin Mt trong nhng phng php tm tt theo tm lc cho kt qu tt l cc phng php da vo trch xut thng tin, phng php dng ny s dng cc mu c nh ngha trc v mt s kin hay l ct truyn v h thng s t ng in cc thng tin vo trong mu c sn ri sinh ra kt qu tm tt. Mc d cho ra kt qu tt tuy nhin cc phng php dng ny thng ch p dng trong mt min nht nh.

Bo co kt thc mn My Hc

Trng i hc in Lc

ti 3: Tm tt vn bn

CHNG 2- C S L THUYT V CNG C PHT TRIN TI


2.1. Gii thiu v ngn ng C#:
C# l ngn ng c dn xut t C v C++, nhng n c to t nn tng pht trin hn. Microsoft bt u vi cng vic trong C v C++ v thm vo nhng c tnh mi lm cho ngn ng ny d s dng hn. Nhiu trong s nhng c tnh ny kh ging vi nhng c tnh ny kh ging vi nhng c tnh c trong ngn ng Java. Microsoft a ra mt s mc ch khi xy dng ngn ng ny. Nhng mc ch l: 1. C# l ngn ng n gin: - C# loi b c mt vi s phc tp v ri rm ca cc ngn ng C++ v Java. - C# kh ging C / C++ v din mo, c php, biu thc, ton t. - Cc chc nng ca C# c ly trc tip t ngn ng C / C++ nhng c ci tin lm cho ngn ng n gin hn. 2. C# l ngn ng hin i: C# c c nhng c tnh ca ngn ng hin i nh: - X l ngoi l - Thu gom b nh t ng - C nhng kiu d liu m rng - Bo mt m ngun 3. C# l ngn ng hng i tng: C# h tr tt c nhng c tnh ca ngn ng hng i tng l: - S ng gi (encapsulation) - S k tha (inheritance) - a hnh (polymorphism) 4. C# l ngn ng mnh m v mm do: - Vi ngn ng C#, chng ta ch b gii hn chnh bn thn ca chng ta. Ngn ng ny khng t ra nhng rng buc ln nhng vic c th lm. - C# c s dng cho nhiu d n khc nhau nh: to ra ng dng x l vn bn, ng dng ha, x l bng tnh; thm ch to ra nhng trnh bin dch cho cc ngn ng khc. - C# l ngn ng s dng gii hn nhng t kha. Phn ln cc t kha dng m t thng tin, nhng khng g th m C# km phn mnh m. Chng ta c th tm thy rng ngn
Bo co kt thc mn My Hc
9

Trng i hc in Lc

ti 3: Tm tt vn bn

ng ny c th c s dng lm bt c nhim v no. 5. C# l ngn ng hng module: - M ngun ca C# c vit trong Class (lp). Nhng Class ny cha cc Method (phng thc) thnh vin ca n. - Class (lp) v cc Method (phng thc) thnh vin ca n c th c s dng li trong nhng ng dng hay chng trnh khc. 6. C# s tr nn ph bin: C# mang n sc mnh ca C++ cng vi s d dng ca ngn ng Visual Basic..

2.2. Gii thiu v Access :


1. C s d liu l g? C s d liu l mt tp hp cc thng tin c lin quan. V d, nu tp hp tt c cc bc nh cng nhau, bn s c mt c s d liu nh. Nu tp hp tt c cc bc nh c cng ch , bn s c mt c s d liu gc hoc mt tp con trong ton b c s d liu. Nu c s d liu nh (v d nh cc hp ng bo him ca bn), bn c th qun l thng tin bnh thng. Trong nhng trng hp nh vy, bn phi s dng cc phng php qun l c nh mt bng file hay mt danh sch n gin trn giy. Tuy nhin, c s d liu ngy cng nhiu, cc thao tc qun l tr nn kh hn. V d, s rt kh qun l bng tay c s d liu v khch hng trong mt cng ty ln. y l lc my tnh ca bn v h qun tr c s d liu c ch. Phn mm qun tr c s d liu gip bn qun l thng tin c nhanh v d dng hn. Trong Access, mt c s d liu khng ch c thng tin m cn c cc bng thng tin c sp xp, c s d liu Access cn bao gm c mi quan h cc truy vn, bo biu, bo co v cc lnh lp trnh. Di y l khi nim v mt s thut ng thng dng trong Access: 2. Th no l mt Table (bng)? Trong Access, cc bng cha thng tin thc t trong c s d liu, c th c nhiu hn mt bng. Thng tin trong mi bng c th lin quan ti thng tin trong cc bng khc. V d, bn c mt bng cha bn ghi ca tt c cc kha ca trong ta nh, mt bng khc phi c danh sch tt c cha kha cho cc kha. Mt bng cha tn ca tt c nhng ngi c cha kha. C 3 bng c thng tin lin quan n nhau, bi vy chng to thnh mt c s d liu.

Bo co kt thc mn My Hc

10

Trng i hc in Lc

ti 3: Tm tt vn bn

3. Th no l mt Query (truy vn)? Khi lm vic vi c s d liu ln, tc l lm vic vi cc vng ring trn d liu. V d, nu bn c c s d liu ca mt cng ty, v mun xem tt c tn ca khch hng sinh sng ti H Ni. Vi s kin nh vy, bn nn dng truy vn. Mt cu hi truy vn d liu nh sau Nhng khch hng no sinh sng ti H Ni? Nh vy query c nh ngha l vic truy vn cc thng tin ca c s d liu m bn mun xem. V d, nu c s d liu cha tn ca tt c khch hng mua mt chi tit sn phm no , dng truy vn c th a ra mt danh sch tn cc khch hng mua trn. Mt truy vn khc yu cu ch a ra cc khch hng l tr em. V c bn, mt truy vn gii hn hoc lc thng tin t mt c s d liu. Khi bn s dng query lc d liu, Access ch hin th thng tin p ng truy vn Ti sao nn s dng cc truy vn? Mt cch c th l bn ch lm vic mt phn ca c s d liu, cc truy vn thc hin d dng a ra kt qu di dng bn ghi theo mt tiu chun nht nh. Access cho bn thy c s r rng, c th hay phc tp nh bn mun trong cc truy vn. Bn s bt u hc v cc truy vn trong chng 6 S dng cc truy vn x l d liu 4. Th no l Form (biu mu)? Mt c s d liu tn ti lu gi thng tin. Sau khi xc nh thng tin cha trong c s d liu, l ni bn cn nhp d liu; sau xem, thm, hoc thay i d liu. Bn nn s dng ch Datasheet view khi hon thnh mi thao tc, c th to mt biu mu hin th ln mn hnh nhp, xem v thay i thng tin. Trong Access, biu mu (Form) hin th ln mn hnh c gi l form. S dng form c th hin th thng tin trong mt bng, ng thi thm cc nt, text box, cc nhn v i tng khc d liu nhp d dng hn. 5. Record (Bn ghi) l g? Bn ghi l mt khi thng tin c lp, nh d liu v cng nhn hay khch hng. Mt bng c to ln t nhiu bn ghi. V d, nu bn c bng cha thng tin v tp hp cc th chi bng chy, mt bn ghi s l thng tin ring v 1 th. Thng thng, cc bn ghi
Bo co kt thc mn My Hc
11

Trng i hc in Lc

ti 3: Tm tt vn bn

t theo dng trong mt bng, Access trnh by cc bn ghi theo cc dng. 6. Trng (Field) l g? Bng c to ln t cc bn ghi, bn ghi c to t cc trng. Nh vy, mt trng l vng thng tin nh nht trong c s d liu. V d, nu bn c mt bng cha danh b in thoi, mi bn ghi biu th cho mt ngi hay doanh nghip khc nhau. Ln lt, cc bn ghi ny c to t cc trng ring (nh tn, a ch, s in thoi).

Bo co kt thc mn My Hc

12

Trng i hc in Lc

ti 3: Tm tt vn bn

CHNG 3 : CC HNG TIP CN TCH T


3.1. Cc hng tip cn da trn t: Hng tip cn da trn t vi mc tiu tch c cc t hon chnh trong cu. Hng tip cn ny c th chia ra theo 3 hng : da trn thng k (statistics - based) , da trn t in ( dictionary based) v hydrid ( kt hp nhiu phng php vi hy vng t c nhng u im ca cc phng php ny) . Hng tip cn da trn thng k : Da trn cc thng tin nh tn s xut hin ca t trong tp hun luyn ban u . Hng tip cn ny c bit da trn tp ng liu hun luyn , nh vy nn hng tip cn ny t ra linh hot v hu dng trong nhiu lnh vc khc nhau. Hng tip cn da trn t in : tng ca hng tip cn ny l nhng cm t c tch ra t vn bn phi c so khp vi cc t trong t in. Do trong hng tip cn ny i hi t in ring cho tng lnh vc quan tm . Hng tip cn full word / phrase cn s dng mt t in hon chnh c th tch c y cc t hoc ng trong vn bn , trong khi hng tip cn thnh phn component li s dng t in thnh phn .T in thnh phn ch cha cc thnh phn ca t v ng nh hnh v v cc t n gin . Hng tip cn theo t in vn cn mt s hn ch trong vic tch t v thc hin hon ton da vo t in . Nu nh thc hin thao tc tch t bng cch s dng t in hon chnh th trong thc t vic xy dng mt b t in hon chnh l kh thc hin v i hi nhiu thi gian v cng sc . Nu tip cn theo hng s dng t in thnh phn th s gim nh hn ch , kh khn khi xy dng t in , v khi chng ta s s dng cc hnh v t v cc t n gin v cc t khc hnh thnh nn t , cm t hon chnh. Hng tip cn theo Hybrid : Vi mc ch kt hp cc hng tip cn khc nhau tha hng c cc u im ca nhiu k thut v cc hng tip cn khc nhau nhm nng cao kt qa . Hng tip cn ny thng kt hp gia hng da trn thng k v da trn t in nhm tn dng cc mt mnh ca cc
Bo co kt thc mn My Hc
13

Trng i hc in Lc

ti 3: Tm tt vn bn

phng php ny . Tuy nhin hng tip cn Hybrid li mt nhiu thi gian x l , khng gian a v i hi nhiu chi ph. 3.2. Cc hng tip cn da trn k t Trong ting vit, hnh v nh nht l ting c hnh thnh bi nhiu k t trong bng ch ci . Hng tip cn ny n thun rt trch ra mt s lng nht nh cc ting trong vn bn nh rt trch t 1 k t (unigram) hay nhiu k t (n-gram) v cng mang li mt s kt qa nht nh c minh chng thng qua mt s cng trnh nghin cu c cng b , nh ca tc gi L An H [2003] xy dng tp ng liu th 10MB bng cch s dng phng php qui hoch ng ca i ha xc sut xut hin ca cc ng.Ri cng trnh nghin cu ca H. Nguyn[2005] lm theo hng tip cn l thay v s dng ng liu th , cng trnh tip cn theo hng xem Internet nh mt kho ng liu khng l , sau tin hnh thng k v s dng thut gii di truyn tm cch tch t ti u nht , v mt s cng trnh ca mt s tc gi khc.Khi so snh kt qa ca tc gi L An H v H.Nguynt th thy cng trnh ca H.Nguyn cho c kt qa tt hn khi tin hnh tch t , tuy nhin thi gian x l lu hn.u im ni bt ca hng tip cn da trn nhiu k t l tnh n gin , d ng dng , ngoi ra cn c thun li l t tn chi ph cho thao tc to ch mc v x l nhiu cu truy vn.Qua nhiu cng trnh nghin cu ca cc tc gi c cng b , hng tip cn tch t da trn nhiu k t , c th l cch tch t hai k t c cho l s la chn thch hp. 3.3. Mt s phng php tch t ting vit hin nay 3.3.1. Phng php Maximum Matching : Forward / Backward Phng php khp ti a ( MM - Maximum Matching) hay cn gi l LRMM - Left Right Maximum Matching. phng php ny , chng ta s duyt mt ng hoc cu t tri sang phi v chn t c nhiu m tit nht c mt trong t in v c thc hin lp li nh vy cho n ht cu. Dng n gin ca phng php dng gii quyt nhp nhng t n. Gi s chng ta c mt chui k t C1 , C2 , , Cn . Chng ta s p dng phng php t u chui . u tin kim tra xem C1 c phi l t hay khng , sau kim tra xem C 1C2 c phi l t hay khng . Tip tc thc hin nh th cho n khi tm c t di nht .
Bo co kt thc mn My Hc
14

Trng i hc in Lc

ti 3: Tm tt vn bn

Dng phc tp : Quy tc ca dng ny l phn on t .Thng thng ngi ta chn phn on ba t c chiu di ti a. Thut ton bt u t dng n gin , c th l nu pht hin ra nhng cch tch t gy nhp nhng , nh v d trn , gi s C 1 l t v C1C2 cng l mt t , khi chng ta kim tra k t k tip trong chui C 1, C2 , .. ,Cn tm tt c cc on ba t c bt u vi C1 hoc C1C2 . V d : Gi s chng ta c c cc on sau : - C 1 C2 C3 C4 -C1C2 C3C4 C5 -C1C2 C3C4 C5C6 Khi chui di nht s l chui th ba . Do t u tin ca chui th ba (C 1C2) s c chn . Thc hin cc bc cho n khi c chui t honh chnh. Nhn xt : Phng php ny thc hin tch t n gin , nhanh v ch cn da vo t in thc hin . Tuy nhin , khuyt im ca phng php ny cng chnh l t in , ngha l chnh xc khi thc hin tch t ph thuc hon ton vo tnh , tnh chnh xc ca t in. 3.3.2.Phng php Transformation based Learning TBL : Phng php ny tip cn da trn tp ng liu nh du .Theo cch tip cn ny cho my tnh c th nhn bit ranh gii gia cc t c th tch t chnh xc , chng ta s cho my hc cc cu mu trong tp ng liu c nh du ranh gii gia cc t ng .R rng chng ta thy phng php rt n gin , v ch cn cho my hc cc tp cu mu v sau my s t rt ra qui lut ca ngn ng v t s p dng chnh xc khi c nhng cu ng theo lut m my rt ra . V r rng tch t c hon ton chnh xc trong mi trng hp th i hi phi c mt tp ng liu ting Vit tht y v phi c hun luyn lu c th rt ra cc lut y . 3.3.3.M hnh tch t bng WFST v mng Neural : M hnh mng chuyn dch trng thi hu hn c trng s WFST Weighted Finit State Transducer c p dng trong tch t t nm 1996 . tng c bn l p dng WFST vi trng s l xc sut xut hin ca mi t trong kho ng liu. Dng WFST
Bo co kt thc mn My Hc
15

Trng i hc in Lc

ti 3: Tm tt vn bn

duyt qua cc cu cn xt , khi t c trng s ln nht l t c chn tch. Phng php ny cng c s dng trong cng trnh c cng b ca tc gi inh in [2001] , tc gi s dng WFST km vi mng Neural kh nhp nhng khi tch t , trong cng trnh tc gi xy dng h thng tch t gm tng WFST tch t v x l cc vn lin quan n mt s c th ring ca ngn ng ting Vit nh t ly , tn ring , .. v tng mng Neural dng kh nhp nhng v ng ngha sau khi tch t (nu c). Chi tit v 2 tng ny nh sau : 3.3.3.1 Tng WFST gm c 3 bc : o Bc 1 : Xy dng t in trng s : theo m hnh WFST , thao tc phn on t c xem nh l mt s chuyn dch trng thi c xc sut.Chng ta miu t t in D l mt th bin i trng thi hu hn c trng s . Gi s : H l tp cc t chnh t ting Vit (cn gi l ting) . - P l t loi ca t . Mi cung ca D c th l : - T mt phn t ca H ti mt phn t ca H - Cc nhn trong D biu din mt chi ph c c lng theo cng thc : Cost =-log(f/N) Trong : f l tn s ca t , N l kch thc tp mu. o Bc 2 : Xy dng cc kh nng phn on t : gim s bng n t hp khi sinh ra dy cc t c th t mt dy cc ting trong cu , tc gi xut phng php kt hp dng thm t in hn ch sinh ra cc bng n t hp , c th l nu pht hin thy mt cch phn on t no khng ph hp ( khng c trong t in , khng c phi l t ly , khng phi l danh t ring ,) th tc gi loi b cc nhnh xut pht t cch phn on on .

Bo co kt thc mn My Hc

16

Trng i hc in Lc

ti 3: Tm tt vn bn

o Bc 3: La chn kh nng phn on t ti u : Sau khi c c danh sch cc cch phn on t c th c ca cu , tc gi chn trng hp phn on t c trng s b nht. 3.3.3.2 Tng mng Neural : M hnh c s dng kh nhp nhng khi tch t bng cch kt hp so snh vi t in. Nhn xt : M hnh ny t c chnh xc trn 97% theo nh cng b trong cng trnh ca tc gi , bng vic s dng thm mng Neural kt hp vi t in kh cc nhp nhng c th c khi tch ra cc c nhiu t t mt cu v khi tng mng Neural s loi b i cc t khng ph hp bng cch kt hp vi t in. Bn cnh , cng tng t nh phng php TBL im quan trng ca m hnh ny cn tp ng liu hc y . 3.3.4. Phng php tch tch t ting Vit da trn thng k t Internet v thut gii di truyn Phng php tch tch t ting Vit da trn thng k t Internet v thut gii di truyn IGATEC (Internet and Genetics Algorithm based Text Categorization for Documents in Vietnamese) do H. Nguyn xut nm 2005 nh mt hng tip cn mi trong tch t vi mc ch phn loi vn bn m khng cn dng n mt t in hay tp ng liu hc no . Trong hng tip cn ny , tc gi kt hp gia thut ton di truyn vi d liu thng k c ly t Internet . Trong tip cn ca mnh , tc gi m t h thng tch t gm cc thnh phn a. Online Extractor : Thnh phn ny c tc dng ly thng tin v tn s xut hin ca cc t trong vn bn bng cch s dng mt search engine ni ting nh Google hay Yahoo chng hn . Sau , tc gi s dng cc cng thc di y tnh ton mc ph thuc ln nhau (mutual information) lm c s tnh fitness cho GA engine. Tnh xc sut cc t xut hin trn Internet :

Bo co kt thc mn My Hc

17

Trng i hc in Lc

ti 3: Tm tt vn bn

count(w) MAX count( w1 & w 2) p( w1 & w 2) = MAX p( w ) =

Trong MAX = 4 * 109 count(w) s lng vn bn trn Internet c tm thy c cha t w hoc cng cha w1 v w2 i vi count(w1&w2). Tnh xc sut ph thuc ca mt t ln mt t khc :
p( w1 & w 2) p( w1)

p( w1 | w 2) =

Thng tin ph thuc ln nhau (mutual information) ca cc t ghp c cu to bi n ting ( cw = w1w2wn)


MI(cw) = n p w j p( w1 & w 2 & ..... & w n ) j =1 p( w1 & w 2 & ..... & w n )

( )

b. GA Engine for Text Segmentation : mi c th trong quan th c biu din bi chui cc bit 0,1 , trong , mi bit i din cho mt ting trong vn bn , mi nhm bit cng loi i din cho cho mt segment. Cc c th trong qun th c khi to ngu nhin , trong mi segment c gii hn trong khong 5 . GA engine sau thc hin cc bc t bin v lai ghp nhm mc ch lm tng gi tr fitness ca cc c th t c cch tch t tt nht c th. 3.4. Kt lun Sau khi xem xt mt s hng tip cn trong tch t vn bn ting Vit , cc nghin cu c cng b u ch ra rng phng php tch t da trn t mang li kt qa c chnh xc kh cao , iu ny c c nh vo tp hun luyn ln , c nh du ranh gii gia cc t chnh xc gip cho vic hc rt ra cc lut tch t cho cc
Bo co kt thc mn My Hc
18

Trng i hc in Lc

ti 3: Tm tt vn bn

vn bn khc c tt p , tuy nhin chng ta cng d nhn thy hiu sut ca phng php hon ton ph thuc vo tp ng liu hun luyn. Do khc phc s ph thuc ca t in, chng ta ngh s dng hng tip cn ca H.Nguyn (s c trnh by chi tit trong phn sau ) tch t . Hng tip cn da trn k t c u im l d thc hin , thi gian thc hin tng i nhanh , tuy nhin li cho kt qa khng chnh xc bng hng tip cn da trn t . Hng tip cn ny ni chung ph hp cho cc ng dng khng cn chnh xc tuyt i trong tch t vn bn nh ng dng lc spam mail , firewall ,Nhn chung vi hng tip cn ny nu chng ta c th ci tin nng cao chnh xc trong tch t th hng tip cn ny l hon ton kh thi v c kh nng thay th hng tip cn tch t da trn t v khng phi xy dng kho ng liu , mt cng vic i hi nhiu cng sc , thi gian v s h tr ca cc chuyn gia trong cc lnh vc khc nhau.

Bo co kt thc mn My Hc

19

Trng i hc in Lc

ti 3: Tm tt vn bn

CHNG 4-TM TT VN BN DA TRN PHNG PHP KHNG GIM ST


4.1. Hng tip cn ca bi ton tm tt vn bn:
Nh chng ta bit trn tm tt vn bn ni chung l bi ton thuc lnh vc x l ngn ng t nhin. Trong phn tch x l ngn ng t nhin c cc mc su x l khc nhau c sp xp theo th t nh sau: u tin l mc hnh thi (Morphological), tip theo l mc c php (Syntactic), tip n l mc ng ngha (Semantic) v cui cng l mc ng dng (Pragmatic). Tng t nh cc su x l ca x l ngn ng t nhin, phng php tip cn gii quyt bi ton tm tt vn bn cng c th c phn loi da vo su x l c thc hin trong qu trnh tm tt. Tuy nhin phng php tip cn gii quyt bi ton tm tt vn bn ch c ba mc, l cc mc: hnh thi, c php v ng ngha. Mc hnh thi: ti mc x l ny, trong cc vn bn, n v c s dng so snh l cc ng, cu hay on vn (paragraph). Cc phng php ti mc ny thng s dng o tng ng da trn m hnh khng gian vector (Vector space model) p dng trng s TF.IDF cho cc t v cc cu. Phng php tm tt MMR [CG98] l phng php ni bt ti mc x l ny. Mc c php: n v c s dng so snh ti mc x l ny l s dng vic phn tch nhng cu trc ng php tng ng trong vn bn. Cc phng php ti mc ny tp trung vo vic phn tch cu trc ng php gia cc cu hay cc ng trong tng on vn thuc vn bn. Phng php do Barzilay v cc ng tc gi khc xut nm 1999 [BME99] thuc mc x l ny. Mc ng ngha: ti mc x l ny tp trung nhiu vo vic phn tch cc tn thc th, mi quan h gia cc thc th cng nh cc s kin ny sinh thc th xc nh c quan trng ca thng tin. Phng php ca McKeown v Radev xut nm 1995[MR95] l mt dng ca tm tt ti mc x l ny.
Bo co kt thc mn My Hc
20

Trng i hc in Lc

ti 3: Tm tt vn bn

Da vo cc c trng ca tng phng php tip cn, Inderjeet Mani a ra bng so snh, nh gi ba mc tip cn gii quyt bi ton tm tt vn bn. Mc x l Mc hnh thi c tnh S dng nhiu cc o tng ng gia cc t vng u im S dng rt ph bin, x l d tha tt Nhc im Khng th m t cc c trng khc, kh nng tng hp thng tin km. Mc c php So snh gia cc cy c php ca cu hay ng trong vn bn C kh nng pht hin cc khi nim tng ng trong cc ng,cho php tng hp thng tin. Khng th m t cc c trng khc, i hi phi m rng cc lut so snh gia cc cy c php Mc ng ngha So snh gia cc mu ti liu c n nh. C kh nng m t nhiu c trng khc nhau. Cc mu phi c to trc i vi tng min.

4.2. nh gi kt qu tm tt:
nh gi kt qu tm tt vn bn l mt vic lm kh khn trong thi im hin ti. Vic s dng kin nh gi ca cc chuyn gia ngn ng c xem l cch nh gi tt nht, tuy nhin, cch lm ny li tn rt nhiu chi ph. Bn cnh cc phng php nh gi th cng do cc chuyn gia thc hin, vn nh gi t ng kt qu tm tt cng nhn c nhiu s ch hin nay. NIST 1 k t nm 2000 t chc hi ngh DUC mi nm mt ln thc hin vic nh gi vi quy m ln cc h thng tm tt vn

Bo co kt thc mn My Hc

21

Trng i hc in Lc

ti 3: Tm tt vn bn

bn.Vic nh gi t ng ny nhm mc ch l tm ra c mt o nh gi tm tt gn vi nhng nh gi ca con ngi nht. hi tng (recall) ti cc t l nn khc nhau chnh l thc o nh gi hp l, mc d n khng ch ra c s khc nhau v hiu sut ca h thng. V vy o v s bao ph c tnh theo cng thc: C=RE y, R l hi tng cu c tr v bi cng thc. R = S n v bao ph/ Tng s n v trong m hnh tm tt. E l t l hon thnh nm trong khong t 0 n 1 (1 l hon thnh tt c, l mt phn, l mt s, l kh, 0 l khng c). DUC 2002 s dng mt phin bn iu chnh chiu di ca thc o bao ph, C:

Vi B l s ngn gn v l tham s phn tm quan trng. Cc loi nhn cho E cng c thay i thnh 100%, 80%, 60%, 40%, 20%, v 0% tng ng. Phng php ROUGE. BiLingual Evaluation Understudy (BLEU) [KST02] l mt phng php ca cng ng dch my a ra nh gi t ng cc h thng dch my. Phng php ny c hiu qua nhanh, c lp vi ngn ng v s lin quan vi cc nh gi ca con ngi. Recall Oriented Understudy of Gisting Evaluation (ROUGE) [LH03] l mt phng php do Lin v Hovy a ra vo nm 2003 cng da trn cc khi nim tng t. Phng php ny s dng n-gram nh gi s tng quan gia cc kt qu ca m hnh tm tt v tp d liu nh gi. Phng php ny cho ra kt qu kh quan v c s nh gi cao ca cng ng nghin cu tm tt vn bn.
Bo co kt thc mn My Hc
22

Trng i hc in Lc

ti 3: Tm tt vn bn

4.3. Tm tt vn bn bng phng php khng gim st: Tm tt vn bn bng phng php khng gim st l phng php ca ngnh hc my nhm tm ra mt m hnh m ph hp vi cc quan st. Phng php ny hu ch cho vic nn d liu: v c bn, mi gii thut nn d liu hoc l da vo mt phn b xc sut trn mt tp u vo mt cch tng minh hay khng tng minh. Mt dng khc ca hc khng gim st l phn mnh ( data clustering).

4.3.1 Training Phase:


Input: D={d1,dn}: collection of documents. Output: Calculated F(wi) Processing: In D segment into 2 word sets: noun set and other word set. Calculate F(wi) in noun set by: F ( wi ) =
N D ( wi ) ND

4.3.2 Testing phase:


Input: d: original document, r: rate of summary. Output: d: summary of document Reprocessing: - d has been segmented a set of sentences S={s1, s2, , sn} - In each sentence: + Segment into 2 word sets: noun set and other word set (not noun).
I ( wi ) = N S ( wi ) + F ( wi ) wj
w j d

+ Calculate I(wi) in noun set by:

Bo co kt thc mn My Hc

23

Trng i hc in Lc

ti 3: Tm tt vn bn

+ P(si)=1/i;

4.3.3 Algorithm:
V= ; For each sentence calculating weight of sentence: W(si)= I(wi) + P(si); Sort (si) by descending. Length (d)=length(d)*r%; While (length(d)< length(d)*r%) V=V+si; Arrangements all selected sentence by the original document.

Bo co kt thc mn My Hc

24

Trng i hc in Lc

ti 3: Tm tt vn bn

CHNG 5 - DEMO CHNG TRNH


5.1 Giao din chng trnh:
Form Main:

Hnh 5.1: Giao din chnh ca chng trnh. a ra ci nhn tng qut v chng trnh. Bao gm chc nng: qun l, hun luynThc Nghim, .

Bo co kt thc mn My Hc

25

Trng i hc in Lc

ti 3: Tm tt vn bn

Form hun luyn vn bn

Hnh 5.2: Hun luyn vn bn. u vo l cc on vn bn dng .txt thuc cc lnh vc gio dc, kinh t, th thao, tin hc.

Bo co kt thc mn My Hc

26

Trng i hc in Lc

ti 3: Tm tt vn bn

Form tm tt vn bn

Hnh 5.3: X l tm tt vn bn. u vo l cc vn bn y dng .txt thuc cc lnh vc hoc c th l mu chuyn ngn u ra l mt vn bn c tm tt li ty vo yu cu ngi dng mun lc bao nhiu % lng vn bn ban u.

Cc form qun l
Bo co kt thc mn My Hc
27

Trng i hc in Lc

ti 3: Tm tt vn bn

Hnh 5.4: Qun l Tp Hun

Bo co kt thc mn My Hc

28

Trng i hc in Lc

ti 3: Tm tt vn bn

Hnh 5.5: Qun l T Ch

Hnh 5.6: Qun l T Nguyn Gc


Bo co kt thc mn My Hc
29

Trng i hc in Lc

ti 3: Tm tt vn bn

5.2. H c s d liu:
S dng access thit k c s d liu Bng DataIndex : DataIndex

Bng 5.1 : DataIndex Bng Dictionary : Dictionary

Bng 5.2 : Dictionary Bng Field: Field

Bng 5.3 : Field Bng Test: Test

Bo co kt thc mn My Hc

30

Trng i hc in Lc

ti 3: Tm tt vn bn

Bng 5.4 : Test Bng Test_Instructor: Test_Instructor

Bng 5.5 : Test_Instructor Bng TopicWord: TopicWord

Bng 5.6 : TopicWord

5.3. Thc nghim chng trnh:


M t d liu: Input : on vn bn dng .txt dng y thuc cc lnh vc : cng ngh thng tin,
Bo co kt thc mn My Hc
31

Trng i hc in Lc

ti 3: Tm tt vn bn

gio dc, kinh t, th thao. Output : Rt gn vn bn trn theo t l ty chn. Kt qu thc nghim:

Hnh 5.8: Kt qu Tm Tt

Bo co kt thc mn My Hc

32

Trng i hc in Lc

ti 3: Tm tt vn bn

Hnh 5.9: Kt qu Hun Luyn 5.4. Code chng trnh 5.4.1. Code tch cu
#region"Tch cu trong mt vn bn" public int tachcau(RichTextBox rtb, ListView lv) { lv.Items.Clear(); ListViewItem item; const char s1 = '.'; char[] delimiters = new char[] { s1 }; int ctr = 0; //X l du chm xung dng (!?;:...) string text = Sentence.thaythe(rtb.Text); foreach (String subString in text.Split(delimiters)) { ctr++; if (subString.Length > 1)//Loai bo cau cuoi cung khong co ki tu nao. { item = new ListViewItem(ctr.ToString()); item.SubItems.Add(subString); item.SubItems.Add("_");

Bo co kt thc mn My Hc

33

Trng i hc in Lc
item.SubItems.Add("_"); lv.Items.Add(item);

ti 3: Tm tt vn bn

} } return lv.Items.Count; } #endregion

5.4.2. Code tch t


private List<string> Tachtu(ListView lvLookResult) { try { List<string> list=new List<string>(); ListViewItem item; lvLookResult.Items.Clear(); for (int j = 0; j < LvSentence.Items.Count; j++) { string sentence=Sentence.thaythe(LvSentence.Items[j].SubItems[1].Text); WordSegmentation Wsg = new WordSegmentation(); ArrayList Arr1 = new ArrayList(); Arr1 = Wsg.Voice(Sentence.thaythe(sentence)); string[] str = new string[Arr1.Count]; for (int i = 0; i < Arr1.Count; i++) { str[i] = Arr1[i].ToString(); } ArrayList list1 = new ArrayList(); list1 = Wsg.GetWord(str, Wsg.voice); // ArrayList list2 = new ArrayList(); list2 = Wsg.Trichrut(list1); // The End. string kq = ""; kq = Wsg.tachroi(list2, sentence); string str1 = Wsg.cauchuan1(kq); string str2 = Wsg.cauchuan2(str1); list.Add(Wsg.cauchuan3(str2)); item = new ListViewItem((j+1).ToString()); item.SubItems.Add(list[j]); ArrayList arr = new ArrayList(); arr = SplitText.Spliter(list[j], '/'); item.SubItems.Add(Convert.ToString(arr.Count)); LvSentence.Items[j].SubItems[3].Text = Convert.ToString(arr.Count); lvLookResult.Items.Add(item); } return list; } finally { // MessageBox.Show("Please choose sentence for text", "Message"); }

Bo co kt thc mn My Hc

34

Trng i hc in Lc
}

ti 3: Tm tt vn bn

5.4.3. Code tm tt
private void button1_Click(object sender, EventArgs e) { try { Rate = Convert.ToInt32(cbRate.Text); } catch (Exception) { MessageBox.Show("Bn cha chn t l cn tm tt !"); return;

///////////// tachcau(vanbangoc, LvSentence); List<double> list1 = new List<double>(); listView1.Items.Clear(); list1 = Ws(Tachtu(lvTachtu)); for (int i = 0; i < LvSentence.Items.Count; i++) { LvSentence.Items[i].SubItems[2].Text = Convert.ToString(list1[i]); } ///////////// // trch rt cu quan trng trong vn bn rtbtomtat.ResetText(); groupBox12.Text = "Reduction Text:"; // int len = 0; for (int i = 0; i < LvSentence.Items.Count; i++) { len += Convert.ToInt16(LvSentence.Items[i].SubItems[3].Text); } len = (int)(len * Rate / 100); List<string> list = new List<string>(); list = Sentence.NummberSentence(LvSentence, len); for (int j = 0; j < LvSentence.Items.Count; j++) { foreach (var item in list) { if (Convert.ToInt16(LvSentence.Items[j].SubItems[0].Text) == Convert.ToInt16(item)) { rtbtomtat.Text += LvSentence.Items[j].SubItems[1].Text + ". "; } } }

Bo co kt thc mn My Hc

35

Trng i hc in Lc

ti 3: Tm tt vn bn

private List<double> Ws(List<string> sentence) { try { progressBar1.Value = 0; List<double> list = new List<double>(); ArrayList arr = new ArrayList(); string fullpath = Application.StartupPath + "\\Documents"; foreach (var cauchuan in sentence) { string cau = cauchuan; FindWord(cau); if (cau != "") { arr = SplitText.Spliter(cau, '/'); lvTopicWord.Items.Clear(); ListViewItem item; string I = ""; for (int i = 0; i < arr.Count; i++) { I = "0"; // Iconst cua tu thuong // Add vo cc LvWord if (i == 0) { item = new ListViewItem(arr[i].ToString()); item.SubItems.Add(I); //Tm gi tr Lv ca topic word =? string Pab = ". " + arr[i].ToString(); string Pb = ". "; //MessageBox.Show(Pab); //MessageBox.Show(Pb); double Lv = N_Gram.Lv(Pab, Pb, fullpath); item.SubItems.Add(Lv.ToString()); lvTopicWord.Items.Add(item); // break; } else { item = new ListViewItem(arr[i].ToString()); item.SubItems.Add(I); //Tm gi tr Lv ca topic word =? string Pab = arr[i - 1].ToString() + " " + arr[i].ToString(); string Pb = arr[i - 1].ToString(); //MessageBox.Show(Pab); //MessageBox.Show(Pb); double Lv = N_Gram.Lv(Pab, Pb, fullpath); //double db = N_Gram.Lv(". Microsoft", ". ", Application.StartupPath + "\\vn bn"); item.SubItems.Add(Lv.ToString()); lvTopicWord.Items.Add(item);

Bo co kt thc mn My Hc

36

Trng i hc in Lc
// break; }

ti 3: Tm tt vn bn

} // Tm kim t ch (Topic Word ) for (int i = 0; i < lvTopicWord.Items.Count; i++) { for (int j = 0; j < lvketqua.Items.Count; j++) { if (lvTopicWord.Items[i].SubItems[0].Text.ToLower() == lvketqua.Items[j].SubItems[0].Text.ToLower()) { I = Algorithms.I(Convert.ToDouble(lvketqua.Items[j].SubItems[2].Text), Convert.ToDouble(lvketqua.Items[j].SubItems[4].Text), Convert.ToDouble(lvketqua.Items[j].SubItems[5].Text)); lvTopicWord.Items[i].SubItems[1].Text = I; break; } } } double word=0; ListViewItem item1; for (int j = 0; j < lvTopicWord.Items.Count; j++) { word += Convert.ToDouble(lvTopicWord.Items[j].SubItems[1].Text) + Convert.ToDouble(lvTopicWord.Items[j].SubItems[2].Text); item1 = new ListViewItem(lvTopicWord.Items[j].SubItems[0].Text); item1.SubItems.Add(lvTopicWord.Items[j].SubItems[1].Text); item1.SubItems.Add(lvTopicWord.Items[j].SubItems[2].Text); listView1.Items.Add(item1); //xem kt qu } list.Add(word);

} finally { } }

} return list;

} else { MessageBox.Show("Please word separation for sentence original:", "Message"); } progressBar1.Value += (100 / sentence.Count);

Bo co kt thc mn My Hc

37

Trng i hc in Lc

ti 3: Tm tt vn bn

Bo co kt thc mn My Hc

38

Trng i hc in Lc

ti 3: Tm tt vn bn

CHNG 6 - KT LUN V HNG PHT TRIN


6.1. Kt lun :
Vi nhu cu thc tin v cc ng dng tm tt vn bn hin nay, n tp trung nghin cu v bi ton tm tt vn bn ni chung v tm tt vn bn n ni ring. Cc kt qu c th m n t c l: Kho st, nghin cu tm tt vn bn ting vit bng phng php khng gim st.

Xy dng thut ton cho chng trnh tm tt cu trong vn bn ting vit.

Th nghim demo xut v cho c kt qu ban u kh quan.

6.2. Hng pht trin:


Vi nhng kt qu thc nghim ban u, s cn tip tc hon thin nng cao hiu sut v kt qu. Cn tip tc b sung nhng thiu st chng trnh hon thin hn: Xy dng mt kho d liu ph bin v quy m ln phc v trong qu trnh hun luyn cho vn bn. Hon thin chc nng x l vn bn vi chnh xc cao hn, cho kt qu nhanh hn, cng nh hiu sut cao hn. M rng thm cc chc nng cho chng trnh.

Bo co kt thc mn My Hc

39

Trng i hc in Lc

ti 3: Tm tt vn bn

Bo co kt thc mn My Hc

40

Trng i hc in Lc

ti 3: Tm tt vn bn

TI LIU THAM KHO


[1] Ha Nguyen Thi Thu, Quynh Nguyen Huu, Cuong Do Duc, A novel important word based sentence reduction method for Vietnamese text, Proc. of IEEE on Intellectual Technology in Industrial Practice, pp 401-405, September 2010. [2] Ha Nguyen Thi Thu, Nguyen Thien Luan A Novel Application of Fuzzy Set Theory and Topic Model in Sentence Extraction for Vietnamese Text, International Journal of Computer Science and Network Security, Vol. 10 No. 8 pp. 41-46, 2010. [3] Ha Nguyen Thi Thu, Quynh Nguyen Huu A New method for Vietnamese Sentence Extraction based on important information of topic word and linguistic score, Proc. of IEEE on Multimedia and Computational Intelligence, September 2010 (Accepted). [4] JING, H. 2000. Sentence reduction for automatic text summarization. In Proceedings of the First Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-2000). [5] KNIGHT, K. AND MARCU, D. 2002. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artif. Intell. 139, 1 , 91-107, 2002. [6] COLLINS, M. Head-driven statistical model for natural language parsing. Ph.D. dissertation, Univ. of Pennsylvania, 1999. [7] M.L. Nguyen and S. Horiguchi, A Sentence Reduction Using Syntax Control, Proc. Of 6th Information Retrieval with Asian Language, pp. 139-146, 2003.

Bo co kt thc mn My Hc

41

You might also like