You are on page 1of 58

I HC QUC GIA H NI TRNG I HC CNG NGH

Nguyn Cm T

NHN BIT CC LOI THC TH TRONG VN BN TING VIT NHM H TR WEB NG NGHA V TM KIM HNG THC TH

KHA LUN TT NGHIP I HC H CHNH QUI Ngnh: Cng ngh thng tin

H NI - 2005

I HC QUC GIA H NI TRNG I HC CNG NGH

Nguyn Cm T

NHN BIT CC LOI THC TH TRONG VN BN TING VIT NHM H TR WEB NG NGHA V TM KIM HNG THC TH

KHA LUN TT NGHIP I HC H CHNH QUI Ngnh: Cng ngh thng tin

Cn b hng dn: TS. H Quang Thy Cn b ng hng dn: ThS. Phan Xun Hiu

H NI - 2005

Li cm n
Trc tin, em mun gi li cm n su sc nht n thy gio, TS. H Quang Thy v ThS. Phan Xun Hiu, nhng ngi tn tnh hng dn em trong sut qu trnh nghin cu Khoa hc v lm kha lun tt nghip. Em xin by t li cm n su sc n nhng thy c gio ging dy em trong bn nm qua, nhng kin thc m em nhn c trn ging ng i hc s l hnh trang gip em vng bc trong tng lai. Em cng mun gi li cm n n cc anh ch v cc thy c trong nhm seminar v Khai ph d liu nh ThS.Nguyn Tr Thnh, ThS. To Th Thu Phng, CN. V Bi Hng, CN. Nguyn Th Hng Giang ... cho em nhng li khuyn b ch v chuyn mn trong qu trnh nghin cu. Cui cng, em mun gi li cm n su sc n tt c bn b, v c bit l cha m v em trai, nhng ngi lun kp thi ng vin v gip em vt qua nhng kh khn trong cuc sng. Sinh Vin Nguyn Cm T

Tm tt
Nhn bit cc loi thc th l mt bc c bn trong trch chn thng tin t vn bn v x l ngn ng t nhin. N c ng dng nhiu trong dch t ng, tm tt vn bn, hiu ngn ng t nhin , nhn bit tn thc th trong sinh/y hc v c bit ng dng trong vic tch hp t ng cc i tng, thc th t mi trng Web vo cc ontology ng ngha v cc c s tri thc. Trong kha lun ny, em trnh by mt gii php nhn bit loi thc th cho cc vn bn ting Vit trn mi trng Web. Sau khi xem xt cc hng tip cn khc nhau, em chn phng php tip cn hc my bng cch xy dng mt h thng nhn bit loi thc th da trn m hnh Conditional Random Fields (CRF- Laferty, 2001) . im mnh ca CRF l n c kh nng x l d liu c tnh cht chui, c th tch hp hng trm nghn thm ch hng triu c im t d liu ht sc a dng nhm h tr cho qu trnh phn lp. Thc nghim trn cc vn bn ting Vit cho thy qui trnh phn lp t c kt qu rt kh quan.

ii

Mc lc
Li cm n ........................................................................................................................i Tm tt ............................................................................................................................ ii Mc lc .......................................................................................................................... iii Bng t vit tt ................................................................................................................v M u .............................................................................................................................1 Chng 1. 1.1. 1.2. 1.3. 1.4. 2.1. 2.2. Bi ton nhn din loi thc th ................................................................3 Trch chn thng tin..........................................................................................3 Bi ton nhn bit cc loi thc th ..................................................................4 M hnh ha bi ton nhn bit cc loi thc th .............................................5 ngha ca bi ton nhn bit cc loi thc th ..............................................6 Cc hng tip cn gii quyt bi ton nhn bit cc loi thc th ..........8 Hng tip cn th cng ...................................................................................8 Cc m hnh Markov n (HMM) ......................................................................9 Tng quan v cc m hnh HMM .............................................................9 Gii hn ca cc m hnh Markov n .....................................................10 Tng quan v m hnh Markov cc i ha Entropy (MEMM) .............11 Vn label bias ..................................................................................13 Conditional Random Field (CRF) ...........................................................15

Chng 2.

2.2.1. 2.2.2. 2.3. 2.3.1. 2.3.2. 2.4. 3.1. 3.2. Chng 3.

M hnh Markov cc i ha Entropy (MEMM) ...........................................11

Tng kt chng .............................................................................................14 nh ngha CRF ..............................................................................................15 Nguyn l cc i ha Entropy ......................................................................16 o Entropy iu kin .........................................................................17 Cc rng buc i vi phn phi m hnh ..............................................17 Nguyn l cc i ha Entropy ...............................................................18

3.2.1. 3.2.2. 3.2.3. 3.3. 3.4. 3.5. 3.6.

Hm tim nng ca cc m hnh CRF ............................................................19 Thut ton gn nhn cho d liu dng chui ..................................................20 CRF c th gii quyt c vn label bias..............................................22 Tng kt chng .............................................................................................22 c lng tham s cho cc m hnh CRF .............................................23

Chng 4.

iii

4.1.

Cc phng php lp ......................................................................................24 Thut ton GIS ........................................................................................26 Thut ton IIS ..........................................................................................27 K thut ti u s bc mt .......................................................................28 K thut ti u s bc hai.........................................................................29 H thng nhn bit cc loi thc th trong ting Vit.............................31 Phn cng ................................................................................................31 Phn mm ................................................................................................31 D liu thc nghim ................................................................................31

4.1.1. 4.1.2. 4.2. 4.2.1. 4.2.2. 4.3. 5.1. Chng 5. 5.1.1. 5.1.2. 5.1.3. 5.2. 5.3.

Cc phng php ti u s (numerical optimisation methods) ......................28

Tng kt chng .............................................................................................30 Mi trng thc nghim .................................................................................31

H thng nhn bit loi thc th cho ting Vit .............................................31 Cc tham s hun luyn v nh gi thc nghim .........................................32 Cc tham s hun luyn ..........................................................................32 nh gi cc h thng nhn bit loi thc th ........................................33 Phng php 10-fold cross validation .................................................34 Mu ng cnh v t vng........................................................................35 Mu ng cnh th hin c im ca t..................................................35 Mu ng cnh dng regular expression...................................................36 Mu ng cnh dng t in .....................................................................36 Kt qu ca 10 ln th nghim................................................................37 Ln thc nghim cho kt qu tt nht .....................................................37 Trung bnh 10 ln thc nghim ...............................................................42 Nhn xt ..................................................................................................42

5.3.1. 5.3.2. 5.3.3. 5.4. 5.4.1. 5.4.2. 5.4.3. 5.4.4. 5.5. 5.5.1. 5.5.2. 5.5.3. 5.5.4.

La chn cc thuc tnh ..................................................................................34

Kt qu thc nghim .......................................................................................37

Kt lun..........................................................................................................................43 Ph lc: Output ca h thng nhn din loi thc th ting Vit ..................................45 Ti liu tham kho .........................................................................................................48

iv

Bng t vit tt
T hoc cm t Conditional Random Field M hnh Markov n M hnh Markov cc i ha entropy Vit tt CRF HMM MEMM

M u
Tim Benner Lee, cha ca World Wide Web hin nay, cp Web ng ngha nh l tng lai ca World Wide Web, trong n kt hp kh nng hiu c bi con ngi v kh nng x l c bi my. Thnh cng ca Web ng ngha ph thuc phn ln vo cc ontology cng nh cc trang Web c ch gii theo cc ontology ny. Trong khi nhng li ch m Web ng ngha em li l rt ln th vic xy dng cc ontology mt cch th cng li ht sc kh khn. Gii php cho vn ny l ta phi dng cc k thut trch chn thng tin ni chung v nhn bit cc loi thc thc th ni ring t ng ha mt phn qu trnh xy dng cc ontology. Cc ontology v h thng nhn bit cc loi thc th khi c tch hp vo my tm kim s lm tng chnh xc ca tm kim v cho php tm kim hng thc th, khc phc c mt s nhc im cho cc my tm kim da trn t kha hin nay. thc c nhng li ch m cc bi ton trch chn thng tin ni chung v nhn bit loi thc th ni ring, em chn hng nghin cu nhm gii quyt bi ton nhn bit loi thc th cho ting Vit lm ti lun vn ca mnh. Lun vn c t chc thnh 5 chng nh sau: Chng 1 gii thiu v bi ton trch chn thng tin v bi ton nhn din cc loi thc th cng nhng ng dng ca n. Chng 2 trnh by mt s hng tip cn nhm gii quyt bi ton nhn bit loi thc th nh phng php th cng, cc phng php hc my HMM v MEMM. Cc hng tip cn th cng c nhc im l tn km v mt thi gian, cng sc v khng kh chuyn. Cc phng php hc my nh HMM hay MEMM tuy c th khc phc c nhc im ca hng tip cn th cng nhng li gp phi mt s vn do c th ca mi m hnh. Vi HMM, ta khng th tch hp cc thuc tnh lng nhau mc d nhng thuc tnh ny rt hu ch cho qu trnh gn nhn d liu dng chui. MEMM ,trong mt s trng hp c bit, gp phi vn label bias, l xu hng b qua cc d liu quan st khi trng thi c t ng i ra. Chng 3 gii thiu nh ngha CRF, nguyn l cc i ha Entropy mt phng php nh gi phn phi xc sut t d liu v l c s chn cc hm tim nng cho cc m hnh CRF, thut ton Viterbi gn nhn cho d liu dng chui. Bn cht phn phi iu kin v phn phi ton cc ca CRF cho php cc m hnh ny khc phc c cc nhc im ca cc m

hnh hc my khc nh HMM v MEMM trong vic gn nhn v phn on (segmentation) cc d liu dng chui. Chng 4 trnh by nhng phng php c lng cc tham s cho m hnh CRF nh cc thut ton IIS, GIS, cc phng php da trn vector gradient nh phng php gradient lin hp, quasi-Newton, L-BFGs. Trong s cc phng php ny, phng php L-BFGs c nh gi l tt nht v c tc hi t nhanh nht. Chng 5 trnh by h thng nhn din loi thc th cho ting Vit da trn m hnh CRF, xut cc phng php chn thuc tnh cho vic nhn bit cc loi thc th trong cc vn bn ting Vit v a ra mt s kt qu thc nghim.

Chng 1. Bi ton nhn din loi thc th


Ch chnh ca kha lun l p dng m hnh CRF cho bi ton nhn bit cc loi thc th cho ting Vit. Chng ny s gii thiu tng quan v trch chn thng tin [30][31][32], chi tit v bi ton nhn bit loi thc th [13][15][30][31] v nhng ng dng ca bi ton nhn bit loi thc th.

1.1. Trch chn thng tin


Khng ging nh vic hiu ton b vn bn, cc h thng trch chn thng tin ch c gng nhn bit mt s dng thng tin ng quan tm. C nhiu mc trch chn thng tin t vn bn nh xc nh cc thc th (Element Extraction), xc nh quan h gia cc thc th (Relation Extraction), xc nh v theo di cc s kin v cc kch bn (Event and Scenario Extraction and Tracking), xc nh ng tham chiu (Co-reference Resolution) ... Cc k thut c s dng trong trch chn thng tin gm c: phn on, phn lp, kt hp v phn cm.
October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a superimportant shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying

IE

NAME Bill Gates Bill Veghte Richard Stallman

TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft..

Hnh 1: Mt h thng trch chn thng tin

Kt qu ca mt h thng trch chn thng tin thng l cc mu (template) cha mt s lng xc nh cc trng (slots) c in thng tin.

mc trch chn thng tin ng ngha, mt mu l th hin ca mt s kin trong cc thc th tham gia ng mt s vai tr xc nh trong s kin . Chng hn nh ti MUC-7 [31] (Seventh Message Understanding Conference), mt mu kch bn c yu cu l cc s kin phng tn la v rocket trong 100 bi bo ca New York Times. Cc h thng tham gia hi ngh phi in vo mu ny cc thng tin sao cho c th tr li c cu hi v thi gian, a im ... ca cc s kin phng tn la, rocket c cp trong cc bi bo.

1.2. Bi ton nhn bit cc loi thc th


Con ngi, thi gian, a im, cc con s, ... l nhng i tng c bn trong mt vn bn d bt k ngn ng no. Mc ch chnh ca bi ton nhn bit cc loi thc th l xc nh nhng i tng ny t phn no gip cho chng ta trong vic hiu vn bn. Bi ton nhn bit cc loi thc th l bi ton n gin nht trong s cc bi ton trch chn thng tin, tuy vy n li l bc c bn nht trc khi tnh n vic gii quyt cc bi ton phc tp hn trong lnh vc ny. R rng trc khi c th xc nh c cc mi quan h gia cc thc th ta phi xc nh c u l cc thc th tham gia vo mi quan h . Tuy l bi ton c bn nht trong trch chn thng tin, vn tn ti mt lng ln cc trng hp nhp nhng lm cho vic nhn bit cc loi thc th tr nn kh khn. Mt s v d c th : Bnh nh v HAGL u thua AFC Champion Ledge . o y Bnh nh phi c nh du l mt t chc (mt i bng) thay v l mt a danh. o Ch Bnh vit u cu nn thng tin vit hoa khng mang nhiu ngha. Khi no H Ch Minh c s dng nh tn ngi, khi no c s dng nh tn mt a danh? Bi ton nhn bit loi thc th trong cc vn bn ting Vit cn gp nhiu kh khn hn so vi bi ton ny trong ting Anh v mt s nguyn nhn nh sau: Thiu d liu hun luyn v cc ngun ti nguyn c th tra cu nh WordNet trong ting Anh.

Thiu cc thng tin ng php (POS) v cc thng tin v cm t nh cm danh t, cm ng t ... cho ting Vit trong khi cc thng tin ny gi vai tr rt quan trng trong vic nhn bit loi thc th. Ta hy xem xt v d sau: Cao Xumin, Ch tch Phng Thng mi Xut nhp khu thc phm ca Trung Quc, cho rng cch xem xt ca DOC khi em so snh gi tm ca Trung Quc v gi tm ca n l vi phm lut thng mi Chng ta mun on vn bn trn c nh du nh sau: <PER> Cao Xumin</PER>, Ch tch <ORG>Phng Thng mi Xut nhp khu thc phm </ORG> ca <LOC>Trung Quc</LOC>, cho rng cch xem xt ca <ORG>DOC</ORG> khi em so snh gi tm ca <LOC>Trung Quc</LOC> v gi tm ca <LOC>n </LOC> l vi phm lut thng mi V d trn bc l mt s kh khn m mt h thng nhn bit cc loi thc th ting Vit gp phi trong khi gn nhn cho d liu (xem ph lc): Cm t Phng Thng mi Xut nhp khu thc phm l tn mt t chc nhng khng phi t no cng vit hoa. Cc thng tin nh Phng Thng mi Xut nhp khu thc phm l mt cm danh t v ng vai tr ch ng trong cu rt hu ch cho vic an nhn chnh xc loi thc th, tuy vy do ting Vit thiu cc h thng t ng on nhn chc nng ng php v cm t nn vic nhn bit loi thc th tr nn kh khn hn nhiu so vi ting Anh.

1.3. M hnh ha bi ton nhn bit cc loi thc th


Bi ton nhn bit loi thc th trong vn bn l tm cu tr li cho cc cu hi: ai?, bao gi?, u?, bao nhiu? ... y l mt trng hp c th ca bi tan gn nhn cho d liu dng chui, trong (tr nhn O) th mi mt nhn gm mt tip u ng B_ hoc I_ (vi ngha l bt u hay bn trong mt tn thc th) kt hp vi tn nhn.
Bng 1: Cc loi thc th

Tn nhn PER ORG

ngha Tn ngi Tn t chc

LOC NUM PCT CUR TIME MISC O

Tn a danh S Phn trm Tin t Ngy thng, thi gian Nhng loi thc th khc ngai 7 lai trn Khng phi thc th

V d: chui cc nhn tng ng cho cm Phan Vn Khi l B_PER I_PER I_PER Nh vy vi 8 loi thc th k c Misc, ta s c tng ng 17 nhn (8*2+1). V bn cht gn nhn cho d liu l chnh l mt trng hp c bit ca phn lp trong vn bn, y cc lp chnh l cc nhn cn gn cho d liu.

1.4. ngha ca bi ton nhn bit cc loi thc th


Mt h thng nhn bit cc loi thc th tt c th c ng dng trong nhiu lnh vc khc nhau, c th n c th c s dng nhm: H tr Web ng ngha. Web ng ngha l cc trang Web c th biu din d liu thng minh , y thng minh ch kh nng kt hp, phn lp v kh nng suy din trn d liu . S thnh cng ca cc Web ng ngha ph thuc vo cc ontology [] cng nh s pht trin ca cc trang Web c ch gii bi cc siu d liu tun theo cc ontology ny. Mc d cc li ch m cc ontology em li l rt ln nhng vic xy dng chng mt cch t ng li ht sc kh khn. V l do ny, cc cng c trch chn thng tin t ng t cc trang Web lm y cc ontology nh h thng nhn bit cc loi thc th l ht sc cn thit. Xy dng cc my tm kim hng thc th. Ngi dng c th tm thy cc trang Web ni v Clinton l mt a danh Bc Carolina mt cch nhanh chng m khng phi duyt qua hng trm trang Web ni v tng thng Bill Clinton.

Nhn bit cc loi thc th c th c xem nh l bc tin x l lm n gin ha cc bi ton nh dch my, tm tt vn bn ... Nh c cp trn y, mt h thng nhn bit cc loi thc th c th ng vai tr l mt thnh phn c bn cho cc bi ton trch chn thng tin phc tp hn. Trc khi c mt ti liu, ngi dng c th c lt qua cc tn ngi, tn a danh, tn cng ty c cp n trong . T ng nh ch s cho cc sch. Trong cc sch, phn ln cc ch mc l cc loi thc th. H thng nhn din loi thc th cho ting Vit s lm tin cho vic gii quyt cc bi ton v trch chn thng tin t cc ti liu ting Vit cng nh h tr cho vic x l ngn ng ting Vit. p dng h thng xy dng mt ontology v cc thc th trong ting Vit s t nn mng cho mt th h Web mi - Web ng ngha ting Vit.

Chng 2. Cc hng tip cn gii quyt bi ton nhn bit cc loi thc th
C nhiu phng php tip cn khc nhau gii quyt bi ton nhn din cc loi thc th, chng ny s gii thiu mt s hng tip cn nh vy cng vi nhng u nhc im ca chng t l gii ti sao chng em li chn phng php da trn CRF xy dng h thng nhn din loi thc th cho ting Vit.

2.1. Hng tip cn th cng


Tiu biu cho hng tip cn th cng l h thng nhn bit loi thc th Proteous ca i hc New York tham gia MUC-6. H thng c vit bng Lisp v c h tr bi mt s lng ln cc lut. Di y l mt s v d v cc lut c s dng bi Proteous cng vi cc trng hp ngoi l ca chng: Title Capitalized_Word => Title Person Name ng : Mr. Johns, Gen. Schwarzkopf Ngoi l: Mrs. Fields Cookies (mt cng ty) Month_name number_less_than_32 => Date ng: February 28, July 15 Ngoi l: Long March 3 ( tn mt tn la ca Trung Quc). Trn thc t, mi lut trn u cha mt s lng ln cc ngoi l. Thm ch ngay c khi ngi thit k tm cch gii quyt ht cc ngoi l m h ngh n th vn tn ti nhng trng hp ch xut hin khi h thng c a vo thc nghim. Hn na, vic xy dng mt h thng trch chn da trn cc lut l rt tn cng sc. Thng thng xy dng mt h thng nh vy i hi cng sc vi thng t mt lp trnh vin vi nhiu kinh nghim v ngn ng hc. Thi gian ny cn ln hn khi chng ta mun chuyn sang lnh vc khc hay sang ngn ng khc. Cu tr li cho cc gii hn ny l phi xy dng mt h thng bng cch no c th t hc, iu ny s gip gim bt s tham gia ca cc chuyn gia ngn ng v lm tng tnh kh chuyn cho h thng. C rt nhiu phng php hc my nh cc m hnh markov n (Hidden Markov Models - HMM), cc m hnh Markov cc i ha Entropy (Maximum Entropy Markov Models- MEMM) v m hnh Conditional Random Field (CRF)... c th c p dng gii quyt bi ton nhn bit loi thc th. Cc m hnh CRF s c miu t chi tit trong chng sau, y
8

chng ta s ch xem xt cc m hnh HMM v MEMM cng vi u v nhc im ca chng.

2.2. Cc m hnh Markov n (HMM)


M hnh Markov[7][13][19] n c gii thiu v nghin cu vo cui nhng nm 1960 v u nhng nm 1970 ,cho n nay n c ng dng nhiu trong nhn dng ting ni, tin sinh hc v x l ngn ng t nhin. 2.2.1. Tng quan v cc m hnh HMM HMM l m hnh my trng thi hu hn (probabilistic finite state machine) vi cc tham s biu din xc sut chuyn trng thi v xc sut sinh d liu quan st ti mi trng thi. Cc trng thi trong m hnh HMM c xem l b n i bn di d liu quan st sinh ra do m hnh. Qu trnh sinh ra chui d liu quan st trong HMM thng qua mt lot cc bc chuyn trng thi xut pht t mt trong cc trng thi bt u v dng li mt trng thi kt thc. Ti mi trng thi, mt thnh phn ca chui quan st c sinh ra trc khi chuyn sang trng thi tip theo. Trong bi ton nhn bit loi thc th, ta c th xem tng ng mi trng thi vi mt trong nhn B_PER, B_LOC, I_PER...v d liu quan st l cc t trong cu. Mc d cc lp ny khng sinh ra cc t, nhng mi lp c gn cho mt t bt k c th xem nh l sinh ra t ny theo mt cch thc no . V th ta c th tm ra chui cc trng thi (chui cc lp loi thc th) m t tt nht cho chui d liu quan st (chui cc t) bng cch tnh .

P (S | O ) =

P (S ,O ) P (O )

(2.1)

y S l chui trng thi n, O l chui d liu quan st bit. V P(O) c th tnh c mt cch hiu qu nh thut ton forward-backward [19], vic tm chui S* lm cc i xc sut P(S|O) tng ng vi vic tm S* lm cc i P(S,O).

Ta c th m hnh ha HMM di dng mt th c hng nh sau:

S1

S2

S3

Sn-1

Sn

O1

O2

O3

O4

O5

Hnh 2: th c hng m t m hnh HMM

y, Si l trng thi ti thi im t=i trong chui trng thi S, Oi l d liu quan st c ti thi im t=i trong chui O. S dng tnh cht Markov th nht (trng thi hin ti ch ph thuc vo trng thi ngay trc ) v gi thit d liu quan st c ti thi im t ch ph thuc trng thi ti t, ta c th tnh xc sut P(S,O) nh sau:

P(S , O) = P(S1 ) P(O1 | S1 ) P(St |S t 1) * P(Ot | St )


t =2

(2.2)

Qu trnh tm ra chui trng thi ti u m t tt nht chui d liu quan st cho trc c th c thc hin bi mt k thut lp trnh quy hoch ng s dng thut ton Viterbi [19]. 2.2.2. Gii hn ca cc m hnh Markov n
Trong bi bo Maximum Entropy Markov Model for Information Extraction

and Segmentation[5], Adrew McCallum a ra hai vn m cc m hnh HMM truyn thng ni ring v cc m hnh sinh (generative models) ni chung gp phi khi gn nhn cho d liu dng chui. Th nht, c th tnh c xc sut P(S, O) (2.1), thng thng ta phi lit k ht cc trng hp c th ca chui S v chui O. Nu nh cc chui S c th lit k c v s lng cc trng thi l c hn th trong mt s ng dng ta khng th no lit k ht c cc chui O v d liu quan st l ht sc phong ph v a dng. gii quyt vn ny, HMM phi a ra gi thit v s c lp gia cc d liu quan st, l d liu quan st c ti thi im t ch ph thuc trng thi ti thi im . Tuy vy, vi cc bi ton gn nhn cho d liu dng chui, ta nn a ra cc phng thc biu din cc d liu quan st mm do hn nh l biu din d liu quan

10

st di dng cc thuc tnh (features) khng ph thuc ln nhau. V d vi bi ton phn loi cc cu hi v cu tr li trong mt danh sch FAQ, cc thuc tnh c th l bn thn cc t hay di ca dng, s lng cc k t trng, dng hin ti c vit li u dng hay khng, s cc k t khng nm trong bng ch ci, cc thuc tnh v cc chc nng ng php ca chng R rng nhng thuc tnh ny khng nht thit phi c lp vi nhau. Vn th hai m cc m hnh sinh gp phi khi p dng vo cc bi ton phn lp d liu dng chui l chng s dng xc sut ng thi m hnh ha cc bi ton c tnh iu kin.Vi cc bi ton ny s thch hp hn nu ta dng mt m hnh iu kin c th tnh ton P (S|O) trc tip thay v P (S, O) nh trong cng thc (2.1).

2.3. M hnh Markov cc i ha Entropy (MEMM)


McCallum a ra mt m hnh Markov mi - m hnh MEMM [5] (Maximum Entropy Markov Model) nh p n cho nhng vn ca m hnh Markov truyn thng. 2.3.1. Tng quan v m hnh Markov cc i ha Entropy (MEMM) M hnh MEMM thay th cc xc sut chuyn trng thi v xc sut sinh quan st trong HMM bi mt hm xc sut duy nht P (Si|Si-1, Oi) - xc sut trng thi hin ti l Si vi iu kin trng thi trc l Si-1 v d liu quan st hin ti l Oi. M hnh MEMM quan nim rng cc quan st c cho trc v chng ta khng cn quan tm n xc sut sinh ra chng, iu duy nht cn quan tm l cc xc sut chuyn trng thi. So snh vi HMM, y quan st hin ti khng ch ph thuc vo trng thi hin ti m cn c th ph thuc vo trng thi trc , iu c ngha l quan st hin ti c gn lin vi qu trnh chuyn trng thi thay v gn lin vi cc trng thi ring l nh trong m hnh HMM truyn thng.
S1 S2 S3 Sn-1 Sn

O1

O2

O3

On-1

On

Hnh 3: th c hng m t mt m hnh MEMM

11

p dng tnh cht Markov th nht, xc sut P(S|O) c th tnh theo cng thc :

P ( S | O ) = P ( S1 | O1 ) P ( S t | S t 1 , O1 )
t =1

(2.3)

MEMM coi cc d liu quan st l cc iu kin cho trc thay v coi chng nh cc thnh phn c sinh ra bi m hnh nh trong HMM v th xc sut chuyn trng thi c th ph thuc vo cc thuc tnh a dng ca chui d liu quan st. Cc thuc tnh ny khng b gii hn bi gi thit v tnh c lp nh trong HMM v gi vai tr quan trng trong vic xc nh trng thi k tip. K hiu PSi-1(Si|Oi)=P(Si|Si-1,Oi). p dng phng php cc i ha Entropy (s c cp trong chng 3), McCallum xc nh phn phi cho xc sut chuyn trng thi c dng hm m nh sau:

PSi 1 ( S i | Oi ) =

1 exp a f a (Oi , S i ) Z (Oi , S i 1 ) a

(2.4)

y, a l cc tham s cn c hun luyn (c lng); Z (Oi, Si) l tha s chn ha tng xc sut chuyn t trng thi Si-1 sang tt c cc trng thi Si k u bng 1; fa (Oi, Si) l hm thuc tnh ti v tr th i trong chui d liu quan st v trong chui trng thi. Mi hm thuc tnh fa (Oi,Si) nhn hai tham s, mt l d liu quan st hin ti Oi v mt l trng thi hin ti Si. McCallum nh ngha a=<b, Si>, y b l thuc tnh nh phn ch ph thuc vo d liu quan st hin ti v Si l trng thi hin ti. Sau y l mt v d v mt thuc tnh b: 1 nu d liu quan st hin ti l the 0 nu ngc li Hm thuc tnh fa (Oi, Si) xc nh nu b (Oi) xc nh v trng thi hin ti nhn mt gi tr c th no : fa (Oi,Si)= 1 nu b (Oi) =1 v Si=Si-1 0 nu ngc li

b(Oi) =

12

gn nhn cho d liu, MEMM xc nh chui trng thi S lm cc i P(S|O) trong cng thc (2.3).Vic xc nh chui S cng c thc hin bng cch p dng thut ton Viterbi nh trong HMM. 2.3.2. Vn label bias Trong mt s trng hp c bit, cc m hnh MEMM v cc m hnh nh ngha mt phn phi xc sut cho mi trng thi c th gp phi vn label bias [15][17]. Ta hy xem xt mt kch bn chuyn trng thi n gin sau:

r_ 0 r_

i_

b: rib 5

o_

b: rob

Hnh 4: Vn label bias

Gi s ta cn xc nh chui trng thi khi xut hin chui quan st l rob. y, chui trng thi ng S l 0345 v ta mong i xc sut P (0345|rob) s ln hn xc sut P(0125|rob). p dng cng thc (2.3), ta c: P (0125|rob) =P (0)*P (1|0, r)*P (2|1, o)*P (5|2, b) V tng cc xc sut chuyn t mt trng thi sang cc trng thi k vi n bng 1 nn mc d trng thi 1 cha bao gi thy quan st o nhng n khng c cch no khc l chuyn sang trang thi 2, iu c ngha l P (2|1, x) =1 vi x c th l mt quan st bt k. Mt cch tng qut, cc trng thi c phn phi chuyn vi entropy thp (t ng i ra) c xu hng t ch hn n quan st hin ti. Li c P (5|2, b) =1, t suy ra: P (0125|rob) = P(0)*P(1|0,r). Tng t ta cng c P (0345|rob)=P (0)*P (3|0,r). Nu trong tp hun luyn, t rib xut hin thng xuyn hn t rob th xc sut P(3|0,r) s nh hn xc sut P(1|0,r), iu dn n xc sut P(0345|rob) nh hn xc sut P(0125|rob), tc l chui trng thi S=0125 s lun c chn d chui quan st l rib hay rob. Nm 1991, Lon Bottou a ra hai gii php cho vn ny.Gii php th nht l gp hai trng thi 1, 3 v tr hon vic r nhnh cho n khi gp mt quan st
13

xc nh (c th y l i v o). y chnh l trng hp c bit ca vic chuyn mt automata a nh sang mt automata n nh. Nhng vn ch ngay c khi c th thc hin vic chuyn i ny th cng gp phi s bng n t hp cc trng thi ca automata. Gii php th hai m Bottou a ra l chng ta s bt u m hnh vi mt th y ca cc trng thi v cho th tc hun luyn t quyt nh mt cu trc thch hp cho m hnh.Tic rng gii php ny s lm mt tnh i tnh c th t ca m hnh, mt tnh cht rt c ch cho cc bi tan trch chn thng tin [5]. Mt gii php ng n hn cho vn ny l xem xt ton b chui trng thi nh mt tng th v cho php mt s cc bc chuyn trong chui trng thi ny ng vai tr quyt nh vi vic chn chui trng thi. iu ny c ngha l xc sut ca ton b chui trng thi s khng phi c bo tn trong qu trnh chuyn trng thi m c th b thay i ti mt bc chuyn ty thuc vo quan st ti .Trong v d trn, xc sut chuyn ti 1 v 3 c th c nhiu nh hng i vi vic ta s chn chui trng thi no hn xc sut chuyn trng thi ti 0.

2.4. Tng kt chng


Chng ny gii thiu cc hng tip cn nhm gii quyt bi ton nhn din loi thc th: hng tip cn th cng, cc hng tip cn hc my (HMM v MEMM). Trong khi hng tip cn th cng c gii hn l tn km v cng sc, thi gian v khng kh chuyn th HMM khng th tch hp cc thuc tnh phong ph ca chui d liu quan st vo qu trnh phn lp, v MEMM gp phi vn label bias. Nhng phn tch, nh gi vi tng phng php cho thy nhu cu v mt m hnh tht s thch hp cho vic gn nhn d liu dng chui ni chung v bi ton nhn din cc loi thc th ni ring.

14

Chng 3. Conditional Random Field (CRF)


CRF [6][11][12][15][16][17] c gii thiu ln u vo nm 2001 bi Lafferty v cc ng nghip. Ging nh MEMM, CRF l m hnh da trn xc sut iu kin, n c th tch hp c cc thuc tnh a dng ca chui d liu quan st nhm h tr cho qu trnh phn lp. Tuy vy, khc vi MEMM, CRF l m hnh th v hng. iu ny cho php CRF c th nh ngha phn phi xc sut ca ton b chui trng thi vi iu kin bit chui quan st cho trc thay v phn phi trn mi trng thi vi iu kin bit trng thi trc v quan st hin ti nh trong cc m hnh MEMM. Chnh v cch m hnh ha nh vy, CRF c th gii quyt c vn label bias. Chng ny s a ra nh ngha CRF, mt s phng php c lng tham s cho cc m hnh CRF v thut tan Viterbi ci tin tm chui trng thi tt nht m t mt chui d liu quan st cho trc. Mt s qui c k hiu: Ch vit hoa X, Y, Zk hiu cc bin ngu nhin. Ch thng m x, y, t, s,k hiu cc vector nh vector biu din chui cc d liu quan st, vector biu din chui cc nhn Ch vit thng in m v c ch s l k hiu ca mt thnh phn trong mt vector, v d xi ch mt thnh phn ti v tr i trong vector x. Ch vit thng khng m nh x, y, l k hiu cc gi tr n nh mt d liu quan st hay mt trng thi. S: Tp hu hn cc trng thi ca mt m hnh CRF.

3.1. nh ngha CRF


K hiu X l bin ngu nhin nhn gi tr l chui d liu cn phi gn nhn v Y l bin ngu nhin nhn gi tr l chui nhn tng ng. Mi thnh phn Yi ca Y l mt bin ngu nhin nhn ga tr trong tp hu hn cc trng thi S. Trong bi ton nhn bit cc loi thc th, X c th nhn gi tr l cc cu trong ngn ng t nhin, Y l mt chui ngu nhin cc tn thc th tng ng vi cc cu ny v mi mt thnh phn Yi ca Y c min gi tr l tp tt c cc nhn tn thc th (tn ngi, tn a danh,...). Cho mt th v hng khng c chu trnh G=(V,E), y V l tp cc nh ca th v E l tp cc cnh v hng ni cc nh th. Cc nh V biu din cc thnh phn ca bin ngu nhin Y sao cho tn ti nh x mt-mt gia mt nh v
15

mt thnh phn ca Yv ca Y. Ta ni (Y|X) l mt trng ngu nhin iu kin (Conditional Random Field - CRF) khi vi iu kin X, cc bin ngu nhin Yv tun theo tnh cht Markov i vi th G:

P(Yv | X ,Y , v) = P(Yv | X ,Y , N(v))

(3.1)

y, N(v) l tp tt c cc nh k vi v. Nh vy, mt CRF l mt trng ngu nhin ph thuc tan cc vo X. Trong cc bi ton x l d liu dng chui, G n gin ch l dng chui G=(V={1,2,m},E={(i,i+1)}). K hiu X=(X1, X2,, Xn), Y=(Y1,Y2, ...,Yn). M hnh th cho CRF c dng:
X

Y1

Y2

Y3

Yn-1

Yn

Hnh 5: th v hng m t CRF

Gi C l tp hp tt c cc th con y ca th G - th biu din cu trc ca mt CRF. p dng kt qu ca Hammerley-Clifford [14] cho cc trng ngu nhin Markov, ta tha s ha c p(y|x) - xc sut ca chui nhn vi iu kin bit chui d liu quan st- thnh tch ca cc hm tim nng nh sau:

P (y | x) = A ( A | x)
AC

(3.2)

V trong cc bi ton x l d liu dng chui th biu din cu trc ca mt CRF c dng ng thng nh trong hnh 5 nn tp C phi l hp ca E v V, trong E l tp cc cnh ca th G v V l tp cc nh ca G, hay ni cch khc th con A hoc ch gm mt nh hoc ch gm mt cnh ca G.

3.2. Nguyn l cc i ha Entropy


Lafferty et. al.[17] xc nh cc hm tim nng cho cc m hnh CRF da trn nguyn l cc i ha Entropy [1][3][8][29]. Cc i ha Entropy l mt nguyn l cho php nh gi cc phn phi xc sut t mt tp cc d liu hun luyn.

16

3.2.1. o Entropy iu kin Entropy l o v tnh ng u hay tnh khng chc chn ca mt phn phi xc sut. o Entropy iu kin ca mt phn phi m hnh trn mt chui trng thi vi iu kin bit mt chui d liu quan st p(y|x) c dng sau:

H ( p) = ~(x) * p(y | x) * log p(y | x) p


x,y

(3.3)

3.2.2. Cc rng buc i vi phn phi m hnh Cc rng buc i vi phn phi m hnh c thit lp bng cch thng k cc thuc tnh c rt ra t tp d liu hun luyn. Di y l v d v mt thuc tnh nh vy:

f=

1 nu t lin trc l t ng v nhn hin ti l B_PER 0 nu ngc li

Tp cc thuc tnh l tp hp cc thng tin quan trng trong d liu hun luyn. K hiu k vng ca thuc tnh f theo phn phi xc sut thc nghim nh sau:
E ~ ( x , y ) [ f ] ~ ( x, y ) f ( x, y ) p p
x,y

(3.4)

p y ~(x, y) l phn phi thc nghim trong d liu hun luyn. Gi s d


liu hun luyn gm N cp, mi cp gm mt chui d liu quan st v mt chui nhn D={(xi,yi)}, khi phn phi thc nghim trong d liu hun luyn c tnh nh sau:

~(x, y) =1/N * s ln xut hin ng thi ca x,y trong tp hun luyn p


K vng ca thuc tnh f theo phn phi xc sut trong m hnh
E p [ f ] ~ ( x ) p ( y | x ) * f ( x, y ) p
x,y

(3.5)

Phn phi m hnh thng nht vi phn phi thc nghim ch khi k vng ca mi thuc tnh theo phn phi xc sut phi bng k vng ca thuc tnh theo phn phi m hnh :

E ~ ( x,y ) [ f ] = E p [ f ] p

(3.6)

17

Phng trnh (3.6) th hin mt rng buc i vi phn phi m hnh. Nu ta chn n thuc tnh t tp d liu hun luyn, ta s c tng ng n rng buc i vi phn phi m hnh. 3.2.3. Nguyn l cc i ha Entropy Gi P l khng gian ca tt c cc phn phi xc sut iu kin, v n l s cc thuc tnh rt ra t d liu hun luyn. P l tp con ca P, P c xc nh nh sau:

P' = {p P | E p ( f i ) = E ~ ( f i )i { ,2,3..., n}} 1 p

(3.7)

P P C1

(a)

(b)

P C1 C2 (c) (d) Hnh 6: Cc rng buc m hnh C1 C2

P l khng gian ca ton b phn phi xc sut. Trng hp a: khng c rng buc; trng hp b: c mt rng buc C1, cc m hnh p tha mn rng buc nm trn ng C1; trng hp c: 2 rng buc C1 v C2 giao nhau, m hnh p tha mn c hai rng buc l giao ca hai ng C1 v C2; trng hp d: 2 rng buc C1 v C2 khng giao nhau, khng tn ti m hnh p tha mn c 2 rng buc. T tng ch o ca nguyn l cc i ha Entropy l ta phi xc nh mt phn phi m hnh sao cho phn phi tun theo mi gi thit bit t thc
18

nghim v ngoi ra khng a thm bt k mt gi thit no khc. iu ny c ngha l phn phi m hnh phi tha mn mi rng buc c rt ra t thc nghim, v phi gn nht vi phn phi u. Ni theo ngn ng ton hc, ta phi tm phn phi m hnh p(y|x) tha mn hai iu kin, mt l n phi thuc tp P (3.7) v hai l n phi lm cc i Entropy iu kin (3.3). Vi mi thuc tnh fi ta a vo mt tha s langrange i , ta nh ngha hm Lagrange L( p, ) nh sau:

L ( p, ) = H ( p ) + i * ( E ~ [ f i ] E p [ f i ]) p
i

(3.8)

Phn phi p(y|x) lm cc i o Entropy H ( p ) v tha mn n rng buc dng E ~ ( x , y ) [ f ] = E p [ f ] cng s lm cc i hm L( p, ) (theo l thuyt tha s p Langrange). T (3.8) ta suy ra:

p ( y | x) =

1 exp i f i Z ( x) i

(3.9)

y Z (x) l tha s chun ha m bo

p(y | x) = 1 vi mi x:
y

Z (x) = exp i f i y i

(3.10)

3.3. Hm tim nng ca cc m hnh CRF


Bng cch p dng nguyn l cc i ha Entropy, Lafferty xc nh hm tim nng ca mt CRF c dng mt hm m.

A ( A | x) = exp k f k ( A | x )
k

(3.11)

y fk l mt thuc tnh ca chui d liu quan st v k l trng s ch mc biu t thng tin ca thuc tnh fk . C hai loi thuc tnh l thuc tnh chuyn (k hiu l t) v thuc tnh trng thi(k hiu l s) ty thuc vo A l th con gm mt nh hay mt cnh ca G. Thay cc hm tim nng vo cng thc (3.2) v thm vo mt tha s chun ha Z(x) m bo tng xc sut ca tt c cc chui nhn tng ng vi mt chui d liu quan st bng 1, ta c:

19

P(y | x) =

1 exp k t k (y i1 , y i , x) + k sk (y i , x) Z (x) i k i k

(3.12)

y, x,y l chui d liu quan st v chui trng thi tng ng; tk l thuc tnh ca tan b chui quan st v cc trng thi ti v tr i-1, i trong chui trng thi; sk l thuc tnh ca ton b chui quan st v trng thi ti v tr i trong chui trng thi. si = 1 nu xi=Bill v yi= B_PER 0 nu ngc li 1 nu xi-1= Bill, xi=Clinton v yi-1=B_PER,yi=I_PER ti = 0 nu ngc li

Tha s chun ha Z(x) c tnh nh sau:

Z (x) = exp k t k (y i 1 , y i , x) + k s k (y i , x) y i k i k

(3.13)

(1 , 2 ,..., 1, 2 ..) l cc vector cc tham s ca m hnh, teta s c c

lng gi tr nh cc phng php c lng tham s cho m hnh s c cp trong phn sau.

3.4. Thut ton gn nhn cho d liu dng chui


Ti mi v tr i trong chui d liu quan st, ta nh ngha mt ma trn chuyn |S|*|S| nh sau:
M i (x) = [M i ( y ' , y, x)]

(3.14) (3.15)

M i ( y ' , y, x) = exp k t k ( y ' , y, x) + k s k ( y , x) k k

y Mi(y,y,x) l xc sut chuyn t trng thi y sang trng thi y vi chui d liu quan st l x. Chui trng thi y* m t tt nht cho chui d liu quan st x l nghim ca phng trnh: y* = argmax{p(y|x)} (3.16)

20

Chui y* c xc nh bng thut ton Viterbi ci tin. nh ngha i ( y ) l xc sut ca chui trng thi di i kt thc bi trng thi y v c xc sut ln nht bit chui quan st l x.
y1

Prob= i ( y1 )

i +1 ( y j )

y2 Prob=

i ( y2 )
yN
i

yj

Prob=

( yN )

Hnh 7: Mt bc trong thut ton Viterbi ci tin

Gi s bit tt c i ( y k ) vi mi yk thuc tp trng thi S ca m hnh, cn xc nh i +1 ( y j ) . T hnh 7, ta suy ra cng thc quy

i +1 ( y j ) = max ( i 1 ( y k ) * M i ( y k , y j , x) )y k S

(3.17)

t Pr ei ( y ) = arg max ( i 1 ( y ' ) * M i ( y ' , y , x ) ) . Gi s chui d liu quan st x c di n, s dng k thut backtracking tm chui trng thi y* tng ng nh sau: Bc 1: Vi mi y thuc tp trng thi tm

y * (n) = arg max( n ( y ) )


n

i i y

Bc lp: chng no i>0 i-1 Prei(y)

y*(i) = y Chui y* tm c chnh l chui c xc sut p(y*|x) ln nht, cng chnh l chui nhn ph hp nht vi chui d liu quan st cho trc.

21

3.5. CRF c th gii quyt c vn label bias


Bn cht phn phi ton cc ca CRF gip cho cc m hnh ny trnh c vn label bias c miu t trong phn 2.3.2 trn y. phng din l thuyt m hnh, ta c th coi m hnh CRF nh l mt my trng thi xc sut vi cc trng s khng chun ha, mi trng s gn lin vi mt bc chuyn trng thi. Bn cht khng chun ha ca cc trng s cho php cc bc chuyn trng thi c th nhn cc gi tr quan trng khc nhau. V th bt c mt trng thi no cng c th lm tng hoc gim xc sut c truyn cho cc trng thi sau n m vn m bo xc sut cui cng c gn cho ton b chui trng thi tha mn nh ngha v xc sut nh tha s chun ha ton cc. Trong [17], Lafferty v cc ng nghip ca ng tin hnh th nghim vi 2000 mu d liu hun luyn v 500 mu kim tra, cc mu ny u cha cc trng hp nhp nhng nh trong v d miu t phn 2.3.2. Thc nghim cho thy t l li ca CRF l 4.6% trong khi t l li ca MEMM l 42%, iu ny chng t rng cc m hnh MEMM khng xc nh c nhnh r ng trong trng hp label bias

3.6. Tng kt chng


Chng ny gii thiu nhng vn c bn v CRF: nh ngha CRF, thut ton gn nhn cho d liu dng chui trong CRF, nguyn l cc i ha Entropy xc nh cc hm tim nng cho cc m hnh CRF, chng minh CRF c th gii quyt c vn label bias. p dng cc m hnh CRF trong cc bi ton x l d liu chui [5] [9] cho thy CRF c kh nng x l d liu dng ny mnh hn so vi cc m hnh hc my khc nh HMM hay MEMM.

22

Chng 4. c lng tham s cho cc m hnh CRF


K thut c s dng nh gi tham s cho mt m hnh CRF l lm cc i ha o likelihood gia phn phi m hnh v phn phi thc nghim. Gi s d liu hun luyn gm mt tp N cp, mi cp gm mt chui quan st v mt chui trng thi tng ng, D={(x(i),y(i))} i = 1K N . o likelihood gia tp hun luyn v m hnh iu kin tng ng p(y|x, ) l:

L ( ) = p ( y | x , ) p ( x , y )
~ x,y

(4.1)

p y (1 , 2 ,..., 1, 2 ..) l cc tham s ca m hnh v ~(x, y ) l phn phi

thc nghim ng thi ca x,y trong tp hun luyn. Nguyn l cc i likelihood: cc tham s tt nht ca m hnh l cc tham s lm cc i hm likelihood.

ML = arg max L( )

(4.2)

ML

m bo nhng d liu m chng ta quan st c trong tp hun luyn

s nhn c xc sut cao trong m hnh. Ni cch khc, cc tham s lm cc i hm likelihood s lm phn phi trong m hnh gn nht vi phn phi thc nghim trong tp hun luyn. V vic tnh teta da theo cng thc (4.1) rt kh khn nn thay v tnh ton trc tip, ta i xc nh teta lm cc i logarit ca hm likelihood (thng c gi tt l log-likelihood):

l ( ) = ~ ( x, y ) log ( p ( y | x, ) ) p
x,y

(4.3)

V hm logarit l hm n iu nn vic lm ny khng lm thay i gi tr ca c chn.Thay p(y|x, ) ca m hnh CRF vo cng thc (4.3), ta c:
n n +1 l ( ) = ~ (x, y ) * t + * s ~ (x) * log Z p p x ,y i =1 x i =1

(4.4)

y, (1 , 2 ,..n ) v ( 1 , 2 ,..., m ) l cc vector tham s ca m hnh, t l vector cc thuc tnh chuyn (t1(yi-1,yi,x),t2(yi-1,yi,x),), s l vector cc thuc tnh trng thi (s1(yi,x),s2(yi,x),).

23

Hm log-likelihood cho m hnh CRF l mt hm lm v trn trong ton b khng gian ca tham s. Bn cht hm lm ca log-likelihood cho php ta c th tm c gi tr cc i ton cc bng cch thit lp cc thnh phn ca vector gradient ca hm log-likelihood bng khng. Mi thnh phn trong vector gradient ca hm log-likelihood l o hm ca hm log-likelihood theo mt tham s ca m hnh. o hm hm log likelihood theo tham s k ta c:
n l ( ) = ~ (x, y ) t k (y i 1 , y i , x) p k i =1 x,y

~ - p ( x) p ( y | x, ) t k ( y i 1 , y i , x)
x i =1

= E ~ ( x ,y ) t k E p ( y|x , ) t k p

[ ]

[ ]

(4.5)

Vic thit lp phng trnh trn bng 0 tng ng vi vic a ra mt rng


p buc cho m hnh: gi tr trung bnh ca tk theo phn phi ~(x) p(y | x, ) bng gi tr p trung bnh ca tk theo phn phi thc nghim ~ (x, y ) .

V phng din ton hc, bi ton c lng tham s cho mt m hnh CRF chnh l bi ton tm cc i ca hm log-likelihood. Chng ny gii thiu mt s phng php tm cc i ca log-likelihood: cc phng php lp (IIS v GIS), cc phng php ti u s (Conjugate Gradient, cc phng php Newton...).

4.1. Cc phng php lp


Cc phng php lp lm mn dn phn phi m hnh bng cc cp nht cc tham s m hnh theo cch

k k + k

(4.6)

y, cc gi tr k c chn sao cho gi tr ca hm likelihood gn vi cc i hn. Lafferty et. al. [17] a ra hai thut ton lp cho vic c lng tham s cho m hnh CRF, mt l IIS v mt l GIS. Trong phn ny, chng ta s tm hiu v phng php lp tng qut sau i su tm hiu hai thut ton IIS v GIS. Gi s chng ta c mt m hnh p(y | x, ) y (1 , 2 ,..., 1 , 2 ,...) , mc ch ca cc phng php lp l tm mt tp cc tham s mi + sao cho hm loglikelihood nhn gi tr ln hn vi tp tham s c, y = (1 , 2 ,..., 1 , 2 ,...) . Ni cch khc, trong cc phng php lp ta phi tm mt cch thc cp nht tham s
24

m hnh sao cho hm log-likelihood nhn gi tr cng gn vi gi tr cc i cng tt. Vic cp nht tham s s c lp li cho n khi hm log-likelihood hi t (gia s ca hm log-likelhood c tr tuyt i nh hn mt gi tr no ). Vi m hnh CRF, gia s ca hm log-likelihood b chn di bi mt hm ph A( , ) c nh ngha nh sau
n n +1 A( , ) k t k (y i 1 , y i , x) + k s k (y i , x) i =1 k x , y i =1 k

p + 1 ~ (x) p(y | x, ) t =1
k

n +1

t k (y i 1 , y i , x) exp(k T (x, y ) ) T (x, y )

+
i =1 k

s k ( y i , x) exp( k T ( x, y ) ) T ( x, y )

(4.7)

y T (x, y ) l tng cc thuc tnh ca chui d liu quan st v chui cc nhn tng ng (x,y)

T (x, y ) t k (y i 1 , y i , x) + s k (y i , x)
i =1 k i =1 k

n +1

(4.8)

V l ( + ) l ( ) A( , ) nn lm cc i A( , ) cng s lm cc i gia s ca hm log-likelihood. Di y l th tc lp tm tp tham s lm cc i hm likelihood. Khi to cc k Lp cho n khi no hi t Gii phng trnh
A( , ) = 0 vi mi tham s k k

Cp nht cc tham s

k k + k

Thit lp o hm tng phn ca A( , ) theo tham s k bng khng ta thu c phng trnh sau:

25

E ~ ( x, y ) [t k ] ~ (x, y ) t k (y i 1 , y i , x) p p
x,y i =1

n +1

(4.9)

~ = p (x) p(y | x, ) t k (y i 1 , y i , x) exp(k T (x, y ))


x,y i =1

n+ q

(4.10) T y, ta c th tnh c cc gia s

v k . IIS [2][15] v GIS [15] l

hai trng hp c bit ca phng php lp, mi thut ton c mt cch chn vector gia s cp nht tham s khc nhau.
4.1.1. Thut ton GIS

t C l gi tr ln nht ca T(x,y) vi tt c x,y trong tp d liu hun luyn. nh ngha mt vector thuc tnh ton cc (thuc tnh khng gn lin vi mt cnh hay mt nh no trong th m t mt CRF) .

g (x, y ) C t k (y i 1 , y i , x) + s k (y i , x)
i =1 k i =1 k

n +1

(4.11)

Thng thng vic thm vo mt thuc tnh s lm thay i phn phi xc sut ca m hnh, tuy nhin cc thuc tnh ton cc g(x,y) han ton ph thuc vo cc thuc tnh c trong m hnh, iu ny c ngha l ta khng a thm mt rng buc no i vi phn phi m hnh hay ni cch khc phn phi m hnh s khng i khi thm vo thuc tnh ton cc. Mc d khng lm thay i phn phi m hnh, vic thm cc thuc tnh g(x,y) la lm thay i gi tr ca T(x,y), tnh c cc thuc tnh ton cc T(x,y) s lun nhn gi tr hng s C. Nu cc thuc tnh ch nhn ga tr 0,1 th T(x,y) s chnh l s cc thuc tnh hot ng trong m hnh. Vi gi thit T(x,y)=C, Lafferty et.al [15][17] chng minh rng phng trnh (4.10) c th gii theo phng php gii tch thng thng. Logarithm hai v ca phng trnh (4.10), ta c:
n +1 log E ~ ( x ,y ) [t k ] = log ~ (x) p(y | x, ) t k (y i 1 , y i , x) exp(k * C ) p p i =1 x,y

= log E p ( y|x, ) [t k ] + k

*C

(4.12)

26

T y, suy ra:

k =

E ~ ( x , y ) [t k ] 1 p log C E p ( x , y ) [t k ]

(4.13)

Tc hi t ca thut ton GIS ph thuc ln ca C, C cng ln cc bc cp nht cng nh, t l hi t cng chm, ngc li C cng nh, tc hi t cng nhanh.
4.1.2. Thut ton IIS

T tng ca thut ton IIS: biu din phng trnh (4.10) di dng mt a thc ca exp( k ) , p dng phng php Newton-Raphson gii a thc nhn c tm k . biu din phng trnh (4.10) di dng a thc ca exp( k ), Lafferty et.al a ra xp x

T (x, y ) T (x) = max y T (x, y )


Thay T (x, y ) vo phng trnh (4.10), ta c:

(4.14)

E ~ ( x, y ) [tk ] = ~(x) p(y | x, ) tk (y i 1, y i , x) exp(kT (x)) p p


x, y i =1

n +1

(4.15)

Phn hoch tp cc cp (x,y) thnh Tmax tp con khng giao nhau, y

Tmax = max T (x) . Vit li (4.15) di dng

E~(x,y) [tk ] = p
m=0

Tmax

{x,y|T (x)=m}

~(x) p(y | x,) t (y , y , x)[exp( )]m p k k i1 i (4.16)


i=1

n+1

nh ngha a k ,m l k vng ca t k trong tp cc cp (x,y) c T (x) = m .


a k , m = ~ (x) p(y | x, ) t k (y i 1 , y i , x) (m, T (x)) p
x, y i =1 n +1

(4.17)

y, (m, T (x)) c nh ngha nh sau:


(m, T (x)) =

1 nu T(x)=m 0 nu ngc li

27

Khi , phng trnh (4.16) c th vit li di dng

E ~(x,y ) [t k ] = ak ,m [exp(k )] p
m =0

Tmax

(4.18)

Gii phng trnh (4.18) theo phng php Newton-Raphson ta tm c k .

4.2. Cc phng php ti u s


Cc k thut ti u s[15][28] s dng vector gradient ca hm log-likelihood tm cc tr. Hai loi k thut ti u c cp trong phn nay l k thut ti u bc mt v k thut ti u bc hai.
4.2.1. K thut ti u s bc mt

K thut ti u s bc mt s dng cc thng tin cha trong bn thn vector gradient ca hm cn ti u dn dn tnh tin cc c lng n im m vector gradient bng 0 v hm t cc tr. C hai phng php ti u bc mt c th dng c lng tham s cho mt m hnh CRF, c hai phng php ny u l bin th ca thut ton gradient lin hp khng tuyn tnh (non-linear conjugate gradient). Khng xem xt mt hng tm kim trong khi lm cc i hm s nh cc phng php leo i, cc phng php hng lin hp sinh ra mt tp cc vector khc khng tp lin hp v ln lt lm cc i hm dc theo hng ny. Cc phng php gradient lin hp khng tuyn tnh l trng hp c bit ca k thut hng lin hp trong mi vector lin hp hay hng tm kim ch c sinh t hng tm kim trc m khng phi t tt c cc thnh phn ca tp lin hp trc . c bit, mi hng tm kim pj sau l t hp tuyn tnh ca hng i ln dc nht hay gradient ca hm cn tm cc tr v hng tm kim trc pj-1. Mi bc lp ca thut tan cp nht gradient lin hp tnh tin cc tham s ca hm cn tm cc i theo hng ca vector lin hp hin thi s dng lut cp nht:

k ( j +1) = k j + ( j ) p j
y, ( j ) l ln ca bc nhy ti u.

(4.19)

C hai phng php ti u bc mt rt thch hp cho vic c lng tham s m hnh CRF, l cc thut tan Fletcher-Reeves v Polak-Ribire-Positive. V bn cht hai thut ton ny l hon ton tng ng, chng ch khc nhau v cch chn hng tm kim v ln ca bc nhy ti u.

28

4.2.2. K thut ti u s bc hai

Ngoi gi tr ca vector gradient, cc k thut ti u s bc hai ci tin cc k thut bc mt trong vic tnh ton cc cp nht cho tham s bng cch thm yu t v ng cong hay o hm bc hai ca hm cn tm cc tr. Lut cp nht bc hai c tnh ton bng cch khai trin chui Taylor bc hai ca l ( + ) nh sau:
l ( + ) l ( ) + T G ( ) + 1 T H ( ) 2

(4.20)

G ( ) v H ( ) ln lt l vector gradient v ma trn Hessian (ma trn o hm


tng phn bc hai) ca hm log-likelihood l ( ) . Thit lp o hm ca xp x trong (4.20) bng 0 ta tm c gia s cp nht tham s m hnh nh sau:

( k ) = H 1 ( ( k ) )G( ( k ) )

(4.21)

y, k l ch s ca ln lp hin ti. Mc d vic cp nht cc tham s m hnh theo cch thc ny cho hi t rt nhanh nhng vic tnh nghch o ca ma trn Hessian li i hi chi ph ln v thi gian c bit l vi cc bi ton c ln nh l cc bi tan trong x l ngn ng t nhin. V th cc phng php bc hai m phi tnh tan trc tip nghch o ca ma trn Hessian khng thch hp cho vic c lng tham s cho cc m hnh CRF. Cc phng php quasi-Newton l cc trng hp c bit ca k thut ti u bc hai, tng t nh cc phng php Newton tuy nhin chng khng tnh ton trc tip ma trn Hessian m thay vo chng xy dng mt m hnh ca ma trn Hessian ti mi bc lp bng cch o thay i trong vector gradient. Yu t c bn ca cc phng php quasi-Newton l chng thay th ma trn Hessian trong khai trin Taylor (4.20) bi B( ) . Cch thc cp nht tham s m hnh cng v th m thay i:

( k ) = B 1 ( ( k ) )G( ( k ) )

(4.22)

Ti mi bc lp, B 1 ( ) c cp nht phn nh cc thay i trong tham s tnh t bc lp trc. Tuy nhin, thay v phi tnh ton li, B 1 ( ) ch cn phi cp nht li ti mi bc phn nh cong o c trong bc lp trc.

B( ( k ) ) 1 (G( ( k ) ) G( ( k 1) )) = k 1

(4.23)

29

Vic xp x ma trn Hessian theo B( ) cho php phng php quasi-Newton hi t nhanh hn so vi phng php Newton truyn thng. Phng php Limited memory quasi-Newton (L-BFGs) [11] ci tin ca phng php quasi-Newton thc hin tnh tan khi lng b nh b gii hn. Nhng thc nghim gn y cho thy phng php Limited memory quasi-Newton vt tri hn hn so vi cc phng php khc bao gm c GIS, IIS, gradient lin hp... trong vic tm cc i hm log-likelihood.

4.3. Tng kt chng


Chng ny cp n vn c lng cc tham s cho m hnh CRF bng cch lm cc i likelihood ng thi gii thiu mt s phng php tm cc i ca hm log-likelihood nh IIS, GIS, gradient lin hp, quasi-Newton v L-BFGs nhm phc v cho c lng tham s m hnh. Trong cc phng php tm cc tr hm loglikelihood, phng php L-BFGs c nh gi l vt tri hn hn so vi cc phng php khc.

30

Chng 5. H thng nhn bit cc loi thc th trong ting Vit


Mt h thng nhn bit loi thc th trong ting Vit nu ra i s gp phn quan trng trong x l ting Vit v hiu cc vn bn ting Vit. Tuy rng nhn bit loi thc th l mt bi ton c bn trong trch chn thng tin v x l ngn ng t nhin nhng i vi ting Vit th y li l mt bi ton tng i mi. Mc d c nhng kh khn do c th ca ting Vit v tnh cht tin phong trong lnh vc nghin cu ny, nhng th nghim ban u ca em cho ting Vit cng t c nhng kt qu rt ng khch l.

5.1. Mi trng thc nghim


5.1.1. Phn cng

My Celeron III, chip 768 MHz, Ram 128 MB


5.1.2. Phn mm

FlexCRFs l mt CRF Framework cho cc bi ton gn nhn d liu d liu dng chui nh POS tagger, Noun Phrase Chunking,... y l mt cng c m ngun m c pht trin bi ThS. Phan Xun Hiu v TS. Nguyn L Minh (Vin JAISTNht Bn). H thng nhn bit loi thc th cho ting Vit ca em c xy dng trn nn ca Framework ny.
5.1.3. D liu thc nghim

D liu cho thc nghim gm 50 bi bo lnh vc kinh doanh (khong gn 1400 cu) ly t ngun http://www.vnexpress.net. D liu ban u c cho qua b tin x l lc b cc th HTML v chuyn t dng m ha UTF-8 sang ting Vit khng du m ha dng Telex. Sau d liu c gn nhn bng tay phc v cho qu trnh thc nghim.

5.2. H thng nhn bit loi thc th cho ting Vit


Cc bc gn nhn cho mt trang Web ting Vit c minh ha nh hnh v di y

31

Input (HTML) Tin x l La chn thuc tnh FlexCRF framework Khi phc + tagging Output (HTML)
Hnh 8: Cu trc h thng nhn bit loi thc th

5.3. Cc tham s hun luyn v nh gi thc nghim


5.3.1. Cc tham s hun luyn

Mt s ty chn trong FlexCRF framework cho qu trnh hun luyn:


Bng 2: Cc tham s trong qu trnh hun luyn

Tham s init_lamda_val num_iterations f_rare_threshold

Gi tr 1.0 55 1

ngha Gi tr khi to cho cc tham s trong m hnh S bc lp hun luyn Ch c cc thuc tnh c tn s xut hin ln hn gi tr ny th mi c tch hp vo m hnh CRF Ch c cc mu v t ng cnh c tn s xut hin ln hn gi tr ny mi c tch hp vo m hnh CRF

cp_rare_threshold

32

eps_log_likelihood

0.01

FlexCRF s dng phng php L-BFGs c lng tham s m hnh. Gi tr ny cho ta iu kin dng ca vng lp hun luyn, nu nh |log_likelihood(t)-log_likelihood(t1)|<0.01 th dng qu trnh hun luyn . y t v t-1 l bc lp th t v t-1.

5.3.2. nh gi cc h thng nhn bit loi thc th

Cc h thng nhn bit loi thc th c nh gi cht lng thng qua ba o: chnh xc (precision), hi tng (recall) v o F (F-messure). Ba o ny c tnh ton theo cc cng thc sau:

rec =
pre =

correct correct + incorrect + mis sin g


correct correct + incorrect + spurious

(5.1)

(5.2)

F=

2 * pre * rec pre + rec

(5.3)

ngha ca cc gi tr correct, incorrect, missing v spurious c nh ngha nh bng sau:


Bng 3: Cc gi tr nh ga mt h thng nhn din loi thc th

Gi tr Correct Incorrect Missing Spurious

ngha S trng hp c gn ng. S trng hp b gn sai. S trng hp b thiu S trng hp tha

33

Mt h thng nhn bit loi thc th c th c nh ga mc nhn hoc mc cm t. hiu r hn vn ny chng ta hy xem xt v d sau: V d: gi s h thng gn nhn cm t Phan Vn Khi l B_PER I_PER O. mc nhn, h thng gn ng c 2 trong s 3 nhn v th chnh xc s l 2/3. mc cm t, ta mun c cm ny c nh du l tn ngi hay chui nhn tng ng phi l B_PER I_PER I_PER, chnh xc khi xt mc cm t s l 0/1 (thc t c mt cm tn thc th nhng h thng khng nh du ng c cm no).
5.3.3. Phng php 10-fold cross validation

H thng th nghim theo phng php 10-fold cross validation. Theo phng php ny, d liu thc nghim c chia thnh 10 phn bng nhau, ln lt ly 9 phn hun luyn v 1 phn cn li kim tra, kt qu sau 10 ln thc nghim c ghi li v nh gi tng th.

5.4. La chn cc thuc tnh


La chn cc thuc tnh t tp d liu hun luyn l nhim v quan trng nht, gi vai tr quyt nh cht lng ca mt h thng nhn bit loi thc th. Cc thuc tnh c la chn cng tinh t th chnh xc ca h thng cng tng. Do ting Vit thiu cc thng tin ng php (POS) cng nh cc ngun ti nguyn c th tra cu nn c th t c chnh xc gn vi chnh xc t c vi cc h thng xy dng cho ting Anh cn phi la chn cc thuc tnh mt cch cn thn v hp l. Cc thuc tnh ti v tr i trong chui d liu quan st gm hai phn, mt l thng tin ng cnh tai v tr i ca chui d liu quan st, mt l phn thng tin v nhn tng ng. Cng vic la chn cc thuc tnh thc cht l chn ra cc mu v t ng cnh (context predicate template), cc mu ny th hin nhng cc thng tin ng quan tm ti mt v tr bt k trong chui d liu quan st. p dng cc mu ng cnh ny ti mt v tr trong chui d liu quan st cho ta cc thng tin ng cnh (context predicate) ti v tr . Mi thng tin ng cnh ti i khi kt hp vi thng tin nhn tng ng ti v tr s cho ta mt thuc tnh ca chui d liu quan st ti i. Nh vy mt khi c cc mu ng cnh, ta c th rt ra c hng nghn thuc tnh mt cch t ng t tp d liu hun luyn. Bc u th nghim, em a ra mt s mu v t ng cnh sau:

34

5.4.1. Mu ng cnh v t vng


Bng 4: Cc mu ng cnh v t vng

Mu ng cnh w:0,w:1

ngha D liu quan st c ti v tr hin ti v ngay sau v tr hin ti

V d: p dng mu ng cnh trn ti v tr 1 trong chui 3000 USD ta c ng cnh w:0:USD. Gi s trong d liu hun luyn, t USD trong chui d liu trn c gn nhn I_CUR, kt hp vi ng cnh ta c th rt ra c mt thuc tnh ca chui d liu quan st l gk = 1 nu t hin ti l USD v nhn l I_CUR 0 nu ngc li

5.4.2. Mu ng cnh th hin c im ca t


Bng 5: Cc mu ng cnh th hin c im ca t

Mu ng cnh initial_cap all_cap contain_percent_sign first_obsrv uncaped_word valid_number

ngha T vit hoa ch ci u tin (c kh nng l thc th) T gm tan cc ch ci vit hoa (c kh nng l ORG, v d: EU, WTO...) T cha k t % (c kh nng l thc th PCT) T u tin ca cu (thng tin v vit hoa khng c ngha) T vit thng (c kh nng khng phi l thc th) T hin ti l mt s hp l, v d: 123; 12.4

35

mark 4_digit_number

Du cu nh cc du chm, phy , hai chm Nhiu kh nng l nm, v d: nm 2005

5.4.3. Mu ng cnh dng regular expression


Bng 6: Cc mu ng cnh dng Regular Expression

Mu ng cnh ^[0-9]+/[0-9]+/[0-9]+$ ^[0-9]+/[0-9]+$ ^[0-9][0-9][0-9][0-9]$ ^(T|t)h (hai|ba|t|nm|su|by|)$ ^(C|c)h nht$ ^[0-9]%$ ^([0-9]|[A-Z])+$

V d 12/04/2005 22/5 2005 Th hai 7% 3COM

ngha Ngy thng Ngy thng hoc phn s Nm Ngy trong tun Phn trm Tn cng ty

5.4.4. Mu ng cnh dng t in

Cc mu ng cnh dng ny cho php ta tra cu trong mt s danh sch cho trc. Cc thng tin ng cnh sinh ra t cc mu ny rt c ch cho vic nhn bit lai thc th. Nu nh trong ting Anh c cc ti nguyn cho php tra cu nh www.babyname.com (tra cu cc tn ting Anh) ... th ting Vit hon ton khng c cc ngun ti nguyn nh vy, v th em phi thu thp v xy dng cc ngun thng tin ny t u. y l mt cng vic rt mt thi gian nn em mi ch lit k th im mt vi trng hp in hnh v vn cha khai thc ht c th mnh ca chng.

36

Bng 7: Cc mu ng cnh dng t in

Mu ng cnh first_name last_name mid_name Verb Time_marker Loc_noun Org_noun Per_noun

V d Nguyn, Trn, L ... Hoa, Lan, Thng .... Th, Vn, nh S, , pht biu, ni ... Sng, tra, chiu, ti Th trn, tnh, huyn, th , o, ... Cng ty, t chc, tng cng ty ... ng, b, anh, ch, ...

5.5. Kt qu thc nghim


5.5.1. Kt qu ca 10 ln th nghim

100 80 60 40 20 0 1 2 3 4 5 6 Recall 7 8 9 10

Precision

F-measure

Hnh 9: Gi tr ba o Precision, Recall, F-measure qua 10 ln thc nghim

5.5.2. Ln thc nghim cho kt qu tt nht

37

Bng 8: nh gi mc nhn - Ln thc nghim cho kt qu tt nht

Label O B_LOC I_LOC B_ORG

Manual 2132 91 55 52

Model 2134 97 59 53 67 25 13 28 2 5 36 12 20 18 7 3 6

Match 2101 83 51 47 54 22 12 27 2 5 33 11 19 15 5 3 0

Pre. (%) 98.4536 85.567 86.4407 88.6792 80.597 88 92.3077 96.4286 100 100 91.6667 91.6667 95 83.3333 71.4286 100 0 90.5981

Rec. (%) 98.546 91.2088 92.7273 90.3846 93.1034 84.6154 92.3077 93.1034 66.6667 100 55.9322 91.6667 90.4762 100 50 75 0 85.3586 96.325

F-Measure(%) 98.4998 88.2979 89.4737 89.5238 86.4 86.2745 92.3077 94.7368 80 100 69.4737 91.6667 92.6829 90.9091 58.8235 85.7143 0 87.9003 96.325

B_TIME 58 I_TIME B_PER B_NUM I_NUM B_PCT I_ORG B_CUR I_CUR I_PER 26 13 29 3 5 59 12 21 15

B_MISC 10 I_MISC I_PCT AVG1. AVG2. 2585 4 0

2585

2490

96.325

38

Bng 9: nh gi mc cm t - Ln thc nghim cho kt qu tt nht

Chunk Manual PER LOC ORG PCT MISC NUM TIME CUR ARG1. ARG2. 270 13 91 52 5 10 29 58 12

Model 13 97 53 5 7 28 67 12

Match 12 82 40 5 5 27 54 11

Pre.(%) 92.31 84.54 75.47 100 71.43 96.43 80.60 91.67 86.55

Rec.(%) 92.31 90.11 76.92 100 50.00 93.10 93.10 91.67 85.90 87.41

F-Mesuare(%) 92.31 87.23 76.19 100 58.82 94.74 86.40 91.67 86.23 85.51

282

236

83.69

39

100 90 80 F1-measure score (%) 70 60 50 40 30 20 10 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 S vng lp hun luyn (L-BFGS)

Hnh 10: Qu trnh tng F-measure qua cc bc lp

40

0 -5000 -10000 -15000 -20000 Log-likelihood -25000 -30000 -35000 -40000 -45000 -50000 -55000 -60000 -65000 -70000 1 5 9 13 17 21 25 29 33 37 41 45 49 53 S vng lp hun luyn (L-BFGS)

Hnh 11: Qu trnh tng log-likelihood qua cc bc lp

41

5.5.3. Trung bnh 10 ln thc nghim


Bng 10: nh gi mc nhn- Trung bnh 10 ln thc nghim

o Precision Recall F-measure

Gi tr (%)
82.59756 79.89403 81.18363

Bng 11: nh gi mc cm t trung bnh 10 ln thc nghim

o Precision Recall F-measure

Gi tr (%)
81.855 79.351 80.537

5.5.4. Nhn xt

Bc u thc nghim h thng nhn din loi thc th trong ting Vit cho kt qu tng i kh quan. Tuy vn cn nhiu trng hp nhp nhng do nhng kh khn cp trong chng 1 nhng em tin rng mt khi xy dng c tp d liu hun luyn ln, thu thp c cc ngun tra cu di do hn v la chn nhiu thuc tnh tt hn, h thng cn c th t c chnh xc cao hn na trong tng lai.

42

Kt lun
Nhng vn c gii quyt trong lun vn
Kha lun h thng ha mt s vn l thuyt v trch chn thng tin, bi ton nhn bit loi thc th ng thi trnh by, phn tch, nh gi mt s hng tip cn bi ton nhn bit loi thc th. Mt s vn v gii php i vi bi ton nhn bit loi thc th cho ting Vit da trn m hnh CRF c xut, thc nghim v thu c mt s kt qu rt kh quan. Sau y l mt s nt chnh m lun vn tp trung gii quyt. Chng mt a ra mt ci nhn khi qut v trch chn thng tin, bi ton nhn bit loi thc th, m hnh ha bi ton di dng mt bi tan gn nhn d liu dng chui v nhng ng dng ca bi tan nhn din loi thc th t thy c s cn thit phi c mt h thng nhn din loi thc th cho ting Vit. Chng hai xem xt cc hng tip cn khc nhau nhm gii quyt bi ton nhn din loi thc th, l cc phng php th cng, phng php HMM, phng php MEMM. Chng ny i su vo phn tch nh gi tng phng php, cho thy s thiu linh hot ca cc phng php th cng, s ngho nn ca cc thuc tnh c chn trong m hnh HMM v vn label bias m cc m hnh MEMM gp phi. Nhng nh gi ny l gii v sao em li la chn phng php hc my CRF l c s xy dng h thng nhn din loi thc th cho ting Vit. Chng ba a ra nh ngha v CRF, gii thiu nguyn l cc i ha Entropy, thut ton gn nhn cho d liu dng chui. Chng ny cng chng minh rng CRF l m hnh thch hp nht cho bi tan nhn din loi thc th, c th n cho php tch hp cc thuc tnh phong ph a dng ca chui d liu quan st, bn cht phn phi ton cc gip cho cc m hnh CRF trnh c vn label bias m MEMM gp phi. Chng bn h thng cc phng php c lng cc tham s cho cc m hnh CRF, l cc phng php lp (IIS, GIS), cc phng php da trn vector gradient nh gradient lin hp, quasi-Newton, L-BFGs. Trong s cc phng php ny, L-BFGs c nh gi tt nht, y cng chnh l phng php m FlexCRFs mt CRF framework - s dng c lng tham s cho m hnh.

43

Chng nm trnh by h thng nhn din loi thc th cho ting Vit v xut cc phng php la chn thuc tnh cho vic nhn din cc loi thc th trong cc vn bn ting Vit. Chng ny cng a ra cc kt qu ca h thng nhn din loi thc th ting Vit qua mt s ln thc nghim.

Cng vic nghin cu trong tng lai


Mc d kt qu phn loi thc th ca h thng c th tt hn na nhng do thi gian c hn nn em mi ch dng li con s trung bnh l 80%, trong thi gian ti, em s tip tc nghin cu nhm ci thin h thng, em tin rng kt qu ny c th tng ln xp x 90% mc cm t. Trn c s h thng nhn din loi thc th ting Vit hin nay, em d nh s m rng v c th ha cc loi thc th nh phn nh loi thc th ch a danh thnh cc loi thc th ch t nc, sng ngi, .... Tm hiu v xy dng mt h thng nhn din mi quan h gia cc thc th nh tm ra mi quan h nh ni sinh ca mt ngi, v chc v mt ngi trong mt cng ty t chc ... Xy dng mt ontology ch a danh, t chc, ... cho ting Vit. Tch hp ontology v h thng nhn din loi thc th vo my tm kim ting Vit Vinahoo nhm phc v vic tm kim hng thc th.

44

Ph lc: Output ca h thng nhn din loi thc th ting Vit


Bng Ch thch: Mu Nu Ta Xanh nc bin Xanh l cy Tm Xanh nht Da cam Loi thc th LOC ORG PER PCT TIME CUR NUM MISC ngha Tn a danh Tn t chc Tn ngi Phn trm Ngy thng, thi gian Tin t S Nhng loi thc th khc

Kt qu sau khi h thng gn nhn mt s chui d liu quan st Th nm,16/12/2004,15:11 GMT+7. Cao Xumin , Ch tch Phng Thng mi Xut Nhp khu thc phm ca Trung Quc , cho rng , cch xem xt ca DOC khi em so snh gi tm ca Trung Quc vi gi tm ca n l vi phm lut thng mi . m bo li ch ca Nh nc v doanh nghip, sau thi im bn giao ti sn , VMS mi c th tin hnh kim k v thu t chc t vn xc nh gi tr doanh nghip . EU thc y quan h thng mi vi Trung Quc ( 24/02 ). Hip hi cht lng Thng Hi phng vn 2.714 khch hng 29 siu th quanh thnh ph trong thng qua. Th tng Trung Quc n Gia Bo va cho bit , nm nay nc ny s gim tc tng trng kinh t xung cn 8% so vi con s 9,4% trong nm 2004 nhm t c s pht trin n nh hn . Hng cng s m rng mng li ca mnh sang Australia v Canada. OPEC gi nguyn sn lng khai thc du. Theo k hoch , vng 2 ca cuc thi ln ny vi 6 i chi s t chc ng thi Hong Kong , TP HCM v Australia .
45

' i din thng mi EU khng nn lnh o WTO ' ( 12/03 ) . VN min th thc cho cng dn 4 nc Bc u ( 20/04 ) . Gi du th gii gim nh sau tuyn b ca OPEC ( 25/02 ) . TP HCM t chc ngy hi du lch nhn dp 30/4 ( 21/04 ) . Trc thc trng ny , nhng du khch n l hi m khng t phng trc ch cn cch thu cc khch sn pha ngoi , cch xa trung tm thnh ph . Khi gia nhp WTO , mi trng u t ca Trung Quc c v " mi trng cng " ( c s h tng ) ln " mi trng mm " ( c ch chnh sch ) s c ci thin hn na , Trung Quc s tr thnh mt trong nhng "im nng " thu ht u t nc ngoi ca th gii . - C th chng ta s lm g y nhanh tin gia nhp WTO? Nht khuyn co cng dn ca h Trung Quc ch n an ninh khi lm sng biu tnh bt u cch y vi tun. N lc ca Trung Quc gia nhp WTO ( 28/12 ) . " C rt nhiu thanh nin Nht hiu bit v Trung Quc " . Trung Quc m mn cuc chin thp mi ( 14/01 ) . Thm 2 cng ty u gi c phn qua sn H Ni ( 12/04 ) . Khi lng giao dch khng c bin ng ln so vi tun trc khin th trng vn th nm ngang . S nng bng ca th trng vng en trong nhng ngy qua khin gii phn tch a ra nhn nh , th trng nhin liu ngy cng nhy cm vi nhng nhn t v m nh chnh sch ca T chc cc nc xut khu du m ( OPEC ) , nhu cu s dng ca nhng ngi khng l nh M , Trung Quc v n . Du th ch cn 50 USD /thng (14/04). Hi thng 12 nm ngoi , Tng thng M George Bush , ngi tho ngi cuc chin tranh thp vi EU v mt s nc chu , cng phi d b thu sut cao sau nhiu ln WTO a ra li cnh co . Bc di t CEPT n WTO ( 04/01 ) . L trnh chun b gia nhp WTO ca Vit Nam ( 22/12 ) . Trn thc t , Chnh ph Trung Quc nhiu tin ca cho ngnh thp trong nc , ng thi khng qun cnh bo bng mi cch s ln t cc i th khc , t nht l trong vng 10 nm ti . V lu di, t nay cho n thng 3 sang nm, doanh thu ca ton Thai Airways s gim khong 2-3% do Phuket l mt trong nhng th trng chnh.

46

Ngay sau khi thm ha xy ra , sn bay Phuket ng ca vi gi v hot ng li sau 6 gi. Tnh n hm qua , 60% khch du lch nc ngoi hy ch khch sn v khu ngh dng Phuket .

47

Ti liu tham kho


[1]. [2]. [3]. [4]. [5]. A.Berger, A.D.Pietra, and J.D.Pietra.A maximum entropy approach to natural langauge processing. Computational Linguistics, 22(1):39-71, 1996. Adam Berger. The Improved Iterative Scaling Algorithm: A gentle Introdution. School of Computer Science, Carnegie Mellon University Andrew Borthwick. A maximum entropy approach to Named Entity Recognition. New York University, 1999 Andrew McCallum. Efficiently Inducing Features of Conditional Random Fields. Computer Science Department. University of Massachusetts. A.McCallum, D.Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proc. Iternational Conference on Mechine Learning, 2000, pages 591-598. Andrew McCallum, Khashayar Rohanimanesh, and Charles Sutton. Dynamic Conditional Random Fields for Jointly Labeling Multiple Sequences. Department of Computer Science, University of Massachusetts Andrew Moore. Hidden Markov Models Tutorial Slides. A.Ratnaparkhi.A maximum entropy model for part-of-speech tagging.In Proc. Emparical Methods for Natural Language Processing, 1996. Basilis Gidas. Stochastic Graphical Models and Applications, 2000. University of Minnesota.

[6].

[7]. [8]. [9].

[10]. David Barber. An Introduction to Graphical Models. [11]. Dong C.Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization.Mathematical Programming 45 (1989),pp.503-528. [12]. F.Sha and F.Pereira.Shallow parsing with conditional random fields. In Proc. Human Language Technology/ the Association for Computational Linguistics North American Chapter, 2003. [13]. GuoDong Zhou, Jian Su. Named Entity Recognition using an HMM-based Chunk Tagger. [14]. Hammersley, J., & Clifford, P. (1971). Markov fields on finite graphs and lattices. Unpublished manuscript.
48

[15]. Hanna Wallach. Efficient Training of Conditional Random Fields. University Of Edinburgh, 2002 [16]. Hieu Phan, Minh Nguyen, Bao Ho Japan Advanced Institute of Science and Technology,Japan , and Susumu Horiguchi- Tokosu University, Japan. Improving Discriminative Sequential Learning with Rare-but-Important Associations. SIGKDD 05 Chicago, II, USA, 2005. [17]. J.Lafferty, A.McCallum, and F.Pereira.Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. ICML, 2001. [18]. John Lafferty, Yan Liu, Xiaojin Zhu, School of Computer Science Carnegie Mellon University, Pittsburgh, PA 15213. Kernel Conditonal Random Fields: Representation, Clique Selection and Semi-Supervised Learning. CMS-CS-04-115, February 5, 2004. [19]. Rabiner.A tutorial on hidden markov models and selected applications in speech recognition. In Proc. the IEEE, 77(2):257-286, 1989. [20]. Robert Malouf, Alfa-Informatica Rijksuniversiteit Groningen, Postbus 716 9700AS Groningen The Newtherlands. A comparison of Algorithms for maximum entropy parameter estimation. [21]. Ronald Schoenberg. Optimization with the Quasi-Newton Method, September 5, 2001. [22]. Sunita Sarawagi, William W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. [23]. Trausti Kristjansson, Aron Cullota, Paul viola, Adrew McCallum. Interactive Information Extraction with Constrained Conditionial Random Fields. [24]. Xuming He, Richard S. Zemel, Miguel . Carreira-Perpinan, Department of Computer Science, University of Toronto. Multiscale Conditional Random Fields for Image Labeling. [25]. Yasemin Altun and Thomas Hofmann, Department of Computer Science, Brown University, Providence, RI. Large Margin Methods for Label Sequence Learning.

49

[26]. Yasemin Altun, Alex J. Smola, Thomas Hofmann. Exponential Faminlies for Conditional Random Fields. [27]. Walter F.Mascarenhas. The BFGS method with exact line searches fails for non-convex objective functions. Published May 7, 2003 [28]. Web site: http://web.mit.edu/wwmatch . Optimization [29]. Web site: http://www.mtm.ufsc.br/ . Shannon Entropy [30]. Web site: http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html . Information about the sixth Message Understanding Conference. [31]. Web site: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7 _toc.html . Information about the seventh Message Understanding Conference. [32]. William W.Cohen, Adrew McCallum. Slides Information Extraction from the World Wide Web, KDD 2003.

50

O3 O1

[1]. Andrew Borthwick. A maximum entropy approach to Named Entity Recognition. Doctor of Philosophy, New York University, September 1999 [2]. A.McCallum, D.Freitag, F. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proc. ICML 2000, pages 591-598. [3]. Dong C.Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45 (1989), pp.503-528. [4]. GuoDong Zhou, Jian Su. Named Entity Recognition using an HMM-based Chunk Tagger. ACL Philadenphia, July 2002, pp. 473-480 [5]. Hanna Wallach. Efficient Training of Conditional Random Fields. Doctor of Philosophy, University Of Edinburgh, 2002 [6]. Hieu Phan, Minh Nguyen, Bao Ho, and Susumu Horiguchi. Improving Discriminative Sequential Learning with Rare-but-Important Associations. ACM SIGKDD Chicago, IL, USA, August 21-24, 2005 (to appear). [7]. J.Lafferty, A.McCallum, and F.Pereira.Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. ICML , pages 282-290,2001 [8]. Rabiner.A tutorial on hidden markov models and selected applications in speech recognition. In Proc. the IEEE, 77(2):257-286, 1989. [9]. William W.Cohen, Adrew McCallum. Slides Information Extraction from the World Wide Web, KDD 2003 [10]. P.X.Hieu, N.L.Minh. http://www.jaist.ac.jp/~hieuxuan/flexcrfs/flexcrfs.html [11]. Website: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html

51

You might also like