You are on page 1of 118

Phn tch b cc v nhn dng nh cng vn ting Vit

LI CM N
hon thnh ti ny v c kin thc nh ngy hm nay, u tin chng em xin gi li cm n n Ban Gim Hiu cng ton th Thy C Khoa Cng Ngh Thng Tin Trng i Hc Nng Lm TP.HCM tn tnh ging dy, truyn t kin thc cng nh nhng kinh nghim qu bu cho chng em trong sut qu trnh hc tp v nghin cu ti trng. Chng em cng chn thnh cm n thy Nguyn c Thnh tn tnh hng dn v quan tm, ng vin chng em trong qu trnh thc hin ti. Chng em cng by t lng bit n su sc n nhng ngi thn trong gia nh, bn b ng vin v to mi iu kin gip chng em trong qu trnh hc tp cng nh trong cuc sng. Mc d chng em c gng hon thnh tt ti nhng cng khng th trnh khi nhng sai st nht nh, rt mong c s thng cm v chia s cng qu Thy C v bn b. Chng em xin gi li chc sc khe v thnh t ti tt c qu thy c cng cc bn. Nhm sinh vin thc hin V i Bnh Nguyn Th T Mi Nguyn Thy Giang

GVHD: Ths.Nguyn c Thnh

-i-

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

MC LC
Trang DANH MC CC HNH...........................................................................................V DANH MC CC BNG..........................................................................................X DANH SCH CH VIT TT...............................................................................XI XI TM TT................................................................................................................XII T VN ............................................................................................................5 PHNG PHP OTSU...........................................................................................5 S DNG CC PHP BIN I MORPHOLOGY TRONG C LNG NGHING VN BN....................................................................................8 T VN ......................................................................................................8 MT S HNG TIP CN HIN C:........................................................9 M T PHNG PHP.................................................................................15 KT QU THC NGHIM............................................................................28 PHNG PHP QUAY NH VN BN NH PHN......................................33 T VN ....................................................................................................33 M T PHNG PHP.................................................................................34 KT LUN........................................................................................................38 TNG KT.............................................................................................................38 T VN : .......................................................................................................40 MT S PHNG PHP TCH KHI HIN C.........................................43 M T PHNG PHP......................................................................................45 TCH KHI THEO CHIU NGANG............................................................45 TCH KHI THEO CHIU DC..................................................................51 KT LUN V NHN XT T KT QU THC NGHIM:......................53
GVHD: Ths.Nguyn c Thnh - ii SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

T VN ..........................................................................................................55 M T PHNG PHP......................................................................................55 DNG CC PHP BIN I MORPHOLOGY T LEM DNG VN BN......................................................................................................55 LY LC CHIU I VI MI KHI VN BN THEO TRC OY........................................................................................................57 XC NH DNG VN BN TRONG MI KHI.....................................59 KT LUN.............................................................................................................60 T VN ..........................................................................................................62 MT S HNG TIP CN KHC.................................................................62 M T PHNG PHP......................................................................................63 NI DU V K T.......................................................................................63 NI K T TRONG T.................................................................................65 TNG KT.............................................................................................................67 T VN ..........................................................................................................68 M T PHNG PHP......................................................................................69 KT LUN V MT S KT QU THC NGHIM....................................70 XY DNG GROUND TRUTH V CNG C NH GI CHNH XC CA THUT TON PHN VNG VN BN..............................71 KT XUT KT QU..........................................................................................76 KT XUT KT QU DI DNG FILE XML.........................................76 KT XUT KT QU DI DNG FILE MS WORD..............................79 T VN ..........................................................................................................83 C S L THUYT MNG NEURAL NHN TO V GII THUT LAN TRUYN NGC.......................................................................................84 9.2.1.NHNG THNH PHN CHNH CA MT MNG NEURAL........85 M HNH MNG NEURAL NHN TO.....................................................87 9.2.2.CC HM KCH HOT THNG C DNG............................87
GVHD: Ths.Nguyn c Thnh - iii SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

CU TRC MNG FEED-FORWARD.........................................................88 GII THUT LAN TRUYN NGC (BACK PROPAGATION ALGORITHM)...................................................................................89 M T PHNG PHP......................................................................................94

GVHD: Ths.Nguyn c Thnh

- iv -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

DANH MC CC HNH
Trang HNH 0.1: BASELINE. ASCENDERS V DESCENDERS..................................XI HNH 0.2: CC LOI THNH PHN LIN THNG........................................XI HNH 1.3: H THNG OCR VI VAI TR TRONG PHN TCH B CC VN BN......................................................................................................3 HNH 1.4: M HNH QU TRNH X L CA MT PHN MM OCR........4 HNH 2.5: (A) MINH HA MT VN BN THC...............................................7 HNH 3.6: MT V D CC DNG VN BN C XU HNG DNH LI VI NHAU DO NH HNG CA DU................................................9 HNH 3.7: CC IM LEFT MOST BOTTOM V BOTTOM MOST LEFT CA TPLT..................................................................................................17 HNH 3.8: MT V D V NH VN BN V CC PROFILE CA N. TRONG LOT HNH NY, (A) L NH VN BN GC, (B) L BOTTOM PROFILE, (C) L CC LEFT PROFILE, (D) V (E) L CC LC PHN B GC CA VN BN TM C NH (B) V (C)....................................................................................................19 HNH 3.9: NHNG KHONG GC NGHING KHC NHAU C S DNG C LNG GC NGHING PH HP CHO PHN T CU TRC..................................................................................................21 HNH 3.10: MT VI V D CA VIC S DNG PHP NG V M VI NHNG PHN T CU TRC NGHING. HNH 3.5A V 3.5D L NHNG NH A VO BAN U. HNH 3.5B V 3.5E L NHNG KT QU CA VIC P DNG BC TIN X L, C LNG TH, V PHP NG TNG NG VI HNH 3.5A V 3.5D.
GVHD: Ths.Nguyn c Thnh -vSVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

HNH 3.5C V 3.5F L NHNG KT QU CA VIC P DNG PHP M TNG NG VI HNH 3.5B V 3.5E. .............................25 HNH 3.11: MT THNH PHN LIN THNG DI VI H TA NH ......................................................................................................................26 HNH 3.12: SO SNH PHNG PHP NGH VI PHNG PHP CA CHEN SAU KHI P DNG C LNG TH TRN 900 NH THUC NG H LATIN C QUAY VI 9 GC NGHING BT K................................................................................................................30 HNH 3.13: SO SNH PHNG PHP NGH VI PHNG PHP VA CHEN SAU KHI P DNG C LNG TH TRN TT C NH THC NGHIM C QUAY VI 9 GC NGHING BT K......31 HNH 3.14: SO SNH PHNG PHP NGH VI PHNG PHP CA CHEN SAU KHI P DNG C LNG TH TRN C S D LIU UW ENGLISH I GM 900 NH C QUAY VI 9 GC NGHING BT K...................................................................................32 HNH 3.15: MINH HA HIN TNG R NH SAU KHI QUAY..............34 HNH 3.16: NH MINH HA VIC CHIA NH THNH CC BLOCK.........35 HNH 3.17: CHUYN I MT BLOCK 3X3 SANG S THP PHN...........36 HNH 3.18: MINH HA MT NH GC B NGHING....................................37 HNH 3.19: NH 3.13 QUAY THEO PHNG PHP THNG THNG .....37 HNH 3.20: NH 3.13 SAU KHI C QUAY THEO PHNG PHP QUAY THEO BLOCK............................................................................................38 HNH 4.21: MT V D V VN BN CNG VN VI CC PHN VNG CHUN PH BIN CA CC C QUAN HNH CHNH TI VIT NAM.............................................................................................................42
GVHD: Ths.Nguyn c Thnh - vi SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

HNH 4.22: NH VN BN GC C CHNH THNG DNG CHO QU TRNH TCH KHI........................................................................47 HNH 4.23: LC CHIU NGANG CA NH VN BN HNH 4.2.........48 HNH 4.24: MT V D V VIC ON THNG LM NH HNG TI QU TRNH TCH KHI VN BN.....................................................49 HNH 4.25: NH VN BN C TCH KHI THEO CHIU NGANG. ......................................................................................................................50 HNH 4.26: MT KHI VN BN SAU KHI TCH NGANG...........................51 HNH 4.27: LC CHIU DC CA KHI VN BN TRONG HNH 4.6 ......................................................................................................................51 HNH 4.28: KT QU TCH DC CA KHI VN BN HNH 4.6..........51 HNH 4.29: (A) HAI KHI B GP THNH MT..............................................52 HNH 4.30: HNH 4.2 VI CC KHI C TCH BNG PHNG PHP C NGH TRN.............................................................53 HNH 5.31: NH VN BN GC SAU KHI TCH KHI CN TCH DNG ......................................................................................................................56 HNH 5.32: NH VN BN TRONG HNH 5.1 C T LEM................57 HNH 5.33: NH MINH HA CC DNG LNG NHAU..................................58 HNH 5.34: HNH LC CHIU CA MT KHI VN BN..................58 HNH 5.35: (A) MT DNG CT NHNG KHNG M RNG BIN...........59 HNH 5.36: NH VN BN SAU KHI TCH DNG..........................................60 HNH 6.37: HNH MINH HA V TR CA DU SO VI K T...................64 HNH 6.38: HNH BIU DIN KHI NIM DXMERGE V DYMERGE.......64 HNH 6.39: (A) HNH BAN U............................................................................65
GVHD: Ths.Nguyn c Thnh - vii SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

HNH 6.40: (A) MINH HA CHO CH S B MT IM, B TCH THNH 3 THNH PHN LIN THNG.................................................................65 HNH 6.41: (A) MINH HA CH B TCH THNH 2 THNH PHN LIN THNG........................................................................................................66 HNH 6.42: MT DNG VN BN GM CC K T C NI DU. ......................................................................................................................67 HNH 6.43 MT DNG VN BN SAU KHI C TCH T................67 HNH 7.44: HNH MINH HA K T B DNH VI NHAU............................68 HNH 7.45: HNH MINH HA HNH CHIU THEO TRC X CA CC K T DNH TRONG HNH 7.1A V 7.1B..................................................69 HNH 7.46: HNH MINH HA KT QU VIC CT K T DNH CA HNH 7.1A V 7.1B....................................................................................70 HNH 8.47: HNH BIU DIN CC MI QUAN H GIA GROUND TRUTH V DETECTION........................................................................................74 HNH 8.48: M HNH CU TRC FILE C LU DI DNG MS WORD..........................................................................................................80 HNH 8.49: HNH TH HIN CC KHI C CHUNG MT HNG NGANG ......................................................................................................................81 HNH 9.50: M HNH B NO V MNG NEURAL SINH HC...................85 HNH 9.51: M HNH MT NEURAL NHN TO............................................87 HNH 9.52: M HNH MNG NEURAL FEED-FORWWAD............................88 HNH 9.53: M HNH TNH TON MT NEURON..........................................90 HNH 9.54: M HNH TNH TON MNG NEURAL TNG QUT..............92 HNH A.55: CC PHP BIN I MORPHOLOGY.......................................104
GVHD: Ths.Nguyn c Thnh - viii SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

HNH A.56: CC MINH HA V PHP T GIN I VI MT S PHN T CU TRC C BN.........................................................................105

GVHD: Ths.Nguyn c Thnh

- ix -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

DANH MC CC BNG
Trang BNG 3.1: CHNH XC CA C LNG TH......................................28 BNG 3.2: CHNH XC CA PHNG PHP CA CHEN[3] SAU KHI P DNG C LNG TH.................................................................30 BNG 3.3: CHNH XC CA PHNG PHP NGH..........................30 BNG 3.4: CHNH XC CA PHNG PHP CA CHEN SAU KHI P DNG C LNG TH TRN C S D LIU UW ENGLISH I GM 900 NH C QUAY VI 9 GC NGHING BT K.........31 BNG 3.5: CHNH XC CA PHNG PHP NGH TRN C S D LIU UW ENGLISH I GM 900 NH C QUAY VI 9 GC NGHING BT K...................................................................................32 BNG 4.6: THNG K CHNH XC CA THUT TON TCH KHI 54 BNG 8.7: H S NH GI CHNH XC..................................................76 BNG 8.8: KT QU THC NGHIM ................................................................76 BNG 9.9: THNG K SO SNH KH NNG CA NO NGI V MY TNH............................................................................................................84

GVHD: Ths.Nguyn c Thnh

-x-

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

DANH SCH CH VIT TT


1. OCR (Optical Character Recognition): nhn dng k t. 2. DAS (Document Analysis Systems): cc h thng phn tch vn bn. 3. Base line: l ng c s ca dng vn bn (xem hnh 0-1).

4. Ascenders: phn ph trn ca k t m cao hn chiu cao ca cc k t thng (xem hnh 1).
5. Descenders: phn di ca k t m nm di ng base line(xem hnh 0-1).

Hnh 0.1: Baseline. Ascenders v Descenders

6. TPLT(Thnh phn lin thng): l tp hp cc pixel ln cn nhau. Gm hai loi:

thnh phn lin thng 4 v thnh phn lin thng 8.


7. Thnh phn lin thng 4: i vi mi pixel c 4 pixel ln cn nh hnh 0-2(a) . 8. Thnh phn lin thng 8: i vi mi pixel c 8 pixel ln cn nh hnh 0-2(b).

Hnh 0.2: Cc loi thnh phn lin thng (a) thnh phn lin thng 4 (b) thnh phn lin thng 8

GVHD: Ths.Nguyn c Thnh

- xi -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

TM TT
Phn tch b cc vn bn l mt bc rt quan trng trong h thng OCR. Do nhiu yu t nh kch c ch, kiu ch, khong cch gia cc dng v b cc ca mt s vn bn kh phc tp, cng vi s xut hin ca nhiu v du (c bit trong cc vn bn ting Vit), nh hng rt ln n kt qu ca qu trnh phn tch v nhn dng. Qu trnh nhn dng nh vn bn bao gm nhiu bc: xm ha nh u vo, nh phn nh, chnh nghing vn bn, tch khi, tch dng, tch t, tch k t v cui cng l nhn dng vn bn. Trong ni dung ca ti ny, chng ti s trnh by qu trnh nh phn nh, xc nh gc nghing, tch khi vn bn cho cc nh cng vn ting Vit, sau tin hnh tch dng, tch t, tch k t ri nhn dng, hn th na chng ti cn xy dng Ground truth nh gi chnh xc ca thut ton tch khi, v ng thi chng ti cng xy dng cch kt xut ra kt qu di dng file XML v file MS Word. i vi giai on nh phn, chng ti p dng phng php Otsu. i vi giai on xc nh gc nghing ca vn bn, chng ti xut mt phng php mi da trn cc php bin i Morphology xc nh gc nghing vn bn ri p dng php quay theo block chnh nghing cho vn bn u vo. Tip , qu trnh tch khi vn bn c thc hin da trn vic phn tch cc projection profile theo chiu dc v chiu ngang. T nhng kt qu thu c sau qu trnh tch khi, chng ti tin hnh tch dng bng cch t lem nhng dng vn bn, sau chiu ph ngang tm ra nhng ng ct hp l, phn bit cc dng trong cng mt khi. Trong bc xc nh cc t trong mi dng, chng ti ngh phng php mi m n da vo phng php ca Otsu tm ra ngng ph hp dng trong vic tch cc t trn cng mt dng, v to c s cho tch k t. Trong giai on tch k t, chng ti xem nh mt k t s bao gm c du i km vi n, chng bc ny chng ti s x l vn tch nhng k t dnh vi nhau thnh nhng k t ring bit da vo lc hnh chiu theo trc x, sau xc nh nhng v tr no c mt pixel thp tin hnh tch k
GVHD: Ths.Nguyn c Thnh - xii SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

t. Sau khi vn bn c tch k t, chng ti xy dng mt mng Neural nhn to hot ng theo c ch back-propagation tin hnh nhn dng vn bn. Vic kt xut kt qu ca qu trnh phn tch, xy dng b cc vn bn v nhn dng c th c tin hnh theo hai cch, hoc kt xut ra file XML hoc kt xut ra file MS Word. Trong lnh vc nhn dng v x l nh vic kt xut kt qu ra file XML l mt chun c cng nhn hin nay. Tuy nhin, trong ti ny, chng ti cng cho php kt xut kt qu nhn dng thnh file MS Word, gip ngi s dng c th thao tc d dng hn trong vic chnh sa cng nh tm kim v mt ni dung. Trong ni dung ti ny, chng ti cng tin hnh xy dng thut ton nh gi chnh xc ca thut ton tch khi. Khi thc hin ti ny, chng ti tin hnh kim nghim phng php chnh nghing trn c s d liu gm 1080 nh bao gm 900 nh thuc ng h Latin v 180 nh thuc cc ngn ng khc nh Trung Quc, Thi, rp, v trn c s d liu nh UW English I, mt c s d liu chun, vi chnh xc l 99% i vi 900 nh vn bn Latin, 96.67% i vi c s d liu gm 1080 nh v 96.63% i vi c s d liu UW English I. i vi thut ton tch khi vn bn, chng ti tin hnh xy dng ground truth v kim nghim phng php tch khi trn c s d liu gm 100 nh thu c t cc cng vn gi n (i) ca Khoa Cng ngh Thng tin, i hc Nng Lm Tp.HCM, v t c chnh xc l 90,54%, hiu sut tm c khi ng l 84, 20%. i vi vic tch dng, tch t cng nh tch k t v nhn dng, chng ti cha th tin hnh kim nghim v a ra cc kt qu thc nghim. Nhng kt qu ca cc qu trnh ny l kh tt, n c th p ng c nhu cu ca qu trnh xy dng b cc vn bn v nhn dng trong ton b ti.

GVHD: Ths.Nguyn c Thnh

- xiii -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Chng 1 GII THIU


Ngy nay, do s thnh hnh ca my tnh c nhn, phng tin lm cho k thut ch bn in t tr nn v cng ph bin, s lng nhng ti liu lu tr trn giy tng n mt s lng ng k. Hng t t nhng trang giy c to ra mi nm di nhiu hnh thc khc nhau nh sch, tp ch, bn tin, bo, th t, biu mu, bng ghi nh, trn khp th gii. Mc khc, vic lu tr, phn phi, phc hi nhng thng tin trn giy l mt cng vic i hi nhiu cng sc, thm ch khng th thc hin c mt cch th cng. Mt yu cu c t ra l chuyn nhng ti liu bng giy trc y thnh nhng dng my c th c c v c th thao tc thng qua qu trnh x l vn bn hay nhng h thng phc hi thng tin trc tuyn. My tnh cung cp mt kh nng to ln, linh hot trong vic tm kim t ng, kh nng truy xut gn nh lp tc nhng ti liu m khng cn quan tm ti v tr vt l ca n. My tnh cn cung cp cho chng ta mt ch bo mt ng thi lm cho vic kim chng tr nn d dng trn mt quy m ln. C rt nhiu cch khc nhau thc hin vic chuyn i ny. Mt gii php n gin nht l nhp li ni dung ca vn bn thng qua bn phm. Tuy nhin, y l mt cng vic khng kh thi v i hi nhiu thi gian v kh nng sai st rt cao. Mt gii php khc l xy dng mt h thng OCR (Optical Character Recognition) (xem hnh 1.1). Vi cch tip cn ny, nhng vn bn s c scan thnh nh, v sau c chuyn i sang bng m ASCII/UniCode bng cch s dng h thng OCR trn. Tuy nhin, vic hin thc mt h thng OCR c th a ra c nhng kt qu chnh xc mt cch t ng, khng cn bt c mt s chnh sa no sau l mt vn v cng kh khn. C rt nhiu yu t nh hng n kt qu ca phng php OCR nh kch c ch, gc nghing, nhiu, du, hay s phc tp ca b cc vn bn, Nhng yu t
GVHD: Ths.Nguyn c Thnh -1SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

ny c th c gii quyt trong giai on tin x l. Tuy nhin, nhng kt qu trung gian trong giai on tin x l c nh hng quan trng n chnh xc ca kt qu cui cng ca nhng h thng OCR. Mt trong nhng bc tin x l quan trng l phn trang nh vn bn, ngha l, xc nh cu trc vt l ca mt vn bn l bao gm nhiu khi, nhng khi ny c th l vng vn bn (text), hnh nh hay bng biu; y chng ti ch quan tm n nhng vng text. Trong ni dung ca ti ny, chng ti s gii quyt bi ton phn tch b cc vn bn. Chng ti cng ngh mt phng php hon ton mi xc nh gc nghing ca nh, sau tin hnh tch vn bn thnh cc khi ring bit, ri tch dng, tch t, tch k t v cui cng l xy dng mt mng Neural dng nhn dng k t. ng thi chng ti cng tin hnh xy dng Ground Truth v hin thc thut ton nh gi chnh xc ca phng php tch khi. Kt qu cui cng ca qu trnh phn tch b cc vn bn v nhn dng c kt xut ra file di hai dng l XML v MS Word. Cc phn cn li ca bo co ny c t chc nh sau: Trong chng 2, chng ti trnh by qu trnh nh phn nh da theo phng php ca Otsu, trong chng 3 chng ti xut mt phng php da trn vic s dng cc php bin i Morphology tin hnh c lng gc nghing ca nh vn bn. Cng trong chng 3 chng ti s trnh by php quay nh theo block, gip gim thiu tnh trng r nh, lm cho kt qu ca cc giai on sau thm chnh xc. Trong chng 4, chng ti tin hnh trnh by phng php phn vng vn bn cho nh cng vn ting Vit. Chng 5 s trnh by phng php tch dng vn bn da vo lc chiu biu din s phn b cc pixel en trn cc dng trong vn bn. Chng 6 chng ti s a ra mt phng php tch t mi, phng php ny da vo phng php Otsu tm ra mt khong cch hp l dng ni cc k t trong mt t, phn tch k t dnh s c trnh by trong chng 7. Chng 8 l cch xy dng Ground Truth v cng c nh gi chnh xc ca cc thut ton phn vng vn bn, phn kt xut kt qu ra hai dng XML file v MS Word file cng s c trnh by trong chng ny. Trong chng 9, chng ti s gii thiu s b v mng neural nhn to hot ng theo c ch
GVHD: Ths.Nguyn c Thnh -2SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Back Propagation v xy dng mt mng nhn dng ni dung vn bn. Cui cng, chng 10 s tng kt mt s kt qu t c v a ra hng pht trin ca ti.

Hnh 1.3: H thng OCR vi vai tr trong phn tch b cc vn bn

Sau y l m hnh qu trnh x l cng nh phn tch v nhn dng mt vn bn ting Vit :

GVHD: Ths.Nguyn c Thnh

-3-

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 1.4: M hnh qu trnh x l ca mt phn mm OCR

GVHD: Ths.Nguyn c Thnh

-4-

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

NH PHN HA NH VN BN
T VN Trong thc t, nh vn bn m chng ta nhn vo ban u x l l nh mu. V vy c th thc hin c qu trnh phn tch v nhn dng, chng ta cn phi chuyn chng thnh nh nh phn trong mi im nh (pixel) c biu din bi mt trong 2 gi tr l 0 hoc 255. u tin, nh mu nhn vo s c chuyn thnh nh xm vi cc mc xm c gi tr t 0 n 255 da trn ba gi tr RED, GREEN, BLUE ca nh u vo. T nh xm ny, chng ta s so snh mc xm ca tng im vi mt ngng cho trc quyt nh im s l 0 hoc 255, gi tr 0 biu din cho mu en v 255 biu din cho mu trng. Trong chng ny, chng ti s s dng phng php ca Otsu [26] ngh tm ra ngng thch hp i vi mi nh nhn vo. PHNG PHP OTSU Trc tin, sau khi thng k mc xm trn nh ban u, chng ta s nhn c mt th biu din mc xm c hai nh, mt nh biu din cho nhng vng l text, nh cn li biu din cho nhng vng l nn ca nh. Theo Otsu, ngng k* tt nht c chn l gi tr m ti n lm cho s chnh lch b2 gia hai on trn th t cc i. Gi tr b2 c nh ngha nh sau:
2 b = a1 (m1 mt ) 2 + a2 (m2 mt ) 2 ,

(2.1)

Thay mt = a1m1 + a2 m2 , a1 + a2 = 1 , ta c:
b2 = a1 a 2 (m1 m 2 ) 2 ,

(2.2)

GVHD: Ths.Nguyn c Thnh

-5-

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Trong m1 v m2 biu din gi tr trung bnh tng ng vi on 1 v on 2 (xem hnh 4), a1 v a2 l tn sut xut hin ca m1 v m2 . T l a j ca din tch on j vi tng din tch c tnh nh sau:
aj =
iC j

p,
i

j = 1,2,

: tng xc sut trn on j

(2.3)

Trong pi l thng ca s ln xut hin ca mc xm th i v tng s ln xut hin ca tt c cc mc xm cho nn,

p
i =0

I 1

= 1,

(2.4)

Vi I biu din tng s nhng mc xm. Thng thng, i vi nh vn bn, I c gi tr l 256. C1 ( C 2 ) biu din tp hp tt c nhng im c gi tr nh hn hoc bng (ln hn) ngng k. Ch rng, gi tr trung bnh m j c tnh nh sau:
mj =

iC j

i p

aj,

j = 1,2.

: mc xm trung bnh trn on j

(2.5)

Ngng k* tt nht s c xc nh bng cch tm ra nh ca b2 .

GVHD: Ths.Nguyn c Thnh

-6-

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 2.5: (a) Minh ha mt vn bn thc (b) Biu biu din mc xm vi ngng xm tt nht k* (c) nh thu c sau qu trnh nh phn ha vi ngng xm k* tm c

GVHD: Ths.Nguyn c Thnh

-7-

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

CHNH NGHING NH VN BN
S DNG CC PHP BIN I MORPHOLOGY TRONG C LNG NGHING VN BN T VN Trong qu trnh nhn dng v x l nh vn bn cng nh trong hu ht cc phn mm s dng k thut OCR (Optical Character Recognition) hay cc h thng phn tch vn bn DAS (Document Analaysis System), chng ta phi tri qua nhiu cng on phc tp v mt trong nhng cng on l c lng gc nghing ca ton b vn bn. Chnh iu ny s to iu kin thun li cho vic thc hin cc bc tip theo trong qu trnh nhn dng sau ny. Nguyn nhn ca vic to ra gc nghing vn bn c th do vic copy, in, fax hoc scan .Trong hu ht cc phng php gii quyt bi ton OCR, vic vn bn b nghing nh hng rt nghim trng n cc bc tip theo nh: tch khi, phn tch b cc, thut ton nhn dng OCR, ngay c khi gc nghing ca vn bn rt nh vo khong 5o. c nhiu cch tip cn nhm gii quyt vn nhiu mc khc nhau nh cc phng php do Baird [2] hoc ca Hinds v cc ng nghip ngh [12]. Tuy nhin, chng u gp nhng kh khn nht nh ( chnh xc khng tt, gc nghing qu ln ). C hai tiu chun c bn nh gi chnh xc ca vic chnh nghing nh vn bn. Tiu chun u tin l gii hn gc c lng v d gc c lng ca vn bn gii hn trong khong [-10o, 10o]. Th hai l s lng gc nghing trong ton vn bn ngha l vn bn c mt hay nhiu gc nghing. Trong phm vi ca ti ny, chng ti ch quan tm n vn bn c mt gc nghing. i vi mt vi phng php xc nh gc nghing vn bn, phi c mt s rng buc i vi nh vn bn u vo nh c ch, khong cch gia cc dng, ngn ng s dng trong vn bn,

GVHD: Ths.Nguyn c Thnh

-8-

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

thm ch b cc ca vn bn cng b rng buc, v d nh mt vi thut ton i hi phi c s lng thnh phn lin thng l ch hay phi c tht t nhiu. Trong ti ny, chng ti xin ngh mt thut ton da trn cc php bin i Morphology c lng gc nghing vn bn. Thut ton ca chng ti c bit thch hp cho cc vn bn c du nh ting Vit, ting Php, i vi loi vn bn ny, vic xut hin ca cc du, phn ph trn, phn ph di ca ch cng nh nhiu lm cho cc dng ln cn nhau c xu hng dnh li vi nhau (xem hnh 3.1). Chnh iu ny lm cho cc phng php xc nh gc nghing vn bn trc y b tht bi. Bng cch s dng cc php bin i Morphology, du, nhiu s b tch khi nh vn bn. N gip cho vic xc nh cc dng vn bn d dng hn. Qu trnh loi b nhiu v du nh vo cc php bin i Morphology c th lm mt mt s thng tin ca vn bn. Tuy nhin, s mt mt khng quan trng, v gc nghing ca vn bn c c trng bi cc dng vn bn ngay c sau khi loi b phn ph trn v ph di. Chng 3 ny s c trnh by nh sau: phn 3.1.1 l t vn , phn 3.1.2 l mt s hng tip cn hin c; trong phn 3.1.3, chng ti s m t chi tit phng php c ngh v p dng n vo vn bn xc nh gc nghing chnh xc. Cc tham s v kt qu thc nghim s c chng ti trnh by phn 3.1.4 ca chng ny. Cui cng, phn 3.1.5 l phn kt lun v phng php.

Hnh 3.6: Mt v d cc dng vn bn c xu hng dnh li vi nhau do nh hng ca du

MT S HNG TIP CN HIN C: C rt nhiu cch tip cn c miu t v phn loi trong cc ti liu tham kho. Trong phn ny, chng ti s a ra cc m t, phn tch v tm tt ht sc ngn gn v hu ht cc phng php hin c. Cc phng php ny c th c phn loi
GVHD: Ths.Nguyn c Thnh -9SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

da trn cc k thut chnh nh: phn tch lc chiu (projection profiles) [2, 14, 15, 16], nhm cc thnh phn lin thng [19, 22, 31], bin i Hough [12, 18, 30], cc php bin i Morphology [6, 10, 21, 29], v mt s bin th khc [8, 9, 23, 24, 27]. Baird [2] dng lc chiu c lng gc nghing vn bn. phng php ny, lc chiu c to ra t cc im gia phn di ranh gii (bounding boxes) ca cc thnh phn lin thng. Mc ch chnh ca hm ny l tnh tng cc hnh vung ca cc profile bins. Gc nghing ca vn bn s c xc nh bng cch qui khong gc nghing thuc v cho ti lc xc nh c gc chnh xc. Ishitani [14] phn tch lc chiu ca nh vn bn. Tp hp cc cc dng vn bn song song nhau s c xc nh v profile ny s biu din cc line c s chuyn i gia cc pixel t en sang trng hoc ngc li. Gc nghing ca tng dng s c thay i cc i ha lch ca php chiu. Phng php ny cng ph hp vi cc vng ln khng phi l vn bn. Kanai v Bagdanov [15] ngh mt phng php c lng gc nghing cho vn bn nn kiu JBIG. Trong phng php ny, im bn phi nht ca mt black run m ln cn di khng phi l en s c chn. Nhng im ny s c chn ra bng cch s dng chun nn CCITT4 sau chuyn i v x l nn theo dng ca cc bit v cc im trng s c tm thy nh cc k thut tng t nh thut ton gii m vi hai trng thi n gin. Kavallieratou [16] s dng k thut chiu profile kt hp vi php phn b Wigner-Ville (WVD). tng chnh y l da trn s s lc ca cc trang thng ng s c nh cao v dc ca nh ny l ln rt nhiu so vi cc lc ca cc trang khc c gc nghing. Trong phng php ny, cng cc i ca phn b Wigner-Ville theo chiu ngang ca vn bn c dng lm chun cho gc nghing c lng. WVD ca lc biu th s ln xut hin ca cc gc. Trong trng hp ny, s ln xut hin s tng theo chiu cao ca trang v cc i ca WVD s nm trong khong t 0o n 180o. Phng php ny c th p dng cho cc vn bn c gc nghing nm trong khong t -89o n 89o v n cng c th ng dng cho cc vn bn
GVHD: Ths.Nguyn c Thnh - 10 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

vit tay. Tuy nhin, cng nh cc thut ton s dng lc chiu, k thut ny cng c cc hn ch v kh khn trong vic la chn cc vng text phn tch lc c bit l i vi cc vn bn c b cc phc tp (non-Manhattan layout). Trong loi vn bn ny, cc vng text v vng nh khng tch bit vi nhau. Trong khi , phng php ny ch thch hp vi cc vng vn bn l homogenous textual, ngha l cc lc chiu ngang s khng ng vi cc vng khng phi l text cng nh cc vng khng phi l homogenous regions hay cc vng c cc dng vn bn khng thng hng. OGorman [22] ngh mt phng php khc, gi l doc-strum, tm ra gc nghing vn bn bng cch nhm cc TPLT ln cn. tng chnh l i vi mi TPLT s tm k phn t gn nht. Sau , gc ca cc cp TPLT ny s c biu din vo trong mt th. T th ny s xc nh c gc nghing ban u ca nh vn bn. Gc gia cc ln cn gn nht s c tnh ton trn thc t cc ln cn ny thng l cc TPLT trong cng mt dng. Mi TPLT s c biu din bng tm ca n. p dng phng php bnh phng cc tiu tm ra gc ca cc dng vn bn. Cui cng, gc ca ton b vn bn s c c lng da trn gc ca tt c cc dng vn bn ny. Phng php ny c th p dng cho nh vn bn c mi gc nghing. Tuy nhin phng php ny rt nhy cm vi nhiu. Do , bc tin x l cn c thc hin lc vn bn. Bn cnh , cch tip cn ny rt tn thi gian cho vic duyt cc TPLT ln cn. Ngoi ra, n ch oc thc hin khi vn bn ch n thun l text. Vi cng cch tip cn nh trn, Lu v Tan [19] ci tin phng php ny bng cch gii hn kch thc ca cc TPLT ln cn. Trong phng php ny, cc TPLT ln cn to thnh chui c di nht nh vi kch thc ph hp s c chn ra. Da trn cc chui TPLT s xc nh gc nghing ca nh vn bn. Li im ca phng php ny l n c th p dng cho mi gc nghing v mi ngn ng s dng trong vn bn. Tuy nhin i vi cc nh vn bn b nhiu v cc vn bn c du nh vn bn ting Vit, chnh xc ca thut ton s b nh hng kh nhiu.
GVHD: Ths.Nguyn c Thnh - 11 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Yuan v cng s [31] cng a ra mt hng tip cn khc da trn vic tnh ton gc nghing gia cc TPLT. Tuy nhin, trong phng php ny, thay v da trn k thut nhm cc TPLT ln cn th n s tnh gc ca tt c cc cp TPLT duy nht ri cng dn li vi nhau. Sau , trong cc nh cao ca lc chiu s chn ra nh thch hp nht lm gc nghing cho ton vn bn. i vi vic s dng php bin i Hough, Hinds [12] ngh mt phng php p dng cho cc nh u vo l 300 dpi sau thu nh thnh 75 dpi tng tc x l. Trong phng php ny, mi pixel en s oc thay th bng mt pixel trng ngoi tr pixel xa nht v pha bn tri. N s c thay th bng chiu di ca run . Cch lm ny gn ging vi cch nn nh. Sau , run length s c cng dn v php bin i Hough s c p dng cho cc gc nm trong -15o n 15o vi chnh xc l 0.5o. Cui cng, gc nghing ca nh vn bn s c tnh ton bng cch cc i ha gi tr ca cc cp (p, ). Phng php ny ch c th p dng cho cc vn bn c font size nh hn 24. Le v cng s [18] cng dng php bin i Hough xc nh gc nghing nh vn bn. Tuy nhin, tng tc v ci thin chnh xc ca thut ton, mt hm heuristic c thm vo phn loi cc TPLT nhm loi b cc thnh phn khng phi l text. Sau , php bin i Hough s p dng cho cc im di cng ca cc TPLT. Vi cng mt k thut nh trn, cch tip cn c ngh bi Yu v Jain [30], thay v s dng im di cng, php bin i Hough phn cp s p dng cho tm ca cc TPLT. Phng php ny c th thch nghi vi nhiu loi vn bn nh cc vn bn k thut, vn bn vit tay, Tuy nhin, bt li ln nht ca phng php ny l thi gian tnh ton rt lu, c bit l i vi cc vn bn c s xut hin ca nhiu. i vi cc hng tip cn da trn php bin i Morphology, cc dng vn bn s c hnh dng ha bng cc php bin i nh ng, m. Vic s dng cc php bin i Morphology s rt thun li v nhiu s c loi b. iu ny rt thch hp trong cc vn bn c du nh ting Vit, ting Php,Trong phng php ca
GVHD: Ths.Nguyn c Thnh - 12 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Chen v cng s [6], cc php ng, m vi cc phn t cu trc khc nhau c s dng. Sau khi thc hin cc php bin i ny, cc dng vn bn s bin thnh cc vt thon di ri p dng mt phng php khc xc nh hng ca cc dng vn bn. Trong qu trnh p dng, c th xut hin mt s hng sai lch chng c to ra bi nhiu v cc TPLT khng phi l text. Mt thut ton khc l good lines selection s c s dng. Trong thut ton ny, cc dng c hng gn ging vi hng c bn ca ton vn bn s c chn ra. Cui cng, gc nghing ca ton vn bn s c c lng t cc hng chn ra ny. Tuy nhin, phng php ny ch p dng c cho cc vn bn c nghing l 5o v chnh xc l 0.5o ( c kim tra trn b th vin nh UW English Document Image Database). Das v Chanda [10] cng dng cc php ng, m trn cc dng vn bn vi hai thnh phn cu trc dng ng thng v dng hnh vung nh. nh vn bn c thc hin php m s c qut theo chiu dc ghi nhn cc pixel c s chuyn i t 1 sang 0, cng chnh l base line ca dng vn bn. Cc dng c chiu di ln hn mt ngng cho trc s c chn ra v gc ca ton b vn bn l trung v ca gc cc dng vn bn ny. Gii hn ca phng php ny l n ch thc hin tt i vi cc nh vn bn c gc nghing di 15o. Najman [21] li hin thc cc php ton Morphology theo mt cch khc. tng chnh l tm ra gc quay ti u nht ca cc phn t cu trc bng cch cc i ha din tch ca cc vt thng to ra t cc php ton Morphology. Trong hng tip cn ny, thut ton Run-Length Smoothing closing (RLSA) cng c s dng ti u ha gc quay ca phn t cu trc. Gc quay ny cng chnh l gc nghing ca ton b vn bn. Nhc im ln nht ca c ba phng php va trnh by trn l chng ph thuc vo kch c ch, khong cch gia cc dng, khong cch gia cc k t ln cn trong vn bn, Do cc thut ton ny rt ph thuc vo cc tham s thc nghim v khng th xc nh cc tham s ny mt cch t ng. Trong mt cch tip cn khc, Chen v Wang [8] ngh mt phng php da trn k thut cc i ha lch ca s bin i t pixel en sang trng v ngc li
GVHD: Ths.Nguyn c Thnh - 13 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

(transitions-counts). Trong phng php ny, transition-counts variance (TCV) ca mi gc trong khong t -45o n 45o s c tnh ton. Trc ht, vng c kch thc 256 x 256 pixel gia vn bn c chn ra. Sau s tnh tng s s bin i trong vng ny. Nu tng ny vt qu mt ngng cho trc, th n s c dng tnh TCV. Ngc li, nu tng ny nh hn ngng cho trc th vng ny s c dch chuyn theo c chiu dc v chiu ngang ca vn bn cho ti khi tm c vng thch hp, tc l vng c text thc hin thut ton. tng c bn ca vic s dng TCV l da trn c s nh ca lc transition-counts ca gc nghing ca vn bn s xut hin thng xuyn v lc ca cc transition-counts khc th t hn. Do , TCV no l biu din ln nht s c trng cho gc nghing ca vn bn. Chou [9] a ra mt phng php da trn cc piecewise bao ph cho cc i tng nh cc dng vn bn, cc hnh nh, cc form, hay cc bng biu. u tin s chia vn bn thnh cc vng tch ri nhau, gi l cc slabs, cc vng ny s c gii hn bi cc hnh bnh hnh. Cc hnh bnh hnh ny s c v bng cch qut nh t nhiu nhiu gc khc nhau. Sau s xc nh gc nghing ca vn bn bng cch o kch thc ca cc vng khng c gii hn bi cc hnh bnh hnh. Thut ton ny ch gii hn cho cc vn bn c gc nghing trong khong [-15o, 15o]. Mt nhc im khc ca phng php ny l cc hnh bnh hnh s c to ra bng cch kim th vi nhiu gc quay. V th, phng php ny tn rt nhiu thi gian cho vic thc hin qui ny. Okun [23], mt trong cc cch hiu chnh thut ton chnh nghing , gii thiu cch pht hin gc nghing da trn hnh dng ca cc vn bn c cha cc mu t Latin/Cyrillic. Gc ca cc TPLT ln cn s c c lng da trn hnh gii hn ca cc TPLT ny. Bng cch thao tc vi mi cp TPLT ln cn thut ton s cng dn cc votes cho mi gc nghing. Vic c lng gc nghing ca vn bn s c chn l gc kt hp vi a s cc votes. Bn cnh , tng tnh chnh xc ca thut ton nh vn bn s c tch b cc phn khng phi l text.

GVHD: Ths.Nguyn c Thnh

- 14 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Okun v cng s [24] xut mt phng php khc s dng bn k thut ty chn c lng gc nghing vn bn. K thut u tin l da vo s lng gc nghing ca cc thnh phn lin thng ni b trong mt lc v gc v nh ca lc chnh l gc nghing cn tm. K thut th hai ging nh vic tm kim gc nghing bng cch trch xut cc dng vn bn v cng dn cc gc nghing vo trong mt lc gc. Cch xc nh th ba l chn mt trong hai cch trn, trong khi cch th t l cch kt hp c hai cch u tin tm c gc nghing chnh xc nht. Shi v cng s [27] ngh mt thut ton s dng horizontal fuzzy run-length. Trong phng php ny, nh vn bn u vo c qut t tri sang phi v t phi sang tri to ra horizontal fuzzy run-lengths ca nh hin th cc dng. Sau , cc vng vn bn s c chn ra v mi dng vn bn c tng trng bi mt hnh nh l cc thnh phn lin thng. Mt thut ton n gin c dng xc nh gc nghing ca mi vng vn bn v gc nghing chung va ton vn bn cng c c lng da trn phng php cc tiu ha khong cch gia cc hnh c trng cho vng vn bn. Thut ton ny c mt hn ch l s dng qu nhiu tham s ngi dng. Hn th na, mt vn khc cng cn phi xem xt trong phng php ny l s nh hng ca nh vn bn phi l t trn xung di. Trong n ny, chng ti cng s dng cc php bin i Morphology c lng gc nghing ca nh vn bn. Tuy nhin, khc vi cc phng php khc, c bit l cc phng php [6, 10], phng php ca chng ti c th ph hp vi tt c cc loi vn bn vi bt k gc nghing -90o cho n 90o, ngha l phng php ca chng ti khng ph thuc vo gc nghing. Hn th na, trong phng php ny hu ht tt c cc tham s c tnh ton da trn nh vn bn u vo. Do trong phng php ca chng ti c lp vi tham s v chng c tnh ton t ng. M T PHNG PHP. tng chnh ca phng php ny c chng ti trnh by trong ti liu tham kho [29] v c th c tm tt nh sau: Trc ht l qu trnh tin x l, y l qu
GVHD: Ths.Nguyn c Thnh - 15 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

trnh dng lc nhiu, du v nhng thnh phn lin thng ln. Trong qu trnh ny cc tham s nh chiu cao v chiu rng c trng ca ch, s c t ng xc nh da trn vn bn u vo. Sau , thut ton c lng th s xc nh c khong m gc nghing ca vn bn ri vo. Cui cng, vi nhng tham s tm thy bc u tin, chng ti s thc hin cc php ng v m cho cc dng vn bn to thnh cc vt to thun li cho bc xc nh gc nghing tip theo. Sau mt thut ton n gin s c dng xc nh gc ca mi dng vn bn v gc nghing ca ton b vn bn cng s c tm thy da trn gc nghing ca cc dng vn bn.
3.1.3.1.

BC TIN X L

Trong bc ny, chng ta s ln lt xc nh cc lc v chiu cao v chiu rng ca tt c cc thnh phn lin thng trong vn bn. Chiu cao v chiu rng xut hin nhiu ln nht ca cc thnh phn lin thng, gi l W v H, s c xc nh nh vo vic tm ra nh ca nhng lc ny. W v H cng chnh l chiu cao v chiu rng c trng ca cc k t trong vn bn. Trong qu trnh lc du v nhiu, cc thnh phn lin thng c chiu cao v chiu rng nh hn T0 min{W, H} c xem l nhiu v du, c ngha l i vi mi thnh phn lin thng c(w, h), trong w v h l chiu cao v chiu rng ca n. Nu max{w, h} T0 min{W, H}, c s b loi khi vn bn chng ta ang xem xt. i vi vic loi b cc thnh phn lin thng ln, nu mt thnh phn lin thng c(w, h) c gi l thnh phn lin thng ln khi min{w, h} 1/T0 max{W, H}, n cng s b loi ra khi nh vn bn. Trong thut ton ca chng ti, chng ti kim nghim trn nhiu gi tr khc nhau ca T0 trn nhiu nh vn bn v chng ti nhn thy gi tr ti u nht ca T0 l 1/4. 3.1.3.2. C LNG TH

Sau khi thc hin bc tin x l, chng ti s c c hai nh gi l bottom profile v left profile. Bottom profile c to ra bng cch thay th mi thnh phn
GVHD: Ths.Nguyn c Thnh - 16 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

lin thng bng mt im bottom most left, tng t left profile c to ra da trn cc im left most bottom ca cc thnh phn lin thng (xem hnh 3.2). i vi cc gc trong khong [-45o, 45o], cc im bottom most left s c trng cho ng base lines ca vn bn. Tuy nhin trong trng hp gc nghing vn bn ln, cc im left most bottom ca thnh phn lin thng s biu th cho cc base lines tt hn (xem cc hnh 3.3(a), 3.3(b), 3.3(c)).

Hnh 3.7: Cc im left most bottom v bottom most left ca TPLT

(a)

GVHD: Ths.Nguyn c Thnh

- 17 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

(b)

(c)

GVHD: Ths.Nguyn c Thnh

- 18 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

(d)

(e)

Hnh 3.8: Mt v d v nh vn bn v cc profile ca n. Trong lot hnh ny, (a) l nh vn bn gc, (b) l bottom profile, (c) l cc left profile, (d) v (e) l cc lc phn b gc ca vn bn tm c nh (b) v (c)

Trong mi profile (bottom hay left), gc ca mi cp im ln cn c tnh v thng k vo trong lc gc (xem hnh 3.3(d) v 3.3(e)). Ln cn ca mt im p trong nh profile c xc nh bng cch qut tt c cc im (tr p) trong mt hnh ch nht c kch thc (2W, 2H) vi tm l im p, trong W v H c ly bc tin x l. W v H l bao nhiu s ty thuc vo nh vn bn u vo. Do , phng php ca chng ti ch da vo cc tham s khng n v. Hnh 3.3 l mt v d v lc gc ca left profile v bottom profile. Mc ch chnh ca c lng th l tm ra mt khong 20o m gc nghing thc ca vn bn thuc v. L do m chng ti chn 20o cho khong c lng gc nghing s c gii thch r trong phn 3.1.3.3 ca ti liu ny. Trong mi profile chng ti s tnh din tch phn en ca mi khong, khong no c din tch ln nht trong 9 khong ca th tng ng s c chn ra. Trong hai khong va tm c, ta chn khong c din tch ln hn v cng chnh l khong m gc nghing vn bn thuc v. Trong hnh 3.3, khong c chn l khong tm thy t left profile (hnh 3.3(c)). 3.1.3.3. P DNG CC PHP BIN I MORPHOLOGY tin hn cho vic m t phng php chng ti ngh, chng ti xin trnh by ngn gn cc nh ngha cn bn ca cc php ton Morphology.

GVHD: Ths.Nguyn c Thnh

- 19 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Cc php gin (dilation), co (erosion), m (opening), v ng (closing) ca mt nh nh phn I bi thnh phn cu trc E c k hiu ln lt l I E , I E ,
I E , v I E ; v c nh ngha nh sau:

I E = z Z 2 z = x + y for some x I and y E


I E = x Z 2 x + y I for every y E

(3.0) (3.2) (3.3) (3.4)

I E = ( I E) E
I E = ( I E) E

Php t gin (k-fold dilation) ca tp hp cc thnh phn cu trc E l:

( k E ) = ( E E ... E ) / k , k 1

(3.5)

Trong bc ny, chng ti s thc hin cc php ng v m cho cc dng vn bn. Php ng dng ni cc k t trong mt t, v cc t trong mt dng, php m loi b cc thnh phn lin thng rt nh, cng nh cc phn ph trn hay phn ph di ca k t. Do cc dng vn bn s tr thnh cc vt thon di. Tuy nhin, thc hin cc php ng, m mt cch hiu qu nht ta cn xc nh kch c v hnh dng ca cc phn t cu trc tht chnh xc. Trong thut ton ny, chng ti xin ngh mt cch tnh ton n gin c m t nh sau: Trung im ca khong m gc nghing vn bn thuc v tm c trong bc c lng th chnh l gc quay ca phn t cu trc. V d, trong hnh 3.3, khong m gc nghing vn bn ri vo [30o, 50o], th gc quay ca phn t cu trc s l 40 o. L do m chng ti chia gc quay ca vn bn thnh 9 phn v mi phn tng ng vi 20o l v mi gc quay ca phn t cu trc c th ph hp cho tt c cc vn bn c gc nghing trong khong [ 10o, + 10o], ngha l khong chnh lch l 20o. Qua thc nghim bng cch quan st v th nghim trn mt s lng ln cc nh vn bn, cho thy vic xc nh gc quay cho cc phn t cu trc l rt quan trng. N gip cho kt qu ca cc php ng m l ng n nht. Vi mt phn t cu trc ph hp, th ch cc t trong
GVHD: Ths.Nguyn c Thnh - 20 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

cng dng mi kt hp li c vi nhau trong khi t trong cc dng khc nhau s vn ri nhau (xem hnh 3.5).

Hnh 3.9: Nhng khong gc nghing khc nhau c s dng c lng gc nghing ph hp cho phn t cu trc

Gi I l nh thu c sau khi kh nhiu, du v nhng thnh phn lin thng ln. nh Ico c to ra nh sau:

I co = ( I ( m Ec ) ) ( n Eo )

(3.6)

Trong nhng phn t cu trc 13 v 22 c chn tng ng vi Ec v Eo; m v n c xc nh bi max{W / 2z, H / z} v max{W / 3z, H / 2z}; vi z l thu nh thch hp ca nh, z c tnh nh sau: z = min{W / 4, H / 5} c tnh bng thut ton c lng th; v ( m Ec ) v ( n E o ) l nhng kt qu ca php quay nhng phn t cu trc ( m Ec ) v ( n Eo ) bi gc (hnh 3.5 l mt minh ha ca nh Ico). Mt ln na, c th thy r rng rng kch thc v gc nghing ca phn t cu trc c xc nh mt cch t ng v ch da trn nh a vo ban u. Vi vic tnh ton t ng ny, thut ton m chng ti ngh c th p dng gii quyt vn c lng gc nghing ca nhng vn bn c gc nghing ty .

GVHD: Ths.Nguyn c Thnh

- 21 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

(a)

(b)

GVHD: Ths.Nguyn c Thnh

- 22 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

(c)

(d)

GVHD: Ths.Nguyn c Thnh

- 23 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

(e)

GVHD: Ths.Nguyn c Thnh

- 24 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

(f)
Hnh 3.10: Mt vi v d ca vic s dng php ng v m vi nhng phn t cu trc nghing. Hnh 3.5a v 3.5d l nhng nh a vo ban u. Hnh 3.5b v 3.5e l nhng kt qu ca vic p dng bc tin x l, c lng th, v php ng tng ng vi hnh 3.5a v 3.5d. Hnh 3.5c v 3.5f l nhng kt qu ca vic p dng php m tng ng vi hnh 3.5b v 3.5e.

3.1.3.4.

C LNG TINH

Sau khi p dng php ng v php m, nhng dng vn bn ca nh c bi en c xem nh l nhng thnh phn lin thng. Trong bc ny, chng ti ngh mt thut ton n gin s dng c lng hng ca tt c nhng thnh phn lin thng v ca ton vn bn. Gi o l mt thnh phn lin thng, ngha l o = {(xi, yi), i = 1,.., n}. Gi pi (xi, yi) l mt im ty thuc o. Chng ta cn tm gc * ca thnh phn lin thng o (xem hnh 3.6).

GVHD: Ths.Nguyn c Thnh

- 25 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 3.11: Mt thnh phn lin thng di vi h ta nh

Gi p'i l kt qu ca php quay pi theo mt gc vi tm c(xc,yc) ca o, ngha l,

p 'i = ( x'i , y 'i ) trong x'i = ( xi xc ) cos + ( yc yi ) sin + xc v

y'i = ( yi yc ) cos + ( xi xc ) sin + yc .


Gi dy i l khong cch i s gia p'i v pi. dy i c th c tnh nh sau:

dyi = y 'i yc = ( yi yc ) cos + ( xi xc ) sin


Gi T ( ) l tng nhng bnh phng ca
n 2 n

(3.7)

dy i , i = 1,2,..., n :
2

T ( ) = dyi = [ ( yi yc ) cos + ( xi xc ) sin ]


i =1 i =1

(3.8)

Trong c(xc, yc) l tm ca thnh phn lin thng o. Gc * ca mt thnh phn lin thng o (vi trc x) c xc nh bi:

* = arg min[T ( ) ]
T() s t cc tr nu o hm ca n bng 0, ngha l T ' ( ) = 0 . Chng ta c:
T ( ) = [( y i y c ) 2 cos2 + ( xi xc ) 2 sin 2 + 2( xi xc )( y i y c ) sin cos ]
i =1 n

(3.9)

(3.10)

= A cos + B sin + 2C sin cos


2 2

Trong :
GVHD: Ths.Nguyn c Thnh - 26 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

A=

( yi yc )2
i =1
n

(3.11)

B=
n

( xi xc )2
i =1

(3.12)

C= Cho nn,

( xi xc )( yi yc )
i =1

(3.13)

T ' ( ) = ( B A) sin 2 + 2C cos 2

(3.14)
if A B if A = B

= arctan [ 2C ( A B ) ] 2 T ' () = 0 = 4

(3.15)

Bi v phng trinh T ' ( ) c 2 nghim (s khc nhau l ca hm tan) m:

1 = or 2 = 2
V vy,

(3.16)

* = 1 2

if T (1 ) T (2 ) otherwise

(3.17)

V phng trnh c 2 nghim 1, 2 nn khi thay vo biu thc T() ta s c c hai gi tr T1() v T2(), chn ng vi biu thc lm cho T() nh hn. Sau khi p dng thut ton ny, mi thnh phn lin thng c c trng bi mt cp s (*, T(*)/n), trong n l s im thuc thnh phn lin thng . Mt thnh phn lin thng c xem l ng tin cy nu nh t l T(*)/n nh hn mt ngng c nh ngha trc l T1. Trong qu trnh thc nghim, chng ti t T1 l 0.007. Ch nhng thnh phn lin thng ng tin cy mi c gi li cho qu trnh x l k tip trong khi nhng ci khc s c loi b. T kt qu ca c lng th, gi s rng khong gc tm c l [, ]. Bi v c lng th c th a ra nhng kt qu khng chnh xc, nn chng ti m rng
GVHD: Ths.Nguyn c Thnh - 27 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

khong ny vi mt gi tr cho trc l 2o, ngha l khong gc nghing ca vn bn ri vo s l [ , + ]. Trong qu trnh trnh thc nghim, c lng th c th cho kt qu sai khi gc nghing ca vn bn gn vi bin gia hai khong gn k nhau. Chng ti cng quan st thy rng lch i vi ng bin ca gc nghing tht s khng vt qu 2o. Cho nn, c t l 2o. Nhng thnh phn lin thng ng tin cy m hng ca n ri ra ngoi khong [ , + ] s b loi b. Sau , khong [ , + ] s c chia thnh nhiu khong nh hn, mi khong s c rng tng ng l 0.1 o, v th biu din s phn b gc ca tt c nhng thnh phn lin thng cn li s c tnh vi nhng khong nh ny. Cui cng, nh ca th ny s c chn l gc nghing ca ton vn bn. KT QU THC NGHIM Trong qu trnh thc nghim, chng ti kim tra thut ton ngh trn d liu gm 1080 nh c to ra t 120 nh, mi nh c quay vi 9 gc ngu nhin t -90o n 90o, to thnh 900 nh vn bn ting Latin, v 180 nh ca nhng ngn ng khc nh Trung Quc, Nht, rp, Thi, ... Nhng vn bn ny c qut (scan) vi nhng phn gii khc nhau t 150 n 300 dpi v c gc nghing bt k t -90 o n 90o. chnh xc ca c lng th c trnh by trong bng 3.1. Trong bng ny, chnh xc ca c lng th c tnh bng t l ca s lng nh xc nh ng khong m gc nghing ca vn bn ri vo.
Bng 3.1: chnh xc ca c lng th

Nhng vn bn ting Latin (900 nh) chnh xc 97.00%

Tt c vn bn (1080 nh) 96.30%

GVHD: Ths.Nguyn c Thnh

- 28 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Trong phn ny, chng ti s m t so snh kt qu thc nghim ca phng php chng ti ngh vi nhng phng php khc. S khc nhau i vi nhng phng php khc cng da trn thut ton Morphology [6, 10, 21] l phng php chng ti ngh c th c s dng m khng h c bt k mt gii hn no v gc nghing. Hn th na, tt c nhng phng php ny u khng a ra bt k gii php chung no cho vic tnh ton kch thc ca nhng phn t cu trc m ch s dng duy nht mt phn t cu trc cho mi vn bn. Tuy nhin, so snh vi mt trong nhng phng php ny, chng ta cn p dng c lng th tm ra phn t cu trc thch hp cho vic s dng cc php bin i Morphology. Vi mt s nghin cu k lng, chng ti chn phng php c ngh bi Chen cng cng s [6] nh l mt phng php tiu biu ca vic s dng nhng php bin i Morphology thc hin so snh. Chng ti chn phng php ny bi v n c trng nht cho vic p dng cc php ton Morphology trong nhn dng v x l nh vn bn. Phng php c ngh bi Najman [21] s dng phng php qui xc nh gc nghing. V vy, n ch thch hp vi nhng vn bn c gc nghing nm trong khong nh. Trong phng php c ngh bi Das v Chanda [10], sau khi p dng nhng php bin i Morphology, tt c nhng im thay i t en sang trng c pht hin v t nhng im ny, gc ca ton b vn bn s c tnh ra. Tuy nhin, nhng s chuyn tip ny khng a ra c nhng thng tin chnh xc khi gc ca vn bn gn vi 90o (mc d chng ti khi hin thc thut ton ny p dng vi nhng phn t cu trc thch hp). iu c ngha l vic s dng nhng chuyn tip ch ph hp vi nhng vn bn c gc nghing nh (khong 15o). Gii hn ny cng khng c tng ln khi p dng thm cc php ton Morphology. Phng php u tin ca Chen v cng s [6] cng ch p dng vi nhng vn bn c gc nghing trong khong 5o. V th, trong phn so snh, chng ti ci tin phng php ny bng cch p dng c lng th xc nh gc nghing ph hp cho phn t cu trc, v sau s dng phng php ca Chen v cng s tnh gc ca ton b vn bn. Nhng kt

GVHD: Ths.Nguyn c Thnh

- 29 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

qu thc nghim ca s ci tin ny v phng php chng ti ngh c trnh by tng ng trong bng 3.2 v 3.3.
Bng 3.2: chnh xc ca phng php ca Chen[3] sau khi p dng c lng th

Nhng vn bn ting Latin chnh xc lch gc (900 nh) 94.78% 0.13o

Tt c vn bn (1080 nh) 89.26% 0.15o

Bng 3.3: chnh xc ca phng php ngh

Nhng vn bn ting Latin chnh xc lch gc (900 nh) 99.00% 0.15o

Tt c vn bn (1080 nh) 96.67% 0.15o

Hnh 3.12: So snh phng php ngh vi phng php ca Chen sau khi p dng c lng th trn 900 nh thuc ng h Latin c quay vi 9 gc nghing bt k

GVHD: Ths.Nguyn c Thnh

- 30 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 3.13: So snh phng php ngh vi phng php va Chen sau khi p dng c lng th trn tt c nh thc nghim c quay vi 9 gc nghing bt k

Trong bng 3.2 v 3.3, chnh xc ca c lng gc nghing c tnh bng t l ca s lng nh c lng ng so vi tng s nh vn bn kim th vi sai s cho php l 0.5o. Ngha l nu chnh lch ca gc nghing c lng so vi gc nghing tht s ca mt vn bn ln hn 0.5o, n s c xem nh mt kt qu sai. Nhng kt qu thc nghim cng ch ra rng lch ln nht ca c lng th ch l mt khong v lch ln nht ca phng php ngh i vi c lng tinh l 0.85o (tt hn phng php ci tin ca Chen [3]). V vy, nu gi tr ca sai s cho php l 0.85o, chnh xc ca phng php chng ti ngh s l 100%. Chng ti cng chn lc ra 900 nh t c s d liu UW English I, mt c s d liu nh dng kim nghim c cng nhn trn ton th gii kim th phng php ca Chen sau khi p dng c lng th v phng php do chng ti ngh. Kt qu s c trnh by trong bng 3.4 v 3.5.
Bng 3.4: chnh xc ca phng php ca Chen sau khi p dng c lng th trn c s d liu UW English I gm 900 nh c quay vi 9 gc nghing bt k

C s d liu UW English I (900 nh)


GVHD: Ths.Nguyn c Thnh - 31 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

chnh xc ca c lng th chnh xc ca c lng tinh lch gc

99.89% 89.77% 0.15o

Bng 3.5: chnh xc ca phng php ngh trn c s d liu UW English I gm 900 nh c quay vi 9 gc nghing bt k

C s d liu UW English I (900 nh) chnh xc ca c lng th chnh xc ca c lng tinh lch gc 99.89% 96.63% 0.14o

Hnh 3.14: So snh phng php ngh vi phng php ca Chen sau khi p dng c lng th trn c s d liu UW English I gm 900 nh c quay vi 9 gc nghing bt k

Chng ti cng kim tra phn mm Omni Page 12.0 [25] vi mt s d liu kim tra ca chng ti (91 nh vn bn). Kt qu cho thy rng Omni Page 12.0 c th
GVHD: Ths.Nguyn c Thnh - 32 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

pht hin nhng gc nghing trong mt vi khong gii hn nh [-16 o, 16o], [76o, 90o] v [-90o, -76o]. Tm li, nhng kt qu thc nghim trn ch ra rng phng php ca chng ti l c lp vi tham s, bi v hu ht nhng tham s u khng c n v, v chng c tnh mt cch t ng. Thm vo phng php ca chng ti li khng ph thuc vo gc nghing cng nh phn gii ca nh vn bn u vo.

PHNG PHP QUAY NH VN BN NH PHN T VN Sau khi xc nh c gc nghing vn bn, vic cn lm tip theo l quay nh gc theo gc mi xc nh . Quay nh vn bn l mt bc rt quan trng, n l tin cho vic phn tch v xy dng b cc cng nh nhn dng vn bn sau ny. chnh xc ca vic quay nh s nh hng rt nhiu n kt qu ca cc bc tip theo. Hin nay c rt nhiu phng php ngh cho vic quay nh. C th n c nh: php quay da trn bin i Affine, phng php do Cheng ngh, phng php 3-pass, phng php do Jiang ngh hay phng php black run. Tuy nhin, mt hn ch chung ca cc phng php ny l lm mt im trong khi quay do php lm trn s, gy ra hin tng r nh (xem hnh 3.10).

GVHD: Ths.Nguyn c Thnh

- 33 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 3.15: Minh ha hin tng r nh sau khi quay

Trong ti ny, chng ti hin thc phng php quay theo block do Sung Chen, Yung Mok Baek v In Cheol Kim ngh [7]. M T PHNG PHP Phng php chng ti trnh by y l mt phng php nhanh, hiu qu v c bit gim thiu tnh trng r nh. Phng php ny gm ba bc chnh nh sau: 1. To v lu tr cc PMPs. 2. Chia nh gc thnh cc block 3. Thc hin quay nh 3.2.2.1. TO V LU TR CC PMPs tng ca phng php ny l chia nh ban u thnh cc block c kch thc nh sn, ri to ra mt tp hp cc block c quay theo gc nh, i vi mi block trong nh s ly block c quay tng ng rp vo m khng cn phi quay li. Lm nh vy s tit kim c thi gian thc hin php quay nhm tng tc ca thut ton.
GVHD: Ths.Nguyn c Thnh - 34 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

y, chng ta s c hai loi PMPs c kch thc khc nhau, gi s gc cn quay vn bn l .:

PMP 9x9: c to ra bng cch ct mt vung c kch thc 9x9 vi tt c cc pixel trong vung u l pixel en ri quay theo gc ..

PMPs 3x3: Mt vung c kch thc 3x3 vi s phn b ca hai loi pixel en v trng mt cch ngu nhin th s c tt c 2 9-1 trng hp c tm thy (tr trng hp tt c u l pixel trng). Quay cc vung mt gc ta s c 511 PMPs 3x3 tng ng.

3.2.2.2.

CHIA NH THNH CC BLOCK

u tin, ta s chia nh gc thnh cc vung (block) c kch thc 9x9. Sau khi chia nh vy s c ba trng hp xy ra: i vi cc block cha ton pixel trng ta khng cn phi xt ti trong qu trnh quay nh i vi cc block cha ton cc pixel en, vic chia block coi nh xong. i vi cc block c cha c hai loi pixel en v trng, ta s tin hnh chia nh ra thnh cc block c kch thc 3x3, vi iu kin block sau s chng ln block trc mt ct v block di s chng ln block trn mt hng.

Hnh 3.16: nh minh ha vic chia nh thnh cc block

3.2.2.3.

THC HIN QUAY NH


- 35 SVTH: Bnh, Mi, Giang

GVHD: Ths.Nguyn c Thnh

Phn tch b cc v nhn dng nh cng vn ting Vit

Khi thc hin xong qu trnh chia block cho nh gc, ta s c hai loi block l block c kch thc 9x9 v block c kch thc 3x3. i vi cc block c kch thc 9x9, ta s tin hnh quay tm ca n theo gc c im mi l A, sau ly nh ca PMP 9x9 rp vo ch cn quay sao cho tm ca PMP trng vi A. i vi cc block c kch thc 3x3, tng t nh trn ta cng ch cn quay tm ca n ri xc nh PMP tng ng rp vo. Vic xc nh PMP tng ng c tin hnh nh sau: u tin ly nh ca block m ha sang dng s nh phn (vi 0 l i din cho pixel trng, 1 l pixel en), ri bin i s sang dng thp phn. Con s ny chnh l v tr ca PMP trong buffer lu cc PMP to ra trong bc 1.

Hnh 3.17: Chuyn i mt block 3x3 sang s thp phn

Do vic to ra cc block 3x3 c to ra bng cch ct gi u trong cc block 9x9 nn sau khi thc hin php quay, mt s pixel s b trng. Vic ly gi tr ti cc pixel c thc hin da trn php OR nn tnh trng r nh s b gim rt ng k.

GVHD: Ths.Nguyn c Thnh

- 36 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 3.18: Minh ha mt nh gc b nghing

Hnh 3.19: nh 3.13 quay theo phng php thng thng

GVHD: Ths.Nguyn c Thnh

- 37 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 3.20: nh 3.13 sau khi c quay theo phng php quay theo block

KT LUN Trong ti ny, chng ti s dng php quay theo block nh trnh by trn cho cc nh vn bn vi gc quay c c lng trc . Phng php quay ny khng nhng c chnh xc cao m lm cn gim hin tng r nh nn n gp phn lm tng chnh xc cho qu trnh phn tch b cc vn bn cng nh nhn dng k t trong cc bc tip theo. Phng php quay theo block ny cng l mt trong nhng phng php quay nh nhanh nht hin nay nn vic p dng n s khin cho tc ca ton b qu trnh chnh nghing nh vn bn c tng ln ng k. TNG KT Trong chng ny, chng ti xin gii thiu mt phng php mi cho vic c lng gc nghing ca nh vn bn da trn nhng php ton Morphology. y, chng ti ngh mt thut ton c lng i t th n tinh tm ra gc nghing ca vn bn. i vi c lng th, chng ti tnh cc gc ca nhng thnh phn lin thng gn k nhau v khong gc ca vn bn s c xc nh da trn vic thng k cc gc ny. i vi c lng tinh, chng ti s dng php ng v php m t en nhng khong trng gia cc k t v t trong cng mt dng vn bn. Sau , nhng dng vn bn s c hnh dng c trng l cc vt thon di v gc ca chng s c tnh ton da vo cng thc chng minh trn. T , gc nghing ca ton b vn bn s c xc nh. Vic kt hp cc php bin i Morphology vi qu trnh
GVHD: Ths.Nguyn c Thnh - 38 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

c lng th s to ra nhng thun li khi c lng gc nghing ca vn bn. Th nht, phng php ny c th thc hin m khng cn cung cp thm mt thng tin chi tit no t vn bn nh kch c ch, khong cch gia cc dng, Nhng thng s ny s c tnh da trn mi nh ring bit. Th hai, thut ton m chng ti ngh c th thc hin vi cc gc nghing bt k. Cui cng, cc kt qu thc nghim cho thy phng php xut khng nhng c kh nng c lng gc nghing cho nhng vn bn s dng mu t Latin m cn cho nhng vn bn ca nhng ngn ng khc nh ting Trung Quc, Nht, RpNgoi ra trong ti ny, chng ti cng s dng mt phng php quay mi gim thiu vic nh b r khi quay, gip cho cc giai on tch khi, tch dng, tch t, tch k t v nhn dng chnh xc hn.

GVHD: Ths.Nguyn c Thnh

- 39 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

TCH KHI VN BN
T VN : Phn tch b cc vn bn l mt bc tin x l c bit quan trng cc h thng OCR. y l qu trnh chia nh nh vn bn thnh cc khi thun nht, c ngha l, cc khi ny ch cha mt loi thng tin, hoc l text, hoc l nh, hoc l bngTrong nhiu trng hp, chnh xc ca qu trnh phn tch b cc vn bn lm nh hng rt nhiu n chnh xc ca h thng OCR. Trong phm vi ti ny, chng ti u tin cho vic tch khi trong vn bn cng vn ting Vit. Cc khi ny c phn chia theo mt s chun c bn ca mt vn bn cng vn thng thng c s dng trong cc c quan hnh chnh ti Vit Nam. Trn thc t c nhiu phng php c xut phn tch b cc ca mt nh vn bn bt k. Tuy nhin, trong phm vi ca n ny, chng ti ch quan tm n vic phn tch b cc ca vn bn cng vn hnh chnh ti Vit Nam. V vy, sau y chng ti ngh vic s dng mt phng php n gin da trn phng php ca G. Nagy, S. Seth, and M. Viswanathan xut [20] ng thi c ci tin ph hp hn i vi cc vn bn hnh chnh ti nc ta. Phng php ny s c trnh by ti phn 4.3 ca chng ny. Sau y l mt b cc thng gp ca mt vn bn cng vn hnh chnh ti nc ta. Thng thng n bao gm 8 phn chnh :

C quan gi Quc hiu Ngy thng nm lp cng vn Tn cng vn Knh gi


- 40 SVTH: Bnh, Mi, Giang

GVHD: Ths.Nguyn c Thnh

Phn tch b cc v nhn dng nh cng vn ting Vit

Ni dung cng vn C quan nhn K tn ng du

GVHD: Ths.Nguyn c Thnh

- 41 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

C quan gi

Quc hiu

Tn cng vn Ngy, thng, nm Knh gi

Ni dung cng vn

C quan nhn

K tn, ng du

Hnh 4.21: Mt v d v vn bn cng vn vi cc phn vng chun ph bin ca cc c quan hnh chnh ti Vit Nam

Trong chng ny chng ti trnh by nhng vn sau: phn 4.2 l phn trnh by mt s phng php tch khi hin c, trong phn 4.3 chng ti m t mt cch chi
GVHD: Ths.Nguyn c Thnh - 42 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

tit v phng php tch khi vn bn c gii thiu trong mc 1 ny. Trong phn 4.4, mt s kt lun v nhn xt v phng php cng nh cc kt qu thc nghim s c trnh by. MT S PHNG PHP TCH KHI HIN C Hin nay c hai hng tip cn chnh trong qu trnh tch khi vn bn l: thut ton top-down [20], [1], [28] thut ton ny bt u thc hin t ton b vn bn s tm ra cc khi, sau da trn cc khi tm ra dng, t ri k t. Cch tip cn th hai l bottom-up [22], [17], [5] ngc li vi cch tip cn u tin, cch ny i t cc TPLT nh tm ra cc k t, ri tm n cc t sau l cc dng, t cc dng ny s tm c cc khi. OGorman [22] s dng cch tip cn bottom-up tch khi. i vi mi TPLT ta s ni k TPLT gn n nht. Mi cp TPLT gn nht c c trng bi khong cch d v gc gia tm ca hai TPLT. Khong cch gia cc t v cc dng s c xc nh da vo biu biu hin mi quan h gia d v (cn gi l docstrum). T , da trn cc khong cch ny, cc dng s c xc nh. Cc khi c hnh thnh bng cch nhm mt hoc nhiu dng li vi nhau da trn c tnh khong cch ca chng. Mt u th ca phng php ny l vn bn u vo khng cn phi chnh nghing, tuy nhin O Gorman khng a ra chnh xc ca thut ton trn cc nh vn bn.

GVHD: Ths.Nguyn c Thnh

- 43 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Mt i din khc trong phng php tip cn bottom-up l phng php ca S. Chen [5]. u tin, ta s thc hin php ng trn nh nh phn. Cc on (segment) tm c s l t hoc l nh ty thuc vo kch thc ca n. Tip theo, thut ton s nhm cc t thnh cc dng da trn vic thng k khong cch gia cc t. Cc dng s c nhm thnh khi da trn vic thng k tng ng v chiu cao, chiu rng, ca cc khi. Nhc im ca phng php ny l qu ph thuc vo cc tham s thc nghim nh c ch, font chnn kt qu ca thut ton ny l khng cao. K. Kise [17] a ra mt phng php tch khi da trn lc Voronoi. Lc Voronoi ny s gip ta tm c cc bounding box ca cc thnh phn trong nh u vo c b cc khng theo chun Manhattan vi gc nghing c xc nh bng cch xem mi im nh l mt neural ri tm cc neural ln cn gp li thnh mt neural ln hn. C th cho n khi tch c khi ca vn bn. u im ln nht ca phng php ny l c th phn vng vn bn m khng cn dng ti cc tham s thc nghim. Tuy nhin, thi gian thc hin ca phng php ny l rt lu v phng php ny bt u thc hin trn cc im nh ri i ln thnh cc k t, t, dng sau mi ti khi. A. Antonacopouslos [1] th a ra mt phng php theo hng top-down, phng php ny c th m t nh sau. u tin, ta s tin hnh t lem vn bn nh phn, tuy nhin cch t lem y khng ging vi t lem ca Morphology m ta s t lem vn bn theo chiu dc. Sau khi t lem, cc hng trong cng mt khi s dnh li vi nhau. Cc khi s c tch bit bi cc khong trng, da vo cc khong trng ny ta s tch c cc khi vi nhau. Tuy nhin, phng php ny rt chm v yu cu s dng nhiu tham s thc nghim. Thc s Nguyn c Thnh [28] cng a ra mt phng php phn vng vn bn theo hng top-down trong lun vn thc s ca mnh. u tin, nh vn bn u vo s c thu nh li cho n khi cc vng s dnh li vi nhau. Sau da trn mt cng thc nh gi do tac gia a ra xc nh thuc tnh cho cc vng. Mt

GVHD: Ths.Nguyn c Thnh

- 44 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

vng c th l nh, mt vng tch xong hoc mt vng cn tch thmBc tip theo l phng ln nh vn bn ln tin hnh phn vng. G. Nagy[20] l mt trong cc bi i din cho hng tip cn th hai, top-down. Trong phng php ny, G. Nagy s dng cc lc chiu X-Y c trng cho cu trc vn bn. Da trn lc ny, ng s tch cc khi lng nhau thnh cc khi nh hn. y, chng ti p dng phng php ny tin hnh tch khi cho vn bn cng vn. Do nh vn bn cng vn c nhng c th ring nh: cu trc kh n gin, t c hnh nh, khong cch gia cc dng v cc t l khc nhau V th, chng ti tin hnh ci tin thm cho ph hp vi vn bn cng vn ting Vit. Chng ti s trnh by r hn v phng php tch khi trong phn 4.3 ca chng ny. M T PHNG PHP Phng php tch khi m chng ti thc hin c tm tt nh sau: Bc th nht chng ti tin hnh tch khi theo phng ngang trong c s dng mt s tham s c xc nh ti phn c lng gc nghing nh vn bn c trnh by trn. Bc th hai chng ti tin hnh tch khi theo chiu dc bng cch da vo cc khi tch theo chiu ngang. Bc tip theo chng ti s tin hnh chiu ngang mt ln na trn cc khi xc nh c bc th hai. Sau khi tch c cc khi th cng on lc b cc khi c kch thc khng ph hp c tin hnh v cho ra kt qu cui cng. TCH KHI THEO CHIU NGANG Sau khi mt nh vn bn c chnh thng ng bng bc chnh nghing trnh by chng 3, chng ta s tin hnh qu trnh duyt theo chiu ngang ca vn bn. Trn thc t, trong qu trnh to ra nh vn bn cng nh quay nh vn bn nhiu xut hin. Chnh iu ny lm nh hng ti chnh xc ca qu trnh tch khi. ci thin thut ton, nh vn bn u vo s c lc nhiu, tc l nhng on biu din no qu nh hoc qu ln, khng c trng cho s phn b ca cc k t s b loi b. Qua thc nghim, chng ti s loi b cc TPLT no c chiu rng ln hn hay nh
GVHD: Ths.Nguyn c Thnh - 45 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

hn ngng T = * W hay chiu cao ln hn hoc nh hn ngng T = * H. Trn vn bn c lc nhiu, chng ti s tin hnh duyt theo chiu t trn xung di t tri qua phi, qua mi dng pixel ca vn bn ta s cng dn s pixel en trn tng dng. S pixel en trn tng dng c biu din thnh mt th vi trc nm dc l chiu cao ca vn bn cn trc nm ngang l s pixel en m c trn mt dng. th va tm c chnh l biu biu din s phn b ca cc khi vn bn (xem hnh 4.3).

GVHD: Ths.Nguyn c Thnh

- 46 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 4.22: nh vn bn gc c chnh thng dng cho qu trnh tch khi

GVHD: Ths.Nguyn c Thnh

- 47 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

(a)

(b)

Hnh 4.23: Lc chiu ngang ca nh vn bn hnh 4.2 (a) Lc ban u (b) Lc saukhi loi b cc on thng v smooth

Sau khi thc hin qu trnh chiu ly lc , qu trnh Smooth th c thc hin ni lin phn du vi phn c bn ca cc dng vn bn gip cho vic xc nh im ct chnh xc hn. Trong qu trnh kim th trn nhiu nh vn bn cng vn
GVHD: Ths.Nguyn c Thnh - 48 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

hnh chnh, chng ti thng k thy rng ngng smooth ph hp l 2. Mt s nh vn bn c nhng on thng di hay cc on gm nhiu k hiu trang tr ging nhau. V th khi ly lc chiu ngang chng s to thnh cc peak cao nhng n l thng th rng ca cc on ny l khng qu H / 2, i lc chng lm cho cc vng tht ra l tch ri nhau b dnh li vi nhau lm nh hng ti kt qu ca qu trnh tch khi. Do khi thc hin n ny, chng ti tin hnh lc b cc on ny ra khi lc chiu ngang tng chnh xc ca thut ton. on thng ny lm nh nh hng kt qu tch khi

Hnh 4.24: Mt v d v vic on thng lm nh hng ti qu trnh tch khi vn bn

Trong hnh trn kt qu ng ca vic tch khi l phi tch thnh hai khi, tuy hin s xut hin ca on thng ch khin cho hai khi b dnh thnh mt. Sau cc bc trn, cn c vo lc sau cng ta s tin hnh xc nh cc im tch khi theo chiu ngang. Cc dng c gi l cng mt khi khi khong cch gia chng nh hn 2 x H. Nh vy, nu khong cch gia hai dng ln hn 2 x H ta s tm c mt vt ct mi cho vic tch khi theo chiu ngang. Kt qu thu c sau qu trnh tch khi theo chiu ngang l tp hp cc vng c tch theo chiu ngang ca vn bn. Mi khi ny c th cha nhiu khi khc phn b theo chiu dc. V vy trn mi khi ngang ny ta s tin hnh tch khi theo chiu dc. Sau y l mt hnh biu din kt qu ca qu trnh tch khi theo chiu ngang.

GVHD: Ths.Nguyn c Thnh

- 49 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 4.25: nh vn bn c tch khi theo chiu ngang.

GVHD: Ths.Nguyn c Thnh

- 50 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

TCH KHI THEO CHIU DC

Hnh 4.26: Mt khi vn bn sau khi tch ngang

Trn mi khi ngang xc nh bc trn ta s duyt chng theo chiu dc. ng vi mi ct ta s m s pixel en. S lng trn cc ct s c biu din thnh mt th, gi l lc chiu dc. Lc ny c trc Oy l s lng pixel en trn mi ct v trc Ox l chiu rng ca nh vn bn.
Cc vt ct c xc nh tch khi

Hnh 4.27: Lc chiu dc ca khi vn bn trong hnh 4.6

Da vo lc ny ta s xc nh cc im dng tch khi theo chiu dc. Cc t c gi l cng trong mt khi nu khong cch gia chng khng qu 3 x W. Nh vy, khong cch vng trng ca hai khi biu din ln hn 3 W th chng s c tch thnh hai khi theo chiu dc.

Hnh 4.28: Kt qu tch dc ca khi vn bn hnh 4.6

4.3.3.

TCH KHI THEO CHIU NGANG LN 2

GVHD: Ths.Nguyn c Thnh

- 51 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Do cu trc vn bn khng thun tu mi khi ch c mt khi cng nm trn mt hng ngang nn s c trng hp sau khi tch khi, hai hoc nhiu khi b gp thnh mt (nh hnh 4.9(a)). khc phc tnh trng trn, thng thng ngi ta s tin hnh tch khi cho n khi khng tch c na th thi, nhng cu trc ca mt vn bn cng vn l kh n gin nn trong ti ny chng ti ch tin hnh tch khi theo chiu ngang thm mt ln na th tnh trng ny s c khc phc.

(a)
Hnh 4.29: (a) Hai khi b gp thnh mt

(b)
(b)Kt qu sau khi tch ngang ln 2

Sau khi thc hin vic tm v tch cc khi, ta c mt tp hp cc khi vn bn ring bit. Tuy nhin trong vn bn lun c nhng khi nhiu c th (nh cc kim bm, cc vt mc lem) nn chng cn c loi b. Theo kt qu thc nghim th cc khi c kch thc nh hn 5H x 5W s khng c chp nhn.

GVHD: Ths.Nguyn c Thnh

- 52 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 4.30: Hnh 4.2 vi cc khi c tch bng phng php c ngh trn

KT LUN V NHN XT T KT QU THC NGHIM: Trong chng ny chng ti trnh by v mt phng php n gin dng tch khi cho cc nh vn bn cng vn hnh chnh thng thy Vit Nam. Nh
GVHD: Ths.Nguyn c Thnh - 53 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

ni phn u tin ca chng ny, mt nh vn bn cng vn thng chia thnh tm phn. Trong phn ni dung ca vn bn c th c chia thnh nhiu phn nh. Nh vy, nu xem xt trn mt nh vn bn cng vn chun c y cc phn nh trnh by th qu trnh tch khi s tch c t nht l 8 khi. Mi khi ng vi mt phn trong vn bn, ring phn ni dung c th chia thnh nhiu khi. Khi thc hin n ny chng ti tin hnh kim nghim trn nhiu nh cng vn v t c kt qu rt kh thi. Hu ht cc khi c bn trong cc vn bn cng vn u c tm ra. i vi mt s vn bn c cc khi vn bn khng tch ri nhau, phng php ca chng ti s gp cc khi li v xem chng nh l mt khi ng nht. Bn cnh , ch vit tay xut hin trn vn bn i lc cng lm thay i b cc cng nh kt qu ca thut ton. Sau y l kt qu thc nghim chng ti tin hnh nh gi c trn 100 nh vn bn cng vn. Cc thut ng ny s c trnh by r hn trong chng 8.
Bng 4.6: Thng k chnh xc ca thut ton tch khi

chnh xc thut ton tch khi 90.54% Correct detection 84.20% Miss detection 0.00% False detection 2.25% Splitting detection 1.04% Merging dectection 5.22% Spurious detection 7.28%

Tch khi l bc khi u cho qu trnh phn tch b cc vn bn. chnh xc trong vic thc hin qu trnh ny c nh hng ln ti kt qu ca c qu trnh phn tch b cc. Do , vn tch khi l mt vn cn phi c quan tm ng mc.

GVHD: Ths.Nguyn c Thnh

- 54 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

TCH DNG VN BN
T VN Thut ton xc nh b cc vn bn thc cht l tm cch chia nh vn bn ra thnh nhiu khi m mi khi s c trng cho mt vng vn bn. Sau mi khi ny s c chia thnh nhiu dng, ri mi dng c th c chia thnh mt hoc nhiu t, tng t mi t s c chia thnh nhiu k t. Trong chng 4, chng ti i c bc u tin trong vic xc nh v phn tch b cc vn bn l xc nh cc khi vn bn. Trong chng ny, da trn cc khi vn bn xc nh c, chng ti s tin hnh xc nh dng trong mi khi vn bn . M T PHNG PHP c rt nhiu phng php a ra tch dng vn bn, trong phm vi ti ny chng ti ch ngh mt phng php tch dng rt n gin da trn cc php bin i Morphology v php chiu ly lc . Phng php ca chng ti gm 3 bc cn bn sau: Dng cc php bin i Morphology t lem dng vn bn. Ly lc chiu i vi mi khi vn bn theo trc Oy. Xc nh dng vn bn trong mi khi. DNG CC PHP BIN I MORPHOLOGY T LEM DNG VN BN Vic xc nh dng vn bn trong thut ton chng ti a ra y ch yu da vo cc pixel en v mt phn b ca chng. Trong bc u tin ny, chng ti s tin hnh t cc dng vn bn thnh cc vt lem thon di, mc ch l lm tng s lng pixel en c trong mt dng vn bn. Mt iu lu rng, vic t lem cc dng
GVHD: Ths.Nguyn c Thnh - 55 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

vn bn ch nhm mc ch xc nh cc vt ct ca dng vn bn ch khng lm nh hng ti nh vn bn ban u. Chnh vic lm ny s lm tng chnh xc ca thut ton xc nh dng vn bn nht l i vi cc dng vn bn cui cng trong khi ch c mt, hai hoc rt t t. Cch thc t lem dng vn bn trong giai on ny c thc hin kh ging vi cch thc t lem dng vn bn trong giai on xc nh gc nghing vn bn. Tuy nhin, c mt s khc bit duy nht l vic t lem ny c thc hin trn vn bn c chnh nghing.

Hnh 5.31: nh vn bn gc sau khi tch khi cn tch dng

GVHD: Ths.Nguyn c Thnh

- 56 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 5.32: nh vn bn trong hnh 5.1 c t lem

Sau khi thc hin bc ny, vn bn ca chng ta s c t trng bi cc vt en thon di, nh vic xc nh dng vn bn s c thc hin d dng hn. LY LC CHIU I VI MI KHI VN BN THEO TRC OY Cc khi vn bn xc nh c trong nh vn bn u vo c th c xem l mt khi c layout n gin, thun nht. Cc dng trong khi tch ri nhau, khng c hin tng dng ny nm gia khong cch ca hai dng khc (nh cc hnh minh ha trong hnh 5.3 di y). Do s trnh c trng hp cc dng dnh li vi nhau thnh mt khi.
GVHD: Ths.Nguyn c Thnh - 57 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 5.33: nh minh ha cc dng lng nhau

Nh vy i vi mi khi vn bn c xc nh trong qu trnh tch khi ta s d dng tch thnh cc dng vn bn ring bit vi cch ly lc chiu di y. Vic chiu lc cc khi vn bn c cc dng c t lem s thc hin tng t vic chiu ngang vn bn trong qu trnh tch khi. Tc l, duyt theo chiu t trn xung di t tri qua phi, qua mi dng pixel ca vn bn ta s cng dn s pixel en trn tng dng. S phn b s lng pixel en trn tng dng c biu din trong th vi trc nm dc l chiu cao ca vn bn cn trc nm ngang l s pixel en m c trn mt dng. Sau khi thc hin php chiu ly lc , ta s c mt biu din s phn b ca cc pixel en trn mi dng vn bn.

Hnh 5.34: Hnh lc chiu ca mt khi vn bn

Quan st lc ta d dng nhn thy s phn b ca cc dng vn bn trong khi. Chnh nh vo lc ny ta s xc nh c cc dng vn bn trong bc tip theo.

GVHD: Ths.Nguyn c Thnh

- 58 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

XC NH DNG VN BN TRONG MI KHI Xc nh dng vn bn l mt bc khng th thiu trong qu trnh xy dng v phn tch b cc vn bn. Da vo lc xc nh c trong bc trn chng ti s xc nh cc nht ct to ra cc dng vn bn. Khi tin hnh t lem vn bn da trn cc php bin i Morphology, nhiu b loi b nn lc chiu ch n thun c trng cho s phn b ca cc dng vn bn. Cng trong qu trnh t lem, du, cc acender, descender b loi b nn m bo chnh xc ca vic tch dng v khng lm mt du trong vn bn ting Vit, chng ti tin hnh m rng bin cho dng mt khong T. Theo thc nghim T = * H, trong H l chiu cao trung bnh ca ch.

(a)

(b)
Hnh 5.35: (a) Mt dng ct nhng khng m rng bin (b) Dng ct c m rng bin

GVHD: Ths.Nguyn c Thnh

- 59 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 5.36: nh vn bn sau khi tch dng

Trn y l mt on vn bn c tch dng. Kt qu qu trnh tch dng l danh sch cc dng, y chnh l input u vo cho vic tch t v tch k t tip theo trong qu trnh xy dng v phn tch b cc vn bn. KT LUN Qu trnh tch dng l bc tip theo sau bc tch khi trong qu trnh phn tch v xy dng b cc vn bn. Trong ti ny chng ti a ra phng php tch dng kh n gin nh trnh by trn y. Tuy y l mt phng php tch dng n gin nhng i vi nh vn bn cng vn ting Vit th y l mt phng
GVHD: Ths.Nguyn c Thnh - 60 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

php kh hiu qu. Kim tra trn 100 nh cng vn dng trong qu trnh tch khi, chng ti nhn thy phng php ny m bo c chnh xc kh tt.

GVHD: Ths.Nguyn c Thnh

- 61 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

TCH T VN BN
T VN Sau khi vn bn c tch thnh nhiu dng, chng ta tip tc tch t da trn cc dng tm c. y l mt bc quan trng, l c s c th tch k t v tin hnh nhn dng. Trong ti ny, chng ti da theo phng php ca Otsu tm ra khong cch c trng gia cc k t, mt ngng ph hp thc hin tch cc t vi nhau trong cng mt dng. Trong chng ny chng ti trnh by nhng vn sau: trong phn 6.2, chng ti s trnh by mt s hng tip cn khc trong vic gii quyt vn tch t trong vn bn, tip theo trong phn 6.3, chng ti s m t chi tit phng php tch t m chng ti ngh v phn 6.4 l phn kt lun ca phng php. MT S HNG TIP CN KHC Mt s phng php s dng ngng xc nh trc [11], sau s phn loi cc k t thuc cng mt t v cc k t thuc cc t khc nhau da vo vic so snh khong cch theo trc x gia cc k t trong cng mt t v cc t khc nhau vi ngng xc nh trc ny. Phng php ny kh d hin thc. Tuy nhin, do s a dng ca b cc vn bn, vic xc nh mt ngng chung cho tt c cc loi vn bn l mt iu kh khn. Hn na, khong cch gia cc k t trong cng mt t cc dng khc nhau c th khc nhau. iu ny c th thy r trong trng hp khi vn bn c canh l theo nh dng justify. Chen [5] a ra mt phng php tch t da trn vic s dng cc php bin i morphology. tng chnh ca phng php ny l xc nh cc pixel (c trng v en) thuc v cng mt t. Tuy nhin, phng php ny cn mt c s d liu nh vn bn chun ln c th thc hin nhng tnh ton thng k.

GVHD: Ths.Nguyn c Thnh

- 62 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Mt s hng tip cn s dng khong cch trung bnh theo trc x gia tt c cc k t trong cng mt dng lm ngng phn loi cc k t trong cng mt t vi cc k t thuc cc t khc. u im ca phng php ny l gi tr ngng s c tnh ton mt cch t ng v ch ph thuc vo tng dng vn bn. Tuy nhin, do s khong cch gia cc k t thng nhiu hn so vi s khong cch gia cc t, nn gi tr trung bnh ny c xu hng gn bng vi khong cch gia cc k t trong mt t. iu ny c th dn n kt qu tch t sai khi dng c t t v cc t di. Trong lun vn ny, chng ti ngh mt phng php tch t da trn vic s dng phng php xc nh ngng t ng Otsu [26], sau khi thng k c mng nhng khong cch gia cc k t, chng ti s da vo phng php Otsu tm ngng thch hp cho tch vic tch t, chng ti s trnh by chi tit phng php ny sau y. M T PHNG PHP Phng php tch t ca chng ti c tm tt nh sau: bc th nht chng ti tin hnh ni du vo k t i cng n, y ta xem nh mt k t s bao gm c du ca n, bc th hai chng ti s sp xp cc k t mi tm c t tri qua phi, sau chng ti s thng k khong cch gia cc k t v da vo phng php ca Otsu xc nh khong cch gia cc t, cui cng l da trn khong cch tm c, chng ti s ni cc k t vi nhau thnh mt t hon chnh. NI DU V K T Nh chng ta bit, trong ting Vit, du ch c th pha trn hoc pha di k t v tt c u nm gia k t hoc pha bn phi (nh hnh 6.1), v d nh cc du m, du sc, du huyn, du hi, du ng lun nm pha trn k t, cn du nng nm di k t.

GVHD: Ths.Nguyn c Thnh

- 63 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 6.37: Hnh minh ha v tr ca du so vi k t

Nh vy nhng thnh phn lin thng no b bao ph bi thnh phn lin thng khc theo trc Ox v nm trn hoc nm di thnh phn lin thng th s c ni li vi nhau, vi iu kin l t l bao ph theo trc Ox phi ln hn 9/10 hoc t l bao ph ln hn 1/3 nhng c khong cch tm nh hn chiu rng k t . C ngha l nu mt BoundingBox b bao ph 90% chiu rng th chc chn n l du ca k t c BoundingBox bao ph n, nu khng th t l bao ph theo chiu rng l 1/3 nhng khng nm lch v pha bn tri qu chiu rng k t gc. T l bao ph c tnh nh sau: Gi b1 v b2 l hai BoundingBox ca hai thnh phn lin thng bt k, DxMerge l trng lp theo trc Ox ca hai TPLT, tng t DyMerge l trng lp theo trc Oy ca hai TPLT.

Hnh 6.38: Hnh biu din khi nim DxMerge v DyMerge

DxMerge v DyMerge c tnh nh sau: DxMerge = max(left1 left2) min(right1 right2) DyMerge = max(top1 top2) min(bottom1 bottom2) pha di v bn phi tng ng vi b1, b2. Vy t l bao ph theo trc Ox, r s l : r = (|DxMerge| +1) / min(w1 w2)
GVHD: Ths.Nguyn c Thnh - 64 -

(6.1) (6.2)

Vi top1, left1, bottom1, right1, top2, left2, bottom2, right2 l cc im pha trn, bn tri,

(6.3)
SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

vi w1, w2 l chiu rng tng ng ca b1 v b2.

Hnh 6.39: (a) Hnh ban u (b) Cc BoundingBox ca cc thnh phn lin thng (c) Hnh (a) sau khi c ni du.

Hu ht cc vn bn sau khi nh phn b mt im, dn n mt s k t b tch thnh nhiu thnh phn lin thng. V vy, sau khi ni du, k t ny b tch thnh cc BoundingBox nh, nn chng ta cn ni cc BoundingBox ny li to thnh mt k t hon chnh. Nhng thnh phn lin thng no b bao ph ln nhau theo trc Ox, vi t l bao ph ln hn 0.4, ng thi cng b bao ph theo trc Oy th s c ni li vi nhau, t l bao ph theo trc Ox cng c tnh theo cng thc nh trn.

Hnh 6.40: (a) Minh ha cho ch S b mt im, b tch thnh 3 thnh phn lin thng (b) Cc BoundingBox ca cc thnh phn lin thng (c) BoundingBox ca ch S sau khi c ni thnh mt k t.

NI K T TRONG T Sau khi cc k t c ni du, ta s thu c mt danh sch cc BoundingBox. Ta s sp xp danh sch ny t tri qua phi theo im bn tri ca BoundingBox , cng do b mt im khi nh phn, mt s du i lin vi k t b tch ra thnh hai phn, v d nh cc k t , . V vy sau khi sp xp, ta s xt hai BoundingBox k nhau xc nh xem chng c phi do mt k t b tch ra hay khng. Ta nhn thy t
GVHD: Ths.Nguyn c Thnh - 65 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

l v din tch gia du v k t gc lun nh hn , v khong cch gia chng lun nh hn 1/5 chiu rng k t, thm vo v tr ca BoundingBox bn phi, tc l BoundingBox ca du s cao hn v tr ca BoundingBox bn tri, l BoundingBox ca k t. Gi b1, b2 l hai BoundingBox c sp xp t tri qua phi. T l r v din tch ca b1, b2 c tnh nh sau: r = area (b2) /area (b1) (6.4)

Hnh 6.41: (a) Minh ha ch b tch thnh 2 thnh phn lin thng (b) Cc BoundingBox ca cc thnh phn lin thng (c) BoundingBox ca ch sau khi c ni thnh mt k t.

Vy sau khi ni du ta thu c cc BoundingBox tng trng cho cc k t trong t. Da vo danh sch trn, ta thng k khong cch gia hai BoundingBox k cn nhau vo mt mng. T mng ny, ta s da vo phng php ca Otsu xc nh ra khong cch thch hp ni cc k t trong mt t. Tuy nhin, phng php thng k da vo Otsu s khng chnh xc khi c mt vi k t ring l cch qu xa so vi cc k t khc , nn chng ta s kim tra nu ngng tm c khng ng tin cy th s c tnh li theo chiu rng k t. Ta nhn thy mt ngng khng ng tin cy l khi n ln hn chiu rng k t th ta s xem nh ngng dng tch t l 2/5 chiu rng k t. V trn thc t khong cch gia hai t lun lun phi nh hn hoc bng chiu rng k t. Sau khi tm c ngng thch hp tch t, ta s kim tra t tri qua phi, nu khong cch gia hai k t k cn nhau nh hn hoc bng vi ngng tm c th chng thuc cng mt t, v ngc li.

GVHD: Ths.Nguyn c Thnh

- 66 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 6.42: Mt dng vn bn gm cc k t c ni du.

Hnh 6.43 Mt dng vn bn sau khi c tch t.

TNG KT Tch t l mt bc x l quan trng to c s cho vic tch k t v tin hnh nhn dng. Trong ti ny, chng ti ngh mt phng php mi, thng k khong cch gia cc k t ri sau da trn phng php ca Otsu xc nh mt ngng thch hp gip tch cc t ra mt cch nhanh chng v chnh xc. y l mt phng php hon ton mi, thc hin khng phc tp nhng mang li kt qu kh tt. Chng ti tin hnh th nghim trn mt s nh cng vn v thy kt qu nhn c kh chnh xc, to mt c s tt tin hnh tch k t.

GVHD: Ths.Nguyn c Thnh

- 67 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

TCH K T
T VN Sau khi tch t, chng ta cn xc nh cc k t trong tng t. Trong thc t, cc thnh phn lin thng c tch ra t cc t c th tr thnh cc k t. Tuy nhin, do c trng ca ngn ng ting Vit, cc k t c th c du. Do vy, cc thnh phn lin thng to nn mt k t cn c gp li. Thao tc gp ny c trnh by trong phn ni du vi k t trong phn tch t chng 6. Bn cnh s xut hin ca du trong ting Vit, mt vn khc i vi vic nhn dng vn bn ting Vit cn c lu l hin tng dnh ca k t. Hin tng ny xy ra do nh hng ca qu trnh scan, fax hoc photocopy vn bn. Trong cc k t khc nhau c th b dnh li vi nhau trong cng mt thnh phn lin thng. Th d: 2 k t i v n (in) khi b dnh c th c th b nhn lm l t m hoc iii, hay k t c, l (cl) khi b dnh c th nhn lm thnh k t d. Cc k t b dnh s lm gim chnh xc ca cc phng php nhn dng (xem hnh 7.1).

Hnh 7.44: Hnh minh ha k t b dnh vi nhau

Mt s phng php kt hp vic ct k t dnh cng vi vic nhn dng [3]. Tuy nhin, do bn thn qu trnh nhn dng ch in ting Vit l mt ti phc tp. Do vy, trong lun vn ny, chng ti trnh by mt phng php n gin dng tch k t dnh khng in nghing. Phng php ch thun ty da trn lc chiu trn trc x ca k t dnh, sau cc vt ct s c xc nh thng qua lc chiu ny. Trong qu trnh tch k t, thun tin cho vic nhn dng sau ny, chng ti khng phn bit k t vi du ca chng m xem nh tt c l mt n th.

GVHD: Ths.Nguyn c Thnh

- 68 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Quan st trn cc trng hp k t dnh, chng ti nhn thy phn ln vic dnh k t thuc mt trong hai loi: dnh phn chn k t (hnh 7.1a) v dnh phn u k t (hnh 7.1b). Do vy, trong lun vn ny, chng ti cng xut hai x l tng ng vi 2 hin tng trn. M T PHNG PHP Phng php c xut trong lun vn ny c th tm tt nh sau: Trc tin, hnh chiu trn trc x ca k t dnh c to ra, gi l h. T hnh chiu ny, nhng vng m ti mt pixel thp, nh hn hoc bng ngng Th s c xc nh. Lc ny, mt nh gi n gin c th c s dng nhm xc nh v tr m ti mt pixel thp c phi l ch dnh hay khng.

Hnh 7.45: Hnh minh ha hnh chiu theo trc x ca cc k t dnh trong hnh 7.1a v 7.1b

Gi x l v tr m ti mt pixel thp, ngha l h(x) T1 (h(x) l gi tr ca hnh chiu ti v tr x). Nu x l v tr dnh phn chn k t th x phi tha mn mt s tnh cht nh sau: Xt mt hnh ch nht c kch thc (dx, dy) vi dx = 2T 1 v dy = dx. Hnh ch nht c t sao trn baseline ca k t dnh v tm hnh ch nht chnh l v tr x. Nu mt pixel en ca na hnh ch nht bn tri (dx/2, dy) v na hnh ch nht bn phi (dx/2, dy) c t l pixel en trn din tch cc hnh ch nht ny ln hn hoc bng mt ngng T2 xc nh trc, hn na, nu khng tn ti mt run (dy lin tc cc pixel) en nm ngoi vng hnh ch nht trn v theo hng ln (tnh theo trc y) phn trn ca k t dnh sao cho chiu di ca run nh hn hoc bng mt ngng T3 th v tr x c xem l v tr dnh phn chn ca k t. i vi trng hp k t dnh phn u (do du ca k t), tiu chun dng nh gi x c phi l v tr ca k t dnh phn u hay khng s c xc nh nh
GVHD: Ths.Nguyn c Thnh - 69 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

sau: Gi hmax l gi tr ln nht ca hnh chiu trong khong [x-dx, x]. Khi x s l im ct nu: hmax h(x) T4 v h(x1) h(x), h(x+1) h(x) (7.2) Cc gi tr dx, dy, T1, T2, T3 v T4 c tnh ton mt cch t ng da vo khong cch c c lng ca chiu rng (W) v chiu cao (H) ca cc k t trong cng mt dng vn bn. Hnh di y minh ho kt qu ca vic ct k t dnh. (7.1)

Hnh 7.46: Hnh minh ha kt qu vic ct k t dnh ca hnh 7.1a v 7.1b

KT LUN V MT S KT QU THC NGHIM Tch k t l mt giai on quan trng, l c s ca bc nhn dng vn bn, c bit vi nhng ngn ng c du nh ting Vit cng vi vic vn bn khi scan b nhiu th vic ct nhng k t b dnh nhau l rt quan trng. Trong ti ny, chng ti trnh by mt phng php n gin nhng rt hiu qu x l vn trn. Phng php trn c kim chng trn tp d liu gm 195 trng hp k t dnh vi s k t dnh l 398 k t. Phng php tch c chnh xc 334 k t, t t l 84,42%. V y s l mt c s tt qu trnh nhn dng vn bn t c kt qu chnh xc.

GVHD: Ths.Nguyn c Thnh

- 70 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

XY DNG GROUND TRUTH V CNG C NH GI CHNH XC CA THUT TON PHN VNG VN BN


XY DNG GROUND TRUTH V CNG C NH GI CHNH XC CA THUT TON PHN VNG VN BN Ground truth l tp c s d liu dng nh gi chnh xc ca cc thut ton nh thut ton phn vng, tch dng, tch t v tch k t. Trong mi vng, dng, t, k t c c trng bi cc hnh ch nht bao quanh n, c th tm gi l cc bounding box. nh gi chnh xc ca thut ton tch khi vn bn cp chng 4, chng ti tin hnh xy dng ground truth cho tch khi vn bn v hin thc thut ton nh gi. y chng ti xy dng ground truth trn 100 nh cng vn n v i khoa Cng ngh Thng tin trng i hc Nng Lm. Kt qu tr ra ca thut ton xc nh khi vn bn l nhng hnh ch nht bao quanh cc khi vn bn , y chng ti gi l cc bounding box. tnh ton chnh xc ca thut ton, chng ta cn so snh cc bounding box xc nh c nh thut ton vi cc bounding box thc s ca vn bn [27]. t g = {G1, G2, , GN} l tp hp ca N khi bounding box thc s. d = {D1, D2, , DM} l tp hp ca M khi bounding box xc nh c nh thut ton Trc khi tin hnh nh gi thut ton, chng ti s gii thiu mt s trng thi xc nh mi quan h gia hai tp hp bounding box. Gi s c hai tp hp bounding box g v d xc nh mi quan h gia hai tp ny gm c cc khi quan h sau:

Mis-detection (quan h 1-0): ngha l, trong tp g c bounding box ny nhng trong tp d khng c. Thut ton xc nh thiu bounding box ny.

GVHD: Ths.Nguyn c Thnh

- 71 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

False-detection (quan h 0-1): ngha l, trong tp g khng c bounding box ny nhng trong tp d li c. Thut ton xc nh d bounding box ny.

Correct-detection (quan h 1-1): c hai tp g v d u c bounding box ny. Thut ton xc nh ng i vi bounding box va xt. Splitting-detection (quan h 1-m): trong tp g xc nh c 1 bounding box nhng trong tp d bounding box li b chia nh ra thnh nhiu bounding box khc.

Merging-detection (quan h m-1): trong tp g xc nh c nhiu bounding box nhng trong tp d cc bounding box ny li b gp thnh mt.

Spurious-detection (quan h m-m): l biu din cho cc quan h cn li. tng ng gia hai bounding box A v B c k hiu s(A, B) v c xc nh nh sau:
r B s ( A, B ) = Aeaea( A ) A r ( A)

(8.1)

Trong :

Area (A) l din tch ca khi bounding box A Area (A B) l din tch khi A chng lp ln khi B

Nh vy tng ng nh ngha t l bao ph ca bounding box A bi bounding box B. Sau da trn o ca php tnh tng ng, ta s nh ngha hai tham chiu g: g G v d: d D nh sau:

g (Gi) = {Dj d | Gi = arg max s ( Dj, X )}


X g

(8.2) (8.3)

d ( Dj ) = {Gi g | Dj = arg max s (Gi, X )}


Xd

Vi g(Gi) l tp hp cc Dj d sao cho t l bao ph tc l tng ng so vi Gi l ln nht trong s cc bounding box khc trong g v d(Gi) l tp hp cc Gi g sao cho t l bao ph tc l tng ng so vi G i l ln nht trong s cc bounding box khc trong d.
GVHD: Ths.Nguyn c Thnh - 72 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Da trn hai hm s g: g G v d: d D ta c th xc nh c mi quan h gia cc phn t trong g v d. Cc mi quan h c nh ngha nh sau:


1. Nu tn ti Gi m s(Gi, Dj) = 0 vi j = 1, 2, , M th quan h

c gi l mis-detection.
2. Nu tn ti Dj m s(Dj, Gi) = 0 vi i = 1, 2, , N th quan h

c gi l false-detection.
3. Mt quan h c gi l correct-detection gia Gi v Dj khi v ch

khi g(Gi) = {Dj} v d(Dj) = {Gj}


4. Tn ti mt splitting-detection gia Gi v {Dj1, Dj2, , Djm} khi v

ch khi

g(Gi) = {Dj1, Dj2, , Djm} Tn ti mt D0 g(Gi) m d(D0) ={Gi} th tt c D g(Gi) nu


D D0 th d(D) =

i vi tt c D g (Gi ) ,G d (Dj )
5. Tn ti mt merging-detection gia {Gi1, Gi2, , Gim} v Dj khi

v ch khi

d(Dj) = {Gi1, Gi2, , Gim} Tn ti mt G0 d(Dj) m g(G0) ={Dj} th tt c G d(Dj) nu


G G0 th g(G) =

i vi tt c G (Dj), Dj G)

6. Tt c cc trng hp cn li c gi l spurious-detection Hnh v di y l mt biu din dng hnh hc cho cc mi quan h c trnh by bn trn. Hnh trn l biu th cho cc khi ng v nhng hnh vung l biu th cho cc khi do thut ton xc nh c. Mi tn i t cc khi ng ti cc khi xc nh c l biu din cho mi quan h d: d D. Ngc li, mi tn i t cc khi xc nh c ti cc khi ng l biu th cho mi quan h g: g G.

GVHD: Ths.Nguyn c Thnh

- 73 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 8.47: Hnh biu din cc mi quan h gia Ground truth v Detection

GVHD: Ths.Nguyn c Thnh

- 74 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Khi cc mi quan h gia g v d c xc nh, chng ti s tin hnh m s lng cc quan h miss, false, correct, merging, spurious. t N10, N01, N11 l s lng
g g cc mi quan h miss, false, correct. t N 1gm , N m1 v N mm nh ngha cho s lng cc

d khi trong g c quan h 1-m, m-1, m-m vi cc khi trong d. Tng t nh vy N 1m ,

d d N m1 v N m m l s lng cc khi trong d c quan h 1-m, m-1, m-m vi cc khi trong

g. Chng ta c cc cng thc sau:


g g N = N 10 + N 11 + N 1gm + N m1 + N mm

(8.4) (8.5) (8.6) (8.7)

d d M = N 01 + N 11 + N 1dm + N m1 + N mm

N 1gm N 1dm
g d N m1 N m1

Trong ti ny, chng ti tin hnh tch khi trn cc vn bn cng vn nn cc khi vn bn ny tch bit nhau. Do , splitting trong g chnh l merging trong d v tng t merging trong g chnh l splitting trong d. Ta c cc cng thc sau:
g N m1 = N 1dm d N 1gm = N m1

(8.8) (8.9)

chnh xc ca thut ton tch khi c k hiu l k v c o bng hm sau: k = min (k1, k2) trong :
g g g k1 = ( 10 N 10 + 11 N 11 + 1m N 1m + m1 N m1 + mm N mm ) / N d d d k2 = ( 10 N 10 + 11 N 11 + 1m N 1m + m1 N m1 + mm N mm ) / M

(8.10)

(8.11) (8.12)

GVHD: Ths.Nguyn c Thnh

- 75 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

v 10, 01, 11, 1m, m1, mm l cc h s ca cc quan h miss, false, correct, merging, spurious. k no ln hn s c chn lm chnh xc ca thut ton tch khi. Qua thc nghim, chng ti chn cc h s theo bng sau:
Bng 8.7: H s nh gi chnh xc

11 1m m1 mm 1. 0.5 0.5 0.0 0 0 0 Sau khi tin hnh xy dng ground truth v nh gi chnh xc cho 100 nh vn bn cng vn ting Vit ly t cc cng vn n v i ca khoa Cng ngh Thng tin trng i hc Nng Lm Tp.HCM chng ti c c kt qu nh sau:
Bng 8.8: Kt qu thc nghim

10 0.

01 0.

chnh xc thut ton tch khi 90.54% Correct detection 84.20% Miss detection 0.00% False detection 2.25% Splitting detection 1.04% Merging dectection 5.22% Spurious detection 7.28%

KT XUT KT QU Qu trnh xy dng thut ton nh gi chnh xc ca qu trnh phn vng vn bn i hi phi c d liu u vo l cc ground truth vit theo nh dng XML. Bn cnh , XML ngy nay ang tr thnh mt chun c s dng rng ri trong hu ht cc phn mm OCR. Do , chng ti tin hnh xy dng cng c cho php kt xut kt qu di dng XML. Ngoi ra, p ng nhu cu s dng, lu tr cng nh tm kim chng ti cng xy dng thm cng c kt xut kt qu di dng file Word tin cho ngi dng thao tc. KT XUT KT QU DI DNG FILE XML Trong mt bi ton phn tch v nhn dng nh th thng thng kt qu s c kt xut ra di dng file XML, v c tnh ca file XML ngoi vic c th lu gi ni
GVHD: Ths.Nguyn c Thnh - 76 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

dung, n cn c th lu gi cc c im khc ca vn bn. Vic kt xut kt qu ra file XML cng l c s nh gi chnh xc ca thut ton da trn Ground truth. Mt file XML kt qu s c cu trc nh sau: <?xml> <page skew = > <point name = centerImage x = y = > <region> <point name = topleft x = y = > <point name = topright x = y = > <point name = bottomleft x = y = > <point name = bottomright x = y = > <line> <point name = topleft x = y = > <point name = topright x = y = > <point name = bottomleft x = y = > <point name = bottomright x = y = > <word> <point name = topleft x = y = > <point name = topright x = y = > <point name = bottomleft x = y = > <point name = bottomright x = y = > <character> <point name = topleft x = y = > <point name = topright x = y = > <point name = bottomleft x = y = > <point name = bottomright x = y = > </character>
GVHD: Ths.Nguyn c Thnh - 77 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

</word> </line> </region> </page> Page l mt i tng m t cho ton vn bn, n c thuc tnh l skew vi skew l gc nghing ca vn bn c n v l , mt vn bn s c nhiu vng region, mi vng c nhiu dng - line , mi dng c nhiu t - word v mi t c nhiu k t - character. Mi mt i tng vng, dng, t, hay k t u c bn im dng xc nh v tr ca n bao gm topleft, topright, bottomleft, bottomright, mi im ny c ta x v y theo trc ta mn hnh, y l v tr xc nh trn nh vn bn trc khi c chnh nghing. Trong k t - character ngoi bn im xc nh v tr, cn c ni dung l k t sau khi nhn dng c. thc hin c nhng cng vic trn, chng ti s dng cc gi : package org.w3c.dom.*; package javax.xml.parsers.*; package javax.xml.transform.*; package javax.xml.transform.dom.*; package javax.xml.transform.stream.*; u tin, to mt file XML, chng ta to mt i tng Document :
DocumentBuilderFactory fac = DocumentBuilderFactory.newInstance(); DocumentBuilder parser = fac.newDocumentBuilder(); Document doc = parser.newDocument();

Sau , ta ln lt to cc Element l page, region, line, word v character bng phng thc createElement(String name) ca Document.

GVHD: Ths.Nguyn c Thnh

- 78 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

thm thuc tnh cho Element ta dng phng thc setAtribute(String name, String value). thm vo cc element nh hn trong mt Element ln, ta dng phng thc appendChild(Element e) . lu file XML trn, chng ta s dng i tng Transformer
TransformerFactory transFac = TransformerFactory.newInstance(); Transformer tran = transFac.newTransformer(); tran.setOutputProperty(OutputKeys.METHOD, "xml");

i tng Transformer s dng phng thc transform() nhn vo Source v Result, vi Source l mt i tng DOMSource nhn vo file XML cn lu v Result l mt StreamResult nhn vo mt File s xut ra, kt xut thnh mt file.
Source src = new DOMSource(doc); Result dest = new StreamResult(new File(destFileName)); tran.transform(src, dest);

vi doc l i tng lu cu trc file XML, destFileName l tn file kt qu. XML l mt hnh thc lu tr d liu kh ph bin, c bit trong mt ng dng OCR th XML l hnh thc kt xut kt qu rt hu hiu. Vic dng file XML lu kt qu s gip thun tin trong vic lu gi cc c tnh ca vn bn c hnh thnh trong qu trnh x l, cng nh l c s dng Ground truth nh gi chnh xc ca cc thut ton nh tch khi, tch dng, tch t hay tch k t. Tuy nhin, c ci nhn trc quan v kt qu nhn dng nh mt vn bn tht s v cng d dng trong vic tm kim v chnh sa, chng ti cng trnh by mt phng php n gin kt xut kt qu ra thnh file MS Word. Phng php ny s c trnh by trong phn k tip sau y KT XUT KT QU DI DNG FILE MS WORD Mc d kt qu kt xut ca qu trnh phn tch b cc nh vn bn thnh file XML l mt chun c cng nhn rng ri hin nay song kha cnh ngi dng th c n l mt vn kh khn. Nhm to ra s d dng cho ngi s dng ng thi gip ngi dng c mt cch nhn trc quan v kt qu ca qu trnh phn tch b cc
GVHD: Ths.Nguyn c Thnh - 79 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

nh vn bn v nhn dng k t chng ti tin hnh kt xut kt qu ca qu trnh ny thnh nhng tp tin c th c d dng bng Microsoft Word. Sau y chng ti xin gii thiu phng php m chng ti tin hnh thc hin qu trnh ny: Nh trnh trn, cu trc ca mt nh vn bn c chng ti phn tch thnh cc phn nh sau:

Hnh 8.48: M hnh cu trc file c lu di dng MS Word

Da trn nhng kt qu vic kt xut tp tin word c xc nh mt cch n gin. Trc ht chng ti s xem xt v nhm cc khi nm trn mt hng ngang. Mt khi c xem l mt on (paragraph) nu khng c khi no nm chung hng ngang. Cc khi trn cng mt hng ngang s c a vo cc ct ca mt bng. Mt dng trong vn bn kt xut s tng ng vi mt dng tm thy sau qu trnh tch dng. Mt t trong vn bn ny tng ng mt t ca qu trnh tch t. Tng t mi k t s l mt k t sau khi thc hin tch k t.

GVHD: Ths.Nguyn c Thnh

- 80 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 8.49: Hnh th hin cc khi c chung mt hng ngang

Cng vic u tin ca qu trnh kt xut ra file word l phi c mt th hin ca file word (document) v cho php d liu c ghi vo.
Document document = new Document(); RtfWriter2.getInstance(document, new FileOutputStream(fileName)); document.open();

Sau khi document ny c khi to v m ra cho php d liu ghi vo, chng ti s to d liu kt xut. D liu u vo ca qu trnh kt xut d liu l cc khi, do , chng ti s xc nh cc khi nm trn tng hng ngang.Cc hng ngang no ch c mt khi s c chng ti xem nh l mt on (paragrap) trong qu trnh kt xut. D liu trong cc on ny l cc k t trong on c nhn ra sau qu trnh nhn dng. Nhng hng ngang no c hn mt khi chng ti s to thnh bng c s ct bng s khi trong hng. D liu ca cc khi s c a vo cc ct tng ng. Vic cui cng l sau khi kt xut xong d liu chng ti ng document to nhm gii phng b nh. Microsoft Word l phn mm son tho vn bn ph bin hin nay nn kt qu ca phn tch b cc nh vn bn ca chng ti s c hin th thnh tp tin word.
GVHD: Ths.Nguyn c Thnh - 81 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

iu ny s gip ngi dng c ci nhn ton cnh v cc qu trnh phn tch v nhn dng trn, ng thi n cng gip ngi dng thao tc d dng trong vic nh gi, sa cha cng nh lu tr v tm kim.

GVHD: Ths.Nguyn c Thnh

- 82 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

NG DNG MNG NEURAL NHN TO TRONG NHN DNG K T IN TING VIT


T VN Ngy nay khng ai c th ph nhn vai tr cc k quan trng ca my tnh trong nghin cu khoa hc k thut cng nh trong i sng. My tnh lm c nhng iu k diu v gii c nhng vn tng chng nan gii. Cng ngy cng c nhiu ngi t hi, liu my tnh thng minh hn con ngi hay cha? Chng ti s khng tr li cu hi y. Thay vo , chng ti s nu ra nhng khc bit ch yu gia cch lm vic ca my tnh v b c con ngi. Mt my tnh, d c mnh n u chng na, u phi lm vic theo mt chng trnh chnh xc c hoch nh trc bi cc chuyn gia. Bi ton cng phc tp th vic lp trnh cng cng phu. Trong khi con ngi lm vic bng cch hc tp v rn luyn. Trong khi lm vic con ngi c kh nng lin tng, kt ni s vic ny vi s vic khc, v quan trng hn ht, h c th sng to. Do c kh nng lin tng, con ngi c th d dng lm nhiu iu m vic lp trnh cho my tnh i hi rt nhiu cng sc. Chng hn nh vic nhn dng hay tr chi ch. Mt em b c th t hc hi nhn dng v phn loi vt chung quanh mnh, bit c ci g l thc n, ci g l chi. Mt ngi bnh thng cng c th on c vi ch trong mt ch. Nhng tht kh m dy cho my tnh lm c nhng vic y. Bn hy th thit k mt my tnh c kh nng lm nh th! T lu cc nh khoa hc nhn thy nhng u im y ca b c con ngi v tm cch bt chc thc hin nhng my tnh c kh nng hc tp, nhn dng v phn loi. Cc mng neural nhn to (Artificial Neural Network, ANN) ra i t
GVHD: Ths.Nguyn c Thnh - 83 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

nhng n lc . ANN l mt lnh vc nghin cu rng ln v ch mi pht trin mnh khong 15 nm gn y thi. Tuy c nhiu kt qu khch l, nhng ANN hy cn xa mi t c s hon chnh nh b c con ngi. C S L THUYT MNG NEURAL NHN TO V GII THUT LAN TRUYN NGC u tin ANN c gii thiu nm 1943 bi nh thn kinh hc Warren McCulloch v nh logic hc Walter Pits. Nhng vi nhng k thut trong thi gian ny cha cho php h nghin cu c nhiu. Nhng nm gn y m phng ANN xut hin v pht trin. Cc nghin cu ng dng c thc hin trong cc ngnh: in, in t, k thut ch to, y hc, qun s, kinh t...v mi nht l cc nghin cu ng dng trong lnh vc qun l d n xy dng. Ti Vit Nam vic nghin cu ng dng ANN vo qun l xy dng ch mi bt u trong vi nm gn y v cn c pht trin. Mng n ron nhn to l mt m phng x l thng tin, c nghin cu ra t h thng thn kinh ca sinh vt, c bit l con ngi, ging nh b no x l thng tin. N bao gm s lng ln cc mi gn kt cp cao x l cc yu t lm vic trong mi lin h gii quyt vn r rng. ANNs ging nh con ngi, c hc bi kinh nghim, lu nhng kinh nghim hiu bit v s dng trong nhng tnh hung ph hp. Sau y l mt thng k so snh kh nng ca no ngi v my tnh

Bng 9.9: Thng k so snh kh nng ca no ngi v my tnh

My tnh
GVHD: Ths.Nguyn c Thnh - 84 -

B no ngi
SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

n v tnh ton B nh Thi gian x l Thng lng

105 mch logic

1011 neural 1011 neural, 1014 khp ni

10-8 giy 109 bit/giy

10-3 giy 1014 khp/giy

9.2.1. NHNG THNH PHN CHNH CA MT MNG NEURAL

Hnh 9.50: M hnh b no v mng neural sinh hc

Mt mng neural thng thng c cc thnh phn chnh sau y: Soma l thn ca neural. Cc dendrites l cc dy mnh, di, gn lin vi soma, chng truyn d liu (di dng xung in th) n cho soma x l. Bn trong soma cc d liu c tng hp li. C th xem gn ng s tng hp y nh l mt php ly tng tt c cc d liu m neural nhn c. Mt loi dy dn tn hiu khc cng gn vi soma l cc axon. Khc vi dendrites, axons c kh nng pht cc xung in th, chng l cc dy dn tn hiu t neural i cc ni khc. Ch khi no in th trong soma vt qu mt gi
GVHD: Ths.Nguyn c Thnh - 85 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

tr ngng no (threshold) th axon mi pht mt xung in th, cn nu khng th n trng thi ngh. Axon ni vi cc dendrites ca cc neural khc thng qua nhng mi ni c bit gi l synapse. Khi in th ca synapse tng ln do cc xung pht ra t axon th synapse s nh ra mt s cht ho hc (neurotransmitters); cc cht ny m "ca" trn dendrites cho cc ions truyn qua. Chnh dng ions ny lm thay i in th trn dendrites, to ra cc xung d liu lan truyn ti cc neural khc. C th tm tt hot ng ca mt neural nh sau: neural ly tng tt c cc in th vo m n nhn c, v pht ra mt xung in th nu tng y ln hn mt ngng no . Cc neural ni vi nhau cc synapses. Synapse c gi l mnh khi n cho php truyn dn d dng tn hiu qua cc neural khc. Ngc li, mt synapse yu s truyn dn tn hiu rt kh khn. Cc synapses ng vai tr rt quan trng trong s hc tp. Khi chng ta hc tp th hot ng ca cc synapses c tng cng, to nn nhiu lin kt mnh gia cc neural. C th ni rng ngi no hc cng gii th cng c nhiu synapses v cc synapses y cng mnh m, hay ni cch khc, th lin kt gia cc neural cng nhiu, cng nhy bn. Hy nh k nguyn tc ny, v chng ta s dng n trong vic hc tp ca cc ANNs.

GVHD: Ths.Nguyn c Thnh

- 86 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

M HNH MNG NEURAL NHN TO

Hnh 9.51: M hnh mt neural nhn to

Neural ny s hot ng nh sau: gi s c N inputs, neural s c N weights (trng s) tng ng vi N ng truyn inputs. Neural s ly tng c trng s ca tt c cc inputs. Ni nh th c ngha l neural s ly input th nht, nhn vi weight trn ng input th nht, ly input th hai nhn vi weight ca ng input th hai v.v..., ri ly tng ca tt c cc kt qu thu c. ng truyn no c weight cng ln th tn hiu truyn qua cng ln, nh vy c th xem weight l i lng tng ng vi synapse trong neural sinh hc. C th vit kt qu ca output mng neural ny nh sau:
out = g (in) = g ( w j a j )

(9.1)

9.2.2. CC HM KCH HOT THNG C DNG

Hm dng bc (hm ngng):

1 if x step ( x) = 0 if x <

(9.2)

Hm du:

1 if x sign( x) = 1 if x < - 87 -

(9.3) SVTH: Bnh, Mi, Giang

GVHD: Ths.Nguyn c Thnh

Phn tch b cc v nhn dng nh cng vn ting Vit

Hm sigmoid:

sigmoid ( x) = 1 / (1 + e ( x + ) )

(9.4)

CU TRC MNG FEED-FORWARD Nm 1986 Rumelhart v McClelland ci tin Perceptron thnh mng Perceptron nhiu lp (MultiLayer Perceptron, MLP), hay cn gi l mng Feedforward. Mng Feed-Forward l mt mng gm mt hay nhiu lp neural, trong cc dy dn tn hiu ch truyn theo mt chiu t input qua cc lp, cho n output. Mi mng Feed-Forwrad phi c t nht mt lp input (input layout) (nn lu rng lp input khng c tnh l mt lp ca mng), mt lp output (output layer) v c th c nhiu lp n (hidden layer). Mi neural trong lp trc s kt ni vi tt c cc neural trong lp tip theo. Sau y l mt m hnh v mng Feed-Forward:

Hnh 9.52: M hnh mng neural Feed-forwwad

GVHD: Ths.Nguyn c Thnh

- 88 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Mi neural s nhn tn hiu t cc neural ca lp mng lin trc n, v n s chia cc tn hiu ny thnh cc gi tr trng s. Trng s u vo s l tng cc trng s nhn c. Sau trng s ny s c nh gi thng qua mt hm gii hn, thng thng l hm ngng, xc nh gi tr u ra. Gi tr ny s c truyn ti tt c cc neural khc trong lp tip theo. V th s dng mng gii quyt mt vn no , ta phi truyn gi tr cho cc neural trong lp mng u tin (input layer), sau cc gi tr ny s c lan truyn ti tt c cc lp ca mng ri xc nh gi tr u ra. GII THUT LAN TRUYN NGC (BACK PROPAGATION ALGORITHM) Thut ton Back Propagation c s dng iu chnh cc trng s kt ni sao cho tng sai s E nh nht.

E = (t ( x i , w) y ( x i )) 2
i =1

(9.5)

Trong : t (xi, w): gi tr ca tp mu y (xi): gi tr kt xut ca mng Trc tin , ta xt trn mt Neural, mi Neural u c gi tr vo v ra, mi gi tr u c mt trng s nh gi mc nh hng ca gi tr vo . Thut ton Back Propagation s iu chnh cc trng s gi tr ej = Tj yj l nh nht. Trc ht ta phi xc nh v tr ca mi neuron. Neuron no l ca lp n v neuron no l ca lp xut. Ta cn bit cc k hiu: wij: vector trng s ca neuron j s u vo i uj: vector gi tr kt xut ca neuron trong lp j

GVHD: Ths.Nguyn c Thnh

- 89 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 9.53: M hnh tnh ton mt neuron

Gi tr sai s ca neuron j ti vng lp th n

e j (n) = t j ( n) y j ( n)
Tng bnh phng sai s ca mng neural:

(9.6)

1 k 2 E (n) = e j (n) 2 j =1
Ti neuron j ta c tng trng s input:

(9.7)

u j (n) = wij x j ( n)
i =0

(9.8)

Gi tr kt xut ca neuron j:

y j (n) = f j (u j (n))
-

(9.9)

Tnh ton gi tr o hm sai s cho mi neuron wij


E (n) E (n) e j (n) y j (n) u j (n) = wij ( n) e j (n) y j (n) u j (n) wij (n)

(9.10)

Trong :
1 k 2 e j ( n) E (n) 2 j =1 = = e j (n) e j ( n) e j (n) GVHD: Ths.Nguyn c Thnh - 90 -

(9.11)

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

e j (n) y j (n)

(t j (n) y j ( n)) y j (n) = f j' (u j ( n))

= 1

(9.12)

y j (n) u j (n)
p

(9.13)

u j (n) wij ( n)

( wij .xi (n))


i =0

wij (n)

= x i ( n)

(9.14)

E (n) = e j (n). f ' (u j (n)).xi (n) wij (n)

(9.15)

Gi tr iu chnh trng s:
wij = E ( n) = .e j (n). f ' (u j (n)).xi (n) wij (n)

(9.16)

t:
j =
E (n) E (n) e j (n) y j (n) = = e j (n). f ' (u j (n)) wij (n) e j (n) y j ( n) u j (n)

(9.17)

Ta c :
wij = . j ( n).xi (n)

(9.18)

T ta c cng thc iu chnh trng s:


wij (n + 1) = wij (n) + wij (n)

(9.19)

Nh vy qu trnh iu chnh trng s c th c xc nh theo cc cng thc trn, tuy nhin ta cn phi xc nh v tr ca neuron thuc lp no (lp n hay lp xut). iu ny rt quan trng trong vic tnh ton cho tng h s iu chnh trng s.

GVHD: Ths.Nguyn c Thnh

- 91 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh 9.54: M hnh tnh ton mng Neural tng qut

Trng hp 1: Nu neuron j l nt xut. Ta c :


k =
E ( n) E (n) ek ( n) y k (n) = = ek (n). f k' (u k (n)) w jk (n) ek (n) y k (n) u k (n) w jk = . k (n). y j (n)

(9.20) (9.21)

Trng hp 2: Nu neuron j l nt n:
j =
E (n) E (n) y j (n) E (n) ' = = . f (u j ( n)) wij (n) y j (n) u j (n) y j (n)

(9.22)

Trong :
E ( n) = 1 q 2 e k ( n) 2 k =1

(9.23)

Khi :
E (n) = y j (n) ( 1 q 2 ek (n)) q ek (n) 2 k =1 = ek y j ( n) y j (n) k =1

(9.24)

ek (n) ek (n) u k (n) = y j (n) u k (n) y j ( n) ek (n) (t k (n) y k (n)) (t k (n) f k (u k (n))) = = = f k' (u k (n)) u k (n) u k (n) u k (n) GVHD: Ths.Nguyn c Thnh - 92 -

(9.25) (9.26)

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Ta c :
u k ( n) =
m

j =0

jk

( n) y j ( n)

(9.27)

u k (n) = y j (n)

( w jk ( n) y j (n))
j =0

y j (n)

= w jk (n)

(9.28)

q E (n) = ek (n) f k' (u k (n)) w jk ( n) y j (n) k =1

(9.29)

Theo trn ta c:
k =
E (n) w jk ( n) = e k (n). f k' (u k (n))
q

(9.30)

E (n) y j (n)

= k ( n) w jk (n)
k =1

(9.31)

Vy:
j (n) = f ' (u j (n)) k (n) w jk (n)
k =1 q

(9.32)

T nhng cng thc tnh trn ta c th tng qut nh sau: Gi tr iu c chnh trng s

H s hc

Gi tr input ca neuron

Trong : Nu neuron j l nt xut:


j = e j (n) f j' (u j (n))

j(n)
(9.33)

Nu neuron j l nt n:

GVHD: Ths.Nguyn c Thnh

- 93 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

j (n) = f ' (u j (n)) k (n) w jk (n)


k =1

(9.34)

Nh vy tu theo hm hot ng ta c th tnh d dng tnh ton cc gi tr iu chnh trng s cho tng trng s tng ng theo thut ton Back Propagation. M T PHNG PHP Trong ti ny, chng ti s tin hnh nhn dng cc k t trong vn bn Ting Vit do chng ti phi nhn dng 224 k t c trong Ting Vit, bao gm ch in hoa, ch thng, ch s v mt s k t c bit khc. Trong ting Vit, mt s k t c dng vit hoa v vit thng rt ging nhau nh ch o v O, s v S, p v P, .. hoc l cc trng hp khc nh: l (ch l thng) v 1 (s 1), du - v du _ khc phc tnh trng hc nhm ca mng neural, chng ti tin hnh xy dng 2 mng neural. Mt mng dng nhn dng ch hoa, cc con s v cc k t c bit. Mt mng dng nhn dng ch thng. y, chng ti tin hnh xy dng cc mng neural theo m hnh Feed-Forward v da trn gii thut lan truyn ngc nhn dng k t. M hnh mng neural ny s gm 3 lp sau: mt lp input, mt lp output v mt lp n. Mi k t trong vn bn, m chng ti tm c trong phn tch k t trnh by trn, s c chun ho thnh nh c kch thc 30x30 pixel. Nh vy tin hnh nhn dng, mng neural phi c xy dng c lp input gm 900 nt. Gi tr ca cc input s c ly t cc pixel nh k t c chun ho. ng vi mi pixel en trong nh k t th nt tng ng trong lp input s c gi tr l 1.0, ngc li nt s c gi tr l 0. i vi mng nhn dng ch thng do phi nhn dng 93 k t nn lp output ca mng ny phi c 93 nt. i vi mng cn li, s nt ca lp output l 131 dng nhn dng 131 k t. S lng nt lp n ca cc mng chnh l trung bnh cng s nt ca lp input v lp output. Sau khi xy dng xong 2 mng neural ny, chng ti tin hnh cho mng hc mng c th nhn dng c cc k t trong vn bn ting Vit.

GVHD: Ths.Nguyn c Thnh

- 94 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

TNG KT
Trong ni dung ca ti ny, chng ti tin hnh phn tch, xy dng b cc v tin hnh nhn dng nh vn bn cng vn ting Vit u tin, chng ti tin hnh xm ha nh, chuyn t nh mu nhn vo ban u sang nh xm da vo cc thng s mu c bn l Red, Green v Blue. Sau , p dng phng php Otsu xc nh ngng xm thch hp nht. Sau qu trnh nh phn nh, ta thu c mt nh nh phn c biu din nh mt ma trn im bao gm 2 gi tr 0 v 255, gi tr 0 biu din cho mu en v 255 biu din cho mu trng. Trong giai on xc nh gc nghing cho nh vn bn, mt thut ton c lng gc nghing t th n tinh c xut. Trc ht, nhiu, du v nhng thnh phn lin thng ln c loi b trong bc tin x l. Sau , khong gc m gc nghing ca vn bn ri vo c xc nh thng qua bc c lng th. Vi vic s dng cc php ng v m Morphology, cc khong trng gia cc k t trong mt t v cc t trong mt dng c lp y, cc dng vn bn c c trng bi cc vt thon di v c xem nh l nhng thnh phn lin thng. Sau , gc nghing ca cc thnh phn lin thng ny c xc nh v mt phng php thng k n gin c p dng tnh gc nghing ca vn bn. So vi cc phng php khc, phng php c lng gc nghing do chng ti xut c mt s u im nh: phng php hu nh c lp vi cc tham s thc nghim v hu ht u l nhng tham s khng n v v c tnh ton mt cch t ng da trn nh u vo, phng php c th p dng vi nhng vn bn c gc nghing bt k, cng nh khng ph thuc vo ngn ng c s dng trong vn bn. c th hon thin mt h thng OCR cho vic nhn dng cc nh cng vn, cc bc tip theo l qu trnh phn tch, xy dng b cc v nhn dng nh vn bn nh: tch cc khi, tch dng, tch t, tch k t v nhn dng k t cn c thc hin. Sau khi a vn bn c chnh nghing, cc khi vn bn s c xc nh.
GVHD: Ths.Nguyn c Thnh - 95 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Trong n ny, chng ti ch quan tm n tch khi cho nh vn bn cng vn. Trc tin, vn bn c chiu ph ngang theo trc Oy tin hnh tch khi theo chiu ngang. Sau , i vi mi khi va tm c, ph dc theo trc Ox c p dng tch thnh nhng khi nh hn. Tuy nhin, do cu trc ca cc vn bn l kh phc tp do s c tnh trng hai hay nhiu khi ny nm lng trong cng mt khi tch. khc phc tnh trng trn, chng ti s tin hnh tch khi theo chiu ngang thm mt ln na i vi cc khi tm c. Sau qu trnh ny, cc khi s c tch bit vi nhau. Trong ti ny, chng ti tin hnh xy dng ground truth trn 100 nh cng vn ting Vit cng nh hin thc thut ton nh gi chnh xc ca thut ton tch khi. Kt qu cho thy phng php tch khi do chng ti ngh trn y thc hin kh tt khong 90.54%, hn th na y l phng php tch khi n gin nn tc thc hin kh nhanh. Bn cnh , cc tham s trong qu trnh tch khi ny cng l cc tham s khng n v. Trong cng on tch dng trn cc khi va tm c, chng ti p dng phng php t lem da trn cc php bin i Morphology ri tin hnh ly lc chiu ngang tch dng. Khi tin hnh t lem nh vn bn, cc pixel en trn cng mt dng c xu hng tng thm, ngc lai, cc pixel en phn b khng phi trn cc dng vn bn c xu hng mt i. Do , vic ly lc biu din s phn b cc pixel en trn mt dng bc tip theo l rt thun tin. Nh vy, cc dng c xc nh mt cch ng n hn. Chng ti tin hnh kim tra trn cc nh cng vn s dng tch khi trn y th thy kt qu rt kh thi. N c th p ng c nhu cu cho cc bc tip theo nh tch t, tch k t v nhn dng. Trong ti ny, chng ti cng a ra mt phng php mi khi tch t trong vn bn. Trc ht, do vn bn ting Vit c rt nhiu du, v th chng ti s tin hnh ni cc du vo cc k t. Nh vy mi k t s bao gm k t v du ca n. Sau da vo lc Otsu, chng ti s xc nh c khong cch c trng ca cc t trong vn bn tin hnh tch t.

GVHD: Ths.Nguyn c Thnh

- 96 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Do c im ca ting Vit l c du, nn trong giai on tch k t ny chng ti xem nh mt k t s bao gm c du i km vi n. Mi k t s c ni du to thnh mt k t n nht, sau chng ti s tin hnh tch cc k t b dnh nhau. Phng php tch k t dnh m chng ti trnh by y s da trn lc chiu theo trc x, t xc nh vt ct gia nhng k t b dnh bng cch xc nh v tr c mt pixel thp. Khi hin thc phng php ny, chng ti cng tin hnh thc nghim trn 195 trng hp v t c chnh xc l 84,42%. y s l mt c s tt gip cho giai on nhn dng thm phn chnh xc. Trong phn nhn dng, chng ti cng xy dng mt mng neural hot ng theo gii thut lan truyn ngc tin hnh nhn dng k t. Thm vo , chng ti cho kt xut kt qu thnh file di dng XML v MS Word, trong phn Ground Truth chng ti s dng file XML lu li thng tin c th nh gi c chnh xc ca thut ton, v file MS Word s c kt xut sau phn nhn dng vn bn, c mt ci nhn trc quan vo kt qu v c th d dng chnh sa cng nh tm kim. Tuy nhin, trong ni dung ti ny, giai on nhn dng vn bn cha hon tt nn chng ti cha th nh gi c chnh xc ca qu trnh ny. Nhng kt qu t c trong ti ny l mt c s tt c th xy dng mt phn mm OCR hon chnh gii quyt vn lu tr v x l nhng vn bn hnh chnh ngy cng nhiu Vit Nam.

GVHD: Ths.Nguyn c Thnh

- 97 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

[1]

Antonacopouslos, A.: Page Segmentation Using the Description of the Background, Computer Vision and Image Understanding 70:3 (1998) 350369 Baird, H.S.: Skew Angle of Printed Documents. In: Proc. of SPSEs 40th Annual Conference and Symposium on Hybrid Image Systems, Rochester, NY (1987) 2124

[2]

[3]

Breuel, T.M.: Segmentation of Handprinted Letter Strings using a Dynamic Programming Algorithm. In: Proc. 6th Int. Conf. on Document Analysis and Recognition, Seattle, USA (2001) 821826

[4]

Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.M.: Geometric Layout Analysis Techniques for Document Image Understanding: A Review. Technical Report, 9703-09, ITC-IRST, Trento, Italy (1998)

[5]

Chen, S.: Document Layout Analysis Using Recursive Morphological Transforms, PhD thesis, Univ. of Washington, 1995 Chen, S., Haralick, R.M., Phillips, I.T.: Automatic Text Skew Estimation in Document Images. In: Proc. 3rd Int. Conf. on Document Analysis and Recognition, Montral, Canada (1995) 11531156

[6]

[7]
[8]

Chen, S., Baek, Y.M., and Kim, I.C.: .. US patents. Chen, Y.K., Wang, J.F.: Skew Detection and Reconstruction Based on Maximization of Variance of Transition-counts. Pattern Recognition 33 (2000) 195208

[9]

Chou, C.H., Chu, S.Y., Chang, F.: Estimation of Skew Angles for Scanned Documents Based On Piecewise Covering by Parallelograms. Pattern Recognition 40:2 (2007) 443455

[10] Das, A., Chanda, B.: A Fast Algorithm for Skew Detection of Document Images

Using Morphology. Int. J. Document Analysis and Recognition 4 (2001) 109114

GVHD: Ths.Nguyn c Thnh

- 98 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit


[11] Gatos, B., Konidaris, T., Ntzios, K., Pratikakis, I., Perantonis, S.J.: A

Segmentation-free Approach for Keyword Search in Historical Typewritten Documents. In: Proc. 8th Int. Conf. on Document Analysis and Recognition, Seoul, South Korea (2005) 5458
[12] Hinds, S.C., Fisher, J.L., DAmato, D.P.: A Document Skew Detection Method

Using Run-Length Encoding and the Hough Transform. In: Proc. 10th Int. Conf. on Pattern Recognition, Atlantic City PA, USA (1990) 464468
[13] Hull, J.J.: Document Image Skew Detection: Survey and Annotated Bibliography.

Document Analysis Systems II, Hull, J.J., and Taylor, S.L. (eds.), World Scientific (1998) 4064
[14] Ishitani, Y.: Document Skew Detection Based On Local Region Complexity. In:

Proc. 2nd Int. Conf. on Document Analysis and Recognition, Tsukuba, Japan (1993) 4952
[15] Kanai, J., Bagdanov, A.D.: Projection Profile Based Skew Estimation Algorithm

for JBIG Compressed Images. Int. J. Document Analysis and Recognition (1998) 4351
[16] Kavallieratou, E., Fakotakis, N., Kokkinakis, G.K.: Skew Angle Estimation for

Printed and Handwritten Documents Using the Wigner-Ville Distribution. Image and Vision Computing 20 (2002) 813824
[17] Kise, K., Sato, A., and Jwata, M.: Segmentation of Page Images Using the Area

Voronoi Diagram, Computer Vision and Image Understanding 70:3 (1998) 370 382
[18] Le, D.S., Thoma, G.R., Weschler, H.: Automated Page Orientation and Skew

Angle Detection for Binary Document Images. Pattern Recognition 27:10 (1994) 13251344
[19] Lu, Y., Tan, C.L.: Improved Nearest Neighbor Based Approach to Accurate

Document Skew Estimation. In: Proc. 7th Int. Conf. on Document Analysis and Recognition, Edinburgh, Scotland (2003) 503507
GVHD: Ths.Nguyn c Thnh - 99 SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit


[20] Nagy, G., Seth, S., and Viswanathan, M.: A Prototype Document Analysis System

for Technical Journals, Computer 25 (1992) 1022


[21] Najman, L.: Using Mathematical Morphology for Document Skew Estimation. In:

Proc. of SPIE Conf. on Document Recognition and Retrieval XI, San Jose, California, USA (2004) 182191
[22] OGorman, L.: The Document Spectrum for Page Layout Analysis. IEEE Trans.

on Pattern Analysis and Machine Intelligence 15:11 (1993) 11621173


[23] Okun, O.: Geometrical Approach to Skew Detection for Documents Containing

the Latin/Cyrillic Characters. In: Proc. of SPIE Conf. on Vision Geometry VIII, Denver, Colorado, USA (1999) 357365
[24] Okun, O., Pietikine, M., Sauvola, J.: Document Skew Estimation without Angle

Range Restriction. Int. J. Document Analysis and Recognition (1999) 132144 [25] OmniPage Pro, Version 12.0
[26] Otsu, N.: A Threshold Selection Method from Gray-Level Histogram. IEEE Trans.

Systems, Man, and Cybernetics (1979) 6266


[27] Shi, Z., Govindaraju, V.: Skew Detection for Complex Document Images Using

Fuzzy Run-Length. In: Proc. 7th Int. Conf. on Document Analysis and Recognition, Edinburgh, Scotland (2003) 715719
[28] Thanh, N.D.: A Robust Document Layout Analysis Algorithm for Vietnamese

documents, Master thesis, Asian Institute of Technology, 2005.


[29] Thanh, N.D., Bnh, V.D., Mi, N.T.T., Giang, N.T.: A Robust Document Skew

Estimation Algorithm Using Mathematical Morphology, to appear at: the 19th IEEE International Conference on Tools with Artificial Intelligence, Patras, Greece, 2007
[30] Yu, B., Jain, A.: A Robust and Fast Skew Detection Algorithm for Generic

Documents. Pattern Recognition 29:10 (1996) 15991629

GVHD: Ths.Nguyn c Thnh

- 100 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit


[31] Yuan, B., Tan, C.L.: Skewscope: The Textual Document Skew Detector. In: Proc.

7th Int. Conf. on Document Analysis and Recognition, Edinburgh, Scotland (2003) 4953

GVHD: Ths.Nguyn c Thnh

- 101 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

CC PHP BIN I MORPHOLOGY


Cc php ton Morphology l cc php ton thng c s dng trong qu trnh x l nh. Mi nh c xem l mt tp hp cc im. t , , C, l cc tp hp trong khng gian hai chiu 2 vi gc ta .
1. Php co (Erosion): php co ca mt tp hp bi phn t cu trc c k

hiu l v c nh ngha l : = { x 2 | x + b vi mi b } Php co cng c th c hiu l: = { b | b } M rng ca php co l: Nu , th php co c cc tnh cht sau: C C
2.

(1)

(2) (3) (4) (5)

Php gin (Dilation): php gin ca tp hp bi phn t cu trc c k (6)

hiu l v c nh ngha l: = { c 2 | c = a + b vi mt vi a v b } Php gin cng c th c hiu l: = { b | b } Php gin c tnh cht giao hon v kt hp: (7)

(8) ( C) = ( ) C (9) Php gin l php i ngu ca php co. Tnh cht i ngu ca hai php bin i ny c m t bi biu thc sau: = ( c )c Trong c l phn b ca . M rng ca php gin l:
GVHD: Ths.Nguyn c Thnh - 102 -

(10)

(11)
SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Nu , th php gin c tnh cht sau:


3.

(12)

Php m (Opening): php m ca tp hp bi phn t cu trc c k (13) (14)

hiu bi v c nh ngha l: = ( ) Php m cng c th c hiu l: = { t | t , t 2 } M rng ca php m l: (15) Hn th na, nu phn t cu trc c phn tch morphology thnh hai phn t cu trc nh hn l G v c ngha l = G th: (G ) G
4.

(16)

Php ng (Closing): php ng ca tp hp bi phn t cu trc c k

hiu bng v c nh ngha l: = ( ) Php ng cng c th c hiu l: = {x | x t vi mt vi t sao cho t } m. S i ngu ca hai php ton ny th hin qua biu thc: M rng ca php ng l: = ( c )c (19) (17) (18)

Tng t nh trong trng hp ca php gin, php ng l php i ngu ca php

(20) Nu phn t cu trc c phn tch morphplogy thnh hai phn t cu trc nh hn l G v H c ngha l: = G H th: (G ) G (21)

GVHD: Ths.Nguyn c Thnh

- 103 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

Hnh A.55: Cc php bin i Morphology (a) Hnh ban u (b) Phn t cu trc (c) Kt qu hnh (a) sau khi thc hin php ng (d) Kt qu hnh (c) sau khi thc hin php m

Tm li: Cc php co, gin, ng, m l cc php bin i khng ph thuc vo nh u vo. t = ( )t (22) t = ( )t (23) = ( )t t (24) (25) t = ( )t Nhng khng phi tt c u khng ph thuc vo phn t cu trc.
5.

t = ( t = ( t = ( t = (

)-t )-t )-t )-t

(26) (27) (28) (29)

Php t gin (n-fold dilation): php t gin ca tp hp c k hiu

l ( n ) v c nh ngha l:

GVHD: Ths.Nguyn c Thnh

- 104 -

SVTH: Bnh, Mi, Giang

Phn tch b cc v nhn dng nh cng vn ting Vit

{} ( n ) = . . . n

nu n = 0 nu n = 1, 2, 3, (30)

M rng ca php t gin l: ( i ) ( j ) khi i < j K n =1 n=2 n=3 (31) n=4

Hnh A.56: Cc minh ha v php t gin i vi mt s phn t cu trc c bn.

GVHD: Ths.Nguyn c Thnh

- 105 -

SVTH: Bnh, Mi, Giang

You might also like