You are on page 1of 68

I HC QUC GIA H NI

TRNG I HC CNG NGH




Nguyn Tin Thanh


TRCH CHN QUAN H THC TH TRN
WIKIPEDIA TING VIT DA VO
CY PHN TCH C PHP






KHO LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin







H NI - 2010












I HC QUC GIA H NI
TRNG I HC CNG NGH


Nguyn Tin Thanh


TRCH CHN QUAN H THC TH TRN
WIKIPEDIA TING VIT DA VO
CY PHN TCH C PHP






KHO LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin


Cn b hng dn: PGS.TS. H Quang Thy
Cn b ng hng dn: ThS. Nguyn Thu Trang



H NI - 2010

i

LI CM N
Li u tin, ti xin gi li cm n v lng bit n su sc nht ti PGS.TS H
Quang Thy, ThS. Nguyn Thu Trang v CN. Trn Nam Khnh tn tnh hng dn
ti trong sut qu trnh thc hin kho lun tt nghip.
Ti chn thnh cm n cc thy, c to cho ti nhng iu kin thun li
ti hc tp v nghin cu ti trng i hc Cng Ngh.
Ti cng xin gi li cm n ti ThS. Trn Mai V v cc anh ch, cc bn sinh
vin ti phng th nghim KT-Sislab gip ti rt nhiu trong vic thu thp v x l
d liu. Ti xin gi li cm n ti cc bn trong lp K51CA v K51CHTTT ng h
khuyn khch ti trong sut qu trnh hc tp ti trng.
Cui cng, ti mun c gi li cm n v hn ti gia nh v bn b, nhng
ngi thn yu lun bn cnh v ng vin ti trong sut qu trnh thc hin kha lun
tt nghip.
Ti xin chn thnh cm n !


H Ni, ngy 21 thng 05 nm 2010
Sinh vin


Nguyn Tin Thanh





ii

Tm tt
Trch chn quan h ng ngha (gi tt l quan h) c xem l bi ton c
bn ca x l ngn ng t nhin nhn c s quan tm rt ln t cc nh nghin cu,
cc hi ngh ln trn th gii[1, 9, 41]. Ti Vit Nam, bi ton ny vn t ra rt nhiu
thch thc do tnh phc tp ca ngn ng ting Vit v s khng y ca cc ti
nguyn ngn ng hc.
Trn c s phn tch u v nhc im ca cc phng php trch chn quan
h, kha lun p dng phng php trch chn quan h da trn c trng gii
quyt bi ton ny. Cc c trng biu th quan h c trch chn da trn cy phn
tch c php ting Vit, sau c a vo b phn lp SVM tm c loi quan h
tng ng, t trch chn c cc th hin ca quan h. Hn na, nhm gim cng
sc cho giai on xy dng tp d liu hc, kha lun khai thc tnh giu cu trc ca
d liu trn Wikipedia ting Vit xy dng tp d liu hc bn t ng.
Kt qu thc nghim trn mt s loi quan h ban u cho thy m hnh trch
chn ca h thng cho o F
1
t trung bnh 86,4%. iu ny khng nh m hnh l
kh quan, c kh nng ng dng trong thc t.














iii

MC LC

Li cm n ..................................................................................................... i
Tm tt .................................................................................................... ii
Mc lc ................................................................................................... iii
Danh sch cc bng ............................................................................................ v
Danh sch cc hnh v ....................................................................................... vi
Danh sch cc t vit tt ................................................................................... vii
M u .................................................................................................... 1
Chng 1. Khi qut v bi ton trch chn ng ngha ............................ 3
1.1. Quan h ng ngha ...................................................................................... 3
1.1.1. Khi nim ............................................................................................... 3
1.1.2. Phn loi quan h ng ngha ................................................................... 3
1.2. Bi ton trch chn quan h ng ngha ........................................................ 7
1.3. ng dng .................................................................................................... 8
Tm tt chng mt ................................................................................................ 9
Chng 2. Mt s hng tip cn trch chn quan h ng ngha ........... 10
2.1. Hc khng gim st trch chn quan h ..................................................... 10
2.2. Hc c gim st trch chn quan h ........................................................... 13
2.2.1. Phng php Link grammar .................................................................. 13
2.2.2. Phng php trch chn da trn cc c trng ..................................... 16
2.2.3. Phng php trch chn da trn hm nhn .......................................... 21
2.3. Hc bn gim st trch chn quan h ......................................................... 24
2.3.1. Phng php DIRPE ............................................................................. 24
2.3.2. Phng php Snowball ......................................................................... 27
2.4. Nhn xt.................................................................................................... 29
Tm tt chng hai ................................................................................................ 29
Chng 3. M hnh trch chn quan h trn Wikipedia ting Vit da
vo cy phn tch c php .............................................................................. 30
3.1. c trng ca Wikipedia ........................................................................... 30
3.1.1. Thc th trong Wikipedia ..................................................................... 30
3.1.2. Infobox ................................................................................................. 31
3.1.3. Mc phn loi ....................................................................................... 31
3.2. Cy phn tch c php ting Vit ............................................................... 32
3.2.1. Phn tch c php .................................................................................. 32
iv

3.2.2. Mt s thnh phn c bn ca cy phn tch c php ting Vit ............ 32
3.3. M hnh trch chn quan h da trn cy phn tch c php trn Wikipedia
ting Vit ............................................................................................................... 33
3.3.1. Pht biu bi ton.................................................................................. 33
3.3.2. tng gii quyt bi ton ................................................................... 33
3.3.3. Xy dng tp d liu hc ...................................................................... 34
3.3.4. M hnh h thng trch chn quan h .................................................... 36
Tng kt chng ba ................................................................................................ 40
Chng 4. Thc nghim v nh gi kt qu .......................................... 41
4.1. Mi trng thc nghim ........................................................................... 41
4.1.1. Cu hnh phn cng .............................................................................. 41
4.1.2. Cng c phn mm ............................................................................... 41
4.2. D liu thc nghim .................................................................................. 42
4.3. Thc nghim ............................................................................................. 42
4.3.1. M t ci t chng trnh .................................................................... 42
4.3.2. Xy dng tp d liu hc da trn Wikipedia ting Vit ....................... 42
4.3.3. Sinh vector c trng ............................................................................ 45
4.3.4. B phn lp SVM ................................................................................. 47
4.4. nh gi.................................................................................................... 48
4.4.1. nh gi h thng ................................................................................. 48
4.4.2. Phng php nh gi ........................................................................... 49
4.4.3. Kt qu kim th .................................................................................. 49
4.5. Nhn xt.................................................................................................... 51
Kt lun .................................................................................................. 52
Phc lc .................................................................................................. 53
Ti liu tham kho ............................................................................................ 56


v

Danh sch cc bng
Bng 1-1 : 15 quan h trong Wordnet .......................................................................... 4
Bng 1-2: 22 loi quan h ng ngha theo Roxana Girju .............................................. 5
Bng 2-1: ng i ngn nht ................................................................................... 23
Bng 2-2: Mt s c trng thu c t ng i ph thuc ..................................... 23
Bng 3-1: Cc thuc tnh ca vector c trng ........................................................... 39
Bng 4-1: Cu hnh phn cng ................................................................................... 41
Bng 4-2: Danh sch cc phn mm s dng ............................................................. 41
Bng 4-3 : Cc gi tr nh gi h thng phn lp ...................................................... 49
Bng 5-1: Bng cc nhn c s dng trong cy phn tch c php ......................... 53
vi

Danh sch cc hnh v
Hnh 1: V d v ng lin kt (1) ........................................................................... 14
Hnh 2: V d v ng lin kt (2) ........................................................................... 14
Hnh 3: V d v mu ................................................................................................. 14
Hnh 4: V d v cp thc th sinh bi qu trnh khp mu ........................................ 14
Hnh 5: V d v cy phn tch c php...................................................................... 21
Hnh 6: Cc c trng thu c t cy phn tch c php .......................................... 21
Hnh 7: Minh ha th ph thuc ............................................................................ 22
Hnh 8: Cc quan h mu trch chn c.................................................................. 26
Hnh 9: Kin trc ca h thng Snowball ................................................................... 27
Hnh 10: V d v cy phn tch c php ting Vit ................................................... 32
Hnh 11: Qu trnh xy dng tp d liu hc ............................................................. 34
Hnh 12: Cu trc biu din ca thng tin ca infobox ............................................... 35
Hnh 13: M hnh trch chn quan h trn Wikipedia ................................................. 36
Hnh 14: Cy con biu din quan h thnh_lp ....................................................... 38
Hnh 15: V d v tm kim trn Wikipedia ............................................................... 44
Hnh 16 : Bng thng k d liu hc ca quan h ngy sinh ................................... 48
Hnh 17: Kt qu kim th i vi quan h nm thnh lp ..................................... 50
Hnh 18: Kt qu kim th i vi quan h hiu trng ......................................... 50
Hnh 19: Kt qu kim th i vi quan h ngy sinh ............................................ 51
Hnh 20: So snh kt qu trung bnh ca ba quan h .................................................. 51
vii

Danh sch cc t vit tt

T hoc cm t Vit tt
A Library for Support Vector Machines
LibSVM
Dual Iterative Pattern Relation Expansion DIPRE
Support vector machine SVM
Wikipedia Wiki
1

M u
Trch chn quan h ng ngha (hay quan h) c xem l bi ton c bn
ca x l ngn ng t nhin, thc hin nhim v trch chn quan h gia cc khi
nim v mt ng ngha hoc da vo quan h xc nh trc nhm tm kim nhng
thng tin phc v cho qu trnh x l khc. Trch chn quan h c ng dng
nhiu cho cc bi ton nh: xy dng Ontology[15, 16, 19, 22], h thng hi p
[22,29], pht hin nh qua on vn bn [11], tm mi lin h gia bnh-genes
[27], V th, trch chn quan h khng nhng nhn c s quan tm rt ln t
cc nh nghin cu, cc hi ngh ln trn th gii trong nhng nm gn y nh:
Coling/ACL, Senseval, m cn l mt phn trong cc d n quan trng mang tm
c quc t trong lnh vc khai ph d liu nh: ACE (Automatic Content
Extraction), DARPA EELD (Evidence Extraction and Link Discovery), ARDA-
AQUAINT (Question Answering for Intelligence), ARDA NIMD (Novel
Intelligence from Massive Data).
Ti Vit Nam, bi ton ny vn t ra rt nhiu thch thc do tnh phc tp
ca ngn ng ting Vit v s khng y ca cc ti nguyn ngn ng hc. Trn
c s phn tch cc phng php trch chn quan h, kha lun a ra m hnh
hc c gim st trch chn quan h thc th da vo cy phn tch c php trn
min d liu Wikipedia ting Vit. Kt qu thc nghim bc u cho thy m hnh
l kh quan v c kh nng ng dng tt.
Ni dung ca kha lun c b cc gm c 4 chng:
Chng 1: Gii thiu khi qut v bi ton trch chn quan h ng ngha
cng nh cc khi nim lin quan.
Chng 2: Gii thiu cc phng php tip cn gii quyt bi ton trch
chn quan h. Vi mi phng php hc my: c gim st, khng gim st v bn
gim st, kha lun gii thiu mt s m hnh tiu biu. y l c s phng php
lun quan trng kha lun a ra m hnh p dng i vi bi ton trch chn
quan h trn min d liu Wikipedia ting Vit.
Chng 3: Trn c s phn tch u v nhc im ca cc phng php
c trnh by chng 2, kha lun la chn phng php trch chn quan h
da trn c trng theo tip cn hc c gim st gii quyt bi ton ny. Cc c
trng ca quan h c trch chn da trn cy phn tch c php ting Vit, sau
c a vo b phn lp s dng thut ton SVM, tm c loi quan h tng
2

ng, t trch chn c cc th hin ca quan h. Hn na, gim cng sc
cho giai on xy dng tp d liu hc, cc c trng biu din d liu giu cu
trc trn Wikipedia ting Vit c s dng. Ni dung chnh ca chng ny
trnh by cc c trng ca Wikipedia, cy phn tch c php ting Vit v xut
mt m hnh trch chn quan h da trn cy phn tch c php.
Chng 4: Thc nghim, kt qu v nh gi. Tin hnh thc nghim vic
xy dng tp d liu hc, thc nghim trch chn quan h s dng b phn lp
SVM.
Phn kt lun v nh hng pht trin kho lun: Tm lc nhng ni
dung chnh t c ca kha lun ng thi cng ch ra nhng im cn khc
phc v a ra nhng nh hng nghin cu trong thi gian sp ti.

















3

Chng 1. Khi qut v bi ton trch chn ng ngha
Ni dung chnh ca kha lun l xut mt m hnh trch chn quan h
thc th da trn cy phn tch c php trn min d liu Wikipedia ting Vit.
Chng ny s gii thiu cc khi nim v quan h ng ngha, bi ton trch chn
quan h ng ngha v nhng ng dng ca bi ton ny. y l c s l thuyt quan
trng cho vic xc nh mc tiu cng nh phm vi gii quyt ca m hnh xut.
1.1. Quan h ng ngha
1.1.1. Khi nim
Xc nh quan h ng ngha (semantic relation) l mt lnh vc ngha nhn
c nhiu s quan tm t cc nh nghin cu v ngn ng hc cng nh x l
ngn ng t nhin. C nhiu nh ngha v quan h ng ngha c a ra. Theo
ngha hp, Birger Hjorland [42] nh ngha quan h ng ngha:
Quan h ng ngha l mi quan h v mt ng ngha gia hai hay nhiu
khi nim. Trong , khi nim c biu din di dng t hay cm t.
V d: Ta c cu Trng i hc Cng ngh c Th tng chnh ph
quyt nh thnh lp ngy 25 thng 5 nm 2004. Khi , ta ni: (Trng i
hc Cng ngh, ngy 25 thng 5 nm 2004) c quan h ng ngha l ngy
thnh lp.
Trong kha lun ny, trong trng hp khng gy nhm ln, khi nim quan
h ng ngha c gi tt l quan h.
Vic xc nh quan h gia cc khi nim l mt vn quan trng trong tm
kim thng tin. iu ny s lm tng tnh ng ngha cho cu hay tp ti liu. ng
thi, khi tm kim mt thng tin no , ta c th nhn c nhng thng tin v cc
vn khc lin quan ti n. V vy, tm kim c nhng thng tin chnh xc,
chng ta cn bit cc loi quan h v tm hiu cc phng php xc nh c
cc quan h .
1.1.2. Phn loi quan h ng ngha
Quan h ng ngha th hin quan h gia cc khi nim v c biu din
di dng cu trc phn cp thng qua cc quan h. Trong [17], Iris Hendrickx v
cng s tng kt v ch ra rng phn loi quan h ng ngha l rt a dng, ph
thuc vo nhng c trng ng ngha cng nh mc ch v i tng tip cn.
Mc ny s gii thiu hai h thng phn loi quan h ng ngha c s dng kh
4

ph bin trong bi ton trch chn quan h l WordNet v h thng phn loi ca
Girju.
WordNet [16, 39] l mt t in trc tuyn trong Ting Anh, c pht trin
bi cc nh t in hc thuc trng i hc Princeton (M). WordNet bao gm
100.000 khi nim bao gm danh t, ng t, tnh t, ph t lin kt vi nhau thng
qua 15 quan h (c m t trong bng 1-1)
Bng 1-1 : 15 quan h trong Wordnet
STT
Quan h ng
ngha
Cc khi nim c
lin kt bi quan h
ng ngha
V d
1. Hypernymy
(is - a)
Danh t - Danh t
ng t - ng t
Cat is-a feline
Manufacture is-a make
2. Hyponymy
(reverse is-a)
Danh t - Danh t
ng t - ng t
Feline reverse is-a cat
Manufacture reverse is-a mak
3. Is-part- of Danh t - Danh t Leg is-part-of table
4. Has-part Danh t - Danh t Table has-part leg
5. Is-member-of Danh t - Danh t UK is-member-of NATO
6. Has-member Danh t - Danh t NATO has-member UK
7. Is-suff-of Danh t - Danh t Carbon is-stuff-of coal
8. Has-stuff Danh t - Danh t Coal has-stuff carbon
9. Cause-to ng t - ng t To develop cause-to to grow
10. Entail ng t - ng t To snore entail to sleep
11. Atribute Tnh t - Danh t Hot attribute temperature
12. Synonymy
(synset)


Danh t - Danh t
ng t - ng t
Tnh t - Tnh t
Ph t - Ph t

Car synonym automobile
To notice synonym to observe
Happy synonym content
Mainly synonym primarily
5

13. Antonymy


Danh t - Danh t
ng t - ng t
Tnh t - Tnh t
Ph t - Ph t
Happines antonymy
unhappiness
To inhale antonymy to exhale
Sincere antonymy insincere
Always antonymy never
14. Similarity Tnh t - Tnh t Abridge similarity shorten
15. See-also ng t - ng t
Tnh t - Tnh t
Touch see-also touch down
Inadequate see-also
insatisfactory

Thng thng, ngi ta hay s dng WordNet vo vic tm kim cc quan
h ng ngha. ng thi, da vo cc quan h ny, mt t trong WordNet c th
tm c cc lin h vi cc khi nim khc.
Roxana Girju [10] a ra h thng cc quan h ng ngha gm 22 loi
nh trong bng 1-2, trong mt s quan h ng ngha quan trng thng c
dng th hin quan h gia cc khi nim nh: hyponymy/ hypernymy (is - a),
meronymy/holonym (part - whole), ng ngha (synonymy) v tri ngha
(antonymy).

Bng 1-2: 22 loi quan h ng ngha theo Roxana Girju
STT Quan h ng ngha M t V d
1. HYPERNYMY
(IS-A)
Mt thc th/ s kin/ trng
thi l lp con ca mt thc
th/ s kin/ trng thi khc
daisy flower;
large company, such as
Microsoft
2. PART-WHOLE
(MERONYMY)
Mt thc th/ s kin/ trng
thi l mt b phn ca thc
th/ s kin/ trng thi khc
door knob; the door of
the car
3. CAUSE Mt s kin/trng thi l
nguyn nhn cho mt s
kin/trng thi khc xy ra
malaria mosquitos;
death by hunger;
The earthquake
6

generated a big
Tsunami
4. INSTRUMENT Mt thc th c s dng
nh l mt phng tin/cng
c
pump drainage; He
broke the box with a
hammer.
5. MAKE / PRODUCE

Mt thc th to ra/ sn xut
ra mt thc th khc
honey bees; GM makes
cars
6. KINSHIP (thn
thch)

Mt thc th c lin quan ti
thc th khc bi quan h
huyt thng, hn nhn

boys sister; Mary
has a daughter
7. POSSESSION (s
hu)

Mt thc th s hu thc th
khc
family
estate; the girl has a
new car.
8. SOURCE / FROM Xut x ca thc th olive oil
9. PURPOSE

Mt trng thi hay dnh
ng l kt qu t mt trng
thi hay s kin khc
migraine drug; He was
quiet in order not to
disturb her.
10. LOCATION/SPACE quan h c bit gia hai
thc th hoc gia thc th
v s kin
field mouse; I left the
keys in the car
11. TEMPORAL Thi gian lin quan ti mt
s kin
5-O clock tea; the
store opens at 9 am
12. EXPERIENCER Cm gic hay trng thi ca
mt thc th
desire for
chocolate; Marys fear.
13. MEANS Phng tin m mt s kin
c thc hin
bus service; I go to
school by bus.
14. MANNER Cch thc m mt s kin
xy ra
hard-working
immigrants;
performance with
7

passion
15. TOPIC Mt i tng l c trng
ca i tng khc
they argued about
politics
16. BENEFICIARY

Mt thc th hng li ch
t mt trng thi hay s kin
customer service; I
wrote Mary a letter.
17. PROPERTY Thuc tnh ca mt thc
th/s kin hay trng thi
red rose; the juice has a
funny color.
18. THEME Mt thc th c m t
theo/ trong mt hnh ng
hay s kin khc
music lover
19. AGENT Tc nhn thc hin hnh
ng
the investigation of the
police
20. DEPICTION-
DEPICTED
Mt thc th c biu din
trong mt thc th khc
the picture of the girl
21. TYPE Mt t hay khi nim l kiu
ca mt t hay hay khi
nim khc
member state;
framework law
22. MEASURE Mt thc th biu din s
lng ca mt thc th/s
kin no
70-km distance; The
jacket costs $60; a cup
of sugar
1.2. Bi ton trch chn quan h ng ngha
Theo [9, 36, 41], trch chn quan h c xem l mt b phn quan trng
ca trch chn thng tin. Tp cc cu hay cc vn khi xem xt mc tru tng cao
th y chnh l tp hp cc khi nim, cc thc th v quan h gia chng. Cc
thc th hay khi nim c th hin di dng cc t hay cm t. Quan h ng
ngha gia chng c n trong cc lin kt gia cc khi nim hay thc th ny.
Vic pht hin ra cc quan h ny c ngha rt quan trng trong cc bi ton x l
ngn ng t nhin.
Roxana Girju [10] pht biu bi ton trch chn quan h ng ngha nh
sau: Nhn u vo l cc khi nim hay thc th, thng qua tp ti liu khng c
8

cu trc nh cc trang web, cc ti liu, tin tc, ta cn phi xc nh c cc
quan h ng ngha gia chng
Mt v d v trch chn quan h ng ngha c Roxana Girju [10] a ra
nh sau:
Cho mt on vn bn vi cc thc th/khi nim c gn nhn:
[Saturdays snowfall]
TEMP
topped [a record in Hartford, Connecticut]
LOC
with
[the total of 12/5 inches]
MEASURE
, [the weather service]
TOPIC
said. The storm
claimed its fatality Thursday when [a car driven by a [college student]
PART-
WHOLE
]
THEME
skidded on [an interstate overpass]
LOC
in [the mountains of
Virginia]
LOC/PART-WHOLE
and hit [a concrete barrier]
PART-WHOLE
, police said.
Khi , h thng trch chn quan h ng ngha s cho kt qu l cc quan h
c th c gia cc thc th/khi nim ny, c th nh sau:

TEMP (Saturday, snowfall) LOC (mountains, Virginia)
PART-WHOLE/LOC (mountains, Virginia) LOC (Hartford Connecticut, record)
PART-WHOLE (concrete, barrier) LOC (interstate, overpass)
PART-WHOLE (student, college) TOPIC (weather, service)
THEME (car, driven by a college student) MEASURE(total, 12.5 inches)
1.3. ng dng
Trch chn quan h ng ngha c ng dng trong nhiu lnh vc khc nhau.
Lnh vc u tin phi nhc ti l vic xy dng c s tri thc m in hnh l xy
dng Ontology thnh phn nhn ca Web ng ngha. Trong khi nhng li ch m
Web ng ngha em li l rt ln th vic xy dng cc ontology mt cch th cng
li ht sc kh khn. Gii php cho vn ny chnh l k thut trch chn thng tin
ni chung v trch chn quan h ni ring t ng ha mt phn qu trnh xy
dng cc ontology. c nhiu cc nghin cu lin quan ti vn ny nh [15,
16, 19, 22]
Trch chn mi quan h ng ngha cng c s dng nhiu trong cc h
thng hi p. Mt s h thng hi p c xy dng da vo vic trch xut t
ng cc t, khi nim v mi quan h. Chng hn Kim v cng s [22] cng a ra
9

mt h thng hi p OntotrileQA s dng k thut trch chn quan h ng ngha
cho cc thc th trn ontoloty c gn nhn bng tay.
Ngoi ra, trch chn quan h cn c ng dng trong cc lnh vc x l nh
nh pht hin nh qua on vn bn (text-to-image generation) [11] . Trch chn
quan h cng l mt cng c c lc tron lnh vc cng ngh sinh hc nh tm
quan h bnh tt - Genes, nh hng qua li gia protein-protein (Protein-Protein
interaction)[27]

Tm tt chng mt
Trong chng ny, kho lun gii thiu khi qut cc khi nim lin quan
ti bi ton trch chn quan h ng ngha, mt s loi quan h ng ngha v nhng
ng dng ni bt. Trong chng tip theo, kho lun s tp trung lm r cc
phng php in hnh m hnh ha bi ton trch chn quan h ng ngha v cch
gii quyt tng ng.

















10

Chng 2. Mt s hng tip cn trch chn quan h ng ngha
Trch chn quan h c xem l mt phn quan trng ca trch chn thng
tin [9], nhn c s quan tm ngy cng nhiu hn ca cng ng x l ngn ng
t nhin v hc my. Cc tip cn gii quyt bi ton hin nay tp trung vo s
dng cc phng php hc my tin hnh trch chn t ng. C ba loi hc my
l hc khng gim st, hc c gim st v hc bn gim st u th hin c
nhng u im ring ca mnh.
Hn na, trong cc nghin cu gn y [8, 12, 13, 17, 21], cy phn tch c
php ca cu c xem l mt thng tin quan trng cho trch chn quan h. Do ,
trong chng ny, vi mi phng php hc my, kha lun s gii thiu mt s
m hnh tiu biu. y l c s phng php lun quan trng kha lun a ra
m hnh p dng i vi bi ton trch chn quan h trn min d liu Wikipedia
ting Vit.
2.1. Hc khng gim st trch chn quan h
Hc khng gim st c bn cht l s dng cc thut ton phn cm cc quan
h m hnh ha. C nhiu cch khc nhau [1, 7, 12, 18 ] biu din quan h
gia hai thc th/khi nim, trong ph bin nht l biu din quan h ny di
dng vector c trng. Vn ct li l lm th no la chn c cc c trng
tt v hiu qu. Mt gii php c Jinxiu Chen v cng s [18] a ra da trn
tng xy dng hm Entropy xp hng cc c trng, t , a mt thut
ton la chn c c trng v s cm ti u nht. C th nh sau:
u tin, Jinxiu Chen v cng s a ra mt s khi nim:
Gi P = {p
1
, p
2
, p
N
} l tp tt c cc vector ng cnh m ng thi xut
hin cp thc th E
1
v E
2
. y, ng cnh bao gm tt c cc t xut hin trc,
gia v sau cp thc th.
Gi W= {w
1
, w
2
, , w
M
} l tp cc c trng, bao gm tt c cc t xut
hin trong P.
Gi s, p
n
(1 n N) thuc khng gian c trng W (chiu ca W l M).
tng ng gia vector p
i
v p
j
c cho bi cng thc:

,
,
exp( * )
i j
i j
S D = trong :
- D
i,j
l o Oclit gia p
i
v p
j
,
11

-
ln 0.5
D
= l hng s dng thu c bng thc nghim
- D l khong cch trung bnh gia cc p
i

Khi , entropy ca tp d liu P vi N im d liu c nh ngha l:
, , , ,
1 1
( log (1 ) log(1 ))
N N
i j i j i j i j
i j
E S S S S
= =
= +
(2.1)
Sau , la chn mt tp con cc c trng quan trng t W, cc c trng c
xp hng theo quan trng ca chng theo cm. Hm xp hng cc c trng da
trn mt gi thit rng mt c trng l khng quan trng nu n xut hin trong
tp d liu c th tch ri [18]. quan trng ca mi c trng I(w
k
) c xc
nh bi entropy ca tp d liu sau khi loi b i c trng w
k
.
Da trn nhn xt rng: mt c trng l km quan trng nht nu sau khi
loi b n i s lm cho E t gi tr nh nht, cc c trng c xp sp theo
quan trng ca chng, ta thu c tp W
r
= {f
1
, , f
M
}.
Khi , vic tm tp con c trng tt nht F s tr thnh bi ton tm kim
trn khng gian {(f
1
, , f
k
), 1 k M} : tc l tm
arg max { ( , )}
r
k F W
F criterion F k
_
=

Gi P

l tp con cc cp thc th c ly mu t tp cc cp thc th y


P. Kch thc ca P

l N (vi = 0.9)
Gi C (hay C

) l ma trn kt ni c kch thc | | *| | P P (hay | | *| | P P



)
da trn cc kt qu phn cm tng ng t P ( hay P

) trong :

Khi , n nh ( , ) M C C

(l nht qun gia kt qu phn cm


trnC

v C ) s c tnh theo cng thc:


, ,
,
,
,
1{C =C =1, , }
( , )
1{C =1, , }
i j i j i j
i j
i j i j
i j
p P p P
M C C
p P p P


e e
=
e e

(2.2)
Tuy nhin, v ( , ) M C C

c chiu hng gim khi s cm k tng nn trnh


trng hp gi tr k nh s c la chn lm s cm, bin ngu nhin c lp
k


c
ij
=
1 nu nh cp thc th p
i
v p
j
nm trong cng mt cm
0 trong trng hp ngc li
12

c s dng chun ha ( , ) M C C

. Bin ngu nhin c lp ny c c bng


cch vi mi gi tr k, thc hin q ln vic tch d liu vo k cm mt cch ngu
nhin. Khi , hm mc tiu
, ,
( , )
F k F k
M C C

s c tnh theo cng thc (2.2) v:


or
, , , , ,
1 1
1 1
( , ) ( , )
i i
k k
q q
n m
F k F k F k F F
i i
M M C C M C C
q q


= =
=
(2.3)
Hm ny c thc hin theo 8 bc sau:
Hm: criterion(F, k, P, q)
u vo: tp con c trng F, s cm k, tp cc cp thc th P v tn xut ly mu
q
u ra: im nh gi cht lng ca F v k
X l:
1. Thc hin thut toan k-means vi k cm theo nh input trn cc tp cc cp
P
F

2. Khi to ma trn kt ni C
F,k
da trn kt qu phn cm trn
3. S dng bin c lp ngu nhin
k

gn nhn cho tng cp trong P


F

4. Khi to ma trn kt ni
,
k
F
C

cho tt c cc P
F

5. Khi to q tp con ca tp cc cp thc th y bng cch la chn ngu
nhin N trong s N cp ban u ( 0 1)
6. Vi mi tp con, thc hin phn cm nh trong cc bc 2, 3, 4 v cho ra
kt qu
, ,
,
k
F k F
C C


7. Tnh M
F,k
nh gi cht lng ca k thng qua cng thc 2.3
8. Tr v kt qu M
F,k


Cui cng, m hnh thut ton la chn (Model Selection Agorithm) cho trch chn
quan h:
u vo: Tp d liu D vi cc thc th c gn nhn (E
1
, E
2
)
u ra: Tp con cc c trng v s lng kiu quan h (Model Order)
X l:
13

1. Tm tt c cc ng cnh ca tt c cc cp thc th c trong tp D. Tp ng
cnh ny t tn l P
2. Xp hng cc c trng da theo cng thc (2.1)
3. Tnh khong (K
l
, K
h
) : s cc cm quan h c th c (thp nht ti cao nht)
4. Thit lp gi tr c lng s kiu quan h k = K
l

5. La chn cc c trng theo thut ton criterion(F, k, P, q)
6. Lu gi gi tr

,
k
F k v im s cht lng tng ng l M
F,k

7. Nu k < K
h
th quay li bc 5, khng th sang bc 8
8. La chn k v tp con c trng

k
F c gi tr ln nht trong cc gi tr M
F,k


2.2. Hc c gim st trch chn quan h
Bi ton trch chn quan h ng ngha gia hai thc th cng c gii quyt
bng cch coi y l bi ton phn lp s dng phng php hc my. Cc th hin
ca quan h c chuyn sang cc mt tp cc c trng f
1
, f
2
, , f
N
, to nn mt
vector c trng N chiu. Trong qu trnh hc, cc thut ton phn lp c p
dng i vi cc thc th u vo xc nh lp quan h ca n, t trch chn
c quan h c th c.
Theo G. Zhou v M. Zhang [32], cc m hnh c th c chia lm ba ni
dung chnh: Phng php da trn m hnh sinh, da vo hm nhn (tree kernel) v
phng php tip cn da vo c trng.
2.2.1. Phng php Link grammar
Phng php ny c cc nh nghin cu thuc hc vin Mac-Planck a
ra nm 2006. V nguyn tc, c th trch chn c bt c quan h no. H thng
thc nghim trn 3 quan h: birthdate, synonymy, instanceOf.
Trong phng php ny s dng mt s cc khi nim c bn v
linkgrammar [12, 40] nh sau:
Mi ng lin kt (linkage) l mt th phng v hng, trong :
- Cc nt ca th ny l cc t ca cu.
- Cung ni gia cc nt gi l kt ni (link).
- Nhn ca cc cung ny gi l loi kt ni (connectors) ly t mt tp hu
hn cc k hiu.
14

Link grammar l mt tp cc lut quy nh mt t s kt ni vi t ng sau hoc
trc n bi loi kt ni no: <word connectors > hoc <connectors word>. V
d: t was trong hnh 1 s c <subj_link - was> v < was compl_link >
Mi ng lin kt ca mt cu c sinh ra bi link grammar.

Hnh 1: V d v ng lin kt (1)

Hnh 2: V d v ng lin kt (2)
Mt ng lin kt biu din mt quan h R nu cu m ng lin kt m t
cha cp thc th nm trong quan h R. V d: trong hnh 2, th hin quan h s
hu: London has an airports
Mt mu l mt ng lin kt m trong hai t (cm t) c th c thay
th bi mt k hiu i din (placeholder). V d: trong hnh 1, thay Chopin bi X
v composers bi Y, ta c mt mu nh hnh 3.

Hnh 3: V d v mu


Hnh 4: V d v cp thc th sinh bi qu trnh khp mu
15

ng i ngn nht (duy nht) t mt k t i din ny ti k t i din kia
c gi l mt cu (bridge). (ng in m trong hnh 3). Cu ny khng bao
gm cc k t i din.
Mt mu c gi l khp vi mt ng lin kt nu cu ca mu xut hin
trong ng lin kt (cho php cc danh t hay tnh t l khc nhau)
Khi mt mu khp vi mt ng lin kt, ta ni mu sinh ra mt cp t
(cm t). Cp t ny nm v tr ca cc k t i din tng ng gia link v mu.
V d: hnh , cp Mozart v composers xut hin trong ng lin kt, nm
tng ng vi cc k t i din X v Y trong mu hnh 4. Ta ni, mu sinh ra
cp thc th <Mozart - composers>.
tin hnh vic hc, Fabian M. Suchanek v cng s [15] tin hnh
phn loi cc cp t, chia chng lm 3 loi sau:
- Mt cp c th l mt v d (example) cho quan h ch. V d: vi quan h
birthdate , cc v d l mt danh sch tn ngi v ngy sinh ca h
<Frederic Chopin - 1810>
<Wolfgang Amadeus Mozart - 1756>
- Mt cp c th l mt phn v d (counterExample) l cc cp khng th
nm trong mt quan h. V d, vi quan h birthdate, cc phn v d c th
c suy din t v d. Nu <Chopin - 1810> l mt v d th
<Chopin - 2000> hin nhin mt phn v d.
- Mt cp c th l mt ng vin (candidate) c th c cho quan h ch. V
d, vi quan h birthdate, ch cc cp c dng <Tn ring ngi ngy>
mi c th l ng vin.
- Mt cp c th khng thuc vo 1 trong 3 loi trn.
Da trn cc khai nim ny, h thng trch chn quan h c a ra vi 3 pha x
l chnh:
Pha 1: Pha nhn dng (discovery phase): Xc nh cc mu biu din quan h ch
- Trong tt c cc cu, tm cc ng lin kt m cc cp v d xut hin.
- Thay th cc cp ny bi cc k t i din to ra cc mu. Cc mu thu
c lc ny c gi l mu chc chn (positive patterns)
V d: Khi c cu "Chopin was born in 1810", th mu "X was born in Y" s
c sinh ra
16

- Duyt qua cc cu mt ln na, tm tt c cc cu c ng lin kt khp vi
mu chc chn m cc cp thc th sinh ra t qu trnh khp ny thuc
phn v d th tin hnh thay th cc cp ny bi cc k t i din, ta c
cc mu, gi l mu khng chc chn (negative patterns)
V d: Khi duyt li, tm c cu "Chopin was born in 2000", c cp <X
Y> l <Chopin - 2000> thuc phn v d th mu "X was born in Y" s
c thu s cho vo tp mu mu khng chc chn
Pha2: Pha hc (Training Phase): To ra cc mu chc chn nh m hnh hc my
- M hnh hc thng k c p dng hc cc khi nim ca cc mu chc
chn t tp mu chc chn v mu khng chc chn.
- Kt qu ca pha ny l b phn lp cho cc mu mu chc chn hay l
mu khng chc chn.
- S dng thut ton phn lp K-ngi hng xm gn nht (kNN) hoc SVM
Pha 3: Pha kim th (Testing Phase):
- Vi mi ng lin kt, to tt c cc mu c th bng cch thay th cp t
(cm t) tng ng bi cc k t i din.
- Nu cp t ny c dng ng vin v mu c phn lp l mu chc chn
th cp ny c chp nhn nh l phn t mi ca quan h ch.
2.2.2. Phng php trch chn da trn cc c trng
Trong phng php ny, vector c trng th hin quan h ng ngha gia
hai thc th M1 v M2 c xc nh t ng cnh bao quanh cc thc th ny.
Theo Abdulrahman Almuhareb [4], cc vector c trng c chia lm hai loi
chnh: mt l, c trng da vo cc t ln cn ca M1 v M2; hai l, c trng da
vo quan h v mt ng php ca M1 v M2. Ni dung ca kha lun ny quan tm
ti loi c trng th hai.
Trong loi ny, th t xut hin ca cc thc th cng c phn bit, v d
M1 Parent-Of M2 th khc vi M2 Parent-Of M1

. Vi mi cp thc th, cc
thng tin v t vng, ng php v ng ngha s c s dng nh l cc c trng
th hin cho quan h.
G. Zhou v M. Zang [32] a ra 8 loi c trng thng c s dng trong
phng php ny:
c trng v t: Ty theo v tr ca t m chng c phn chia lm 4 loi:
17

- T biu din M1 v M2: Trong nhng t ny, t trung tm (head word) c
coi l quan trng hn v mang nhiu ngha thng tin hn. T trung tm ca
M1(M2) l t cui cng ca cm t biu din M1 (M2). Trong trng hp
c gii t nm trong cm t biu din M1 (M2) th t trung tm l t cui
cng trc khi gp gii t. V d, vi mt cm t biu din M1 l
University of Michigan th t trung tm y l University.
- T nm gia M1 v M2: Cc t ny c chia lm 3 loi:
o T u tin nm gia
o T cui cng nm gia
o V cc t cn li
- T nm trc M1 v t nm sau M2: ch quan tm ti 2 t ng ngay trc
M1 v ng ngay sau M2, c chia lm 2 loi:
o T u tin ng trc M1 v t u tin ng sau M2
o T th hai ng trc M1 v t th hai ng sau M2
Nh vy, c trng v t s gm cc phn sau:
- WM1: tp cc t trong M1
- HM1: t trung tm ca M1
- WM2: tp cc t trong M2
- HM2: t trung tm ca M2
- HM12: kt hp cc t trung tm ca c HM1 v HM2
- WBNULL: khi khng c t no nm gia
- WBFL: t duy nht nm gia khi ch c mt t nm gia
- WBF: t u tin nm gia khi c t nht hai t nm gia M1 v M2
- WBL: t cui cng nm gia khi c t nht hai t nm gia M1 v M2
- WBO: cc t khng phi t u tin v cui cng nm gia M1 v M2
- BM1#1: t u tin nm trc M1
- BM1#2: t th hai ng trc M1
- AM2#1: t u tin ng sau M2
- AM2#2: t th hai ng sau M2
18

c trng v kiu thc th: c 5 loi thc th c quan tm l NGI, T
CHC, CNG TY, A DANH v GPE. c trng ny s c cc thuc tnh sau:
- ET12: th hin kiu thc th ca M1 v M2
- EST12: th hin cc kiu thc th con ca M1 v M2
- EC12: th hin lp thc th ca M1 v M2

c trng v cc bc c lin quan (mention level): th hin cc c trng lin quan
ti thc th ang xem xt, v d M1 hoc M2 c th l TN, DANH T v I
T c trng ny bao gm hai thuc tnh:
- ML12: kt hp cc thng tin lin quan ca M1 v M2
- MT12: kt hp cc thng tin ca LDC v kiu ca M1 v M2
c trng v np chng: cc thuc tnh ca c trng ny gm c
- #MB: s lng
- #WB: s lng cc t nm gia
- M1 > M2 hay M1 < M2:
Thng thng, cc c trng trng nhau trn l qu ph bin c th t mnh
gy nh hng. V vy, chng cn c kt hp thm vi cc thuc tnh khc:
- ET12 (hoc EST12) + M1 > M2
- ET12(EST12) + M1 < M2
- HM12 + M1 > M2
- HM12 + M1 < M2
c trng da trn cm t: c trng ny c nh gi mang tnh then cht trong
cc bi ton ton trch chn quan h. Cc phng php khc s dng thng tin ny
da trn cy phn tch c php, tuy nhin, trong phng php ny th tch bch vic
to ra cc cm t v cy phn tch c php y . y, cc cm t c trch
chn da trn cy phn tch c php. Hu ht cc c trng v cm t quan tm ti
t trung tm ca cc cm nm gia M1 v M2. Tng t nh cc c trng v t,
c trng v cm t c chia lm 3 loi sau:
- Cc cm t trung tm nm gia M1 v M2 chia lm 3 loi con:
o Cm t u tin nm gia M1 v M2
o Cm t cui cng nm gia M1 v M2
19

o Cm t nm gia M1 v M2
- Cm t trung tm nm trc M1, gm 2 cm t:
o Cm t u tin trc M1
o Cm t th hai trc M1
- Cm t trung tm nm sau M2, gm 2 cm t:
o Cm t u tin sau M2
o Cm t th hai sau M2
Nh vy, c trng ny gm c 12 thuc tnh c biu din nh sau:
- CPHBNULL: khng c cm t no nm gia M1 v M2
- CPHBFL: cm t trung tm duy nht khi ch c duy nht mt cm t trung
tm
- CPHBF: cm t trung tm u tin nm gia nu c t nht hai cm t nm
gia M1 v M2
- CPHBL: cm t trung tm cui cng nm gia nu c t nht hai cm t nm
gia M1 v M2
- CPHBO: cc cm t trung tm khc nm gia M1 v M2 (ngoi tr CPHBF
v CPHBL)
- CPHBM1#1: cm t trung tm u tin trc M1
- CPHBM1#2: cm t trung tm th hai trc M1
- CPHAM2#1: cm t trung tm u tin sau M2
- CPHAM2#2: cm t trung tm th hai sau M2
- CPP: ng ni cc nhn cm t trn ng i t M1 sang M2
- CPPH: ng ni cc nhn cm t trn ng i t M1 sang M2 ch tnh cc
cm t trung tm (nu c t nht 2 cm t nm gia)
c trng cy ph thuc: c trng ny bao gm cc thng tin v t, t loi, nhn
cm t ca M1 v M2 da trn cy ph thuc, trch xut t cy phn tch c php
y . Cay ph thuc c sinh ra bng cch s dng thng tin v cc cm t
trung tm da vo phn tch c php Collins v lin kt tt c cc thnh phn ca
cm t ti t trung tm ca cm t . Cc c nh du th hin M1 v M2 c cng
l cm danh t, cm ng t hay cm gii t khng. C th, cc thuc tnh ca c
trng ny nh sau:
20

- ET1DW1: kt hp ca kiu thc th v t ph thuc vo M1
- H1DW1: kt hp ca t trung tm v t ph thuc vo M1
- ET2DW2: kt hp ca kiu thc th v t ph thuc vo M2
- ET2DW2: kt hp cc t trung tm v t ph thuc vo M2
- ET12SameNP: kt hp ET12 vi thng tin M1 v M2 c cng l cm danh
t hay khng.
- ET12SamePP: kt hp ET12 vi thng tin M1 v M2 c cng l cm gii t
hay khng.
- ET12SameVP: kt hp ET12 vi thng tin M1 v M2 c cng l cm ng
t hay khng.
c trng cy phn tch c php: c trng biu din cc thng tin c c t cy
phn tch c php y , bao gm cc thuc tnh:
- PTP: ng i th hin cc nhn cm t (loi b cc trng lp) ni M1 v
M2 trn cy phn tch c php
- PTPH: ng i th hin cc nhn cm t (loi b cc trng lp) ni M1 v
M2 trn cy phn tch c php (ch tnh cc cm t trung tm)
c trng t cc ngun ti nguyn giu ng ngha: Thng tin ng ngha t rt
nhiu ngun ti nguyn nh WordNet c s dng phn lp cc t quan trng
vo cc danh sch ng ngha khc nhau tng ng vi cc quan h c ch ra.
Cc thng tin ny rt c ch trong vic gii quyt cc trng hp d liu th trong
trch chn quan h. Cc ngun ny bao gm:
- Danh sch tn cc quc gia: bao gm cc thng v tn quc gia v cc tnh,
thnh ph ca n. C hai thuc tnh c s dng biu din c trng
ny:
o ET1 Country: kiu thc th ca M1 khi M2 l tn ca mt quc gia
o ContryET2: kiu thc th ca M2 khi M1 l tn ca mt quc gia
- Danh sch t th hin cc quan h trong gia nh : bao gm 6 loi quan h:
cha m, ng b, v chng, anh (ch) em, cc quan h gia nh khc v quan
h khc. C hai thuc tnh c s dng biu din thng tin ny, bao
gm:
21

o ET1SC2: kt hp kiu thc th ca M1 v lp ng ngha ca M2 khi
M2 l mt kiu con ca quan h x hi
o SC1ET2: kt hp kiu thc th ca M2 v lp ng ngha ca M1 khi
tham s u tin l mt dng ca quan h gia nh
Nanda Kambhatla [21] hun luyn m hnh cc i ha Entropy s dng
cc c trng c c t lung c trng nh m t trn tin hnh trch chn
quan h.

Hnh 5: V d v cy phn tch c php


Hnh 6: Cc c trng thu c t cy phn tch c php
2.2.3. Phng php trch chn da trn hm nhn
Phng php ny cng ging phng php trch chn da vo c trng
ch cng biu din quan h di dng mt vector c trng. Nhng im khc bit
c bn i vi phng php da vo c trng l ch: phng php ny tp
trung vo vic xy dng hm nhn th no cho hiu qu khi tin hnh phn lp s
dng thut ton SVM ch khng phi l c trng no s c la chn.
22

Razvan C. Bunescu v Raymond J. Mooney [8] a ra mt phng php
trch chn quan h da trn quan st rng thng tin th hin quan h gia hai thc
th c tn trong cng mt cu c biu din bi ng i ngn nht gia hai thc
th ny trong th ph thuc (dependency graph) [35].
Da trn hai gi thit:
- Cc quan h c trch chn c l quan h gia cc thc th nm
trong cng mt cu
- S tn ti hay khng tn ti ca mt quan h th c lp vi on vn
bn trc v sau cu ang xem xt.
iu ny c ngha l ch trch chn cc quan h c m t trong cu cha
hai thc th quan tm.
Hn na, vi mt cu c coi l mt th ph thuc gm cc nt tng
ng vi cc t trong cu, cc cung c hng c ni gia hai t ph thuc nhau
da trn chc nng v ng php: tnh t b ngha cho danh t trong cm danh t
(severalstations), danh t ghp (pumping stations) hay trng t b ngha
cho ng t (recently raided) nh v d trong hnh 7.


Hnh 7: Minh ha th ph thuc
Trn th v hng thu c t th ph thuc ny, ta tm c ng i
ngn nht gia hai thc th. V d mt s ng i ngn nht c th hin trong
bng 2-1.
23

Bng 2-1: ng i ngn nht

ng i ny l dng biu din c ng nht quan h gia hai thc th. ng i
ph thuc c biu din nh l mt chui cc t. Da trn thng thng tin v t
loi, cc kiu thc th vector c trng s c sinh ra tng ng vi mi ng
i ph thuc. V d vi ng protesterseized stations bng 2-1, ta c:
| | | |
er
ER
protester station
seized
NNS NNS
VBD
Noun Noun
V b
P SON FACILITY
( (
(
( (
(
( (

(
( (
(
( (


Khi , s c tt c 48 = (4x1x3x1x4) c trng thu c cho ng i ny, v d
l:
Bng 2-2: Mt s c trng thu c t ng i ph thuc

Hm nhn m Razvan C. Bunescu v Raymond J. Mooney [7] a ra nh
sau:
Gi x = x
1
x
2
x
m
v y = y
1
y
2
y
n
l hai quan h, trong x
i
biu din tp
cc thng tin ng vi t nm v tr th i trong quan h. Khi , hm nhn l s c
trng trng nhau gia x v y v c tnh theo cng thc:

Trong ( , )
i i i i
c x y x y = l s thuc tnh chung ti v tr th i ca x v y
V d: vi hai th hin ca quan h LOCATED:
K (x, y)

=
0 nu m n =
1
( , )
n
i i
i
c x y
=
[
nu m = n
24

1. his actions in Brcko , v
2. his arrival in Beijing.
Ta c ng i ph thuc tng ng l:
1. hisactions inBrcko
2. hisarrival inBeijing
Lc ny:
x = [x
1
x
2
x
3
x
4
x
5
x
6
x
7
] trong x
1
={his, PRP, PERSON}, x
2
= {}, x
3
=
{actions, NNS, Noun}, x
4
= {}, x
5
= {in, IN}, x
6
={}, x
7
= {Brcko, NNP,
Noun, LOCATION}
y = [y
1
y
2
y
3
y
4
y
5
y
6
y
7
], trong y
1
= {his, PRP, PERSON}, y
2
= {}, y
3
=
{arrival, NN, Noun}, y
4
= {}, y
5
= {in, IN}, y
6
= {}, y
7
= {Beijing, NNP,
Noun, LOCATION}
Theo cng thc trn, hm nhn K(x, y) = 3*1*1*1*2*1*3 = 18.
S dng thut ton SVM vi hm nhn ny tin hnh phn lp quan h, t
trch chn c cc quan h cn tm.
2.3. Hc bn gim st trch chn quan h
2.3.1. Phng php DIRPE
Vo nm 1998 [7][1], Brin gii thiu mt phng php hc bn gim st
cho vic trch chn mu quan h ng ngha DIRPE. Phng php c th nghim
vi quan h author book vi tp d liu ban u khong 5 v d cho quan h
ny. DIRPE m rng tp ban u thnh mt danh sch khong 15.000 cun sch.
Phng php DIRPE c m t nh sau:
u vo: Tp cc quan h mu S = {<A
i
,B
i
>}. V d trong trn hp trn, tp
quan h mu l S = {<author
i
,book
i
>}. Tp ny c gi l tp ht ging.
u ra: Tp cc quan h R trich chn c.
X l:
- Tp quan h ch R c khi to t tp ht ging S.
- Tm tt c cc cu c cha cc thnh phn ca tp ht ging ban u.
- Da vo tp cu tm c, tin hnh tm cc mu quan h gia cc thnh
phn ca ht ging ban u. Brin nh ngha mu ban u rt n gin, bng
vic gi li khong m k t trc thnh phn mu u tin, gi l prefix; gi
25

li pha sau thnh phn th hai n k t gi l suffix; k k t nm gia hai
thnh phn ny, gi l middle. Mu quan h c biu din di dng sau:
[order, author, book, prefix, suffix, middle] trong , order th hin th t
xut hin ca author v book trong mt cu. (order = 1 th author ng trc
book v bng 0 trong trng hp cn li)
- T nhng mu m cha c gn nhn ta thu c mt tp ht ging <A,
B> mi; thm ht ging mi ny vo tp ht ging cho quan h .
- Quay li bc 2 tm ra nhng ht ging v mu mi cho ti khi tp
V d minh ha i vi quan h tc gi - sch trn :
u vo:
- Tp ht ging ban u S= {<Arthur Conan Doyle, The Adventures of
Sherlock Holmes>}.
- V mt tp cc ti liu bao gm cc ht ging ban u
X l:
- Quan h ch R c gn bng S
- Xc nh mu quan h.
Mu quan h c dng nh sau: [order, author, book, prefix, suffix, middle]
Da vo tp ti liu, ta thu tp cc cu c cha tp ht ging ban u. T tp
cu ny, tin hnh trch chn cc mu quan h. (nh hnh 8).
T trch chn ra c mt tp cc mu:
[ 0, Arthur Conan Doyle, The Adventures of Sherlock Holmes, Read, online
or, by]
[1, Arthur Conan Doyle, The Adventures of Sherlock Holmes, now that Sir,
in 1892, wrote]
26


Hnh 8: Cc quan h mu trch chn c
Sau khi c tp mu trn, chng ta tin hnh so khp (matching) cc thnh
phn gia, trc v sau ca mi mu gom nhm chng li thnh tng nhm
v loi b nhng mu trng nhau. T , ta thu c nhng mu i din cho
mt nhm cc mu c dng nh sau:
[t ph bin nht ca prefix, author, middle, book, t ph bin nht ca suffix]
Mu trch chn cho:
[sir, Arthur Conan Doyle, wrote, The Adventures of Sherlock Holmes, in
1892]
- Vic sinh ht ging mi.
T nhng mu hon chnh, ta xt ti nhng mu cn khuyt mt vi thnh
phn, v d nh sau: [Sir, ???, wrote, ??? in 1892].
S dng nhng tp mu nh trn tm kim nhng ti liu khc Sir Arthur
Conan Doyle worte Speckled Band in 1892, that is aroud 662 years apart which
would make the stories
T tp cu tm kim c, ta c th trch xut ra c nhng tp ht ging
mi mi: (Arthur Conan Doyle, Speckled Band)
Phng php t hiu qu cao trn d liu html cho vic xc nh tp mu v
sinh ht ging mi. V th, da trn tng ca phng php DIPRE, vo nm
2000, Agichtein v Gravano a ra phng php Snowball [14] tin hnh thc hin
trn d liu khng cu trc, xy dng o nh gi tin cy cho vic sinh tp
27

mu quan h v tp ht ging mi c sinh ra v b sung thm vic nhn dng
thc th. Phng php ny c trnh by chi tit hn phn tip theo.
2.3.2. Phng php Snowball
Snowball [14][1] l h thng trch chn quan h m tp mu v tp ht
ging mi c sinh ra c nh gi cht lng trong qu trnh x l. Gii thut
c thc nghim trn quan h t chc a im (organization location).
Vi tp ht ging ban u nh: Microsoft Redmond, IBM Armonk, Boeing
Seatile, Intel Santa Clara.

Hnh 9: Kin trc ca h thng Snowball
Kin trc c bn ca Snowball c minh ho nh hnh 9 v c m t nh sau:
u vo:
- Mt tp vn bn D (tp hun luyn).
- Tp nhn ht ging ban u S = {A
i
, B
i
} gm cc cp quan h mu no .
V d cp quan h <T chc a im> nh trnh by trn.
u ra: Tp cc quan h trch chn c
X l:
Bc 1: Tm s xut hin ca cc cp quan h trong d liu
- Vi ht ging <A
i
, B
i
>, tin hnh tm d liu l cc cu c cha c A
i
v B
i
.
H thng s tin hnh phn tch, chn lc v trch chn cc mu. Tng t
nh DIPRE, mt cu khp vi biu thc * A
i
* B
i
* th cm t ng trc
A
i
gi l prefix, cm t ng gia A
i
v B
i
l middle v cm t ng sau B
i

gi l suffix.
Bc 2: Tm s xut hin ca cc thc th trong d liu
28

- Snowball s tin hnh phn cm tp cc mu bng cch s dng hm Match
c tnh tng ng gia cc mu v xc nh mt vi ngng tng
ng t
sim
cho vic gom nhm cc cm nhm lm gim s lng cc mu
cng nh lm cho mu c tnh khi qut cao hn.
- Gi (prefix1, middle1, suffix1) v (prefix2, middle2, suffix2) l h s ng
cnh tng ng vi mu1 v mu2 th tng ng Match(mu1, mu2)
c xc nh nh sau:
Match(mu1, mu2) = (prefix1.prefix2) + (suffix1.suffix2)
+ (middle1.middle2)
- Cc mu sau khi tm thy, s c i chiu li vi kho d liu ban u
kim tra xem chng c tm ra c cc ht ging mi <A, B> no khng.
Ht ging mi <A, B> s nm mt trong cc trng hp sau:
o Positive: Nu <A,B> nm trong danh sch ht ging
o Negative: Nu <A, B> ch c ng mt trong hai (A hoc B) xut
hin trong danh sch ht ging.
o Unknown:Nu <A, B>, c A, B u khng xut hin trong danh
sch ht ging. Tp Unknown c xem l tp cc ht ging mi cho
vng lp sau.
Bc 3: Sinh mu mi
- Snowball s tnh chnh xc ca tng mu da trn s Positive v Negative
ca n v chn ra top N mu c im s cao nht. tin tng ca mu
c tnh theo cng thc:
. os
( )
. os .
P p tive
belief P
P p tive P negative
=
+


Bc 4: Tm cc ht ging mi cho vng lp tip theo
- Vi mi mu trong danh sch top N c chn s l cc cp trong tp ht
ging mi, tip tc c a vo vng lp mi.
- Tng t nh vi mu th cc cp ny cng c c tnh nh sau:
| |
0
( ) 1 (1 ( ))
p
i
conf T belief P
=
=
[
29


- H thng s chn ra c M cp c nh gi tt nht v M cp ny c
dng lm ht ging cho qu trnh chn mu k tip. H thng s tip tc
c quay li bc 1. Qu trnh trn tip tc lp cho n khi h thng khng
tm c cp mi hoc lp theo s ln m ta xc nh trc.
2.4. Nhn xt
C ba loi hc khng gim st, c gim st v bn gim st u th hin
c nhng u v nhc im ring ca mnh. Theo Valpola [31], i vi hc c
gim st, cht lng trch chn ca h thng trn nhng min d liu c th l rt
tt, tuy nhin chi ph i vi vic xy dng tp d liu l rt tn km, do kh
nng m rng min ng dng l kh khn. Cn i vi phng php hc khng
gim st cho kh nng hc vi lng d liu ln hn v tc nhanh tuy nhin m
hnh hc li phc tp hn hc c gim st. Trong khi , hc bn gim st c
xem nh l mt phng php ti u gim thiu chi ph cng nh ti nguyn xy
dng. Vic la chn phng php no l ty thuc vo tng min ng dng v c
trng ca bi ton.
Ti Vit Nam, cc nghin cu v cc sn phm thit yu x l vn bn ting
Vit ra i [2, 38] cho php p dng nhiu k thut x l hn trch chn quan h
ng ngha, chng hn cc thng tin v tch t, nhn t loi v c bit l cy phn
tch c php. Hn na, da trn vic tng hp cc kt qu nghin cu gn y, G.
Zhou v M. Zhang [32] khng nh cc rng phng php tip cn da trn c
trng t c kt qu tt hn.
y chnh l cc l do v sao m kha lun a ra m hnh trch chn
quan h da vo cy phn tch c php theo phng php da trn c trng.

Tm tt chng hai
Trong chng ny m t khi qut cc phng php gii quyt bi ton
trch chn quan h, ch ra c nhng u nhc im v l do la chn phng
php da trn c trng gii quyt bi ton ny. M hnh trch chn quan h ca
kha lun ny s c trnh by chi tit trong chng tip theo.


30

Chng 3. M hnh trch chn quan h trn Wikipedia ting Vit da
vo cy phn tch c php
Trn c s phn tch u v nhc im ca cc phng php trch chn quan
h, kha lun la chn phng php hc c gim st trch chn quan h da trn
c trng gii quyt bi ton ny. Cc c trng ca quan h s c ly ra da
trn cy phn tch c php ting Vit, sau c a vo b phn lp s dng
thut ton SVM. Hn na, gim cng sc cho giai on xy dng tp d liu
hc, cc c trng ca d liu trn Wikipedia ting Vit c s dng. V vy,
trong chng ny, kha lun trnh by cc c trng ca Wikipedia, cy phn tch
c php ting Vit v m hnh xut trch chn quan h trn Wikipedia.
3.1. c trng ca Wikipedia
Wikipedia gi tt l Wiki (pht m nh "Uy-ki"; t ting Hawaii wikiwiki,
c ngha "nhanh"; cng c gi l cng trnh m), l mt loi ng dng xy dng
v qun l cc trang thng tin do nhiu ngi cng pht trin c a ra vo nm
2001 bi Jimmy Wales v Larry Sanger [24]. Wiki c xy dng theo nguyn tc
phn tn: Ai cng c th chnh sa, thm mi, b sung thng tin ln cc trang tin v
khng ghi li du n l ai cung cp thng tin . y c xem l mt Bch
khoa ton th b tra cu ln nht v ph bin nht trn Internet hin nay [23].
Nh c trng biu din thng tin rt giu ng ngha c th hin cc mu
nh dng d liu, cc lin kt gia cc thc th trang Wiki v cch phn mc cc
trang Wiki m Wikipedia tr thnh mt i tng c quan tm c bit trong lnh
vc khai ph d liu v x l ngn ng t nhin[5, 6, 13, 16, 19, 23].
3.1.1. Thc th trong Wikipedia
Trn Wiki, mt thc th thng c lin kt ti mt trang Wiki m t thc
th (i khi c gi l thc th trang Wiki) theo cch: khi mt thc th c
to ra trn wiki, tc gi to ra mt lin kt gia thc th v trang web Wiki m t
thc th , ng thi, vi mi thc th xut hin trong trang Wiki ny, lin kt ti
trang Wiki m t thc th cng to to ra. y l mt c trng quan trng ca
Wiki cho php d dng xc nh cc thc th. V d sau c trch ra t trang i
hc Cng ngh, i hc Quc gia H Ni trn Wiki , bao gm cc lin kt ti thc
th i hc Quc gia H Ni, Nguyn Vn Hiu
Trng i hc Cng ngh (tn ting Anh: University of Engineering
and Technology hay UET) l mt trng i hc thuc i hc Quc gia H Ni,
31

c Th tng chnh ph quyt nh thnh lp ngy 25 thng 5 nm 2004. y l
mt m hnh i hc hin i. GS. TSKH. Vin s Nguyn Vn Hiu l Hiu
trng sng lp trng.
3.1.2. Infobox
Infobox ca mt trang Wiki l mt bng c thit k theo mt mu c nh
theo quy nh ca Wikipedia, nm gc trn bn phi ca trang, biu din tm tt
cc thng tin v trang wiki vi ni dung thng l cc s kin (fact) v cc
thng k lin quan [33]. Ni dung ca bng thng c biu din di cc cp
<thuc tnh gi tr> [16]. Hnh 12 l mt v d v infobox ca trang Wiki Trng
i hc Khoa hc T nhin. Cc bng ny cho php trch chn cc thng tin mt
cch chnh xc v nhanh chng.
3.1.3. Mc phn loi
Wikipedia cng cung cp cc mc phn loi, cho php cc tc gi phn nhm
v to cc lin kt ti t cc trang ti cc mc phn loi tng ng. Mt trang c
th lin kt ti nhiu mc. Mt mc trn Wikipedia c mt tn duy nht. Mt mc
mi c th c to ra bi mt tc gi tun theo nhng khuyn co ca Wiki trong
vic to mt mc mi v lin kt cc trang ti n. Mt vi thuc tnh quan trng ca
mc trn Wikipedia gm c:
- Mt mc c th c nhiu mc con v nhiu mc cha
- Mt mc c th c cha rt nhiu trang nhng cng c nhng mc ch c
mt lng nh cc trang.
- Mt trang m thuc v mc m rng thng khng thuc v cc mc cha
cu mc m rng . V d trang Spain khng thuc mc Ngi chu u
- Quan h mc con ca mt mc khng phi lun lun l quan h cha con.
V d, Bn Chu u l mc con ca mc Chu u nhng hai mc
ny khng c quan h is-a
- C chu trnh trong th biu din cc mc.




32

3.2. Cy phn tch c php ting Vit
Trong mc ny s trnh by mt s cc khi nim v thnh phn c bn v cy
phn tch c php
1
, l c s cho biu din cc c trng ca mt quan h.
3.2.1. Phn tch c php
Nhn u vo l mt chui cc t t (l kt qu ca qu trnh phn tch t t,
thng thng i vi x l ngn ng l cc t), phn tch c php (parsing hay
syntatic analys) l qu trnh phn tch nhm a ra cu trc ng php ca chui t
da vo mt vn phm no . Thng thng cu trc ng php c l dng
cy, bi thng qua dng ny s ph thuc ca cc thnh phn l trc quan. Cy ny
c gi l cy phn tch c php.

Hnh 10: V d v cy phn tch c php ting Vit
3.2.2. Mt s thnh phn c bn ca cy phn tch c php ting Vit
Cu trc ca cy c php nh sau:
- Nt gc th hin loi cu (trn thut, nghi vn, cm thn, cu khin)
- Cc nt l biu din cc t trong cu
- Nt cha ca cc nt l ny biu din nhn t loi tng ng ca nt
con.

1
KC01.01/06-10: "Nghin cu pht trin mt s sn phm thit yu v x l ting ni v vn bn ting Vit"
(VLSP)
33

- Cc nt trung gian cn li th hin chc nng ng php (cm danh t,
cm ng t, b ng )
V d: Vi cu: Trng i hc Cng ngh c thnh lp ngy 25 thng
5 nm 2004. , sau khi tin hnh phn tch c php, ta c cy phn tch c php
nh hnh 10. C 14 nhn t loi, 5 nhn cm t v 4 loi nhn cu c lit k v
m t nh trong ph lc.
3.3. M hnh trch chn quan h da trn cy phn tch c php trn
Wikipedia ting Vit
3.3.1. Pht biu bi ton
Bi ton trch chn quan h c Roxana Girju [10] pht biu nh
chng 1, trong trng hp ny c th c vit li nh sau:
u vo:
- Tp d liu D: tp cc trang web trn Wikipedia ting Vit
- Tp thc th E = {e
i
} 1, i n = xut hin trong D
- Tp cc loi quan h 9 = {R
j
} 1, j m =
u ra:
- Tt c cc b quan h
1 2
( , , )
i j i
e R e
vi 1 i n , 1 j m
3.3.2. tng gii quyt bi ton
Vic tm tt c cc b quan h
1 2
( , , )
i j i
e R e
c th c tin hnh bng cch,
vi mi quan h R
j
9, tm tt c cc cp thc th
1 2
( , )
i i
e e tha mn quan h
j
R
ny. Nh vy, bi ton by gi tr thnh: tm tt c cc th hin ca mt quan h R
cho trc. Da trn gi thit rng: mi th hin ca 1 quan h c m t trong
mt cu, tng gii quyt bi ton c a ra nh sau:
- Da trn cy phn tch c php ca cu, biu din cc th hin ca quan h
di dng cy quan h. Mi cy quan h ny s tng ng vi mt vector
c trng.
- Coi mi quan h R ging nh mt tp hp hay mt lp - cc cy quan h.
Nhn ca lp ny l tn quan h.
- Tin hnh to b phn lp cc cy quan h, t trch chn c th hin
ca quan h.
M hnh trch chn quan h c chia lm 2 pha chnh: xy dng tp d liu
hc v giai on p dng.
34

3.3.3. Xy dng tp d liu hc
Mt trong nhng nhc im ca phng php hc c gim st l chi ph cho
vic xy dng tp d liu l rt tn km. Da vo cc c trng ca Wikipedia,
kha lun a ra m hnh xy dng tp d liu hc bn t ng, gim thiu c
nhiu chi ph xy dng. M hnh ny c m t nh trong hnh 11:

Hnh 11: Qu trnh xy dng tp d liu hc
a. Trch chn thng tin trn Infox:
Nh m t phn trc, thng tin trn infobox l mt dng biu din c
cu trc. iu ny cho php ta trch chn t ng cc th hin ca mt quan h.
Mi cp <thuc tnh gi tr> ca infobox cho ta mt b ba quan h vi thc
th trang wiki c dng: <Thc_th_trang_Wiki Thuc_tnh - Gi_tr>, cc
loi quan h <thuc tnh> v cc cp thc th cng nm trong quan h
<Thc_th_trang_Wiki Gi_tr>. V d, trong trng hp hnh 12, ta s trch
c b ba quan h, loi quan h, cp thc th tng ng l:
<Trng i hc Khoa hc T nhin, i hc Quc gia H Ni Nm
thnh lp - 1993>
<Nm thnh lp>
< Trng i hc Khoa hc T nhin, i hc Quc gia H Ni 1993>
b. Tm kim trn Wikipedia
Mc tiu ca x l ny l tm ra cc cu cha c ba thnh phn ca quan h
<E1 R E2>. Do infobox l bng thng tin tm tt v ni dung ca trang nn
s gn nh lun tm c cc cu m th hin quan h <E1 R E2>.


35

Infobox M html tng ng

<table class="infobox" >
<tbody>
<tr>
<td><b>Trng i hc Khoa hc T nhin,
i hc Quc gia H Ni</b><br></td>
</tr>
<tr>
<td colspan="2"></td>
</tr>
<tr>
<th>Tn gi khc</th>
<td>Trng i hc ng Dng<br>
Trng i hc Khoa hc<br>
Trng i hc Tng hp H Ni</td>
</tr>
<tr>
<th>Khu hiu</th>
<td>Khu hiu</td>
</tr>
<tr>
<th>Nm thnh lp</th>
<td>1993</td>
</tr>
<tr>
<th>Loi hnh</th>
<td>Trng i hc cng lp</td>
</tr>
<tr>
<th>Gim c</th>
<td>1</td>
</tr>
<tr>
<th>Hiu trng</th>
<td>PGS.TS. Bi Duy Cam </td>
</tr>
<tr>
<th>Hiu ph</th>
<td>Nguyn Hu D<br>
Nguyn Hong Lng<br>
Nguyn Vn Ni</td>
</tr>
...
<tr>
<th>Email</th>
<td>dhkhtnhn@vnn.vn</td>
</tr>
<tr>
<th>Website</th>
<td>http://www.hus.edu.vn</a></td>
</tr>
</tbody>
Hnh 12: Cu trc biu din ca thng tin ca infobox
Sau khi trch chn c mt tp cc cu cha cc b quan h tng ng <E1
R E2>, tin hnh phn tch cy c php, tm cy biu din quan h ny, ri sinh
36

ra vector c trng tng ng. Cc vector ny s c gn nhn bng tay v cho
vo huyn luyn b phn lp SVM nh c m t di y.
3.3.4. M hnh h thng trch chn quan h
M hnh trch chn quan h gm c 3 pha chnh: tin x l, sinh vector c
trng v nhn dng nh c m t nh trong hnh v sau:


Hnh 13: M hnh trch chn quan h trn Wikipedia
Chi tit v x l ca tng pha nh sau:
3.3.4.1. Pha tin x l
Trong pha ny, nhn u vo mt tp cc trang Wikipedia trn mt min ng
dng quan tm, sau qu trnh x l thu c mt tp cc cu tim nng th hin
quan h R. Cc cu tim nng l cc cu cha t kha th hin quan h R ang xem
xt.
Ln lt tng trang s c loi b cc th html. Trong qu trnh loi b th
html th nh du cc lin kt ti cc thc th trang Wiki khc.
Tin hnh tch cu s dng b cng c JvnTextpro [43].
Chng hn nh trong v d v thc th trang Trng i hc Khoa hc T
nhin,i hc Quc gia H Ni, vi quan h nm thnh lp cc ta s tm c
cu tim nng l:
37

Trng i hc Khoa hc T nhin thuc i hc Quc gia H Ni c
thnh lp theo ngh nh s 97/CP ngy 10/12/1993 ca chnh ph.
Cc cu ny s c lu li, phc v cho pha tip theo.
3.3.4.2. Pha sinh vector c trng
Trong pha ny gm 3 x l con:
a. Phn tch c php
Trong pha ny, s dng H phn tch cu ting Vit [38], ta thu c cc cy
phn tch c php tng ng vi tng cu thu c pha mt.
b. Sinh cy con biu din quan h R
Da trn mt s nhn xt sau:
- Ting Vit l ngn ng c cu trc cu dng ch ng - v ng - b ng, tc
c ngha l ch ng thng i trc, sau ti v ng v cui cng l b
ng [4]. Cu trc ny tng ng vi cu trc subject verb object
trong ting Anh [34].
- Trong cu, ch ng thng l cc danh t, cm danh t.
- Cc thc th hay khi nim l cc danh t hay cm danh t
- Da trn lin kt ch ng - v ng - b ng, ta c c lin kt (cm)
danh t (cm)ng t (cm) danh t trn cy phn tch c php.
Khi , cy con (ca cy phn tch c php) c kh nng biu din quan h R
s c ba thnh phn trung tm l: mt cm t trung tm biu din quan h R ( thng
thng l cm ng t) v hai cm danh t biu din hai thc th tng ng. Th
tc sinh cc cy ny nh sau:
u vo: cy phn tch c php c cha cc t kha k th hin quan h R
u ra: tt c cc cy con tim nng th hin quan h R
X l:
i. Tm nt nh nht trn cy cha t kha k, gi l nt K
ii. Tm tt c cc cm danh t NP tha mn mt trong cc iu kin [2]:
a. Nhnh NP c su bng 1
b. Nhnh NP c su bng 2 phn u, danh t trung tm v phn
sau. Trong , phn sau l nhnh c nhn khc PP (cm gii t) v
khc SBAR (cu)
38

c. Nhnh NP c su bng 3 ch gm danh t trung tm v theo sau l
mt NP c su bng 2
d. Cc nhnh c nhn QP cng c xem xt l cm danh t ch s
lng
iii. Vi tng cp (NP
i
, NP
j
) c c t bc ii, da vo cy phn tch c php,
tm ng i t NP
i
ti NP
j
m i qua KEY . ng i ny cho ta cy con
tim nng biu din R.
V d vi cu Trng i hc Cng ngh (tn gi ting Anh : ) c th tng
chnh ph quyt nh thnh lp ngy 25 thng 5 nm 2004 ta ly c cy con
biu din R c dng:

Hnh 14: Cy con biu din quan h thnh_lp
c. Sinh vector c trng
Mi cy con trn tng ng vi mt vector c trng. Vector c trng ny gm
c 5 c trng sau:
- Cm nhn trung tm: cm nhn c ni dung biu din quan h R. Trong hnh
14, cm ny l VP (nhn mu )
- Cm_nhn_th_hin_E
1
: cm nhn c ni dung biu din thc th E1. V d:
NP ngoi cng bn tri
- Cm_nhn_th_hin_E
2
: cm nhn c ni dung biu din thc th E2. V d:
NP ngoi cng bn ph
- ng_dn_nhn_E
i
: ng i t cm nhn biu din E
i
ti cm nhn trung
tm. Trong v d trn: ng n nhn E
1
v E
2
ln lt l NP -> NP -> VP-
> NP -> VP v NP -> VP. c trng ny c 2 thuc tnh:
o S nt nm trung gian khi i t nt biu din thc th E
i
ti nt trung
tm
o di trung bnh ca ng i (Bng trung bnh trng s ca cc nt
trung gian trn ng i t thc th E
i
ti nt trung tm)
- Trng s ca mt nt c xc nh nh sau:
o Nt l c trng s bng 1
o Nt cn li c trng s bng tng trng s ca cc nt con
39

Nh vy, mt vector c trng gm c 7 thuc tnh, c m t chi tit
trong bng sau:
Bng 3-1: Cc thuc tnh ca vector c trng
STT Tn cm Gi tr ngha
1
Cm nhn
trung tm
[0,1]

Kh nng nhn th hin quan
h ang tm. Gi tr cng cao
th kh nng cng ln.
2
Cm nhn th
hi E1
[0,1]

Kh nng nhn th hin mt
thc th ng. Gi tr cng
cao th kh nng cng ln.
3
Cm nhn th
hin E2
[0,1]

Kh nng nhn th hin mt
thc th ng. Gi tr cng
cao th kh nng cng ln.
4
ng dn
nhn E1
S nhn nm trung gian
khi i t nhn biu din
thc th E1 ti nhn
trung tm
lin quan ca thc th i
vi quan h, th hin qua
khong cch v thnh phn
ca cc nhn trung gian. Gi
tr cng ln th lin quan
cng nh.
5
di trung bnh ca
ng i (Bng trung
bnh trng s ca cc nt
trung gian trn ng i
t thc th E
1
ti nt
trung tm)
6
ng dn
nhn E2
S nhn nm trung gian
khi i t nhn biu din
thc th E2 ti nhn
trung tm
lin quan ca thc th i
vi quan h, th hin qua
khong cch v thnh phn
ca cc nhn trung gian. Gi
tr cng ln th lin quan
cng nh.
7
di trung bnh ca
ng i (Bng trung
bnh trng s ca cc nt
trung gian trn ng i
40

t thc th E
2
ti nt
trung tm)
3.3.4.3. Pha nhn dng
Vic nhn dng cc vector c trng tr thnh vic phn lp nh phn s
dng m hnh SVM c hun luyn.
Nh trnh by bc xy dng tp d liu hc, cc cu trong b d liu
hc s c phn tch c php, sinh cy con biu din quan h R v sinh vector c
trng tng ng nh cc bc trn. Sau , cc vector ny s c gn nhn bng
tay. Nu cy con c sinh ra thc s biu din quan h R, vector tng ng s
c gn nhn c1 ngc li s c gn nhn c0. Tin hnh hun luyn m hnh
SVM vi tp d liu hc ny ta c b phn lp SVM cho quan h R.
Cc vector c trng ca cc cy con tim nng s c phn lp bi b
phn lp ny. T cc vector nhn gi tr c1 tng ng l cc cy con tim nng s
c chp nhn v quan h thu c t cy con ny l cu tr li cho bi ton.
Tng kt chng ba
Trong chng ny, da trn phn tch cc c trng ca d liu Wikipedia
ting Vit v cy phn tch c php ting Vit, kha lun a ra mt phng n
xy dng tp d liu hc bn t ng v m hnh trch chn quan h da trn
phng php hc c gim st. Kt qu thc nghim chng sau cho thy m hnh
l hon ton kh thi.











41


Chng 4. Thc nghim v nh gi kt qu
4.1. Mi trng thc nghim
4.1.1. Cu hnh phn cng
Bng 4-1: Cu hnh phn cng
Thnh phn Ch s
CPU Intel Core 2 Duo 2.0Ghz
RAM 2GB
HDD 160GB
OS Windows 7 Professional 32 bit

4.1.2. Cng c phn mm
H thng s dng cc cng c sau:
Bng 4-2: Danh sch cc phn mm s dng
STT Tn phn
mm
Tc gi Ngun
1. eclipse-SDK-
3.4.0-win32
http://www.eclipse.org/downloads
2. ColtechParser Nguyn
Phng Thi

3. JvnTextpro Nguyn Cm
T

4. weka-3-6-2 http://prdownloads.sourceforge.net/weka/w
eka-3-6-2.exe
5. LibSVM Chih-Chung
Chang v
Chih-Jen Lin
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
42


4.2. D liu thc nghim
D liu thc nghim l hn 4000 trang Wiki ting Vit c ly t [37].
Trong c 300 trang Wiki v cc min trng i hc v cao ng trong c nc.
4.3. Thc nghim
4.3.1. M t ci t chng trnh
Chng trnh c t chc thnh 4 gi:
- RE.Crawler : thc hin cc thu thp cc trang Wiki theo min hoc theo
tng trang c th.
- RE.Infobox : trch chn cc b quan h da trn infobox ca Wiki
- RE.GrammarTree : cc th tc x l cy phn tch c php v sinh vector
c trng
- RE.Util : Cc th tc chun ha vn bn, x l xu
4.3.2. Xy dng tp d liu hc da trn Wikipedia ting Vit
i vi phng php hc c gim st, vic xy dng tp d liu hc l c
bit quan trng. Theo thng k v cc loi quan h c quan tm nht trong bi
ton trch chn quan h [21], kha lun la chn 3 quan h: nm thnh lp,
hiu trng v ngy sinh tin hnh thc nghim. Tp d liu hc cho mi
quan h khong 350-400 cu. Qu trnh xy dng nh sau:
a. Trch chn infobox
Vi mi trang Wiki, infobox ca trang (nu c) s c trch chn v tch
ra thnh cc b quan h c dng: <E1 R E2>, trong :
- E1: l thc th trang Wiki ang xem xt
- R : quan h m thc th E1 c (chnh l thnh phn thuc tnh trong
bng infobox)
- E2: l thc th c quan h R vi E1 (l thnh phn gi tr tng ng
vi thuc tnh trong bng infobox)
V d vi trang Wiki i hc Quc gia H Ni, cc b quan h trch chn
c l:
STT B quan h <E1 R E2>
43

1. <i hc Quc gia H Ni - Nm thnh lp - 1906>
2.
< i hc Quc gia H Ni - a ch - 144 ng Xun Thy Qun
Cu Giy, H Ni, Vit Nam>
3. < i hc Quc gia H Ni - Website - www.vnu.edu.vn>
4. < i hc Quc gia H Ni - Gim c - Mai Trng Nhun>
5. < i hc Quc gia H Ni Loi hnh i hc quc gia>
6. <ai_Hoc_Quoc_Gia_Ha_Noi - in thoi - +84-4-7547968>
Sau bc ny thu c 864 b quan h.
Cc b th hin quan h nm thnh lp, hiu trng v ngy sinh ln
lt c ly ra. Thng k kt qu c cho nh bng sau:
Quan h S lng V d b quan h <E1 R E2>
Hiu
trng
116
<Trng i hc Vn Lang - Hiu trng - TS. Nguyn
Dng>
<Hc Vin Ngn Hng Vit Nam - Hiu trng - Tin
s T Ngc Hng>
<Trng i hc Quc T - i hc Quc Gia thnh
ph H Ch Minh - Hiu trng - H Thanh Phong>
<Trng i hc Kin Trc H Ni - Hiu trng -
TS. nh c>
<Trng i hoc Y Dc Cn Th Hiu trng - PGS.
TS. Bc s CK II Phm Vn Lnh>
<Trng i hc Bch Khoa H Ni - Hiu trng -
GS.Ts. Nguyn Trng Ging>
<Trng i hc S Phm H Ni 2 - Hiu trng -
PGS.TS. Nguyn Vn M>
<Hc Vin K Thut Qun S - Hiu trng - Gio s,
TSKH Phm Th Long.>
<Hc Vin Y Dc Hc C Truyn Vit Nam - Hiu
trng - GS. TS.Trng Vit Bnh>
<Hc Vin Ngoi Giao - Hiu trng - PGS. TS. Dng
Vn Qung>
Nm
thnh lp
132
<Hc Vin Ngn Hng Vit Nam - Nm thnh lp -
1998>
<Trng i hc S Phm, i hc Thi Nguyn - Nm
thnh lp - 25 thng 12 nm 1987>
<Trng i hc Cng nghip H Ni - Nm thnh lp
- 2005>
<Trng i hc H Hoa Tin - Nm thnh lp - 2007>
<Trng i hc B Ra Vng Tu - Nm thnh lp -
2006>
<Hc Vin n Nhc Hu - Nm thnh lp - 26 thng 3
nm 2008>
<Trng i Hc Thnh Ty - Nm thnh lp - 10
thng 10 nm 2007>
<Trng i hc S Phm Nng - Nm thnh lp -
1975>
44

<Khoa Qun tr Kinh doanh i hc Quc gia H Ni -
Nm thnh lp - 13 thng 7 nm 1995>
<i hc Thi Nguyn - Nm thnh lp - 1994>
<Trng i hc iu Dng Nam nh - Nm thnh lp
- 26 thng 2 nm 2004>
Ngy
sinh
160
<Nguyn Tn Dng ngy sinh - 17 thng 11, 1949>
<Nguyn Vn Hiu ngy sinh - Ngy 21 thng
07,1938>
<Phan Vn Khi ngy sinh - 25 thng 12, 1933>
<H Ch Minh ngy sinh - 19 thng 5, 1890>
<inh Tin Hon ngy sinh 924>
<Nng c Mnh ngy sinh - 11 thng 9, 1940>
<Gia Long ngy sinh - 8 thng 2 nm 1762>
<Minh Mng ngy sinh - 25 thng 5 nm 1791>
<Nguyn Du ngy sinh - 3 thng 1, 1766>
<Trn Thi Tng - ngy sinh - 17 thng 7, 1218>

b. Tm kim trn Wiki
tm cc cu m t b quan h <E1 R E2> va tm c trn, ta tm
trong thc th trang Wiki tng ng. Cc cu cha c ba thnh phn ca b quan h
s ly ra v lu vo trong c s d liu.
Qu trnh ny gm 3 bc sau:
- To truy vn gi ti modul tm kim ca Wiki. T kha ca truy vn
l quan h R v s lng kt qu tr v. Wiki s tr v mt danh sch
cc trang Wiki c cha t kha ny.

Hnh 15: V d v tm kim trn Wikipedia
45


- Cc trang tr v s c thu thp, cho qua bc tin x l (nh mc
tip theo)
- Cc cu c trch ra c th l mt trong ba loi sau:
o Loi 1: Cu cha c 3 thnh phn ca quan h
o Loi 2: Cu cha R v E1 hoc R v E2
o Loi 3: Cu cha R
Cc cu ny s c phn tch c php, sinh cy quan h, sinh vector c
trng. Cc vector c trng c c t cu loi 1 s c gn nhn t ng. Cc
vector c trng c c t cu loi 2 v 3 s c gn nhn bng tay.
Tin x l
Cc trang sau khi c thu thp v s c tin hnh tin x l:
- Loi b cc th html
- Tch cu
- Trch ra nhng cu cha R
- Chun ha cu.
Vic loi b cc th html, tch cu c thc hin bi b cng c
JvnTextPro[43], sau , nhng cu cha R s c lu li.
C mt s k t c bit m b phn tch c php khng x l cn c loi
b hoc thay th bng k hiu tng ng. Cc k hiu m ngoc (, ng ngoc
) ny thng c s dng mang ngha ch thch nn khng lm mt i
ngha, cc cp ng m ngoc s c thay th bi du gch gang - tng ng.
V d: cu Trng i hc Bch khoa H Ni (ting Anh: Hanoi University of
Technology, vit tt l HUT) l trng i hc k thut a ngnh, c thnh lp ti
H Ni ngy 15 thng 10 nm 1956. s c chun ha thnh Trng i hc
Bch khoa H Ni - ting Anh: Hanoi University of Technology, vit tt l HUT -
l trng i hc k thut a ngnh, c thnh lp ti H Ni ngy 15 thng 10
nm 1956.
4.3.3. Sinh vector c trng
a. Phn tch c php
- Tch t: s dng b tch t JvnTextpro[43] ca Nguyn Cm T.
46

- a cu v dng chun u vo vo b phn tch c php.
- Phn tch c php s dng b phn tch c php coltechparser ca
Nguyn Phng Thi v cng s [38]
Nhn xt:
- Kt qu thc nghim cho thy kt qu phn tch c php s ph thuc rt ln
vo vic tch t.
- Phn tch c php cc cu sau khi tch t s cho cy phn tch c php tt
hn.
b. Trch chn cy con biu din quan h R v sinh vector c trng
S dng thut ton nh trnh by mc 3.3.4.2 ta s sinh c cc cy
con c kh nng biu din quan h <E1 R E2> (gi tt l cy con)
Cc thuc tnh ca vector c trng v = (v
1
, v
2
, v
3
, v
4
, v
5
, v
6
, v
7
) th hin kh
nng m cy con biu din quan h R, c th c xc nh nh sau trong qu
trnh thc nghim:
- Cm nhn trung tm: Kh nng cy con th hin quan h R ang tm (ch
khng phi l quan h R no khc). Gi tr cng cao th kh nng cng ln.
Nu Node
R
l nt trn cy con biu din R, gi:
o num1 l s nt l ca Node
R

o num2 l s nt l ca Node
R
c gi tr trng vi t kha th hin R
Khi : v
1
c tnh theo cng thc

- Cm nhn th hin E1, E2: Kh nng cc nt biu din thc s l thc th.
Gi tr cng cao th kh nng cng ln. Nu Node
Ei
l nt trn cy con biu
din E
i
, gi:
o num1 l s nt l ca Node
Ei

o num2 l s nt l ca Node
R
biu din thc th E
i
( xc nh trc
nh theo gi thit bi ton)
Khi : v
2
, v
3
c tnh theo cng thc
v
1
=
0 node l ca Node
R
c cha t nh khng
trong trng hp cn li
n u m 2
n u m 1

47


n u m 2
v
n u m 1
=
- ng dn ti nhn E1, E2:
o v
4
: s nt i t nt biu din E1 sang nt biu din R
o v
6
: s nt i t nt biu din E2 sang nt biu din R
o
5
4
w
t
v
v
=

vi w
t
l trng s ca cc nt trn ng i t nt biu
din E1 sang nt biu din R vi ch rng v
5
=0 nu v
4
=0
o
7
6
w
t
v
v
=

vi w
t
l trng s ca cc nt trn ng i t nt biu
din E2 sang nt biu din R vi ch rng v
7
=0 nu v
6
=0
o w
t
c tnh theo nh m t trong mc 3.3.4.2
- Trong qu trnh thc nghim p dng, trng s ca nt l c gn bng mt
mang ngha, cc t c s dng u c xem l tng ng nhau.
Cy con hnh 14 c vector c trng v = (0.5; 1.0; 1.0; 3.0;0.0; 2.0;0)
Nhn xt:
- Thc nghim cho thy, gi tr ca v
4
, v
5
, v
6
, v
7
cng nh th cy con thu c
cng c kh nng th hin ng b quan h <E R E>. iu ny cng ph
hp vi thc t l khi cc thnh phn trn cy phn tch c php cng gn
nhau, th mc quan h gia chng s cng cao hn.
- iu ny cng chng t rng, cc cng thc a ra tnh vector c trng l
hp l.
- Tuy nhin, vn cn mt s nhp nhng khi xc nh trng hp cm nhn
trung tm cha t kha biu din R nhng li cha thm cc t khng.

4.3.4. B phn lp SVM
S dng phn mm Weka[26] v LibSVM[44] tin hnh hun luyn m
hnh v kim th.
Mt v d thng k v d liu hc trong trng hp quan h nm thnh lp
ca m hnh c cho trn hnh v:
48


Hnh 16 : Bng thng k d liu hc ca quan h ngy sinh

4.4. nh gi
4.4.1. nh gi h thng
H thng c nh gi cht lng thng qua ba o: chnh xc
(precision), hi tng (recall) v o F (F-messure). Ba o ny c tnh
ton theo cc cng thc sau:
i
i
C
i i
correctC
pre
correctC incorrectC
=
+

0
1
1
C
1
correctC
rec
correctC incorrectC
=
+


0
0
0
C
1
correctC
rec
correctC incorrectC
=
+



2* *
i i
i
i i
C C
C
C C
pre rec
F
pre rec
=
+

49

ngha ca cc gi tr correctC
i
, incorrectC
i
c nh ngha nh bng 4-3.
4.4.2. Phng php nh gi
H thng th nghim theo phng php nh gi cho. Theo phng php
ny, d liu thc nghim c chia thnh 10 phn bng nhau, ln lt ly 9 phn
hun luyn v 1 phn cn li kim tra, kt qu sau 10 ln thc nghim c ghi
li v nh gi tng th.

Bng 4-3 : Cc gi tr nh gi h thng phn lp
C
0
C
1

C
0
correctC
0
incorrectC
0

C
1
incorrectC
1
correctC
1

Vi:
Gi tr ngha
correctC
0
S kt qu c phn lp vo C
0
l ng
incorrectC
0
S kt qu c phn lp vo lp C
0
l sai
incorrectC
1
S kt qu c phn lp vo lp C
1
l sai
correctC
1
S kt qu c phn lp vo lp C
1
l ng

4.4.3. Kt qu kim th
Kt qu kim th ca 3 quan h nm thnh lp, hiu trng v ngy
sinh cho kt qu nh sau:
50


Hnh 17: Kt qu kim th i vi quan h nm thnh lp



Hnh 18: Kt qu kim th i vi quan h hiu trng
51


Hnh 19: Kt qu kim th i vi quan h ngy sinh

Hnh 20: So snh kt qu trung bnh ca ba quan h
4.5. Nhn xt
Bc u thc nghim h thng trch chn quan h da trn cy phn tch c
php cho kt qu tng i kh quan. o F
1
trung bnh cho tng quan h th
nghim nm thnh lp, hiu trng, ngy sinh ln lt l 91,06% , 89,9% v
83,08%. Tuy vn cn nhiu trng hp nhp nhng nhng ti tin rng mt khi
xy dng c tp d liu hun luyn ln, thu thp c cc ngun tra cu di
do hn v kt hp thm cc c trng khc, cng nh a ra c trng s cc nt
ring theo tng quan h, h thng cn c th t c chnh xc cao hn na
trong tng lai.
52

Kt lun
T vic nghin cu bi ton trch chn quan h, kha lun a ra m hnh
trch chn quan h thc th da trn cy phn tch c php trn min d liu
Wikipedia ting Vit. Qua nhng kt qu thc nghim t c cho thy m hnh l
kh thi v c th p dng c.
V mt ni dung, kha lun t c nhng kt qu sau:
- Gii thiu bi ton trch chn quan h v cc khi nim lin quan.
- Tm hiu v phn tch cc phng php trch chn quan h in hnh, trong
tp trung vo cc phng php c s dng cy phn tch c php.
- Da vo c trng ca Wikipedia ting Vit, a ra c m hnh xy dng
tp d liu hc bn t ng
- p dng m hnh hc c gim st SVM xy dng m hnh trch chn
quan h da vo cy phn tch c php trn min d liu ca Wikipedia ting
Vit t kt qu kh quan.
Bn cnh nhng, do hn ch v mt thi gian v kin thc kha lun vn cn
hn ch sau:
- Kha lun cha xy dng c giao din ngi dng v kt qu thc nghim
mt s trng hp cha t chnh xc nh mong mun
V nh hng nghin cu, vic gii quyt bi ton theo tip cn c gim st
l bc khi u tt. Trong thi gian ti, kha lun s c pht trin theo cc
hng sau:
- Mt l, hon thin bc xy dng tp d liu hc sao cho c th thc hin
c trn nhiu quan h tin ti xy dng b phn lp a lp.
- Hai l, th nghim m hnh hc khng gim st trn vector c trng xy
dng c.
- Ba l, tch hp modul ny vo h thng xy dng t ng ontology cho ting
Vit trn min ng dng cc trng i hc Vit Nam nhm phc v vic
tm kim hng thc th.
53

PH LC
Bng 5-1: Bng cc nhn c s dng trong cy phn tch c php
K hiu
nhn
Phn loi V d
K hiu
nhn
Phn loi V d
No - Danh
t ring
No
Bi Thy
Anh, H
Ni
A Tnh
t
Ai Tnh
t ch tnh
cht
Trong vt,
mnh
mng
N - Danh t
Ns danh t
n th
qun, o,
bn
An Tnh
t nh
lng
Cao (hai
mt), rng
(vi si
ty)..
Nc danh t
tng th
qun o, bnh
lnh, bn b
P i t
Pp i
t xng h

Na Danh
t tru
tng
giai iu
Pd i
t ch nh
y, ,
kia
Nu danh
t n v o
lng
lt ru, nm
mui, mu
t, pht suy
ngh
Pn i
t ch s
lng
By, by
nhiu, tt
c
V - ng t
Vt ngoi
ng t
n bnh, xy
nh
Pi i t
nghi vn
Ai, g, u,
bao gi,
bao
nhiu
Vi ni
ng t
ng, ni, lm
vic
R Ph t
Rd - Ph
t ch
hng
Vo (nh),
xung (cu
thng),
(sn xut)
ra
54

Ve ng
t tn ti
Cn, mt,
ht
Rt Ph
t ch thi
gian

Va ng
t tip th
B, phi,
c
C Gii
t

Do, ca,
vi, hay,
nu
Vv ng
t tnh thi
Mun, dm,
qu quyt
M Tr t
Chinish,
cht, ngay,
tt nhin,
, , h, h
Vg ng
t tng hp
mua bn, nh
p
E Cm
t

i ch, i
chao, d,
vng
Vz ng
t l

Nl Loi
t

Ci, con,
cy, ngi,
tm
NP cm
danh t

Tt c nhng
chic ko
Nq S t
Mt, hai,
ba, dm,...
VP cm
ng t

ang n cm,
yu c y, bn
cho h
Y t vit
tt

CHXH,
TTCK,
CNTT
AP Cm
tnh t

Xinh qu,
mng ci, gii
v th thao
X T
khng xc
nh

RP Cm
ph t
Vn cha
SBAR
mnh
ph

Quyn
sch m
anh mn;
khe v
chi th
55

thao u
n
PP Cm
gii t
vo Si Gn
S Cu
trn thut

Ti i hc
bng xe
p
QP cm
t ch s
lng

Nm trm,
hn 200
SQ Cu
nghi vn

Ai ang
trong nh?
SE Cu
cm thn
i ch,
SC Cu
cu khin

Khng
c lm
n, i i
em















56

TI LIU THAM KHO
Ting Vit
[1] H Quang Thy, Phan Xun Hiu, on Sn, Nguyn Tr Thnh, Nguyn Thu
Trang, Nguyn Cm T (2009). Gio trnh Khai ph d liu Web. NXBGDVN,
10-2009.
[2] Nguyn Th Minh Huyn, Phan Xun Hiu, Nguyn L Minh, L Thanh
Hng (2009). Bo co kt qu sn phm cc cng c x l ngn ng t nhin
ting Vit. ti KC01.01/06-10 "Nghin cu pht trin mt s sn phm
thit yu v x l ting ni v vn bn ting Vit"
[3] Nguyn Hng Cn (2008). Cu trc c php cu ting Vit: ch - v hay -
thuyt. Hi ngh khoa hc v Vit Nam hc.
Ting Anh
[4] Abdulrahman Almuhareb (2006). Attributes in lexical acquistion. PhD Thesis.
University of Essex.
[5] Adrian Iftene, Alexandra Balahur-Dobrescu (2008). Named Entity Relation
Mining using Wikipedia. The Sixth International Language Resources and
Evaluation LREC08 (2008), European Language Resources Association
(ELRA), Pages: 29517408
[6] Anne-Marie Vercoustre, Jovan Pehcevski, James A. Thom (2007). Using
Wikipedia Categories and Links in Entity Ranking. INEX 2007: 321-335.
[7] Brin, S. (1998). Extracting patterns and relations from the world wide web.
WebDB 1998: 172-183.
[8] Bunescu R. C., and Mooney R. J. (2005). A shortest path dependency kernel
for relation extraction. HLT/EMNLP 2005: 724731.
[9] Chinchor, N. and Marsh, E. (1998). Information extraction task definition
(version 5.1). The 7
th
Message Understanding Conference.
http://acl.ldc.upenn.edu/muc7/ ie_task.html.
[10] Corina Roxana Girju (2002). Text mining for semantic relations. PhD. Thesis.
The University of Texas at Dallas, 2002.
[11] Coyle, B., and Sproat, R. (2001). WordsEye: An automatic text-to-scene
conversion system. The Siggraph Conference, Los Angeles, USA.
57

[12] Daniel Sleator and Davy Temperly (1993). Parsing English with a Link
Grammar. Third International Workshop on Parsing Technologies.
http://www.cs.cmu.edu/ afs/cs.cmu.edu/project/link/pub/www/papers/ps/LG-
IWPT93.pdf.
[13] Dat P. T. Nguyen, Yutaka Matsuo, Mitsuru Ishizuka (2007). Relation
Extraction from Wikipedia Using Subtree Mining. AAAI 2007: 1414-1420
[14] Eugene Agichtein, Luis Gravano (2000). Snowball: Extracting Relations from
Large Plain-Text Collections. ACM DL 2000: 85-94.
[15] Fabian M. Suchanek, Georgiana Ifrim, Gerhard Weikum (2006). LEILA:
Learning to Extract Information by Linguistic Analysis. COLING/ACL 2006
(Workshop On Ontology Learning And Population).
[16] Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum (2008). YAGO: A
Large Ontology from Wikipedia and WordNet. Web Semantics: Science,
Services and Agents on the World Wide Web, 6(3): 203-217.
[17] Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid O
Seaghdha,Sebastian Pado, Marco Pennacchiotti, Lorenza Romano and Stan
Szpakowicz (2009). Multi-Way Classification of Semantic Relations Between
Pairs of Nominals. The NAACL-HLT-09 Workshop on Semantic Evaluations:
Recent Achievements and Future Directions (SEW-09), Boulder, USA, May
2009.
[18] Jinxiu Chen, Donghong Ji, Chew Lim Tan, Zhengyu Niu (2005).
Unsupervised Feature Selection for Relation Extraction. The 2nd International
Joint Conference on Natural Language Processing (IJCNLP-05),
http://www.aclweb.org/anthology/I /I05/I05-2045.pdf
[19] Jonathan Yu, James A. Thom and Audrey Tam (2007). Ontology evaluation
using Wikipedia categories for browsing. CIKM 2007: 223-232.
[20] Kai-Hsiang Yang, Chun-Yu Chen, Hahn-Ming Lee, and Jan-Ming Ho (2008).
EFS: Expert Finding System Based on Wikipedia Link Pattern Analysis. The
2008 IEEE International Conference on Systems, Man and Cybernetics (SMC
2008): 631-635.
[21] Kambhatla N. (2004). Combining lexical, syntactic, and semantic features
with maximum entropy models for extracting relations. ACL 2004.
58

[22] Kim S., Lewis P., Martinez K. and Goodall S. (2004). Question Answering
Towards Automatic Augmentations of Ontology Instances. The Semantic
Web: Research and Applications: First European Semantic Web Symposium,
ESWS: 152-166.
[23] L.Denoyer and P.Gallinari (2006). The Wikipedia XML corpus. SIGIRForum,
40(1): 6469.
[24] Larry Sanger (2005). The Early History of Nupedia and Wikipedia: A
Memoir. Open Sources 2.0, ed. DiBona, Cooper, and Stone. O'Reilly, 2005
(Pre-published in slashdot.org, Apr. 2005).
[25] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni
(2007). Open information extraction from the Web. IJCAI 2007: 2670-2676.
[26] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
Reutemann, Ian H. Witten (2009). The WEKA Data Mining Software: An
Update. SIGKDD Explorations, 11(1):10-18.
[27] Minlie Huang, Xiaoyan Zhu, Yu Hao, Donald G. Payan, Kunbin Qu, Ming Li
(2004). Discovering patterns to extract protein-protein interactions from full
texts. Bioinformatics, 20(18):3604-3612.
[28] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S.
Soderland, D. Weld, and A. Yates (2004). Web-Scale Information Extraction
in KnowItAll. WWW 2004: 100-110.
[29] I. Fahmi (2009). Automatic term and relation extraction for medical question
answering system, PhD Thesis, University of Groningen, Netherlands
[30] Sanghee Kim, Paul H. Lewis, Kirk Martinez (2004). The Impact of Enriched
Linguistic Annotation on the Performance of Extracting Relation Triples.
CICLing 2004: 547-558.
[31] Valpola, H. (2000). Bayesian Ensemble Learning for Nonlinear Factor
Analysis. PhD Thesis, Helsinki University of Technology.
[32] Zhou GuoDong, Zhang Min. Extracting relation information from text
documents by exploring various types of knowledge. Information Processing
and Management 43 (2007): 969982.
[33] http://en.wikipedia.org/wiki/Help:Infobox
[34] http://en.wikipedia.org/wiki/Subject_Verb_Object
59

[35] http://en.wikipedia.org/wiki/Dependency_graph
[36] http://inex.is.informatik.uni-duisburg.de/
[37] http://static.wikipedia.org/downloads/2008-06/vi/
[38] http://vlsp.vietlp.org:8080/demo/?page=home
[39] http://wordnet.princeton.edu/
[40] http://www.abisource.com/projects/link-grammar/
[41] http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html . Information about the
sixth Message Understanding Conference.
[42] http://www.db.dk/bh/Lifeboat_KO/CONCEPTS/semantic_relations.htm
[43] Nguyen Cam Tu (2008). JVnTextpro: A Java-based Vietnamese Text
Processing Toolkit
[44] http://www.csie.ntu.edu.tw/~cjlin/libsvm/

You might also like