You are on page 1of 70

I HC QUC GIA H NI

TRNG I HC CNG NGH

V Tin Thnh

BI TON TRCH XUT THNG TIN CHO D LIU


BN CU TRC V P DNG XY DNG H THNG
TM KIM GI C SN PHM

KHO LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin

H NI 2009

I HC QUC GIA H NI
TRNG I HC CNG NGH

V Tin Thnh

BI TON TRCH XUT THNG TIN CHO D LIU


BN CU TRC V P DNG XY DNG H THNG
TM KIM GI C SN PHM

KHO LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin

Cn b hng dn: Th.S. Trn Th Oanh


Cn b ng hng dn: CN. Trn Mai V

H NI 2009

Li cm n
Li u tin, ti xin gi li cm n v lng bit n su sc nht ti Ph Gio s Tin
s H Quang Thy, Thc s Trn Th Oanh, C nhn Trn Mai V tn tnh hng dn
ti trong sut qu trnh thc hin kho lun tt nghip.
Ti chn thnh cm n cc thy, c to cho ti nhng iu kin thun li ti
hc tp v nghin cu ti trng i Hc Cng Ngh.
Ti cng xin gi li cm n ti cc anh ch v cc bn sinh vin trong nhm Khai
ph d liu gip ti rt nhiu trong vic thu thp v x l d liu.
Ti xin gi li cm n ti cc bn trong lp K50CA v K50CHTTT ng h
khuyn khch ti trong sut qu trnh hc tp ti trng.
Cui cng, ti mun c gi li cm n v hn ti gia nh v bn b, nhng
ngi thn yu lun bn cnh v ng vin ti trong sut qu trnh thc hin kha lun tt
nghip.

Ti xin chn thnh cm n !

Sinh vin
V Tin Thnh

Tm tt ni dung
Trch xut thng tin t d liu bn cu trc l mt bi ton c s quan tm ti
nhiu hi ngh ln trn th gii [9],[10],[12],[13]. Bi ton ny l mt thnh phn khng
th thiu trong cc ng dng v thu thp v trch xut thng tin hin nay. Mt trong
nhng ng dng l trch xut thng tin ca sn phm t cc trang thng mi in t
xy dng h thng tm kim gi c, nhm cung cp thng tin tt nht n ngi tiu
dng.
Kha lun ny tp trung nghin cu bi ton trch xut thng tin t d liu bn cu
trc v p dng xy dng h thng tm kim gi c sn phm. Kha lun xc nh mt
tp lut trch xut gi c gii bi ton trch xut gi khi cho bit tn sn phm v trn
c s , bi ton t ng trch xut thng tin v tn v gi ca sn phm c gii quyt.
Kha lun a ra cc bc xy dng h thng tm kim gi cho sn phm trn cc trang
web ting Vit. Kha lun tin hnh cc thc nghim v nh gi kt qu. Kt qu
thc nghim cho thy cc thng tin c trch xut t h thng l c tin cy.

Mc lc
Tm tt ni dung .................................................................................................................i
Mc lc ................................................................................................................................ii
Bng cc k hiu v ch vit tt.........................................................................................v
Danh sch cc hnh ............................................................................................................vi
Danh sch bng biu ...................................................................................................... viii
Gii thiu .............................................................................................................................1
Chng 1. Khi qut bi ton trch xut thng tin cho d liu bn cu trc ..............3
1.1 Bi ton trch xut thng tin .......................................................................................3
1.1.1 Gii thiu bi ton................................................................................................3
1.1.2 D liu ca bi ton .............................................................................................3
1.1.3 Cc hng tip cn trong bi ton trch xut thng tin........................................4
1.2 Bi ton trch xut thng tin cho d liu bn cu trc................................................6
1.2.1 Vn t ra vi bi ton ....................................................................................6
1.2.2 Mt s phng php trch xut thng tin cho d liu bn cu trc .....................6
1.2.3 Phng php nh gi..........................................................................................7
1.2.4 ng dng ca bi ton trch xut thng tin cho d liu bn cu trc ..................8
Chng 2. Mt s phng php s dng trong bi ton trch xut thng tin cho d
liu bn cu trc ...............................................................................................................10
2.1 Trch xut thng tin da vo cy DOM....................................................................10
2.1.1 Khi nhim cy DOM ........................................................................................10
2.1.2 Xy dng cy DOM ...........................................................................................11
2.1.3 S dng cy DOM trch xut thng tin .........................................................12
2.2 Trch xut thng tin da theo cc mu biu thc chnh qui .....................................13

ii

2.2.1 Khi nim biu thc chnh qui ...........................................................................13


2.2.2 S dng biu thc chnh qui trch xut thng tin..........................................14
2.3 Mt s gii thut trch xut thng tin cho d liu bn cu trc ................................14
2.3.1 Hai kiu biu din ca cc trang giu d liu ....................................................14
2.3.2 Mt s gii thut in hnh ................................................................................16
Chng 3. p dng bi ton trch xut thng tin bn cu trc xy dng h thng
tm kim gi c sn phm ................................................................................................21
3.1 Khi qut h thng tm kim gi c ca sn phm ...................................................21
3.1.1 Khi nim ...........................................................................................................21
3.1.2 Cc phng php xy dng ...............................................................................21
3.1.3 Cc h thng hin ti..........................................................................................22
3.2 C s thc tin ..........................................................................................................23
3.3 C s khoa hc .........................................................................................................25
3.3.1 Phn loi trang kinh doanh.................................................................................26
3.3.2 Bi ton trch xut thng tin gi c ca mt sn phm xc nh. ......................27
3.3.3 Bi ton t ng trch xut thng tin v tn v gi ca sn phm trong cc trang
kinh doanh sn phm...................................................................................................33
3.4 Cc bc xy dng h thng ....................................................................................37
3.4.1 M hnh h thng ...............................................................................................37
3.4.2 Kh nng m rng ca h thng ........................................................................40
Chng 4. Thc nghim v nh gi kt qu................................................................41
4.1 Mi trng phn cng v phn mm........................................................................41
4.1.1 Cu hnh phn cng ...........................................................................................41
4.1.2 Cng c phn mm ............................................................................................41
4.2 Kt qu thc nghim.................................................................................................44

iii

4.2.1 Thc nghim trch xut gi ca mt sn phm cho trc..................................44


4.2.2 Thc nghim xc nh website kinh doanh .......................................................49
4.2.3 Thc nghim thu thp v trch xut thng tin t mt website ...........................52
4.2.4 Thc nghim kh nng thu thp thng tin ca h thng....................................53
Kt lun .............................................................................................................................55
Ti liu tham kho............................................................................................................56

iv

Bng cc k hiu v ch vit tt


K hiu

Din gii

HTML

HyperText Markup Language

URL

Uniform Resource Locator

XPath

XML Path

W3C

World Wide Web Consortium

Danh sch cc hnh


Hnh 1. V d v tnh cu trc ca trang web bn cu trc ..................................................4
Hnh 2. V d v bi ton nhn dng thc th ......................................................................5
Hnh 3. V d v trch xut ni dung chnh ca trang Web..................................................8
Hnh 4. V d v h thng tm kim gi c...........................................................................9
Hnh 5. V d xy dng cy DOM s dng hp o............................................................12
Hnh 6. Dng biu din ca trang list page ........................................................................15
Hnh 7. Dng biu din ca trang detail page ....................................................................15
Hnh 8. Chuyn i t m HTML sang cy EC .................................................................16
Hnh 9. V d gii thut RoadRunner [12] .........................................................................20
Hnh 10. Trang gii thiu sn phm HP CQ60-203TX......................................................24
Hnh 11. Trang gii thiu sn phm HP CQ60-101TX......................................................24
Hnh 12. Biu din cy DOM ca m HTML hai trang v sn phm HP..........................25
Hnh 13. V d v trang kinh doanh thng thng.............................................................26
Hnh 14. V d v trang rao vt ..........................................................................................27
Hnh 15. V d v trch xut gi trong mt trang web........................................................27
Hnh 16. V d v sn phm cha nhng gi khng ng .................................................29
Hnh 17. V d v trch xut gi thc ca trang sn phm .................................................29
Hnh 18. Tp lut trch xut gi sn phm..........................................................................32
Hnh 19. Lut trch xut nh sn phm...............................................................................33
Hnh 20. Lut trch xut thng tin bo hnh sn phm ......................................................33
Hnh 21. Kt qu google tr v vi truy vn "nokia 1200" ................................................35
Hnh 22. Kt qu tr v ca google vi query "nokia 1200" + "vn OR usd"...................36
Hnh 23. M hnh tng quan ca h thng .........................................................................38
Hnh 24. Module xc nh cc website kinh doanh sn phm v cc mu trch xut........39

vi

Hnh 25. Module Thu thp d liu v trch xut thng tin.................................................40
Hnh 26. Trch xut cc URL lin quan .............................................................................45
Hnh 27. Trang Web c s nhp nhng gi c ...................................................................48
Hnh 28. Trang Web c gi c r rng ...............................................................................49

vii

Danh sch bng biu


Bng 1. Cu hnh phn cng s dng trong thc nghim ..................................................41
Bng 2.Cc phn mm s dng trong thc nghim ...........................................................41
Bng 3. M t chng trnh thc thi trch xut gi sn phm .......................................43
Bng 4. Kt qu thc nghim trch xut gi thc ca mt sn phm.................................47
Bng 5. Kt qu thc nghim xc nh website kinh doanh sn phm..............................51
Bng 6. Kt qu thc nghim trch xut sn phm.............................................................53
Bng 7. Kt qu thc nghim kh nng thu thp thng tin ca h thng...........................54
Bng 8. Mt s sn phm trch xut c..........................................................................54

viii

Gii thiu
Nhng nm gn y, cng vi s pht trin mnh m ca h tng c s mng cng
nh cng ngh lu tr Internet tr thnh mt thnh phn khng th thiu trong i
sng con ngi. Hng lot cc ng dng da trn nn tng ca Internet ra i phc
v cho nhu cu, li ch ca con ngi. Ni bt ln trong cc ng dng chnh l cc ng
dng lin quan n thng mi in t. Thng mi in t ra i gip con ngi gim
thiu ti a thi gian cng nh chi ph khi tham gia giao dch hng ha.Tuy nhin cng
vi s pht trin ca thng tin trn Internet th cc thng tin lin quan n thng mi
in t cng bng n khng km, hng lot cc trang web bn hng trc tuyn cng vi
n l hng triu sn phm v cc thng tin lin quan n sn phm lm cho con ngi kh
khn trong vic tm kim. Cc cu hi: Sn phm no tt ? Gi c ca hng no tt hn ?
Tm kim thng tin ca sn phm u ?... lm con ngi kh khn khi la chn mt sn
phm cn giao dch. Gii php cho vn ny chnh l cn c mt h thng tm kim
phc v cho nhu cu tm kim ny ca con ngi cc h thng ny thng c bit n
vi tn gi h thng tm kim gi c sn phm.
Chnh t nhu cu thc t y, h thng tm kim gi c c s quan tm ca rt
nhiu cng ty ln nh Yahoo, Google, Amazonbn cnh n cng c s quan tm
ca cng ng nghin cu khoa hc. Nhiu bi bo lin quan n cc thnh phn ca h
thng cng xut hin trn nhiu hi ngh ln ca th gii nh: WWW1,
SIGMOD2,[1],[3],[7] hay cc sn phm mang tnh thng mi nh: PriceScan, Kelkoo,
Yahoo!Shopping... Mc d tn ti kh nhiu cc h thng nh vy nhng bi ton ny
vn t ra rt nhiu cc thch thc hin nay. Do cc h thng c sn hu ht thu thp d
liu u thng qua vic cung cp ca cc ca hng hay nhp d liu thu cng, cng vic
ny tn nhiu chi ph v thi gian. Nhiu nghin cu c a ra gim thiu chi ph
ny, hu ht cc nghin cu u tp trung vo vic p dng cc phng php trch xut t
ng da vo d liu bn cu trc xy dng cc thnh phn thu thp t ng thng tin
trn cc trang web bn hng trc tuyn.
Trn c s cc nghin cu c, lun vn cng da trn nh hng xy dng
thnh phn trch xut thng tin t ng da vo trch xut thng tin trn d liu bn cu
1
2

The International World Wide Web Conferences


ACM Special Interest Group on Management of Data . http://www.sigmod.org

trc xut ra mt m hnh h thng tm kim gi c sn phm. V qua m hnh


xut tc gi tin hnh cc thc nghim nh gi cc kt qu t c ca m hnh.
Kha lun gm 4 chng ni dung c m t s b di y:
Chng 1. Khi qut bi ton trch xut thng tin cho d liu bn cu trc
khi qut bi ton trch chn thng tin ni chung, cc cch tip cn gii quyt
bi ton thng qua min d liu (c cu trc, khng cu trc v bn cu trc) v
gii thiu bi ton trch chn thng tin cho d liu bn cu trc , phng php
nh gi kh nng trch xut thng tin thng qua hi tng (R), tin cy
(P) v cc ng dng thc tin ca bi ton.
Chng 2. Mt s phng php s dng trong bi ton trch xut thng tin
cho d liu bn cu trc gii thiu v cc s dng cy DOM v biu thc chnh
qui trch xut thng tin. Chng ny cng cp n hai gii thut trch
xut tiu biu l gii thut da trn h thng Stalker v gii thut
RoadRunner.
Chng 3. p dng trch xut thng tin bn cu trc xy dng h thng tm
kim gi c sn phm nu khi nim v h thng tm kim gi c, gii thiu cc
h thng hin ti. Chng ny cng cp n c s thc tin v cng ngh
web hin ti , t c s thc tin kt hp vi bi ton trch xut thng tin t d
liu bn cu trc xy dng c s l thuyt trch xut thng tin gi c ca
sn phm, a ra m hnh ca h thng v nu c tnh m ca h thng
xut.
Chng 4. Thc nghim v nh gi kt qu nh gi cc bi ton nu
phn c s l thuyt ti chng 3 v trch xut gi c ca sn phm. Kt qu
thc nghim cho thy c hiu qu ca phng php trch xut gi c sn
phm.
Phn kt lun tm lc ni dung chnh ca kha lun v nu nh hng pht
trin trong thi gian ti.

Chng 1. Khi qut bi ton trch xut thng tin cho d


liu bn cu trc
Ch chnh ca kha lun l p dng bi ton trch xut thng tin cho d liu bn
cu trc xy dng h thng tm kim gi c. Chng ny s gii thiu bi ton trch
xut thng tin ni chung v bi ton trch xut thng tin cho d liu bn cu trc ni
ring, t a ra mt s ng dng ca bi ton trch xut thng tin cho d liu bn cu
trc, ng thi cng gii thiu v phng php nh gi kh nng trch xut thng qua
hi tng (R), tin cy (P).

1.1 Bi ton trch xut thng tin


1.1.1 Gii thiu bi ton
Trch xut thng tin bi ton nhn dng nhng thnh phn thng tin c th ca mt
vn bn, nhng thnh phn ny chnh l ht nhn to nn ni dung ng ngha ca vn bn
[6].
V d: Vi mt bo co thi tit c th trch xut c thng tin v cc vng, thi
gian, nhit cao hay thp. Vi mt trang web v kinh doanh sn phm trc tuyn c th
trch xut c thng tin v tn sn phm, thuc tnh ca sn phm v gi ca sn phm
.

1.1.2 D liu ca bi ton


D liu thng thng c chia thnh 3 dng c bn[17]:
D liu khng cu trc: D liu khng cu trc thng dng ch d liu
dng t do v khng cn c cu trc nh ngha sn v d nh: ngn ng t
nhin.
D liu c cu trc: D liu c cu trc thng dng ch d liu lu tr trong
cc h qun tr c s d liu quan h nh MS SQL server hay MySQL, trong
cc thc th v cc thuc tnh c nh ngha sn .
D liu bn cu trc: L d liu c cu trc nhng khng hon ton tng minh,
n khng tun theo nhng cu trc, cch thc cu trc ca bng v cc m hnh
d liu trong c s d liu nhng n cha nhng th , nhng nh du ti nhng

phn t ng ngha ring bit ca cc bn ghi v cc trng ring bit bn trong


d liu .
Cc trang web thng thng l mt dng tiu biu ca d liu bn cu trc, nhng
thnh phn c cu trc trong trang web l d liu c ly t tng c s d liu (c
cu trc) bn di v hin th trn web thng qua cc th HTML
Hnh 1: M t d liu bn cu trc v trang sn phm, d liu ny cha tn cc sn
phm, gi sn phm v cc thng tin chi tit v sn phm. Cc thng tin ng vi tng sn
phm c m t di dng m HTML nh trc. D liu ny c ly t tng c s
d liu (c cu trc) bn di v hin th trn trang web thng qua cc th HTML. y
chnh l thnh phn c cu trc ca trang web.

Cu trc HTML
ging nhau

Hnh 1. V d v tnh cu trc ca trang web bn cu trc

1.1.3 Cc hng tip cn trong bi ton trch xut thng tin


Cc bi ton trch xut thng tin thng thng c tip cn theo d liu m bi
ton x l. V vy c nhng dng bi ton nh sau:

D liu c cu trc
i vi d liu c cu trc, vic trch xut thng tin l kh n gin. V cc thng
tin c biu din theo nhng nh dng chun ca bng, thc th.. nn c th ly
c nhng thng tin cn thit mt cc d dng da vo nhng truy vn.
V d: d liu c cu trc c lu tr trong h qun tr c s d liu MS SQL,
MySQL c th trch xut c nhng thng tin cn thit da vo cc lnh SQL nh
SELECT, JOIN.
D liu khng cu trc
i vi d liu khng cu trc th c mt s bi ton v trch xut thng tin nh
nhn dng v trch xut thc th: tn ngi, tn t chc
Mt v d ca trch xut thc th:

Hnh 2. V d v bi ton nhn dng thc th

gii quyt bi ton trch xut thc th th c nhiu cch tip cn nh HMM,
SVM hay CRFngoi ra cn mt gii thut kh ni ting l gii thut DIPRE - Dual
Iterative Pattern Relation Expansion ca BRin [8] trong vic trch xut cp thc th quan
h tn sch v tc gi i vi trang amazon.com.

D liu bn cu trc
Web l d liu in hnh trong d liu bn cu trc. Trch xut thng tin web l
vn trch xut cc thnh phn thng tin mc tiu t nhng trang Web. Mt chng
trnh hay mt lut trch xut thng c gi l mt wrapper [2].
Phng php trch xut ny c nhiu hng tip cn nh s dng cy DOM[15].
Phng php ny s phn tch m ngun HTML di dng mt cy cc node, mi node l
mt th HTML, qu trnh trch xut thng tin s da vo ng i t gc n node cha
thng tin cn trch xut.

1.2 Bi ton trch xut thng tin cho d liu bn cu trc


1.2.1 Vn t ra vi bi ton
Trch xut thng tin cho d liu bn cu trc
Bi ton trch xut thng tin cho d liu bn cu trc l rt hu dng bi v n cho
php chng ta thu c v tch hp d liu t nhiu ngun cung cp cho nhng dch
v gi tr gia tng nh : thu c nhng thng tin Web mt cch ty , h thng tm kim
gi c, hay meta-search. Ngy cng nhiu cc cng ty, cc t chc ph cp cc thng tin
trn Web, th kh nng trch xut d liu t cc trang Web ngy cng tr nn quan
trng.
Bi ton ny c bt u nghin cu vo gia nhng nm ca thp nin 1990
bi nhiu cng ty v cc nh nghin cu[2].

1.2.2 Mt s phng php trch xut thng tin cho d liu bn cu trc
Nh ta ni v mt s hng tip cn mc 1.1.3 i vi d liu bn cu trc th
bi ton trch xut c mt s phng php in hnh nh:
Phng php th cng
Quan st mt trang Web v m ngun ca n, ngi lp trnh s tm mt vi mu v
vit chng trnh trch xut cc d liu mc tiu. lm n gin hn cho ngi lp
trnh, mt vi ngn ng miu t mu v cc giao din ngi dng c xy dng. Tuy
nhin vi phng php ny th khng th lm vic vi mt s lng ln cc trang[2].

Wrapper qui np
y l phng php bn t ng. N c xut vo khong nm 1995-1996.
Trong phng php ny th mt tp hp cc lut trch xut c hc t mt b cc trang
c gn nhn bng tay. Sau cc lut ny s c dng trch xut cc thnh phn
d liu t nhng trang c nh dng tng t. Mt s gii thut tiu biu nh: Stalker[5],
WIEN[13] (c s dng trong my tm kim lycos).
Phng php t ng
c xut trong nm 1998, phng php ny t ng tm cc mu hoc cc cu
trc trch xut thng tin t nhng trang cho trc. V phng php ny khng cn n
s gn nhn bng tay nn n c th trch xut c d liu t mt lng khng l cc
trang; mt s gii thut tiu biu nh RoadRunner[12], bootstrapping[1].

1.2.3 Phng php nh gi


nh gi cht lng phng php trch xut thng tin cho d liu bn cu trc
ngi ta thng s dng mt s o nh hi tng (R), tin cy (P).
Gi s sau khi s dng bi ton trch xut cho mt tp d liu gm n ti liu. Kt qu
trch xut c l m ti liu.Kt qu trch xut ng l q ti liu khi hi tng R v
chnh xc P s c tnh theo cng thc (1) v (2).
R

q
n

q
m

x 100
x 100

(1)

%
%

(2)

V d:
Nu tp d liu cn trch xut l 100 (ti liu).
D liu trch xut c l: 97 (ti liu).
D liu trch xut ng l: 90 (ti liu) .
R

90
100

90
97

x 100
x 100

%
%

= 90 %
= 92,78 %

1.2.4 ng dng ca bi ton trch xut thng tin cho d liu bn cu trc
Nhn dng v trch xut ni dung chnh ca trang Web
Vi mt trang web ngoi nhng thnh phn mang thng tin chnh th cn nhng
thnh phn t c ngha v mt thng tin nh qung co, cc menu.... Vic nhn dng v
trch xut ni dung chnh ca trang web gip gim thiu vic lu tr thng tin v ti u
kt qu tr v trong cc my tm kim v my tm kim ch phi lu ni dung chnh ca
trang web v tm kim trong ni dung chnh ny. Cc gii thut c xut nh
ContentExtractor v FeatureExtractor ca Debnath[9],[10].

Ni dung chnh

Hnh 3. V d v trch xut ni dung chnh ca trang Web

H thng tm kim gi c sn phm


H thng cho php ngi s dng so snh c gi c ca sn phm m h
mun mua. H thng ny phi duyt qua cc trang web kinh doanh sn phm
trch xut cc thng tin hu dng v sn phm.

Hnh 4. V d v h thng tm kim gi c

Chng 2. Mt s phng php s dng trong bi ton trch


xut thng tin cho d liu bn cu trc
C nhiu k thut cng nh gii thut c s dng gii quyt bi ton trch xut
thng tin cho d liu bn cu trc. Chng 2 s gii thiu nhng k thut trch xut s
dng cy DOM [15],[6] v biu thc chnh qui[2]. Chng ny cng cp n hai gii
thut trong bi ton trch xut thng tin cho d liu bn cu trc v cc u nhc im
ca gii thut .

2.1 Trch xut thng tin da vo cy DOM


2.1.1 Khi nhim cy DOM
Theo W3C th DOM (Document Object Model) l mt giao din lp trnh ng dng
(API) cho cc vn bn HTML hp l v cc vn bn XML c cu trc cht tr. N nh
ngha cu trc logic ca cc vn bn v cch thc mt vn bn c truy cp v thao
tc[15]. V d v mt bng c ly vn bn HTML:
<TABLE>
<TBODY>

Dng biu din cy DOM ca m HTML

<TR>
<TD>Shady Grove</TD>
<TD>Aeolian</TD>
</TR>
<TR>
<TD>Over the River,
Charlie</TD>
<TD>Dorian</TD>
</TR>
</TBODY>
</TABLE>

10

2.1.2 Xy dng cy DOM


Xy dng cy DOM t nhng trang Web u vo l mt bc cn thit trang nhiu
gii thut trch xut d liu [2]. C hai phng php c bn xy dng cc cy DOM.
-

S dng cc th ring bit

Hu ht cc th HTML lm vic trong mt cp. Mi cp cha mt th m <> v mt


th ng </>. Bn trong mi cp th c th c nhng cp th khc, kt qu l cu trc tr
nn chng cho. Xy dng mt cy DOM t mt trang Web bng cch s dng m
HTML ca n l mt vn cn thit. Trong mt cy DOM, mi cp th l mt node,
nhng cp th n bn trong l node con ca node hin ti. C hai nhim v cn thi hnh
l:
Lm sch m HTML: Mt vi th khng cn th ng (nh <li>, <hr>,<p>) mc
d chng c th ng. Bi vy mt th ng nn c chn vo tt c cc th
c cn bng. Cc th c nh dng khng tt cng cn thit c sa cha.
Mt th sai thng l mt th ng, l th ct ngang cc khi n bn trong. V
d: <tr> <td> </tr> </td>, s rt kh sa li trng hp ny nu
tn ti s chng cho a cp. C mt vi phn mm m ngun m lm sch
m HTML, mt s nhng phn mm thng dng nh: JTidy, NekoHTML,
HTMLCleaner.
Xy dng cy: Chng ta c th i theo cc khi con ca cc th HTML xy
dng c cy DOM.
-

S dng cc th v cc hp o (visual cue)

Thay v phn tch m HTML sa li, c th s dng s biu din hoc cc thng
tin o (v d nh: a ch trn mn hnh m cc th c biu din) suy lun mi quan
h c cu trc ca cc th v c th xy dng c cy DOM. Phng thc xy dng c
th phn tch m HTML thnh cy DOM, min l trnh duyt c th hin th c on
m mt cch chnh xc.
Trong mt trnh duyt web, mi phn t HTML (cha ng mt th m, cc thuc
tnh ty chn, ni dung HTML c nhng ty v mt th ng, th ny c th thiu)
c biu din nh mt hnh ch nht. Thng tin o ny c th ly c sau khi m

11

HTML c biu din trn trnh duyt. Mt cy DOM sau c th c xy dng da


vo cc thng tin o ny. Cc bc x l nh sau:
Tm 4 ng bin ca hnh ch nht ng vi mi phn t HTML thng qua vic
cng c trnh din ca trnh duyt, v d: Internet Explorer.
Theo s tun t ca cc th m v s kim tra xem mt hnh ch nht c nm
trong mt hnh ch nht khc khng, xy dng cy DOM.
V d minh ha v s dng visual cue:
Mt on m HTML c 3 li. s dng thng tin o c th d dng xy dng c
cy DOM.

Hnh 5. V d xy dng cy DOM s dng hp o

2.1.3 S dng cy DOM trch xut thng tin


trch xut c thng tin cn thit mt node ca cy DOM, chng ta cn ch r
ng i t gc ca cy n node cn trch xut thng tin. ng i ny gi l mt
XPath[16]hay mu trch xut.
Trch xut thng tin web da vo cy DOM trc tin vic trch xut ny c h
tr bi xy dng cy DOM cho m HTML ca trang.
Cc mu trch xut c th c lm r nh ng dn t gc ca cy DOM n
node cha ni dung cn trch xut.

12

V d :
y l cy DOM ca mt on m HTML cha thng tin v cun sch, gm tn
cun sch (title) v tn tc gi (author). Bi ton t ra l s dng cy DOM ny trch
xut cc thng tin v tn sch v tc gi vit sch. Mu trch xut c xy dng sau:

Sample DOM Tree Extraction


HTML

Element
HEADER

BODY

Age of Spiritual
Machines

Character-Data
FONT

Ray
Kurzwei

Mu trch xut tn sch: HTMLBODYBCharacterData


Mu trch xut tn tc gi: HTML BODYFONTA CharacterData

2.2 Trch xut thng tin da theo cc mu biu thc chnh qui
2.2.1 Khi nim biu thc chnh qui
Mt biu thc chnh qui c th c s dng m hnh m ha HTML [2]. Cho
mt tp cc k t alphabe v mt token #text khng thuc , mt biu thc chnh qui
trn l mt chui trn {#text, *,?,|,(,)} c nh ngha nh sau :

13

Mt chui rng v tt c cc phn t trong {#text} u l mt biu thc chnh


qui.
Nu A v B l mt biu thc chnh qui, th AB, (A|B) v (A)? cng l mt biu thc
chnh qui, trong (A|B) tc l A hoc B v (A)? thc l (A|).
Nu A l mt biu thc chnh qui, th (A)* cng l biu thc chnh qui, trong o
(A)*= {, A, AA, AAA,}.
Chng ta cng s dng (A)+ ch A(A)*. Nu biu thc chnh qui khng c cha
(A|B) th n gi l biu thc chnh qui kt hp t do. Mt biu thc chnh qui thng
dng th hin mt mu trch xut.

2.2.2 S dng biu thc chnh qui trch xut thng tin
Vi mt biu thc chnh qui, mt otomat hu hn trng thi c th c xy dng
v c s dng so khp s xut hin ca n trong chui tun t cc trang web. Trong
qu trnh ny, d liu c th c trch xut.
V d: Vi m HTML nh sau:
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Tinh Tong cua cac so tu 1->n</title>
</head>
ly c phn tiu ca on m ny th ta c th xy dng biu thc chnh qui
nh sau: <head>.*?<title>(#text)</title>

2.3 Mt s gii thut trch xut thng tin cho d liu bn cu trc
2.3.1 Hai kiu biu din ca cc trang giu d liu
Cc trang giu d liu c chia thnh hai loi thng qua s biu din ca chng[2]
- List Page: l trang cha ng mt vi danh sch ca cc i tng. Hnh 8 gii
thiu mt list page. C hai dng trang list, l trang list b tr theo chiu ngang

14

hoc chiu dc. Bn trong mi vng, bn ghi d liu c nh dng s dng cng
mt mu v mu s dng trong hai vng khc nhau l khc nhau [2].
- Detail Page: l trang ch gii thiu mt i tng n. V d hnh 9 l mt trang
detail page gii thiu v sn phm . N cha ng tt c cc thuc tnh ca sn phm
nh: tn, nh, gi, thng s k thut, thi gian bo hnh [2] .

Hnh 6. Dng biu din ca trang list page

Hnh 7. Dng biu din ca trang detail page

15

2.3.2 Mt s gii thut in hnh


Hin nay t tng ca phng php trch xut th cng khng cn c s dng .
V vy kha lun ch gii thiu phng php trch xut thng tin t ng v bn t ng
cho bi ton trch xut thng tin cho d liu bn cu trc.
Phng php Wrapper qui np: y l phng php trch xut bn t ng
Gii thut c nu ra di y l gii thut da trn h thng Stalker.
- Mt v d v trch xut theo gii thut da trn h thng Stalker.
Mt trang Web c th c nhn di dng c th t ca token S (v d nh: cc t,
cc s v cc th HTML). Vic trch xut s dng mt cu trc cy gi l cy
EC(embedded catalog tree), y l cng c m hnh d liu nhng trong mt trang
HTML. Gc ca cy l vn bn cha tt c cc token tun t S ca trang, ni dung ca
mi node con l mt chui con ca node cha. trch xut mt node, Wrapper s dng
miu t cy EC ca trang v tp hp cc lut trch xut.
V d bn di l s chuyn i mt on m HTML sang cy EC. Ch rng
chng ta s dng LIST y bi v tp hp cc a ch lun lun c th t.

Hnh 8. Chuyn i t m HTML sang cy EC

16

Vi mi node trong cy, Wrapper nhn dng hoc trch xut ni dung ca node t
cha ca n, node cha l node cha ng chui token ca tt c cc node con. Mi trch
xut c thc hin bi 2 lut, Start Rule v End Rule. Start Rule ch ra s bt u ca
node v End Rule ch ra s kt thc ca node. Phng thc ny c th p dng cho c
node l v cc node danh sch (list node).
Cc lut trch xut da trn tng ca m neo (landmark). Mi m neo l mt
chui cc token lin tip v n dng nh du s bt u hay kt thc ca mt phn t
mc tiu. Hnh di y l trnh din m HTML ca trang web trong hnh 10.
<p> Restaurant Name: <b>Good Noodles</b><br><br>
<li> 205 Willow, <i>Glen</i>, Phone 1-<i>773</i>-366-1987</li>
<li> 25 Oak, <i>Forest</i>, Phone (800) 234-7903 </li>
<li> 324 Halsted St., <i>Chicago</i>, Phone 1-<i>800</i>-996-5023 </li>
<li> 700 Lake St., <i>Oak Park</i>, Phone: (708) 798-0008 </li> </p>
trch xut c tn ca qun n Good Noodles th lut trch xut s l:
Start Rule: R1: SkipTo(<b>) tc l h thng nn xut pht im bt u ca trang
v b qua tt c cc token cho n khi chng thy c th <b> u tin. Cc lut
SkipTo(:) hoc SkipTo(i) khng ng. V theo cy EC trong hnh 10 R1 l cha ca
node name, nh vy n s l node gc. Node gc th cha chui token tun t ca c
trang Web.
Tng t End Rule : R2: SkipTo (</b>) s xc nh c im kt thc tn ca
qun n.
- Qu trnh hc lut
Trong h thng Wrapper qui np qu trnh hc l mt qu trnh ch o.
Kha lun ny s trnh by gii thut hc ca wrapper sinh ra cc lut trch xut.
tng c bn ca gii thut hc lut nh sau:
sinh ra Start Rule cho mt node ca cy EC, mt vi token tin t hay cc i
din ca node c nhn dng nh cc m neo, chng c th nhn dng n nht s bt
u ca mt node. sinh ra End Rule cho mt node, mt vi token hu t hay cc i

17

din ca node c nhn dng nh mt m neo. Tin trnh sinh Start Rule v End Rule l
ging nhau.
Cho trc mt tp cc mu hun luyn c gn nhn, gii thut hc s sinh ra
cc lut trch xut tng quan trch xut tt c cc phn t mc tiu (positive items) m
khng trch xut cc phn t khc (nagertive items).
Sau qu trnh ny th mt wrapper c sinh ra , n s c p dng cho cc
trang web khc cha ng cc d liu tng t v c nh dng cng mt cch vi tp
mu hun luyn.
- u im v nhc im
u im:
Ngi s dng ch phi gn nhn mt lng nh cc d liu mu.Qu trnh hc l
qu trnh t ng sinh ra lut trch xut.
Nhc im:
Nu mt site thay i, lm sao wrapper bit c s thay i ?
Nu pht hin chnh xc c s thay i, lm sao t ng s wrapper?
V phng php ny ph thuc vo vic gn nhn bng tay nn n khng ph hp
cho trch xut mt lng ln cc trang. V d, nu mt trang kinh doanh sn phm mun
trch xut tt c cc cc sn phm c bn trn Web, vic gn nhn bng tay hu nh l
nhim v khng th. Vic duy tr wrapper l vic lm rt tn km, v web l mt mi
trng ng. Cc site th lun lun thay i.
Phng php trch xut t ng
hn ch nhc im ca Wrapper qui np, phng php trch xut t ng
c nghin cu rt nhiu. Vic trch xut t ng l hon ton c th bi v d liu trn
mt website thng c m ha vi mt s lng mu c nh. C th tm nhng khun
mu bng vic khai ph nhng mu lp li trong nhiu trang ca mt website.
Trong mt vi ng dng, chng ta cn trch xut d liu t cc trang detail-page, v
nhng trang ny cha nhiu thng tin hn. V d: trong mt trang list-page, thng tin ca
mi sn phm thng thng ch l tn, nh v gi. Tuy nhin nu ng dng cn nhng
thng tin miu t sn phm th chng ta cn trch xut t nhng trang detail.

18

Mt thut ton trch xut t ng kh tiu biu m c th trch xut c trang detail
v trang list l RoadRunner.
-

M t gii thut

u vo: Mt tp hp cc trang mu, mi trang cha ng mt hay nhiu bn ghi


(mt trang c th l list page hoc detail page).
u ra: Mt mu trch xut c th trch xut c tt cc cc trang trong tp mu,
trong gii thut ny mu trch xut l biu thc chnh qui kt hp t do.
-

Phng thc tip cn


Ban u, gii thut s ly mt s lng ngu nhin cc trang vi mu trch xut W.

Mu trch xut W sau c nh ngha li bi vic kt hp c th t vi m


HTML ca mi trang pi khc trong tp mu, gii quyt vn sai khc gia cc mu
trch xut ca cc trang trong tp mu. Cui sung gii thut sinh ra mt wrapper chung c
th trch xut c tt c cc trang trong tp mu. Wrapper ny s c p dng trch xut
cho nhng trang khc c cu trc tng t vi nhng trang trong mu
S sai khc xut hin khi mt vi token ca trang pi xut hin sai khc so vi W.
C hai kiu sai khc trong vic so khp l:
S sai lch xu vn bn (string mismatch) : Chng biu th thng qua cc trng
d liu hay cc mc.
S sai khc gia cc th (tag mismatch).
Gii thut ny c lm r trong hnh di y:

19

Hnh 9. V d gii thut RoadRunner [12]


-

u, nhc im ca gii thut


u im: Khng cn s gn nhn ca ngi dng vi tp mu hun luyn, c th
t ng xy dng c mu trch xut.
Nhc im: N khng th t ng nhn dng c u l thc th thng tin
mong mun ca ngi dng. V vy ngi s dng s vn phi t gn nhn
nhng kt qu u ra. V d: hnh trn khi n xc nh c th <B> c d liu
tng ng ca 2 trang nhng n khng th xc nh y l tn ca quyn sch,
m ch c th xc nh n l mt xu k t.

20

Chng 3. p dng bi ton trch xut thng tin bn cu


trc xy dng h thng tm kim gi c sn phm
Vic p dng bi ton trch xut thng tin cho d liu bn cu trc xy dng h
thng tm kim gi c sn phm l vn quan trng nht ca kha lun. Trong chng
ny kha lun s cp n khi nim ca h thng tm kim gi c, phng php xy
dng h thng v cch nh gi cc h thng ang tn ti.

3.1 Khi qut h thng tm kim gi c ca sn phm


Trong phn ny kha lun s cp ti khi nim v h thng tm kim gi c, cc
phng php xy dng, u nhc im ca cc h thng tm kim gi c hin ti, t
a ra cch tip cn xy dng h thng tm kim gi c ph hp.

3.1.1 Khi nim


H thng tm kim gi c (hay cn c bit n vi tn l dch v so snh gi c)
l mt khi nim thuc lnh vc thng mi in t. Cc h thng ny cho php ngi
s dng tm kim v thy c s so snh gi c ca mt sn phm c th trn nhiu
trang web bn hng khc nhau [18]. H thng tm kim gi c thng thng khng phi l
mt h thng bn hng trc tuyn, tuy nhin n chnh l mt cng c gin tip h tr vic
gii thiu sn phm ca cc ca hng kinh doanh cng nh vic mua hng ca ngi s
dng.

3.1.2 Cc phng php xy dng


Do cc h thng tm kim gi c tp trung vo vic th hin cc thng tin gi c trn
nhiu trang web bn hng khc nhau nn hng tip cn gii quyt bi ton ny cng
u i su vo vic to ra mt mi trng tt nht cho vic thu thp, trao i thng tin sn
phm gia cc ca hng c sn phm v h thng. Thng thng c ba phng php
xy dng h thng da vo c trng trn [18] :
- Phng php da vo s cung cp thng tin trc tip t cc ca hng. Cc h
thng dng ny s nhn c s cung cp thng tin ca cc ca hng v thng tin, gi c
ca sn phm, ngi qun tr h thng s cp nhp vo c s d liu ca h thng. Cc
ca hng s khng tng tc trc tip ln h thng.

21

- Phng php da vo s tng tc ca ca hng trn h thng. Cc h thng dng


ny thng c bit n nh l cc m hnh B2C(Business To Customer), B2B
(Business To Business) trong thng mi in t. H thng s to ra mi trng giao
din, cho php cc ca hng tng tc trc tip vi h thng cung cp thng tin.
- Phng php t ng thu thp thng tin t cc trang web bn hng hay gii thiu
sn phm ca cc ca hng. H thng dng ny s khng da vo s cung cp thng tin
ca cc ca hng m t ng truy nhp vo cc trang web ca ca hng trch xut cc
thng tin sn phm a v c s d liu ca h thng.

3.1.3 Cc h thng hin ti


Cc h thng hin ti.
i vi ba phng php tip cn c gii thiu mc 3.1.2, vic p dng hai
phng php u s gp phi cc hn ch do d liu ca h thng hon ton ph thuc
vo s cung cp ca cc ca hng trong khi gi c l dng d liu bin ng lin tc theo
thi gian i hi phi c s cp nht lin tc thng tin vo c s d liu. Bn cnh ,
vic p dng hai phng php ny, c s d liu s b gii hn v s lngca hng cung
cp d liu cho h thng. Do hai phng php ny khng phi l phng php ti u
xy dng h thng tm kim gi c.
Cn phng php tip cn th ba, d liu c thu thp thng qua cc trang kinh
doanh sn phm. H thng s qut qua nhng trang web ca hng nhn c gi c
ca sn phm, thay v phi s dng ngun cung cp ca ngi kinh doanh. V vy y l
phng php c gi tr nht tnh ti thi im hin nay.
C rt nhiu bi ton c xut theo phng thc tip cn th ba xy dng h
thng tm kim gi c nh:
-

Bootstrapping Information Extraction from Semi-structured Web Pages c


xut bi Andrew Carlson v Charles Schafer p dng cho nhng trang cho thu nh
v du lch . [1].

Automated Price Comparison Shopping Search Engine ca Elwin Chai, Rick


Jones p dng cho h thng PriceHunter [3].

A Scalable Comparison-Shopping Agent for the World-Wide Web ca Robert Bo


Doorenbos, Oren Etzioni v Daniel So Weld [7].
22

Cc vn ca bi ton nu trn
Cc bi ton ny c xut xy dng nhng h thng tm kim gi c sn
phm, tuy nhin chng gp phi mt vn , l cc tn ca sn phm phi c cung
cp trc v cc trang kinh doanh sn phm phi xc nh r trn h thng.
Vit Nam hin nay cng c mt vi h thng kh tiu biu nh : Vatgia1, Aha2.
Tuy nhin hai h thng ny li xy dng theo cch tip cn th hai, nn phi ph thuc
nhiu vo cc nh kinh doanh.
T nhng nhn nh nu trn, kha lun ny s dng cch tip cn th ba xy
dng h thng v s gii quyt mt s tn ti mt s phng php xy dng h thng tm
kim gi c hin ti.

3.2 C s thc tin


Hin nay cc trang web u xy dng trn nn nhng ngn ng lp trnh ng nh
PHP, ASP. Khi ngi dng vo mt trang kinh doanh sn phm v tm kim mt sn
phm no th kt qu c tr v v hin th trn trnh duyt theo mt s khun mu
nh sn, cc trang trong cng khun mu ny th c chung cu trc HTML. Tc l khi
chng ta bit mu trch xut mt trang trong khun mu ny, th c th s dng mu
trch xut nhng thng tin ca nhng trang khc c cng khun mu.
V d : Vi website www.trananh.vn, hnh 13,14 l hai sn phm ca laptop HP
c biu din bi hai trang detail.

1
2

http://www.vatgia.com
http://www.aha. vn

23

Hnh 10. Trang gii thiu sn phm HP CQ60-203TX

Hnh 11. Trang gii thiu sn phm HP CQ60-101TX


Hai trang detail ny tuy gii thiu v hai sn phm khc nhau nhng u c chung
mt dng biu din ca cy DOM

24

Hnh 12. Biu din cy DOM ca m HTML hai trang v sn phm HP


Mu trch xut cc thng tin
Tn Sn phm: HTML BODY TABLE TR[1] TD[1] TN SN
PHM (1).
Gi Sn Phm: HTML BODY TABLE TR[3] TD[1] DIV[1]
FONT [1] GI SN PHM (2).
Nhn xt:
V cc trang trong cng mt website c cu trc tun theo mt vi khun mu nht
nh nn ta c th s dng nhng mu trch xut (1) trch xut tn sn phm v (2)
trch xut gi sn phm t trang khc c cng cy DOM trn.

3.3 C s khoa hc
Phn c s l thuyt s nu v gii quyt nhng bi ton c s xy dng h thng
tm kim gi c. Trong phn ny s tp trung vo hai bi ton chnh l bi ton v xc
nh gi thc ca mt sn phm v bi ton t ng trch xut thng tin v tn v gi

25

sn phm. Bi ton xc nh gi thc mt sn phm s b tr gii quyt bi ton t


ng trch xut thng tin v tn v gi ca sn phm. y chnh l thnh phn ct li
xy dng h thng tm kim gi c sn phm.

3.3.1 Phn loi trang kinh doanh


Cc trang kinh doanh sn phm c chia lm hai loi chnh:
- Cc trang kinh doanh sn phm thun tu: y l cc trang c b cc v trnh by
r rng, cc thng tin c cung cp theo nhng khun mu nht inh.

Hnh 13. V d v trang kinh doanh thng thng

26

- Cc trang rao vt: cc trang c b cc khng r rng, ty thuc vo ngi s dng.

Hnh 14. V d v trang rao vt

3.3.2 Bi ton trch xut thng tin gi c ca mt sn phm xc nh


Bi ton tin : xc nh gi trong mt trang Web
- u vo: M ngun HTML ca mt trang Web.
- u ra: Cc gi cha trong m ngun .
V d: Vi mt trang Web v kinh doanh sn phm HP Mini-note.

Tin t: Gi

Hu t: VN

Hnh 15. V d v trch xut gi trong mt trang web


Th cc gi trch xut c s l:

27

6,559,000 VN

4,950,000 VN

13,999,000 VN

14,399,000 VN

Phng php kha lun s dng l xy dng cy DOM tng ng vi m HTML


ca trang, sau s duyt qua cy DOM xc nh c gi cha trong trang.
xc nh c node no trong cy DOM l cha gi th kha lun xy dng
c b lut xc nh gi.
xc nh c gi ta s dng mt s lut sau:
- Trc gi th c mt vi tin t: nh GI, PRICE
- Sau gi cng c cc hu t nh: VN, USD, VND,,$ .
- nh dng ca gi: dng s , tc l bao gm cc k t {0, 1, 2,, 9, ,, .}
- Node cha gi l: #text
Tuy nhin trong qu trnh thng k ny chng ti cng thy c nhiu gi khng lin
quan v d nh trng hp trn th 300.000 VN khng phi l gi mc d n cha hu
t VN.
Trong mt s trng hp nh hnh 19 th mc d tha mn cc iu kin v tin t,
hu t v nh dng ca gi. Nhng n khng phi l gi c ngha vi ngi s dng.
V vy kha lun ny xy dng cc tin t loi tr loi tr cc gi khng ngha
.
Mt s tin t loi tr nh : GI C, GI BA, GI TH TRNG

28

Tin t loi tr
Gi khng ng

Hnh 16. V d v sn phm cha nhng gi khng ng

Bi ton trch xut thng tin gi c ca sn phm


M t bi ton
- u vo: Tn sn phm v trang Web ln quan n sn phm.
- u ra: Gi thc ca sn phm, mu trch xut gi thc v mu trch xut tn
sn phm.
V d: u vo l trang web bn sn phm Nokia 1200 nh sau.
Gi thc

Hnh 17. V d v trch xut gi thc ca trang sn phm

29

u ra s l gi ca sn phm ny : VN 540.000 l gi thc ca sn phm, mu


trch xut tn sn phm ny l HTML BODY TABLE[1] TR[1] TD[1]
Tn sn phm v mu trch xut gi ny l HTML BODY TABLE[1] TR[2]
TD[2] Gi thc sn phm.
Xc nh c gi ca sn phm l mt bi ton ht sc quan trng trong h thng
tm kim gi c. Tuy nhin khng c mt chun nhn dng c gi m c th p dng
nhn dng tt c cc trang.
Phng php gii quyt bi ton
xc nh gi phng php c thc hin thng qua nhng bc sau:
Xy dng cy DOM tng ng vi m HTML ca trang Web u vo
-

Bc 1: Xc nh c node ca cy DOM cha tn sn phm v ly c mu trch


xut tn sn phm.

Bc 2: Xc nh tt cc cc node cha gi trong trang Web nh nu trong bi


ton tin v ly c mu trch xut tng ng vi nhng gi .

Bc 3: Loi tr cc gi khng ph hp.

Bc 4: Xc nh c gi thc ca sn phm thng qua mi quan h gia tn v gi


ca sn phm.

Ti bc 1, ta s duyt qua cy DOM, xc nh node cha tn sn phm (tn sn


phm nh r t u vo). T cc node ny ta s sinh ra mu trch xut tng ng vi
tn sn phm.
Ti bc 2 sau khi xc nh c node c cha gi theo bi ton tin , ta c th ly
c mu trch xut tng ng vi node theo phng php trch xut s dng cy
DOM nu phn 2.1.
Sau khi xc nh c tt c cc mu trch xut gi v mu trch xut tn sn
phm, xc nh c gi thc ca sn phm ta phi loai tr nhng gi khng ph hp,
l nhng gi nm trng mt s th <strike> hay th <s>.
Gi c th xut hin c lp hoc khng c lp, v d: gi <tag>120.000 vn
</tag> l gi c lp trong khi gi <tag>100.000 vn (30%) </tag> l gi khng c lp.
Nu ch c mt gi c lp th gi ny c coi l gi thc. Nu c nhiu gi c lp, th

30

tt c cc gi u c th l gi ca sn phm. V vy ta phi da vo mi quan h gia


tn sn phm v gi ca sn phm . Mi quan h gia tn sn phm v gi ca n trong
mt trang kinh doanh sn phm l s gn nhau v mt cu trc HTML (v d: chng
thuc 2 node k nhau trong cy DOM ).
xc nh c s gn nhau gia cc node cha tn v gi trong cy DOM. Kha
lun ny s dng trng lp v ng i t gc n node ca mu trch xut.
V d:
Mu trch xut tn sn phm l: HTML BODY TABLE TR TD
DIV[1] Tn sn phm.
Mu trch xut gi sn phm l: HTML BODY TABLE TR TD
DIV[2] FONT Tn sn phm.
Vi 2 mu trch xut nh trn th trng lp s l : 5 tng ng vi 5 bc i:
HTML[1] BODY[2] TABLE [3] TR[4] TD[5].
Nu trng lp gia mu trch xut tn v mu trch xut gi l ln nht th n
c coi l gi thc ca sn phm. Tuy nhin trong mt s trang khng cung cp gi ca
sn phm nhng li c cha nhng gi ngoi lai, nhng gi ny khng phi l gi sn
phm. Vn t ra l lm sao c th xc nh c gi khng phi l gi ca sn
phm.
gii quyt vn ny, kha lun s dng thm mt o v s khc bit gia 2
mu trch xut. Nu s khc nhau ny nh hn mt ngng th mu trch xut tr n gi
v mu trch xut tr n tn sn phm mi c chp nhn l mt ng c trch xut
gi thc v tn ca sn phm cc trang.
i vi nhng trang Vit Nam, kha lun ny gp mt vi thch thc l cch
thc vit gi khng ng hoc gi qu nhp nhng, nh mt s trang li vit l : Gi:
VN 120.000 , trong khi thc t th phi vit l 120.000 VN. Mt khc mt s trang li
cha cp nht c gi sn phm v gi ch xut hin di dng Gi:(x)vn. t bit l
gi mt s trang v rao vt Vit Nam th khng theo mt qui tc vit, v d Cn bn
nokia 1200 gi 320k
Qua thng k ti nhiu trang kinh doanh cc loi sn phm Vit Nam v trn th
gii trn nhiu lnh vc nh cc trang v: in thoi, my tnh, m phm v trang sc,

31

c bit l mt s trang v rao vt.. .Kt hp bi ton tin v bi ton xc nh gi


thc kha lun ny xut ra mt tp lut trch xut gi ca sn phm.

Hnh 18. Tp lut trch xut gi sn phm

Trong tp lut ny gm mt s lut chnh:


-

FirstRule: tin t ca gi

LastRule: Hu t ca gi

RejectRule: tin t loi tr

Format: nh dng ca gi

TagName: tn th HTML m gi nm trong .

Trong khi xy dng c tp lut trch xut gi c, chng ti nhn thy: ngoi gi
c ca sn phm ngi s dng cn quan tm n nhng thuc tnh khc ca sn phm
nh: nh ca sn phm, thi gian bo hnh, thng tin khuyn mi Bn cnh cch t
chc tp lut vi gi c th p dng cho nhng thuc tnh ny.
Trn t tng chung ca phng php trch xut gi, tc l ly tn sn phm lm
neo xc nh gi thc ca sn phm bng cch xc nh gi gn nht vi sn phm.
Kha lun ny cng xy dng thnh cng cc lut trch xut cho nhng thuc
tnh trn:
-

Lut trch xut nh sn phm

32

Hnh 19. Lut trch xut nh sn phm


-

Lut trch xut thi gian bo hnh

Hnh 20. Lut trch xut thng tin bo hnh sn phm

3.3.3 Bi ton t ng trch xut thng tin v tn v gi ca sn phm trong


cc trang kinh doanh sn phm
Trong nhng bi ton v trch xut thng tin mc 2.3 th tp mu hun luyn phi
c xc nh trc. Vi phng php trch xut bn t ng th cn s gn nhn bng
tay vi tp mu hun luyn ny. Vi phng php trch xut t ng nh RoadRunner th
phi gn nhn bng tay kt qu u ra.
Trong bi ton kha lun nu ra di y c th t ng xc nh tp mu hun
luyn t mt tp cc tn sn phm, t ng sinh ra cc mu trch xut tn v gi ca sn
phm.
Vi mt tp ht ging cc tn sn phm, chng ta c th t ng xc nh c tp
cc trang lin quan n sn phm, sau s sinh ra cc mu trch xut thng tin v tn v
gi sn phm mt cch t ng trong tp trang lin quan ny da vo tp lut nu 3.3.2.

33

M t bi ton
-

u vo: Mt tp ht ging tn cc sn phm.

u ra: Cc website kinh doanh sn phm v cc mu trch xut thng tin v tn,
gi ca cc sn phm trong website .
Phng php gii quyt bi ton

gii quyt bi ton ny kha lun s dng bi ton xc nh gi thc ca sn


phm nu mc 3.3.2.
-

Bc 1: Xc nh cc trang ln quan

Vi tp ht ging cc tn ny, ta s to ra cc truy vn gi n my tm kim, kt


qu tr v s c nhng trang lin quan n sn phm . C th ta s gii quyt bc 1
nh sau :
Vi tn sn phm ta s to ra nhng truy vn gi ti my tm kim, kt qu tr v
ca my tm kim l cc trang lin quan n sn phm.
V d: vi tn sn phm nokia 1200, ta s to truy vn nokia 1200 gi ti my tm
kim google ta s xc nh c cc trang lin quan n sn phm nokia 1200 nh sau

34

Trang tin tc

Trang kinh
doanh sp

Hnh 21. Kt qu google tr v vi truy vn "nokia 1200"


Tuy nhin cc kt qu tr v c th ch l trang gii thiu, trang tin tc v sn phm,
ngay trong v d trn th kt qu u tin tr v ca my tm kim li l mt trang tin tc
sn phm.V vy ta phi ti u nhng truy vn gi n my tm kim t c kt qu
tt nht, tc l s lng trang lin quan n kinh doanh sn phm nhiu nht. Da vo
c th ca cc trang kinh doanh sn phm chng ta c th to ra nhng truy vn tt
gi ti my tm kim.
V d: mt truy vn c ti u ca nokia 1200 l nokia 1200 + vn OR usd
Kt qu tr v ca my tm kim google l:

35

Trang kinh
doanh sp

Hnh 22. Kt qu tr v ca google vi query "nokia 1200" + "vn OR usd"


Qua v d ny chng ti thy nu ti u cc truy vn gi n my tm kim th kt
qu tr v nhng trang kinh doanh sn phm xut hin nhiu hn, nh trong v d trn th
6 trang u tin ny u l trang kinh doanh sn phm.
-

Bc 2: Ly c mu trch xut tng ng vi tng trang bc 1.

Vi mi mt trang lin quan c xc nh bc 1, n s tng ng l trang lin


quan n mt sn phm trong tp ht ging. Cp tn sn phm, trang ln quan n sn
phm s lm u vo cho bi ton trch xut thng tin gi ca mt sn phm xc nh,
kt qu tr v s l cc mu trch xut tng ng vi tng trang.
-

Bc 3: Xc nh c website kinh doanh v cc mu trch xut tng ng.


Qua bc 2 ta s thng k c nhng cp mu trch xut trn tng website.

36

xc nh c mt website l kinh doanh sn phm. Chng ti s dng mt


phng php thng k l thng k s lng sn phm c th trch xut c gi trong
website . Nu s lng ny ln hn mt ngng th website ny s l website kinh
doanh sn phm. Ngng ny c xc nh thng qua s lng sn phm trong tp ht
ging.
Sau khi xc nh c website kinh doanh sn phm. Kha lun ny xc nh
c cc mu trch xut thng tin v tn sn phm v gi sn phm tng ng vi website
. Thng k s trng lp ca cc mu trch xut, nu trng lp ln hn mt ngng
th mu trch xut c th p dng cho cc trang khc trong cng website ny.

3.4 Cc bc xy dng h thng


Nh nhn xt nu phn 3.1.3 v vn ca cc h thng hin ti. Kha lun
xc nh c vic phi xy dng mt h thng tm kim gi c c th gii quyt c
nhng vn . Di cc c s thc tin v c s l thuyt nu trn, kha lun ny
a ra m hnh xy dng h thng hon ton t ng, c th t ng xc nh c
cc website kinh doanh sn phm lng nh tn sn phm ban u v c th t ng trch
xut thng tin v tn v gi ca sn phm trong cc website .
Trong phn ny kha lun a ra m hnh h thng v nu ln c kh nng t
ng m rng ca h thng.

3.4.1 M hnh h thng

37

M hnh tng quan

Hnh 23. M hnh tng quan ca h thng

Trc ht, tp ht ging cc tn sn phm c a qua module xc nh website


kinh doanh sn phm v mu trch xut to ra mt tp cc website kinh doanh sn
phm v mu trch xut tn, gi sn phm ti cc website .
Cc website v mu trch xut tng ng ny s qua module thu thp d liu v
trch xut thng tin thu thp c tn sn phm v gi ca sn phm, thng tin ny s
c cp nht vo c s d liu thng tin sn phm v tp ht ging tn sn phm.

38

Module xc nh cc website kinh doanh sn phm v cc mu trch xut

Hnh 24. Module xc nh cc website kinh doanh sn phm v cc mu trch xut


Module ny c xy dng trn c s bi ton ng trch xut thng tin v tn v
gi ca cc trang sn phm.
Tp ht ging ban u qua tin trnh xc nh cc trang lin quan c mt tp
cc trang lin quan n sn phm. Tp cc trang lin quan s c qua tin trnh trch
xut cc mu trch xut thng tin t c cc mu trch xut v website tng ng
vi mu trch xut . Cc mu v website ny s c thng k s trng lp, t c
website v mu trch xut ph hp vi website.

39

Module Thu thp d liu v trch xut thng tin

Hnh 25. Module Thu thp d liu v trch xut thng tin
Sau khi xc nh c cc website v cc mu trch xut thng tin ca website, th
website ny s c thu thp d liu. Sau tp d liu thu thp ny s c qua module
trch xut thng tin ly cc thng tin v sn phm: tn sn phm v gi ca sn phm.
Cc thng tin ny s c cp nht vo c s d liu v sn phm, tn ca sn phm
s c dng m rng tp ht ging.

3.4.2 Kh nng m rng ca h thng


H thng c th xc nh c nhiu trang v mu trch xut hay khng ph thuc
vo kch c ca tp ht ging. Do tn sn phm thu dc thng qua module thu thp v
trch xut d liu c cp nht tip vo tp ht ging tn sn phm, v th c s d liu
v tp ht ging sn phm s lun c cp nht. V vy h thng lun c m rng.

40

Chng 4. Thc nghim v nh gi kt qu


nh gi c h thng tm kim gi c, kha lun s tp trung nh gi kh
nng thu thp d liu v tn v gi ca sn phm. Trong chng ny kha lun a ra 3
thc nghim nh gi kh nng thu thp thng tin v tn v gi sn phm ca h thng,
a ra cc bng kt qu t c ca tng thc nghim v nhng nhn xt, nh gi kt
qu .

4.1 Mi trng phn cng v phn mm


4.1.1 Cu hnh phn cng
Bng 1. Cu hnh phn cng s dng trong thc nghim
Thnh phn

Ch s

CPU

Intel Celeron CPU 2.66 ghz

RAM

768 MB

OS

WindowsXP Service Pack 2

B nh ngoi

40GB

4.1.2 Cng c phn mm


Bng 2.Cc phn mm s dng trong thc nghim
STT

Tn phn
mm

Tc gi

Ngun

Neko HTML

Phn phi
bi Apache

http://sourceforge.net/projects/ne
kohtml

eclipse-SDK3.4.1-win32

http://www.eclipse.org/download
s/

41

Vi cc cng c phn mm trn kha lun xy dng chng trnh thc thi
trch xut gi ca sn phm. Cu trc chng trnh c phn lm 3 gi (package) chnh
nh sau:
Crawler : chc nng chnh ca gi ny l thu thp d liu
GettingPattern: Chc nng ca gi ny l xc nh mu trch xut thng tin
v gi v tn sn phm ca mt trang web.
Extracting: chc nng ca gi ny l trch l xc nh cc website kinh
doanh v trch xut tn, gi sn phm trong website .
Chi tit cc lp ca 3 gi ny c m t theo bng 3 bn di.

42

Bng 3. M t chng trnh thc thi trch xut gi sn phm


Packages

Classes
Crawling

Thu thp d liu t mt website

SEProcessing

Thu thp cc url tr v t truy vn gi


n google

StandardHTML

Loi b mt s thnh phn khng quan


trng trong m HTML nh cc on m
SCRIPT, STYLE

ParserHTML

Phn tch m HTML sang dng cy


DOM (s dng NekoHTML)

GettingXPath

Xc nh tt c cc mu trch xut tr
n tn v gi sn phm.

ProcessingXPath

Xc nh c mu trch xut chnh xc


tn v gi sn phm.

GettingWebsite

Xc nh c website kinh doanh sn


phm v mu trch xut ca website

Crawler

GettingPattern

Extracting

Chc nng

Trch xut thng tin v tn, gi cc sn


ExtractingInformation phm trong cc website kinh doanh sn
phm

43

4.2 Kt qu thc nghim


4.2.1 Thc nghim trch xut gi ca mt sn phm cho trc
M t thc nghim
Mc ch ca thc nghim ny kim nghim tnh ng n ca bi ton xc
nh gi thc ca sn phm bng cc lut nu mc 3.3.2.
- u vo : Tn sn phm v trang web cha tn sn phm .
- u ra : Gi ca sn phm nu trang web c cha gi.
D liu thc nghim
-

D liu trch xut gi ca mt sn phm c thu thp thng qua my tm


kim google.

Vi mt tn sn phm cho trc, ta s to ra truy vn gi n my tm kim.


o V d:
Vi tn sn phm my nh Canon PowerShot G10 th truy vn gi n
my tm kim s l : Canon PowerShot G10 + VN OR USD

Ly mt lng kt qu tr v u tin ca my tm kim, ta s trch xut c


tp cc url t kt qu
o V d:
ng vi truy vn Canon PowerShot G10 + VN OR USD th 5 kt
qu u tin tr v thng qua my tm kim google v cc url tng ng
c m t trong hnh di y :

44

URL
trch
xut

Hnh 26. Trch xut cc URL lin quan

- Sau cc url ny s c chun ha v dng chun v c ti d liu trang web


v.
- D liu c ti v c cho qua module trch xut gi sinh ra gi ca sn
phm.
V d:
Tng ng vi 5 URL trn th kt qu trch xut c s l:
- http://www.vatgia.com/319/257728/canon-powershot-g10.html
o Product: canon powershot g10

Price:8.008.000 vn (440,00 usd)

45

- http://www.raovatmienphi.com/canon-powershot-g10-gia-490-usd.html
o Product: canon powershot g10

Price: gi 490 usd

- http://www.123mua.com.vn/xem?sp=RXGQRVfReX
o Product: canon powershot g10

Price:644 sd

- http://enbac.com/Ky-thuat-so/p167975/May-chup-hinh-Canon-PowerShotG10.html
o Product: my chp hnh canon powershot g10

Price:8.550.000vn

- http://www.megabuy.vn/?a=NEWS&news=DETA&hdn_news_id=10434
o Product: canon powershot g10

Price:665 usd

Kt qu thc nghim
Kha lun thc nghim trn tp cc sn phm: nokia 1200, lenovo thinkpad t61,
canon powershot g103; mi sn phm ny s thc nghim trn 3 trng hp tng ng
vi s lng 10, 30, 100 kt qu m google tr v. nh gi kt qu thc nghim kha
lun ny s dng o hi tng (R) v tin cy (P). Kt qu thc nghim c m
t theo bng sau:

46

Bng 4. Kt qu thc nghim trch xut gi thc ca mt sn phm

Tn sn
phm

Nokia 1200

Lenovo
Thinkpad
t61

Canon
PowerShot
G10

Query

Nokia
1200 +
VN OR
USD

Lenovo
Thinkpad
t61 +
VN OR
USD

Canon
PowerSho
t G10 +
VN OR
USD

S
lng
Kt qu
kt
thc t
qu tr
ng
v bi
google

Kt
qu
trch
xut
c

Kt
qu
ng

Thi
gian
thc
thi

hi
tng

tin
cy

10

37,45
s

100%

100%

30

23

26

23

147,4
3s

100%

88,46%

100

68

70

67

407,1
7s

10

10

10

39,67s

90%

90%

30

23

25

22

125,2
5s

95,6%

88%

100

43

46

40

1200s

93,02%

86,95%

10

52,92s

100%

100%

30

19

21

18

86,91s 94,74 % 85,71 %

100

45

50

44

263,3
3s

98,53 % 95,71 %

97,78%

88%

47

Nhn xt
Vi tt c cc kt qu t c th ta c th thy rng tin cy thp hn hi
tng. S d c kt qu nh vy bi v: C mt vi trng hp gi xut hin qu nhp
nhng.
V d:

Hnh 27. Trang Web c s nhp nhng gi c


Vi trng hp ny c th nhn dng nhm thnh: nokia 1200 c gi: 599.000
ng
Thc t th n li mun cung cp thng tin v nokia 1202 c gi: 599.000
ng
hi tng cao bi v hu nh cc trang c gi ng th c th trch xut c
chnh xc. Gi ng l gi m th hin l gi thc ca sn phm.
V d:

48

Hnh 28. Trang Web c gi c r rng


Kt qu trch xut c s l:
Tn sn phm: nokia 1200 black , Gi sn phm: 520,000 vn

4.2.2 Thc nghim xc nh website kinh doanh


M t thc nghim
Mc ch ca thc nghim ny l kim nghim s chnh xc v kh nng xc nh
c cc trang kinh doanh sn phm t tp ht ging tn sn phm ban u ca bi ton
t ng trch xut thng tin v tn v gi ca sn phm trong mc 3.3.3
- u vo : Mt tp ht ging tn cc sn phm.
- u ra : Website kinh doanh sn phm c bn nhng sn phm trong tp ht ging
v cc mu trch xut tng ng vi website.
D liu thc nghim
- Tp ht ging tn sn phm cho trc.
- Chn my tm kim google xc nh cc trang lin quan n sn phm

49

- To truy vn t tn cc sn phm tp ht ging, gi ti google, thu c cc


trang lin quan
- Ti cc trang lin quan n sn phm v xc nh c cc mu trch xut thng
tin sn phm, ta s thu c mt b (Website, mu_trch_tn sn phm,
mu_trch_gi sn phm)
Xc nh s trng lp ca cc b, nu mt b trng lp nhiu ln, th website trong
b l website kinh doanh v cc mu trch xut trong b l mu trch xut c th p
dng cho website ny.
Kt qu thc nghim
Vi tp ht ging gm 4 tn sn phm nh sau :
-

nokia 1200

nokia e71 white steel

nokia 1202

nokia 6300 silver

Chn ngng l 3 th ta c.

50

Bng 5. Kt qu thc nghim xc nh website kinh doanh sn phm


S lng kt qu
t google tr v

Thi gian
chy

Domain bn hng nhn


c
www.123mua.com.vn

10

288,84s

www.vatgia.com
www.chodientu.vn
www.vinacms.vn
www.123mua.com.vn
www.vatgia.com

30

708s

www.chodientu.vn
www.vinacms.vn
www.enbac.com
www.123mua.com.vn
www.vatgia.com
www.chodientu.vn
www.vinacms.vn

100

3638.76s

www.enbac.com
www.quangcaosanpham.com
www.dienthoaididong.com.vn
www.aha.vn
www.trananh.vn

51

Nhn xt
Kt qu t c l kh quan. Trong cc website m h thng xc nh c th tt
c u l website kinh doanh sn phm.
Tng ng vi cc trng hp :
-

google tr v l 10 th nhn dng c 4 website

google tr v l 30 th nhn dng c 5 website

google tr v l 100 th nhn dng c 10 website


Tuy nhin do s lng tp ht ging ban u mi ch c 4 tn sn phm nn s
lng website kinh doanh sn phm nhn dng c vn cn t.

4.2.3 Thc nghim thu thp v trch xut thng tin t mt website
M t thc nghim
Mc ch ca thc nghim ny kim nghim phng php trch xut thng tin
sn phm nu bi ton t ng trch xut tn v gi ca sn phm trong muc 3.3.3.
Thc nghim ny cng gip nh gi c tnh chnh xc ca cc mu trch xut trong
thc nghim 4.3.2
- u vo : Website kinh doanh v cc mu trch xut tng ng vi wesite
thc nghim xc nh website kinh doanh.
- u ra : Tn sn phm v gi ca cc sn phm .
D liu s dng
Trong thc nghim ny chng ti s s dng 2 website trong thc nghim 2:
- www.dienthoaididong.com.vn
- www.trananh.vn
Hai website kinh doanh s c thu thp d liu, vi s lng 5000 ti liu trn mt
website v trch xut d liu t tp d liu ny da vo cc mu trch xut tng ng vi
tng website .
Kt qu t c

52

Bng 6. Kt qu thc nghim trch xut sn phm


Website

Kt qu trch xut c

www.dienthoaididong.com.vn

743 sn phm

www.trananh.vn

416 sn phm

Nhn xt
S lng sn phm trch xut c l kh nhiu. Trong s nhng sn phm trch
xut c th tt c nhng sn phm u chnh xc, iu cho thy phng php
trch xut thng tin ny chnh xc.
Tuy nhin trong 416 sn phm ca website www.trananh.vn th ch c cc sn phm
v in thoi di ng trong khi website ny cn c nhng sn phm v my vi tnh,
nguyn nhn ca kt qu ny l do sn phm trn tp ht ging u l tn ca cc loi
in thoi di ng v khun mu ca lnh vc in thoi v my tnh website ny l
khc nhau.

4.2.4 Thc nghim kh nng thu thp thng tin ca h thng


M t thc nghim
Mc ch thc nghim ny l nh gi kh nng thu thp thng tin v tn v gi sn
phm ca h thng
-

u vo: Tp ht ging tn sn phm

u ra: Tn v gi ca nhng sn phm c th trch xut c.

D liu thc nghim


Tn sn phm trong tp ht ging c ly t trang vatgia.comError! Reference
source not found.. Cc tn sn phm ny c phn b u nhiu loi sn phm nh:
in thoi, my tnh, my nh, trang sc, gia dng
Kt qu t c

53

Bng 7. Kt qu thc nghim kh nng thu thp thng tin ca h thng


S lng tn sn phm
trong tp ht ging

S lng website kinh


doanh c xc nh

S lng sn phm trch


xut c

334 sn phm

125 trang kinh doanh (ph


lc 2)

47.856 sn phm, trong c


34.012 sn phm khng trng
nhau

Nhn xt:
Nhng sn phm trch xut c cng dn tri trn nhiu lnh vc nh tp ht ging.
V d mt s sn phm tiu biu nh:

Bng 8. Mt s sn phm trch xut c


Tn sn phm

Gi sn phm

nokia 2680 slide

1,530,000 vn

canon powershot g10

8.645.000 vn

dell inspiron mini 9 - r560921vn ( pc - dos )

8,029,000 vn

Comple nam hiu Cavil Klein

14.560.000 vn

Phn trang im - Ohui

575.000 vn

Kt qu ny cho thy kh nng thu thp thng tin trong h thng t hiu qu tt.

54

Kt lun
Kt qu t c ca kha lun ny
T vic nghin cu bi ton trch xut thng tin cho d liu bn cu trc, kha lun
a ra phng php t ng trch xut gi ca sn phm. Qua nhng kt qu thc
nghim t c cho thy tnh hu dng ca phng php ny.
V mt ni dung, kha lun t c nhng kt qu sau:
-

Gii thiu bi ton trch xut thng tin: Khi nim, min d liu v cc hng
tip cn ca bi ton

Nghin cu bi ton trch xut thng tin cho d liu bn cu trc: Nu c


nhng phng php s dng trong vic trch xut, gii thiu hai gii thut trch
xut Stalker v Roadrunner ng thi phn tch nhng u nhc im ca cc
gii thut ny nhm xy dng phng php ph hp gii quyt bi ton trch
xut thng tin gi sn phm.

Thng qua c s l thuyt gii quyt bi ton trch xut thng tin gi sn
phm, kha lun xy dng c m hnh h thng tm kim gi c sn phm.

Xy dng c chng trnh thi hnh c bi ton trch xut thng tin gi
c sn phm trn ngn ng Java, mi trng Eclipse nh gi c m hnh
h thng xy dng.

Bn cnh nhng, do hn ch v mt thi gian v kin thc kha lun vn cn hn


ch sau:
-

Kha lun cha xy dng c giao din ngi dng v kt qu thc nghim
xc nh gi thc cha t chnh xc nh mong mun.

nh hng tng lai


Trong tng lai, kha lun s tip tc hon thin nhng hn ch nn trn, ng thi
cng c gng cng b h thng ny phc v cho ngi s dng.

55

Ti liu tham kho


[1]. Andrew Carlson and Charles Schafer, Bootstrapping Information Extraction from
Semi-structured Web Pages, ECML/PKDD, 2008.
[2]. Bing Liu, Web Data Mining Exploring Hyperlinks, Contents, and Usage Data,
http://www.cs.uic.edu/~liub/WebMiningBook.html ,December, 2006.
[3]. Elwin Chai, Rick Jones, Automated Price Comparison Shopping Search Engine _
PriceHunter, CSE,2001
[4]. Irmak, and T. Suel, Interactive Wrapper Generation with Minimal User Effort. In
Proc. of the 15th Intl. Conf. on World Wide Web (WWW'06), 2006.
[5]. I. Muslea, S. Minton, and C. A. Knoblock. A Hierarchical Approach to Wrapper
Induction. In Proc. of the Intl. Conf. on Autonomous Agents (AGENTS99), pp. 190
197, 1999.
[6]. Jaeyoung Yang, Heekuck Oh, Kyung-Goo Doh and Joongmin Choi A ,KnowledgeBased Information Extraction System for Semi-structured Labeled Documents,
Proceedings of the Third International Conference on Intelligent Data Engineering and
Automated Learning, 2002
[7]. Robert Bo Doorenbos, Oren Etzioni, and Daniel So Weld, A Scalable ComparisonShopping Agent for the World-Wide Web,
www.cs.washington.edu/homes/etzioni/papers/agents97.pdf, 1997
[8]. Sergey Brin, Extracting Patterns and Relations from the World Wide Web,
WebDB Workshop at 6th International Conference on Extending Database
Technology, 1998
[9]. S. Debnath, P. Mitra, N. Pal, and C. L. Giles. Automatic Identification of
Informative , IEEE Trans. Knowl. Data Eng. 17 , 2005
[10]. S. Debnath, P. Mitra, and C. L. Giles. Automatic extraction of informative blocks
from webpages. In Proc. SAC, pages 1722-1726, 2005.
[11]. Sections of Web-pages. In TKDE, pages 12331246, 2005.

56

[12]. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards Automatic Data


Extraction from Large Web Sites.In Proc. of Very Large Data Bases (VLDB01),
pp.109118, 2001.
[13]. WIEN N. Kushmerick. Wrapper Induction for Information Extraction. Ph.D
Thesis. Dept. of Computer Science, University of Washington, TR UW-CSE-97-1104, 1997
[14]. W. Cohen, M. Hurst, and L. S. Jensen. A Flexible Learning System for Wrapping
Tables and Lists in Html Documents. In Proc. of the 11th Intl. World Wide Web Conf.
(WWW02), pp. 232241, 2002.
[15]. http://www.w3.org/DOM/
[16]. http://www.w3.org/TR/xpath
[17]. http://www.dcs.bbk.ac.uk/~ptw/teaching/ssd/toc.html
[18]. http://en.wikipedia.org/wiki/Price_comparison_service

57

Ph lc
Ph lc 1: Danh sch mt s website c kho st c trng ca gi sn
phm
a ch website
www.amazon.com
www.jr.com
www.imobilecellphones.com
www.220depot.com
www.trananh.vn
www.vatgia.com
www.rongbay.com
www.vinabook.com
www.sieuthitrangsuc.com
www.aodaiminhthu.com
www.goodsmart.vn

58

Ph lc 2: Danh sch mt s website kinh doanh xc nh c trong thc


nghim 4.4.4
a ch website
www.ducminhmobile.net
www.gsmserver.com
www.gounlock.com
www.123mua.com.vn
www.dienthoaididong.com.vn
www.vatgia.com
www.aha.vn
www.chodientu.vn
www.raovat.net
www.trananh.vn
www.megabuy.vn

59

You might also like